Valid Simultaneous Inference in HighDimensional Settings (with the hdm package for R)
Abstract.
Due to the increasing availability of highdimensional empirical applications in many research disciplines, valid simultaneous inference becomes more and more important. For instance, highdimensional settings might arise in economic studies due to very rich data sets with many potential covariates or in the analysis of treatment heterogeneities. Also the evaluation of potentially more complicated (nonlinear) functional forms of the regression relationship leads to many potential variables for which simultaneous inferential statements might be of interest. Here we provide a review of classical and modern methods for simultaneous inference in (highdimensional) settings and illustrate their use by a case study using the R package hdm. The R package hdm implements valid joint powerful and efficient hypothesis tests for a potentially large number of coefficients as well as the construction of simultaneous confidence intervals and, therefore, provides useful methods to perform valid postselection inference based on the LASSO.
R and the package hdm are opensource software projects and can be freely downloaded from CRAN: http://cran.rproject.org.
Contents
 1 Introduction
 2 The Setting
 3 Simultaneous PostSelection Inference in High Dimensions – An Overview
 4 Methods for Testing multiple Hypotheses
 5 Implementation in R
 6 A Simulation Study for Valid Simultaneous Inference in High Dimensions
 7 A RealData Example for Simultaneous Inference in High Dimensions  The Gender Wage Gap Analysis
 8 Conclusion
 9 Appendix
1. Introduction
Valid simultaneous inference becomes very important when multiple hypotheses are tested at the same time. For instance, suppose a researcher wants to estimate a potentially large number of regression coefficients and to assess which of these coefficients are significantly different from zero at a given significance level . It is well known that in such a situation, the naive approach, i.e. simply ignoring multiple testing issues, will generally lead to flawed conclusions due to a large number of mistakenly rejected hypotheses. Indeed, the actual probability of incorrectly rejecting one or more hypotheses will in general exceed the claimed/desired level by large.
The statistical literature has proposed various approaches to mitigate the consequences of testing multiple hypotheses at the same time. These methods can be grouped into two approaches according to the underlying criterion. The first approach, initiated by the famous Bonferroni correction, seeks to control the probability of at least one false rejection which is called the familywise error rate (FWER). Since the definition of the FWER refers to the probability of making at least one type I error, the FWERcriterion is appealing from an intuitive point of view. However, FWER control is often criticized to be conservative and, instead, the false discovery rate (FDR) control is used as a criterion in many test procedures leading to the second major class of multiple testing correction methods, e.g. Benjamini and Hochberg (1995). The FDR refers to the expected share of falsely rejected null hypotheses and, hence, results from FDRprocedures differ from classical tests results in terms of interpretation. Various approaches aim at maintaining control of the FWER and reducing conservativeness at the same time by incorporating a stepwise procedure, for instance, the stepdown method of Holm (1979). Moreover, taking the dependence structure of test statistics into consideration allows to reduce conservativeness of FWERprocedures, as in the stepdown procedure of Romano and Wolf (2005a, b) which is based on resampling methods.
In the following, we review classical methods and will describe how valid simultaneous inference can be conducted in a highdimensional regression setting, i.e. if the number of covariates exceeds the number of observations, and give examples how the presented methods can be applied with the statistical software R . It is wellknown that classical regression methods, such as ordinary least squares, break down in highdimensional settings. Instead, regularization methods, e.g. the lasso, can be used for estimation. However, postselection inference is nontrivial and requires modification of the estimators.
The R package hdm implements the doubleselection approach of Belloni, Chernozhukov, and Hansen (2014) that allows to perform valid inference based on modelselection with the lasso. Moreover, hdm provides powerful tests for a large number of hypotheses using the multiplier bootstrap, potentially in combination with the Romano and Wolf stepdown procedure (starting with Version 0.3.0) . The implemented methods are less conservative / more powerful than traditional approaches as they allow to take the dependence structure among test statistics into account and proceed in a stepwise manner. At the same time, the implemented procedures are attractive from an intuitive point of view as they guarantee control of the familywise error rate (FWER).
The remainder of the paper is organized as follows. First, the setting is introduced and an overview on valid postselection inference in high dimensions is provided. Second, a short and selective review on traditional and recent methods to adjust for multiple testing is presented. Third, we give an overview on the tools for valid postselection inference in highdimensions available in the R package hdm. Fourth, the use of the software is illustrated in a simulated and a replicable realdata example. A conclusion is provided in the last section.
2. The Setting
We are interested in testing a set of hypotheses in a highdimensional regression model, i.e. a regression where the number of covariates is large, potentially much larger than the number of observations , i.e. we have . The ultimate objective in this setting is to perform inference on a subset of the regression coefficients, i.e. a vector of socalled “target” coefficients with and .
(1) 
where is an intercept and denote the regression coefficients of the control variables . Moreover, it is assumed that . In this setting, hypotheses are tested for the coefficients that correspond to the effect of the “target” variables on the outcome
(2) 
For instance, such a highdimensional regression setting arises in causal program evaluation studies, where a large number of regressors is included to approximate a potentially complicated, nonlinear population regression function using transformations with dictionaries, e.g. interactions, splines or polynomials. Alternatively, an analysis of heterogeneous treatment effects across possibly many subgroups as in the example in Section 7 might require a large number of interactions of the regressors.
Suppose, there is a test procedure for each of the hypotheses leading to test statistics and unadjusted pvalues . In the context of multiple testing, it is often helpful to sort the pvalues in an increasing order (i.e. the “most significant” test result as the first in the row) and the hypotheses likewise, i.e. and with . Also the test statistics are ordered by the same logic . A researcher decides whether to accept or to reject a null hypothesis if the corresponding pvalue is above or below a prespecified significance level . Generally, the significance level corresponds to the probability of erroneously rejecting a true null hypothesis. However, if the conclusions are based on a comparison of unadjusted pvalues and the significance level, the probability of incorrectly rejecting at least one of the hypotheses will, potentially by large, exceed the claimed level . Hence, adjustment for multiple testing becomes necessary to draw appropriate inferential conclusions.
3. Simultaneous PostSelection Inference in High Dimensions – An Overview
In highdimensional settings traditional regression methods such as ordinary least squares break down and testing the hypotheses will severely suffer from the shortcomings of the underlying estimation method. Penalization methods, for instance the lasso or other machine learning techniques, provide an opportunity to overcome the failure of traditional least squares estimation as they regularize the regression problem in Equation 1 by introducing a penalization term. In the example of lasso, the ordinary least squares minimization problem is extended by a penalization of the regression coefficients using the norm. The lasso estimator is the solution to the maximization problem
(3) 
with being the norm, is a penalization parameter and denotes a diagonal matrix of penalty loadings. More details on the choice of and as implemented in the hdm package can be found in the package vignette available at CRAN and Chernozhukov, Hansen, and Spindler (2016). As a consequence of the penalization, some of the coefficients are shrunk towards zero and some of them are set exactly equal to zero. In general, inference after such a selection step is only valid under the strong assumption of perfect model selection, i.e. the lasso does only set those coefficients to zero that truly have no effect on . However perfect model selection and the underlying assumptions are often considered unrealistic in realworld applications leading to a breakdown of the naive inferential framework and, thus, flawed inferential conclusions.
In contrast to the naive procedure, the socalled doubleselection approach of Belloni, Chernozhukov, and Hansen (2014) tolerates small model selection mistakes so that valid confidence intervals and test procedures can be based on the lasso. The doubleselection method is based on orthogonalized moment equations and introduces an auxiliary (lasso) regression step for each of the target coefficients in order to avoid that variable selection erraneously excludes variables that are related to both the outcome and the target regressors by setting their coefficients equal to zero. Double selection proceeds as follows: (1) For each of the target variables in , , a lasso regression is estimated to identify the most important predictors among the covariates and the remaining target variables . (2) A lasso regression of the outcome variable on all explanatory variables, except for , is estimated to identify predictors of . This step is executed for each of the target variables with . (3) The target coefficients are estimated from a linear regression of the outcome on all target variables as well as all covariates that have been selected in either step (1) or (2). As a consequence of the doubleselection procedure, the risk of an omitted variable bias that might arise due to imperfect model selection is reduced. It can be shown that the doubleselection estimator is asymptotically normally distributed under a set of regularity assumptions. Probably, the most important of these assumptions is (approximate) sparsity. This assumption states that only a subset of the regressors suffice to describe the relationship of the outcome variable and the covariates and that all other regressors have no or only a negligible effect on the outcome. In general, valid postselection inference is compatible with other tools from the machine learning literature, for instance elastic nets or regression trees, as long as these methods satisfy some regularity conditions (Belloni, Chernozhukov, and Kato, 2015; Belloni, Chernozhukov, and Hansen, 2014).
In the multiple testing scenario described above, the doubleselection approach can be used to test the null hypotheses , . Belloni, Chernozhukov, and Kato (2015) show that a valid ()confidence interval can be constructed by using the multiplier boostrap as established in Chernozhukov, Chetverikov, and Kato (2013). Moreover, Chernozhukov, Chetverikov, and Kato (2013) and Belloni, Chernozhukov, and Kato (2015) show that a multiplier bootstrap version of the RomanoWolf method can be used to construct a joint significance test in a highdimensional setting such that asymptotic control of the FWER is obtained.
An advantage of the stepdown method of Romano and Wolf (2005a, b) is that it is based on resampling methods. Hence, it is able to account for the dependence structure underlying the test statistics and to give less conservative results as compared to methods such as the Bonferroni and Holm correction. The idea of the RomanoWolf procedure is to construct rectangular simultaneous confidence intervals in subsequent steps whereas in each step, the coverage probability is kept above a level of . If in step , the confidence set does not contain zero in dimension , the corresponding is rejected. In step , the algorithm proceeds analogously by constructing a rectangular joint confidence region for those coefficients for which the null hypotheses has not been rejected in step or before. The algorithm stops if no hypothesis is rejected anymore. To take the dependence structure of the test hypotheses into account, the classical RomanoWolf stepdown procedure uses the bootstrap to compute the constant which is needed to construct a rectangular confidence interval. This constant is estimated by the ()quantile of the maxima of the bootstrapped test statistics in each step to guarantee the coverage probability of (). The computational burden of the RomanoWolf stepdown procedure can be reduced by using the multipler bootstrap. This bootstrap method requires the calculation of the solution of an orthogonal moment equation only once and then operates on pertubations by realizations of an independently and standard normally distributed random variable (Chernozhukov, Chetverikov, and Kato, 2013).
4. Methods for Testing multiple Hypotheses
4.1. A Global Test for Joint Significance with Lasso Regression
A basic question frequently arising in empirical work is whether the Lasso regression has explanatory power, comparable to a Ftest for the classical linear regression model. The construction of a joint significance test follows Chernozhukov, Chetverikov, and Kato (2013) (Appendix M). Based on the model with intercept , the null hypothesis of joint statistical insignificance is . The null hypothesis implies that
and the restriction can be tested using the supscore statistic:
where . The critical value for this statistic can be approximated by the multiplier bootstrap procedure, which simulates the statistic:
where ’s are i.i.d. , conditional on the data. The quantile of serves as the critical value, . We reject the null if in favor of statistical significance, and we keep the null of nonsignificance otherwise.
4.2. Multiple Hypotheses Testing with Control of the Familywise Error Rate
The FWER is defined as the probability of falsely rejecting at least one hypothesis. The goal is to control the FWER and to secure that it does not exceed a prespecified level . We assume that for the individual tests the significance level is set uniformly to .
Bonferroni Correction
According to the Bonferroni correction the cutoff of the pvalues is set to and all hypotheses with pvalues below the adjusted level are rejected. Bool’s inequality then gives directly that the FWER is smaller or equal to . Instead of adjusting the level of to , it is possible to adjust the pvalues so that we reject a hypothesis if . A drawback of the procedure is that it is quite conservative, meaning that in many applications, in particular in highdimensional settings when many hypotheses are tested simultaneously, often no or very few hypotheses are rejected, increasing the risk of accepting false null hypotheses (i.e. of a type II error).
BonferroniHolm Correction
We again assume that the pvalues are ordered (from lowest to highest) with corresponding hypotheses . The BonferroniHolm procedure controls the FWER by the following procedure: Let be the smallest index such that the corresponding pvalue exceeds the adjusted cutoff .
Reject the null hypothesis and accept . The BonferroniHolm procedure can be considered as general improvement over the Bonferroni correction that maintains control of the FWER and reduces the risk of a type II error at the same time. The adjusted pvalue according to the BonferroniHolm correction are computed as with .
Joint Confidence Region Using Multiplier Bootstrap
Belloni, Chernozhukov, and Kato (2015) derive valid ()confidence regions for the vector of target coefficients, , in the highdimensional regression setting in Equation 1 estimated with lasso. The confidence regions which are constructed with the multiplier bootstrap can be used equivalently to a joint signficance test of the hypotheses. Accordingly, the null hypotheses , , would be rejected at the level if the simultaneous ()confidence region does not cover zero in dimension .
RomanoWolf Stepdown Procedure
While the Bonferroni and BonferroniHolm correction do not take into account the dependence structure of the test statistics, further improvements in the control of the type II error can be achieved by modeling the dependence structure using resampling. A very popular and powerful method in this regard is the RomanoWolf stepdown procedure. Stepdown methods proceed in several rounds where in each round a decision is taken on the set of hypotheses being rejected. The algorithm continues until no further hypotheses are rejected. The RomanoWolf stepdown procedure guarantees asymptotic control of the FWER at level by constructing a sequence of simultaneous tests. We present the recent version of the RomanoWolf method from Chernozhukov, Chetverikov, and Kato (2013) and Belloni, Chernozhukov, and Kato (2015) who prove the validity of the procedure in combination with the multiplier bootstrap.

Sort the test statistics in a decreasing order (in terms of their absolute values):

Draw multiplier bootstrap versions for each of the test statistics , , and ,

For each and determine the maximum of the bootstrapped test statistics .

Compute initial pvalues, for

Compute adjusted pvalues by ensuring monotonicity

if

if

The padjustment algorithm parallels that in Romano and Wolf (2016) with the only difference that the bootstrap test statistics are computed efficiently with the multiplier bootstrap procedure instead of the classical bootstrap. In contrast to traditional bootstrap methods, the multiplier bootstrap does not require reestimation of the lasso regression for each bootstrap sample. Romano and Wolf (2005b, 2016) recommend a high number of bootstrap repetitions .
If the data stem from a randomized experiment, the method introduced in List, Shaikh, and Xu (2016) can be used. It is a variant of the RomanoWolf procedure under uncounfoundedness. Moreover, it allows to compare the effect of different treatments and several outcome variables simultaneously.
4.3. Multiple Hypotheses Testing with Control of the False Discovery Rate
The FWER is a very strict criterion which is often very conservative. This means that in settings when thousands or hundred thousands of hypotheses are tested simultaneously, the FWER does often not detect useful signals. Hence, in largescale settings frequently a less strict criterion, the socalled false discovery rate (FDR) is employed. The false discovery proportion (FDP) is defined as the ratio of the number of hypotheses which are wrongly classified as significant (false positives) and the total number of positives. If the latter is zero, it is defined as zero. The FDR is defined as the expected value of the FDP : . The FDR concept reflects the tradeoff between false discoveries and true discoveries.
BenjaminiHochberg Procedure
To control the FDR, the BenjaminiHochberg (BH) procedure ranks the hypotheses according to the corresponding pvalues and then chooses a cutoff along the ranking to control the FDR at a prespecified level of . The BH procedure first uses a stepup comparision to find a cutoff pvalue:
and then rejects all hypotheses . In most applications, is chosen.
5. Implementation in R
Estimation of the highdimensional regression model in Equation 1 and simultaneous inference on the target coefficients is implemented in the R package hdm available at CRAN. hdm provides an implementation of the doubleselection approach of Belloni, Chernozhukov, and Kato (2015) using the lasso as the variable selection device. The function rlassoEffects() does valid inference on a specified set of target parameters and returns an object of S3 class rlassoEffects. This output object is used to perform simultaneous inference subsequently as described in the following. More details on the hdm package and introductory examples are provided in the hdm vignette available at CRAN and Chernozhukov, Hansen, and Spindler (2016). The package hdm offers three ways to perform valid simultaneous inference in highdimensional settings:

Overall Significance Test
The hdm provides a joint significance test that is comparable to a Ftest known from classical ordinary least squares regression. Based on Chernozhukov, Chetverikov, and Kato (2013) (Appendix M), the null hypothesis that no covariate has explanatory power for the outcome is tested, i.e.
The test is performed automatically if summary() is called for an object of the S3 class rlasso. This object corresponds to the output of the function rlasso() which implements the lasso estimator using a theorybased rule for determining the penalization parameter.

Joint Confidence Interval
Based on an object of the S3 class rlassoEffects, a valid joint confidence interval with coverage probability can be constructed for the specified target coefficients using the command confint() with the option joint=TRUE.

Multiple Testing Adjustment of pValues
Starting with Version 0.3.0, the hdm package offers the S3 method p_adjust() for objects inheriting from classes rlassoEffects and lm. By default, p_adjust() implements the RomanoWolf stepdown procedure using the computationally efficient multiplier bootstrap procedure (option method = "RW"). Hence, the hdm offers an implementation of the pvalue adjustment that corresponds to a joint test à la RomanoWolf for both postselection inference based on double selection with the lasso as well as for ordinary least squares regression. Moreover, the p_adjust() call offers classical adjustment methods in a userfriendly way, i.e. the function can be executed directly on the output object returned from a regression with rlassoEffects() or lm(). The hosted correction methods are the methods provided in the p.adjust() command of the basic stats package, i.e. Bonferroni, BonferroniHolm, and BenjaminiHochberg among others. If an object of class lm is used, the user can provide an index of the coefficients to be tested simultaneously. By default, all coefficients are tested.
6. A Simulation Study for Valid Simultaneous Inference in High Dimensions
The simulation study provides a finitesample comparison of different multiple testing corrections in a highdimensional setting, i.e. the Bonferroni method, the BonferroniHolm procedure, BenjaminiHochberg adjustment and the RomanoWolf stepdown method. In addition, the study illustrates the failure of the naive approach that ignores the problem of simultaneous hypotheses testing, i.e. without any correction of the significance level or pvalues.
6.1. Simulation Setting
We consider a regression of a continuous outcome variable on a set of regressors, ,
(4) 
with and variance . In our setting, the realizations of are generated by a joint normal distribution with and with . We consider the case of an i.i.d. sample with and observations. The setting is sparse in that only few, , regressors are truly nonzero whereas the location of the nonzero coefficient is only known by an inaccessible oracle. Thus, the lasso is used to select the set of explanatory variables with a nonzero coefficient. Figure 1 presents the regression coefficients in the simulation study.
The regression in Equation 4 is estimated wih postlasso and inference for all regression coefficients is based on double selection.
6.2. Results
The simulation results are summarized in Table 1. The reported results refer to averages over repetitions in terms of correct and incorrect rejections of null hypotheses at a prespecified level of as well as the empirical FWER and FDR.
The results show that multiple testing adjustment is of great importance since naive inferential statements might be invalid. If each of the hypotheses is tested at a significance level of without adjustment of the pvalues, the naive procedure leads to at least one incorrect rejection with a probability of almost 1. Also the BenjaminiHochberg procedure with incorrectly rejects a true null in more than 3000 out of the 5000 repetitions. On average, more than 5 () and more than 4 () true hypotheses are rejected without adjustment of pvalues. The simulation study illustrates that control of the FDR is achieved by the BenjaminiHochberg correction. Accordingly, one would incorrectly reject more than 1 hypothesis on average. Over all 5000 repetitions 9.5% () and 8.7% () of all rejections are incorrect (false positives) which is below the specified level of .
Methods with asymptotic control of the FWER are much less likely to erraneously reject true null hypotheses. In the setting with observations, the number of incorrect rejections is on average around 0.1 to 0.17 with the Bonferroni correction being most conservative. The empirical familywise error rates are very close to the desired level , despite the small number of observations and the relatively large number of tested hypotheses. With larger sample size () the empirical FWERs approach the level . The price for control of the probability of at least one type I error is paid in terms of power with less correct rejections for the FWERmethods as compared to the BenjaminiHochberg correction. The largest number of correct rejections while still maintaining control of the FWER is achieved by the RomanoWolf stepdown procedure. Hence, taking the dependence structure of the test statistics into account is favorable in case of dependencies among test statistics.
Naive  BenjaminiHochberg  Bonferroni  BonferroniHolm  RomanoWolf  
Correct Rejections  
Incorrect Rejections  
Familywise Error Rate  
False Discovery Rate  
Correct Rejections  
Incorrect Rejections  
Familywise Error Rate  
False Discovery Rate  
The Table presents the average number of correct or incorrect rejections at a significance level and the FWER over all repetitions. For the BenjaminiHochberg adjustment is chosen. Standard deviation in parentheses.
7. A RealData Example for Simultaneous Inference in High Dimensions  The Gender Wage Gap Analysis
The following section demonstrates the methods for valid simultaneous inference implemented in the package hdm and provides a comparison of the classical correction methods in a replicable realdata example. The gender wage gap, i.e. the relative difference in wages emerging between male and female employees, is a central topic in labor economics. Frequently, studies report an average gender wage gap estimate, i.e. how much less women earn as compared to men in terms of average (i.e. mean or median) wages. However, it might be helpful for policy makers to gain a more detailed impression on the gender wage gap and to assess whether and to what extent the gender wage gap differs across individuals. A simplistic although frequently encountered approach to assess the wage gap heterogeneity is to compare the relative wage gap across female and male employees in subgroups that are defined in terms of a particular characteristic, e.g. industry. It is obvious that this approach neglects the role of other characteristics relevant for the wage income and, hence, the wage gap, e.g. educational background, experience etc. As an examplary illustration, the gender gap in average (mean) earnings in 12 different industrial categories is presented in Figure 2, suggesting that the wage gap differs by large across the subgroups. In contrast to the approach that is simply based on descriptive statistics, an extended regression equation including interaction terms of the gender indicator with observable characteristics is able to take the role of other labor market characteristics into account and, hence, allows to give insights on the determinants of the gender wage gap. As the regression approach leads to a large number of coefficients that are tested simultaneously, an appropriate multiple testing adjustment is required. Thus, the heterogeneous gender wage gap example is used to demonstrate the adjustment methods provided in the R package hdm. The presented example is an illustration of the more extensive analysis of a heterogeneous gender wage gap in Bach, Chernozhukov, and Spindler (2018).
7.1. Data Preparation
The examplary data is a subsample of the 2016 American Community Survey.
# Load the hdm package rm(list=ls()) library(hdm) # load the ACS data load("ACS2016_gender.rda")
7.2. Valid Simultaneous Inference on a Heterogeneous Gender Wage Gap
In order to answer the question whether the gender wage gap differs according to the observable characteristics of female employees in a valid way, it is necessary to account for regressors that affect women’s job market environment. In the example, variables on marriage, the presence of own children, geographic variation, job characteristics (industry, occupation, hours worked), human capital variables (years of education, experience (squared)), and field of degree are considered. A wage regression is set up that includes all twoway interactions of female with the available characteristics in addition to the baseline regressors, .
(5) 
The analysis begins with constructing a model matrix that implements the regression relationship of interest. \MakeFramed\@setminipage
# Weekly log wages as outcome variable y = data$lwwage # Model Matrix containing 2way interaction of female # with relevant regressors + covariates X = model.matrix( ~ 1 + fem + fem:(ind + occ + hw + deg + yos + exp + exp2 + married + chld19 + region + msa ) + (married + chld19 + region + msa + ind + occ + hw + deg + yos + exp + exp2), data = data) # Exclude the constant variables X = X[,which(apply(X, 2, var)!=0)] dim(X)
## [1] 70473 123
Accordingly, the regression model considered has regressors in total and is estimated on the basis of observations. The wage Equation 5 is estimated with the lasso with the theorybased choice of the penalty term (“rlasso”). To answer the question whether the included regressors have any explanatory power for the outcome variable, the global test of overall significance is run by calling summary() on the output object of the rlasso() function. \MakeFramed\@setminipage
# Estimate rlasso lasso1 = rlasso(X,y) # Run test summary(lasso1, all = FALSE) # Output shifted to the Appendix.
The hypotheses that all coefficients in the model are zero can be rejected at all common significance levels. The main objective of the analysis is to estimate the magnitude of the effects associated with the gender interactions and to assess whether these effects are jointly significantly different from zero. The socalled “target” variables, in total regressors, are specified in the index option of the rlassoEffects() function. Hence, it is necessary to indicate the columns of the created model matrix that correspond to interactions with the female dummy.
# Construct index for gender variable and interactions (target parameters) index.female = grep("fem", colnames(X)) K = length(index.female) # Perform inference on target coefficients # estmation might take some time (10 minutes) effects = rlassoEffects(x=X,y=y, index = index.female) summary(effects) # Output shifted to the Appendix.
The output presented in the appendix shows the estimated coefficients together with tstatistics and unadjusted pvalues. The next step is to adjust the pvalues for multiple testing. Starting with Version 0.3.0, the hdm offers the S3 method p_adjust() for objects inheriting from classes rlassoEffects and lm. It hosts the correction methods from the function p.adjust() of the stats Package, e.g. Bonferroni, BonferroniHolm, BenjaminiHochberg as well as no correction at all. First, the naive approach without a multiple testing correction given a significance level is presented. Table 2 shows the number of rejections at significance levels . \MakeFramed\@setminipage
# Extract (unadjusted) pvalues pvals.unadj = p˙adjust(effects, method = "none") # Coefficients and Pvals head(pvals.unadj)
## Estimate. pval ## femTRUE 0.0799 0.20800 ## femTRUE:indAGRI 0.1307 0.00349 ## femTRUE:indCONSTR 0.0521 0.15226 ## femTRUE:indMANUF 0.0097 0.70716 ## femTRUE:indTRANS 0.0403 0.15392 ## femTRUE:indRETAIL 0.0264 0.30256
Thus, without a correction for multiple testing, , , and hypotheses could be rejected given the significance levels of 1%, 5% and 10%. If one returns to the initial example on variation by industry, one would find significant variation of the wage gap by industry (as compared to the baseline category “Wholesale”) in categories, namely “Agriculture”, “Finance, Insurance, and Real Estate” and “Professional and Related Services” at a significance level of 0.1.
Second, classical correction methods like the Bonferroni, BonferroniHolm, and the BenjaminiHochberg adjustments are used to account for testing the hypotheses at the same time.
Method  Significance Level  

0.01  0.05  0.10  
Naive  
BenjaminiHochberg  
Bonferroni  
Holm  
Joint Confidence Region  
RomanoWolf 
# Bonferroni pvals.bonf = p˙adjust(effects, method = "bonferroni") # Holm pvals.holm = p˙adjust(effects, method = "holm") head(pvals.bonf)
## Estimate. pval ## femTRUE 0.0799 1.000 ## femTRUE:indAGRI 0.1307 0.217 ## femTRUE:indCONSTR 0.0521 1.000 ## femTRUE:indMANUF 0.0097 1.000 ## femTRUE:indTRANS 0.0403 1.000 ## femTRUE:indRETAIL 0.0264 1.000
head(pvals.holm)
## Estimate. pval ## femTRUE 0.0799 1.000 ## femTRUE:indAGRI 0.1307 0.178 ## femTRUE:indCONSTR 0.0521 1.000 ## femTRUE:indMANUF 0.0097 1.000 ## femTRUE:indTRANS 0.0403 1.000 ## femTRUE:indRETAIL 0.0264 1.000
As a general improvement, the Holmcorrected pvalues are smaller or equal to those obtained from a Bonferroni adjustment. At significance levels 1%, 5% and 10%, it is possible to reject fewer hypotheses if pvalues are corrected for multiple testing.
According to the BenjaminiHochberg (BH) correction (Benjamini and Hochberg, 1995) of pvalues that achieves control of the FDR it is possible to reject and null hypotheses at specified values of the FDR, , at 0.01, 0.05 and 0.1. \MakeFramed\@setminipage
pvals.BH = p˙adjust(effects, method = "BH") head(pvals.BH)
## Estimate. pval ## femTRUE 0.0799 0.3778 ## femTRUE:indAGRI 0.1307 0.0174 ## femTRUE:indCONSTR 0.0521 0.3078 ## femTRUE:indMANUF 0.0097 0.8412 ## femTRUE:indTRANS 0.0403 0.3078 ## femTRUE:indRETAIL 0.0264 0.4690
Regarding variation by industry, the Bonferroni and Holm procedure find a significantly different wage gap (at the 10% significance level) only for industry “Finance, Insurance, and Real Estate” whereas the BenjaminiHochberg correction with leads to the same conclusion as obtained without any adjustment. These results can now be compared to results obtained from joint significance test with and without the RomanoWolf stepdown procedure. We can start with construction of a joint confidence region for the coefficients using the option joint=T in the confint() function for objects of the class rlassoEffects. \MakeFramed\@setminipage
# valid joint 0.95confidence interval alpha = 0.1 CI = confint(effects, level = 1alpha, joint=T) head(CI)
## 5 % 95 % ## femTRUE 0.2758 0.1160 ## femTRUE:indAGRI 0.2923 0.0310 ## femTRUE:indCONSTR 0.1632 0.0589 ## femTRUE:indMANUF 0.0912 0.0718 ## femTRUE:indTRANS 0.1305 0.0499 ## femTRUE:indRETAIL 0.0582 0.1110
The results from the confidence intervals are equivalent to a test at significance level so that hypotheses can be rejected. However, the RomanoWolf stepdown procedure allows to increase power. For instance with a significance level of 5%, the stepdown correction allows to reject one hypothesis more than with the joint confidence interval. The pvalues can be adjusted according to the RomanoWolfstepdown algorithm by setting the option method = “RW” (default) of the p_adjust() call. The number of repetitions can be varied by specifying the option B, by default. \MakeFramed\@setminipage
# RomanoWolf stepdown adjustment pvals.RW = p˙adjust(effects, method = "RW", B=1000) head(pvals.RW)
## Estimate. pval ## femTRUE 0.0799 0.997 ## femTRUE:indAGRI 0.1307 0.369 ## femTRUE:indCONSTR 0.0521 0.986 ## femTRUE:indMANUF 0.0097 1.000 ## femTRUE:indTRANS 0.0403 0.992 ## femTRUE:indRETAIL 0.0264 0.999
Using the joint confidence interval and the RomanoWolf stepdown adjustment allows to reject more hypotheses than with traditional methods at significance levels 5% and 10%. Hence, taking into account the dependence of test statistics is beneficial in terms of power in the realdata example.
8. Conclusion
The previous sections provide a short overview on important methods for multiple testing adjustment in a highdimensional regression setting. Throughout the paper, our intention was to present the concepts and the necessity of a multiple adjustment in a comprehensive way. Similarly, the tools for valid simultaneous inference in highdimensional settings that are available in the R package hdm are intended to be easy to use in empirical applications. The demonstration of the methods in the realdata example are intended to motivate applied statisticians to (i) use modern statistical methods for highdimensional regression, i.e. the lasso, and (ii) to appropriately adjust if multiple hypotheses are tested simultaneously. Since the hdm provides userfriendly adjustment methods for objects of the S3 class lm, we hope that uses will use the correction methods more frequently even in classical least squares regression.
9. Appendix
9.1. Details of Simulation Study
The simulation study was implemented using the statistical software R (R version 3.3.3 (20170306)) on a x86_64redhatlinuxgnu (64 bit) platform. For the sake of replicability, the R code for the simulation study is available as supplemental material on the website https://www.bwl.unihamburg.de/en/statistik/forschung/softwareunddaten.html. In the lasso regression, the theorybased and datadependent choice of the penalty term for homoscedastic errors is implemented (Chernozhukov, Hansen, and Spindler, 2016).
9.2. Additional Simulation Results
We additionally provide the simulation results if significance tests are based on a level of or .
Naive  BenjaminiHochberg  Bonferroni  BonferroniHolm  RomanoWolf  
Correct Rejections  
Incorrect Rejections  
Familywise Error Rate  
False Discovery Rate  
Correct Rejections  
Incorrect Rejections  
Familywise Error Rate  
False Discovery Rate  
The Table presents the average number of correct or incorrect rejections at a significance level and the FWER over all repetitions. For the BenjaminiHochberg adjustment is chosen. Standard deviation in parantheses.
Naive  BenjaminiHochberg  Bonferroni  BonferroniHolm  RomanoWolf  
Correct Rejections  
Incorrect Rejections  
Familywise Error Rate  
False Discovery Rate  
Correct Rejections  
Incorrect Rejections  
Familywise Error Rate  
False Discovery Rate  
The Table presents the average number of correct or incorrect rejections at a significance level and the FWER over all repetitions. For the BenjaminiHochberg adjustment is chosen. Standard deviation in parantheses.
9.3. Additional Results, RealData Example
For the sake of brevity, the full summary outputs of the global significance test and the joint significance tes for the target coefficients with lasso and double selection are omitted in the main text. For completeness, the output is presented in the following.
lasso1 = rlasso(X,y) # Run test summary(lasso1, all = FALSE)
## ## Call: ## rlasso.default(x = X, y = y) ## ## PostLasso Estimation: TRUE ## ## Total number of variables: 123 ## Number of selected variables: 58 ## ## Residuals: ## Min 1Q Median 3Q Max ## 2.50948 0.27489 0.00787 0.25579 2.66745 ## ## Estimate ## (Intercept) 4.70 ## femTRUE 0.01 ## married 0.12 ## chld19 0.09 ## regionMiddle Atlantic Division 0.07 ## regionEast North Central Div. 0.04 ## regionWest North Central Div. 0.07 ## regionEast South Central Div. 0.12 ## regionMountain Division 0.06 ## regionPacific Division 0.13 ## msa 0.18 ## indAGRI 0.22 ## indMANUF 0.08 ## indTRANS 0.09 ## indRETAIL 0.14 ## indFINANCE 0.18 ## indBUISREPSERV 0.12 ## indENTER 0.08 ## indPROFE 0.06 ## indADMIN 0.05 ## occBus Operat Spec 0.04 ## occComput/Math 0.02 ## occLife/Physical/Soc Sci. 0.18 ## occComm/Soc Serv 0.34 ## occLegal 0.11 ## occEduc/Training/Libr 0.35 ## occArts/Design/Entert/Sports/Media 0.17 ## occHealthc Pract/Technic 0.10 ## occProtect Serv 0.07 ## occOffice/Administr Supp 0.32 ## occProd 0.32 ## hw50to59 0.21 ## hw60to69 0.28 ## hw70plus 0.25 ## degComp/Inform Sci 0.16 ## degEngin 0.21 ## degEnglish/Lit/Compos 0.06 ## degLib Arts/Hum 0.06 ## degBio/Life Sci 0.04 ## degMath/Stats 0.16 ## degPhys Fit/Parks/Recr/Leis 0.08 ## degPsych 0.04 ## degCrim Just/Fire Prot 0.04 ## degPubl Aff/Policy/Soc Wo 0.05 ## degSoc Sci 0.08 ## degFine Arts 0.08 ## degBus 0.08 ## degHist 0.06 ## yos 0.11 ## exp 0.03 ## femTRUE:married 0.05 ## femTRUE:chld19 0.04 ## femTRUE:regionMountain Division 0.01 ## femTRUE:indAGRI 0.07 ## femTRUE:indFINANCE 0.13 ## femTRUE:occArchit/Engin 0.03 ## femTRUE:occOffice/Administr Supp 0.02 ## femTRUE:degPubl Aff/Policy/Soc Wo 0.01 ## femTRUE:exp2 0.02 ## ## Residual standard error: 0.463 ## Multiple Rsquared: 0.391 ## Adjusted Rsquared: 0.39 ## Joint significance test: ## the sup score statistic for joint significance test is 280 with a pvalue of 0
# Summary of significance test summary(effects)
## [1] "Estimates and significance testing of the effect of target variables" ## Estimate. Std. Error t value ## femTRUE 0.079926 0.063479 1.26 ## femTRUE:indAGRI 0.130650 0.044735 2.92 ## femTRUE:indCONSTR 0.052142 0.036422 1.43 ## femTRUE:indMANUF 0.009699 0.025818 0.38 ## femTRUE:indTRANS 0.040300 0.028265 1.43 ## femTRUE:indRETAIL 0.026415 0.025622 1.03 ## femTRUE:indFINANCE 0.135372 0.024672 5.49 ## femTRUE:indBUISREPSERV 0.035614 0.026010 1.37 ## femTRUE:indPERSON 0.047695 0.039551 1.21 ## femTRUE:indENTER 0.050513 0.039630 1.27 ## femTRUE:indPROFE 0.054828 0.024235 2.26 ## femTRUE:indADMIN 0.011253 0.028358 0.40 ## femTRUE:occBus Operat Spec 0.025938 0.016180 1.60 ## femTRUE:occFinanc Spec 0.048161 0.016859 2.86 ## femTRUE:occComput/Math 0.006376 0.018829 0.34 ## femTRUE:occArchit/Engin 0.043310 0.027374 1.58 ## femTRUE:occLife/Physical/Soc Sci. 0.061113 0.024642 2.48 ## femTRUE:occComm/Soc Serv 0.156487 0.024372 6.42 ## femTRUE:occLegal 0.009461 0.021722 0.44 ## femTRUE:occEduc/Training/Libr 0.115386 0.015943 7.24 ## femTRUE:occArts/Design/Entert/Sports/Media 0.046201 0.019747 2.34 ## femTRUE:occHealthc Pract/Technic 0.006586 0.017898 0.37 ## femTRUE:occProtect Serv 0.003907 0.036230 0.11 ## femTRUE:occSales 0.018405 0.015144 1.22 ## femTRUE:occOffice/Administr Supp 0.000996 0.015553 0.06 ## femTRUE:occProd 0.001608 0.036690 0.04 ## femTRUE:hw40to49 0.053407 0.017716 3.01 ## femTRUE:hw50to59 0.071987 0.019040 3.78 ## femTRUE:hw60to69 0.123902 0.022724 5.45 ## femTRUE:hw70plus 0.202611 0.031436 6.45 ## femTRUE:degAgri 0.008232 0.035918 0.23 ## femTRUE:degComm 0.039552 0.021445 1.84 ## femTRUE:degComp/Inform Sci 0.080594 0.030006 2.69 ## femTRUE:degEngin 0.007807 0.025530 0.31 ## femTRUE:degEnglish/Lit/Compos 0.019353 0.024121 0.80 ## femTRUE:degLib Arts/Hum 0.046775 0.037581 1.24 ## femTRUE:degBio/Life Sci 0.037057 0.022191 1.67 ## femTRUE:degMath/Stats 0.063706 0.034123 1.87 ## femTRUE:degPhys Fit/Parks/Recr/Leis 0.022403 0.030424 0.74 ## femTRUE:degPhys Sci 0.075693 0.026715 2.83 ## femTRUE:degPsych 0.007874 0.023112 0.34 ## femTRUE:degCrim Just/Fire Prot 0.085437 0.029390 2.91 ## femTRUE:degPubl Aff/Policy/Soc Wo 0.017049 0.041864 0.41 ## femTRUE:degSoc Sci 0.052175 0.019783 2.64 ## femTRUE:degFine Arts 0.017200 0.022201 0.77 ## femTRUE:degMed/Hlth Sci Serv 0.026064 0.025231 1.03 ## femTRUE:degBus 0.020136 0.017776 1.13 ## femTRUE:degHist 0.071948 0.026570 2.71 ## femTRUE:yos 0.005787 0.003359 1.72 ## femTRUE:exp 0.001838 0.003762 0.49 ## femTRUE:exp2 0.008556 0.009151 0.93 ## femTRUE:married 0.050840 0.009097 5.59 ## femTRUE:chld19 0.051834 0.009268 5.59 ## femTRUE:regionMiddle Atlantic Division 0.024007 0.015626 1.54 ## femTRUE:regionEast North Central Div. 0.009862 0.015658 0.63 ## femTRUE:regionWest North Central Div. 0.011769 0.018628 0.63 ## femTRUE:regionSouth Atlantic Division 0.001011 0.015414 0.07 ## femTRUE:regionEast South Central Div. 0.006562 0.020696 0.32 ## femTRUE:regionWest South Central Div. 0.072252 0.017598 4.11 ## femTRUE:regionMountain Division 0.028934 0.019022 1.52 ## femTRUE:regionPacific Division 0.057981 0.016211 3.58 ## femTRUE:msa 0.002253 0.014194 0.16 ## Pr(>t) ## femTRUE 0.20800 ## femTRUE:indAGRI 0.00349 ** ## femTRUE:indCONSTR 0.15226 ## femTRUE:indMANUF 0.70716 ## femTRUE:indTRANS 0.15392 ## femTRUE:indRETAIL 0.30256 ## femTRUE:indFINANCE 4.1e08 *** ## femTRUE:indBUISREPSERV 0.17091 ## femTRUE:indPERSON 0.22785 ## femTRUE:indENTER 0.20245 ## femTRUE:indPROFE 0.02368 * ## femTRUE:indADMIN 0.69150 ## femTRUE:occBus Operat Spec 0.10892 ## femTRUE:occFinanc Spec 0.00428 ** ## femTRUE:occComput/Math 0.73491 ## femTRUE:occArchit/Engin 0.11361 ## femTRUE:occLife/Physical/Soc Sci. 0.01314 * ## femTRUE:occComm/Soc Serv 1.4e10 *** ## femTRUE:occLegal 0.66316 ## femTRUE:occEduc/Training/Libr 4.6e13 *** ## femTRUE:occArts/Design/Entert/Sports/Media 0.01930 * ## femTRUE:occHealthc Pract/Technic 0.71291 ## femTRUE:occProtect Serv 0.91413 ## femTRUE:occSales 0.22424 ## femTRUE:occOffice/Administr Supp 0.94896 ## femTRUE:occProd 0.96504 ## femTRUE:hw40to49 0.00257 ** ## femTRUE:hw50to59 0.00016 *** ## femTRUE:hw60to69 5.0e08 *** ## femTRUE:hw70plus 1.2e10 *** ## femTRUE:degAgri 0.81872 ## femTRUE:degComm 0.06514 . ## femTRUE:degComp/Inform Sci 0.00723 ** ## femTRUE:degEngin 0.75976 ## femTRUE:degEnglish/Lit/Compos 0.42237 ## femTRUE:degLib Arts/Hum 0.21327 ## femTRUE:degBio/Life Sci 0.09494 . ## femTRUE:degMath/Stats 0.06191 . ## femTRUE:degPhys Fit/Parks/Recr/Leis 0.46152 ## femTRUE:degPhys Sci 0.00461 ** ## femTRUE:degPsych 0.73334 ## femTRUE:degCrim Just/Fire Prot 0.00365 ** ## femTRUE:degPubl Aff/Policy/Soc Wo 0.68383 ## femTRUE:degSoc Sci 0.00836 ** ## femTRUE:degFine Arts 0.43848 ## femTRUE:degMed/Hlth Sci Serv 0.30160 ## femTRUE:degBus 0.25730 ## femTRUE:degHist 0.00677 ** ## femTRUE:yos 0.08494 . ## femTRUE:exp 0.62519 ## femTRUE:exp2 0.34980 ## femTRUE:married 2.3e08 *** ## femTRUE:chld19 2.2e08 *** ## femTRUE:regionMiddle Atlantic Division 0.12446 ## femTRUE:regionEast North Central Div. 0.52878 ## femTRUE:regionWest North Central Div. 0.52751 ## femTRUE:regionSouth Atlantic Division 0.94770 ## femTRUE:regionEast South Central Div. 0.75119 ## femTRUE:regionWest South Central Div. 4.0e05 *** ## femTRUE:regionMountain Division 0.12824 ## femTRUE:regionPacific Division 0.00035 *** ## femTRUE:msa 0.87386 ##  ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Footnotes
 The setting with a potentially infinitedimensional vector of target coefficients is theoretically considered in Belloni, Chernozhukov, Chetverikov, and Wei (forthcoming) and Belloni, Chernozhukov, and Kato (2015).
 More details on implementation of the simulation study are provided in the appendix and the supplemental material available at https://www.bwl.unihamburg.de/en/statistik/forschung/softwareunddaten.html.
 It can be replicated with the documentation “Appendix: Replicable Data Example” that is available online. Further information and the code is provided at https://www.bwl.unihamburg.de/en/statistik/forschung/softwareunddaten.html.
References
 Bach, P., V. Chernozhukov, and M. Spindler (2018): “An econometric (re)analysis of the gender wage gap in a highdimensional setting,” Work in Progress.
 Belloni, A., V. Chernozhukov, D. Chetverikov, and Y. Wei (forthcoming): “Uniformly valid postregularization confidence regions for many functional parameters in zestimation framework,” Annals of Statistics, arXiv preprint arXiv:1512.07619.
 Belloni, A., V. Chernozhukov, and C. Hansen (2014): “Inference on Treatment Effects After Selection Amongst HighDimensional Controls,” Review of Economic Studies, 81, 608–650, ArXiv, 2011.
 Belloni, A., V. Chernozhukov, and K. Kato (2015): “Uniform postselection inference for least absolute deviation regression and other Zestimation problems,” Biometrika, 102(1), 77–94.
 Benjamini, Y., and Y. Hochberg (1995): “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the royal statistical society. Series B (Methodological), pp. 289–300.
 Chernozhukov, V., D. Chetverikov, and K. Kato (2013): “Gaussian approximations and multiplier bootstrap for maxima of sums of highdimensional random vectors,” Annals of Statistics, 41, 2786–2819.
 Chernozhukov, V., C. Hansen, and M. Spindler (2016): “hdm: HighDimensional Metrics,” The R Journal, 8(2), 185–199.
 Holm, S. (1979): “A simple sequentially rejective multiple test procedure,” Scandinavian journal of statistics, pp. 65–70.
 List, J. A., A. M. Shaikh, and Y. Xu (2016): “Multiple Hypothesis Testing in Experimental Economics,” Working Paper 21875, National Bureau of Economic Research.
 Romano, J. P., and M. Wolf (2005a): “Exact and Approximate Stepdown Methods for Multiple Hypothesis Testing,” Journal of the American Statistical Association, 100(469), 94–108.
 Romano, J. P., and M. Wolf (2005b): “Stepwise Multiple Testing as Formalized Data Snooping,” Econometrica, 73(4), 1237–1282.
 (2016): “Efficient computation of adjusted pvalues for resamplingbased stepdown multiple testing,” Statistics & Probability Letters, 113, 38 – 40.