1 INTRODUCTION
###### Abstract

The Model-X knockoff procedure has recently emerged as a powerful approach for feature selection with statistical guarantees. The advantage of knockoff is that if we have a good model of the features , then we can identify salient features without knowing anything about how the outcome depends on . An important drawback of knockoffs is its instability: running the procedure twice can result in very different selected features, potentially leading to different conclusions. Addressing this instability is critical for obtaining reproducible and robust results. Here we present a generalization of the knockoff procedure that we call simultaneous multi-knockoffs. We show that multi-knockoff guarantees false discovery rate (FDR) control, and is substantially more stable and powerful compared to the standard (single) knockoff. Moreover we propose a new algorithm based on entropy maximization for generating Gaussian multi-knockoffs. We validate the improved stability and power of multi-knockoffs in systematic experiments. We also illustrate how multi-knockoffs can improve the accuracy of detecting genetic mutations that are causally linked to phenotypes.

Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization

Jaime Roquero Gimenez                        James Zou

Department of Statistics Stanford University Stanford, CA 94305 roquero@stanford.edu                        Department of Biomedical Data Science Stanford University Stanford, CA 94305 jamesz@stanford.edu

## 1 Introduction

In many machine learning and statistics settings, we have a supervised learning problem where the outcome depends on a subset of the features , potentially in complex ways, and we would like to identify these salient features. Take medical genetics as an example, the features are the genotypes at variants in the genome, and is a binary indicator for the presence/absence of disease. The true model could be that , where is the salient subset of the variants, and is some noise/randomness. It is tremendously important to identify which feature/variant is in .

If we assume that is simple, say a linear function, then we might hope to use the fitted parameter of a model to select salient features. For example, we might fit a Generalized Linear Models (GLM) on with LASSO penalty to promote sparsity in the coefficients (Tibshirani, 1996), and subsequently select those features with non-zero coefficients. Step-wise procedures where we sequentially modify the model is another way of doing feature selection (Mallows, 1973).

A clear limitation of this parametric approach is the need to to have a good model for . For the genetics example, there is no great model. Moreover the standard feature selection methods are all plagued by correlations between the features: a feature that is not really relevant for the outcome, i.e. not in , can be selected by LASSO or Step-wise procedure, because it is correlated with relevant features. In these settings we usually lack statistical guarantees on the validity of the selected features. Finally, even procedures with statistical guarantees usually depend on having valid -values, which are based on a correct modeling of and (sometimes) assume some asymptotic regime. However there are many common settings where these assumptions fail and we cannot perform inference based on those -values (Sur and Candès, 2018).

A powerful new approach called Model-X knockoff procedure (Candès et al., 2018) has recently emerged to deal with these issues. This method introduces a new paradigm: we no longer assume any model for the distribution of in order to do feature selection (and therefore do not compute -values), but we assume that we have full knowledge of the feature distribution – or at least we can accurately model it, though there are some robustness results (Barber et al., 2018). This knowledge of the ground truth allows us to sample new knockoff variables satisfying some precise distributional conditions. Although we make no assumption on , we can use the knockoff procedure to select features while controlling the False Discovery Rate (FDR), which is the average proportion of the selected features that are not in .

One major obstacle for the widespread application of knockoffs is its instability. The entire procedure sensitively depends on the knockoff sample , which is random. Therefore, running the knockoff procedure twice may lead to very different selected sets of features. Our analysis in Section 4 shows that instability is especially severe when the number of salient features (i.e. the size of ) is small, as is often the case. Also, whenever the number of features is very large, previous methods for generating knockoffs failed to consistently generate good samples for , leading to inconsistent selection sets if several runs of the procedure were done simultaneously. Power also decreases drastically under those previous methods. This makes it challenging to confirm the selected variants in a replication experiment. Addressing the instability of knockoffs is therefore an important problem.

##### Our Contributions.

We generalize the standard (single) knockoff procedure to simultaneous multiple knockoffs (or multi-knockoff for short). Our multi-knockoff procedure guarantees FDR control and has better statistical properties than the original knockoff, especially when the number of salient features is small. We propose a new entropy maximization algorithm to sample Gaussian multi-knockoffs. Our systematic experiments demonstrate that multi-knockoff is more stable and more powerful than the original (single) knockoff. Moreover we illustrate how multi-knockoff can improve the ability to select causal variants in Genome Wide Association Studies (GWAS).

## 2 Background on Knockoffs

We begin by introducing the usual setting of feature selection procedures. We consider the data as a sequence of i.i.d. samples from some unknown joint distribution: , . We then define the set of null features by if and only if (where the subscript indicates all variables except the th and bold letters indicate vectors). The non-null features, also called alternatives, are important because they capture the truly salient effects and the goal of selection procedures is to identify them. Running the knockoff procedure gives us a selected set , while controlling for False Discovery Rate (FDR), which stands for the expected rate of false discoveries: . The ratio is also called False Discovery Proportion (FDP).

Assuming we know the ground truth for the distribution , the first step of the standard knockoff procedure is to obtain a knockoff sample that satisfies the following conditions:

###### Definition 2.1 (Knockoff sample).

A knockoff sample of a d-dimensional random variable is a d-dimensional random variable such that two properties are satisfied:

• Conditional independence:

• Exchangeability :

 [X,~X]swap(S)\lx@stackreld=[X,~X]∀S⊂{1,…,d}

where stands for equality in distribution and refers to the vector where the original th feature and the th knockoff feature have been swapped whenever .

The first condition is immediately satisfied as long as knockoffs are sampled conditionally on the sample without considering any information about (which will be the case in our sampling methods so we will not mention it again). The second condition ensures that the knockoff of each feature is sufficiently similar to the original feature in order to be a good comparison baseline. We also denote by the matrix where we stack the i.i.d. -dimensional samples into one matrix (this is acceptable as the i.i.d. assumption allows for all sampling procedures to be done sample-wise).

The next step of the knockoff procedure constructs what we call feature statistics , such that a high value for is evidence that the th feature is non-null. Feature statistics described in Candès et al. (2018) depend only on such that for each we can write for some function . The only restriction these statistics must satisfy is the flip-sign property: swapping the th feature and its corresponding knockoff feature should flip the sign of the statistic while leaving other feature statistics unchanged. More formally, for a subset of features, denoting the data matrix where the original th variable and its corresponding knockoff have been swapped whenever , we have:

 wj([X,~X]swap(S),Y)={−wj([X,~X],Y),if% j∈S.wj([X,~X],Y),% otherwise.

As suggested in Candès et al. (2018), the choice of feature statistics can be done in two steps: first, find a statistic where each coordinate corresponds to the “importance” — hence we will call them importance scores — of the corresponding feature (either original or knockoff). For example, would be the absolute value of the regression coefficient of the th feature.

After obtaining the importance score for the original and knockoff feature, we take the difference to compute the feature statistic . The intuition is that importance scores of knockoffs serve as a control, a larger importance score of the th feature compared to that of its knockoff implies a larger positive (and therefore is evidence against the null). Given some target FDR level that we fix in advance, we define the selection set based on the following threshold :

 ^τ= min{t>0:1+#{j:Wj≤−t}#{j:Wj≥t}∨1≤q} (1) ^S= {j:Wj≥^τ} (2)

According to Theorem 3.4 in Candès et al. (2018), this procedure controls FDR at level (actually called Knockoff + procedure). The mechanism behind this procedure is the Selective SeqStep+ introduced in Barber et al. (2015). The intuition is that we try to maximize the number of selections while bounding by an upwardly biased estimate of the FDP, which is the fraction:

 ˆFDPKN+=1+#{j:Wj≤−t}#{j:Wj≥t}∨1 (3)

The added constant in the numerator is called the “offset”, equal to one in our case; a different FDP estimate with offset equal to 0 leads to a slightly different procedure that controls a modified, less stringent version of the FDR.

##### Instability of Knockoffs

If we generate multiple replication datasets —multiple versions of , each of which is sampled from the common —then the knockoff procedure guarantees that on average, the proportion of false discoveries is less than the desired threshold. However for a particular dataset , the selected features could be very different from that of another sample, as we empirically demonstrate in Section 4. There are settings where in half of the experiments, the knockoff procedure selects a large number of features, and it selects zero features in the other half of the times. This instability is a major issue if we want to ensure that the discoveries from data are reproducible.

The instability in the selected features is partially due to the randomness in and in the knockoff sample . The knockoff procedure, equations 1-3, is also sensitive to the sample . The knockoff selection set is based on a conservative estimate of the FDP given by equation (3). The threshold that determines the selected set requires such FDP estimate to be below some target FDR level set in advance, which in turn requires us to select at least features due to the presence of the offset in the numerator. This requirement is a great source of instability of the knockoff procedure: whenever the number of non-nulls is close to that threshold value , we can end up either selecting a fairly large number of non-nulls, or not selecting any, even when the signal is strong. Our goal is to develop a new knockoff procedure which controls FDR and is more stable. We achieve this by introducing simultaneous multiple knockoffs, called multi-knockoffs for short, which extends the standard knockoff procedure.

## 3 Simultaneous Multiple Knockoffs

##### A Naive Flawed Approach to Multi-knockoffs

One approach to improve the stability of the selected features is to run the standard knockoff procedure multiple times in parallel and to take some type of consensus. This approach is flawed and does not control FDR. The reason is that by running knockoff multiple times in parallel, the symmetry between and the knockoff samples is broken. To maintain symmetry and guarantee FDR, we need to simultaneously sample multiple knockoffs. We will make this more precise now.

### 3.1 Multi-knockoff Selection Procedure

Fix an positive integer , the multi-knockoff parameter (the usual single knockoff case corresponds to , for which all of our results are also valid). The goal is to extend the previous distributional properties of knockoffs of Definition 2.1 to settings where we simultaneously sample knockoff copies of the same -valued dataset (where, again, denotes either the -valued random variable when making distributional statements or a matrix when referring to the feature set of i.i.d. samples). As in the single knockoff setting, we can define an equivalence notion and notation for swapping multiple vectors. Instead of defining for some subset of indices, we consider a collection of permutations over the set of integers , one for each of the initial dimensions. Whenever we will use multi-knockoffs, we will index the original features by . We define the permuted vector , where for all , . Each permutes the features corresponding to the th dimension of each vector , leaving the other dimensions unchanged. Once this generalized swap notion is defined, we extend the exchangeability property based on the invariance of the joint distribution to such transformations.

###### Definition 3.1.

We say that the concatenated vectors satisfy the extended exchangeability property if the equality in distribution holds for any as defined above.

###### Definition 3.2.

We say that is a multi-knockoff vector of (or that they are multi-knockoffs of ) if the joint vector satisfies extended exchangeability and the conditional independence requirement .

We will later on give examples of how to generate such multi-knockoffs. We state a lemma that is a direct generalization of Lemma 3.2 in Candès et al. (2018) and give a proof in Appendix B.1.

###### Lemma 3.1.

Consider a subset of nulls . Define a generalized swap as above, where is the identity permutation whenever , and otherwise can be any permutation. Then we have the following equality in distribution for a multi-knockoff:

 ([X0,X1,…,Xκ],Y)\lx@stackreld=([X0,X1,…,Xκ]swap(σ),Y)

Once the multi-knockoff vector is sampled, consider the joint vector which takes values in . As for the single knockoff setting, we construct importance scores where each is a -dimensional vector. The importance scores are associated to the features in the following sense: if we generate importance scores on a swapped joint vector (as in Definition 3.1), then we obtain the same result as if we had swapped the importance scores of the initial joint vector. That is, the function defining the importance scores must satisfy . Common examples of such constructions are the absolute values of the coefficients associated to each feature when regressing on , with eventually a penalty for sparsity. Denoting an ordered sequence by indexing in parenthesis (i.e. for any real-valued sequence , we have ), we can define feature-wise ordered importance scores for each feature . For all , define:

 κi=argmax0≤k≤κTkiτi=T(0)i−T(1)i

We no longer have the possibility of generating feature statistics by taking an antisymmetric function of the importance scores. The extension to multi-knockoffs is done through these newly defined variables by noticing the analogy that if (single knockoff), then and corresponds to the sign of ( if and only if ). In the single knockoff setting, the crucial distributional result is that, conditionally on , the signs of the null are i.i.d. flip coins. In the multi-knockoff case, the information encoded by the sign of is contained in , which indicates whether among a given dimension the original feature has a higher importance score than that of its knockoffs. In Appendix B.4 we provide a geometric explanation for such choices of . The crucial result is that null behave uniformly and independently in distribution and can be used to estimate the number of false discoveries.

###### Lemma 3.2.

The random variables are i.i.d. distributed uniformly on the set , and independent of the remaining variables , and of the feature-wise ordered importance scores . In particular, conditionally on the variables and , the random variables are i.i.d. distributed uniformly on .

We prove this lemma in Appendix B.2. Following the steps that build the knockoff procedure as a particular case of the SeqStep+ procedure, we construct the following threshold that defines the rejection set of our multi-knockoff procedure based on a FDP estimate .

Essentially, the multi-knockoff procedure returns the features where the original feature has higher importance score than any knockoffs ( ), and the gap with the 2nd largest importance score is above some threshold.

###### Proposition 3.3.

Fix a target FDR level . The procedure that selects the features in the set given by Algorithm 1 controls FDR at level .

We prove this result in Appendix B.3. One advantage of the multi-knockoff selection procedure lies on the new value of the offset parameter. By averaging over the multi-knockoffs, we are able to decrease the threshold of minimum number of rejections from to , leading to an improvement in power and stability. We call the detection threshold of the multi-knockoff. We experimentally confirm such results in Section 4.

### 3.2 Gaussian Multi-knockoffs Based on Entropy Maximization

Most of the research and applications have focused around generating standard knockoffs when comes from a multivariate Gaussian distribution , although more universal sampling algorithms exist, for which we provide in Appendix A a generalization to multi-knockoffs. Here we extend the existing procedures for Gaussian knockoffs to generate Gaussian multi-knockoffs for . A sufficient condition for to be a multi-knockoff vector is for to be jointly Gaussian such that: 1) all the has the same mean ; and 2) the covariance matrix has the form :

 Σκ=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝ΣΣ−D…Σ−DΣ−DΣ…Σ−D⋮⋮⋱⋮Σ−DΣ−D…Σ⎞⎟ ⎟ ⎟ ⎟ ⎟⎠κ+1blocks

where is a diagonal matrix chosen so that is positive semi-definite to ensure that it is a valid covariance.

###### Proposition 3.4.

If has the mean and covariance structure given above, then is a valid multi-knockoff of .

If a diagonal term is zero, then for . This generates a valid multi-knockoff but it has no power to discover the th feature (regardless of whether it is null or non-null) since each multi-knockoff is indistinguishable from the original feature. The general intuition is that the more independent the knockoffs are from the original , the greater the power of discovering the non-null features (Candès et al., 2018). Therefore previous work for the standard single knockoff (corresponding to ) has focused on finding as large as possible in some sense, while maintaining the positive semi-definiteness of the covariance matrix.

To construct for Gaussian multi-knockoffs, we propose maximizing the entropy (which has a simple closed form for Gaussian distributions). This is equivalent in the single knockoff case to minimizing mutual information, as suggested in Candès et al. (2018). Indeed, , and do not depend on , hence the equivalence.

Entropy Knockoffs The diagonal matrix for constructing entropy multi-knockoffs is given by the following convex optimization problem:

 argmins−logdet(κ+1κΣ−D(s))−κd∑i=1log(si) subject to{κ+1κΣ−D(s)≻0si≥0∀i∈{1,…,d}

This optimization problem is a convex optimization problem, by noticing that is convex. It can be solved efficiently and our implementation is based on the Python package CVXOPT (M. S. Andersen and Vandenberghe, 2012). This knockoff construction method avoids solutions where diagonal terms are extremely close to , and we provide the following lower bound on the diagonal terms of :

 (λmin(s))1κ−2≤min1≤j≤dsj

where is the smallest (positive) eigenvalue of . The fact that we maximize the value of the determinant of such matrix implies that we avoid having any extremely small eigenvalue, hence this bound proves useful. We provide additional analysis on the formulation of entropy maximization as a convex optimization problem and prove this lower bound in Appendix A. Once the diagonal matrix is computed, we can generate the Gaussian multi-knockoffs by writing the conditional distribution given the original features .

 (X1,…,Xκ)|X0∼N((μ1,…,μκ),~Σ),where ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩μi=DΣ−1μ+(Id−DΣ−1)X0∀1≤i≤κ~Σ=⎛⎜ ⎜ ⎜ ⎜⎝C…C−DC−D…C−D⋮⋮⋮C−D…C⎞⎟ ⎟ ⎟ ⎟⎠andC=2D−DΣ−1D

In the single knockoff setting (), the standard approach in literature is to solve a semidefinite program (SDP) to optimize . An alternative approach in the literature, called equicorrelation, is to restrict to be and solve for (where we consider as a correlation matrix, the goal being having the same correlation between original features and knockoffs in every dimension). We provide natural generalizations of the SDP and equicorrelation to optimize the matrix for multi-knockoffs (see Appendix A). The SDP knockoffs are based on an optimization problem that promotes sparsity: the fact that the objective function is a -distance between the identity matrix and the diagonal matrix implies that many diagonal terms of the optimal solution will be set almost equal to 0. In addition, Candès et al. (2018) noticed that the equicorrelated knockoff method tends to have very low power, as in high dimensions the diagonal terms of are proportional to the lowest eigenvalue of the covariance matrix , which in high-dimensional settings tends to be extremely small. Currently, SDP knockoffs are chosen by default.

We perform experiments to demonstrate the advantage of entropy over SDP and equicorrelation. We randomly generate correlation matrices with the function make_spd_matrix from the Python package scikit-learn (Pedregosa et al., 2011), and compute the diagonal matrix with the SDP and equicorrelated methods (the diagonal terms of are necessarily in the interval so that we can compare them across several runs). The lower a diagonal term is, the higher the correlation is between the original feature and its corresponding knockoff, and the less powerful is the knockoff. In Figure 1, we plot the density of the distribution of the logarithm of the diagonal terms of (that we approximate with the empirical distribution based on 50 runs).

We see that a significant proportion of diagonal terms based on the SDP construction have values extremely small, several orders of magnitude smaller than , which effectively behave as whenever we sample knockoffs and thus the corresponding features are essentially undiscoverable because their knockoffs are too similar. In Appendix C.1 we show more such comparisons of the distribution of the diagonal terms for varying dimensions of the correlation matrices and strength of the correlation. More concerning in the SDP construction case, the set of almost-zero diagonal terms is very unstable to perturbations in the correlation matrix. We report in Appendix C.2 the simulations proving the instability of such sets. The outcome is simple: the Jaccard similarity between two sets of SDP undiscoverable features generated from two empirical covariance matrices obtained from two batches of i.i.d. samples is on average very low. That is, two parallel runs of the knockoff procedure on different datasets coming from the exact same original distribution lead to different sets of undiscoverable features.

The equicorrelated construction does not suffer such issue, although the diagonal terms tend to be smaller compared to the SDP diagonal terms that are not almost . The SDP construction, due to its objective function, maximizes some diagonal terms at the expense of many others that are effectively set to 0, whereas the equicorrelated construction treats all coordinates more equally.

Finally, the entropy construction achieves the best performance: the diagonal terms it constructs are generally a couple of orders of magnitude higher than the equicorrelated method, and when comparing to SDP, the entropy construction does not generate almost-zero terms, so that it does not create any catastrophic knockoff. This means that entropy knockoffs will generally have higher power than those from the other methods, and that on top of that the whole procedure will be more stable: the selection set will not vary from one run to another because the set of undiscoverable features changes completely.

## 4 Experiments

We first conduct systematic experiments on synthetic data, so that we know the ground truth. For each experiment, we evaluate both the power and the stability of multi-knockoffs and the standard knockoff. Then we evaluate the performance of knockoff on a real set from Genome Wide Association Studies (GWAS).

### 4.1 Analyzing Improvements with Synthetic Data

We run simulations with synthetic data to confirm the threshold phenomenon and the improvements brought by multi-knockoffs. We randomly generate a feature matrix from a random covariance matrix, fix a number of non-nulls and create a binary response Y based on a logistic response of a weighted linear combination of the non-null features. Then, we sample multi-knockoffs with (single knockoff), and from that same and run the knockoff procedure based on a logistic regression to obtain a selection set, along with values for the power and an FDP. We then repeat this whole procedure 50 times to obtain estimates of the variance and get an empirical FDR. Knockoffs are generated based on the entropy construction to show that our multi-knockoff based improvement is made on top of the entropy improvement (which is only specific for Gaussian knockoffs).

We set our target FDR level at , and compare the single knockoff setting with multi-knockoffs (with ) over a range of number of non-null features. We report our results in Figure 2. We first point out that FDR is strongly controlled in all the experiments, as expected. We then estimate the threshold values for detection given by the estimates : for single knockoffs (), rejections for multi-knockoffs with , and whenever . By plotting power as a function of the number of non-nulls we clearly confirm this threshold behavior. All three settings attain a high power regime whenever the number of non-nulls exceeds the expected detection threshold. This shows the advantage of using multi-knockoffs in settings where we expect a priori the number of non-nulls to be small, and want to make sure that our method has a chance of selecting such small set of non-nulls. We also see there is a small price to pay for using multi-knockoffs. Whenever the number of non-null features increases so that we are beyond the detection thresholds, power decreases with the number of multi-knockoffs. This is due to the fact that sampling multi-knockoffs imposes a more stringent constraint to construct the knockoff conditional distribution (cf. Appendix A), and therefore multi-knockoffs can have slightly “worse” power as increases.

Finally, multi-knockoffs not only substantially improve the power of the procedure in settings with a small number of non-nulls; they also help stabilizing the procedure. We plot in Figure 3, as a function of the selection frequency, the density of the distribution of the non-nulls. In order to get the selection frequency, we run the same procedure as before, except this time we sample 200 (multi-)knockoffs out of one same and run the procedure each time keeping that same and the response we had generated. This allows us to compute how frequently each non-null is selected (by repeatedly sampling knockoffs from a same , FDR is no longer controlled. The point of these simulations is to stress the improvement in stability). For different settings where we vary the number of non-nulls, we see that the multi-knockoffs consistently reject a large fraction of the non-nulls whenever the threshold of detection is attained. In contrast, most non-null features are selected by the standard knockoff at low frequency, indicating instability.

The key aspect here is that the improvement in power whenever a threshold is crossed is not because of an overall increase in selection frequency of all the non-nulls: the densities in the above figures do not concentrate around intermediate selection frequency values. That is, multi-knockoffs do not increase the power by increasing instability.

### 4.2 Applications: GWAS Causal Variants

We apply our stabilizing procedures for fine mapping the causal variants in a genome wide association study (GWAS). These studies scan the whole genome in search of single nucleotide polymorphisms (SNPs) that are associated with a particular phenotype. In practice, they compute correlation scores for each SNP with respect to the phenotype, and select those beyond a certain significance threshold. Often times, the high correlation between SNPs (called linkage disequilibrium) implies that a large number of consecutive SNPs have a large association score and thus are selected. Fine mapping consists in finding the precise causal SNPs that really help explain the phenotype. Knockoffs can be useful in this setting, but the threshold phenomenon described earlier is an impediment to the application of knockoffs. We want to analyze several dozens, maybe hundreds of SNPs that have passed the selection threshold of the GWAS. However, the number of true causal SNPs may be very low, possibly less than 10. If we set a target FDR level of , the single knockoff procedure may be unable to make any detection.

We follow the lines of Hormozdiari et al. (2016, 2014) and run simulations analogous to those presented in Figure 2, where the features now correspond to individual genotypes. As it is not possible to actually know, for a given phenotype, which are the true causal SNPs without experimental confirmation, we generate synthetic responses (phenotypes) by randomly choosing a given number of SNPs as causal. Such semi-synthetic data (real and simulated ) is standard in literature Hormozdiari et al. (2014). In addition, we run a selection procedure without statistical guarantees that is commonly used: we pick the top correlated SNPs with the response. We give more details in Appendix C.3. We recover the results obtained with synthetic data and report them in Figure 4: FDR is controlled with the multi-knockoff procedure, and the top correlation method fails to control FDR. We also observe the detection threshold effect: for a low number of causal SNPs, single knockoffs have almost no power, and multi-knockoffs have better power than picking the most correlated SNPs. Note that here we use the real SNPs data from 2000 individuals, and we are approximating the SNP covariance matrix by a Gaussian in order to apply multi-knockoffs (SNPs take value in {0,1,2}). The multi-knockoff is robust enough that FDR is controlled even with this model mismatch.

## 5 Discussion

In this paper, we propose multi-knockoffs, an extension of the standard knockoff procedure. We show that multi-knockoff guarantees FDR control, and demonstrate how to generate Gaussian multi-knockoffs via a new entropy-maximization algorithm. Our extensive experiments show that multi-knockoffs are more stable and more powerful compared to the standard (single) knockoff. Finally we illustrate on the important problem of identifying causal GWAS mutations that multi-knockoff substantially outperforms the popular approach of selecting mutations with the highest correlation with the phenotype. The main contribution of this paper is in proposing the mathematical framework of multi-knockoffs; additional empirical analysis and applications is an important direction of future work.

#### Acknowledgments

J.R.G. was supported by a Stanford Graduate Fellowship. J.Z. is supported by a Chan–Zuckerberg Biohub Investigator grant and National Science Foundation (NSF) Grant CRII 1657155. The authors thank Emmanuel Candès and Nikolaos Ignatiadis for helpful discussions.

## References

• Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
• Mallows (1973) Colin L Mallows. Some comments on c p. Technometrics, 15(4):661–675, 1973.
• Sur and Candès (2018) Pragya Sur and Emmanuel J Candès. A modern maximum-likelihood theory for high-dimensional logistic regression. arXiv preprint arXiv:1803.06964, 2018.
• Candès et al. (2018) Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2018.
• Barber et al. (2018) Rina Foygel Barber, Emmanuel J Candès, and Richard J Samworth. Robust inference with knockoffs. arXiv preprint arXiv:1801.03896, 2018.
• Barber et al. (2015) Rina Foygel Barber, Emmanuel J Candès, et al. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
• M. S. Andersen and Vandenberghe (2012) J. Dahl M. S. Andersen and L. Vandenberghe. Cvxopt: A python package for convex optimization, version 1.1.5. 2012.
• Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
• Hormozdiari et al. (2016) Farhad Hormozdiari, Martijn van de Bunt, Ayellet V Segrè, Xiao Li, Jong Wha J Joo, Michael Bilow, Jae Hoon Sul, Sriram Sankararaman, Bogdan Pasaniuc, and Eleazar Eskin. Colocalization of gwas and eqtl signals detects target genes. The American Journal of Human Genetics, 99(6):1245–1260, 2016.
• Hormozdiari et al. (2014) Farhad Hormozdiari, Emrah Kostem, Eun Yong Kang, Bogdan Pasaniuc, and Eleazar Eskin. Identifying causal variants at loci with multiple signals of association. Genetics, pages genetics–114, 2014.
• Lei and Fithian (2018) Lihua Lei and William Fithian. Adapt: an interactive procedure for multiple testing with side information. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):649–679, 2018.
• Lei et al. (2017) Lihua Lei, Aaditya Ramdas, and William Fithian. Star: A general interactive framework for fdr control under structural constraints. arXiv preprint arXiv:1710.02776, 2017.
• Ignatiadis et al. (2016) Nikolaos Ignatiadis, Bernd Klaus, Judith B Zaugg, and Wolfgang Huber. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature methods, 13(7):577, 2016.
• Consortium et al. (2015) 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015.

## Appendix A Sampling Multiple Knockoffs

### a.1 Gaussian Multi-knockoffs

We generalize the knockoff generation procedure to have multi-knockoffs, starting with the Gaussian case. We see that a sufficient condition for to be a multi-knockoff vector -besides all vectors having the same mean - is that the joint vector has a covariance matrix of the form:

 Σκ=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝ΣΣ−D…Σ−DΣ−DΣ…Σ−D⋮⋮⋱⋮Σ−DΣ−D…Σ⎞⎟ ⎟ ⎟ ⎟ ⎟⎠κ+1blocks

We can easily generalize previous diagonal matrix constructions to the multi-knockoff setting. The mathematical formulation of the heuristic behind SDP and equicorrelated knockoffs -as an objective function in the convex optimization problem- does not change when sampling multi-knockoffs, as the correlation between an original feature and any of its multi-knockoffs is the same as a consequence of exchangeability. However, the positive semi-definite constraint that defines the feasible set changes with . For the Entropy knockoffs, the objective function depends also on .

###### Proposition A.1.

We generalize the diagonal construction methods SDP, equicorrelated and entropy when sampling multi-knockoffs from a multivariate Gaussian, by the following convex optimization problems. We recover the formulations for the single knockoff setting by replacing .

• SDP Multi-knockoffs For a covariance matrix whose diagonal entries are equal to one, the diagonal matrix for constructing SDP knockoffs is given by the following convex optimization problem:

 minimized∑i=1|1−si| subject to{κ+1κΣ−D(s)≻0si≥0∀i∈{1,…,d}
• Equicorrelated Multi-knockoffs For a covariance matrix whose diagonal entries are equal to one, the diagonal matrix for constructing equicorrelated knockoffs is given by the following convex optimization problem:

 maximizessubject to{κ+1κΣ−sId≻0s≥0

The solution of this optimization problem has a closed form expression: , where is the smallest (positive) eigenvalue of .

• Entropy Multi-knockoffs The diagonal matrix for constructing entropy knockoffs is given by the following convex optimization problem (as is convex):

 argmins−logdet(κ+1κΣ−D(s))−κd∑i=1log(si) subject to{κ+1κΣ−D(s)≻0si≥0∀i∈{1,…,d}

The entropy knockoff construction method avoids solutions where diagonal terms are extremely close to , and we provide the following lower bound on the diagonal terms of :

 (λmin(s))1κ−2≤minj∈{1,…,d}sj

where is the smallest (positive) eigenvalue of .

For the SDP method and the equicorrelated method, increasing the number of multi-knockoffs constrains the feasible set of the convex optimization problem. However, diagonal terms can always be as close to 0 as they want, and we empirically observe a slight decrease in power as we increase indicating that the added constraints limit the choice of “good” values for the diagonal terms.

###### Proof.

The heuristic behind the different construction methods looks for different optimal solutions to convex optimization problems. Depending on the multi-knockoff parameter , we need to adapt two parts of the convex optimization formulations: the objective function and the feasible set. Objective functions in the SDP and equicorrelated constructions remain unchanged as they do not depend on the number of multi-knockoffs.

We first look at how the constraints defining the feasible set change as we go from simple knockoffs to multi-knockoffs. All three methods (SDP, equicorrelated, entropy) define the feasible set for by constraining to be positive definite. We show that this constraint is equivalent to , which we prove by induction. Suppose that at step , for any positive definite matrix , for positive definite diagonal matrix,

 ⎛⎜ ⎜ ⎜ ⎜ ⎜⎝SS−D…S−DS−DS…S−D⋮⋮⋱⋮S−DS−D…S⎞⎟ ⎟ ⎟ ⎟ ⎟⎠κ+1blocks≻0 ⇔1+κκS−D≻0

where we write for symmetric positive definite. We have :

 Σκ+1=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝ΣΣ−D…Σ−DΣ−DΣ…Σ−D⋮⋮⋱⋮Σ−DΣ−D…Σ⎞⎟ ⎟ ⎟ ⎟ ⎟⎠κ+2blocks≻0
 ⇔ ⎛⎜ ⎜⎝Σ…Σ−D⋮⋱⋮Σ−D…Σ⎞⎟ ⎟⎠κ+1blocks −⎛⎜ ⎜⎝Σ−D⋮Σ−D⎞⎟ ⎟⎠Σ−1(Σ−D…Σ−D)≻0 ⇔ ⎛⎜ ⎜⎝C…C−D⋮⋱⋮C−D…C⎞⎟ ⎟⎠κ+1blocks≻0,C % defined below ⇔ 1+κκC−D≻0,by induction as C≻0 ⇔ 2+κ1+κD−DΣ−1D≻0 ⇔ (ΣDD2+κ1+κD)≻0 ⇔ Σ−D(2+κ1+κD)−1D≻0 asΣ≻0and this is a Schur complement ⇔ 2+κ1+κΣ−D≻0

Hence the recursive step and we conclude the proof. We have

 C=Σ−(Σ−D)Σ−1(Σ−D)=2D−DΣ−1D≻0

given that so is the Schur complement of :

 (ΣΣ−DΣ−DΣ)≻0
##### Objective Function for Entropy Construction

In addition to this, we need to formulate the objective function for the entropy construction. The entropy of a multivariate Gaussian has a simple closed formula.

 H(X0,X1,…,Xκ)=12logdet(2πeΣκ)

We rearrange the expression of to show that minimizing is equivalent to minimizing

 −logdet(κ+1κΣ−D(s))−κd∑i=1log(si)

(We showed in the main text that minimizing the entropy in a Gaussian setting is equivalent to minimizing this log-determinant). In order to do so, it suffices to show by induction that the following holds for all :

 det(Σκ)∝det(D)κdet(κ+1κΣ−D)

where the multiplicative constant is a real number depending only on . We first show this for .

 det(Σ1)= det(ΣΣ−D(s)Σ−D(s)Σ) = det(Σ)det(Σ−(Σ−D(s))Σ−1(Σ−D(s))) = det(Σ)det(2D(s)−D(s)Σ−1D(s)) = det(ΣD(s)D(s)2D(s)) = det(2ΣD(s)−D(s)D(s)) = det(2Σ−D(s))d∏i=1si

Suppose the result holds for a given . We use the notation . We have:

 |Σκ+1|= ∣∣ ∣ ∣ ∣ ∣∣ΣΣ−D…Σ−DΣ−DΣ…Σ−D⋮⋮⋱⋮Σ−DΣ−D…Σ∣∣ ∣ ∣ ∣ ∣∣κ+2blocks = |Σ|∣∣∣⎛⎜ ⎜⎝Σ…Σ−D⋮⋱⋮Σ−D…Σ⎞⎟ ⎟⎠κ+1blocks −⎛⎜ ⎜⎝Σ−D⋮Σ−D⎞⎟ ⎟⎠Σ−1(Σ−D…Σ−D)∣∣∣
 = |Σ|∣∣ ∣ ∣∣C…C−D⋮⋱⋮C−D…C∣∣ ∣ ∣∣κ+1blocks ∝ |Σ||D|κ∣∣1+κκC−D∣∣by induction ∝ |Σ||D|κ∣∣2+2κκD−1+κκDΣ−1D−D∣∣ ∝ |Σ||D|κ+1∣∣2+κκI−1+κκΣ−1D∣∣ ∝ |D|κ+1∣∣2+κκ+1Σ−D∣∣

Hence the result, where is the same as before. We used the following two formulae to compute determinants of block matrices:

• If is invertible, then

 det(ABCD)=det(A)det(D−CA−1B)
• If and commute and all the blocks are square matrices, then

##### Lower Bound for Diagonal Terms in Entropy Construction

For the entropy construction, in order to give a lower bound for the , we derive an expression for the solution of the minimization problem. Without loss of generality, fix so that we compute the partial derivative with respect to . Denote . Using Jacobi’s formula for the derivative of a determinant, we get:

 ddsj(|R(s)|d∏i=1sκi) =(ddsj|R(s)|)d∏i=1sκi+|R(s)|(d∏i≠jsi)sκ−1j =|R(s)|tr(R(s)−1dR(s)dsj)d∏i=1si+|R(s)|sκ−1jd∏i≠jsi =(sκ−1j−sjR(s)−1jj)(|R(s)|d∏i≠jsi)

given that where is a matrix where the only non-zero term equal to one is in position . Therefore . Setting this expression to we get that the solution of the convex optimization problem satisfies

 1sκ−2j=R(s)−1jj∀j∈{1,…,d}

Now we can write the diagonal term in the inverse matrix as a quotient between two determinants: where is the principal minor of when removing the th row and column. As both and can be written as a product of eigenvalues, the Cauchy interlacing theorem gives the following lower bound:

 λmin(R(s))≤minj∈{1,…,d}sκ−2j

where is the smallest (positive) eigenvalue of . ∎

### a.2 General Multi-knockoff Sampling Based on SCIP

We can also generalize to the multi-knockoff setting a universal (although possibly intractable) knockoff sampling algorithm introduced in Candès et al. (2018): the Sequential Conditional Independent Pairs (SCIP). Fix the number of multi-knockoffs to sample (so that SCIP corresponds to ). We iterate for over the features, at each step sampling knockoffs for the th feature, independently one of another, from the conditional distribution of the original feature given all the available variables sampled so far. We formulate this in Algorithm 2 and prove that the resulting samples satisfy exchangeability.

###### Proof.

We need to prove the following equality in distribution, using the notations of Definition 3.1:

 [X0,X1,…Xκ]swap(σ)\lx@stackreld=[X0,X1,…Xκ]

We follow the same proof as in Candès et al. (2018), where we have the following induction hypothesis:

##### Induction Hypothesis:

After i steps, we have

 [X0,X11:i,…Xκ1:i]swap(σ)\lx@stackreld=[X0,X11:i,…Xκ1:i]

where now with arbitrary permutations over . After the first step the equality holds for given that all have the same conditional distribution and are independent one of another. Now, if the hypothesis holds at step , then at step we have that the joint distribution of can be decomposed as a product of conditional distributions given the sampling procedure so that we have:

 L(X0,X11:i,…Xκ1:i)=∏κk=0L(Xki|X0−i,X1:κ1:i−1)L(X0−i,X1:κ1:i−1)

Now, by induction hypothesis, the expression in the denominator satisfies the extended exchangeability for the first dimensions (we marginalize out over which doesn’t matter as at step the permutations are over ). And so are the terms in the numerator, as again we permute only elements among the first dimensions. And, because of the conditional independent sampling, the numerator expression is also exchangeable for the th dimension. In conclusion, is exchangeable for the first dimensions, hence concluding the proof. ∎

## Appendix B Proofs

### b.1 Proof of Lemma 3.1

###### Proof.

Given that the operation is the concatenation of the action of each permutation onto and that we can write as the composition of transpositions, we see that it is enough to show the result for a simple transposition of two features (original or multi-knockoff) corresponding to a null dimension. This leads us directly to the proof of Lemma 3.2 in Candès et al. (2018), where the difference is that we add all the extra multi-knockoffs in the conditioning set. ∎

### b.2 Proof of Lemma 3.2

###### Proof.

Consider any collection of permutations on the set , and for , set the identity permutation. In order to prove the result we need to show the following equality in distribution:

 ([σi(κi)]1≤i≤d, [(T(k)i)0≤k≤κ]1≤i≤d) \lx@stackreld=([κi]1≤i≤d,[(T(k)i)0≤k≤κ]1≤i≤d)

Define for every and . Using the notation for the extended swap this is equivalent to , where for each null index the th features of and its knockoffs have been permuted according to (and the non-null remained at their place). By construction, is a function of and which associates to each feature in a “score” for its importance (for simplicity here we will denote by the whole concatenated vector of ). The choice of such function is restricted so that . By the multi-knockoff exchangeability property, and our specific choice of that does not permute non-null features, we also have