Variable selection with error control: Another look at Stability Selection
Stability Selection was recently introduced by Meinshausen and Bühlmann (2010) as a very general technique designed to improve the performance of a variable selection algorithm. It is based on aggregating the results of applying a selection procedure to subsamples of the data. We introduce a variant, called Complementary Pairs Stability Selection (CPSS), and derive bounds both on the expected number of variables included by CPSS that have low selection probability under the original procedure, and on the expected number of high selection probability variables that are excluded. These results require no (e.g. exchangeability) assumptions on the underlying model or on the quality of the original selection procedure. Under reasonable shape restrictions, the bounds can be further tightened, yielding improved error control, and therefore increasing the applicability of the methodology.
Key words: Complementary Pairs Stability Selection, -concavity, subagging, subsampling, variable selection
The problem of variable selection has received a huge amount of attention over the last 15 years, motivated by the desire to understand structure in massive data sets that are now routinely encountered across many scientific disciplines. It is now very common, e.g. in biological applications, image analysis and portfolio allocation problems as well as many others, for the number of variables (or predictors) that are measured to exceed the number of observations . In such circumstances, variable selection is essential for model interpretation.
In a notable recent contribution to the now vast literature on this topic, Meinshausen and Bühlmann (2010) proposed Stability Selection as a very general technique designed to improve the performance of a variable selection algorithm. The basic idea is that instead of applying one’s favourite algorithm to the whole data set to determine the selected set of variables, one instead applies it several times to random subsamples of the data of size , and chooses those variables that are selected most frequently on the subsamples. Stability Selection is therefore intimately connected with bagging (Breiman, 1996, 1999) and subagging (Bühlmann and Yu, 2002).
A particularly attractive feature of Stability Selection is the error control provided by an upper bound on the expected number of falsely selected variables (Meinshausen and Bühlmann, 2010, Theorem 1). Such control is typically unavailable when applying the original selection procedure to the whole data set, and allows the practitioner to select the threshold for the proportion of subsamples for which a variable must be selected in order for it to be declared significant.
However, the bound does have a couple of drawbacks. Firstly, it applies to the ‘population version’ of the subsampling process, i.e. to the version of the procedure that aggregates results over the non-random choice of all subsamples. Even for as small as 15, it is unrealistic to expect this version to be used in practice, and in fact choosing around random subsamples is probably typical. More seriously, the bound is derived under a very strong exchangeability assumption on the selection of noise variables (as well as a weak one on the quality of the original selection procedure, namely that it is not worse than random guessing).
In this paper, we develop the methodology and conceptual understanding of Stability Selection in several respects. We introduce a variant of Stability Selection, where the subsamples are drawn as complementary pairs from . Thus the subsampling procedure outputs index sets , where each is a subset of of size , and . We call this variant Complementary Pairs Stability Selection (CPSS).
At first glance it would seem that CPSS would be expected to yield very similar results to the original version of Stability Selection. However, we show that CPSS in fact has the following properties:
The Meinshausen–Bühlmann bound holds for CPSS regardless of the number of complementary pairs chosen – even with .
There is a corresponding bound for the number of important variables excluded by CPSS.
Our results have no conditions on the original selection procedure, and in particular do not require the strong exchangeability assumption on the selection of noise variables. Indeed, we argue that even a precise definition of ‘signal’ and ‘noise’ variables is not helpful in trying to understand the properties of CPSS, and we instead state the bounds in terms of the expected number of variables chosen by CPSS that have low selection probability under the base selection procedure, and the expected number of high selection probability variables that are excluded by CPSS. See Section 2 for further discussion.
The bound on the number of low selection probability variables chosen by CPSS can be significantly sharpened under mild shape restrictions (e.g. unimodality or -concavity) on the distribution of the proportion of times a variable is selected in both and . We discuss these conditions in detail in Sections 3.2 and 3.3 respectively, and compare both the original and new bounds to demonstrate the marked improvement.
Our improved bounds are based on new versions of Markov’s inequality that hold for random variables whose distributions are unimodal or -concave. However, it is important to note at this point that the results are not just a theoretical contribution; they allow the practitioner to reduce (and therefore select more variables) for the same control of the number of low selection probability variables chosen by CPSS. In Section 3.4, we give recommendations on how a practitioner can make use of the bounds in applying CPSS.
In Section 4.1, we present the results of an extensive simulation study designed to illustrate the appropriateness of our shape restrictions, and to compare Stability Selection and CPSS with their base selection procedures.
A review of some of the extensive literature on variable selection can be found in Fan and Lv (2010). Work related more specifically to Stability Selection includes Bach (2008), who studied the Bolasso (short for Bootstrapped enhanced Lasso). This involves applying the Lasso to bootstrap (with replacement) samples from the original data, rather than subsampling without replacement. A final estimate is obtained by applying the Lasso to the intersection of the set of variables selected across the bootstrap samples. Various authors, particularly in the machine learning literature, have considered the stability of a feature selection algorithm, i.e. the insensitivity of the output of the algorithm to variations in the training set; such studies include Lange et al. (2003), Kalousis, Prados and Hilario (2007), Kuncheva (2007), Loscalzo, Yu and Ding (2009) and Han and Yu (2010). Saeys, Abeel and Peer (2008) consider obtaining a final feature ranking by aggregating the rankings across bootstrap samples.
2 Complementary Pairs Stability Selection
In order to keep our discussion rather general, we only assume that we have vector-valued data which we take to be a realisation of independent and identically distributed random elements . Informally, we think of some of the components of as being ‘signal variables’, and others as being ‘noise variables’, though for our purposes it is not necessary to define these notions precisely. Formally, we let and , thought of as the index sets of the signal and noise variables respectively. A variable selection procedure is a statistic taking values in the set of all subsets of , and we think of as an estimator of . As a typical example, we may often write with the covariate and the response , and our (pseudo) log-likelihood might be of the form
for some . In this context, we regard as the signal indices, as noise indices. Examples from graphical modelling can also be cast within our framework. Note however that we do not require a (pseudo) log-likelihood of the form (1).
We define the selection probability of a variable index under as
We take the view that for understanding the properties of Stability Selection, the selection probabilities are the fundamental quantities of interest. Since an application of Stability Selection is contingent on a choice of base selection procedure , all we can hope is that it selects variables having high selection probability under the base procedure, and avoids selecting those variables with low selection probability. Indeed this turns out to be the case; see Theorem 1 below.
Of course, has a Bernoulli distribution with parameter , so we may view as an unbiased estimator of (though is not a model parameter in the conventional sense). The key idea of Stability Selection is to improve on this simple estimator of through subsampling.
For a subset with , we shall write
Definition 1 (Complementary Pairs Stability Selection).
Let be randomly chosen independent pairs of subsets of of size such that . For , the Complementary Pairs Stability Selection version of a variable selection procedure is , where the function is given by
Note that is an unbiased estimator of , but, in general, a biased estimator of . However, by means of the averaging involved in (3), we hope that will have reduced variance compared with , and that this increased stability will more than compensate for the bias incurred. Indeed, this is the case in other situations where bagging and subagging have been successfully applied, such as classification trees (Breiman, 1996) or nearest neighbour classifiers (Hall and Samworth, 2005; Biau, Cérou and Guyader, 2010; Samworth, 2011).
An alternative to subsampling complementary pairs would be to use bootstrap sampling. We have found that this gives very similar estimates of , though most of our theoretical arguments do not apply when the bootstrap is used (the approach in Section 3.3.1 is an exception in this regard). In fact, taking subsamples of size can be thought of as the subsampling scheme that most closely mimics the bootstrap (e.g. Dümbgen, Samworth and Schuhmacher, 2011).
It is convenient at this stage to define another related selection procedure based on sample splitting.
Definition 2 (Simultaneous Selection).
Let be randomly chosen independent pairs of subsets of of size such that . For , the Simultaneous Selection version of is , where
For our purposes, Simultaneous Selection is a tool for understanding the properties of CPSS. However, the special case of of Simultaneous Selection was studied by Fan, Samworth and Wu (2009), and a variant involving all possible disjoint pairs of subsets was considered in Meinshausen and Bühlmann (2010).
3 Theoretical properties
3.1 Worst-case bounds
In Theorem 1 below, we show that the expected number of low selection probability variables chosen by CPSS is controlled in terms of the expected number chosen by the original selection procedure, with a corresponding result for the expected number of high selection probability variables not chosen by CPSS. The appealing feature of these results is their generality: they require no assumptions on the underlying model or on the quality of the original selection procedure, and they apply regardless of the number of complementary pairs of subsets chosen.
For , let denote the set of variable indices that have low selection probability under , and let denote the set of those that have high selection probability.
If , then
Let and . If , then
In many applications, and for a good base selection procedure, we imagine that the set of selection probabilities is positively skewed in , with many selection probabilities being very low (predominantly noise variables), and with just a few being large (including at least some of the signal variables). To illustrate Theorem 1(i), consider a situation with variables and where the base selection procedure chooses 50 of them. Then Theorem 1(i) shows that on average CPSS with selects no more than a quarter of the below average selection probability variables chosen by .
Our Theorem 1(i) is analogous to Theorem 1 of Meinshausen and Bühlmann (2010). The differences are that we do not require the condition that is exchangeable, nor that the original procedure is no worse than random guessing, and our result holds for all . The price we pay is that the bound is stated in terms of the expected number of low selection probability variables chosen by CPSS, rather than the expected number of noise variables, which we do for the reasons described in Section 2. If the exchangeability and random guessing conditions mentioned above do hold, then, writing , we recover
The final bound here was obtained in Theorem 1 of Meinshausen and Bühlmann (2010) for the population version of Stability Selection.
3.2 Improved bounds under unimodality
Despite the attractions of Theorem 1, the following observations suggest there may be scope for improvement. Firstly, we expect we should be able to obtain tighter bounds as increases. Secondly, and more importantly, examination of the proof of Theorem 1(i) shows that our bound relies on first noting that
and then applying Markov’s inequality to . For equality in Markov’s inequality, must be a mixture of point masses at and , but Figure 1 suggests that the distribution of , which is supported on , can be very different from this. Indeed, our experience, based on extensive simulation studies, is that when is close to (which is where the bound in Theorem 1(i) is probably of most interest), the distribution of over is remarkably consistent over different data generating processes, and Figure 1 is typical.
It is therefore natural to consider placing shape restrictions on the distribution of which encompass what we see in practice, and which yield stronger versions of Markov’s inequality. As a first step in this direction, we consider the assumption of unimodality.
Suppose that the distribution of is unimodal for each . If , then
where, when ,
The proof of Theorem 2 is based on a new version of Markov’s inequality (Theorem 9 in the Appendix) for random variables with unimodal distributions supported on a finite lattice. There is also an explicit expression for when , which follows from Theorem 9 in the same way, but we do not present it here because it is a little more complicated, and because we anticipate the bound when is (much) smaller than being of most use in practice. See Section 3.4 for further discussion.
3.3 Further improvements under -concavity
The unimodal assumption allows for a significant improvement in the bounds attainable from a naive application of Markov’s inequality. However, Figure 1 suggests that further gains may be realised by placing tighter constraints on the family of distributions for that we consider, in order to match better the empirical distributions that we see in practice.
A very natural constraint to impose on the distribution of is log-concavity. By this, we mean that, if denotes the probability mass function of , then the linear interpolant to is a log-concave function on . Log-concavity is a shape constraint that has received a great deal of attention recently (e.g. Walther (2002); Dümbgen and Rufibach (2009); Cule, Samworth and Stewart (2010)), and at first sight it seems reasonable in our context, because if the summands in (4) were independent, then we would have , which is log-concave.
It is indeed possible to obtain a version of Markov’s inequality under log-concavity that leads to another improvement in the bound on . However, we found that in practice, the dependence structure of the summands in (4) meant that the log-concavity constraint was a little too strong. We therefore consider instead the class of -concave distributions, which we claim defines a continuum of constraints that interpolate between log-concavity and unimodality (see Propositions 3 and 4 below). This constraint has also been studied recently in the context of density estimation by Seregin and Wellner (2010) and Koenker and Mizera (2010); see also Dharmadhikari and Joag-Dev (1988).
To define the class, we recall that the generalised mean of is given by
for . This is also well-defined for if we take when , and define . In addition, we may define
We are now in a position to define -concavity.
A non-negative function on an interval is -concave if for every and , we have
A probability mass function supported on is -concave if the linear interpolant to is -concave.
When , it is easy to see that is -concave if and only if is convex. Let denote the class of -concave probability mass functions on . Then each is unimodal, and as is non-decreasing in for fixed and , we have for . Furthermore, is unimodal if it is -concave, and is log-concave if it is -concave. The following two results further support the interpretation of -concavity for as an interpolation between log-concavity and unimodality.
A function is log-concave if and only if it is -concave for every .
Let be a unimodal probability mass function supported on and suppose both that and that , for some . Then is -concave for some .
In Proposition 11 in the Appendix, we present a result that characterises those -concave distributions that attain equality in a version of Markov’s inequality for random variables with -concave distributions on . If we assume that is -concave for all , using (5), for these variables we can obtain a bound of the form
where denotes the maximum of over all -concave random variables supported on with . Although does not appear to have a closed form, it is straightforward to compute numerically, as we describe in Section A.4. The lack of a simple form means a direct analogue Theorem 2 is not available. We can nevertheless obtain the following bound on the expected number of low selection probability variables chosen by CPSS:
Our simulation studies suggest that is a sensible choice to use for the bound. In other words, if denotes the probability mass function of , then the linear interpolant to is typically well approximated by a convex function. This is illustrated in the bottom left panel of Figure 1 (note that the right-hand tail in this plot corresponds to tiny probabilities).
3.3.1 Lowering the threshold
The bounds obtained thus far have used the relationship (5) to convert a Markov bound for into a corresponding one for the statistic of interest, . The advantage of this approach is that is much smaller than for variables with low selection probability, so the Markov bound is quite tight. However, for close to , the inequality (5) starts to become weak, and bounds can only be obtained for in any case.
To solve this problem, we can apply our versions of Markov’s inequality directly to . We have found, through our simulations, that for variables with low selection probability, the distribution of can be modelled very well as a -concave distribution (see Figure 3). That the distribution of is closer to log-concavity than that of is intuitive because although the summands in (3) are not independent, terms involving subsamples which have little overlap will be close to independent. If we assume that is -concave and that is -concave for all , we can obtain our best bound
which is valid for all , provided we adopt the convention that for . The resulting improvements in the bounds can been seen in Figure 2. Note the kink in Figure 2 for the -concave bound (8) just before . This corresponds to the transition from where is smaller to where is smaller.
3.4 How to use these bounds in practice
The quantities and , which appear on the right hand sides of the bounds, will in general be unknown to the statistician. Thus when using the bounds, they will typically need to be replaced by and respectively. In addition, several parameters must be selected, and in this section we go through each of these in turn and give guidance on how to choose them.
Choice of .
We recommend as a default value. Choosing larger than this increases the computational burden, and may lead to the -concavity assumptions being violated.
Choice of .
As mentioned at the beginning of Section 3.2, is a natural choice. In other words, we regard the below average selection probability variables as the irrelevant variables. Other choices of are possible, but the use of (6) and (7) to construct the bound suggests that the inequality will be tightest when most of the variables have a selection probability close to .
Choice of and threshold .
One can regard the choice of (which is usually fixed through a tuning parameter ) as part of the choice of the base selection procedure. One option is to fix by varying at each evaluation of the selection procedure until it selects variables. However, if the number of variables selected at each iteration is unknown in advance (e.g. if is fixed, or if cross-validation is used to choose at each iteration), then can be estimated by .
An important point to note is that although choosing or is usually crucial when carrying out variable selection, this is not the case when using CPSS. Our experience is that the performance of CPSS is surprisingly insensitive to the choice of (see also Meinshausen and Bühlmann (2010)). That is to say, does not vary much as varies, and also the final selected sets for different values of tend to be similar (where different thresholds are chosen to control the selection of variables in at a pre-specified level). Thus, when using CPSS, it is the threshold that plays a role similar to that of a tuning parameter for the base procedure. The great advantage of CPSS is that our bounds allow one to choose to control the expected number of low selection probability variables selected.
To summarise: we recommend as a sensible default CPSS procedure taking and . We then choose using the bound (8) with replaced by to control the expected number of low selection probability variables chosen.
4 Numerical properties
4.1 Simulation Study
In this section we investigate the performance and validity of the bounds derived in the previous section by applying CPSS to simulated data. We consider both linear and logistic regression and different values of and . In each of these settings, we first generate independent explanatory vectors with each . We use a Toeplitz covariance matrix with entries
and we look at various values of in . So the correlation between the components decays exponentially with the distance between them in .
For linear regression, we generate a vector of errors and set
where the design matrix has row . The error variance is chosen to achieve different values of the signal-to-noise ratio (SNR), which we define here by
For logistic regression, we generate independent responses
Here is a scaling factor which is chosen to achieve a particular Bayes error rate.
In both cases, we fix the -dimensional vector of coefficients to have non-zero components, of which we choose as equally spaced points within with the remaining equally spaced in . The indices of the non-zero components, , are chosen to follow a geometric progression up to rounding, with first term 1 and term . The values are then randomly assigned to each index in , but this choice is then fixed for each particular simulation setting.
With , this setup will have several signal variables correlated amongst themselves, and also some signal correlated with noise. In this way, the framework above includes a very wide variety of different data generating processes on which we can test the theory of the previous section.
By varying the base selection procedure, its tuning parameters, the values of , , , and also the SNR and Bayes error rates, we have applied CPSS in several hundred different simulation settings. For reasons of space, we present only a subset of these numerical experiments below, but the results from those omitted are not qualitatively different.
In the graphs which follow, we look at CPSS applied to the Lasso (Tibshirani, 1996), which we implemented using the package glmnet (Friedman, Hastie and Tibshirani, 2010) in R (R Development Core Team, 2010). We follow the original stability selection procedure put forward in Meinshausen and Bühlmann (2010) and compare this to the method suggested by our -concave bound (8). Thus we first choose the level at which we wish to control the expected number of low selection probability variables (so we aim to have ). Then we fix and set the threshold at 0.9. This ensures that, according to the original worst case bound, we control the expected number of low selection probability variables selected at the required level. In the -concave case, we take our threshold as
We also give the results one would obtain using the Lasso alone, but with the benefit of an oracle which knows the optimal value of the tuning parameter . That is, we take as our selected set, where
and is the selected set when using the Lasso with tuning parameter applied to the whole data set.
We present all of our results relative to the performance of CPSS using an oracle-driven threshold , where is defined by
respectively. Thus the heights of the black and grey bars relate to the loss of power in using the threshold suggested by the corresponding bounds. In all of our simulations, we used . Each scenario was run 500 times, and in order to determine the set , in each scenario, we applied the particular selection procedure to 50,000 independent data sets.
It is immediately obvious from the results that using the -concave bound, we are able to recover significantly more variables in than when using the the worst case bound. Furthermore, though it is not shown in the graphs explicitly, we also achieve the required level of error control in all but one case (where the -concavity assumption fails). In fact the one particular example is hardly exceptional in that we have . Thus in close accordance with our theory, there are no significant violations of the -concave bound.
We also see that the loss in power due to using rather than , is very low. In almost all of the scenarios, we are able to select more than 75% of the signal we could select with the benefit of an oracle, and usually much more than this. It is interesting that the performance of the oracle CPSS and oracle Lasso procedures are fairly similar. The key advantage of CPSS is that it allows for error control whereas there is in general no way of determining (or even approximating) the optimal that achieves the required error control. In fact, the performance of CPSS with our bound is only slightly worse then that of the oracle Lasso procedure, and in a few cases, particularly when is small, it is even slightly better. In the cases where , we see that CPSS is not quite as powerful. This is because having such large correlations between variables causes to be relatively spread out in . As explained in Section 3.4, we expect our bound to weaken in this situation. However, even when the correlation is as high as 0.9, we recover a sizeable proportion of the signal we would select had we used the optimal .
4.2 Real data example
Here we illustrate our CPSS methodology on the widely studied colon data set of Alon et al. (1999), freely available at http://microarray.princeton.edu/oncology/affydata/index.html. The data consist of 2000 gene expression levels from 40 colon tumour samples and 22 normal colon tissue samples, measured using Affymetrix oligonucleotide arrays. Our goal is to identify a small subset of genes which we are confident are linked with the development of colon cancer. Such a task is important for improving scientific understanding of the disease and for selecting genes as potential drug targets.
The data were first preprocessed by averaging over the expression levels for repeated genes (which had been tiled more than once on each array), log-transforming each gene expression level, standardising each row to have mean zero and unit variance, and finally removing the columns corresponding to control genes, so that genes remained. The transformation and standardisation are very common preprocessing steps to reduce skewness in the data and help eliminate the effects of systematic variations between different microarrays (see for example Amaratunga and Cabrera (2004) and Dudoit et al. (2002)).
We applied CPSS with (Lasso) penalised logistic regression as the base procedure, with , and choosing both using the -concave bound of Section 3.4, and the original bound of Meinshausen and Bühlmann (2010). We estimated the expected classification error in the two cases by averaging over 128 repetitions of stratified random subsampling validation, taking 8 cancerous and 4 normal observations in each test set. Thus when applying CPSS, we had . We looked at and , and set to control with and .
Rather than subsampling completely at random when using CPSS, we also stratified these subsamples to include the same proportion of cancerous to normal samples as in the training data supplied to the procedure. Without this step, some of the subsamples may not include any samples from one of the classes, and applying to such a subsample would give misleading results. Using stratified random subsampling is still compatible with our theory, provided that is interpreted as an expectation over random data which contain the same class proportions as observed in the original data. In general, this approach of stratified random subsampling is useful when the response is categorical.
The results in Table 1 show that, as expected, the new error bounds allow one to select more variables than the conservative bounds of Meinshausen and Bühlmann (2010) for the same level of error control, and as a consequence, the expected prediction error is reduced. Figure 8 demonstrates the robustness of the selected set to the different values of . Finally, we also applied CPSS on the entire dataset with and and using the -concave bound of Section 3.4 to choose to control (cf. Figure 9). We see that with just 5 genes out of 1908, we manage to separate the two classes quite well.
Appendix A Appendix
a.1 Proof of Theorem 1
The proof of Theorem 1 requires the following lemma.
If , then
If , then
(i) Let be randomly chosen independent pairs of subsets of of size such that . Then
Now because and are independent conditional on . It follows using (9) that
where we have used Markov’s inequality in the final step.
(ii) Define and by replacing with in the definitions of and respectively. Then, using the bound corresponding to (9) and Markov’s inequality again,
a.2 Proof of Theorem 2
The proof of Theorem 2 requires several preliminary results, and we use the following notation. Let denote the finite lattice . If is a probability mass function on , we write for , thereby associating with .
For , we denote the probability that a random variable distributed according to takes values greater than or equal to by . We also write for the expectation of this random variable and for the support of .
Let be the set of all unimodal probability mass functions on , and let . We consider the problem of maximising over . Since the cases and are trivial, there is no loss of generality in assuming throughout that and , so in particular .
There exists a maximiser of in .
Since is linear and therefore continuous, it suffices to show that is closed and bounded. Now is bounded as . Moreover, the hyperplane is closed. Also, is a continuous function on , so is closed. Now let . If then there must exist such that . Clearly this inequality must hold for all in a sufficiently small open ball about , so is open. We see that
Thus is an intersection of closed sets and hence is closed. ∎
We will make frequent use of the following simple proposition in subsequent proofs.
Suppose that and satisfy
and that there exists some with for all and for all . Then
with equality if and only if for .
The following result characterises the extremal elements of in the sense of maximising the tail probability . In particular, it shows that such extremal elements can take only one of two simple forms.
Any maximiser of satisfies
writing for , we have either
(i) Suppose maximises , but that . Define . As , we must have . Define by
where are chosen such that , but are small enough that . Then but , a contradiction.
(ii) Suppose first that there exists a mode of which is at least . Let be such that for and for . As , we can apply Proposition 7 to see that