Feature Selection by Higher Criticism Thresholding:Optimal Phase Diagram

# Feature Selection by Higher Criticism Thresholding: Optimal Phase Diagram

David Donoho and Jiashun Jin
Department of Statistics, Stanford University
Department of Statistics, Carnegie Mellon University
December 11, 2008
###### Abstract

We consider two-class linear classification in a high-dimensional, low-sample size setting. Only a small fraction of the features are useful, the useful features are unknown to us, and each useful feature contributes weakly to the classification decision – this setting was called the rare/weak model (RW Model) in [11].

We select features by thresholding feature -scores. The threshold is set by higher criticism (HC) [11]. Let denote the -value associated to the -th -score and denote the -th order statistic of the collection of -values. The HC threshold (HCT) is the order statistic of the -score corresponding to index maximizing . The ideal threshold optimizes the classification error. In [11] we showed that HCT was numerically close to the ideal threshold.

We formalize an asymptotic framework for studying the RW model, considering a sequence of problems with increasingly many features and relatively fewer observations. We show that along this sequence, the limiting performance of ideal HCT is essentially just as good as the limiting performance of ideal thresholding. Our results describe two-dimensional phase space, a two-dimensional diagram with coordinates quantifying “rare” and “weak” in the RW model. Phase space can be partitioned into two regions – one where ideal threshold classification is successful, and one where the features are so weak and so rare that it must fail. Surprisingly, the regions where ideal HCT succeeds and fails make the exact same partition of the phase diagram. Other threshold methods, such as FDR threshold selection, are successful in a substantially smaller region of the phase space than either HCT or Ideal thresholding.

The False Feature Detection Rate (FDR) and local FDR of the ideal and HC threshold selectors have surprising phase diagrams which we also describe. Results showing asymptotic equivalence of HCT with ideal HCT can be found in a companion paper [12].

Keywords: Asymptotic rare weak model (ARW), False Discovery Rate (FDR), linear classification, local False Discovery Rate (Lfdr), phase diagram, Fisher’s separation measure (SEP), feature selection by thresholding.

AMS 2000 subject classifications: Primary-62H30, 62H15. Secondary-62G20, 62C32, 62G10.

Acknowledgments: The authors would like to thank the Isaac Newton Mathematical Institute of Cambridge University for hospitality during the program Statistical Theory and Methods for Complex, High-Dimensional Data, and for a Rothschild Visiting Professorship held by DLD. DLD was partially supported by NSF DMS 05-05303, and JJ was partially supported by NSF CAREER award DMS 06-39980.

## 1 Introduction

The modern era of high-throughput data collection creates data in abundance. Some devices – spectrometers and gene chips come to mind – automatically generate measurements on thousands of standard features of each observational unit.

What hasn’t changed in science is the difficulty of obtaining good observational units. High-throughput devices don’t help us to find and enroll qualified patients, or catch exotic butterflies, or observe primate mating behaviors. Hence, in many fields the number of observational units – eg. patients, butterflies, or matings – has not increased, and stays today in the dozens or hundreds. But each of those few observational units can now be subjected to a large battery of automatic feature measurements

Many of those automatically measured features will have little relevance to any given project. This new era of feature glut poses a needle-in-a-haystack problem: we must detect a relatively few valuable features among many useless ones. Unfortunately, the combination of small sample sizes (few observational units) and high dimensions (many feature measurements) makes it hard to tell needles from straw.

Orthodox statistical methods assumed a quite different set of conditions: more observations than features, and all features highly relevant. Modern statistical research is intensively developing new tools and theory to address the new unorthodox setting; such research comprised much of the activity in the recent 6-month Newton Institute program Statistical Theory and Methods for Complex, High-Dimensional Data.

In this article we focus on this new setting, this time addressing the challenges that modern high-dimensional data pose to linear classification schemes. New data analysis tools and new types of mathematical analysis of those tools will be introduced.

### 1.1 Multivariate normal classification

Consider a simple model of linear classifier training. We have a set of labelled training samples , , where each label is and each feature vector . For simplicity, we assume the training set contains equal numbers of ’s and ’s and that the feature vectors obey , , for an unknown mean contrast vector ; here denotes the feature covariance matrix and is the training set size. In this simple setting, one ordinarily uses linear classifiers to classify an unlabeled test vector , taking the general form , for a sequence of ‘feature weights’ .

Classical theory going back to RA Fisher [3] shows that the optimal classifier has feature weights ; at first glance linear classifier design seems straightforward and settled. However, in many of today’s most active application areas, it is a major challenge to construct linear classifiers which work well.

### 1.2 p larger than n

The culprit can be called the “ problem”. A large number of measurements is automatically made on thousands of standard features, but in a given project, the number of observational units, , might be in the dozens or hundreds. The fact that makes it difficult or impossible to estimate the feature covariance with any precision.

It is well known that naive application of the formula to empirical data in the setting is problematic; at a minimum, because the matrix of empirical feature covariances is not invertible. But even if we use the generalized inverse , the resulting naive classification weights, , often give very “noisy” classifiers with low accuracy. The modern feature glut thus seriously damages the applicability of ‘textbook’ approaches.

A by-now standard response to this problem is to simply ignore feature covariances, and standardize the features to mean zero and variance one. One in effect pretends that the feature covariance matrix is the identity matrix, and uses the formula [5, 16]. Even after this reduction, further challenges remain.

### 1.3 When features are rare and weak

In fields such as genomics, with a glut of feature measurements per observational unit, it is expected that few measured features will be useful in any given project; nevertheless, they all get measured, because researchers can’t say in advance which ones will be useful. Moreover, reported misclassification rates are relatively high. Hence there may be numerous useful features, but they are relatively rare and individually quite weak.

Such thinking motivated the following rare/weak feature model (RW Feature Model) in [11]. Under this model,

• Useful features are rare: the contrast vector is nonzero in only out of elements, where is small, i.e. close to zero. As an example, think of , , so . In addition,

• Useful features are weak: the nonzero elements of have common amplitude , which is assumed not to be ‘large’. ‘Large’ can be measured using ; values of in the range to imply that corresponding values of are not large.

Since the elements of the feature vector where the class contrast are entirely uninformative about the value of , only the features where are useful. The problem is how to identify and benefit from those rare, weak features.

We speak of and as the sparsity and strength parameters for the Rare/Weak model, and refer to the model. Models with a ‘sparsity’ parameter are common in estimation settings [14, 13, 27], but not with the feature strength constraint . Also closely related to the RW model is work in multiple testing by Ingster and the authors [23, 9, 25].

### 1.4 Feature selection by thresholding

Feature selection - i.e. working only with an empirically-selected subset of features - is a standard response to feature glut. We are supposing, as announced in Section 1.2, that feature correlations can be ignored and that features are already standardized to variance one. We therefore focus on the vector of feature -scores, with components , . These are the -scores of two-sided tests of : . Under our assumptions with and the feature contrast vector. Features with nonzero typically have significantly nonzero while all other features will have values largely consistent with the null hypothesis . In such a setting, selecting features with -scores above a threshold makes sense.

We identify three useful threshold functions: , . These are: Clipping, which ignores the size of the -score, provided it is large; Hard Thresholding, which uses the size of the -score, provided it is large; and Soft Thresholding, which uses a shrunken -score, provided it is large.

###### Definition 1.1

Let . The threshold feature selection classifier makes its decision based on where , and .

In words, the classifier sums across features with large training-set -scores, and a simple function of the -score generates the feature weight.

Several methods for linear classification in bioinformatics follow this approach: the Shrunken Centroids method [29] is a variant of soft thresholding in this two-class setting; the highly-cited methods in [17] and [22] are variants of hard thresholding. Clipping makes sense in the theoretical setting of the RW model (since then the useful features have all the same strength) and is simpler to analyse than the other nonlinearities.

Thresholding has been popular in estimation for more than a decade [14]; it is known to be succesful in ‘sparse’ settings where the estimand has many coordinates, of which only a relatively few coordinates are significantly nonzero. However classification is not the same as estimation, and performance characteristics are driven by quite different considerations.

One crucial question remains: how to choose the threshold based on the data? Popular methods for threshold choice include cross-validation [29]; control of the false discovery rate [1, 2, 4, 10]; and control of the local false discovery rate [15].

### 1.5 Higher Criticism

In [11] we proposed a method of threshold choice based on recent work in the field of multiple comparisons. We now very briefly mention work in that field and then introduce the threshold choice method.

#### 1.5.1 HC testing

Suppose we have a collection of -values which under the global null hypothesis are uniformly distributed: . Consider the order statistics: . Under the null hypothesis, these order statistics have the usual properties of uniform order statistics, including the asymptotic normality . HC forms a -score comparing with its mean under the null, and then maximizes over a wide range of . Formally:

###### Definition 1.2

(HC Testing) [9]. The Higher Criticism objective is

 HC(i;π(i))=√Ni/N−π(i)√i/N(1−i/N). (1.1)

Fix (eg ). The HC test statistic is .

HC seems insensitive to the selection of , in Rare/Weak situations; here we always use .

In words, we look for the largest standardized discrepancy for any between the observed behavior and the expected behavior under the null. When this is large, the whole collection of -values is not consistent with the global null hypothesis. The phrase “Higher Criticism” is due to John Tukey, and reflects the shift in emphasis from single test results to the whole collection of tests; see discussion in [9]. Note: there are several variants of HC statistic; we discuss only one variant in this brief note; the main results of [9] still apply to this variant; for full discussion see [9, 11].

#### 1.5.2 HC thresholding

Return to the classification setting in previous sections. We have a vector of feature -scores . To apply HC notions, translate -scores into two-sided -values, and maximizes the HC objective over index in the appropriate range. Define the feature -values , ; and define the increasing rearrangement , the HC objective function , and the increasing rearrangement correspondingly. Here is our proposal.

###### Definition 1.3

(HC Thresholding). Apply the HC procedure to the feature -values. Let the maximum HC objective be achieved at index . The Higher Criticism threshold (HCT) is the value . The HC threshold feature selector selects features with -scores exceeding in magnitude.

Figure 1 illustrates the procedure. Panel (a) shows a sample of -scores, Panel (b) shows a PP-plot of the corresponding ordered -values versus and Panel (c) shows a standardized PP-plot. The standardized PP-Plot has its largest deviation from zero at ; and this generates the threshold value.

#### 1.5.3 Previously-reported results for HCT

Our article [11] reported several findings about behavior of HCT based on numerical and empirical evidence. In the RW model, we can define an ideal threshold, i.e. a threshold based on full knowledge of the RW parameters and and chosen to minimize the misclassification rate of the threshold classifier – see Section 2 below. We showed in [11] that:

• HCT gives a threshold value which is numerically very close to the ideal threshold.

• In the case of very weak feature z-scores, HCT has a False Feature Discovery Rate (FDR) substantially higher than other popular approaches, but a Feature Missed Detection Rate (MDR) substantially lower than those other approaches;

• At the same time, HCT has FDR and MDR very closely matching those of the ideal threshold.

In short, HCT has very different operating characteristics from those other thresholding schemes like FDR thresholding [1, 2] and Bonferroni thresholding, but very similar operating characteristics to the ideal threshold.

### 1.6 Asymptotic RW model, and the phase diagram

In this paper we further support the findings reported in [11], this time using an asymptotic analysis. In our analysis the number of observations and the number of features tend to infinity in a linked fashion, with remaining very small compared to . (Empirical results in [11] show our large- theory is applicable at moderate and ).

More precisely, we consider a sequence of problems with increasingly more features, increasingly more rare useful features, and relatively small numbers of observations compared to the number of features.

###### Definition 1.4

The phrase asymptotic RW model refers to the following combined assumptions.

• Asymptotic Setting. We consider a sequence of problems, where the number of observations and the number of features both tend to along the sequence.

• dramatically larger than . Along this sequence, , so there are dramatically more features per observational unit than there are observational units.

• Increasing Rarity. The sparsity varies with and according to , .

• Decreasing Strength. The strength varies with and according to , .

The symbol refers to the model combining these assumptions.

In this model, because , useful features are individually too weak to detect and because , useful features are increasingly rare with increasing , while increasing in total number with . It turns out that and are incidental, while and are the driving parameters. Hence we always simply write below.

There is a large family of choices of where successful classification is possible, and another large family of choices where it is impossible. To understand this fact, we use the concept of phase space, the two-dimensional domain . We show that this domain is partitioned into two regions or ‘phases’. In the “impossible” phase, useful features are so rare and so weak that classification is asymptotically impossible even with the ideal choice of threshold. In the “possible” phase, successfully separating the two groups is indeed possible - if one has access to the ideal threshold. Figure 2 displays this domain and its partition into phases. Because of the partition into two phases, we also call this display the phase diagram. An explicit formula for the graph bounding these phases is given in (3.2) below.

The phase diagram provides a convenient platform for comparing different procedures. A threshold choice is optimal if it gives the same partition of phase space as the one obtained with the ideal choice of threshold.

How does HCT compare to the ideal threshold, and what partition in the phase space does HCT yield? For reasons of space, we focus in this paper on the Ideal HC threshold, which is obtained upon replacing the empirical distribution of feature -scores by it expected value. The Ideal HC threshold is thus the threshold which HCT is ‘trying to estimate’; in the companion paper [12] we give a full analysis showing that the ideal HCT and HCT are close.

The central surprise of our story is that HC behaves surprisingly well: the partition of phase space describing the two regions where ideal thresholding fails and/or succeeds also describes the two regions where Ideal HCT fails and/or succeeds in classifying accurately. The situation is depicted in the table below:

Here by ‘succeeds’, we mean asymptotically zero misclassification rate and by ‘fails’, we mean asymptotically 50% misclassification rate.

In this sense of size of regions of success, HCT is just as good as the ideal threshold. Such statements cannot be made for some other popular thresholding schemes, such as False Discovery threshold selection. As will be shown in [12] even the very popular Cross-Validated choice of Threshold will fail if the training set size is bounded, while HCT will still succeed in the RW model in that case.

The full proof of a broader set of claims – with a considerably more general treatment – will appear elsewhere. To elaborate the whole story on HCT needs three connected papers including [11], [12], and the current one. In [11], we reported numerical results both with simulated data from the RW model and with certain real data often used as standard benchmarks for classifier performance. In [12], we will develop a more mathematical treatment of many results we cite here and in [11]. The current article, logically second in the triology, develops an analysis of Ideal HCT which is both transparent and which provides the key insights underlying our lengthy arguments in [12]. We also take the time to explain the notions of phase diagram and phase regions. We believe this paper will be helpful to readers who want to understand HCT and its performance, but who would be overwhelmed by the epsilontics of the analysis in [12].

The paper is organized as follows. Section 2 introduces a functional framework and several ideal quantities. These include the proxy classification error where Fisher’s separation (SEP) [3] plays a key role, the ideal threshold as a proxy for the optimal threshold, and the ideal HCT as a proxy of the HCT. Section 3 introduces the main results on the asymptotic behavior of the HC threshold under the asymptotic RW model, and the focal point is the phase diagram. Section 4 outlines the basic idea behind the main results followed by the proofs. Section 5 discuss the connection between the ideal threshold and the ideal HCT. Section 6 discusses the ideal behavior of Bonferroni threshold feature selection and FDR-controlling feature selection. Section 7 discusses the link between ideal HCT and ordinary HCT, the finite phase diagram, and other appearances of HC in recent literature.

## 2 Sep functional and ideal threshold

Suppose is a fixed, nonrandom linear classifier, with decision boundary . Will correctly classify the future realization from simple model of Section 1.3. Then equiprobable and . The misclassification probability can be written

 P{YL(X)<0|μ}=Φ(−12Sep(L;μ)), (2.1)

where denotes the standard normal distribution function and measures the standardized interclass distance:

 Sep(L;μ) = E{L(X)|Y=1}−E{L(X)|Y=−1}SD(L(X)) = 2∑w(j)μ(j)(∑w(j)2)1/2=2⟨w,μ⟩∥w∥2,

The ideal linear classifier with feature weights , and decision threshold implements the likelihood ratio test. It also maximizes , since for every other linear classifier , .

### 2.1 Certainty-equivalent heuristic

Threshold selection rules give random linear classifiers: the classifier weight vector is a random variable, because it depends on the -scores of the realized training sample. If denotes a linear classifier constructed based on such a realized vector of -scores, then the misclassification error can be written as

 Err(LZ,μ)=Φ(−12Sep(LZ;μ)); (2.3)

this is a random variable depending on and on . Heuristically, because there is a large number of coordinates, some statistical regularity appears, and we anticipate that random quantities can be replaced by expected values. We proceed as if

 Sep(LZ;μ)≈2EZ|μ⟨w,μ⟩(EZ|μw2)1/2 (2.4)

where the expectation is over the conditional distribution of conditioned on . Our next step derives from the fact that itself is random, having about nonzero coordinates, in random locations. Now as we write heuristically

 EZ|μ⟨w,μ⟩≈p⋅ϵ⋅μ0⋅E{ηt(Z1)|μ1=μ0},

while

 EZ|μ⟨w,w⟩≈p⋅(ϵ⋅E{η2t(Z1)|μ1=μ0}+(1−ϵ)⋅E{η2t(Z1)|μ1=0}).
###### Definition 2.1

Let the threshold be fixed and chosen independently of the training set. In the model we use the following expressions for proxy separation

 ˜Sep(t;ϵ,τ)=2A√B,

where

 A(t,ϵ,τ)=ϵ⋅τ⋅E{ηt(τ+W)}.
 B(t,ϵ,τ)=ϵ⋅E{η2t(τ+W)}+(1−ϵ)E{η2t(W)}.

and denotes a standard normal random variable. By proxy classification error we mean

 ˜Err(t;ϵ,τ,p,n)=Φ(−12√pn⋅˜Sep(t,ϵ,τ)).

Normalizations are chosen here so that, in large samples

 Sep(LZ,μ)≈√pn⋅˜Sep(t,ϵ,τ).

While ordinarily, we expect averages to “behave like expectations” in large samples, we use the word proxy to remind us that there is a difference (presumably small). Software to compute these proxy expressions has been developed by the authors, and some numerical results using them were reported in [11].

Of course, the rationale for our interest in these proxy expressions is our heuristic understanding that they accurately describe the exact large-sample behavior of certain threshold selection schemes. This issue is settled in the affirmative, after considerable effort, in [12].

### 2.2 Certainty-equivalent threshold functionals

In general the best threshold to use in a given instance of the RW model depends on both and . It also depends on the specific realization of and even of . However, dependence on and is simply “noise” that goes away in large samples, while the dependence on and remains.

###### Definition 2.2

The ideal threshold functional maximizes the proxy separation

 Tideal(ϵ,τ)=arg maxt˜Sep(t,ϵ,τ).

Heuristically, represents a near-optimal threshold in all sufficiently large samples; it is what we “ought” to be attempting to use.

###### Definition 2.3

Folding. The following concepts and notations will be used in connection with distributions of absolute values of random variables.

• The Half Normal distribution function .

• The noncentral Half-Normal distribution .

• Given a distribution function , the folded distribution is . The Half Normal is the folded version of the standard Normal, and the noncentral Half Normal is the folded version of a Normal with unit standard deviation and nonzero mean equal to the noncentrality parameter.

• Let denote the 2-point mixture

 Fϵ,τ(t)=(1−ϵ)Φ(t)+ϵΦ(t−τ);

denotes the corresponding folded distribution:

 Gϵ,τ(t)=(1−ϵ)Ψ0(t)+ϵΨτ(t),

We now define an HCT functional representing the target that HC thresholding aims for.

###### Definition 2.4

Let be a distribution function which is not the standard normal . At such a distribution, we define the HCT functional by

 THC(F)=argmaxt>t0¯G(t)−¯Ψ(t)√G(t)⋅¯G(t);

here is the folding of , and is a fixed parameter of the HCT method (eg. ). The HC threshold in the model may be written, in an abuse of notation,

 THC(ϵ,τ)

meaning .

Let denote the usual empirical distribution of the feature -scores . The HCT of Definition 1.3 can be written as . Let denote the expected value of ; then will be called the ideal HC threshold. Heuristically, we expect the usual sampling fluctuations and that

 THC(Fn,p)≈THC(F)

with a discrepancy decaying as and increase. This issue is carefully considered in the companion paper [12], which shows that the empirical HC threshold in the ARW model indeed closely matches the ideal HC threshold.

For comparison purposes, we considered two other threshold schemes. First, (ideal) False-Discovery Rate thresholding. For a threshold , and parameters , the expected number of useful features selected is

 E(TP)(t;ϵ,τ,p)=p⋅ϵ⋅¯Gϵ,τ(t);

and the expected number of useless features selected is

 E(FP)(t;ϵ,τ,p)=p⋅(1−ϵ)⋅¯Ψ(t).

Let denote the expected rate of useful features above threshold and denote the expected rate of usless features above threshold. In analogy with our earlier heuristic, we define the proxy False Discovery Rate (FDR)

 ˜FDR(t;ϵ,τ,p)=FPR(t)TPR(t)+FPR(t)

(The term “proxy” reminds us that

 E(FP)(t)E(TP)(t)+E(FP)(t)≠EFP(t)TP(t)+FP(t),

although for large the difference will often be small.)

We define the FDRT- functional by

 TFDR,α(ϵ,τ)=min{t:˜FDR<α,t>t0}.

Heuristically, this is the threshold that FDRT is ‘trying’ to learn from noisy empirical data. We will also need the proxy Local FDR.

 ˜Lfdr(t;ϵ,τ,p)=FPR′(t)TPR′(t)+FPR′(t).

Here denotes the derivative of , which exists, using smoothness properties of ; similarly for and . Intuitively, denotes the expected fraction of useless features among those features having observed -scores near level .

Second, we considered Bonferroni-based thresholding.

 TBon=¯Φ−1(p−1).

This threshold level is set at the level that would cause on average one false alarm in a set of null cases.

In [11, Figures 2-3], we presented numerical calculations of all these functionals and their separation behavior in two cases.

• , and .

• and .

Although our calculations are exact numerical finite- calculations, we remark that they correspond to sparsity exponents and , respectively. The figures show the following.

• There is a very close numerical approximation of the HCT to the ideal threshold, not just at large but also even at quite small .

• FDR and Bonferroni thresholds behave very differently from the ideal and from HC.

• The separation behavior of the HCT is nearly ideal. For the constant FDR rules, the separation behavior is close to ideal at some but becomes noticeably sub-ideal at other .

• The False discovery rate behavior of HCT and Ideal thresholding depends on . At small , both rules tolerate a high FDR while at large , both rules obtain a small FDR.

• The Missed detection rate of HCT and Ideal thresholding also depends on . At small , the missed detection rate is high, but noticeably less than 100%. At large , the missed detection rate falls, but remains noticeably above . In contrast the MDR for FDR procedures is essentially 100% for small and falls below that of HCT/ideal for large .

These numerical examples illustrate the idealized behavior of different procedures. We can think of the HCT functional as the threshold which is being estimated by the actual HCT rule. On an actual dataset sampled from the underlying , the HC threshold will behave differently, primarly due to stochastic fluctuations . Nevertheless, the close approximation of the HCT threshold to the ideal one is striking and, to us, compelling.

## 3 Behavior of ideal threshold, asymptotic RW model

We now study the ideal threshold in the asymptotic RW model of Definition 1.4. That is, we fix parameters , in that model and study the choice of threshold maximizing class separation.

We first make precise a structural fact about the ideal threshold, first observed informally in [11].

###### Definition 3.1

ROC Curve. The feature detection receiver operating characteristic curve (ROC) is the curve parameterized by . The tangent to this curve at is

 tan(t)=TPR′(t)FPR′(t)

and the secant is

 sec(t)=TPR(t)FPR(t).

Note that in the model, , , and all depend on , , and , although we may, as here, indicate only dependence on .

###### Theorem 1

Tangent-Secant Rule. In the model. we have

 tan(TIdeal)1+tan(TIdeal)∼12(1+sec(TIdeal)1+sec(TIdeal)),p→∞. (3.1)

Here , and , as in Definition 1.4.

###### Definition 3.2

Success Region. The region of asymptotically successful ideal threshold feature selection in the plane is the interior of the subset where the ideal threshold choice obeys

 ˜Err(Tideal(ϵ,τ);ϵ,τ,p)→∞,p→∞;

here we are in the model of Definition 1.4.

The interesting range involves . The following function is important for our analysis, and has previously appeared in important roles in other (seemingly unrelated) problems; see Section 7.4.

 ρ∗(β)=⎧⎪ ⎪⎨⎪ ⎪⎩0,0<β≤1/2,β−1/2,1/2<β≤3/4,(1−√1−β)2,3/4<β<1. (3.2)

As it turns out, it marks the boundary between success and failure for threshold feature selection.

###### Theorem 2

Existence of Phases. The success region is precisely , . In the interior of the complementary region , , even the ideal threshold cannot send the proxy separation to infinity with increasing .

###### Definition 3.3

Regions I,II, III. The Success Region can be split into three regions, referred to here and below as Regions I-III. The interiors of the regions are as follows:

I.

and ; .

II.

and ; .

III.

and ; .

See Figure 2.

In the asymptotic RW model, the optimal threshold must behave asymptotically like for a certain . Surprisingly we need not have .

###### Theorem 3

Formula for Ideal Threshold. Under the Asymptotic RW model , with , the ideal threshold has the form where

 q∗={4r,Region I,(β+r)24r,Region II,III. (3.3)

Note in particular that in Regions I and II, , and hence . Although the features truly have strength , the threshold is best set higher than .

We now turn to FDR properties. The Tangent-Secant rule implies immediately

 Lfdr(TIdeal)(1+FDR(TIdeal))/2→1,p→∞. (3.4)

Hence any result about FDR is tied to one about local FDR, and vice versa.

###### Theorem 4

Under the Asymptotic RW model , at the ideal threshold proxy FDR obeys

 ˜FDR(Tideal,ϵ,τ)→⎧⎪ ⎪⎨⎪ ⎪⎩1,Region Iβ−r2r,Region II(note β/3

as , and the proxy local FDR obeys

 ˜Lfdr(Tideal,ϵ,τ)→⎧⎪ ⎪⎨⎪ ⎪⎩1,Region Ir+β4r,Region II(note β/3

as .

Several aspects of the above solution are of interest.

• Threshold Elevation. The threshold is significantly higher than in Regions I and II. Instead of looking for features at the amplitude they can be expected to have, we look for them at much higher amplitudes.

• Fractional Harvesting. Outside of Region III, we are selecting only a small fraction of the truly useful features.

• False Discovery Rate. Outside Region III, we actually have a very large false discovery rate, which is very close to in Region I. Surprisingly even though most of the selected features are useless, we still correctly classify!

• Training versus Test performance. The quantity can be interpreted as a ratio: strength of useful features in training / strength of those features in test. From (3.3) we learn that, in Region I, the selected useful features perform about half as well in training as we might expect from their performance in test.

## 4 Behavior of ideal clipping threshold

We now sketch some of the arguments involved in the full proof of the theorems stated above. In the RW model, it makes particular sense to use the clipping threshold function , since all nonzeros are known to have the same amplitude. The ideal clipping threshold is also very easy to analyze heuristically. But it turns out that all the statements in Theorems 1-3 are equally valid for all three types of threshold functions, so we prefer to explain the derivations using clipping.

### 4.1 ˜Sep in terms of true and false discoveries

In the RW model, we can express the components of the proxy separation very simply when using the clipping threshold:

 Aclip(t,ϵ,τ)=ϵ⋅τ⋅Esgn(τ+W)1{|τ+W|>t}.
 Bclip(t,ϵ,τ)=ϵ⋅E1{|τ+W|>t}+(1−ϵ)⋅E1{|W|>t}

where denotes an random variable. Recall the definitions of useful selections TP and useless selections FP; we also must count Inverted Detections, for the case where the but . Put

 E(ID)(t;ϵ,τ,p)=ϵ⋅p⋅Φ(−t−τ),

with again the standard normal distribution, and define the inverted detecton rate by . Then

 Aclip(t;ϵ,τ)=τ⋅(TPR(t)−2IDR)(t)).
 Bclip(t;ϵ,τ)=TPR(t)+FPR(t).

We arrive at an identity for in the case of clipping:

 ˜Sep(t;ϵ,τ)=(2√pτ)⋅(TPR(t)−2IDR(t))√TPR(t)+FPR(t).

We now explain Theorem 1, the Tangent-Secant rule. Consider the alternate proxy

 ¯¯¯¯¯¯¯¯¯Sep(t)=(2τ)⋅TPR(t)√TPR(t)+FPR(t)=2A(t)B1/2(t), say;

i.e. drop the term . It turns out that for the alternate proxy, the tangent secant rule and resulting FDR-Lfdr balance equation are exact identites.

###### Lemma 4.1

Let and . The threshold maximizing as a function of satisfies the Tangent-Secant rule as an exact identity; at this threshold we have

 Lfdr(talt)=12(1+FDR(talt)). (4.1)

Proof. Now, and are both smooth functions of , so at the optimizing we have

 0=B−1/2(A′−12ABB′)=B′B1/2⋅(A′B′−12AB).

By inspection for every . Hence,

 A′τB′=12AτB. (4.2)

The Tangent-Secant Rule follows. We now remark that

 FDR(t)=1− TPR(t)TPR(t)+FPR(t)=1−Aτ⋅B,

and

 Lfdr(t)=1− TPR′(t)TPR′(t)+FPR′(t)=1−A′τ⋅B′.

Display (4.1) follows

The full proof of Theorem 1, which we omit, simply shows that the discrepancy caused by has an asymptotically negligible effect; the two objectives have very similar maximizers.

### 4.2 Analysis in the asymptotic RW model

We now invoke the model: , , , , . Let . The classical Mills’ ratio can be written in terms of the Normal survival function as:

 ¯Φ(tp(q))∼p−q/tp(q),p→∞.

Correspondingly, the half Normal obeys:

 ¯Ψ0(tp(q))∼2⋅p−q/tp(q),p→∞. (4.3)

We also need a notation for poly-log terms.

###### Definition 4.1

Any occurrence of the symbol denotes a term which is and as for some . Different occurrences of this symbol may stand for different such terms.

In particular, we may well have , , as , and yet as . However, certainly , .

The following Lemma exposes the main phenomena driving Theorems 1-4. It follows by simple algebra, and several uses of Mills’ Ratio (4.3) in the convenient form .

###### Lemma 4.2

In the asymptotic RW model , we have:

1. Quasi power-law for useful feature discoveries:

 E(TP)(tq(p),ϵ,τ)=\sc PL(p)⋅pδ(q,r,β),p→∞. (4.4)

where the useful feature discovery exponent obeys

 δ(q;β,r)≡{1−β,0
2. Quasi power-law for useless feature discoveries:

 E(FP)(tq(p),ϵ,τ)=\sc PL(p)⋅p1−q,p→∞, (4.5)
3. Negligibility of inverted detections:

 E(ID)(tq(p),ϵ,τ)=o(E(TP)(tq(p),ϵ,τ)),p→∞. (4.6)

As an immediate corollary, under , we have:

 ˜Sep(tq(p),ϵ,τ)⋅(12√n)=\sc PL% (p)⋅pδ(q;β,r)√pδ(q;β,r)+p1−q,p→∞. (4.7)

On the right side of this display, the poly-log term is relatively unimportant. The driving effect is the power-law behavior of the fraction. The following Lemma contains the core idea behind the appearance of in Theorems 1 and 2, and the distinction between Regions I and Region II,III.

###### Lemma 4.3

Let . Let denote the rate at which

 pδ(q;β,r)√pδ(q;β,r)+p1−q

tends to as , for fixed , , and . Then if and only if . A choice of maximizing this ratio is given by (3.3).

Proof Sketch. By inspection of , it is enough to consider . The ratio grows to infinity with like , where

 γ(q;r,β)=δ(q;β,r)−max((1−q),δ(q;β,r))/2;

provided there exists obeying .

Let denote the value of maximizing the rate of separation:

 q∗(r,β)=argmaxq∈[0,1]γ(q;r,β);

in the event of ties, we take the smallest value of . Let’s define without recourse to the earlier formula (3.2) but instead by the functional role claimed for it by this lemma:

 ρ∗(β)=inf{r:γ(q∗(r,β);r,β)>0,r>0}. (4.8)

We will derive the earlier formula (3.2) from this. Now where

 γ1(q)=δ(q;β,r)−(1−q)/2; and γ2(q)=δ(q;β,r)/2.

We have

 γ(q∗(r,β);r,β)=maxq∈[0,1]mini=1,2γi(q;rβ). (4.9)

In dealing with this maximin, two special choices of will recur below.

• : Viewed as a function of , is maximized on the interval (use calculus!) at , and is monotone on either side of the maximum.

• : On the other hand, is monotone decreasing as a function of on . Hence the maximizing value of in (4.9) over the set of -values where achieves the minimum will occur at the minimal value of achieving the minimum, i.e. at the solution to:

 (1−q)=δ(q;β,r). (4.10)

(4.10) is satisfied uniquely on by .

The behavior of varies by cases; see Table 1. To see how the table was derived, note that

 γ1<γ2 iff δ<1−q.

Consider the first and second rows. For , . Hence on iff iff . Consider the third and fourth rows. For , . Hence on iff .

Derivatives , , are laid out in Table 2.

Table 3 presents results of formally combining the two previous tables. There are four different cases, depending on the ordering of ,, and . In only one case does the above information leave undefined. (We note that this is a purely formal calculation; Lemma 4.4, Display (4.13) below shows that rows 2 and 4 never occur.)

To see how Table 3 is derived, consider the first row. Using the derivative table above, we see that is increasing on , constant on and decreasing on . Hence the maximin value is achieved at any . For row 2, is increasing on , constant on on increasing on and decreasing on . For row 3, is constant on , increasing on and monotone decreasing on . For row 4, is increasing on and decreasing on .