How much does your data exploration overfit?
Controlling bias via information usage.
Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. Instead, the analyst typically needs to use the data to search for interesting analyses to perform and hypotheses to test. This is an adaptive process, where the choice of analysis to be performed next depends on the results of the previous analyses on the same data. Ultimately, which results are reported can be heavily influenced by the data. It is widely recognized that this process, even if well-intentioned, can lead to biases and false discoveries, contributing to the crisis of reproducibility in science. But while any data-exploration renders standard statistical theory invalid, experience suggests that different types of exploratory analysis can lead to disparate levels of bias, and the degree of bias also depends on the particulars of the data set. In this paper, we propose a general information usage framework to quantify and provably bound the bias and other error metrics of an arbitrary exploratory analysis. We prove that our mutual information based bound is tight in natural settings, and then use it to give rigorous insights into when commonly used procedures do or do not lead to substantially biased estimation. Through the lens of information usage, we analyze the bias of specific exploration procedures such as filtering, rank selection and clustering. Our general framework also naturally motivates randomization techniques that provably reduce exploration bias while preserving the utility of the data analysis. We discuss the connections between our approach and related ideas from differential privacy and blinded data analysis, and supplement our results with illustrative simulations.
Modern data is messy and high dimensional, and it is often not clear a priori what is the right analysis to perform. To extract the most insight, the analyst typically needs to perform exploratory analysis to make sense of the data and identify interesting hypotheses. This is invariably an adaptive process; patterns in the data observed in the first stages of analysis inform which tests are run next and the process iterates. Ultimately, the data itself may influence which results the analyst chooses to report, introducing researcher degrees of freedom: an additional source of over-fitting that isn’t accounted for in reported statistical estimates [simmons2011false]. Even if the analyst is well-intentioned, this exploration can lead to false discovery or large bias in reported estimates.
The practice of data-exploration is largely outside the domain of classical statistical theory. Standard tools of multiple hypothesis testing and false discovery rate (FDR) control assume that all the hypotheses to be tested, and the procedure for testing them, are chosen independently of the dataset. Any “peeking” at the data before committing to an analysis procedure renders classical statistical theory invalid. Nevertheless, data exploration is ubiquitous, and folklore and experience suggest the risk of false discoveries differs substantially depending on how the analyst explores the data. This creates a glaring gap between the messy practice of data analysis, and the standard theoretical frameworks used to understand statistical procedures. In this paper, we aim to narrow this gap. We develop a general framework based on the concept of information usage and systematically study the degree of bias introduced by different forms of exploratory analysis, in which the choice of which function of the data to report is made after observing and analyzing the dataset.
To concretely illustrate the challenges of data exploration, consider two data scientists Alice and Bob.
Example 1. Alice has a dataset of 1000 individuals for a weight-loss biomarker study. For each individual, she has their weight measured at 3 time points and the current expression values of 2000 genes assayed from blood samples. There are three possible weight changes that Alice could have looked at—the difference between time points 1 and 2, 2 and 3 or 1 and 3—but Alice decides ahead of time to only analyze the weight change between 1 and 3. She computes the correlation across individuals between the expression of each gene and the weight change, and reports the gene with the highest correlations along with its value. This is a canonical setting where we have tools for controlling error in multiple-hypothesis testing and the false-discovery rate (FDR). It is well-recognized that even if the reported gene passes the multiple-testing threshold, its correlation in independent replication studies tend to be smaller than the reported correlation in the current study. This phenomenon is also called the Winner’s Curse selection bias.
Example 2. Bob has the same data, and he performs some simple data exploration. He first uses data visualization to investigate the average expression of all the genes across all the individuals at each of the time points, and observes that there is very little difference between time 1 and 2 and there is a large jump between time 2 and 3 in the average expression. So he decides to focus on these later two time points. Next, he realizes that half of the genes always have low expression values and decides to simply filter them out. Finally, he computes the correlations between the expression of the 1000 post-filtered genes and the weight change between time 2 and 3. He selects the gene with the largest correlation and reports its value. Bob’s analysis consists of three steps and the results of each step depend on the results and decisions made in the previous steps. This adaptivity in Bob’s exploration makes it difficult to apply standard statistical frameworks. We suspect there is also a selection bias here leading to the reported correlation being systematically larger than the real correlations if those genes are tested again. How do we think about and quantify the selection bias and overfitting due to this more complex data exploration? When is it larger or smaller than Alice’s selection bias?
The toy examples of Alice and Bob illustrate several subtleties of bias due to data exploration. First, the adaptivity of Bob’s analysis makes it more difficult to quantify its bias compared to Alice’s analysis. Second, for the same analysis procedure, the amount of selection bias depends on the dataset. Take Alice for example, if across the population one gene is substantially more correlated with weight change than all other genes, then we expect the magnitude of Winner’s Curse decreases. Third, different steps of data exploration introduce different amounts of selection bias. Intuitively, Bob’s visualizing of aggregate expression values in the beginning should not introduce as much selection bias as his selection of the top gene at the last step.
This paper introduces a mathematical framework to formalize these intuitions and to study selection bias from data exploration. The main tool we develop is a metric of the bad information usage in the data exploration. The true signal in a dataset is the signal that is preserved in a replication dataset, and the noise is what changes across different replications. Using Shannon’s mutual information, we quantify the degree of dependence between the noise in the data and the choice of which result is reported. We then prove that the bias of an arbitrary data-exploration process is bounded by this measure of its bad information usage. This bound provides a quantitative measure of researcher degrees of freedom, and offers a single lens through which we investigate different forms of exploration.
In Section II, we present a general model of exploratory data-analysis that encompasses the procedures used by Alice and Bob. Then we define information usage and show how it upper and lower bounds various measures of bias and estimation error due to data exploration in Section IV. In Section V, we study specific examples of data exploration through the lens of information usage, which gives insight into Bob’s practices of filtering, visualization, and maximum selection. Information usage naturally motivates randomization approaches to reduce bias and we explore this in Section VI. In Section VI, we also study a model of a data analyst who–like Bob–interacts adaptively with the data many times before selecting values to report.
Ii A Model of Data Exploration
We consider a general framework in which a dataset is drawn from a probability distribution over a set of possible datasets . The analyst is considering a large number of possible analyses on the data, but wants to report only the most interesting results. She decides to report the result of a single analysis, and chooses which one after observing the realized dataset, , or some summary statistics of . More formally, the data analyst considers functions of the data, where denotes the output of the th analysis on the realization . Each function is typically called an estimator; each is an estimate or statistic calculated from the sampled data, and is a random variable due to the randomness in the realization of . After observing the sampled-data, the analyst chooses to report the value for . The selection rule captures how the analyst uses the data and chooses which result to report. Because the choice made by is itself a function of the sampled-data, the reported value may be significantly biased. For example, could be very far from zero even if each fixed function has zero mean.
Note that although the number of estimators is assumed to be finite, it could be arbitrarily large; in particular can be exponential in the number of samples in the dataset. The ’s represent the set of all estimators that the analyst potentially could have considered during the course of exploration. Also, while for simplicity we focus on the case where exactly one estimate is selected and reported, our results apply in settings where the analyst selects and reports many estimates.111For example, if the analyst chooses to report results, our framework can be used to bound the average bias of the reported values by letting be a random draw from the selected analyses.
Example 1. For Alice, is a 1000-by-2003 matrix, where the rows are the individuals and the columns are the 2000 genes plus the three possible weight changes. Here there are potential estimators and is the correlation between the th gene and the weight change between times 1 and 3. Alice’s analysis corresponds to the selection procedure .
Example 2. Bob has the same dataset . Because his exploration could have led him to use any of the three possible weight-change measures, the set of potential estimators are the correlations between the expression of one gene and one of the three weight changes and there are such ’s. Bob’s adaptive exploration also corresponds to a selection procedure that takes the dataset and picks out a particular correlation value to report.
Selection Bias. Denote the true value of estimator as ; this is the value that we expect if we apply on multiple independent replication datasets. On a particular dataset , if is the selected test, the output of data exploration is the value . The output and true-value can be written more concisely as and . The difference captures the error in the reported value. We are interested in quantifying the bias due to data-exploration, which is defined as the average error . We will quantify other metrics of error, such as the expected absolute-error or the squared-error . In each case, the expectation is over all the randomness in the dataset and any intrinsic randomness in .
Iii Related Work
There is a large body of work on methods for providing meaningful statistical inference and preventing false discovery. Much of this literature has focused on controlling the false discovery rate in multiple-hypothesis testing where the hypotheses are not adaptively chosen [benjamini1995controlling, benjamini2001control]. Another line of work studies confidence intervals and significance tests for parameter estimates in sparse high dimensional linear regression (see [belloni2014inference, van2014asymptotically, javanmard2014confidence, lockhart2014significance] and the references therein).
One recent line of work [fithian2014optimal, Taylor2015] proposes a framework for assigning significance and confidence intervals in selective inference, where model selection and significance testing are performed on the same dataset. These papers correct for selection bias by explicitly conditioning on the event that a particular model was chosen. While some powerful results can be derived in the selective inference framework (e.g. [taylor2014exact, lee2016exact]), it requires that the conditional distribution is known and can be directly analyzed. This requires that the candidate models and the selection procedure are mathematically tractable and specified by the analyst before looking at the data. Our approach does not explicitly adjust for selection bias, but it enables us to formalize insights that apply to very general selection procedures. For example, the selection rule could represent the choice made by a data-analyst, like Bob, after performing several rounds of exploratory analysis.
A powerful line of work in computer science and learning theory [bousquet2002stability, poggio2004general, shalev2010learnability] has explored the role of algorithmic stability in preventing overfitting. Related to stability is PAC-Bayes analysis, which provides powerful generalization bounds in terms of KL-divergence [mcallester2013pac]. There are two key differences between stability and our framework of information usage. First, stability is typically defined in the worst case setting and is agnostic of the data distribution. An algorithm is stable if, no matter the data distribution, changing one training point does not affect the predictions too much. Information usage gives more fine-grained bias bounds that depend on the data distribution. For example, in Section V-C we show the same learning algorithm has lower bias and lower information usage as the signal in the data increases. The second difference is that stability analysis has been traditionally applied to prediction problems—i.e. to bounding generalization loss in prediction tasks. Information usage applies to prediction—e.g. could be the squared loss of a classifier—but it also applies to model estimation where could be the value of the th parameter.
Exciting recent work in computer science [blum2015ladder, hardt2014preventing, dwork2015generalization, dwork2015reusable] has leveraged the connection between algorithmic stability and differential privacy to design specific differentially private mechanisms that reduce bias in adaptive data analysis. In this framework, the data analyst interacts with a dataset indirectly, and sees only the noisy output of a differentially private mechanism. In Section VI, we discuss how information usage also motivates using various forms of randomization to reduce bias. In the Appendix, we discuss the connections between mutual information and a recently introduced measure called max-information [dwork2015reusable]. The results from this privacy literature are designed for worst-case, adversarial data analysts. We provide guarantees that vary with the selection rule, but apply to all possible selection procedures, including ones that are not differentially private. The results in algorithmic stability and differential privacy are complementary to our framework: these approaches are specific techniques that guarantee low bias for worst-case analysts, while our framework quantifies the bias of any general data-analyst.
Finally it is also important to note the various practical approaches used in specific settings to quantify or reduce bias from exploration. Using random subsets of data for validation is a common prescription against overfitting. This is feasible if the data points are independent and identically distributed samples. However, for structured data—e.g. time-series or network data—it is not clear how to create a validation set. The bounds on overfitting we derive based on information usage do not assume independence and apply to structured data. Special cases of selection procedures corresponding to filtering by summary statistics of biomarkers [bourgon2010independent] and selection matrix factorization based on a stability criterion [wu2016stability] have been studied. The insights from these specific settings agree with our general result that low information usage limits selection bias.
Iv Controlling Exploration Bias via Information Usage
Information usage upper bounds bias. In this paper, we bound the degree of bias in terms of an information–theoretic quantity: the mutual information between the choice of which estimate to report, and the actual realized value of the estimates . We state this result in a general framework, where and are any random variables defined on a common probability space. Let denote the mean of . Recall that a real-valued random variable is –sub-Gaussian if for all , so that the moment generating function of is dominated by that of a normal random variable. Zero–mean Gaussian random variables are sub-Gaussian, as are bounded random variables.
If is –sub-Gaussian for each , then,
where denotes mutual information222The mutual information between two random variables is defined as ..
The randomness of is due to the randomness in the realization of the data . This captures how each estimate varies if a replication dataset is collected, and hence captures the noise in the statistics. The mutual information , which we call information usage, then quantifies the dependence of the selection process on the noise in the estimates. Intuitively, a selection process that is more sensitive to the noise (high ) is at a greater risk for bias. We will also refer to as bad information usage to highlight the intuition that it really captures how much information about the noise in the data goes into selecting which estimate to report. We normally think of data analysis as trying to extract the good information, i.e. the true signal, from data. The more bad information is used, the more likely the analysis procedure is to overfit.
When is determined entirely from the values , mutual information is equal to entropy . This quantifies how much varies over different independent replications of the data.
The parameter provides the natural scaling for the values of . The condition that is -sub-Gaussian ensures that its tail is not too heavy333A random variable is said to be -sub-Gaussian if for all .. In the Appendix, we show how this condition can be relaxed to treat cases where is a sub-Exponential random variables (Proposition 9) as well as settings where the ’s have different scaling ’s (Proposition 8).
Proposition 1 applies in a very general setting. The magnitude of overfitting depends on the generating distribution of the data set, and on the size of the data, and this is all implicitly captured in by the mutual-information . For example, a common type of estimate of interest is the sample average of some function based on an iid sequence . Note that if is sub-Gaussian with parameter , then is sub-Gaussian with parameter and therefore
To illustrate Proposition 1, we consider two extreme settings: one where is chosen independently of the data and one where heavily depends on the values of all the ’s. The subsequent sections will investigate the applications of information usage in depth in settings that interpolate between these two extremes.
Example: data-agnostic exploration. Suppose is independent of . This may happen if the choice of which estimate to report is decided ahead of time and cannot change based on the actual data. It may also occur when the dataset can be split into two statistically independent parts, and separate parts are reserved for data-exploration and estimation. In such cases, one expects there is no bias because the selection does not depend on the actual values of the estimates. This is reflected in our bound: since is independent of , and therefore .
Example: maximum of Gaussians. Suppose each is an independent sample from the zero-mean normal . If , then because all ’s are symmetric and have equal chance of being selected by . Applying Proposition 1 gives This is the well known inequality for the maximum of Gaussian random variables. Moreover, it is also known that this equation approaches equality as the number of Gaussians, , increases, implying that the information usage precisely measures the bias of max-selection in this setting. It is illustrative to also consider a more general selection which first ranks the ’s from the largest to the smallest and then uniformly randomly selects one of the largest ’s to report. Here , where (by the symmetry of as before) and (since given the values of ’s there is still uniform randomness over which of the top is selected). We immediately have the following corollary.
Suppose for each , is a zero-centered sub-Gaussian random variable with parameter . Let denote the values of sorted from the largest to the smallest. Then
In Appendix C, we show that this bound is also tight as and increase.
Information usage bounds other metrics of exploration error. So far we have discussed how mutual information upper bounds the bias . In different application settings, it might be useful to control other measures of exploration error, such as the absolute error deviation and the squared error .
Here we extend Proposition 1 and show how and can be used to bound absolute error deviation and squared error. Note that due to inherent noise even in the absence of selection bias, the absolute or squared error can be of order or , respectively. The next result effectively bounds the additional error introduced by data-exploration in terms of information-usage.
Suppose for each , is sub-Gaussian. Then
where and are universal constants.
Information usage also lower bounds error. In the maximum of Gaussians example, we have already seen a setting where information usage precisely quantifies bias. Here we show that this is a more general phenomenon by exhibiting a much broader setting in which mutual-information lower bounds expected-error. This complements the upper bounds of Proposition 1 and Proposition 2.
Suppose where . Because is a deterministic function of , mutual information is equal to entropy. The probability is a complicated function of the mean vector , and the entropy provides a single number measuring the uncertainty in the selection process. Proposition 2 upper bounds the average squared distance between and by entropy. The next proposition provides a matching lower bound, and therefore establishes a fundamental link between information usage and selection-risk in a natural family of models.
Let where . There exist universal numerical constants , , , and such that for any and ,
Recall that the entropy of is defined as
Here is often interpreted as the “surprise” associated with the event and entropy is interpreted as expected surprise in the realization of . Proposition 3 relies on a link between the surprise associated with the selection of statistic , and the squared error on events when it is selected.
To understand this result, it is instructive to instead consider a simpler setting; imagine , always, , and the selection rule is . When is large,
and so the surprise associated with the event scales with the squared gap between the selection threshold and the true mean of . One can show that as ,
where denotes the selection rule with threshold and if as .
In the Appendix, we investigate additional threshold-based selection policies applied to Gaussian and exponential random variables, allowing for arbitrary correlation among the ’s, and show that also provides a natural lower bound on estimation-error.
V When is bias large or small? The view from information usage
In this section, we consider several simple but commonly used procedures of feature selection and parameter estimation. In many applications, such feature selection and estimation are performed on the same dataset. Information usage provides a unified framework to understand selection bias in these settings. Our results inform when these these procedures introduce significant selection bias and when they do not. The key idea is to understand which structures in the data and the selection procedure make the mutual information significantly smaller than the worst-case value of . We provide several simulation experiments as illustrations.
V-a Filtering by marginal statistics
Imagine that is chosen after observing some dataset . This dataset determines the values of , but may also contain a great deal of other information. Manipulating the mutual information shows
where captures the fraction of the uncertainty in that is explained by the data in beyond the values . In many cases, instead of being a function of , the choice is a function of data that is more loosely coupled with , and therefore we expect that is much smaller than (which itself can be less than ).
One setting when the selection of depends on the statistics of that are only loosely coupled with is variance based feature selection [ewasher, variancefilter]. Suppose we have samples and bio-markers. Let denote the value of the -th bio-marker on sample . Here . Let be the empirical mean values of the -th biomarker. We are interested in identifying the markers that show significant non-zero mean. Many studies first perform a filtering step to select only the markers that have high variance and remove the rest. The rationale is that markers that do not vary could be measurement errors or are likely to be less important. A natural question is whether such variance filtering introduces bias.
In our framework, variance selection is exemplified by the selection rule where . Here we consider the case where only the marker with the largest variance is selected, but all the discussion applies to softer selection when we select the markers with the largest variance. The resulting bias is . Proposition 1 states that variance selection has low bias if is small, which is the case if the empirical means and variances, and , are not too dependent. In fact, when the are i.i.d. Gaussian samples, are independent of . Therefore and we can guarantee that there is no bias from variance selection.
This illustrates an important point that the bias bound depends on instead of . The selection process may depend heavily on the dataset and could be large. However as long as the statistics of the data used for selection have low mutual information with the estimators , there is low bias on the reported values.
We can apply our framework to analyze biases that arise from feature filtering more generally. A common practice in data analysis is to reduce multiple hypotheses testing burden and increase discovery power by first filtering out covariates or features that are unlikely to be relevant or interesting [bourgon2010independent]. This can be viewed as a two-step procedure. For each feature , two marginal statistics are computed from the data, and . Filtering corresponds to a selection protocol on . Since , if the ’s do not reveal too much information about ’s then the filtering step does not create too much bias. In our example above, is the sample variance and is the sample mean of feature . General principles for creating independent and are given in [bourgon2010independent].
More generally, suppose the dataset determines two sets of statistics and . We report and want to quantify its bias, but the selection rule depends only on the ’s, i.e. can be expressed as a function of the ’s. This captures the general situation where data processing and feature selection uses one set of summary statistics () and we want to quantify the bias introduced in these steps on another set of statistics (). The dependence structure can be expressed as a Markov chain , where this notation indicates that conditioned on , is independent of . The data processing inequality implies , which–combined with our bound–formalizes the intuition that the selection rule cannot be substantially biased when and share limited information in common. However, this bound may be quite loose. We instead turn to strong data processing inequalities.
A pair of random variables satisfies a strong data-processing inequality with contraction coefficient if for all random variables with
Let be the smallest constant such that (1) is satisfied for all valid .
The contraction coefficient satisfies several natural properties. First, it tensorizes [anantharam2013maximal]. That is, if is an independent sequence, then . Also, if and follow a Markov chain then .
Example. Suppose consists of iid random variables and is a subsample of data points. Then [kamath2015strong].
Example.(Noisy Channels) If corresponds to a binary symmetric channel with error rate then [polyanskiy2016dissipation].
Note that the contraction coefficient depends only on the distribution of and , and not on the selection rule . A benefit of our mutual information framework for bounding the exploration bias is that we can immediately apply Strong Data Processing to obtain tighter bounds on bias:
Suppose is sub-Gaussian for each . Then if the selection is independent of conditioned on ,
V-B Bias due to data visualization
Data visualization, using clustering for example, is a common technique to explore data and it can inform subsequent analysis. How much selection bias can be introduced by such visualization? While in principle a visualization could reveal details about every data point, a human analyst typically only extracts certain salient features from plots. For concreteness, we use clustering as an example, and imagine the analyst extracts the number of clusters from the analysis. In our framework the natural object of study is the information usage , since if the final selection is a function of , then by the data-processing inequality. In general, is a random variable that can take on values 1 to (if each point is assigned its own cluster). When there is structure in the data and the clustering algorithm captures it, then can be strongly concentrated around a specific number of clusters and . In this setting, clustering is informative to the analyst but does not lead to “bad information-usage” and therefore does not increase exploration bias. This is a stylized example; if the analyst uses additional information beyond the number of clusters , then the bias could increase.
V-C Rank selection with signal
Rank selection is the procedure for selecting the with the largest value (or the top ’s with the largest values). It is the simplest selection policy and the one that we are instinctively most likely to use. We have seen previously how rank selection can introduce significant bias. In the bio-marker example in Subsection V-A, suppose there is no signal in the data, so and . Under rank selection, would have a bias close to .
What is the bias of rank selection when there is signal in the data? Our framework cleanly illustrates how signal in the data can reduce rank selection bias. As before, this insight follows transparently from studying the mutual information . Recall that mutual information is bounded by entropy: When the data provides a strong signal of which to select, the distribution of is far from uniform, and is much smaller than its worst case value of .
Consider the following simple example. Assume
where . The data analyst would like to identify and report the value of . To do this, she selects . When , there is no true signal in the data and is equally likely to take on any value in , . As increases, however, concentrates on , causing and the bias to diminish. We simulated this example with ’s, all but one of which are i.i.d. samples from and for . The simulation results, averaged over 1000 independent runs, are shown in Figure 1.
V-D Information usage along the Least Angle Regression path
Our analyses illustrate that in certain stylized settings, information usage tightly bounds the bias of optimization selections. Here we show that information usage also accurately captures the bias of a more complex selection procedure corresponding to Least Angle Regressions (LARS) [efron2004least]. LARS is an interesting example for two reasons. First it is widely used as a practical tool for sparse regression and is closely related to LASSO. Second LARS composes a sequence of maximum selections and thus provides a more complex example of selection. In Figure 2, we show the simulation results for LARS under three data settings corresponding to low, medium and high signal-to-noise ratios. We use bootstrapping to empirically estimate the information usage and since we know the ground truth of the experiment, we can easily compute the bias of LARS. As the signal in the data increases, the information usage of LARS decreases and, consistent with the predictions of our theory, the bias of LARS also decreases. Moreover, as the number of selected features increases, the average (per feature) information usage of LARS decreases and, consistent with this, the average bias of LARS also decreases monotonically. Details of the experiment are in the Appendix.
V-E Differentially private algorithms
Recent papers [dwork2014preserving, dwork2015reusable] have shown that techniques from differential privacy, which were initially inspired by the need to protect the security and privacy of datasets, can be used to develop adaptive data analysis algorithms with provable bounds on over-fitting. These differentially private algorithms satisfy worst case bounds on certain likelihood ratios, and are guaranteed to have low information-usage. On the other hand, many algorithms have low information-usage without being differentially private. Moreover, as we have seen, the exploration bias of an algorithm could be large or small depending on the particular dataset (e.g. the signal-to-noise ratio of the data) and information usage captures this. Differentially private algorithms have low information usage for all datasets and that is designed adversarial to exploit this dataset, so this is a much stricter condition. In [dwork2015reusable], the authors also define and study a notion of max-information, which can be viewed as a worst-case analogue of mutual information. We discuss the relationship between these measures further in the Appendix.
V-F Information usage and classification overfitting
This section applies our framework to the problem of overfitting in classification. A classifier is trained on a dataset consisting of examples, with input features and corresponding labels . We consider here a setting where the features of the training examples are fixed, and study overfitting of the noisy labels. Each label is drawn independently of the other labels from an unknown distribution . A classifier associates a label with each input . The training error of a fixed classifier is
while its true error rate is
is the expected fraction of examples it mis-classifies on a random draw of the labels . The process of training a classifier corresponds to selecting, as a function of the observed data, a particular classification rule from a large family of possible rules. Such a procedure may overfit the training data, causing the average training error to be much smaller than its true error rate .
As an example, suppose each is a –dimensional feature vector, and consists of all linear classifiers of the form . A training algorithm might set by choosing the parameter vector that minimizes the number of mis-classifications on the training set. This procedure tends to overfit the noise in the training data, and as a result the average training of can be much smaller than its true error rate. The risk of over-fitting tends to increase with the dimension , since higher dimensional models allow the algorithm to fit more complicated, but spurious, patterns in the training set.
The field of statistical learning provides numerous bounds on the magnitude of overfitting based on more general notions of the complexity of an arbitrary function class , with the most influential being the Vapnik-Chervonenkis dimension, or VC-dimension444The VC-dimension of is the size of the largest set it shatters. A set is shattered by if for any choice of labels , there is some with for all .. While the focus is on overfitting of the training data, similar concerns apply to overfitting the validation data.
The next proposition provides information-usage bounds the degree of over-fitting, and then shows that mutual information is upper-bounded by the VC–dimension of . Therefore, information-usage is always constrained by function-class complexity.
Let , , and . Then,
If has VC-dimension , then
The proof of the information usage bound follows by an easy reduction to Proposition 1. The proof of the second claim relies on a known link between VC-dimension and a notion of the log-covering numbers of the function-class.
It is worth highlighting that because VC-dimension depends only on the class of functions , bounds based on this measure can’t shed light on which types of data-generating distributions and fitting procedures allow for effective generalization. Information usage depends on both, and a result could be much smaller than VC-dimension; for example, this occurs when some classifiers in are much more likely to be selected after training than others. This can occur naturally due to properties of the training procedure, like regularization, or properties of the data-generating distribution.
V-G Approximately independent data splitting.
A data scientist has access to data in the form of samples from a Markov chain. She would like to mimic the honest data-splitting she uses with i.i.d data. To do this, she splits the into three parts: , and The first part is used for selection, the third for estimation, and the middle data is thrown away. In particular, and that . One expects that if is large so there is a sufficient delay between the two samples, then the risk of bias and overfitting will be low. We’ll see that this is easy to formalize via an information usage lens.
We assume the Markov process is stationary and time homogeneous with stationary distribution . Moreover, it satisfies a uniform mixing condition
We then claim that
and so a sufficient delay between the sample used for selection and the sample used for estimation guarantees low bias. We have immediately that,
where we used that . Then, by the data processing inequality
V-H Bias control via FDR control
There has been intense interest in large-scale hypothesis testing procedures that control the false-discovery rate. Here we consider the bias and error incurred when estimation is performed after variables are selected in this manner, and bound this in terms of the false discovery rate and the rates of type I and type II errors.
As motivation, consider analysis of a large micro-array experiment. There is a large set of gene-expression data consisting of gene expression levels drawn from samples, where there first samples were taken from tissue with a cancerous tumor and the remaining were taken from healthy tissue. A scientist would like to identify genes with large differential between the expression levels across the two tissue types. She casts this as a multiple hypothesis testing problem, where rejecting a given null hypothesis indicates strength of evidence that an observed differential is unlikely due to random chance. Many procedures exist to control the false discovery rate, which is the expected proportion of type I errors among rejected null hypotheses.
Consider for example the procedure proposed by Benjamini and Hochberg. One first constructs p-values for separate hypothesis testing problems. These are then sorted as . To guarantee the false discovery rate is controlled at some level , their procedure specifies the selection of the the first hypotheses, where is the largest number such that . Framed differently, all hypotheses with p-values less than a random threshold are rejected. To gain some insight, let us consider a simple model where each -value is drawn either from a uniform distribution (i.e. the null distribution) or an alternative distribution . Consider an asymptotic regime where the number of alternative , but the proportion of alternatives following the null distribution stays fixed. Then [genovese2002operating] show that under regularity conditions on , the random threshold converges in probability to a deterministic limit . Therefore, the rate of type I and type II errors, as well as the proportion of false discoveries, all tend to a fixed levels asymptotically as . Whether a particular hypothesis is accepted or rejected is still random and data-dependent, but when is large the overall proportions are nearly deterministic.
We consider a more abstract framework. There is some random matrix , and a vector that is a function of with . The indices are partitioned into two sets and . A selection procedure is a map , where indicates variable was selected. We set to be the set of selected variables and to be its complement.
To form the analogy with the story above, we think of as a vector of summary statistics of the columns of —e.g. the observed gene expression differential between tumor tissue and healthy tissue—and think of as the set for which the null distribution holds — e.g. across repeated samples there would not be an observed differential. The selected variables is the set for which the null hypothesis was rejected. Set and to be analogues of the proportion of type I and type II errors. Note that is the fraction of false discoveries relative to the total number of nulls, and is different from what is called the False Discovery Proportion or FDP. To simplify the discussion, we assume there is always at least one selected variable, so is nonempty. We are interested in the average error or bias in reported estimates among selected, which leads to the study of quantities like
These can be rewritten as , or where, conditioned on , is drawn uniformly at random from the set of selected of selected variables . This leads naturally to the study of information usage , which bounds these quantities. The quantities in (1) reflect whether, the estimation procedures applied to the selected variables produce accurate results on average. For this reason, we are able to provide meaningful guarantees that do not degrade as , a regime in which it is impossible to guarantees that every selected variable is estimated accurately.
Now, let us define to be the false discovery rate. This is the expected proportion of selected variables that are contained within the null set . The next lemma bounds information usage in terms of the false discovery rate, the rates of type I and II error, and an extra error term that vanishes as the random proportion of realized type I and II errors concentrate around their expected value. A short proof is given in Appendix E.
For the FDR control problem defined above,
where denotes the binary entropy function, and denote the type I and II error proportion relative to the total number of true null and true alternative, respectively. The error term is
This result further formalizes the insight that estimation after selection is unlikely to overfit in settings where the selection procedure works reliably. When the rates of false discovery, type I error, and type II error are small, information usage is guaranteed to also be low. The implied bounds on estimation error after selection grow smoothly as the reliability of the selection procedure degrades.
Vi Limiting information usage and bias via randomization
We have seen how information usage provides a unified framework to investigate the magnitude of exploration bias across different analysis procedures and datasets. It also suggests that methods that reduces the mutual information between and can reduce bias. In this section, we explore simple procedures that leverages randomization to reduce information usage and hence bias, while still preserving the utility of the data analysis.
We first revisit the rank-selection policy considered in the previous subsection, and derive a variant of this scheme that uses randomization to limit information usage. We then consider a model of a human data analyst who interacts sequentially with the data. We use a stylized model to show that, even if the analysts procedure is unknown or difficult to describe, adding noise during the data-exploration process can provably limit the bias incurred. Many authors have investigated adding noise as a technique to reduce selection bias in specialized settings [dwork2014preserving, chaudhuri2011differentially]. The main goal of this section is to illustrate how the effects of adding noise is transparent through the lens of information usage.
Vi-a Regularization via randomized selection
Subsection V-C illustrates how signal in the data intrinsically reduces the bias of rank selection by reducing the entropy term in . A complementary approach to potentially reduce bias is to increase conditional entropy by adding randomization to the selection policy . Note that while this randomization increases , it also increases and thus could increase information usage. It is easy to maximize conditional entropy by choosing uniformly at random from , independently of . Imagine however that we want to not only ensure that conditional entropy is large, but want to choose such that the selected value is large. After observing , it is natural then to set the probability of setting by solving a maximization problem
The solution to this problem is the maximum entropy or “Gibbs” distribution, which sets
for that is chosen so that . This procedure effectively adds stability, or a kind of regularization, to the selection strategy by adding randomization. Whereas tiny perturbations to may change the identity of , the distribution is relatively insensitive to small changes in . Note that the strategy (2) is one of the most widely studied algorithms in the field of online learning [cesa2006prediction], where it is often called exponential weights. It is also known as the exponential mechanism in differential privacy. In our framework it is transparent how it reduces bias.
To illustrate the effect of randomized selection, we use simulations to explore the tradeoff between bias and accuracy. We consider the following simple, max-entropy randomization scheme:
Take as input parameters and , and observations . Here is the inverse temperature in the Gibbs distribution and is number of ’s we need to select.
Sample without replacement indices from given in (2). Report the corresponding values .
We consider settings where we have two groups of ’s: after relabeling assume that and for . We define the bias of the selection to be and the accuracy of the selection to be , which is the fraction of reported with true signal . In Figure 3, we illustrate the tradeoff between accuracy and bias for (i.e. there are many more false signals than true signals), randomization strength , and the signal strength varying from 1 to 5. Consistent with the theoretical analysis, max-entropy selection significantly decreased bias. In the low signal regime (), both rank selection and max-entropy selection have low accuracy because the signal is overwhelmed by the large number of false positives. In the high signal regime (), both selection methods have accuracy close to one and max-entropy selection has significantly less bias. In the intermediate regime (), max-entropy selection has substantially less bias but is less accurate than rank selection.
Formally, unless the Gibbs distributions is degenerate with probability 1,
so information usage is strictly smaller than its worst-case value of . It is worth highlighting, however, that the Gibbs mechanism described above does not reduce bias or information usage for all possible data–generating distributions because it could increase entropy .
Vi-B Randomization for a multi-step analyst
We next study how randomization can decrease information usage and bias even when we have very little knowledge of what the analyst is doing. To illustrate this idea, we analyze in detail a simple example of a very flexible data analyst who performs multiple steps of analysis. Flexibility in multi-step data analysis presents a challenge to current statistical approaches for quantifying selection bias. Recent development in post-selection inference have focused on settings where the selection rule is simple and analytically tractable, and the full analysis procedure is fixed and specified before any data analysis is performed. While powerful results can be derived in this framework—including exact bias corrections and valid post-selection confidence intervals [fithian2014optimal, Taylor2015]—these methods do not apply for exploratory analysis where the procedure can be quite flexible.
In this section, we show how our mutual information framework can be used to analyze bias for a flexible multi-step analyst. We show that even if one does not know, or can’t fully describe, the selection procedure , one can control its bias by controlling the information it uses. The main idea is to inject a small amount of randomization at each step of the analysis. This randomization is guaranteed to keep the bad information usage low no matter what the analyst does.
The idea of adding randomization during data analysis to reduce overfitting has been implemented as practical rule-of-thumb in several communities. Particle physicists, for example, have advocated blind data analysis: when deciding which results to report, the analyst interacts with a dataset that has been obfuscated through various means, such as adding noise to observations, removing some data points, or switching data-labels. The raw, uncorrupted, dataset is only used in computing the final reported values [maccoun2015blind]. Adding noise is also closely related to a recent line of work inspired by differential privacy [blum2015ladder, dwork2015generalization, dwork2015reusable, hardt2014preventing].
A model of flexible, multi-step analyst. We consider a model of adaptive data analysis similar to that of [dwork2015reusable, dwork2015generalization]. In this setting, the analyst learns about the data by running a series of analyses on the dataset. Each analysis is modeled by a function of the data , and choice of which analysis to run may depend on the results from all the earlier analyses. More formally, we define the model as follows:
At step 1, the analyst selects a statistic to query for and observes a result .
In the -th iteration, the analyst chooses a statistic as a function of the results that she has received so far, , and receives result .
After iterations, the analyst selects as a function of
The simplest setting is when the result of the analysis is just the value of on the data : . An example of this is the rank selection considered before. At the -th step, is queried (i.e. the order is fixed and does not depend on the previous results) and is returned. The analyst queries all ’s and returns the one with maximal value.
In general, we allow the analysis output to differ from the empirical value of the test and a particularly useful form is . This captures blind analysis settings, where the analyst intentionally adds noise throughout the data analysis in order to reduce over-fitting. A natural goal is to ensure that for every query used in the adaptive analysis, the reported result is close to true value . We will show through analyzing the information usage that noise addition can indeed guarantee such accuracy.
This adaptive analysis protocol can be viewed as a Markov chain
where recall that denotes the vector . By the information processing inequality [cover2012elements], . Therefore, a procedure that controls the mutual information between the history of feedback and the statistics will automatically control the mutual information . By exploiting the structure of the adaptive analysis model, we can decompose the cumulative mutual information into a sum of terms. This is formalized in the following composition lemma for mutual information.
Let denote the history of interaction up to time . Then, under the adaptive analysis model
The important takeaway from this lemma is that by bounding the conditional mutual information between the response and the queried value at each step, , we can bound and hence bound the bias after rounds of adaptive queries. Given a dataset , we can imagine the analyst having a (mutual) information budget, , which is decided a priori based on the size of the data and her tolerance for bias. At each step of the adaptive data analysis, the analyst’s choice of statistic to query next (as a function of her analysis history) incurs an information cost quantified by . The information costs accumulate additively over the analysis steps, until it reaches , at which point the guarantee on bias requires the analysis to stop.
A trivial way to reduce mutual information is to return a response that is independent of the query , in which case the analyst learns nothing about the data and incurs no bias. However in order for the data to be useful for the analyst, we would like the results of the queries to also be accurate.
Adding randomization to reduce bias. As before let denote the true answer of query . If each is –sub-Gaussian, then . Using Proposition 2, we can bound the average excess error of the response , by the sum of two terms,
Response accuracy degrades with distortion, a measure of the magnitude of the noise added to responses, but this distortion also controls the degree of selection bias in future rounds. We will explicitly analyze the tradeoff between these terms in a stylized case of the general model.
Gaussian noise protocol. We analyze the following special case.
Suppose and is jointly Gaussian for any .
For the th query , , the protocol returns a distorted response where . Note that unlike , the sequence is independent.
The term can be thought of as the number of samples in the data-set. Indeed, if is the empirical average of samples from a distribution, then . The ratio is the signal-to-noise ratio of the th response. We want to choose the distortion levels so as to guarantee that a large number of queries can be answered accurately. In order to do this, we will use the next lemma to relate the distortion levels to the information provided by a response. The lemma gives a form for the mutual information where and are independent Gaussian random variables. As one would expect, this shows that mutual information is very small when the variance of is much larger than the variance of . Lemma 3, provided in the Appendix, provides a similar result when is a general (not necessarily Gaussian) random variable.
If and where is independent of , then
where is the signal to noise ratio.
Using Lemma 2, we provide an explicit bound on the accuracy of as a function a function of and . Note that this result places no restriction on the procedure that generates (