A Rademacher Complexity Based Method for Controlling Power and Confidence Level in Adaptive Statistical AnalysisThis paper was submitted to IEEE/ACM/ASA DSAA 2019 on 05/20/2019 and accepted on 07/26/2019. This research was funded by NSF Award RI-18134446, and by DARPA/USAF grant W911NF-16-1-0553.

A Rademacher Complexity Based Method for Controlling Power and Confidence Level in Adaptive Statistical Analysisthanks: This paper was submitted to IEEE/ACM/ASA DSAA 2019 on 05/20/2019 and accepted on 07/26/2019. This research was funded by NSF Award RI-18134446, and by DARPA/USAF grant W911NF-16-1-0553.

Lorenzo De Stefani Department of Computer Science
Brown University
Providence, United States of America
lorenzo@cs.brown.edu
   Eli Upfal Department of Computer Science
Brown University
Providence, United States of America
eli@cs.brown.edu
Abstract

While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), which may each depend on the outcome of the previous tests on the same data. In this work, we present RadaBound a rigorous, efficient and practical procedure for controlling the generalization error when using a holdout sample for multiple adaptive testing. Our solution is based on a new application of the Rademacher Complexity generalization bounds, adapted to dependent tests. We demonstrate the statistical power and practicality of our method through extensive simulations and comparisons to alternative approaches. In particular, we show that our rigorous solution is a substantially more powerful and efficient than the differential privacy based approach proposed in Dwork et al. [Dwork15generalization, dwork2015reusable, dwork2015preserving].

Adaptive Analysis, Rademacher Complexity, Statistical Learning

I Introduction

The goal of data analysis and statistical learning is to model a stochastic process, or distribution, that explain an observed data. A major risk in statistical learning is overfitting, that is, learning a model that fits well with the observed data but does not predict new data. The standard practice in machine learning is to split the data into training and holdout (or testing) sets. A learning algorithm then learns a model using the training data and tests the model on the holdout set to obtain a confidence interval for the expected error or for the value of the loss function of the model. If the process halts after a single iteration, then the statistical analysis is easy. However, in most cases, the learning process is iterative and adaptive. One uses successive tests for model selection, feature selection, parameter tuning, etc., and the choice of the tests themselves often depends on the outcomes of previous tests. Ideally, each hypothesis should be tested on a fresh data sample. However, it is common practice to reuse the same holdout data to evaluate a sequence of hypotheses. While widespread, this practice is known to lead to overfitting; that is, the learned model becomes representative of the sample rather than the actual process. Evaluating the accumulated error in testing a sequence of related hypothesis on the same data set is a major challenge in both machine learning and modern statistics. In machine learning, the problem of “preventing overfit”, is usually phrased and analyzed in terms of bounding the generalization error [ShalevSBD14]. In inference statistics, the goal is controlling the Family Wise Error Rate (FWER), or the False Discovery Rate (FDR) of a sequence of hypothesis tests [benjamini1995controlling].

Our Results: We develop and analyze RadaBound, a rigorous, efficient, and practical procedure for online evaluation of the accumulated generalization error in a sequence of statistical inferences applied to the same sample. RadaBound can evaluate fully adaptive sequences of tests. The choice of a test may depend on the information obtained from previous tests, and the total number of tests is not fixed in advance.

One way to quantify the risk of overfitting after queries is by considering the probability of the condition defined by the results of the first queries. If the probability of such condition is close to 1, the results of the queries evaluated so far do not significantly restrict the sample space, and there is, therefore, no risk of overfitting. Viceversa, if the probability of the observed condition is small, the sample space defined by the true distribution conditioned on the results of the queries is noticeably different from the true distribution, and there is thus a significant risk of overfitting. In general, it is hard to bound the probability of the observed condition as the true distribution over the samples is unknown. However, in the special case for which the queries being considered correspond to evaluating the average of functions (such as evaluating the average risk or loss functions of alternative learning procedures), we can design an adaptive process based on an empirical estimate of the Rademacher Complexity of the set of queries which correctly bounds this probability and correspondingly halts the procedure when the risk of overfitting exceeds a certain threshold fixed by the user.

Our method builds on the concept of Rademacher Complexity [BartlettBL02, Koltchinskii01] that has emerged as a powerful alternative to VC-dimension and related uniform convergence methods for characterizing generalization error and sample complexity. A fundamental advantage of the Rademacher Complexity approach in contrast to standard uniform convergence tools, such as VC-dimension, that capture the complexity with respect the worst case input distribution, is that it yields a data-dependent bound as it is computed with respect to the input (sample) distribution, and can be efficiently approximated from the sample.

Our solution employs three major components: (1) For a set of functions chosen independent of the sample, the Rademacher Complexity [BartlettBL02, Koltchinskii01] provides a powerful and efficient bound on the error in estimating the expectations of all these function using one sample; (2) As long as the outcome of the sequence of tests does not significantly overfit to the sample, conditioning on these outcomes has only a minor effect on the distribution; and (3) The Rademacher Complexity of a sequence of tests can be estimated efficiently form a given sample, requiring similar computation time as running the actual tests. To fully utilize our technique, we need computationally efficient methods for rigorously estimating the Rademacher Complexity from a sample. We introduce two novel methods based on Bernstein’s inequality for martingales [freedman1975tail] and the Martingale Central Limit Theorem [hall2014martingale]. Our analysis and extensive experiments prove and demonstrate that our method guarantees statistical validity while retaining statistical power and practical efficiency.

Related Work: Classic statistics offers a variety of procedures for controlling the Family Wise Error Rate (FWER), ranging from the simple Bonferroni [bonferroni1936teoria] to Holm’s step-down [holm1979simple] and Hochberg’s step-up procedures [hochberg1988sharper] in the context of multiple hypotheses testing. While controlling the FWER under weak assumptions about the hypotheses, these methods are too conservative, giving many false negative results, in particular for large sets of hypotheses. Less conservative procedures, such as Benjamini and Hochberg [benjamini1995controlling], which control the False Discovery Rate (FDR) (i.e., the expected fraction of false discoveries), still do not scale up well for a very large number of hypotheses. However, all these procedures cannot be applied in the adaptive setting, as they require for the set of hypotheses to be fixed at the beginning of the testing procedure (i.e., before any data evaluation).

In statistics, “sequential analysis” or “sequential hypothesis testing” is a paradigm for statistical testing where for a fixed family of hypotheses to be the tested the sample size is not fixed in advance. Instead, data are evaluated as they are collected, and further sampling is stopped in accordance with a pre-defined stopping rule as soon as significant results are observed. Despite the sequential iterative nature of this practices, as the hypotheses being considered are fixed beforehand, sequential analysis procedures are not suitable for adaptive analysis as the set of queries (hypotheses) being considered depends in general for the outcome of previous evaluation of the data itself. Other “sequential” hypothesis testing procedures, such as the sequential False Discovery Rate control by G’Sell et al. [g2016sequential], assume that the order according to which the hypotheses are to be evaluated is fixed beforehand, and hence cannot be adaptively selected. Similar considerations apply to the “Alpha Investing” sequential testing by Foster and Stine [foster2008alpha], which achieves control of the “marginal False Discovery Rate”. While the previously mentioned procedures apply to the setting of hypotheses testing, the method proposed in this work allows adaptive evaluations of statistical queries while maintaining rigorous guarantees on the accuracy of the obtained estimates.

A series of recent papers [Dwork15generalization, dwork2015reusable, dwork2015preserving] explored an interesting relation between “Differential Privacy” [dwork2006calibrating] and overfit prevention in adaptive analysis. The basic idea is to limit the user access to the holdout data so that the answers to the sequence of queries is differentially private. A differentially private access to the holdout data limits the risk of overfitting to that data set. Unfortunately, the practical application of this elegant mathematics is limited. Differential privacy is achieved through random perturbation of the data (or the reply to the queries). The higher the number of adaptive queries, the larger the required perturbation. However, the amount of perturbation is limited by the need to preserve the actual signal in the data. As a result, rigorous application of this approach is either limited to a small number of queries or is computationally intractable [dwork2015preserving], making it less useful than alternative methods [hardt2014preventing, steinke2015interactive]. Our experiments in Section 5 show that RadaBound allows orders of magnitude reduction of the required holdout dataset compared to Dwork et al.’s method [Dwork15generalization] while offering the same guarantees. Further, our technique is much simpler as it does not require any introduction of additional noise. We discuss in detail the advantages of our solution compared to  [Dwork15generalization] in Section VI. A more practical solution for a restricted setting inspired by machine learning competitions was presented in [blum2015ladder]. Their solution, “the Ladder”, provides a loss estimate only for those that made a significant improvement over the previous best. This restricted setting allows to sidestep the hardness results discussed in [hardt2014preventing, steinke2015interactive]. Note however that the guarantee achieved by the Ladder is fundamentally different from the one achieved in Dwork et al. [dwork2015reusable] and in this work, as it ensures accuracy in the relative ordering of the performance of multiple classifiers, while the latter ensure accurate evaluations of adaptively selected queries. Hence, the former is not comparable with the latters.

Paper organization: The presentation is organized as follows: In Section II we introduce out RadaBound method for adaptive statistical analysis. In Section III we discuss the use of uniform convergence bounds based on Rademacher Complexity in our setting, and we present two methods for estimating the Rademacher Complexity of a class of adaptively selected functions from the data. In Section IV we present the details of our RadaBound and the guarantees provided by it. In Section V, we present an experimental validation of the correctness and power of our methods using synthetic data. Finally, in Section VI we compare our approach to the state-of-the-art approach based on Differential Privacy by Dwork et al. [dwork2015reusable]. In particular, we show that a Rademacher complexity based solution gives significantly better results than the more complicated differential privacy based solution of Dwork et al..

Ii The RadaBound

For concreteness, we focus on the following setup. We have an holdout sample composed by independent observations , each from a distribution , and parameters fixed by the user.

The process: In an iterative process, at each step, the user (or an adaptive algorithm) submits a function and receives an estimate of the “ground truth value. The user has no direct access to the sample . That is, he can only acquire information regarding from the confidence intervals for the expectation , which the testing procedure has returned as answer to the queries considered so far. Let denote the set of the first functions evaluated during the adaptive process. The maximum error in estimating the expectations of the functions is given by:

In this work, we use the expression “overfittig” as follows: A given set of functions is said to overfit the sample if for any the value evaluated on differs from the true value by more than the user given threshold . Our adaptive testing process halts at the first -th step for which for which it cannot guarantee that the probability of overfitting is at most , that is, when .

The process is fully adaptive. The choice of the function evaluated at the -th step may depend on the information obtained during the first steps. We make no assumptions on the processes according to which the functions are adaptively chosen to be tested, nor do we require the total number of tests to be fixed in advance. For simplicity, we assume that all functions are in the range . More general settings are discussed later in the paper.

Bounding the generalization error for the iterative process: The sequence of answers to the queries, defines a filtration , such that

The -th query is chosen with respect to, and is answered in the filtered distribution . Our first step in developing RadaBound is to adapt the Rademacher Complexity results to an iterative, adaptive sequence of queries.

Let denote the event that the answer to the -th query was within of the correct value, that is, . Then, , and thus, in the filtration process,

By the definition of the events we thus have:

(1)

where with no subscript refers to probability in the un-filtered distribution .

The fact that the distribution of the generalization error in the adaptive case, , is related to the probability of an error in the non-adaptive case, , is not surprising. In order to fit the sample differently than the original distribution , the process needs to detect a pattern whose frequency is considerably different in the sample compared to the actual distribution . However, the first query that observes such a pattern is chosen when the process has not yet observed a significant difference between the sample and the distribution. This is due to the fact that the process halts as soon as such difference is detected. Thus, the probability of overfitting in queries is related to the probability that the sample gives a bad estimate for the correct value of one of the queries in the non-adaptive case.

The challenge is to compute a tight bound to the probability . We achieve this through two novel bounds on estimating the Rademacher Complexity of .

Iii Bounding using Rademacher Complexity

Our solution is based on iterative applications of Rademacher Complexity bounds.

Definition 1.

[mitzenmacher2017probability] Let be a vector of independent Rademacher random variables, such that for all , . The Empirical Rademacher Complexity of a class of function with respect to a sample , with is

The Rademacher Complexity of for samples of size is defined as .

The relation between and the Rademacher Complexity of is given by the following results 111In our setting, in order apply the result with absolute value we assume that for any we also have , i.e., we assume that is closed under negation.:

Lemma 1 (Lemma 26.2, [ShalevSBD14]).
Lemma 2 (Theorem 14.21, [mitzenmacher2017probability]).

Assume that for all and we have , then:

(2)

Note that in our context (a) we need a one-sided bound, and (b) for all and we have .

For algorithmic applications, two important consequences of these result are that: (a) for bounded functions the generalization error is concentrated around their expectation, and (b) the Rademacher Complexity can be estimated from the sample. In order for this bound to be actually usable in practical applications, it is necessary to compute an estimate of the Rademacher Complexity given the dataset , and to bound its error. In the “textbook” treatment, the difference between Rademacher Complexity and its empirical counterpart is bounded using a second application of McDiarmid’s Inequality [Barlett02, ShalevSBD14]. However, this bound is often too loose for practical applications such as ours.

In this work, we propose an alternative, direct, estimate of the Rademacher Complexity and we develop two methods for tightly bounding the estimation error.

Iii-a Tight bounds on Rademacher Complexity estimate

Given a finite size sample and independent Rademacher vectors , each composed of independent Rademacher random variables (i.e., ), we estimate with

(3)

Clearly, . To bound the error , we model the process as a Doob martingale ([mitzenmacher2017probability, Chapter 13.1]) as follows:

where the are the random variables that determinate the value of the estimate . The first variables ’s correspond to the values of the sample , i.e. for , , and the remaining ’s correspond to the Rademacher random variables, . It is easy to verify that , and .

Next, we define a martingale difference sequence with respect to the martingale , and note that .

Application of Bernstein’s Inequality for Martingales: Our first bound builds on Bernstein’s Inequality for Martingales (BIM). We use the following version due to Freedman [freedman1975tail], as presented in [dzhaparidze2001bernstein] and adapted to one-sided error.

Theorem 1.

[dzhaparidze2001bernstein, freedman1975tail] Let be a martingale difference sequence with respect to a certain filtration .

Thus, for . The process is thus a martingale with respect to this filtration. Further, assume that for , and that the conditional variance . For , we have:

(4)

Note that the bound presented here is slightly different from the one in [freedman1975tail] as for our purposes we only require a one-sided bound. A careful analysis of in our application allows us to obtain a significantly stronger bound than the one obtained using McDiarmid’s Inequality [Barlett02, ShalevSBD14], which depends on the maximum variation of the martingale.

Theorem 2.

Given a sample , a family of functions which take values in , independent vectors of Rademacher random variables, and , we have:

(5)
Proof:

Recall the definition of the Doob martingale

for , and the the definition of the corresponding martingale difference sequence .

By definition, for every , we have , and hence, .

In order to apply Bernstein’s Inequality, we need a bound , such that for , and an upper-bound to the conditional variance, such that

We consider the cases for and separately:

  • : For , let us consider

    According to our definitions, we have

    Since and , , changing the value of any of the points in can change by at most , and thus we have , and with .

    From Popoviciu’s Inequality on variance [popoviciu1935equations], we have that the variance of a random variable which takes values in is bounded from above by . Hence, by applying Popoviciu’s Inequality to , we have that .

    As we are considering the expectation over the unassigned values of the Rademacher random variables, and as we are averaging over the values obtained using independent and identically distributed vectors of Rademacher random variables, we can conclude that , and .

  • : Changing the value of any of the Rademacher random variables can change the value of by at most , and thus we have , and with . By applying Popoviciu’s Inequality, we thus have .

By linearity of expectation, . Further, we have , and for all . The statement follows by applying Theorem 1. ∎

Note that for a sufficiently large (constant) , the term dominates the denominator of the exponent in the right hand side of (5), giving a fast rate of convergence for the estimate. Our estimate fully characterizes the benefit achieved using multiple independent vectors of Rademacher random variables in estimating .

Combining the results of Theorem 2, Lemma 1, and Lemma 2 using the union bound, we obtain an empirical bound on .

Theorem 3.

Given a sample , a family of functions which take values in , independent vectors of Rademacher random variables, and , we have:

(6)
Proof:

From Lemmas 1 and 2, we have:

Theorem 2 characterizes the quality of the estimate of the Rademacher Complexity given by , computed as specified in (3):

Combining the two results, we obtain:

By substituting and in the previous equation, we have:

The statement follows. ∎

Alternative bound with single application of Bernstein’s Inequality for Martingales: We now present an alternative result to the one in Theorem 3, which can be achieved with a single application of BIM. This bound is tighter than the one in Theorem 3 when the number of independent vectors of Rademacher random variables is very high.

Theorem 4.
Proof:

Consider the Doob supermartingale:

where for , , and the remaining correspond to the independent Rademacher random variables in the vectors; that is, for and . It is easy to verify that . Further, , and due to Theorem 1, .

Let us define the corresponding martingale difference sequence . For each , due to linearity of expectation, we have , where:

In order apply Bernstein’s Inequality, we need an upper-bound for and an upper-bound , such that .

Given our definition of , we have that for every , , and thus: . From the properties of covariance, we have , and thus, .

We consider the cases for and separately:

  • : In our setting and , , changing the value of any of the points in can change by at most . Therefore, , and with . By applying Popoviciu’s Inequality, we have: .

    The analysis for follows the same reasoning discussed in the proof of Theorem 2 for bounding in the case , and thus . We can thus conclude:

  • : Changing the value of any of the Rademacher random variables does not change the value of . Hence, .

    Given fixed values for the random variables corresponding to the points in , changing the value of one Rademacher random variable can change the value of by at most . Thus, , and with . By applying Popoviciu’s Inequality, we have:

We, therefore, have for all , and . By linearity of expectation, and by applying Theorem 1:

The statement follows by applying BIM (Theorem 3.3). ∎

This result can be used in RadaBound in place of the bound given by Theorem 3. Note that with this result, it is easier to compute the bound on the probability of overfitting (denoted as in line 11: of Algorithm 1).

Application of the Martingale Central Limit Theorem: In practical applications, one may prefer the standard practice in statistics of applying central limit asymptotic bounds. We develop here a bound based on the Martingale Central Limit Theorem (MCLT). Our experimental results in Section V show that the bound obtained using the MCLT is more powerful while still preserving statistical validity.

We adapt the following version of the MCLT 222Formally, the asymptotic is defined on a triangle array, where rows are samples of growing sizes. We also assume that all expectations are well-defined in the corresponding filtration.:

Theorem 5 (Corollary 3.2, [hall2014martingale]).

Let be a difference martingale with bounded absolute increments. Assume that (1) for a finite , and (2) , then converges in distribution to .

When applying the MCLT, there is no advantage in bounding separately and . Instead, we compute a bound on the distribution of by analyzing the Doob supermartingale

for , with respect to the same defined as in Section III-A.

As in the finite sample case, the following theorem relies on a careful analysis of for the martingale difference sequence .

Theorem 6.

Given a sample , a family of functions which take values in , independent vectors of Rademacher random variables, and , we have:

Where denotes the cumulative distribution function for the standard normal distribution.

Proof:

The proof closely follows the steps of the proof of Theorem 4. Consider the Doob supermartingale for the function :

where for , , and the remaining correspond to the independent Rademacher random variables in the vectors. That is, , for and . Further, let us define the corresponding martingale difference sequence .

In order to apply the MCLT, we need to bound from above, and we need to verify that is bounded.

Note that the sequence defined here corresponds to the martingale difference sequence by the same name that we studied in the proof of Theorem 4. As shown in the proof of Theorem 4, we have , and for all .

Applying the MCLT, we have that as goes to infinity,

converges in distribution to , and thus:

By linearity of expectation, and by applying Theorem 1:

The statement follows. ∎

Due to its asymptotic nature, it is not possible to compare directly the tightness of the bound in Theorem 6 with that of finite sample bounds such as the one in Theorem 3. Still, this bound is of great interest in many practical scenarios as it allows for a much tighter bound for the generalization error.

Iv The RADABOUND Algorithm

The algorithm starts by drawing independent vectors of Rademacher variables. These vectors are fixed throughout the execution of the algorithm. The advantage of fixing the Rademacher vectors is that (1) we deal with a nested sequence of events, , and (2) the actual computation of the Rademacher complexity estimate is simple and efficient.

Computing the estimate: At the end of each round , the algorithm stores for each of the Rademacher vectors , the value . To update these values, at iteration , the algorithm computes

The estimate of the Rademacher Complexity at round is then given by .

1:procedure RADABOUND()
2:      Size of the input sample
3: Initialization estimator for Rademacher Complexity
4:     for  do
5:          vector of iid Rademacher RVs
6:               
7: Main execution body
8:     while new k-th query from the stream do
9:         
10:    Rademacher Average estimation update
11:         for  do
12:                        
13:         
14:    Control with BIM
15:         
16:    Control with MCLT- Alternative to 11:
17:          or
18:    Overfit control test
19:         if  then
20:              return
21:         else
22:               Halt: Cannot guarantee the statistical validity of further queries.               
ALGORITHM 1 RADABOUND - Adaptive data analysis with Rademacher Complexity control

Stopping rule: Given real values , the procedure halts at the first -th step for which it cannot guarantee that

Recall from (1) that

(7)

Since , it is sufficient to require to have