A Auxiliary results for Theorem 2.4 and Corollary 2.4

Oracle inequalities for computationally adaptive model selection

Abstract

We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our computational budget to the optimal function class.

Oracle inequalities for computationally adaptive model selection


Alekh Agarwal Peter L. Bartlett John C. Duchi
alekh@eecs.berkeley.edu bartlett@stat.berkeley.edu jduchi@eecs.berkeley.edu


Department of Statistics, and Department of Mathematical Sciences
Department of EECS, Queensland University of Technology
University of California, Berkeley, CA USA Brisbane, Australia




March 6, 2018



1 Introduction

In decision-theoretic statistical settings, one receives samples drawn i.i.d. from some unknown distribution over a sample space , and given a loss function , seeks a function to minimize the risk

(1)

Since is unknown, the typical approach is to compute estimates based on the empirical risk, , over a function class . Through this, one seeks a function with a risk close to the Bayes risk, the minimal risk over all measurable functions, which is . There is a natural tradeoff based on the class one chooses, since

which decomposes the excess risk of into estimation error (left) and approximation error (right).

A common approach to addressing this tradeoff is to express as a union of classes

(2)

The model selection problem is to choose a class and a function that give the best tradeoff between estimation error and approximation error. A standard approach to the model selection problem is the now classical idea of complexity regularization, which arose out of early works by Mallows [21] and Akaike [1]. The complexity regularization approach balances two competing objectives: the minimum empirical risk of a model class (approximation error) and a complexity penalty (to control estimation error) for the class. Different choices of the complexity penalty give rise to different model selection criteria and algorithms (for example, see the lecture notes by Massart [23] and the references therein). The complexity regularization approach uses penalties associated with each class to perform model selection, where is a complexity penalty for class when samples are available; usually the functions decrease to zero in and increase in the index . The actual algorithm is as follows: for each , choose

(3)

as the output of the model selection procedure, where denotes the -sample empirical risk. Results of several authors [11, 20, 23] show that with appropriate penalties and given a dataset of size , the output of the procedure roughly satisfies

(4)

Several approaches to complexity regularization are possible, and an incomplete bibliography includes the papers [28, 16, 25, 5, 11, 20].

Oracle inequalities of the form (4) show that, for a given sample size, complexity regularization procedures trade off the approximation and estimation errors, often optimally [23]. A drawback of the above approaches is that in order to provide guarantees on the result of the model selection procedure, one needs to be able to optimize over each model in the hierarchy (that is, compute the estimates for each ). This is reasonable when the sample size is the key limitation, and it is computationally feasible when is small and the samples are low-dimensional. However, the cost of fitting a large number of model classes on a large, high-dimensional dataset can be prohibitive; such data is common in modern statistical settings. In such cases, it is the computational resources—rather than the sample size—that form the key inferential bottleneck. In this paper, we consider model selection from this computational perspective, viewing the amount of computation, rather than the sample size, as the quantity whose effects on estimation we must understand. Specifically, we study model selection methods that work within a given computational budget.

An interesting and difficult aspect of the problem that we must address is the interaction between model class complexity and computation time. It is natural to assume that for a fixed sample size, it is more expensive to estimate a model from a complex class than a simple class. Put inversely, given a computational bound, a simple model class can fit a model to a much larger sample size than a rich model class. So any strategy for model selection under a computational constraint should trade off two criteria: (i) the relative training cost of different model classes, which allows simpler classes to receive far more data (thus making them resilient to overfitting), and (ii) lower approximation error in the more complex model classes.

In addressing these computational and statistical issues, this paper makes two main contributions. First, we propose a novel computational perspective on the model selection problem, which we believe should be a natural consideration in statistical learning problems. Secondly, within this framework, we provide algorithms for model selection in many different scenarios, and provide oracle inequalities on their estimates under different assumptions. Our first two results address the case where we have a model hierarchy that is ordered by inclusion, that is, . The first result provides an inequality that is competitive with an oracle knowing the optimal class, incurring at most an additional logarithmic penalty in the computational budget. The second result extends our approach to obtaining faster rates for model selection under conditions that guarantee sharper concentration results for empirical risk minimization procedures; oracle inequalities under these conditions, but without computational constraints, have been obtained, for example, by Bartlett [8] and Koltchinskii [18]. Both of our results refine existing complexity-regularized risk minimization techniques by a careful consideration of the structure of the problem. Our third result applies to model classes that do not necessarily share any common structure. Here we present a novel algorithm—exploiting techniques for multi-armed bandit problems—that uses confidence bounds based on concentration inequalities to select a good model under a given computational budget. We also prove a minimax optimal oracle inequality on the performance of the selected model. All of our algorithms are computationally simple and efficient.

The remainder of this paper is organized as follows. We begin in Section 2 by formalizing our setting for a nested hierarchy of models, providing an estimator and oracle inequalities for the model selection problem. In Section 3, we refine our estimator and its analysis to obtain fast rates for model selection under some additional reasonable (standard) conditions. We study the setting of unstructured model collections in Section 4. Detailed technical arguments and various auxilliary results needed to establish our main theorems and corollaries can be found in the appendices.

2 Model selection over nested hierarchies

In many practical scenarios, the family of models with which one works has some structure. One of the most common model selection settings has the model classes ordered by inclusion with increasing complexity (e.g. [11]). In this section, we study such model selection problems; we begin by formally stating our assumptions and giving a few natural examples, proceeding thereafter to oracle inequalities for a computationally efficient model selection procedure.

2.1 Assumptions

Our first main assumption is a natural inclusion assumption, which is perhaps the most common assumption in prior work on model selection (e.g. [11, 20]): {assumption} The function classes are ordered by inclusion:

(5)

We provide two examples of such problems in the next section. In addition to the inclusion assumption, we make a few assumptions on the computational aspects of the problem. Most algorithms used in the framework of complexity regularization rely on the computation of estimators of the form

(6)

either exactly or approximately, for each class . Since the model classes are ordered by inclusion, it is natural to assume that the computational cost of computing an empirical risk minimizer from is higher than that for a class when . Said differently, given a fixed computational budget , it may be impossible to use as many samples to compute an estimator from as it is to compute an estimator from (again, when ). We formalize this in the next assumption, which is stated in terms of an (arbitrary) algorithm that selects functions for each index based on a set of samples. {assumption} Given a computational budget , there is a sequence such that

  1. for .

  2. The complexity penalties satisfy for .

  3. For each class , the computational cost of using the algorithm with samples is . That is, estimation within class using samples has the same computational complexity for each .

  4. For all , the output of the algorithm , given a computational budget , satisfies

  5. As , for any fixed .

The first two assumptions formalize a natural notion of computational budget in the context of our model selection problem: given equal computation time, a simpler model can be fit using a larger number of samples than a complex model. Assumption 2.1(c) says that the number of samples is chosen to roughly equate the computational complexity of estimation within each class. Assumption 2.1(d) simply states that we compute approximate empirical minimizers for each class . Our choice of the accuracy of computation to be in part (d) is done mainly for notational convenience in the statements of our results; one could use an alternate constant or function and achieve similar results. Finally part (e) rules out degenerate cases where the penalty function asymptotes to a finite upper bound, and this assumption is required for our estimator to be well-defined for infinite model hierarchies. In the sequel, we use the shorthand to denote when the number of samples is clear from context.

Certainly many choices are possible for the penalty functions , and work studying appropriate penalties is classical (see e.g. [1, 21]). Our focus in this paper is on complexity estimates derived from concentration inequalities, which have been extensively studied by a number of researchers [11, 23, 4, 8, 18]. Such complexity estimates are convenient since they ensure that the penalized empirical risk bounds the true risk with high probability. Formally, we have {assumption} For all and for each , there are constants such that for any budget the output satisfies,

(7)

In addition, for any fixed function , .

2.2 Some illustrative examples

We now provide two concrete examples to illustrate Assumptions 2.12.1.

{example}

[Linear classification with nested balls] In a classification problem, each sample consists of a covariate vector and label . In margin-based linear classification, the predictions are the sign of the linear function , where . A natural sequence of model classes is sets indexed via norm-balls of increasing radii: , where . By inspection, so that this sequence satisfies Assumption 2.1.

The empirical and expected risks of a function are often measured using the sample average and expectation, respectively, of a convex upper bound on the 0-1 loss . Examples of such losses include the hinge loss, , or the logistic loss, . Assume that and let be independent uniform -valued random variables. Then we may use a penalty function based on Rademacher complexity of the class ,

Setting to be the Rademacher complexity satisfies the conditions of Assumption 2.1 [9] for both the logistic and the hinge losses which are 1-Lipschitz. Hence, using the standard Lipschitz contraction bound [9, Theorem 12], we may take .

To illustrate Assumption 2.1, we take stochastic gradient descent [26] as an example. Assuming that the computation time to process a sample is equal to the dimension , then Nemirovski et al. [24] show that the computation time required by this algorithm to output a function satisfying Assumption 2.1(d) (that is, a -optimal empirical minimizer) is at most

Substituting the bound on above, we see that the computational time for class is at most . In other words, given a computational time , we can satisfy the Assumption 2.1 by setting for each class —the number of samples remains constant across the hierarchy in this example.

{example}

[Linear classification in increasing dimensions] Staying within the linear classification domain, we index the complexity of the model classes by an increasing sequence of dimensions . Formally, we set

where . This structure captures a variable selection problem where we have a prior ordering on the covariates.

In special scenarios, such as when the design matrix satisfies certain incoherence or irrepresentability assumptions [12], variable selection can be performed using -regularization or related methods. However, in general an oracle inequality for variable selection requires some form of exhaustive search over subsets. In the sequel, we show that in this simpler setting of variable selection over nested subsets, we can provide oracle inequalities without computing an estimator for each subset and without any assumptions on the design matrix .

For this function hierarchy, we consider complexity penalties arising from VC-dimension arguments [27, 9], in which case we may set

which satisfies Assumption 2.1. Using arguments similar to those for Example 2.2, we may conclude that the computational assumption 2.1 can be satisfied for this hierarchy, where the algorithm requires time to select . Thus, given a computational budget , we set the number of samples for class to be proportional to .

We provide only classification examples above since they demonstrate the essential aspects of our formulation. Similar quantities can also be obtained for a variety of other problems, such as parametric and non-parametric regression, and for a number of model hierarchies including polynomial or Fourier expansions, wavelets, or Sobolev classes, among others (for more instances, see, e.g. [23, 4, 11]).

2.3 The computationally-aware model selection algorithm

Having specified our assumptions and given examples satisfying them, we turn to describing our first computationally-aware model selection algorithm. Let us begin with the simpler scenario where we have only model classes (we extend this to infinite classes below). Perhaps the most obvious computationally budgeted model selection procedure is the following: allocate a budget of to each model class . As a result, class ’s estimator is computed using samples. Let denote the output of the basic model selection algorithm (3) with the choices , using samples to evaluate the empirical risk for class , and modifying the penalty to be . Then very slight modifications of standard arguments [23, 11] yield the oracle inequality

with high probability, where is a universal constant. This approach can be quite poor. For instance, in Example 2.2, we have , and the above inequality incurs a penalty that grows as . This is much worse than the logarithmic scaling in that is typically possible in computationally unconstrained settings [11]. It is thus natural to ask whether we can use the nested structure of our model hierarchy to allocate computational budget more efficiently.

To answer this question, we introduce the notion of coarse-grid sets, which use the growth structure of the complexity penalties , to construct a scheme for allocating the budget across the hierarchy. Recall the constant from Assumption 2.1 and let be an arbitrary constant (we will see that controls the probability of error in our results). Given (), we define

(8)

Notice that, to simplify the notation, we hide the dependence of on . With the definition (8), we now give a definition characterizing the growth characteristics of the penalties and sample sizes. {definition} Given a budget , for a set , we say that satisfies the coarse grid condition with parameters , , and if and for each there is an index such that

(9)

Figure 1 gives an illustration of the coarse-grid set. For simplicity in presentation, we set in the statements of our results in the sequel.

Figure 1: Construction of the coarse-grid set . The -axis is the class index , and the -axis represents the corresponding complexity . When the penalty function grows steeply early on, we include a large number of models. The number of complex models included in can be significantly smaller as the growth of penalty function tapers out.

If the coarse-grid set is finite and, say, , then the set presents a natural collection of indices over which to perform model selection. We simply split the budget uniformly amongst the coarse-grid set , giving budget to each class in the set. Indeed, the main theorem of this section shows that for a large class of problems, it always suffices to restrict our attention to a finite grid set , allowing us to present both a computationally tractable estimator and a good oracle inequality for the estimator. In some cases, there may be no finite coarse grid set. Thus we look for way to restrict our selection to finite sets, which we can do with the following assumption (the assumption is unnecessary if the hierarchy is finite). {assumption}

  1. There is a constant such that .

  2. For all the penalty function .

Assumption 1(a) is satisfied, for example, if the loss function is bounded, or even if there is a function with finite risk. Assumption 1(b) also is mild; unless the class is trivial, in general classes satisfying Assumption 2.1 have .

Under these assumptions, we provide our computationally budgeted model selection procedure in Algorithm 1. We will see in the proof of Theorem 2.4 below that the assumptions ensure that we can build a coarse grid of size

In particular, Assumption 2.1(d) ensures that the complexity penalties continue to increase with the class index . Hence, there is a class such that the complexity penalty is larger than the penalized risk of the smallest class , at which point no class larger than can be a minimizer in the oracle inequality. The above choice of ensures that there is at least one class so that , allowing us to restrict our attention only to the function classes .

0:  Model hierarchy with corresponding penalty functions , computational budget , upper bound on the minimum risk of class 1, and confidence parameter . Construction of the coarse-grid set :
  Set .
  for  to  do
     Set to be the largest class for which .
  end for
  Set . Model selection estimate:
  Set for .
  Select a class that satisfies
(10)
  Output the function .
Algorithm 1 Computationally budgeted model selection over nested hierarchies

2.4 Main result and some consequences

With the above definitions in place, we can now provide an oracle inequality on the performance of the model selected by Algorithm 1. We start with our main theorem, and then provide corollaries to help explain various aspects of it.

{theorem}

Let be the output of the algorithm for the class specified by the procedure (10). Let Assumptions 2.11 be satisfied. With probability at least

(11)

Furthermore, if then .

The assumption that is linear is mild: unless is trivial, any algorithm for must at least observe the data, and hence must use computation at least linear in the sample size.

Remarks: To better understand the result of Theorem 2.4, we turn to a few brief remarks.

  1. We may ask what an omniscient oracle with access to the same computational algorithm could do. Such an oracle would know the optimal class and allocate the entire budget to compute . By Assumption 2.1, the output of this oracle satisfies, with probability at least ,

    (12)

    Comparing this to the right hand side of the inequality of Theorem 2.4, we observe that not knowing the optimal class incurs a penalty in the computational budget of roughly a factor of . This penalty is only logarithmic in the computational budget in most settings of interest.

  2. Algorithm 1 and Theorem 2.4, as stated, require a priori knowledge of the computational budget . We can address this using a standard doubling argument (see e.g. [13, Sec. 2.3]). Initially we assume and run Algorithm 1 accordingly. If we do not exhaust the budget, we assume , and rerun Algorithm 1 for another round. If there is more computational time at our disposal, we update our guess to and so on. Suppose the real budget is with . After rounds of this doubling strategy, we have exhausted a budget of , with the last round getting a budget of for . In particular, the last round with a net budget of is of length at least . Since Theorem 2.4 applies to each individual round, we obtain an oracle inequality where we replace with ; we can be agnostic to the prior knowledge of the budget at the expense of slightly worse constants.

  3. For ease of presentation, Algorithm 1 and Theorem 2.4 use a specific setting of the coarse-grid size, which corresponds to setting in Definition 8. In our proofs, we establish the theorem for arbitrary . As a consequence, to obtain slightly sharper bounds, we may optimize this choice of ; we do not pursue this here.

Now let us turn to a specialization of Theorem 2.4 to the settings outlined in Examples 2.2 and 2.2. The following corollary shows oracle inequalities under the computational restrictions that are only logarithmically worse than those possible in the computationally unconstrained model selection procedure (3). {corollary} Let be a specified constant.

  1. In the setting of Example 2.2, define so that is the number of samples that can be processed by the inference algorithm using units of computation. Assume that is large enough that and . With probability at least , the output of Algorithm 1 satisfies

  2. In the setting of Example 2.2, define so that is the number of samples that can be processed by the inference algorithm using units of computation. Assume that is large enough that and . With probability at least , the output of Algorithm 1 satisfies

2.5 Proofs

As remarked after Theorem 2.4, we will present our proofs for general settings of . For the proofs of Theorem 2.4 and Corollary 2.4 in this slight generalization, we define as a set satisfying the coarse grid condition with parameters , and , with satisfying

(13)

First, we show that this inequality is ensured by the choice given in Algorithm 1. To see this, notice that

Thus, for , choosing suffices.

We require the additional notation

(14)

where

(15)

is the natural generalization of the set defined in Algorithm 1: is chosen as the largest index for which . We begin the proof of Theorem 2.4 by showing that any satisfying (13) ensures that any class must have penalty too large to be optimal, so we can focus on classes . We then show that the output of Algorithm 1 satisfies an oracle inequality for each class in , which is possible by an adaptation of arguments in prior work [11]. Using the definition of our coarse grid set (Definition 8), we can then infer an oracle inequality that applies to each class , and our earlier reduction to a finite model hierarchy completes the argument.

Proof of Theorem 2.4

First we show that the selection of the set satisfies Definition 8. {lemma} Let be a sequence of increasing positive numbers and for each set to be the largest index such that . Then for each such that , there exists a such that .

Proof.

Let and choose the smallest such that . Assume for the sake of contradiction that . There exists some such that and , and thus we obtain

(16)

Let be the largest element smaller than in the collection . Then by our construction, is the largest index satisfying . In particular, combining with our earlier inequality (16) leads to the conclusion that , which contradicts the fact that is the smallest index in satisfying . ∎

Next, we show that, for satisfying (13), once the complexity penalty of a class becomes too large, it can never be the minimizer of the penalized risk in the oracle inequality (11). See Appendix A for the proof. {lemma} Fix and , recall the definition (14) of , and let be a class that attains the minimum in the right side of the bound (11). We have . Equipped with the lemmas, we can restrict our attention only to classes . To that end, the next result establishes an oracle inequality for our algorithm compared to all the classes in this set.

{proposition}

Let be the function chosen from the class selected by the procedure (10), where and . Under the conditions of Theorem 2.4, with probability at least

The proof of the proposition follows from an argument similar to that given in [11], though we must carefully reason about the different number of independent samples used to estimate within each class . We present a proof in Appendix A. We can now complete the proof of Theorem 2.4 using the proposition.

{proof-of-theorem}

[2.4] Let be any class (not necessarily in ) and be the smallest class satisfying . Then, by construction of , we know from Lemma 2.5.1 that

In particular, we can lower bound the penalized risk of class as

where we used the inclusion assumption 2.1 to conclude that . Now applying Proposition 2.5.1, the above lower bound, and Lemma 2.5.1 in turn, we see that with probability at least

For (which we have seen satisfies (13)), this is the desired statement of the theorem.

Proof of Corollary 2.4

Under the conditions of Example 2.2, and the assumption that , Assumptions 2.1-1 are satisfied with and . (In particular, implies that satisfies Assumption 1(b).) Also, since , we have

Substituting into Theorem 2.4 gives the first part of the corollary.

Similarly, under the conditions of Example 2.2 and the assumption that , Assumptions 2.1-1 are satisfied with and . (In particular, implies that satisfies Assumption 1(b).) Also, since , we have as before. Substituting into Theorem 2.4 gives the second part of the corollary.

3 Fast rates for model selection

Looking at the result given by Theorem 2.4, we observe that irrespective of the dependence of the penalties on the sample size, there are terms in the oracle inequality that always decay as . A similar phenomenon is noted in [8] for classical model selection results in computationally unconstrained settings; under conditions similar to Assumption 2.1, this inverse-root dependence on the number of samples is the best possible, due to lower bounds on the fluctuations of the empirical process (e.g. [10, Theorem 2.3]). On the other hand, under suitable low noise conditions [22] or curvature properties of the risk functional [6, 18, 7], it is possible to obtain estimation guarantees of the form

where (approximately) minimizes the -sample empirical risk. Under suitable assumptions, complexity regularization can also achieve fast rates for model selection [8, 17]. In this section, we show that similar results can be obtained in computationally constrained inferential settings.

3.1 Assumptions and example

We begin by modifying our concentration assumption and providing a motivating example. {assumption} For each , let . Then there are constants such that for any budget and the corresponding sample size

(17a)
(17b)

Contrasting this with our earlier Assumption 2.1, we see that the probability bounds (17a) and (17b) decay exponentially in rather than , which leads to faster sub-exponential rates for estimation procedures. Concentration inequalities of this form are now well known [6, 18, 7], and the paper [8] uses an identical assumption.

Before continuing, we give an example to illustrate the assumption. {example}[Fast rates for classification] We consider the function class hierarchy based on increasing dimensions of Example 2.2. We assume that the risk and that the loss function is either the squared loss or the exponential loss from boosting . Each of these examples satisfies Assumption 3.1 with

(18)

for a universal constant . This follows from Theorem 3 of [8] (which in turn follows from Theorem 3.3 in [6] combined with an argument based on Dudley’s entropy integral [15]). The other parameter settings and computational considerations are identical to those of Example 2.2.

If we define , then using Assumption 2.1(d) (that ) in conjunction with Assumption (17a), we can conclude that for any time budget , with probability at least ,

(19)

One might thus expect that by following arguments similar to those in [8], it would be possible to show fast rates for model selection based on Algorithm 1. Unfortunately, the results of [8] heavily rely on the fact that the data used for computing the estimators is the same for each class , so that the fluctuations of the empirical processes corresponding to the different classes are positively correlated. In our computationally constrained setting, however, each class’s estimator is computed on a different sample. It is thus more difficult to relate the estimators than in previous work, necessitating a modification of our earlier Algorithm 1 and a new analysis, which follows.

3.2 Algorithm and oracle inequality

As in Section 2, our approach is based on performing model selection over a coarsened version of the collection . To construct the coarser collection of indices, we define the composite penalty term (based on Assumption 3.1)

(20)

Based on the above penalty term, we define our analogue of the coarse grid set (9).

We give our modified model selection procedure in Algorithm 2. In the algorithm and in our subsequent analysis, we use the shorthand to denote the empirical risk of the function on the samples associated with class . Our main oracle inequality is the following:

0:  Model hierarchy with corresponding penalty functions , computational budget , upper bound on the minimum risk of class 1, and confidence parameter . Construction of the coarse-grid set :
  Set .
  for  to  do
     Set to be the largest class for which .
  end for
  Set . Model selection estimate:
  Set for .
  Select the class to be the largest class that satisfies
(21)
for all such that .
  Output the function .
Algorithm 2 Computationally budgeted model selection over hierarchies with fast concentration
{theorem}

Let be the output of the algorithm for class specified by the procedure (21). Let Assumptions 2.12.11 and 3.1 be satisfied. With probability at least

(22)

Furthermore, if then . By inspection of the bound (19)—achieved by devoting the full computational budget to the optimal class—we see that Theorem 3.2’s oracle inequality has dependence on the computational budget within logarithmic factors of the best possible.

The following corollary shows the application of Theorem 3.2 to the classification problem we discuss in Example 18. {corollary} In the setting of Example 18, define so that is the number of samples that can be processed by the inference algorithm using units of computation. Assume that , , and choose the constant in the definition (18) of such that . With probability at least , the output of Algorithm 2 satisfies

3.3 Proofs of main results

In this section, we provide proofs of Theorem 3.2 and Corollary 2. Like our previous proof for Theorem 2.4, we again provide the proof of Theorem 3.2 for general settings of . The proof of Theorem 3.2 broadly follows that of Theorem 2.4, in that we establish an analogue of Proposition 2.5.1, which provides an oracle inequality for each class in the coarse-grid set . We then extend the proven inequality to apply to each function class in the hierarchy using the definition (9) of the grid set.

{proof-of-theorem}

[3.2] Let be shorthand for , the number of samples available to class , and let denote the empirical risk of the function using the samples for class . In addition, let be shorthand for , the penalty value for class using samples. With these definitions, we adopt the following shorthand for the events in the probability bounds (17a) and (17b). Let be an -dimensional vector with (arbitrary for now) positive entries. For each pair of indices and define

(23a)