Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions
We obtain estimation error rates and sharp oracle inequalities for regularization procedures of the form
when is any norm, is a convex class of functions and is a Lipschitz loss function satisfying a Bernstein condition over . We explore both the bounded and subgaussian stochastic frameworks for the distribution of the ’s, with no assumption on the distribution of the ’s. The general results rely on two main objects: a complexity function, and a sparsity equation, that depend on the specific setting in hand (loss and norm ).
As a proof of concept, we obtain minimax rates of convergence in the following problems: 1) matrix completion with any Lipschitz loss function, including the hinge and logistic loss for the so-called 1-bit matrix completion instance of the problem, and quantile losses for the general case, which enables to estimate any quantile on the entries of the matrix; 2) logistic LASSO and variants such as the logistic SLOPE; 3) kernel methods, where the loss is the hinge loss, and the regularization function is the RKHS norm.
Many classification and prediction problems are solved in practice by regularized empirical risk minimizers (RERM). The risk is measured by a loss function and the quadratic loss function is the most popular function for regression. It has been extensively studied (cf. [LM_sparsity, MR2829871] among others). Still many other loss functions are popular among practitioners and are indeed extremely useful in specific situations.
First, let us mention the quantile loss in regression problems. The -quantile loss (also known as absolute or loss) is known to provide an indicator of conditional central tendency more robust to outliers than the quadratic loss. An alternative to the absolute loss for robustification is provided by the Huber loss. On the other hand, general quantile losses are used to estimate conditional quantile functions and are extremely useful to build confidence intervals and measures of risk, like Values at Risk (VaR) in finance.
Let us now turn to classification problems. The natural loss in this context, the so called loss, leads very often to computationally intractable estimators. Thus, it is usually replaced by a convex loss function, such as the hinge loss or the logistic loss. A thorough study of convex loss functions in classification can be found in [zhang2004statistical].
All the aforementioned loss functions (quantile, Huber, hinge and logistic) share a common property: they are Lipschitz functions. This motivates a general study of RERM with any Lipschitz loss. Note that some examples were already studied in the literature: the -penalty with a quantile loss was studied in [belloni2011l1] under the name “quantile LASSO” while the same penalty with the logistic loss was studied in [van2008high] under the name “logistic LASSO” (cf. [MR3526202]). The ERM strategy with Lipschitz proxys of the loss are studied in [MR1892654]. The loss functions we will consider in the examples of this paper are reminded below:
hinge loss: for every ,
logistic loss: for every ;
quantile regression loss: for some parameter , for every where for all .
The two main theoretical results of the paper, stated in Section 2, are general in the sense that they do not rely on a specific loss function or a specific regularization norm. We develop two different settings that handle different assumptions on the design. In the first one, we assume that the family of predictors is subgaussian; in the second setting we assume that the predictors are uniformly bounded, this setting is well suited for classification tasks, including the 1-bit matrix completion problem. The rates of convergence rely on quantities that measure the complexity of the model and the size of the subdifferential of the norm.
To be more precise, the method works for any regularization function as long as it is a norm. If this norm has some sparsity inducing power, like the or nuclear norms, thus the statistical bounds depend on the underlying sparsity around the oracle because the subdifferential is large. We refer these bounds as sparsity dependent bounds. If the norm does not induce sparsity, it is still possible to derive bounds that are now depending on the norm of the oracle because the subdifferential of the norm is very large in . We call it norm dependent bounds (aka “complexity dependent bounds” in [LM_comp]).
We study many applications that give new insights on diverse problems: the first one is a classification problem with logistic loss and LASSO or SLOPE regularizations. We prove that the rate of the SLOPE estimator is minimax in this framework. The second one is about matrix completion. We derive new excess risk bounds for the 1-bit matrix completion issue with both logistic and hinge loss. We also study the quantile loss for matrix completion and prove it reaches sharp bounds. We show several examples in order to assess the general methods as well as simulation studies. The last example involves the SVM and proves that “classic” regularization method with no special sparsity inducing power can be analyzed in the same way as sparsity inducing regularization methods.
A remarkable fact is that no assumption on the output is needed (while most results for the quadratic loss rely on an assumption of the tails of the distribution of ). Neither do we assume any statistical model relating the “output variable” to the “input variable” .
Mathematical background and notations.
The observations are i.i.d pairs where are distributed according to . We consider the case where is a subset of and let denote the marginal distribution of . Let be the set of real valued functions defined on such that where the distribution of is . In this space, we define the -norm as and the norm such that . We consider a set of predictors , where is a subspace of and is a norm over (actually, in some situations we will simply have , but in some natural examples we will consider bounded set of predictors, in the sense that , which implies that cannot be a subspace of ).
For every , the loss incurred when we predict , while the true output / label is actually , is measured using a loss function : . For short, we will also use the notation the loss function associated with . In this work, we focus on loss functions that are nonnegative, and Lipschitz, in the following sense.
Assumption 1.1 (Lipschitz loss function).
For every , and , we have
Note that we chose a Lipschitz constant equal to one in Assumption 1.1. This can always be achieved by a proper normalization of the loss function. We define the oracle predictor as
and is distributed like the ’s. The objective of machine learning is to provide an estimator that predicts almost as well as . We usually formalize this notion by introducing the excess risk of by
Thus we consider the estimator of the form
where and is a regularization parameter to be chosen. Such estimators are usually called Regularized Empirical Risk Minimization procedure (RERM).
For the rest of the paper, we will use the following notations: let and denote the radius ball and sphere for the norm , i.e. and . For the -norm, we write and and so on for the other norms.
Even though our results are valid in the general setting introduced above, we will develop the examples mainly in two directions that we will refer to vector and matrix. The vector case involves as a subset of ; we then consider the class of linear predictors, i.e. . In this case, we denote for , the -norm in as . The matrix case is also referred as the trace regression model: is a random matrix in and we consider the class of linear predictors where for any matrices in . The norms we consider are then, for , the Schatten--norm for a matrix: where is the family of the singular values of . The Schatten- norm is also called trace norm or nuclear norm. The Schatten- norm is also known as the Frobenius norm. The norm, defined as is known as the operator norm.
The notation will be used to denote positive constants, that might change from one instance to the other. For any real numbers , we write when there exists a positive constant such that . When and , we write .
Proof of Concept.
We now present briefly one of the outputs of our global approach: an oracle inequality for the -bit matrix completion problem with hinge loss (we refer the reader to Section 4 for a detailed exposition of this example). While the general matrix completion problem has been extensively studied in the case of a quadratic loss, see [MR2906869, LM_sparsity] and the references therein, we believe that there is no satisfying solution to the so-called -bit matrix completion problem, that is for binary observations . Indeed, the attempts in [srebro2004maximum, Cottet2016] to use the hinge loss did not lead to rank dependent learning rates. On the other hand, [lafond2014probabilistic] studied RERM procedure using a statistical modeling approach and the logistic loss. While these authors prove optimal rates of convergence of their estimator with respect to the Frobenius norm, the excess classification risk, is not studied in their paper. However we believe that the essence of machine learning is to focus on this quantity – it is directly related to the average number of errors in prediction.
From now on we assume that and we consider the matrix framework. In matrix completion, we write the observed location as a mask matrix : it is an element of the canonical basis of where for any the entry of is everywhere except for the -th entry where it equals to . We assume that there are constants such that, for any , (this extends the uniform sampling distribution for which ). These assumptions are encompassed in the following definition.
Assumption 1.2 (Matrix completion design).
The sample size is in and takes value in the canonical basis of . There are positive constants such that for any ,
A predictor can be seen, for this problem, as the natural inner product with a real matrix: . The class that we consider in Section 4 is the set of linear predictors where every entry of the matrix is bounded: where for a specific . This set is very common in matrix completion studies. But it is especially natural in this setting: indeed, the Bayes classifier, defined by , has entries in . So, by taking in the definition of , we ensure that the oracle satisfies , so there would be no point in taking . We will therefore consider the following RERM (using the hinge loss)
where is some parameter to be chosen. We prove in Section 4 the following result.
Assume that Assumption 1.2 holds and there is such that, for any ,
There is a , that depends only on and , and that is formally introduced in Section 4 below, such that if one chooses the regularization parameter
then, with probability at least
the RERM estimator defined in (2) satisfies for every ,
and as a special case for ,
and its excess hinge risk is such that
where the notation is used for constants that might change from one instance to the other but depend only on , and .
The excess hinge risk bound from Theorem 1.1 is of special interest as it can be related to the classic excess risk. The excess risk of a procedure is really the quantity we want to control since it measures the difference between the average number of mistakes of a procedure with the best possible theoretical classification rule. Indeed, let us define the risk of by . It is clear that . Then, it follows from Theorem 2.1 in [zhang2004statistical] that for some universal constant , for every ,
where depends on , , and . This yields a bound on the average of excess number of mistakes of . To our knowledge such a prediction bound was not available in the literature on the -bit matrix completion problem. Let us compare Theorem 1.1 to the main result in [lafond2014probabilistic]. In [lafond2014probabilistic], the authors focus on the estimation error , which seems less relevant for practical applications. In order to connect such a result to the excess classification risk, one can use the results in [zhang2004statistical] and in this case, the best bound that can be derived is of the order of . Note that other authors focused on the classification error: [srebro2004maximum] proved an excess error bound, but the bound does not depend on the rank of the oracle. The rate derived from Theorem 1.1 for the -classification excess risk was only reached in [Cottet2016], but in the very restrictive noiseless setting, which is equivalent to .
We hope that this example convinced the reader of the practical interest of the general study of in (1). The rest of the paper is organized as follows. In Section 2 we introduce the concepts necessary to the general study of (1): namely, a complexity parameter, and a sparsity parameter. Thanks to these parameters, we define the assumptions necessary to our general results: the Bernstein condition, which is classic in learning theory to obtain fast rates [LM_sparsity], and a stochastic assumption on (subgaussian, or bounded). The general results themselves are eventually presented. The remaining sections are devoted to applications of our results to different estimation methods: the logistic LASSO and logistic SLOPE in Section 3, matrix completion in Section 4 and Support Vector Machines (SVM) in Section 5. For matrix completion, the optimality of the rates for the logistic and the hinge loss, that were not known, is also derived. In Section 6 we discuss the Bernstein condition for the three main loss functions of interest: hinge, logistic and quantile.
2 Theoretical Results
2.1 Applications of the main results: the strategy
The two main theorems in Sections 2.5 and 2.6 below are general in the sense that they allow the user to deal with any (nonnegative) Lipschitz loss function and any norm for regularization, but they involve quantities that depend on the loss and the norm. The aim of this Section is first to provide the definition of these objects and some hints on their interpretation, through examples. The theorems are then stated in both settings. Basically, the assumptions for the theorems are of three types:
the so-called Bernstein condition, which is a quantification of the identifiability condition. It basically tells how the excess risk is related to the norm .
a stochastic assumption on the distribution of the ’s for . In this work, we consider both a subgaussian assumption and a uniform boundedness assumption. Analysis of the two setups differ only on the way the “statistical complexity of ” is measured (cf. below the functions in Definition 8.1 and Definition 8.2).
finally, we introduce a sparsity parameter as in [LM_sparsity]. It reflects how the norm used as a regularizer can induce sparsity - for example, think of the “sparsity inducing power” of the -norm used to construct the LASSO estimator.
Given a scenario, that is a loss function , a random design , a convex class and a regularization norm, statistical results (exact oracle inequalities and estimation bounds w.r.t. the and regularization norms) for the associated regularized estimator together with the choice of the regularization parameter follow from the derivation of the three parameters as explained in the next box together with Theorem 2.1 and Theorem 2.2.
For the sake of simplicity, we present the two settings in different subsections with both the exact definition of the complexity function and the theorem. As the sparsity equation is the same in both settings, we define it before even though it involves the complexity function.
2.2 The Bernstein condition
The first assumption needed is called Bernstein assumption and is very classic in order to deal with Lipschitz loss.
Assumption 2.1 (Bernstein condition).
There exists and such that for every , .
The most important parameter is and will be involved in the rate of convergence. As usual fast rates will be derived when . In many situations, this assumption is satisfied and we present various cases in Section 6. In particular, we prove that it is satisfied with for the logistic loss in both bounded and Gaussian framework, and we exhibit explicit conditions to ensure that Assumption 2.1 holds for the hinge and the quantile loss functions.
We call Assumption 2.1 a Bernstein condition following [MR2240689] and that it is different from the margin assumption from [Mammen1999, Tsy04]: in the so-called margin assumption, the oracle in is replaced by the minimizer of the risk function over all measurable functions , sometimes called the Bayes rules. We refer the reader to Section 6 and to the discussions in [MR2933668] and Chapter 1.3 in [HDR-lecue] for more details on the difference between the margin assumption and the Bernstein condition.
The careful reader will actually realize that the proof of Theorem 2.1 and Theorem 2.2 requires only a weaker version of this assumption, that is: there exists and such that for every , , where is defined in terms of the complexity function and the sparsity parameter to be defined in the next subsections,
Note that the set appears to play a central role in the analysis of regularization methods, cf. [LM_sparsity]. However, in all the examples presented in this paper, we prove that the Bernstein condition holds on the entire set .
2.3 The complexity function
The complexity function is defined by
where is the constant in Assumption 2.1 and where is a measure of the complexity of the unit ball associated to the regularization norm. Note that this complexity measure will depend on the stochastic assumption of . In the bounded setting, where is an absolute constant and is the Rademacher complexity of (whose definition will be reminded in Subsection 2.6). In the subgaussian setting, where is an absolute constant, is the subgaussian parameter of the class and is the Gaussian mean-width of (here again, exact definitions of and will be reminded in Subsection 2.5).
Note that sharper (localized) versions of are provided in Section 8. However, as it is the simplest version that is used in most examples, we only introduce this version for now.
2.4 The sparsity parameter
The size of the sub-differential of the regularization function in a neighborhood of the oracle will play as well a central role in our analysis. We recall now its definition: for every
It is well-known that is a subset of the unit sphere of the dual norm of when . Note also that when , is the entire unit dual ball, a fact we will also use in two situations, either when the regularization norm has no “sparsity inducing power” – in particular, when it is a smooth function as in the RKHS case treated in Section 5; or when one wants extra norm dependent upper bounds (cf. [LM_comp] for more details where these bounds are called complexity dependent) in addition to sparsity dependent upper bounds. In the latter, the statistical bounds that we get are the minimum between an error rate that depends on the notion of sparsity naturally associated to the regularization norm (when it exists) and an error rate that depends on .
Definition 2.1 (From [LM_sparsity]).
The sparsity parameter is the function defined by
Note that there is a slight difference with the definition of the sparsity parameter from [LM_sparsity] where there is defined taking the infimum over the sphere intersected with a -ball of radius whereas in Definition 2.1, is intersected with a -ball of radius . Up to absolute constants this has no effect on the behavior of and the difference comes from technical detains in our analysis (a peeling argument that we use below whereas a direct homogeneity argument was enough in [LM_sparsity]).
In the following, estimation rates with respect to the regularization norm , the norm as well as sharp oracle inequalities are given. All the convergence rates depend on a single radius that satisfies the sparsity equation as introduced in [LM_sparsity].
The radius is any solution of the sparsity equation:
Since is central in the results and drives the convergence rates, finding a solution to the sparsity equation will play an important role in all the examples that we worked out in the following. Roughly speaking, if the regularization norm induces sparsity, a sparse element in (that is an element for which is almost extremal – that is almost as large as the dual sphere) yields the existence of a small . In this case, satisfies the sparsity equation.
In addition, if one takes then and since is the entire dual ball associate to , one has directly that and so satisfies the sparsity Equation (8). We will use this observation to obtain norm dependent upper bounds, i.e. rates of convergence depending on and that do not depend on any sparsity parameter. Such a bound holds for any norm; in particular, for norms with no sparsity inducing power as in Section 5.
2.5 Theorem in the subgaussian setting
First, we introduce the subgaussian framework (then we will turn to the bounded case in the next section).
Definition 2.3 (Subgaussian class).
We say that a class of functions is -subgaussian (w.r.t. ) for some constant when for all and all ,
We will use the following operations on sets: for any and ,
The class is -subgaussian.
Note that there are many equivalent formulations of the subgaussian property of a random variable based on -Orlicz norms, deviations inequalities, exponential moments, moments growth characterization, etc. (cf., for instance Theorem 1.1.5 in [MR3113826]). The one we should use later is as follows: there exists some absolute constant such that is -subgaussian if and only if for all and ,
There are several examples of subgaussian classes. For instance, when is a class of linear functionals for and is a random variable in then is -subgaussian in the following cases:
is a Gaussian vector in ,
has independent coordinates that are subgaussian, that is, there are constants and such that , ,
for , is uniformly distributed over (cf. [MR2123199]),
is an unconditional vector (meaning that for every signs , has the same distribution as ), for some and almost surely then one can choose (cf. [LM13]).
In the subgaussian framework, a natural way to measure the statistical complexity of the problem is via Gaussian mean-width that we introduce now.
Let be a subset of and denote by the natural metric in . Let be the canonical centered Gaussian process indexed by (in particular, the covariance structure of is given by : for all ). The Gaussian mean-width of (as a subset of ) is
We refer the reader to Section 12 in [MR1932358] for the construction of Gaussian processes in . There are many natural situations where Gaussian mean-widths can be computed. To familiarize with this quantity let us consider an example in the matrix framework. Let be the class of linear functionals indexed by the unit ball of the -norm and be the distance associated with the Frobenius norm (i.e. ) then
where is a standard Gaussian matrix in , is the dual norm of the nuclear norm which is the operator norm .
We are now in position to define the complexity parameter as announced previously.
The complexity parameter is the non-decreasing function defined for every ,
where are the Bernstein parameters from Assumption 2.1, is the subgaussian parameter from Assumption 2.2 and is an absolute constant (the exact value of can be deduced from the proof of Proposition 8.2). The Gaussian mean-width of is computed with respect to the the metric associated with the covariance structure of , i.e. for every .
After the computation of the Bernstein parameter , the complexity function and the radius , it is now possible to explicit our main result in the sub-Gaussian framework.
and satisfying (8). Then, with probability larger than
where denotes positive constants that might change from one instance to the other and depend only on , , and .
Remark 2.2 (Deviation parameter).
Replacing by any upper bound does not affect the validity of the result. As a special case, it is possible to increase the confidence level of the bound by replacing by : then, with probability at least
we have in particular
Remark 2.3 (Norm and sparsity dependent error rates).
Theorem 2.1 holds for any radius satisfying the sparsity equation (8). We have noticed in Section 2.4 that satisfies the sparsity equation since in that case and so . Therefore, one can apply Theorem 2.1 to both (this leads to norm dependent upper bounds) and to the smallest satisfying the sparsity equation (8) (this leads to sparsity dependent upper bounds) at the same time. Both will lead to meaningful results (a typical example of such a combined result is Theorem 9.2 from [MR2829871] or Theorem 3.1 below).
2.6 Theorem in the bounded setting
We now turn to the bounded framework; that is we assume that all the functions in are uniformly bounded in . This assumption is very different in nature than the subgaussian assumption which is in fact a norm equivalence assumption (i.e. Definition 2.3 is equivalent to for all where is the Orlicz norm, cf. [MR1113700]).
Assumption 2.3 (Boundedness assumption).
There exist a constant such that for all , .
The main motivation to consider the bounded setup is for sampling over the canonical basis of a finite dimensional space like or . Note that this type of sampling is stricto sensu subgaussian, but with a constant depending on the dimensions and , which yields sub-optimal rates. This is the reason why the results in the bounded setting are more relevant in this situation. This is especially true for the -bit matrix completion problem as introduced in Section 1. For this example, the ’s are chosen randomly in the canonical basis of . Moreover, in that example, the class is the class of all linear functionals indexed by : and therefore the study of this problem falls naturally in the bounded framework studied in this section.
Under the boundedness assumption, the natural way to measure the ”statistical complexity” cannot be anymore characterized by Gaussian mean width. We therefore introduce another complexity parameter known as Rademacher complexities. This complexity measure has been extensively studied in the learning theory literature (cf., for instance, [MR2329442, MR2829871, MR2166554]).
Let be a subset of . Let be i.i.d. Rademacher variables (i.e. ) independent of the ’s. The Rademacher complexity of is
Note that when is a version of the isonormal process over (cf. Chapter 12 in [MR1932358]) restricted to then the Gaussian mean-width and the Rademacher complexity coincide: . But, in that case, is not bounded in and, in general, the two complexity measures are different.
There are many examples where Rademacher complexities have been computed (cf. [MR2075996]). Like in the previous subgaussian setting the statistical complexity is given by a function (we use the same name in the two bounded and subgaussian setups because this function plays exactly the same role in both scenarii even though it uses different notion of complexity).
The complexity parameter is the non-decreasing function defined for every by
where denotes positive constants that might change from one instance to the other and depend only on , , and is the function introduced in Definition 2.7.
3 Application to logistic LASSO and logistic SLOPE
The first example of application of the main results in Section 2 involves one very popular method developed during the last two decades in binary classification which is the Logistic LASSO procedure (cf. [MR2699823, MR2412631, MR2427362, MR2721710, MR3073790]).
We consider the vector framework, where are i.i.d. pairs with values in distributed like . Both bounded and subgaussian framework can be analyzed in this example. For the sake of shortness and since an example in the bounded case is provided in the next section, only the subgaussian case is considered here and we leave the bounded case to the interested reader. We therefore shall apply Theorem 2.1 to get estimation and prediction bounds for the well known logistic LASSO and the new logistic SLOPE.
In this section, we consider the class of linear functional indexed by for some radius and the logistic loss:
As usual the oracle is denoted by , we also introduce such that .
3.1 Logistic LASSO
The logistic loss function is Lipschitz with constant , so Assumption 1.1 is satisfied. It follows from Proposition 6.2 in Section 6.1 that Assumption 2.1 is satisfied when the design is the standard Gaussian variable in and the considered class . In that case, the Bernstein parameter is , and we have for some absolute constant which can be deduced from the proof of Proposition 6.2. We consider the norm for regularization. We will therefore obtain statistical results for the RERM estimator that is defined by
where is a regularization parameter to be chosen according to Theorem 2.1.
The two final ingredients needed to apply Theorem 2.1 are 1) the computation of the Gaussian mean width of the unit ball of the regularization function 2) find a solution to the sparsity equation (8).
Let us first deal with the complexity parameter of the problem. If one assumes that the design vector is isotropic, i.e. for every then the metric naturally associated with is the canonical -distance in . In that case, it is straightforward to check that for some (known) absolute constant and so we define, for all ,
for the complexity parameter of the problem (from now and until the end of Section 3, the constants depends only on , , and ).
Now let us turn to a solution of the sparsity equation (8). First note that when the design is isotropic the sparsity parameter is the function
A first solution to the sparsity equation is because it leads to . This solution is called norm dependent.
Another radius solution to the sparsity equation (8) is obtained when is close to a sparse-vector, that is a vector with a small support. We denote by the size of the support of . Now, we recall a result from [LM_sparsity].
Lemma 3.1 (Lemma 4.2 in [LM_sparsity]).
If there exists some such that then where is an absolute constant.
In particular, we get that is a solution to the sparsity equation if there is a -sparse vector which is -close to in . This radius leads to the so-called sparsity dependent bounds.
After the derivation of the Bernstein parameter , the complexity and a solution to the sparsity equation, we are now in a position to apply Theorem 2.1 to get statistical bounds for the Logistic LASSO.
Assume that is a standard Gaussian vector in . Let . Assume that there exists a -sparse vector in . Then, with probability larger than , for every , the logistic LASSO estimator with regularization parameter
and the excess logistic risk of is such that