Neyman-Pearson classification, convexity and stochastic constraints
Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its probability of type I error is below a pre-specified level and (ii), it has probability of type II error close to the minimum possible. The proposed classifier is obtained by solving an optimization problem with an empirical objective and an empirical constraint. New techniques to handle such problems are developed and have consequences on chance constrained programming.
keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, stochastic constraint, convexity, empirical risk minimization, chance constrained optimization.
The Neyman-Pearson (NP) paradigm in statistical learning extends the objective of classical binary classification in that, while the latter focuses on minimizing classification error that is a weighted sum of type I and type II errors, the former minimizes type II error with an upper bound on type I error. With slight abuse of language, in verbal discussion we do not distinguish type I/II error from probability of type I/II error.
For learning with the NP paradigm, it is essential to avoid one kind of error at the expense of the other. As an illustration, consider the following problem in medical diagnosis: failing to detect a malignant tumor has far more severe consequences than flagging a benign tumor. Other scenarios include spam filtering, machine monitoring, target recognition, etc.
In the learning context, as true errors are inaccessible, we cannot enforce almost surely the desired upper bound for type I error. The best we can hope is that a data dependent classifier has type I error bounded with high probability. Henceforth, there are two goals in this project. The first is to design a learning procedure so that type I error of the learned classifier is upper bounded by a pre-specified level with pre-specified high probability; the second is to show that has good performance bounds for excess type II error.
This paper is organized as follows. In Section 2, the classical setup for binary classification is reviewed and the main notation is introduced. A parallel between binary classification and statistical hypothesis testing is drawn in Section 3 with emphasis on the NP paradigm in both frameworks. The main propositions, theorems and their proofs are stated in Section 4 while secondary, technical results are relegated to the Appendix. Finally, Section 5 illustrates an application of our results to chance constrained optimization.
In the rest of the paper, we denote by the -th coordinate of a vector .
2 Binary classification
2.1 Classification risk and classifiers
Let be a random couple where is a vector of covariates and is a label that indicates to which class belongs. A classifier is a mapping whose sign returns the predicted class given . An error occurs when and it is therefore natural to define the classification loss by , where denotes the indicator function.
The expectation of the classification loss with respect to the joint distribution of is called (classification) risk and is defined by
Clearly, the indicator function is not convex and for computation, a common practice is to replace it by a convex surrogate (see, e.g. Bartlett et al., 2006, and references therein).
To this end, we rewrite the risk function as
where . Convex relaxation can be achieved by simply replacing the indicator function by a convex surrogate.
A function is called a convex surrogate if it is non-decreasing, continuous and convex and if .
Commonly used examples of convex surrogates are the hinge loss , the logit loss and the exponential loss .
For a given choice of , define the -risk
Hereafter, we assume that is fixed and refer to as the risk. In our subsequent analysis, this convex relaxation will also be the ground to analyze a stochastic convex optimization problem subject to stochastic constraints. A general treatment of such problems can be found in Section 5.
Because of overfitting, it is unreasonable to look for mappings minimizing empirical risk over all calssifiers. Indeed, one could have a small empirical risk but a large true risk. Hence, we resort to regularization. There are in general two ways to proceed. The first is to restrict the candidate classifiers to a specific class , and the second is to change the objective function by, for example, adding a penalty term. The two approaches can be combined, and sometimes are obviously equivalent.
In this paper, we pursue the first idea by defining the class of candidate classifiers as follows. Let be a given collection of classifiers. In our setup, we allow to be large. In particular, our results remain asymptotically meaningful as long as . Such classifiers are usually called base classifiers and can be constructed in a very naive manner. Typical examples include decision stumps or small trees. While the ’s may have no satisfactory classifying power individually, for over two decades, boosting type of algorithms have successfully exploited the idea that a suitable weighted majority vote among these classifiers may result in low classification risk (Schapire, 1990). Consequently, we restrict our search for classifiers to the set of functions consisting of convex combinations of the ’s:
where denotes the flat simplex of and is defined by . In effect, classification rules given by the sign of are exactly the set of rules produced by the weighted majority votes among the base classifiers .
By restricting our search to classifiers in , the best attainable -risk is called oracle risk and is abusively denoted by . As a result, we have for any and a natural measure of performance for a classifier is given by its excess risk defined by .
The excess risk of a data driven classifier is a random quantity and we are interested in bounding it with high probability. Formally, the statistical goal of binary classification is to construct a classifier such that the oracle inequality
holds with probability , where should be as small as possible.
In the scope of this paper, we focus on candidate classifiers in the class . Some of the following results such as Theorem 4.1 can be extended to more general classes of classifiers with known complexity such as classes with bounded VC-dimension, as for example in Cannon et al. (2002). However, our main argument for bounding type II error relies on Proposition 4.1 which, in turn, depends heavily on the convexity of the problem, and it is not clear how it can be extended to more general classes of classifiers.
2.2 The Neyman-Pearson paradigm
In classical binary classification, the risk function can be expressed as a convex combination of type I error and of type II error :
More generally, we can define the -type I and -type II errors respectively by
Following the NP paradigm, for a given class of classifiers, we seek to solve the constrained minimization problem:
where , the significance level, is a constant specified by the user.
NP classification is closely related to the NP approach to statistical hypothesis testing. We now recall a few key concepts about the latter. Many classical works have addressed the theory of statistical hypothesis testing, in particular Lehmann and Romano (2005) provides a thorough treatment of the subject.
Statistical hypothesis testing bears strong resemblance with binary classification if we assume the following model. Let and be two probability distributions on . Let and assume that is a random variable defined by
Assume further that the conditional distribution of given is given by . Given such a model, the goal of statistical hypothesis testing is to determine whether was generated from or . To that end, we construct a test and the conclusion of the test based on is that is generated from with probability and from with probability . Note that randomness here comes from an exogenous randomization process such as flipping a biased coin. Two kinds of errors arise: type I error occurs when rejecting when it is true, and type II error occurs when accepting when it is false. The Neyman-Pearson paradigm in hypothesis testing amounts to choosing that solves the following constrained optimization problem
where is the significance level of the test. In other words, we specify a significance level on type I error, and minimize type II error. We call a solution to this problem a most powerful test of level . The Neyman-Pearson Lemma gives mild sufficient conditions for the existence of such a test.
Theorem 2.1 (Neyman-Pearson Lemma).
Let and be probability distributions possessing densities and respectively with respect to some measure . Let , where the likelihood ratio and is such that and . Then,
is a level most powerful test.
For a given level , the most powerful test of level is defined by
Notice that in the learning framework, cannot be computed since it requires the knowledge of the likelihood ratio and of the distributions and . Therefore, it remains merely a theoretical propositions. Nevertheless, the result motivates the NP paradigm pursued here.
3 Neyman-Pearson classification via convex optimization
Recall that in NP classification, the goal is to solve the problem (2.3). This cannot be done directly as conditional distributions and , and hence and , are unknown. In statistical applications, information about these distributions is available through two i.i.d. samples , and , , where and . We do not assume that the two samples and are mutually independent. Presently the sample sizes and are assumed to be deterministic and will appear in the subsequent finite sample bounds. A different sampling scheme, where these quantities are random, is investigated in subsection 4.3.
3.1 Previous results and new input
While the binary classification problem has been extensively studied, theoretical proposition on how to implement the NP paradigm remains scarce. To the best of our knowledge, Cannon et al. (2002) initiated the theoretical treatment of the NP classification paradigm and an early empirical study can be found in Casasent and Chen (2003). The framework of Cannon et al. (2002) is the following. Fix a constant and let be a given set of classifiers with finite VC dimension. They study a procedure that consists of solving the following relaxed empirical optimization problem
denote the empirical type I and empirical type II errors respectively. Let be a solution to (3.1). Denote by a solution to the original Neyman-Pearson optimization problem:
The main result of Cannon et al. (2002) states that, simultaneously with high probability, the type II error is bounded from above by , for some and the type I error of is bounded from above by . In a later paper, Cannon et al. (2003) considers problem (3.1) for a data-dependent family of classifiers , and bound estimation errors accordingly. Several results for traditional statistical learning such as PAC bounds or oracle inequalities have been studied in Scott (2005) and Scott and Nowak (2005) in the same framework as the one laid down by Cannon et al. (2002). A noteworthy departure from this setup is Scott (2007) where sensible performance measures for NP classification that go beyond analyzing separately two kinds of errors are introduced. Furthermore, Blanchard et al. (2010) develops a general solution to semi-supervised novelty detection by reducing it to NP classification. Recently, Han et al. (2008) transposed several results of Cannon et al. (2002) and Scott and Nowak (2005) to NP classification with convex loss.
The present work departs from previous literature in our treatment of type I error. As a matter of fact, the classifiers in all the papers mentioned above can only ensure that is small, for some . However, it is our primary interest to make sure that with high probability, following the original principle of the Neyman-Pearson paradigm that type I error should be controlled by a pre-specified level . As will be illustrated, to control , it is necessary to have be a solution to some program with a strengthened constraint on empirical type I error. If our concern is only on type I error, we can just do so. However, we also want to control excess type II error simultaneously.
The difficulty was foreseen in the seminal paper Cannon et al. (2002), where it is claimed without justification that if we use for the empirical program, “it seems unlikely that we can control the estimation error in a distribution independent way”. The following proposition confirms this opinion in a certain sense.
Fix and . Let be the classifier defined as any solution of the following optimization problem:
The following negative result holds not only for this estimator but also for the oracle defined as the solution of
Note that is not a classifier but only a pseudo-classifier since it depends on the unknown distribution of the data.
There exist base classifiers and a probability distribution for for which, regardless of the sample sizes and , any pseudo-classifier such that , it holds
In particular, the excess type II risk of does not converge to zero as sample sizes increase even if . Moreover, when for any (pseudo-)classifier such that , it holds
with probability at least . In particular, the excess type II risk of does not converge to zero with positive probability, as sample sizes increase even if .
The proof of this result is postponed to the appendix. The fact that the oracle satisfies the lower bound indicates that the problem comes from using a strengthened constraint. Note that the condition is purely technical and can be removed. Nevertheless, it is always the case in practice that .
In view of this negative result, it seems that our rightful insist on type I error does not go well with the ambition to control type II error simultaneously. To overcome this dilemma, we resort to a continuous convex surrogate as our loss function. In particular, we design a modified version of empirical risk minimization method such that the data-driven classifier has type I error bounded by with high probability. Moreover, we consider here a class that allows a different treatment of the empirical processes involved.
This new approach comes with new technical challenges which we summarize here. In the approach of Cannon et al. (2002) and of Scott and Nowak (2005), the relaxed constraint on the type I error is constructed such that the constraint on type I error in (3.1) is satisfied by (defined in (3.2)) with high probability, and that this classifier accommodates excess type II error well. As a result, the control of type II error mainly follows as a standard exercise to control suprema of empirical processes. This is not the case here; we have to develop methods to control the optimum value of a convex optimization problem under a stochastic constraint. Such methods have consequences not only in NP classification but also on chance constraint programming as explained in Section 5.
3.2 Convexified NP classifier
To solve the problem of NP classification (2.3) where the distribution of the observations is unknown, we resort to empirical risk minimization. In view of the arguments presented in the previous subsection, we cannot simply replace the unknown true risk functions by their empirical counterparts. The treatment of the convex constraint should be done carefully and we proceed as follows.
For any classifier and a given convex surrogate , define and to be the empirical counterparts of and respectively by
Moreover, for any , let be the set of classifiers in whose convexified type I errors are bounded from above by , and let be the set of classifiers in whose empirical convexified type I errors are bounded by . To make our analysis meaningful, we assume that .
We are now in a position to construct a classifier in according to the Neyman-Pearson paradigm. For any such that , define the convexified NP classifier as any classifier that solves the following optimization problem
Note that this problem consists of minimizing a convex function subject to a convex constraint and can therefore be solved by standard algorithms such as (see, e.g., Boyd and Vandenberghe, 2004, and references therein).
In the next section, we present a series of results on type I and type II errors of classifiers that are more general than .
4 Performance Bounds
4.1 Control of type I error
The first challenge is to identify classifiers such that with high probability. This is done by enforcing its empirical counterpart be bounded from above by the quantity
for a proper choice of positive constant .
Fix constants and let be a given -Lipschitz convex surrogate. Define
Then for any (random) classifier that satisfies , we have
with probability at least . Equivalently
4.2 Simultaneous control of the two errors
Theorem 4.1 guarantees that any classifier that satisfies the strengthened constraint on the empirical -type I error will have -type I error and true type I error bounded from above by . We now check that the constraint is not too strong so that the type II error is overly deteriorated. Indeed, an extremely small would certainly ensure a good control of type I error but would deteriorate significantly the best achievable type II error. Below, we show not only that this is not the case for our approach but also that the convexified NP classifier defined in subsection 3.2 with suffers only a small degradation of its type II error compared to the best achievable. Analogues to classical binary classification, a desirable result is that with high probability,
where goes to as .
The following proposition is pivotal to our argument.
Fix constant and let be a given continuous convex surrogate. Assume further that there exists such that the set of classifiers is nonempty. Then, for any ,
This proposition ensures that if the convex surrogate is continuous, strengthening the constraint on type I error does not deteriorate too much the optimal type II error. We should mention that the proof does not use the Lipschitz property of , but only that it is uniformly bounded by on . This proposition has direct consequences on chance constrained programming as discussed in Section 5.
The next theorem shows that the NP classifier defined in subsection 3.2 is a good candidate to perform classification with the Neyman-Pearson paradigm. It relies on the following assumption which is necessary to verify the condition of Proposition 4.1.
There exists a positive constant such that the set of classifiers is nonempty.
Note that this assumption can be tested using (4.1) for large enough . Indeed, it follows from this inequality that with probability ,
Thus, it is sufficient to check if is nonempty for some . Before stating our main theorem, we need the following definition. Under Assumption 1, let denote the smallest such that and let be the smallest integer such that
In particular, as , and all go to infinity and other quantities are held fixed, (4.5) yields
Note here that Theorem is not exactly of the type (4.2). The right hand side of (4.5) goes to zero if both and go to infinity. Moreover, inequality (4.5) conveys a message that accuracy of the estimate depends on information from both classes of labeled data. This concern motivates us to consider a different sampling scheme.
4.3 A Different Sampling Scheme
Let be independent copies of the random couple . Denote by the marginal distribution of and by the regression function of onto . Denote by the probability of positive label and observe that
In what follows, we assume that so that .
Let be the random number of instances labeled and . In this setup, the NP classifier is defined as in subsection 3.2 where and are replaced by and respectively. To distinguish this classifier from previously defined, we denote the NP classifier obtained with this sampling scheme by .
Let the event be defined by
Denote . Although the event is different from the event , symmetry leads to the following key observation:
Therefore, under the conditions of Theorem 4.2, we find that for the event satisfies
We obtain the following corollary of Theorem 4.2.
5 Chance constrained optimization
Implementing the Neyman-Pearson paradigm for the convexified binary classification bears strong connections with chance constrained optimization. A recent account of such problems can be found in Ben-Tal et al. (2009, Chapter 2) and we refer to this book for references and applications. A chance constrained optimization problem is of the following form:
where is a random vector, is convex, is a small positive number and is a deterministic real valued convex function. Problem (5.1) can be viewed as a relaxation of robust optimization. Indeed, for the latter, the goal is to solve the problem
and this essentially corresponds to (5.1) for the case . For simplicity, we take to be scalar valued but extensions to vector valued functions and conic orders are considered in Ben-Tal et al. (see, e.g., 2009, Chapter 10). Moreover, it is standard to assume that is convex almost surely.
Problem (5.1) may not be convex because the chance constraint is not convex in general and thus may not be tractable. To solve this problem, Prékopa (1995) and Lagoa et al. (2005) have derived sufficient conditions on the distribution of for the chance constraint to be convex. On the other hand, Calafiore and Campi (2006) initiated a different treatment of the problem where no assumption on the distribution of is made, in line with the spirit of statistical learning. In that paper, they introduced the so-called scenario approach based on a sample of independent copies of . The scenario approach consists of solving
Calafiore and Campi (2006) showed that under certain conditions, if the sample size is bigger than some , then with probability , the optimal solution of (5.3) is feasible for (5.1). The authors did not address the control of the term where denotes the optimal objective value in (5.1). However, in view of Proposition 3.1, it is very unlikely that this term can be controlled well.
In an attempt to overcome this limitation, a new analytical approach was introduced by (Nemirovski and Shapiro, 2006). It amounts to solving the following convex optimization problem
in which is some additional instrumental variable and where is convex. The problem (5.4) provides a conservative convex approximation to (5.1), in the sense that every feasible for (5.4) is also feasible for (5.1). Nemirovski and Shapiro (2006) considered a particular class of conservative convex approximation where the key step is to replace by in (5.1), where a nonnegative, nondecreasing, convex function that takes value at . Nemirovski and Shapiro (2006) discussed several choices of including hinge and exponential losses, with a focus on the latter that they name Bernstein Approximation.
The idea of a conservative convex approximation is also what we employ in our paper. Recall that the conditional distribution of given . In a parallel form of (5.1), we cast our target problem as
where is the flat simplex of .
However, there are two important differences in our setting, so that we cannot use directly Scenario Approach or Bernstein Approximation or other analytical approaches to (5.1). First, is an unknown function of . Second, we assume minimum knowledge about . On the other hand, chance constrained optimization techniques in previous literature assume knowledge about the distribution of the random vector . For example, Nemirovski and Shapiro (2006) require that the moment generating function of the random vector is efficiently computable to study the Bernstein Approximation.
Given a finite sample, it is not feasible to construct a strictly conservative approximation to the constraint in (5.6). Instead, what possible is to ensure that if we learned from the sample, this constraint is satisfied with high probability , i.e., the classifier is approximately feasible for (5.6). In retrospect, our approach to (5.6) is an innovative hybrid between the analytical approach based on convex surrogates and the scenario approach.
We do have structural assumptions on the problem. Let be arbitrary functions that take values in and . Consider a convexified version of (5.1):
where is a -Lipschitz convex surrogate, . Suppose that we observe a sample that are independent copies of . We propose to approximately solve the above problem by
for some to be defined. Denote by any solution to this problem and by the value of the objective at the optimum in (5.7). The following theorem summarizes our contribution to chance constrained optimization.
Fix constants and let be a given -Lipschitz convex surrogate. Define
Then, the following hold with probability at least
is feasible for (5.1).
If there exists such that the constraint is feasible for some , then for
In particular, as and go to infinity with all other quantities kept fixed, we obtain
The proof essentially follows that of Theorem 4.2 and we omit it. The limitations of Theorem 5.1 include rigid structural assumptions on the function and on the set . While the latter can be easily relaxed using more sophisticated empirical process theory, the former is inherent to our analysis. Also, we did not address the effect of replacing the indicator function by a convex surrogate; this investigation is beyond the scope of this paper.
6.1 Proof of Proposition 3.1
Let the base classifiers be defined as
For any , denote the convex combination of and by , i.e.,
Suppose the conditional distributions of given or , denoted respectively by and , are both uniform on . Recall that and . Then, we have
Therefore, for any , we have
Observe now that
For any , it yields
This completes the first part of the proposition. Moreover, in the same manner as (6.1), it can be easily proved that
It remains to show that with positive probability for any classifier such that for some . Note that a sufficient condition for a classifier to satisfy this constraint is to have . It is therefore sufficient to find a lower bound on the probability of the event . Such a lower bound is provided by Lemma 6.4, which guarantees that .
6.2 Proof of Theorem 4.1
We begin with the following lemma, which is extensively used in the sequel. Its proof relies on standard arguments to bound suprema of empirical processes. Recall that is family of classifiers such that and that for any in the simplex , denotes the convex combination defined by
The following standard notation in empirical process theory will be used. Let be i.i.d random variables with marginal distribution . Then for any measurable function , we write
Moreover, the Rademacher average of is defined as
where are i.i.d. Rademacher random variables such that for .
Fix . Let be i.i.d random variables on with marginal distribution . Moreover, let an -Lipschitz function. Then, with probability at least , it holds
Proof. Define , so that is an -Lipschitz function that satisfies . Moreover, for any , it holds
Let be a given convex increasing function. Applying successively the symmetrization and the contraction inequalities (see, e.g., Koltchinskii, 2008, Section 2), we find