Supervised classification via minimaxprobabilistic transformations

Supervised classification via minimax
probabilistic transformations

Santiago Mazuelas
BCAM-Basque Center of Applied Mathematics
Bilbao, Spain
smazuelas@bcamath.org
&Andrea Zanoni
École Polytechnique Fédérale de Lausanne
Lausanne, Switzerland
andrea.zanoni@epfl.ch &Aritz Pérez
BCAM-Basque Center of Applied Mathematics
Bilbao, Spain
aperez@bcamath.org
Abstract

Conventional techniques for supervised classification constrain the classification rules considered and use surrogate losses for classification 0-1 loss. Favored families of classification rules are those that enjoy parametric representations suitable for surrogate loss minimization, and low complexity properties suitable for overfitting control. This paper presents classification techniques based on robust risk minimization (RRM) that we call linear probabilistic classifiers (LPCs). The proposed techniques consider unconstrained classification rules, optimize the classification 0-1 loss, and provide performance bounds during learning. LPCs enable efficient learning by using linear optimization, and avoid overffiting by using RRM over polyhedral uncertainty sets of distributions. We also provide finite-sample generalization bounds for LPCs and show their competitive performance with state-of-the-art techniques using benchmark datasets.

1 Introduction

Supervised classification uses training data to find a classification rule with small risk (out-of-sample error). Risk minimization cannot be addressed in practice since the probability distribution of pairs features-label is unknown. Therefore, learning techniques for supervised classification obtain classification rules by addressing a surrogate for risk minimization. The most common surrogate is empirical risk minimization (ERM) that is based on minimizing the loss achieved with training examples. Such approach may suffer from overfitting, usually addressed by constraining the classification rules considered to have reduced complexity [1, 2]. Other surrogate for risk minimization is robust risk minimization (RRM) that is based on minimizing the worst-case risk against a set of probability distributions consistent with training data. Such approach avoids overfitting as long as the probability distribution of features-label pairs belongs to the uncertainty set considered, but it requires to solve a minimax optimization problem [3, 4, 5, 6, 7, 8, 9, 10, 11].

Conventional learning techniques for supervised classification constrain the classification rules considered and use surrogate losses. Favored families of classification rules are those that enjoy parametric representations suitable for surrogate loss minimization, and low complexity properties suitable for overfitting control. Techniques based on regularization in reproducing kernel Hilbert spaces [1], such as support vector machines and kernel logistic regression, consider classification rules obtained from functions with reduced norm in an RKHS, with different design choices such as kernel and regularization parameters. Techniques based on artificial neural networks [12] consider classification rules with a hierarchical structure, with different design choices such as network architecture and activation functions. Techniques based on ensemble learning [13], such as Adaboost and random forests, consider classification rules obtained by combinations of weak rules, with different design choices such as type of weak rules and aggregation method. In addition, conventional techniques enable tractable optimization of the classification rule’s parameters by using a surrogate loss (e.g., hinge, logistic, cross-entropy, and exponential) instead of the original target given by classification 0-1 loss.

Main contributions

This paper presents techniques for supervised classification based on RRM that we call linear probabilistic classifiers. The proposed techniques consider unconstrained classification rules, optimize the classification 0-1 loss, and provide performance bounds during learning. Current techniques based on RRM utilize uncertainty sets of distributions similar to the empirical distribution in terms of several metrics such as moments and marginals fits [3, 4, 5, 6], Wasserstein distances [7, 8], and f-divergences [9, 10]. The proposed LPCs utilize uncertainty sets of distributions given by constraining the expectations of a chosen function that we call generating function. Such distributions are similar in terms of the probability metric given by the generating function [14]. Most RRM methods enable efficient minimax optimization by using parametric families of classification rules and surrogate losses. Techniques based on Wasserstein distances use linear functions or RKHSs and surrogate log lossess [7, 8], while techniques based on f-divergences can use more general parametric families of classification rules and surrogate losses as long as they result in convex losses [9, 10]. As the proposed LPCs, techniques in [6] consider unconstrained classification rules exploiting Lagrange duality. Such work uses uncertainty sets defined by equality constraints, and its learning stage is enabled by approximate optimization with a stochastic gradient descent algorithm. On the other hand, LPCs consider uncertainty sets that contain the actual distribution with a tunable confidence, and LPCs learning is enabled by the reformulation of minimax problem as a linear program.

More detailed comparisons with related techniques are provided in the remarks to the paper’s main new results, organized as follows:

  • Learning techniques that determine LPCs as the solution of a linear optimization problem (Theorem 1 in Section 2).

  • Techniques that obtain upper and lower bounds for the expected loss of general classification rules (Theorem 1 and Proposition 1 in Section 2).

  • Finite-sample generalization bounds for the risk of LPCs in terms of training size and parameters describing the complexity of the generating function (Theorem 2 in Section 3).

In addition, Section 4 describes efficient implementations for LPCs and proposes a simple generating function, and Section 5 shows the suitability of the presented performance’ bounds and compares the classification error of LPCs with respect to state-of-the-art techniques.

Notation: calligraphic upper case letters denote sets; real-valued functions and vector-valued functions are denoted by lower and upper case letters, respectively; vectors and matrices are denoted by bold lower and upper case letters, respectively; , , and denote the transpose, positive part, and -mixed norm of vector ,111The -mixed norm of a vector indexed by is where for . For instance, . respectively; denotes expectation with respect to probability distribution ; and denote vector (component-wise) inequalities; denotes a vector with all components equal to ; and denotes de cardinality of set . We represent real-valued and vector-valued functions with finite domains by vectors and matrices, respectively; specifically, we represent a function for finite set with vector , and a vector function by matrix with column given by for . In addition, if is a function with domain , for each we represent by the function with domain . Finally, we denote by the set of probability distributions with support and represent each for finite set by its probability mass function with and .

2 Minimax classification over polyhedral uncertainty sets

This section first briefly describes the problem statement for supervised classification, and then presents techniques to learn LPCs and to bound expected losses. In what follows, features and labels are elements of sets and , respectively. We assume that both sets are finite; commonly the cardinality of is very large while that of is very small. Such finiteness assumption does not lose any generality in practice, at least using digital computers.

A deterministic classification rule is a function from to . In this paper we consider also classification rules that are allowed to randomly classify each feature, so that a general classification rule is given by a probabilistic transformation also known as Markov transition or channel [15]. We denote by the set of probabilistic transformations from to , that is, functions from to . In what follows we represent each by a Markov kernel function with a probability mass function in for any (i.e., and ). A classification rule classifies each feature as label with probability . In particular, deterministic classification rules correspond to that takes only values and .

The classification - loss (called just loss in the following) of a classification rule at is if it classifies with , and is otherwise. Hence, the expected loss of classification rule with respect to a probability distribution is

The risk of a classification rule (denoted ) is its expected loss with respect to the actual distribution of features-label pairs , that is

The minimum risk is known as Bayes risk and becomes since it is achieved by Bayes’ rule that classifies each with a label attaining the maximum of .

The goal of supervised classification is to determine a classification rule with reduced risk by using a set of training samples. ERM approach is based on minimizing the empirical risk , where is the empirical distribution of training samples [1, 2]. RRM approach is based on minimizing the maximum (worst-case) risk for a probability distribution in an uncertainty set obtained from training samples [3, 4, 5, 6, 7, 8, 9, 10, 11]. The following shows how uncertainty sets defined by linear inequalities enable efficient RRM without constraining the set of classification rules.

Given vectors with and a vector function , we denote by the set

In addition, we call function the generating function, and vectors and the lower and upper endpoints of expectation interval estimates. The minimax expected loss against uncertainty set is

(1)

where the second equality is obtained since the minimax coincides with the maximin because and are closed convex sets of [16]. In the following, whenever we use an expectation point estimate, i.e., , we drop from the superscripts, for instance we denote for as .

Uncertainty sets are polyhedra in defined by affine inequality constraints since . They contain probability distributions that are similar in terms of the generating function’s expectations, for instance, two distributions are in the same uncertainty set for some if their distance is zero for the semi-metric generated by [14].

The following result determines minimax classification rules against the above uncertainty sets as well as the corresponding minimax expected loss.

Theorem 1.

Let and with . If a classification rule satisfies

(2)

for , solution of optimization problem

(3)

then, is a minimax classification rule againts uncertainty set , that is,

In addition, the minimax expected loss against uncertainty set is given by

(4)
Proof.

See Appendix A.2. ∎

Classification rules satisfying (2) always exist since for any , due to the constraints in (3). In addition, a classification rule satisfying (2) can be directly obtained from a solution of (3) as

(5)

for each . In what follows, we refer to such classification rules as LPCs for generating function , that is, classification rules for and given by (5) for solution of (3).

The learning process of an LPC consists on solving the convex optimization problem (3). The inputs of such learning process are expectation interval estimates given by or expectation point estimates given by . Such estimates can be obtained by averaging the values that the generating function takes over the training samples. Then, the prediction process with an LPC for a specific consists on randomly sample a label with probability given by (5) using obtained during learning.

Optimization problem (3) is equivalent to a linear optimization problem with at most constraints. Specifically,

because

and

If case of using expectation point estimates, i.e., , we can take the variables in (3) to be and . In that case, (3) is equivalent to

(6)

that is an optimization problem with less dimensions and less constraints than (3).

The learning process in [6] determines approximately minimax classification rules by addressing optimization such as that in (1) for case using a stochastic gradient descent algorithm. Such approach is enabled by using the training samples’ empirical distribution as surrogate for the features’ marginal of distributions in the uncertainty set. The proposed learning process for LPCs described in Theorem 1 does not rely on approximations and finds minimax classification rules by using linear optimization.

The following result shows that the usage of polyhedral uncertainty sets also allows to obtain performance guarantees (bounds for expected losses) by solving two linear optimization problems.

Proposition 1.

Let

(7)

for a function . Then, for any and

(8)

In addition, (resp. ) if minimizes (resp. maximizes) the expected loss of over distributions in .

Proof.

See Appendix A.3. ∎

For an LPC , the upper bound above is directly given by the learning phase, that is, given in (4) equals . On the other hand, the lower bound for denoted as requires to solve an additional linear optimization problem.

Techniques based on f-divergences and Wasserstein distances in [9, 7, 8] obtain analogous upper and lower bounds for the corresponding uncertainty sets. Note that the bounds for expected losses become risk’s bounds if the actual distribution of features-label pairs belongs to the uncertainty set. Such case can be attained with a tunable confidence using uncertainty sets defined by Wasserstein distances as in [7, 8] or using the proposed LPCs with expectation confidence intervals. However, the bounds are only asymptotical risk’s bounds using uncertainty sets defined by f-divergences as in [9] or using the proposed LPCs with expectation point estimates.

3 Generalization bounds

In this section we develop finite-sample risk’s bounds of LPCs with respect to the smallest worst-case risk for generating function . If the actual distribution of pairs features-label is contained in , the minimax expected loss is the worst-case risk of since with equality if is a distribution in with smallest -norm.

The smallest worst-case risk of LPCs for generating function is with because

Such smallest worst-case risk corresponds with LPC that would require an infinite amount of training samples to exactly determine the expectation of generating function .

The following result bounds the excess risk of LPCs with respect to smallest worst-case risk, as well as the difference between the risk of LPCs and the corresponding minimax expected loss

Theorem 2.

Let be independent samples following distribution , a generating function, , , and

(9)

with

We have that

  • With probability at least ,

    (10)
    (11)
    (12)
    (13)

    where , and for

  • If (6) for has unique solution, then with probability

Proof.

See Appendix A.4. ∎

Inequality (13) and third inequality in (10) bound the excess risk of LPCs with respect to the smallest worst-case risk ; inequality (11) and second inequality in (10) bound the difference between the risk of LPCs and the corresponding minimax expected loss; and inequality (12) and first inequality in (10) bound the difference between the lower bound for the corresponding uncertainty set and the risk of LPCs. These bounds show differences that decrease with as with proportionality constants that depend on the confidence , and other parameters describing the complexity of generating function such as its dimensionality , the difference between its maximum and minimum values , and bounds for the solutions of (6) for vectors in the convex hull of .

The vector above can result in over-pessimistic interval estimates and for the expectation of since it is based on Hoeffding’s inequality and the union bound [17] for the components of . In practice, LPCs can be developed by using tighter interval estimates for the expectation of . Such tighter intervals can be obtained for instance by using bootstrapping methods, the central limit theorem, and better estimates of sub-Gaussian parameters than .

The generalization bounds for the excess risk provided in Theorem 3 of [5] and Theorems 2 and 3 of [11] for RRM with moments fits and Wasserstein distances, respectively, are analogous to those in inequality (13) and third inequality in (10) above. In particular, they also show risk’s bounds with respect to the minimax risk corresponding to an infinite number of samples. The generalization bounds in Corollary 3.2 in [10] and Theorem 2 of [7] for RRM with f-divergences and Wasserstein distances, respectively, are analogous to those in inequality (11) and second inequality in (10). In particular, they also show how the risk can be upper bounded (assymptotically in [10] and inequality (11) or with certain confidence in [7] and second inequality in (10)) by the corresponding finite-sample minimax expected loss.

4 Efficient implementation and choice of generating function

The learning stage of LPCs entails to solve optimization problem (3) using expectation interval estimates and or optimization problem (6) using expectation point estimates . Training samples are used to obtain such estimates for the expectations of , and can be used also to select generating function as described below. In the prediction stage, each is classified as with probability given by (5) using generating function and obtained in the learning stage. The upper bound for the risk of is directly obtained from the learning stage as given by (4), while the lower bound for the risk of is obtained solving an additional linear optimization problem as for given by (7).

The main complexity of LPCs lies in the possibly large number of constraints in the optimization problem solved for learning. As described in Section 2, optimization problems given in (3), (6), and (7) can have up to linear constraints and is usually large. Such complexity can be controlled by i) using generating function that takes a reduced number of values, and ii) approximately solving the optimization problems enforcing only a subset of constraints. Specifically, for i) if optimization problems in (3) and (6) for learning have up to linear constraints, e.g., (3) is equivalent to

(14)

For ii), if is a subset of (e.g., features obtained in training), (3), (6), and (7) can be approximated by optimization problems with up to linear constraints, e.g., (3) can be approximated by

The generating function plays an analogous role to that of predicates in [18], which represent the contribution to the training process of a so-called Intelligent Teacher. Such type of functions are used also in other methods for RRM [5, 6] and are usually obtained from certain moments of the features; while, as pointed out in [18], improved performance can be obtained by more elaborated functions possibly defined algorithmically.

The generating function used by an LPC has to be highly discriminative for classification () and, at the same time, simple enough to enable efficient learning (reduced dimensionality and range of values ). Ideal generating function would be that given by the Bayes rule because in that case , , and . Other generating functions that achieve are those such that is in the linear span of , because

In the numerical results of next section we use a simple generating function given by classifiers as where for each , is a vector of size with zeros and a at the component corresponding to the -tuple , that is, for

(15)

where assigns the integer index of tuple for a chosen order such as lexicographic. For this function we have that and the matrix defining the constraints in (14) corresponding with -tuple has -th component for and

Note that such matrices are highly sparse since each row has only one non-zero component, so high-efficient optimization methods can be exploited for (3) and (6). On the other hand, for this type of generating functions the constant in Theorem 2 above becomes , and the dimensionality and number of linear constraints in (14) become . Therefore, this type of generating functions requires to use a reduced number of classifiers . In the next section we use so the learning process entails to solve linear optimization problems with up to linear constraints and dimensions. In the next section, the expectations of the proposed generating function are estimated from training data using stratified -fold cross-validation. Specifically, each validation sample provides an evaluation of the generating function and the final estimate is obtained by averaging the estimates corresponding with each data partition.

5 Experimental results

In this section we show numerical results for LPCs using synthetic data and UCI datasets. The first set of results shows the suitability of the upper and lower bounds and for LPCs, while the second set of results compares the classification error of LPCs with respect to state-of-the-art techniques.

In the first set of experimental results, we use synthetic data for classification with -dimensional features and 3 classes. Specifically, the features for each class are obtained as random samples from a mixture of two Gaussians with weights and covariances , the means of the Gaussians are and for , and for , and and for . The LPC in this set of results uses generating function in (15) with and , , and given by nearest neighbor (NN) algorithms with , , and neighbors.

\psfrag

Y[l][t][0.7]Risk \psfragX[l][b][0.7]Training size n \psfragA123456789123456[l][][0.5]LPC Risk \psfragB[l][][0.5] Upper bound \psfragC[l][][0.5] Lower bound \psfragD[l][][0.5] Bayes Risk \psfrag50[l][][0.5] \psfrag0.2[l][][0.5] \psfrag0.3[l][][0.5] \psfrag0.4[l][][0.5] \psfrag0.5[l][][0.5] \psfrag0.6[l][][0.5] \psfrag100[l][][0.5] \psfrag1000[l][][0.5] \psfrag5000[l][][0.5] \psfrag2[l][b][0.7]Diabetes \psfrag3[l][b][0.7]German \psfrag4[l][b][0.7]Heart

Figure 1: Upper and lower LPC risk bounds.

Figure 1 shows the risk of an LPC that uses and given by (9) with . For each training size, one instantiation of training samples is used for training and LPC’s risk is estimated using test samples. It can be observed from the figure that the lower and upper bounds can offer accurate estimates for the risk without using test samples.

In the second set of experimental results, we use data sets from the UCI repository (first column of Table 1). LPCs are compared with classifiers: decision tree (DT), quadratic discriminant analysis (QDA), NN, SVM, RF, logistic regression (LR), learning using statistical invariants (LUSI), and adversarial cost-sensitive classifier (ACSC). The first 6 classifiers were implemented using scikit-learn package with the default parameters, LUSI was implemented as in [18] using parameters and , and ACSC was implemented as in [6] using parameter . Two versions of the proposed classifiers LPC1 and LPC2 were implemented using different classifiers , , and to define in (15); LPC1 uses DT, QDA, and NN while LPC2 uses SVM, RF, and LR. The errors in Table 1 have been estimated using paired and stratified -fold cross validation. It can be observed from the table that performance of LPCs is competitive with state-of-the-art techniques.

data set LPC1 LPC2 QDA DT KNN SVM RF LR ACSC LUSI
mammog. .21 .19 .20 .24 .22 .18 .21 .17 .22 .18
vehicle .16 .21 .15 .28 .29 .21 .25 .21 .37 .22
glass .42 .38 .49 .39 .35 .35 .40 .40 .41 .35
haberman .26 .27 .24 .39 .30 .26 .35 .26 .28 .26
column 3C .15 .16 .16 .20 .21 .15 .17 .15 .27 .17
indian liver .29 .28 .45 .35 .34 .29 .30 .28 .33 .27
diabetes .26 .23 .26 .30 .26 .24 .26 .23 .29 .23
adult .15 .15 .20 .18 .17 .15 .15 .18 .20 .15
credit .15 .17 .22 .22 .14 .16 .17 .15 .22 .16
satellite .12 .12 .16 .17 .12 .12 .11 .18 .18 .11
Table 1: Classification error of LPCs in comparison with state-of-the-art techniques.

6 Conclusion

The proposed LPCs consider unconstrained classification rules, optimize the classification 0-1 loss, and provide performance guarantees during learning. We present LPCs’ finite-sample generalization bounds, and describe practical and efficient implementations. This paper shows that supervised classification does not require to select from the outset a family of classification rules or surrogate losses with favorable tractability properties. Differently from conventional techniques, the inductive bias exploited by LPCs comes from a chosen generating function that represents the classification-discriminative characteristics of examples. Learning with LPCs is achieved without further design choices by linear optimization problems given by expectation estimates obtained from training data. Finally, we propose a simple choice for generating function that results in LPCs achieving classification errors competitive with state-of-the-art techniques.

References

  • [1] Vapnik Vladimir. Statistical learning theory. Wiley, New York, 1998.
  • [2] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in computational mathematics, 13(1):1–50, 2000.
  • [3] Gert R.G. Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I. Jordan. A robust minimax approach to classification. Journal of Machine Learning Research, 3:555–582, December 2002.
  • [4] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612, 2010.
  • [5] Farzan Farnia and David Tse. A minimax approach to supervised learning. In Advances in Neural Information Processing Systems, pages 4240–4248, 2016.
  • [6] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classification. In Conference on Uncertainty in Artificial Intelligence, pages 92–101, 2015.
  • [7] Soroosh Shafieezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, pages 1576–1584, 2015.
  • [8] Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization via mass transportation. arXiv preprint, arXiv:1710.10016, 2017.
  • [9] John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv preprint, arXiv:1610.03425, 2016.
  • [10] Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
  • [11] Jaeho Lee and Maxim Raginsky. Minimax statistical learning with Wasserstein distances. In Advances in Neural Information Processing Systems, pages 2692–2701, 2018.
  • [12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
  • [13] Robert E. Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT Press, 2012.
  • [14] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
  • [15] Brendan van Rooyen and Robert C. Williamson. A theory of learning with corrupted labels. Journal of Machine Learning Research, 18:1–50, July 2018.
  • [16] Peter D. Grünwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Annals of Statistics, 32(4):1367–1433, 2004.
  • [17] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
  • [18] Vladimir Vapnik and Rauf Izmailov. Rethinking statistical learning theory: learning using statistical invariants. Machine Learning, pages 1–43, July 2018.
  • [19] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
  • [20] Robert R. Phelps. Convex functions, monotone operators and differentiability. Springer-Verlag, Berlin, Heidelberg, second edition, 2009.

Appendix A Appendices

a.1 Auxiliary lemmas

The proofs of Theorem 1 and Proposition 1 require the lemmas provided below.

Lemma 1.

The norms and are dual.

Proof.

The dual norm of assigns each , the real number

We have that for with

So, to prove the result we just need to find a vector such that and . Let , then given by

satisfies and .

Lemma 2.

Let , and and be the functions and for , where

Then, their conjugate functions are

Proof.

By definition of conjugate function we have

  • If , for each , we have

    and by definition of dual norm we get

    which implies

    Moreover, , so we have that .

  • If , by definition of dual norm and using Lemma 1 there exists such that and . Define as

    By definition of and we have

    and

    Now let and take , then we have

    which tends to infinity as because , so we have that .

Finally, the expression for is straightforward since

a.2 Proof of Theorem 1

Let

in the first step of the proof we show that satisfying (2) is a solution of optimization problem , and in the second step of the proof we show that a solution of is also a solution of .

For the first step, note that

Then, optimization problem is equivalent to

that is separable and has solution given by

for any . The inner maximization above is given in closed-form by

that takes its minimum value for any .

For the second step, if is a solution of we have that

(16)

where the first inequality is due to the fact that and for because by definition of and since .

Since is bounded, is finite, and and are closed and convex, the min and the max in can be interchanged (see e.g., Th. 5.1. in [16]) and we have that . In addition,

because the optimization problem above is separable for and