Satisfying Real-world Goals with Dataset Constraints
The goal of minimizing misclassification error on a training set is often just one of several real-world goals that might be defined on different datasets. For example, one may require a classifier to also make positive predictions at some specified rate for some subpopulation (fairness), or to achieve a specified empirical recall. Other real-world goals include reducing churn with respect to a previously deployed model, or stabilizing online training. In this paper we propose handling multiple goals on multiple datasets by training with dataset constraints, using the ramp penalty to accurately quantify costs, and present an efficient algorithm to approximately optimize the resulting non-convex constrained optimization problem. Experiments on both benchmark and real-world industry datasets demonstrate the effectiveness of our approach.
This is a slightly expanded version of that presented at the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016).
1 Real-world goals
We consider a broad set of design goals important for making classifiers work well in real-world applications, and discuss how metrics quantifying many of these goals can be represented in a particular optimization framework. The key theme is that these metrics, which range from the standard precision and recall, to less well-known examples such as coverage and fairness (Mann and McCallum, 2007; Zafar et al., 2015; Hardt et al., 2016), and including some new proposals, can be expressed in terms of the positive and negative classification rates on multiple datasets.
One may wish to control how often a classifier predicts the positive (or negative) class. For example, one may want to ensure that only of customers are selected to receive a printed catalog due to budget constraints, or perhaps to compensate for a biased training set. In practice, constraining the “coverage rate” (the expected proportion of positive predictions) is often easier than measuring e.g. accuracy or precision because coverage can be computed on unlabeled data—labeling data can be expensive, but acquiring a large number of unlabeled examples is often very easy.
Coverage was also considered by Mann and McCallum (2007), who proposed what they call “label regularization”, in which one adds a regularizer penalizing the relative entropy between the mean score for each class and the desired distribution, with an additional correction to avoid degeneracies.
Work does not stop once a machine learning model has been adopted. There will be new training data, improved features, and potentially new model structures. Hence, in practice, one will deploy a series of models, each improving slightly upon the last. In this setting, determining whether each candidate should be deployed is surprisingly challenging: if we evaluate on the same held-out testing set every time a new candidate is proposed, and deploy it if it outperforms its predecessor, then every compare-and-deploy decision will increase the statistical dependence between the deployed model and the testing dataset, causing the model sequence to fit the originally-independent testing data. This problem is magnified if, as is typical, the candidate models tend to disagree only on a relatively small number of examples near the true decision boundary.
A simple and safe solution is to draw a fresh testing sample every time one wishes to compare two models in the sequence, only considering examples on which the two models disagree. Because labeling data is expensive, one would like these freshly sampled testing datasets to be as small as possible. It is here that the problem of “churn” arises. Imagine that model A, our deployed model, is accurate, and that model B, our candidate, is accurate. In the best case, only of test samples would be labeled differently, and all differences would be “wins” for classifier B. Then only a dozen or so examples would need to be labeled in order to establish that B is the statistically significantly better classifier with confidence. In the worst case, model A would be correct and model B incorrect of the time, model B correct and model A incorrect of the time, and both models correct the remaining of the time. Then of testing examples will be labeled differently, and closer to examples would need to be labeled to determine that model B is better.
We define the “churn rate” as the expected proportion of examples on which the prediction of the model being considered (model B above) differs from that of the currently-deployed model (model A). During training, we propose constraining the empirical churn rate with respect to a given deployed model on a large unlabeled dataset (see also Fard et al. (2016) for an alternative approach).
A special case of minimizing churn is to ensure stability of an online classifier as it evolves, by constraining it to not deviate too far from a trusted classifier on a large held-out unlabeled dataset.
A practitioner may be required to guarantee fairness of a learned classifier, in the sense that it makes positive predictions on different subgroups at certain rates. For example, one might require that housing loans be given equally to people of different genders. Hardt et al. (2016) identify three types of fairness: (i) demographic parity, in which positive predictions are made at the same rate on each subgroup, (ii) equal opportunity, in which only the true positive rates must match, and (iii) equalized odds, in which both the true positive rates and false positive rates must match. Fairness can also be specified by a proportion, such as the rule in US law that certain decisions must be in favor of group B individuals at least as often as group A individuals (e.g. Biddle, 2005; Vuolo and Levy, 2013; Zafar et al., 2015; Hardt et al., 2016).
Zafar et al. (2015) propose learning fair classifiers by imposing linear constraints on the covariance between the predicted labels and the values of certain features, while Hardt et al. (2016) propose first learning an “unfair” classifier, and then choosing population-dependent thresholds to satisfy the desired fairness criterion. In our framework, rate constraints such as those mentioned above can be imposed directly, at training time.
Recall and Precision:
Requirements of real-world classifiers are often expressed in terms of precision and recall, especially when examples are highly imbalanced between positives and negatives. In our framework, we can handle this problem via Neyman-Pearson classification (e.g. Scott and Nowak, 2005; Davenport et al., 2010), in which one seeks to minimize the false negative rate subject to a constraint on the false positive rate. Indeed, our ramp-loss formulation is equivalent to that of Gasso et al. (2011) in this setting.
For certain classification applications, examples may be discovered that are particularly embarrassing if classified incorrectly. One standard approach to handling such examples is to increase their weights during training, but this is difficult to get right: too large a weight may distort the classifier too much in the surrounding feature space, whereas too small a weight may not fix the problem. Worse, over time the dataset will often be augmented with new training examples and new features, causing the ideal weights to drift. We propose instead simply adding a constraint ensuring that some proportion of a set of such egregious examples is correctly classified. Such constraints should be used with extreme care, since they can cause the problem to become infeasible.
2 Optimization problem
|Sets of examples labeled positive/negative, respectively|
|Sets of examples with ground-truth positive/negative labels, and for which a baseline classifier makes positive/negative predictions|
|,||Sets of examples belonging to subpopulation A and B, respectively|
|#TP, #TN, #FP, #FN||, , ,|
|Fairness constraint||, where|
|Equal opportunity constraint||, where|
|Egregious example constraint||and/or for a dataset of egregious examples, where|
A key aspect of many of the goals of Section 1 is that they are defined on different datasets. For example, we might seek to maximize the accuracy on a set of labeled examples drawn in some biased manner, require that its recall be at least on small datasets sampled in an unbiased manner from different countries, desire low churn relative to a deployed classifier on a large unbiased unlabeled dataset, and require that given egregious examples be classified correctly.
Another characteristic common to the metrics of Section 1 is that they can be expressed in terms of the positive and negative classification rates on various datasets. We consider only unlabeled datasets, as described in Table 1—a dataset with binary labels, for example, would be handled by partitioning it into the two unlabeled datasets and containing the positive and negative examples, respectively. We wish to learn a linear classification function parameterized by a weight vector and bias , for which the positive and negative classification rates are:
where is an indicator function that is if its argument is positive, otherwise. In words, and denote the proportion of positive or negative predictions, respectively, that makes on . Table 2 specifies how the metrics of Section 1 can be expressed in terms of the s and s.
We propose handling these goals by minimizing an -regularized positive linear combination of prediction rates on different datasets, subject to upper-bound constraints on other positive linear combinations of such prediction rates:
Starting point: discontinuous constrained problem
Here, is the parameter on the regularizer, there are unlabeled datasets and constraints. The metrics minimized by the objective and bounded by the constraints are specified via the choices of the nonnegative coefficients , , , and upper bounds for the th dataset and, where applicable, the th constraint—a user should base these choices on Table 2. Note that because , it is possible to transform any linear combination of rates into an equivalent positive linear combination, plus a constant (see Appendix B111Appendices may be found in the supplementary material for an example).
We cannot optimize Problem 1 directly because the rate functions and are discontinuous. We can, however, work around this difficulty by training a classifier that makes randomized predictions based on the ramp function (Collobert et al., 2006):
where the randomized classifier parameterized by and will make a positive prediction on with probability , and a negative prediction otherwise (see Appendix A for more on this randomized classification rule). For this randomized classifier, the expected positive and negative rates will be:
Using these expected rates yields a continuous (but non-convex) analogue of Problem 1:
Ramp version of Problem 1
Efficient optimization of this problem is the ultimate goal of this section. In Section 2.1, we will propose a majorization-minimization approach that sequentially minimizes convex upper bounds on Problem 2, and, in Section 2.2, will discuss how these convex upper bounds may themselves be efficiently optimized.
2.1 Optimizing the ramp problem
|2||Construct an instance of Problem 3 with and|
|3||Optimize this convex optimization problem to yield and|
To address the non-convexity of Problem 2, we will iteratively optimize approximations, by, starting from an feasible initial candidate solution, constructing a convex optimization problem upper-bounding Problem 2 that is tight at the current candidate, optimizing this convex problem to yield the next candidate, and repeating.
Our choice of a ramp for makes finding such tight convex upper bounds easy: both the hinge function and constant- function are upper bounds on , with the former being tight for all , and the latter for all (see Figure 1). We’ll therefore define the following upper bounds on and , with the additional parameter determining which of the two bounds (hinge or constant) will be used, such that the bounds will always be tight for :
Based upon these we define the following upper bounds on the expected rates:
which have the properties that both and are convex in and , are upper bounds on the original ramp-based rates:
and are tight at :
Substituting these bounds into Problem 2 yields:
Convex upper bound on Problem 2
As desired, this problem upper bounds Problem 2, is tight at , and is convex (because any positive linear combination of convex functions is convex).
Algorithm 1 contains our proposed procedure for approximately solving Problem 2. Given an initial feasible solution, it’s straightforward to verify inductively, using the fact that we construct tight convex upper bounds at every step, that every convex subproblem will have a feasible solution, every pair will be feasible w.r.t. Problem 2, and every will have an objective function value that is no larger that that of . In other words, no iteration can make negative progress. The non-convexity of Problem 2, however, will cause Algorithm 1 to arrive at a suboptimal solution that depends on the initial .
2.2 Optimizing the convex subproblems
|1||Initialize to the all-zero vector|
|6||Let be an index maximizing|
|7||Return , ,|
The first step in optimizing Problem 3 is to add Lagrange multipliers over the constraints, yielding the equivalent unconstrained problem:
where the function:
is convex in and , and concave in the multipliers . For the purposes of this section, and , which were found in the previous iteration of Algorithm 1, are fixed constants.
Because this is a convex-concave saddle point problem, there are a large number of optimization techniques that could be successfully applied. For example, in settings similar to our own, Eban et al. (2016) simply perform SGD jointly over all parameters (including ), while Gasso et al. (2011) use the Uzawa algorithm, which would alternate between (i) optimizing exactly over and , and (ii) taking gradient steps on .
We instead propose an approach for which, in our setting, it is particularly easy to create an efficient implementation. The key insight is that evaluating is, thanks to our use of hinge and constant upper-bounds on our ramp , equivalent to optimization of a support vector machine (SVM) with per-example weights—see Appendix F for details. This observation enables us to solve the saddle system in an inside-out manner. On the “inside”, we optimize over for fixed using an off-the-shelf SVM solver (e.g. Chang and Lin, 2011). On the “outside”, the resulting -optimizer is used as a component in a cutting-plane optimization over . Notice that this outer optimization is very low-dimensional, since , where is the number of constraints.
Algorithm 2 contains a skeleton of the cutting-plane algorithm that we use for this outer optimization over . Because this algorithm is intended to be used as an outer loop in a nested optimization routine, it does not expect that can be evaluated or differentiated exactly. Rather, it’s based upon the idea of possibly making “shallow” cuts (Bland et al., 1981) by choosing a desired accuracy at each iteration, and expecting the SVMOptimizer to return a solution with suboptimality . More precisely, the SVMOptimizer function approximately evaluates for a given fixed by constructing the corresponding SVM problem and finding a for which the primal and dual objective function values differ by at most .
After finding , the SVMOptimizer then evaluates the dual objective function value of the SVM to determine . The primal objective function value and its gradient w.r.t. (calculated on line 10 of Algorithm 2) define the cut . Notice that since is a linear function of , it is equal to this cut function, which therefore upper-bounds .
One advantage of this cutting-plane formulation is that typical CutChooser implementations will choose to be large in the early iterations, and will only shrink it to be or smaller once we’re close to convergence. We leave the details of the analysis to Appendices E and F—a summary can be found in Appendix G.
3 Related work
The problem of finding optimal trade-offs in the presence of multiple objectives has been studied generically in the field of multi-objective optimization (Miettinen, 2012). Two common approaches are (i) linear scalarization (Miettinen, 2012, Section 3.1), and (ii) the method of -constraints (Miettinen, 2012, Section 3.2). Linear scalarization reduces to the common heuristic of reweighting groups of examples. The method of -constraints puts hard bounds on the magnitudes of secondary objectives, like our dataset constraints. Notice that, in our formulation, the Lagrange multipliers play the role of the weights in the linear scalarization approach, with the difference being that, rather than being provided directly by the user, they are dynamically chosen to satisfy constraints. The user controls the problem through these constraint choices, which have concrete real-world meanings.
While the hinge loss is one of the most commonly-used convex upper bounds on the loss (Rockafellar and Uryasev, 2000), we use the ramp loss, trading off convexity for tightness. For our purposes, the main disadvantage of the hinge loss is that it is unbounded, and therefore cannot distinguish a single very bad example from say, 10 slightly bad ones, making it ill-suited for constraints on rates. In contrast, for the ramp loss the contribution of any single datum is bounded, no matter how far it is from the decision boundary.
The ramp loss has also been investigated in Collobert et al. (2006) (without constraints). Gasso et al. (2011) use the ramp loss both in the objective and constraints, but their algorithm only tackles the Neyman-Pearson problem. They compared their classifier to that of Davenport et al. (2010), which differs in that it uses a hinge relaxation instead of the ramp loss, and found with the ramp loss they achieved similar or slightly better results with up to less computation (our approach does not enjoy this computational speedup).
Narasimhan et al. (2015) considered optimizing the F-measure and other quantities that can be written as concave functions of the TP and TN rates. Their proposed stochastic dual solver adaptively linearizes concave functions of the rate functions (Equation 1). Joachims (2005) indirectly optimizes upper-bounds on functions of , , , using a hinge loss approximation.
Finally, for some simple problems (particularly when there is only one constraint), the goals in Section 1 can be coarsely handled by simple bias-shifting, i.e. first training an unconstrained classifier, and then attempting to adjust the decision threshold to satisfy the constraints as a second step.
We evaluate the performance of the proposed approach in two experiments, the first using a benchmark dataset for fairness, and the second on a real-world problem with churn and recall constraints.
We compare training for fairness on the Adult dataset 222“a9a” from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, the same dataset used by Zafar et al. (2015). The training and testing examples, derived from the 1994 Census, are -dimensional and sparse. Each feature contains categorical attributes such as race, gender, education levels and relationship status. A positive class label means that individual’s income exceeds 50k. Let and denote the sets of male and female examples. The number of positive labels in is roughly six times that of . The goal is to train a classifier that respects the fairness constraint for a parameter (where corresponds to the rule mentioned in Section 1).
Our publicly-available Julia implementation333https://github.com/gabgoh/svmc.jl for these experiments uses LIBLINEAR (Fan et al., 2008) with the default parameters (most notably ) to implement the SVMOptimizer function, and does not include an unregularized bias . The outer optimization over does not use the -dimensional cutting plane algorithm of Algorithm 2, instead using a simpler one-dimensional variant (observe that these experiments involve only one constraint). The majorization-minimization procedure starts from the all-zeros vector ( in Algorithm 1).
We compare to the method of Zafar et al. (2015), which proposed handling fairness with the constraint:
An SVM subject to this constraint (see Appendix D for details), for a range of values, is our baseline.
Results in Figure 2 show the proposed method is much more accurate for any desired fairness, and achieves fairness ratios not reachable with the approach of Zafar et al. (2015) for any choice of . It is also easier to control: the values of in Zafar et al. (2015) do not have a clear interpretation, whereas is an effective proxy for the fairness ratio.
Our second set of experiments demonstrates meeting real-world requirements on a proprietary problem from Google: predicting whether a user interface element should be shown to a user, based on a -dimensional vector of informative features, which is mapped to a roughly -dimensional feature vector via a fixed kernel function . We train classifiers that are linear with respect to . We are given the currently-deployed model, and seek to train a classifier that (i) has high accuracy, (ii) has no worse recall than the deployed model, and (iii) has low churn w.r.t. the deployed model.
We are given three datasets, , and , consisting of , and examples, respectively. The datasets and are hand-labeled, while is unlabeled. In addition, was chosen via active sampling, while and are sampled i.i.d. from the underlying data distribution. For all three datasets, we split out for training and reserved for testing. We address the three goals in the proposed framework by simultaneously training the classifier to minimize the number of errors on plus the number of false positives on , subject to the constraints that the recall on be at least as high as the deployed model’s recall (we’re essentially performing Neyman-Pearson classification on ), and that the churn w.r.t. the deployed model on be no larger than a given target parameter.
These experiments use a proprietary C++ implementation of Algorithm 2, using the combined SDCA and cutting plane approach of Appendix F to implement the inner optimizations over and , with the CutChooser helper functions being as described in Appendices E.1 and F.2.1. We performed iterations of the majorization-minimization procedure of Algorithm 1.
Our baseline is an unconstrained SVM that is thresholded after training to achieve the desired recall, but makes no effort to minimize churn. We chose the regularization parameter using a power-of- grid search, found that was best for this baseline, and then used for all experiments.
The plots in Figure 3 show the achieved churn and error rates on the training and testing sets for a range of churn constraint values (red and blue curves), compared to the baseline thresholded SVM (green lines). When using deterministic thresholding of the learned classifier (the blue curves, which significantly outperformed randomized classification–the red curves), the proposed method achieves lower churn and better accuracy for all targeted churn rates, while also meeting the recall constraint.
As expected, the empirical churn is extremely close to the targeted churn on the training set when using randomized classification (red curve, top left plot), but less so on the held-out test set (top right plot). We hypothesize this disparity is due to overfitting, as the classifier has parameters, and is rather small (please see Appendix C for a discussion of the generalization performance of our approach). However, except for the lowest targeted churn, the actual classifier churn (blue curves) is substantially lower than the targeted churn. Compared to the thresholded SVM baseline, our approach significantly reduces churn without paying an accuracy cost.
- Ball (1997) K. Ball. An elementary introduction to modern convex geometry. Flavors of Geometry, 31:1–58, 1997.
- Bartlett and Mendelson (2002) P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, 2002.
- Biddle (2005) D. Biddle. Adverse Impact and Test Validation: A Practitioner’s Guide to Valid and Defensible Employment Testing. Gower, 2005.
- Bland et al. (1981) R. G. Bland, D. Goldfarb, and M. J. Todd. Feature article—the ellipsoid method: A survey. Operations Research, 29(6):1039–1091, November 1981.
- Boyd and Vandenberghe (2011) S. Boyd and L. Vandenberghe. Localization and cutting-plane methods, April 2011. Stanford EE 364b lecture notes.
- Chang and Lin (2011) C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Collobert et al. (2006) R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In ICML, 2006.
- Cotter et al. (2013) A. Cotter, S. Shalev-Shwartz, and N. Srebro. Learning optimally sparse support vector machines. In ICML, pages 266–274, 2013.
- Davenport et al. (2010) M. Davenport, R. G. Baraniuk, and C. D. Scott. Tuning support vector machines for minimax and Neyman-Pearson classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
- Eban et al. (2016) E. E. Eban, M. Schain, A. Gordon, R. A. Saurous, and G. Elidan. Large-scale learning with global non-decomposable objectives, 2016. URL https://arxiv.org/abs/1608.04802.
- Fan et al. (2008) R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
- Fard et al. (2016) M. M. Fard, Q. Cormier, K. Canini, and M. Gupta. Launch and iterate: Reducing prediction churn. In NIPS, 2016.
- Gasso et al. (2011) G. Gasso, A. Pappaionannou, M. Spivak, and L. Bottou. Batch and online learning algorithms for nonconvex Neyman-Pearson classification. ACM Transactions on Intelligent Systems and Technology, 2011.
- Grünbaum (1960) B. Grünbaum. Partitions of mass-distributions and convex bodies by hyperplanes. Pacific Journal of Mathematics, 10(4):1257–1261, December 1960.
- Hardt et al. (2016) M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In NIPS, 2016.
- Joachims (2005) T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005.
- Mann and McCallum (2007) G. S. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning with expectation regularization. In ICML, 2007.
- Miettinen (2012) K. Miettinen. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media, 2012.
- Narasimhan et al. (2015) H. Narasimhan, P. Kar, and P. Jain. Optimizing non-decomposable performance measures: a tale of two classes. In ICML, 2015.
- Nemirovski (1994) A. Nemirovski. Lecture notes: Efficient methods in convex programming. 1994. URL http://www2.isye.gatech.edu/~nemirovs/Lect_EMCO.pdf.
- Rademacher (2007) L. Rademacher. Approximating the centroid is hard. In SoCG, pages 302–305, 2007.
- Rockafellar and Uryasev (2000) R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:21–42, 2000.
- Scott and Nowak (2005) C. D. Scott and R. D. Nowak. A Neyman-Pearson approach to statistical learning. IEEE Transactions on Information Theory, 2005.
- Shalev-Shwartz and Zhang (2013) S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. JMLR, 14(1):567–599, Feb. 2013.
- Shalev-Shwartz et al. (2011) S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Mathematical Programming, 127(1):3–30, March 2011.
- Vuolo and Levy (2013) M. S. Vuolo and N. B. Levy. Disparate impact doctrine in fair housing. New York Law Journal, 2013.
- Zafar et al. (2015) M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: A mechanism for fair classification. In ICML Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2015.
|Section 2||Number of datasets|
|Section 2||Number of dataset constraints|
|Section 2||th dataset|
|,||Section 2, Equation 1||Positive and negative indicator-based rates|
|Section 2, Problem 1||Regularization parameter|
|,||Section 2, Problem 1||Coefficients defining the objective function|
|,||Section 2, Problem 1||Coefficients defining the th dataset constraint|
|Section 2, Problem 1||Given upper bound of the th dataset constraint|
|Section 2, Equation 2||Ramp function:|
|,||Section 2, Equation 3||Positive and negative ramp-based rates|
|,||Section 2.1, Equation 4||Convex upper bounds on ramp functions|
|,||Section 2.1, Equation 5||Convex upper bounds on ramp-based rates|
|Section 2.2, Equation 7||SVM objective (for minimizing over and )|
|Section 2.2, Equation 6||Optimum of (for maximizing over )|
|Section 2.2||Lagrange multipliers associated with dataset constraints|
|Section 2.2, Algorithm 2||Set of allowed s|
|Section 2.2, Algorithm 2||Candidate solution at the th iteration|
|,||Section 2.2, Algorithm 2||Lower and upper bounds on|
|Section 2.2, Algorithm 2||Gradient of the cutting plane inserted at the th iteration|
|Section 2.2, Algorithm 2||Concave function upper-bounding|
|,||Section 2.2, Algorithm 2||Lower and upper bounds on|
|Appendix C||Maximum allowed :|
|,||Appendix C, Equation 10||Expected positive and negative indicator-based rates|
|Appendix E||Lebesgue measure|
|Appendix E.2, Equation 11||Superlevel set|
|Appendix E.2, Equation 12||Superlevel hypograph|
|Appendix F.1||Total size of datasets:|
|,||Appendix F.1, Equation 13||Coefficients defining the convex objective function|
|,||Appendix F.1, Equation 14||Coefficients defining the th convex dataset constraint|
|Appendix F.1, Equation 15||Given upper bound of the th convex dataset constraint|
|Appendix F.1, Equation 16||Loss of example in dataset , in the SVM objective|
|,||Appendix F.1, Equation 17||Coefficients defining the SVM objective function|
|Appendix F.1, Equation 18||Lipschitz constant of the s|
|Appendix F.1, Equation 19||SVM dual variables|
|Appendix F.1, Equation 19||SVM dual objective (for maximizing over )|
|Appendix F.2, Algorithm 3||Candidate solution at the th iteration|
|,||Appendix F.2, Algorithm 3||Lower and upper bounds on|
|Appendix F.2, Algorithm 3||Derivative of the cutting plane inserted at the th iteration|
|Appendix F.2, Algorithm 3||Convex function lower-bounding|
|,||Appendix F.2, Algorithm 3||Lower and upper bounds on|
Appendix A Randomized classification
The use of the ramp loss in Problem 2 can be interpreted in two ways, which are exactly equivalent at training time, but lead to the use of different classification rules at evaluation time.
This is the obvious interpretation: we would like to optimize Problem 1, but cannot do so because the indicator-based rates and are discontinuous, so we approximate them with the ramp-based rates and , and and hope that this approximation doesn’t cost us too much, in terms of performance. The result is Problem 2. At evaluation time, on an example , we make a positive prediction if is nonnegative, and a negative prediction otherwise.
In this interpretation (also used by Cotter et al. ), we reinterpret the ramp loss as the expected 0/1 loss suffered by a randomized classifier, with the result that the rates aren’t being approximated at all—instead, we’re using the indicator-based rates throughout, but randomizing the classifier and taking expectations to smooth out the discontinuities in the objective function. To be precise, at evaluation time, on an example , we make a positive prediction with probability , and a negative prediction otherwise (with being the ramp function of Equation 2). Taking expectations of the indicator-based rates and over the randomness of this classification rule yields the ramp-based rates and , resulting, once again, in Problem 2.
This use of a randomized prediction isn’t as unfamiliar as it may at first seem: in logistic regression, the classifier provides probability estimates at evaluation time (with being a sigmoid instead of a ramp). Furthermore, at training time, the learned classifier is assumed to be randomized, so that the optimization problem can be interpreted as maximizing the data log-likelihood.
In the setting of this paper, the main advantages of the use of a randomized classification rule are that (i) we can say something about generalization performance (Appendix C), and (ii) because the rates are never being approximated, the dataset constraints will be satisfied tightly on the training dataset, in expectation (this is easily seen in the red curve in the top left plot of Figure 3). Despite these apparent advantages, deterministic classifiers seem to work better in practice.
Appendix B Ratio metrics
Problem 1 minimizes an objective function and imposes upper-bound constraints, all of which are written as linear combinations of positive and negative rates—we refer to such as “linear combination metrics”. Some metrics of interest, however, cannot be written in this form. One important subclass are the so-called “ratio metrics”, which are ratios of linear combinations of rates. Examples of ratio metrics are precision, -score, win/loss ratio and win/change ratio (recall is a linear combination metric, since its denominator is a constant).
Ratio metrics may not be used directly in the objective function, but can be included in constraints by multiplying through by the denominator, then shifting the constraint coefficients to be non-negative. For example, the constraint that precision must be greater than can be expressed as follows:
where we used the fact that on the last line—this is an example of a fact that we noted in Section 2: since positive and negative rates must sum to one, it is possible to write any linear combination of rates as a positive linear combination, plus a constant.
Multiplying through by the denominator is fine for Problem 1, but a natural question is whether, by using a randomized classifier and optimizing Problem 2, we’re doing the “right thing” in expectation. The answer is: not quite. Since the expectation of a ratio is not the ratio of expectations, e.g. a precision constraint in our original problem (Problem 1) becomes only a constraint on a precision-like quantity (the ratio of the expectations of the precision’s numerator and denominator) in our relaxed problem.
Appendix C Generalization
In this appendix, we’ll provide generalization bounds for an algorithm that is nearly identical to Algorithm 1. The two differences are that (i) we assume that the optimizer used on line 3 will prefer smaller biases to larger ones, i.e. that if Problem 3 has multiple equivalent minima, then the optimizer will return one for which is minimized, and (ii) that the Lagrange multipliers are upper-bounded by a parameter , i.e. that instead of optimizing Equation 6, line 3 of Algorithm 1 will optimize:
the difference being the upper bound on . If is large enough that no s are bound to a constraint, then this will have no effect on the solution. If, however, is too small, then the solution might not satisfy the dataset constraints. Notice that Algorithm 2 assumes that , with being compact—hence, for our proposed optimization procedure, the assumption is that .
With these assumptions in place, we’re ready to move on to defining a function class that contains any solution that could be found by our algorithm, and bounding its Rademacher complexity.
dccclxxxvii Define to be the set of all linear functions with and , where is a uniform upper bound on the magnitudes of all training examples, and:
Then will contain all -minimizing optimal solutions of Equation 9 for any and any training dataset.
Let be the the objective function of Problem 3, and the th constraint. Then it follows that:
Differentiating the definition of (Equation 7) and setting the result equal to zero shows that any optimal must satisfy (this is the stationarity KKT condition):
implying by the triangle inequality that , where is as defined in the theorem statement.
Now let’s turn our attention to . The above bound implies that, if is optimal, then , from which it follows that the hinge functions and will be nondecreasing in as long as . Problem 3 seeks to minimize a positive linear combination of such hinge functions subject to upper-bound constraints on positive linear combinations of such hinge functions, so our assumption that the optimizer used on line 3 of Algorithm 1 will always choose the smallest optimal gives that . ∎
The Rademacher complexity of is:
where the expectations are taken over the i.i.d. Rademacher random variables and the i.i.d. training sample , and is as in Lemma 1. Applying the Khintchine inequality and substituting the definition of yields the claimed bound. ∎
We can now apply the results of Bartlett and Mendelson  to prove bounds on the generalization error. To this end, we assume that each of our training datasets is drawn i.i.d. from some underlying unknown distribution . We will bound the expected positive and negative prediction rates w.r.t. these distributions:
where is a binary classification function.
Suppose that the training datasets have sizes , and that is drawn i.i.d. from for all . Then, with probability over the training samples, uniformly over all pairs that are optimal solutions of Equation 9 for some under the assumptions listed at the start of this appendix, the expected rates will satisfy:
the above holding for all , where:
Observe that the ramp rates and are -Lipschitz. Applying Theorems 8 and 12 (part 4) of Bartlett and Mendelson  gives that each of the following inequalities hold with probability , for all :
where is as in Lemma 2. The union bound implies that all inequalities hold simultaneously with probability . The LHSs above are the expected ramp-based rates of a deterministic classifier, but as was explained in Appendix A, these are identical to the expected indicator-based rates of a randomized classifier, which is what is claimed. ∎
An immediate consequence of this result is that (with probability ) if suffers the training loss:
then the expected loss on previously-unseen data (drawn i.i.d. from the same distributions) will be upper-bounded by:
Likewise, if satisfies the constraint:
then the corresponding rate constraint on previously-unseen data will be violated by no more than:
in expectation, where, here and above, is as in Theorem 1.
Appendix D Fairness constraints of Zafar et al. 
The constraints of Zafar et al.  can be interpreted as a relaxation of the constraint under the linear approximation
and solving the hinge constrained optimization problem described in Problem 3. Going further, we could implement these constraints as egregious examples using the constraint:
permitting us to perform an analogue of their approximations in ramp form.
Appendix E Cutting plane algorithm
We’ll now discuss some variants of Algorithm 2. We assume that is the function that we wish to maximize for , where:
is compact and convex.
has a (not necessarily unique) maximizer .
We’re primarily interested in proving convergence rates, and will do so in Appendix E.2. With that said, there is one easy-to-implement variant of Algorithm 2 for which we have not proved a convergence rate, but that we use in some of our experiments due to its simplicity:
(Maximization-based Algorithm 2) CutChooser chooses and .
Observe that this can be found at the same time as is computed, since both result from optimization of the same linear program. However, despite the ease of implementing this variant, we have not proved any convergence rates about it.
e.2 Center of mass-based
We’ll now discuss a variant of Algorithm 2 that chooses and based on the center of mass of the “superlevel hypograph” determined by and , which we define as the intersection of the hypograph of (the set of -dimensional points for which ) and the half-space containing all points for which . Notice that, in the context of Algorithm 2, the superlevel hypograph defined by and corresponds to the set of pairs of candidate maximizers and their possible function values at the th iteration. Because this variant is based on finding a cut center in the -dimensional hypograph, rather than an -dimensional level set (which is arguably more typical), this is an instance of what Boyd and Vandenberghe  call an “epigraph cutting plane method”.
Throughout this section, we will take