# Metric-Optimized Example Weights

###### Abstract

Real-world machine learning applications often have complex test metrics, and may have training and test data that follow different distributions. We propose addressing these issues by using a weighted loss function with a standard convex loss, but with weights on the training examples that are learned to optimize the test metric of interest on the validation set. These metric-optimized example weights can be learned for any test metric, including black box losses and customized metrics for specific applications. We illustrate the performance of our proposal with public benchmark datasets and real-world applications with domain shift and custom loss functions that balance multiple objectives, impose fairness policies, and are non-convex and non-decomposable.

## 1 Introduction

In machine learning applications, each training example is usually weighted equally when training. Such uniform training example weights deliver satisfactory performance when the noise is homoscedastic, training examples follow the same distribution as the test data, and when the training loss matches the test metric. However, these requirements are often violated in real-world applications. Furthermore, the training loss does not always correlate sufficiently with the test metric we care about, such as revenue impact or some fairness metrics. This discrepancy between the loss and metric can lead to inferior test performance [1, 2, 3].

There are many proposals to address some of these issues, as we review in Section 2, but none of them address all of them at once. In this paper, we propose a framework to learn a weighting function on examples, which estimates training example weights that will produce a machine learned model with optimal test metrics. Our proposal, Metric-Optimized Example Weights (MOEW), solves the aforementioned issues simultaneously and is suitable for any standard or customized test metrics. MOEW can be interpreted as learning a transformation of the loss function to better mimic the test metric. To use our proposal, one needs a small set of labeled validation examples that are ideally distributed the same as the test examples. While getting a large number of labeled test examples is sometimes prohibitive, labeling a small amount of them is usually feasible, e.g., through human evaluation.

As an illustrative example, Figure 1 shows a simulated non-IID toy dataset and our learned example weighting. The ground truth decision boundary is the diagonal line from the upper left to lower right corners. The feature values in the training and validation/test data follow a beta distribution: specifically, in the training data, , whereas in the validation/test data, . The goal is to maximize precision at 95% recall on the test distribution. With uniform example weights during training, we get 20.8% precision at 95% recall on the test data. With importance weighting, we get 21.8% precision at 95% recall. With MOEW, we obtain 23.2% precision at 95% recall (the Bayes classifier achieves around 25.0% precision at 95% recall). Comparing Figures 0(c) to 0(d), one can see that MOEW learned to upweight the negative training examples, and upweights examples closer to the center.

## 2 Related Work

Our proposal simultaneously addresses three potential issues: non-constant noise levels across training examples, non-identical distributions between training and test examples (aka covariate shift or domain shift), and training and test objectives mismatch. Several existing methods summarized below address some of these issues, but none of them provide a method to address all of them at once.

Maximum likelihood based inference is the classical approach to address the issue with non-constant noise levels in the training data. For example, maximizing Gaussian likelihood is equivalent to minimizing squared loss with training weights that are inversely proportional to the variance of the training examples. In the case with over-dispersed binary data, beta-binomial models are often used. In addition, several common machine learning techniques, such as ensemble and graph-based methods, are found to be robust to binary label noise; see [4] for a survey of these methods.

To address the issue of covariate shift, classical approaches include propensity score matching [5, 6, 7] or importance weighting [8, 9, 10, 11]. These techniques can also be adapted for model selection tasks [12]. In addition, [13] proposed a discriminative approach of learning under covariate shift.

To address the issue of loss/metric mismatch, researchers have formulated several plug-in approaches and surrogate loss functions. These approaches can be used to optimize some non-decomposable ranking based metrics, such as AUC [14, 15, 1, 16, 17, 18, 19], F score [20, 21, 22], and other ranking metrics [20, 23, 24, 25, 26, 27, 28, 29]. Our work differs in that the test metric can be a black box, and we adapt for it by learning a corresponding weighting function over the examples.

## 3 Metric Adaptive Weight Optimization

We propose learning an example weighting function, trained on validation scores for different example weightings. Our proposal allows covariate shift between training and validation/test examples, and is suitable for any customized metrics.

### 3.1 Overview

Let and denote the sets of training and validation examples, respectively. We assume is drawn IID with the test set, but that and may not be IID. Consider a classifier or regressor, for , which is parameterized by . Let be the label of the data, and be the predicted score. The optimal is obtained by optimizing a weighted loss function :

(1) |

where parameterizes the example weighting function . Our goal is to learn an optimal example weighting function (detailed below). Given the learned weighting function , the model is trained by solving (1) using usual training methods such as SGD, etc.

Let be the metric of interest evaluated for the model with parameters on the validation dataset . Without loss of generality, we assume a larger validation metric is more desirable. We propose learning the example weighting function , such that

(2) |

That is, we propose finding the optimal parameters for the example weighting function such that the model trained with this example weighting function achieves the best validation score. To simplify the notation, we note that in this paper, the parameters always depend on the weight parameters and henceforth we simplify the notation to .

The metric as a function of and is likely non-convex and non-differentiable, which makes it hard to directly optimize through e.g., SGD. Instead, we adopt an iterative algorithm to optimize for and , which is detailed in Algorithm 1. Specifically, we start with a random sample of choices of weight parameters, . For each of the randomly generated weight parameters , , we solve (1) to obtain the corresponding model parameters , and use those to compute the corresponding validation metrics, . Note that this step can be parallelized. Then, based on the batch of weight parameters and validation metrics, , we determine a new set of weight parameter candidates, , and repeat this process for a pre-specified number of iterations, . At the end, we choose the candidate that produced the best validation metric.

In the following subsections, we describe the function class of the weight model, and the subroutine used in Algorithm 1: GetCandidateAlphas.

### 3.2 Function Class for the Example Weighting Model

Recall that the final optimal is taken to be the best out of samples of ’s. To ensure a sufficient coverage of the weight parameter space for a small number of validation samples, we found it best to use a function class with a small number of parameters.

There are arguably many reasonable strategies for defining . We chose the functional form:

(3) |

where is a constant that normalizes the weights over (a batch from) the training set , and is a sigmoid transformation of a linear function of a low-dimensional embedding of . We use the standard importance function , where and denote the probability density function of in the validation and training data, respectively. Note that we can use more generic propensity models on the pair as well. Methods for estimating the importance function are summarized in Section 2.

While there are many possible ways to form a low-dimensional embedding of , we choose to use an autoencoder [30] to form an embedding, as it can be applied to a wide range of problems. Specifically, to train the autoencoder , we minimize the weighted sum of the reconstruction loss for and :

(4) |

where is an appropriate loss for the feature vector and is an appropriate loss for the label . The hyperparameter in (4) is used to adjust the relative importance of features and the label in the embedding: using is similar to weighting based solely on the value of the label. For all the experiments in this paper we used a fixed , but it may need tuning for some problems.

The inclusion of the importance function in (3) may seem unnecessary – in the ideal setting, the embedding should contain enough information for us to recover the importance. However, due to limited flexibility of the weight model and our desire to learn the weight model with a small number of sample weightings, we experimentally found it useful to explicitly include the importance function in our formulation.

### 3.3 Global Optimization of Weight Function Parameters

The validation metric may have multiple maxima as a function of the example-weighting parameters . In order to find the maximum validation score, the sampled candidate ’s should achieve two goals. First, should sufficiently cover the weight parameter space with a sufficiently fine resolution to discover all peaks. In addition, we also need a large number of candidate ’s sampled near the most promising peaks. In other words, there is a exploration (spread out candidate ’s more evenly) and exploitation (make candidate ’s closer to the local optima) trade-off when choosing candidate ’s.

One can treat this as a global optimization problem and sample candidate ’s with a derivative free optimization algorithm [31], such as simulated annealing [32], particle swarm optimization [33, 34] and differential evolution [35]. We chose to base our algorithm on Gaussian process regression (GPR), specifically on the Gaussian Process Upper-Confidence-Bound (GP-UCB) [36, 37] adapted to batched sampling of parameters [38, 39].

As detailed in Algorithm 2, after getting the -th batch of candidate ’s and their corresponding validation metrics , we build a GPR model to fit the validation metrics on for all previous observations. The next batch of candidate ’s is then selected sequentially: we first sample an based on the upper bound of the % prediction interval of , . A larger value of hyperparameter encourages exploration (we sample from regions where we are uncertain), whereas a smaller value of encouranges exploitation (we sample from regions predicted to generate good validation metrics). After is sampled, we refit a GPR model with an added observation for , as if we have observed a validation metric, , which is the lower bound of the prediction interval of . Hyperparameter controls how much the refitted GPR model trusts the old GPR model. Using a larger value of encourages wider exploration within each batch. We then use the refitted GPR model to generate another candidate , and continue this process until all the candidate ’s in the -th batch are generated. Note that to ensure convergence, in practice, we usually generate candidate ’s within a bounded domain .

With GPR models, we can easily control the balance of exploration and exploitation through hyperparameters and . We can also stop the candidate generation process early if the width of the prediction interval throughout the bounded space is smaller than a certain threshold.

Figure 2 shows 200 sampled candidate ’s in batches in the example shown in Section 4.5. It shows that at the beginning of the candidate sampling process (e.g., Figures 1(a)-1(d)), GPR explores more evenly across the domain . As the sampling process continues, GPR begins to exploit more heavily near the optimal in the lower right corner.

## 4 Experimental Results

In this section, we illustrate the value of our proposal by comparing it to uniform and importance weightings on a diverse set of example problems. For our proposal, we first create a -dimensional embedding of training pairs by training an autoencoder that has nodes in one of the hidden layers. We sample candidate ’s limited to a -dimensional ball of radius , in rounds, each of samples. For a fair comparison, for uniform and importance weights, we also train the same number of models () (with fixed example weighting, but random initialization), and pick the model with the best validation metric. To mitigate the randomness in the sampling and optimization processes, we repeat the whole process 100 times and report the average.

### 4.1 MNIST Handwritten Digits

If the model is flexible enough, and the training data clean and sufficient to fit a perfectly accurate model for all parts of the feature space, then the proposed MOEW is not needed. MOEW will be most valuable when the model must take some trade-off, and we can sculpt the loss surface through example weighting in a way that best benefits the test metric. We illustrate this effect with this experiment, by studying the efficacy of our proposed method as a function of the model complexity for a fixed size train set. In order to have a controlled experiment, we take a problem that is essentially solved, the MNIST handwritten digit database [40], and train on it with classifiers of varying complexity.

We used training/validation/test split of sizes 55k/5k/10k respectively. Features were greyscale pixels and the label was one-hot encoded in . To learn the example weights, we used a 5-dimensional embedding^{1}^{1}1The overall results are similar with embedding dimensions 3 to 8. We chose 5 based on initial experiments., created by training a sigmoid-activation autoencoder network on the pair, with nodes in each layer, and took the activation of the middle layer as the embedding. We used mean squared error for reconstruction of and cross entropy loss for reconstruction of . The actual classifier was a sigmoid-activation network with nodes in each layer. We ran our analysis with varying values for the number of hidden units . We used batches of ’s in our proposed method and compared against the best-of-300 uniform weighted models^{2}^{2}2Smaller values of result in better exploration, but the runtime can be longer with less parallelism.. The error metric was taken to be the maximum of the error rates for each digit: .

Figure 2(a) shows the error metric calculated on the testing dataset for increasing model complexities, each averaged over 100 runs. We observe that our proposal clearly outperforms uniform weighting for models of limited complexity. The benefit is smaller for models that are more accurate on the dataset. In most real-world situations, the model complexity is limited either by practical constraints on computational complexity of training/evaluations, or otherwise to avoid overfitting to smaller training datasets. In such cases there might be an inherent trade-off in the learning process (e.g. how much to allocate the capacity of the model to digit 3’s that look like 8’s v.s. 4’s that look like 7’s), and we expect our proposed method to apply such trade-off through MOEW.

### 4.2 Wine Price

In this experiment, we study the choice of the embedding dimension for the MOEW algorithm. We use the Wine reviews dataset from ww.kaggle.com/zynicide/wine-reviews. The task is to predict the price of the wine using 39 Bool characteristic features describing the wine and the quality score (points) for a total of 40 features. We calculate the error in percentage of the correct price, and want the model to have good accuracy across all price ranges. To that end, we set the test metric to be the worst of the errors among 4 quantiles of the price (thresholds are 17, 25 and 42 dollars): .

We used training/validation/test split of sizes 85k/12k/24k respectively. Because the test metric is normalized by the price, and because the difference in the log space is the log of the ratio , we apply a log transformation to the label (price) and use mean squared error for the training loss on the log-transformed prices (for all weightings). For MOEW, we illustrate the effect of using a -dimensional embedding for , created by training a sigmoid-activation autoencoder network on the pair, with nodes in each layer, where we took the activation of the middle layer as the embedding. We used mean squared error for the autoencoder reconstruction of and . The actual regressor was a sigmoid-activation network with nodes in each layer. We used batches of ’s in our proposed method and compared against the best-of-200 uniform weighted models.

The model trained with uniform weights had an average test error metric of . The MOEW method resulted in significantly better test error metrics (i.e. more uniform accuracy across all price ranges). The results with different choices of the embedding dimension, shown in Figure 2(b), suggest that we can effectively learn a weighting function in small dimensional spaces. For , we might need to use more ’s or a larger validation set to achieve further gains.

### 4.3 Community Crime Rate

In this example, we examine the performance of our proposal with a very small dataset and a complicated testing metric. We use the Communities and Crime dataset from the UCI Machine Learning Repository [41], which contains the violent crime rate of 994/500/500 communities for training/evaluation/testing. The goal is to predict whether a community has violent crime rate per 100K popuation above 0.28.

In addition to obtaining an accurate classifier, we also aim to improve its fairness. To this end, we divided the communities into 4 groups based on the quantiles of the percentage of white population in each community (thresholds are 63%, 84% and 94%). We seek a classifier with good accuracy, and at the same time have similar false positive rates (FPR) across racial groups. Therefore, we evaluate classifiers based on two metrics: overall accuracy across all communities and the difference between the highest and lowest FPR across four racial groups (fairness violation).

We used a linear classifier with 95 features, including the percentage of African American, Asian, Hispanic and white population. For MOEW, those 95 features plus the binary label were projected onto a 4-dimensional space using an autoencoder with nodes. We sampled candidate ’s in 10 batches of size 5, and compared our proposal against the best-of-50 uniform weighted models.

In practice, there is usually a trade-off between accuracy and fairness of classifiers (see, e.g. [42][43]). To explore this trade-off, we considered two approaches: with the first (vanilla) approach, we used MEOW to minimize the difference between the highest and lowest FPR across the 4 groups (i.e., minimize fairness violation), with identical decision thresholds for all groups. With the second approach, after training the model, we set a different decision threshold for each racial group to achieve the same FPR on the training data while maintaining overall fixed coverage (post-shifting) [42], and used our proposal to maximize the overall final accuracy.

The results are summarized in Table 1. With the first approach, MOEW reduces the fairness violation by over 20% and yet achieves the same overall accuracy compared to uniform weighting. With the second approach, MOEW improves both accuracy and reduces the fairness violation.

Accuracy | Fairness Violation | |||
---|---|---|---|---|

Uniform | MOEW | Uniform | MOEW | |

Approach 1 | ||||

Approach 2 |

### 4.4 Spam Blocking

For this problem from a large internet services company, the goal is to classify whether a result is spam, and this decision affects whether the result provider receives ads revenue from the company. Thus, it is more important to block more expensive spam results, but it is also important not to block any results that are not spam, especially results with many impressions. We use a simplified test metric that captures these different goals (the actual metric is more complex and proprietary). Specifically, for each method we set the classifier decision threshold so that 5% of the test set is labelled as spam. We then sum the costs saved by blocking correctly identified spam results and divide it by the total number of blocked impressions of incorrectly-identified spam results.

The datasets contain 12 features. The 180k training dataset is 25% spam, and is not IID with the validation/test datasets, which are IID and have 10k/30k examples respectively with 5% spam.

We trained an autoencoder on the 12 features plus label, with layers of nodes, and used the middle layer as a 3-dimensional embedding. For each weighting method, we built a sigmoid-activation network classifier with architecture . Candidate ’s were sampled at a time in rounds of sampling.

Table 2 compares the two non-uniform example weighting methods to the uniform test metric, where we have normalized the reported scores so that the average uniformly weighted test metric is 1.0. Our proposed method clearly outperforms both uniform and importance weightings.

### 4.5 Web Page Quality

This binary classifier example is from a large internet services company. The goal is to identify high quality webpages, and the classifier threshold is set such that 40% of examples are labelled as high quality. The company performed several rounds of human ratings of example web pages. The label generated in the early rounds was a binary label (high/low quality). Then in later rounds, the human raters provided a small number of examples with a finer-grained label, scoring the quality in . The test metric for this problem is the average numeric score of the positively classified test examples.

On the 62k training web pages that have the binary labels, we trained a six-feature sigmoid-activation classifier with nodes. To learn the proposed MOEW, we trained an autoencoder that mapped the six features plus label onto a 3-dimensional space: . We sampled candidate ’s for rounds.

The validation data and test data each have 10k web pages labeled with a fine-grained score in . The datasets are not IID: the training data is the oldest data, the test data is the newest data, with validation data in-between. We summarize the average quality score of 40% selected web pages in the test data in Table 2, together with the 95% error margins. Note that in this example, importance weighting results in uniform weighting.

Uniform | Importance | MOEW | |
---|---|---|---|

Spam Blocking | |||

Web Page Quality |

## 5 Conclusions

In this paper, we proposed learning example weights to sculpt a standard convex loss function into one that better optimizes the test metric for the test distribution. We demonstrated substantial benefits on public benchmark datasets and real-world applications, for problems with non-IID data, heterogenous noise, and custom metrics that incorporated multiple objectives and fairness metrics.

To limit the need for validation data and re-trainings, we tried to minimize the free parameters of the example weighting function. To that end, we used the low-dimensional embedding of provided by an autoencoder, but hypothesize that a discriminatively-trained embedding could more optimal.

We hypothesize that the MEOW could also be useful for other purposes. For example, they may be useful for guiding active sampling, suggesting one should sample more examples in feature regions with high training weights. And we hypothesize one could downsample areas of low-weight to reduce the size of training data for faster training and iterations, without sacrificing test performance.

## References

- [1] Corinna Cortes and Mehryar Mohri. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems 16, pages 313–320, 2004.
- [2] Claudia Perlich, Foster Provost, and Jeffrey S. Simonoff. Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4:211–255, 2003.
- [3] Jesse Davis and Mark Goadrich. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, pages 233–240, 2006.
- [4] B. Frenay and M. Verleysen. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5):845–869, 2014.
- [5] Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
- [6] Paul R. Rosenbaum and Donald B. Rubin. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39(1):33–38, 1985.
- [7] Jared K. Lunceford and Marie Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19):2937–2960, 2004.
- [8] James M. Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of American Statistical Association, 89(427):846–866, 1994.
- [9] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
- [10] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V. Büenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, pages 1433–1440, 2008.
- [11] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009.
- [12] Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007.
- [13] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10:2137–2155, 2009.
- [14] César Ferri, Peter A. Flach, and José Hernández-Orallo. Learning decision trees using the area under the ROC curve. In Proceedings of the 19th International Conference on Machine Learning, pages 139–146, 2002.
- [15] Lian Yan, Robert Dodier, Michael C. Mozer, and Richard Wolniewicz. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic. In Proceedings of the 20th International Conference on International Conference on Machine Learning, pages 848–855, 2003.
- [16] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.
- [17] Alan Herschtal and Bhavani Raskutti. Optimising area under the roc curve using gradient descent. In Proceedings of the 21st International Conference on Machine Learning, pages 49–56, 2004.
- [18] Cynthia Rudin and Robert E. Schapire. Margin-based ranking and an equivalence between AdaBoost and RankBoost. Journal of Machine Learning Research, 10:2193–2232, 2009.
- [19] Peilin Zhao, Steven C. H. Hoi, Rong Jin, and Tianbao Yang. Online AUC maximization. In Proceedings of the 28th International Conference on Machine Learning, pages 233–240, 2011.
- [20] Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22nd International Conference on Machine Learning, pages 377–384, 2005.
- [21] Martin Jansche. Maximum expected F-measure training of logistic regression models. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 692–699, 2005.
- [22] Krzysztof Dembczynski, Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and Eyke Huellermeier. Optimizing the F-Measure in multi-label classification: Plug-in rule approach versus structured loss minimization. In Proceedings of the 30th International Conference on Machine Learning, pages 1130–1138, 2013.
- [23] Christopher J. Burges, Robert Ragno, and Quoc V. Le. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems 19, pages 193–200, 2007.
- [24] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 271–278, 2007.
- [25] Elad Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Ryan Rifkin, and Gal Elidan. Scalable learning of non-decomposable objectives. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pages 832–840, 2017.
- [26] Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency of plug-in classifiers for nondecomposable performance measures. In Advances in Neural Information Processing Systems 27, pages 1493–1501, 2014.
- [27] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient methods for non-decomposable loss functions. In Advances in Neural Information Processing Systems 27, pages 694–702, 2014.
- [28] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Surrogate functions for maximizing precision at the top. In Proceedings of the 32nd International Conference on Machine Learning, pages 189–198, 2015.
- [29] Harikrishna Narasimhan, Purushottam Kar, and Prateek Jain. Optimizing nondecomposable performance measures: A tale of two classes. In Proceedings of the 32nd International Conference on Machine Learning, pages 199–208, 2015.
- [30] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
- [31] Andrew R. Conn, Katya Scheinberg, and Luis N. Vicente. Introduction to Derivative-Free Optimization. SIAM, 2009.
- [32] Peter J. M. van Laarhoven and Emile H. L. Aarts. Simulated Annealing: Theory and Applications. Springer, 1987.
- [33] James Kennedy and Russell Eberhart. Particle swarm optimization. In Proceedings of IEEE International Conference on Neural Networks, pages 1942–1948, 1995.
- [34] Yuhui Shi and Russell Eberhart. A modified particle swarm optimizer. In Proceedings of IEEE International Conference on Evolutionary Computation, pages 69–73, 1998.
- [35] Rainer Storn and Kenneth Price. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11(4):341–359, 1997.
- [36] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- [37] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- [38] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer, 2013.
- [39] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923, 2014.
- [40] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
- [41] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
- [42] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29, pages 3315–3323, 2016.
- [43] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world goals with dataset constraints. In Advances in Neural Information Processing Systems 29, pages 2415–2423, 2016.