An Adaptable Framework for Deep Adversarial Label Learning from Weak Supervision
In this paper, we propose a general framework for using adversarial label learning (ALL) Arachie and Huang (2019) for multiclass classification when the data is weakly supervised. We introduce a new variant of ALL that incorporates human knowledge through multiple constraint types. Like adversarial label learning, we learn by adversarially finding labels constrained to be partially consistent with the weak supervision. However, we describe a different approach to solve the optimization that enjoys faster convergence when training large deep models. Our framework allows for human knowledge to be encoded into the algorithm as a set of linear constraints. We then solve a two-player game optimization subject to these constraints. We test our method on three data sets by training convolutional neural network models that learn to classify image objects with limited access to training labels. Our approach is able to learn even in settings where the weak supervision confounds state-of-the-art weakly supervised learning methods. The results of our experiments demonstrate the applicability of this approach to general classification tasks.
An Adaptable Framework for Deep Adversarial Label Learning from Weak Supervision
Chidubem Arachie Department of Computer Science Virginia Tech firstname.lastname@example.org Bert Huang Department of Computer Science Virginia Tech email@example.com
noticebox[b]Preprint. Under review.\end@float
Researchers and industry practitioners are increasingly turning to weak supervision as an alternative for training machine learning models. Weak supervision is often preferable due to the high cost of labeling training data. The advent of deep learning has seen an explosion in interest towards machine learning, however training deep models often involves collecting massive amounts of training data whose labels are not easily obtained or available. Weak supervision alleviates some of the difficulties and cost associated with supervised learning by only requiring annotators to provide rules or approximate indicators that automatically label the data. Ideally, these annotators or human experts provide several weak rules, and we would combine the different weak supervision signals to train a model that is robust to redundancies and errors in the weak supervision. Our work builds on a recent method introduced by Arachie and Huang (2019) for building a robust model by combining multiple sources of weak supervision. ALL works by training classifiers to perform well on adversarially labeled instances that are consistent with the weak supervision. The authors provide an algorithm for solving binary classification tasks on simple linear models using weak supervision signals constrained by error bounds. We generalize this method by developing a framework that encodes multiple linear constraints on the weak supervision signals. Our framework allows for other forms of learning such as multiclass classification, multilabel classification, and structured prediction for advanced models like deep neural networks. In our experiments, we provide error and additional precision constraint to our framework to solve multiclass image classification task using a deep neural network when the data is either weakly or semi-supervised.
2 Related Work
Our work builds on progress in three main topic areas: weak supervision, learning with constraints, and adversarial learning.
Weak Supervision: We expand on some of the recent advances on weakly supervised learning, which is the paradigm where models are trained using large amounts of unlabeled data and low-cost, often noisy annotation. One key contribution is the Snorkel system Ratner et al. (2017, 2016), a weak supervision approach where annotators write different labeling functions that are applied to the unlabeled data to create noisy labels. The noisy labels are combined using a generative model to learn the correlation and dependencies between the noisy signals. Snorkel then reasons with this generative model to produce probabilistic labels for the training data. Our method is related to this approach in that we use noisy labels, or weak signals, to learn adversarial labels for the training data, but our focus is on model training rather than outputting training labels for the unlabeled data. Nevertheless, we show in experiments that the quality of labels learned using ALL compares favorably to that of labels inferred by Snorkel’s generative modeling. While not the focus of our contributions, the type of human-provided weak signals in our experiments are motivated by techniques in crowdsourcing Gao et al. (2011).
Learning with Constraints: Our framework incorporates error constraints that are reminiscent of boosting Schapire et al. (2002); however, our bounds are more general and allow for other forms of constraints like precision. Our work is also related to techniques for estimating accuracies of classifiers using only unlabeled data Jaffe et al. (2016); Platanios et al. (2014); Steinhardt and Liang (2016) and combining classifiers for transductive learning using unlabeled data Balsubramani and Freund (2015a); Balsubramani and Freund (2015b). Other methods like posterior regularization (PR) Ganchev et al. (2010) and generalized expectation (GE) criteria Druck et al. (2008); Mann and McCallum (2010, 2008) have been developed to incorporate human knowledge or side information into an objective function. These methods provide parameter estimates as constraints such that the label distributions adhere to these constraints. While GE and PR allow incorporation of weak supervision and parameter estimates as constraints, they do not explicitly consider cases where redundant weak signals that satisfy provided constraints conspire to confound the learner.
Adversarial Learning and Games: Researchers have become increasingly interested in adversarial learning Lowd and Meek (2005) as a method for training models that are robust to input perturbations of the data. These methods Miyato et al. (2018); Torkamani and Lowd (2013, 2014) regularize the learned model using different techniques to defend against adversarial attacks with added benefit of improved generalization guarantees. Our approach focuses on adversarial manipulation of the output labels to combat redundancy among multiple sources of weak supervision. Games analyses are gaining importance in machine learning because they generalize optimization frameworks by assigning different objective functions for different players or optimizing agents. The generative adversarial network (GAN) Goodfellow et al. (2014) framework sets up a two player game between a generator and a discriminator, with the aim of learning realistic data distributions for the generator. Our method does not learn a generative model but instead sets up a two player nonzero-sum game between an adversary that assigns labels for the classification task and a model that trains parameters to minimize a cross-entropy loss with respect to the adversarial labels.
Our work is most closely related to adversarial label learning (Arachie and Huang, 2019) (ALL), which integrates these topics. We describe ALL in detail as we introduce our enhancements in the next section.
3 Adaptable Framework for Adversarial Label Learning
Adversarial label learning (ALL) was originally proposed by Arachie and Huang (2019) for training binary classifiers from weak supervision. The weak supervision was in the form of approximate probabilistic classifications of the unlabeled training data. The algorithm simultaneously optimizes model parameters and estimated labels for the training data subject to the constraint that the error of the weak signals on the estimated labels is within annotator-provided bounds. We introduce here a generalization of the ALL approach that enables multiclass classification, and we detail modifications to the optimization method proposed by Arachie and Huang (2019) that allow better performance when training deep neural networks.
Let the unlabeled training data be , and let be a classifier parameterized by . The primal objective function we solve for generalized ALL is
where is the space of label matrices where each row is on the simplex of dimension (i.e., a matrix that can represent a set of multinomial distributions), is the estimated label matrix, is a loss function, and is a set of linear constraint functions on . The estimated labels are optimized adversarially, against the objective of the learning minimization, so we refer to them as adversarial labels.
3.1 Linear Label Constraints
As originally proposed by Arachie and Huang (2019), one set of possible linear constraints that tie the adversarial labels to the true labels is a bound on the error rate of each weak signal. Let be a weak signal that indicates—in a one-versus-rest sense—the probability that each example is in class . And let denote the th column of matrix , which is the current label’s estimated probability that each example is in class . The expected empirical error for the one-versus-rest task under these two probabilistic label probabilities is
where we use as a vector notation for the sum of (or its dot product with the ones vector). Combined with an annotator provided estimate of a bound on reasonable errors for their weak signals, an error-based constraint function for weak signal on class would have form
This error-based constraint function can be insufficient to capture the informativeness of a weak signal, especially in cases where there is class imbalance. For multiclass classification, one-versus-rest signals will almost always be class-imbalanced. In such settings, we can allow annotators to indicate their estimates of weak-signal quality by indicating bounds on the precision. In the ALL setting, expected precision can also be expressed as a linear function of :
Since is a constant with respect to the learning optimization, its appearance in the denominator of this expression does not affect the linearity. We can then define a precision constraint function for each weak signal on class :
Including precision constraints better captures the confusion matrix across different classes, but it may also be possible to design other linear constraints. As long as the constraints are linear, the feasible region for the maximization over remains convex.
3.2 Nonzero-Sum Game
Our adaptable framework for ALL deviates from the original formulation by Arachie and Huang (2019) because we formulate a nonzero-sum game between two agents: the adversary that optimizes and the learner that optimizes . In this game, the objective of the adversary is to assign labels that maximize the error of the model subject to the provided constraints. The model objective is to minimize its loss with respect to the adversarial labels. The loss functions must be differentiable; however, the choice of loss function is task-dependent and can have important impact on optimization. For multiclass classification using deep neural networks, our model uses popular cross-entropy loss. However, for the adversarial labeling, we instead use an expected error as the loss function, which the adversary maximizes. Formally, the model’s loss is the cross-entropy
while the adversarial labeler’s loss is the expected error
The loss function for the adversary is concave, so we are maximizing a concave function subject to linear constraints. This makes the adversarial optimization a linear program with a unique optimum for any fixed . This form is relevant for the initialization scheme described in Section 3.3. We optimize the loss functions using Adagrad (Duchi et al., 2011).
Arachie and Huang (2019) use a primal-dual optimization that jointly solves an augmented Lagrangian relaxation of Eq. 1. Since we extend the method to a nonzero-sum form, we have two separate optimizations. The analogous optimizations are
where is the vector of Karush-Kuhn-Tucker (KKT) multipliers, is the vector of constraint function outputs (i.e., ), is a positive parameter, and denotes the norm of positive terms. The adverary optimization maximizes the linear loss from Eq. 7 while the learner minimizes the model loss from Eq. 6. A primal-dual solver for this problem updates the free variables using interleaved variations of gradient ascent and descent.
The advantage of this primal-dual approach is that it enables inexpensive updates for the gaming agents and other variables being optimized, thereby allowing learning to occur without waiting for the solution of the inner optimization. At every iteration, the primal variables take maximization steps and the dual variables take minimization steps. However, for training deep neural networks, the primal-dual approach is not always ideal.
The large datasets needed to fit large models such as deep neural networks often require stochastic optimization to train efficiently. The key computational benefit of stochastic optimization is that it avoids the cost of computing the true gradient update. Using a primal-dual approach to optimize Eq. 8 would also incur an cost for each update to . Therefore, we instead design our optimization scheme to update and only after a fixed number of epochs. Since each epoch costs computation, the added overhead does not change the asymptotic cost of training.
This optimization scheme has an added benefit that it increases the stability of the learning algorithm. By updating and only after a few epochs of training , we are solving the minimization over nearly to convergence. We still retain the advantages of primal-dual optimization over the variables, but without the added instability of simultaneous nonconvex optimization.
To preserve domain constraints on the variables and , we use projection steps that preserve feasibility. After each update to and , we project to the simplex using the sorting method described by Blondel et al. (2014), and we clip to be non-negative.
Initialization Scheme: To further facilitate faster convergence toward a local equilibrium, we warm start the optimization with a phase of optimization updating only and . The effect of this warm-start phase is that we begin learning with a near-feasible —one that is nearly consistent with the weak-supervision-based constraints. Since this phase uses the fixed output of a randomly initialized model , it does not require repeated forward- or back-propagation through the deep neural network, so it is quite fast, even for large datasets.
We validate our approach on three fine-grained image classification tasks, comparing the performance of models trained with our approach to a baseline averaging method and model trained using labels generated from Snorkel (Ratner et al., 2017). Each of these methods trains from weak signals, and our experiments evaluate how well they can integrate noisy signals and how robust they are to confounding signals.
4.1 Quality of Constrained Labels
Before using our custom weak annotation framework, we first compare the quality of labels generated by our framework to existing methods for fusing weak signals. We follow the experiment design from a tutorial111https://github.com/HazyResearch/snorkel/blob/master/tutorials/images/Images_Tutorial.ipynb designed by Ratner et al. (2017) to demonstrate their Snorkel system’s ability to fuse weak signals and generate significantly higher quality labels than naive approaches. The experiment the Microsoft COCO: Common Objects in Context Plummer et al. (2015) dataset to train detectors of whether a person is riding a bike within each image. We use the 903 images from the tutorial and weak signals generated by the labeling functions based on object occurrence metadata. To demonstrate the power of ALL’s consideration of error bounds, we calculate the error and precision of each rule and use those to define the ALL constraints. We then run the ALL initialization scheme (the first while loop in Algorithm 1), which finds feasible labels adversarially fit against a random initailization, i.e., arbitrary feasible labels. For an increasing number of weak signals, we compare ALL with error constraints (ALL-E), ALL with both error and precision constraints (ALL-PE), Snorkel, and majority voting.
We plot the resulting error rate of the generated labels in Figure 1. For all numbers of weak signals, ALL-PE obtains the highest accuracy labels. ALL-PE represents our adaptable, multi-constraint, framework while ALL-E represents the error-only approach proposed by Arachie and Huang (2019). The labels generated by Snorkel have the same label error using two and three weak signals, but adding additional weak signals starts to confound Snorkel. Our framework is not confounded by these additional weak signals. Finally, corroborating the results reported by Snorkel’s designers, the naive majority vote method has significantly higher error compared to any of the more sophisticated weak supervision techniques.
4.2 Multiclass Image Classification
For our main experiments, we train multiclass image classifiers from weak supervision. We are interested in evaluating the effectiveness of the weak supervision approach, so we use the same deep neural network architecture for all experiments: a six-layer convolutional neural network with a softmax output. Table 1 lists summary result of the error obtained by each method on each dataset using all the weak signals we provide the learners. The final result for each experiment is that ALL outperforms both Snorkel and averaging in all settings, showing a strong ability to fuse noisy signals and to avoid being confounded by redundant signals. We describe our form of weak supervision and each experiment in detail in the rest of the section.
4.2.1 Weak Signals
We ask human annotators to provide weak signals for image datasets. To generate each weak signal, we sample 50 random images belonging to different classes. We then ask the annotator to select a representative image and mark distinguishing regions of the image that indicate its belonging to a specific class. We then calculate pairwise comparisons between the pixels in the region of the reference image selected by the user and the pixels in the same region for all other images in the dataset. We measure the Euclidean distance between the pairs of images and convert the scores to probabilities with a logistic transform. Through this process, the annotator is guiding the design of simple nearest-neighbor one-versus-rest classifiers, where images most similar to the reference image are more likely to belong to its class. We ask annotators to generate many of these rules for the different classes, and we provide the computed probabilities as weak labels for the weakly supervised learners.
In practice, we found that these weak signals were noisy. In some experiments, they were insufficient to provide enough information for the classification task. However, our experiments show a proof-of-concept analysis of how different weakly supervised learners behave with informative but noisy signals. We discuss ideas on how to design better interfaces and better weak signals for the image classification task in Section 5.
We assume we have access to a labeled validation set consisting of 1% of the available data. We use this validation set to compute the precision and error bounds for the weak signals. This validation set is meant to simulate a human expert’s estimate of error and precision. To encourage a fair comparison, we allow all methods to use these labels in addition to weak signals when training by appending the validation set to the dataset with its true labels. Since these bounds are evaluated on a very tiny set of the training data, they are noisy, and prone to the same type of estimation mistakes an expert annotator may make. Therefore, they make a good test for how robust ALL is to imperfect bounds.
|fashion-mnist (semi + weak)||0.228||0.315||0.320|
|SVHN (semi + weak)||0.231||0.435||0.525|
4.2.2 Weakly Supervised Image Classification
In this experiment, we train a deep neural network using only human provided weak labels as described in Section 4.2.1. We use the fashion-mnist Xiao et al. (2017) dataset, which represents an image-classification task where each example is a grayscale image. The images are categorized into 10 classes of clothing types with 60,000 training examples and 10,000 test examples. We have annotators generate five one-versus-rest weak signals for each class, resulting in 50 total weak signals.
We plot analyses of models trained using weak supervision in Figure 2, where Fig. 1(a) plots the test error, and Fig. 1(b) and Fig. 1(c) are histograms of the error and precision bounds for the weak signals evaluated on the validation set. Since our weak signal is a one-versus-rest prediction of an image belonging to a particular class, the baseline precision and error should be 0.1 for training data with balanced classes. The histograms indicate that there is a wide range of precisions and errors for the different weak signals. The error rates in Fig. 1(a) suggest that the test error of the models decreases as we add more weak signals. ALL with both precision and error bounds outperforms Snorkel and the average baseline for all the weak signals. The min and max curves in the plots represent the best and worst possible label errors for labels that satisfy the provided constraints. The high error in the max curve indicates that the constraints alone still allow highly erroneous labels, yet the ALL framework trains models that perform well. The min curve indicates how close feasible adversarial labels could be to the true labels. In this experiment, the min curve is close to zero, which suggests that the inaccuracies in the provided bounds are not overly restrictive.
4.2.3 Semi-Supervised Image Classification
In the previous experiments, the human provided weak signals are informative enough to train models to perform better than random guessing, but the resulting error rate is still significantly lower than that of supervised methods. To further boost the performance, we combine the human weak signals with pseudolabels: predictions of our deep model trained on the validation set and applied to the unlabeled training data (Lee, 2013). By training on the available 1% labels and predicting labels for the remaining 99% unlabeled examples, we create a new, high-quality weak signal. We calculate error and precision bounds for the pseudolabels with four-fold cross-validation on the validation set. We report the results of the models trained on the fashion-mnist dataset using this combination of pseudolabels and human weak signals.
Fig. 3 contains plots of the results. The error and precision histograms now include higher precision bounds and lower error bounds, as a result of the pseudolabel signals being higher quality. Additionally, the min and max error curves have lower values, indicating that we get better quality labels with these signals. However, the error trends in Fig. 2(a) are quite different compared to the previous experiment (Fig. 1(a)). All the methods have good performance with the pseudolabel signals, but as we add the human signals, Snorkel and the average baseline are confounded and produce increasingly worse predictions. ALL however is minimally affected by the human weak signals. The slight variation in the curve can be attributed to the inaccuracy of estimated bounds for the weak signals.
We hypothesize that this trend occurs because of the nature of our weak supervision. Since the weak signals are based on the selection of exemplar images, they may be effectively subsumed by a fully semi-supervised approach such as pseudolabeling. That is, the information provided by each human weak signal is already included in the pseudolabeling signal. This type of redundant information is an important consideration when using weak supervision. Many signals can have dependencies and redundancies. And despite the Snorkel system’s modeling of dependencies among weak signals, it is still confounded by them while ALL’s model-free approach is robust.
4.2.4 Street View House Numbers
We test the performance of the different models using pseudolabels and human weak labels on another image classification task. We use the Street View House Numbers (SVHN) Netzer et al. (2018) dataset, which represents the task of recognizing digits on real images of house numbers taken by Google Street View. Each image is a RGB vector. The dataset has 10 classes consisting of 73,257 training images and 26,032 test images.
Figure 4 plots the results of the experiment. Figure 3(a) features the same trend as Fig. 2(a). For this task, the human weak signals perform poorly in labeling the images, so they do not provide additional information to the learners. This fact is evident in the horizontal slope of the max curve. The min curve suggests that the human weak signals are redundant with poorly estimated bounds, and adding them decreases the space of possible labels for ALL. Comparing models, ALL’s performance is not affected by the redundancies in the weak signals. Since the human weak signals are very similar, Snorkel seems to mistakenly trust the information from these signals more as we add more of them, thus hurting its model performance. ALL uses the extra information provided to it as bounds on the weak signals to protect against placing higher emphasis on the redundant human weak signals.
We introduced an adaptable framework for adversarial label learning that enables users to encode information about the data as a set of linear constraints. We show in our experiments the performance of the method using precision and error constraints. However, our framework allows for other forms of linear constraints for different tasks. Our evaluation demonstrates that our adaptive framework is able to generate high-quality labels for a learning task and is also able to combine different sources of weak supervision to increase the performance of a model. Our experiments show that our framework outperforms state-of-the-art weak supervision methods on different image-classification tasks and is better at handling redundancies among weak supervision signals. Although we have only shown results for image classification, users can adapt our framework for other classification tasks like text classification.
In our work, we use simple nearest-neighbor style weak signals provided by human annotators. We observe that these weak signals performed well in some experiments, but in other experiments, they did not provide adequate information to the learner. In future work, we plan to explore avenues for generating higher quality human supervision signals. One idea for improving the human signals is by learning latent representations of the data (e.g., with an autoencoder) and comparing the latent representations, rather than the pixel values of the images. Since our annotators have no expertise about the data, we simulated this expert knowledge by evaluating the weak signals on a tiny set of the training data. We can also estimate these bounds using agreements and disagreements of the weak signals, but this involves advanced modeling techniques we plan to incorporate in future work.
We thank NVIDIA for their support through the GPU Grant Program and Amazon for their support via the AWS Cloud Credits for Research program.
- Arachie and Huang  Chidubem Arachie and Bert Huang. Adversarial label learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
- Balsubramani and Freund [2015a] Akshay Balsubramani and Yoav Freund. Optimally combining classifiers using unlabeled data. arXiv preprint arXiv:1503.01811, 2015a.
- Balsubramani and Freund [2015b] Akshay Balsubramani and Yoav Freund. Scalable semi-supervised aggregation of classifiers. In Advances in Neural Information Processing Systems, pages 1351–1359, 2015b.
- Blondel et al.  Mathieu Blondel, Akinori Fujino, and Naonori Ueda. Large-scale multiclass support vector machine training via euclidean projection onto the simplex. In 2014 22nd International Conference on Pattern Recognition, pages 1289–1294. IEEE, 2014.
- Druck et al.  Gregory Druck, Gideon Mann, and Andrew McCallum. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 595–602. ACM, 2008.
- Duchi et al.  John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Ganchev et al.  Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11(Jul):2001–2049, 2010.
- Gao et al.  Huiji Gao, Geoffrey Barbier, and Rebecca Goolsby. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems, 26(3):10–14, 2011.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- Jaffe et al.  Ariel Jaffe, Ethan Fetaya, Boaz Nadler, Tingting Jiang, and Yuval Kluger. Unsupervised ensemble learning with dependent classifiers. In Artificial Intelligence and Statistics, pages 351–360, 2016.
- Lee  Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
- Lowd and Meek  Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pages 641–647. ACM, 2005.
- Mann and McCallum  Gideon S Mann and Andrew McCallum. Generalized expectation criteria for semi-supervised learning of conditional random fields. Proceedings of ACL-08: HLT, pages 870–878, 2008.
- Mann and McCallum  Gideon S Mann and Andrew McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research, 11(Feb):955–984, 2010.
- Miyato et al.  Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- Netzer et al.  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and A Ng. The street view house numbers (SVHN) dataset. Technical report, Accessed 2016-08-01.[Online]. Available: http://ufldl. stanford. edu …, 2018.
- Platanios et al.  Emmanouil Antonios Platanios, Avrim Blum, and Tom Mitchell. Estimating accuracy from unlabeled data. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pages 682–691. AUAI Press, 2014.
- Plummer et al.  Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
- Ratner et al.  Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, pages 3567–3575, 2016.
- Ratner et al.  Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris Ré. Snorkel: Fast training set generation for information extraction. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1683–1686. ACM, 2017.
- Schapire et al.  Robert E Schapire, Marie Rochery, Mazin Rahim, and Narendra Gupta. Incorporating prior knowledge into boosting. In ICML, volume 2, pages 538–545, 2002.
- Steinhardt and Liang  Jacob Steinhardt and Percy S Liang. Unsupervised risk estimation using only conditional independence structure. In Advances in Neural Information Processing Systems, pages 3657–3665, 2016.
- Torkamani and Lowd  Mohamad Ali Torkamani and Daniel Lowd. Convex adversarial collective classification. In International Conference on Machine Learning, pages 642–650, 2013.
- Torkamani and Lowd  Mohamad Ali Torkamani and Daniel Lowd. On robustness and regularization of structural support vector machines. In International Conference on Machine Learning, pages 577–585, 2014.
- Xiao et al.  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.