# Adversarial Labeling for Learning without Labels

## Abstract

We consider the task of training classifiers without labels. We propose a weakly supervised method—adversarial label learning—that trains classifiers to perform well against an adversary that chooses labels for training data. The weak supervision constrains what labels the adversary can choose. The method therefore minimizes an upper bound of the classifier’s error rate using projected primal-dual subgradient descent. Minimizing this bound protects against bias and dependencies in the weak supervision. Experiments on three real datasets show that our method can train without labels and outperforms other approaches for weakly supervised learning.

## 1 Introduction

This paper introduces adversarial label learning (ALL), a method for training classifiers without labels by making use of weak supervision. ALL works by training classifiers to perform well on adversarially labeled instances that are consistent with the weak supervision. Many machine learning models require large amounts of labeled training data, which is usually hand labeled or observed and recorded. In real applications, large amounts of training data are often not easily accessible or are expensive to acquire, making labeled training data a critical bottleneck for machine learning.

An alternative for training machine learning models without labeled training data is weak supervision. Weak supervision uses domain knowledge about the specific problem or heuristics to approximate the true labels. A key challenge for weak supervision is the fact that there may be bias in the errors made by the weak supervision signals. Using multiple sources of weak supervision can somewhat alleviate this concern, but dependencies among these weak supervision functions can be misconstrued as independent confirmation of erroneous labels. For example, in a classification task to identify diabetic patients, physicians know that obesity can indicate diabetes, and they also know the rate at which this indicator is wrong. However, since the indicator is biased, one may also consider high blood pressure as an indicator. Unfortunately, these indicators are correlated and may make dependent errors.

ALL trains using weak supervision and aims to mitigate these problems by adversarially labeling the data. The adversarial labeling can construct scenarios where dependencies in the weak supervision are as confounding as possible while preserving the partial correctness of the weak supervision. The learner then trains a model that can perform well against this adversarial labeling. ALL solves these two competing optimizations using primal-dual subgradient descent. The inner optimization finds a worst-case distribution of the labels for the current weight parameter of the model, while the outer optimization finds the best weights for the model for the current label distribution. The inner optimization’s maximized error rate can also be viewed as an upper bound on the true error rate, which the outer optimization aims to minimize. By training to perform well on the worst-case labeling, ALL is robust against dependent and biased errors in weak supervision signals.

The inputs to ALL are a set of unlabeled data examples, a set of weak supervision signals that approximately label the data, and a corresponding set of estimated error bounds on these weak supervision signals. Domain experts can design the weak supervision signals—e.g., by defining approximate labeling rules—and they can use their knowledge to set bounds on the errors of these signals. When designing weak supervision signals, experts often have mental estimates of how noisy the signals are, so this error estimate is an inexpensive yet valuable input for the learning algorithm.

We consider a binary classification setting where a parameterized model is trained to classify the data. We make use of multiple weak signals that represent different approximations of the true model. These weak signals can be interpreted as having different views of the data. The estimated error rates of these weak signals are passed as constraints to our optimization. Importantly, we show that ALL works in cases where these weak signals make dependent errors. Our experiments also show that ALL trains classifiers that are better than the weak supervision signals, even when the error estimates are incorrect. The performance of ALL in this setting is significant because domain experts will often imperfectly estimate the noisiness of the weak supervision signals.

## 2 Related Work

Weak supervision has become an important topic in the context of data-hungry deep learning models. A new line of research on data programming has produced a new paradigm for weak supervision where data scientists write labeling functions that create noisy labels ratner2017snorkel; ratner2016data. The approach then discovers relationships among the noisy labeling functions and is able to combine them and train data-hungry models. Other related approaches provide weak supervision in the form of constraints on the output space stewart2017label, such as those that encode physical laws. Another related effort is on meta-learning for neural networks via weak supervision dehghani2017learning, using semi-supervised data to train an algorithm to learn from weak supervision.

Our work is related to methods developed to estimate the error of classifiers without labeled data jaffe2016unsupervised; platanios2014estimating; steinhardt2016unsupervised that rely on statistical relationships between the error rates of different classifiers. Many of these approaches extend classical statistics methods dawid1979maximum by allowing the errors of the different classifiers to be dependent variables. A key goal of these approaches is to infer the error rate of these classifiers given only unlabeled data. In contrast, our setting assumes that we have reasonably good estimates of the error rates for the weak supervision provided by experts.

A different form of adversarial learning has recently become popular for deep learning goodfellow2014generative. Generative adversarial networks (GANs) pit a data generator and a discriminator against each other to train generative models that imitate realistic data distributions. Though our goal is not to train generative models, the stochastic optimization techniques developed for GANs may help our future work. Other research lowd2005adversarial; madry2017towards has considered variants of adversarial learning, training a classifier to learn sufficient information about another classifier to construct adversarial attacks. These efforts primarily focus on training models to be robust against malicious attacks, which is of interest in secure applications.

The type of optimization we use is reminiscent of saddle-point objectives in structured output learning joachims2009cutting; taskar2005learning; wainwright2008graphical, where the inner optimization of the learning objective is the structured inference procedure. Our inner optimization is an adversarial labeling of unstructured outputs, but similarities suggest some future work combining these two concepts.

## 3 Adversarial Label Learning

The principle behind adversarial label learning (ALL) is that we train a model to perform well under the worst possible conditions. The conditions being considered are the possible true labels of the training data. We consider the setting in which the learner has access to a training set of examples, and weak supervision is given in the form of some approximate indicators of the target classification along with expert estimates of the error rates of these indicators. Formally, let the data be . (We consider these examples to be ordered for notational convenience, but the order does not matter.) These examples belong to classes , but the true classes are unavailable to the learner. Instead, the learner has access to weak supervision signals , where each weak signal is a soft labeling of the data, i.e., . These soft labelings are estimated probabilities that the example is in the positive class. In conjunction with the weak signals, the learner also receives estimated expected error rate bounds of the weak signals . These values bound the expected error of the weak signals, i.e.,

(1) |

which can be equivalently expressed as

(2) |

While the learned classifier does not have access to the true labels , it will use the assumption that this bound holds to define the space of possible labelings. Let the current estimates of learned label probabilities be . We relax the space of discrete labelings to the space of independent probabilistic labels, such that the value represents the probability that the true label of example is positive. The adversarial labeling then is the vector of class probabilities that maximizes the expected error rate of the learned probabilities subject to the constraints given by the weak supervision signals and bounds, which can be found by solving the following linear program:

(3) | ||||

s.t. |

which we present in this unsimplified form to convey the intuition behind its objective and constraints; some simple algebra simplifies this optimization into a more standard form.

Having defined adversarial labeling, we can describe the learning method. ALL trains a parameterized prediction function that reads the data as input and outputs estimated class probabilities, i.e., . When useful, we will write to mean when it is important to note that these are generated from the parameterized function . For now, we assume a general form for this parameterized function. For our optimization method described later in Section 3.2, we assume that the function is sub-differentiable with respect to its parameters . The goal of learning is then to minimize the expected error relative to the adversarial labeling. This principle leads to the following saddle-point optimization:

(4) | ||||

s.t. |

We can view the outer optimization as optimizing a primal objective that is the maximum of the constrained inner optimization. Define this primal function as , such that Eq. 4 can be equivalently written as . If the weak supervision error bounds are true, this primal objective value is an upper bound on the true error rate. This fact can be proven by considering that the true labels satisfy the constraints, and the inner optimization seeks a labeling that maximizes the classifier’s expected error rate. In the next section, we visualize this primal function and the behavior of adversarial labeling before describing how we efficiently solve this optimization in Section 3.2.

### 3.1 Visualizing Adversarial Label Learning

In this section, we investigate a simple case that illustrates the behavior of the primal objective function on a two-example dataset (). For a small dataset, we can visualize in two dimensions a variety of concepts.

In Fig. 0(a), we illustrate the constraints set by the two weak supervision signals. The first signal estimates that is positive with probability 0.3 and that is positive with probability 0.2. The second signal estimates that is positive with probability 0.6 and that is positive with probability 0.1. The bounds for each weak signal error are set to . Note that both weak signals agree that is most likely negative, but they disagree on whether is more likely to be positive or negative.

Constraints on The shaded regions represent the feasible regions determined by the linear constraint corresponding to each weak signal. The intersection of these feasible regions is the search space for label vectors. Note how the pink region determined by allows to be either extreme of 0 or 1. With more examples (), the possibility of ambiguous labels increases significantly.

Primal objective function The contour lines illustrate the objective value of the primal function , which finds the expected error for the adversarially set labels . Since the adversarial inner optimization is a linear program, the solution jumps between vertices of the constrained polytope, making the primal expected error a piecewise linear convex function of .

Adversarial labeling In Fig. 0(a), the blue square is the minimum of the primal function, i.e., the solution to the ALL objective. This solution shows that the ideal learned model should predict to be positive with probability 0.18 and to be positive with probability 0. In the optimal state, the adversarial labeling of the examples is illustrated as the orange triangle at . This is the label probability vector that induces the most error for the current predicted probabilities that still satisfies the constraints set by and .

Robustness to redundant and dependent errors A key feature of ALL is that it is robust to redundant and dependent errors in the weak supervision. In Fig. 0(b), we plot a variation of the setup from Fig. 0(a), except we include two noisy copies of weak signal . Since our optimal solution disagreed with weak signal on the most likely label for , one might expect that adding more weak signals that agree with would “outvote” the solution and pull it to a higher probability of being positive. But if weak signal is highly correlated with weak signals and , they may suffer from the same errors. Instead of these extra signals inducing a majority vote behavior on the solution, their effect on ALL is that they slightly change the feasible region of the adversarial labels, which leaves the optimum unchanged.

These two-dimensional visualizations illustrate the behavior of ALL on a simple input. In higher dimensions, i.e., when there are more examples in the training set, there is more freedom in the constraints set by each weak signal, so there will be more facets to the piecewise linear objective.

### 3.2 Optimization Approach

We use projected primal-dual updates for an augmented Lagrangian relaxation to efficiently optimize the learning objective. The advantage of this approach is that it allows inexpensive updates for all variables being optimized over, and it allows learning to occur without waiting for the solution of the inner optimization. The augmented Lagrangian form of the objective is

(5) | ||||

where is the hinge function that returns its input if positive and zero otherwise. This form uses Karush-Kuhn-Tucker (KKT) multipliers to relax the linear constraints on and a squared augmented penalty term on the constraint violation.

We then take projected gradient steps to update the variables , , and . The update step for the parameters is

(6) |

where is the Jacobian matrix for the classifier over the full dataset and is a gradient step size that can decrease over time. This Jacobian can be computed for a variety of models by back-propagating through the classification computation. The update for the adversarial labels is

(7) |

where (to fit onto the page) , and clips the label vector to the space , projecting it into its domain. The update for each KKT multiplier is

(8) |

which is clipped to be non-negative and uses a fixed step size as dictated by the augmented Lagrangian method hestenes1969multiplier.

These primal-dual updates for the optimization converge given that the objective function satisfies the necessary conditions for convergence derived in du2018linear: The objective is strongly convex in and and concave in , while the penalty term for the augmented Lagrangian is strongly convex. The full algorithm is summarized in Algorithm 1.

## 4 Experiments

We run experiments on three different datasets to measure the predictive power of adversarial label learning (ALL). For each dataset, we generate weak supervision signals and estimate their error rates. We then compare the accuracy of the model trained by ALL against (1) the different weak supervision signals and (2) baseline models trained by treating the weighted vote of the weak supervision signals as labels.

### 4.1 Weak Supervision and Baseline Models

In practice, domain experts provide weak supervision in the form of noisy indicators or simple labeling functions. This weak supervision generates probabilities that the examples in a sample of the data belong to the positive class. Since we do not have explicit domain knowledge for the datasets used in our experiments, we generate the weak signals by training simple, one-dimensional classifiers on subsets of the data.

We split each dataset into training, validation, and testing subsets. We train weak supervision models by selecting a feature and training one-dimensional logistic regression models on the training subset. We select the weak supervision features based on our non-expert understanding of which features could reasonably serve as indicators of the target class. Once each one-dimensional classifier is trained, we no longer use the training subset.

We evaluate one-dimensional classifiers on the validation subset, generating the weak signals . In our first set of experiments, we measure the true error rate of each weak signal on the validation subset and use that as the error bounds . In later experiments, we set all bounds to 0.3 as an arbitrary guess.

Baseline Models The input to a weakly supervised learning task includes the weak supervision signals , bounds , and the validation set without labels. A straightforward approach that a reasonable data scientist could take to this training task is to compute pseudo-labels using the weak signals. Then one can train many classifiers using a standard supervised learning approach. For our baseline comparisons, we generate baseline models by treating the rounded average of weak signals as a label.

In each of our experiments, we consider three different weak signals. We run ALL on the first weak signal (ALL-1), the first and second weak signals (ALL-2), or all three weak signals (ALL-3). We compare against the accuracy of directly using the individual weak signals as the classifier (WS-1, WS-2, and WS-3). And finally, we train models to mimic the average of the first weak signal (AVG-1), the first and second weak signals (AVG-2), and all three weak signals (AVG-3). For the ALL and AVG models, we train logistic regression classifiers, i.e., single-layer neural networks with a sigmoid squashing function.

[width= ]figures/fashion_mnist_bar_val |
[width= ]figures/fashion_mnist_bar_test |

[width=5cm]figures/mnistweak-crop.pdf

### 4.2 Fashion-MNIST

The first dataset we experiment with is the Fashion-MNIST dataset xiao2017fashion, which is an image-classification dataset where each example is a grayscale image. The images are categorized into ten classes of clothing types. Each class contains 6,000 training examples and 1,000 test examples. We consider the binary classification between three pairs of classes: dresses/sneakers, sandals/ankle boots, and coats/bags. We used the pixel value at the one-quarter, center, and three-quarter locations along the vertical center line (see Fig. 3) to build the respective weak supervision models. We split the training examples in half, using 3,000 examples for the training set and 3,000 examples for the validation set. We use the full 1,000 test examples as our test set.

We plot the performance of ALL, the weak supervision baselines, and the averaging baselines in Fig. 2. We evaluate on both the validation set and the completely held-out test set. The learning algorithms observe the features of the validation set but not the labels, and no algorithms observe the test set until evaluation. For all three class pairs, ALL trains models that perform significantly better than the weak signals and the baselines on both validation and test data. The baselines perform better with an increasing number of weak signals, but their best accuracy score on any class is significantly worse than that of ALL. Despite the fact that the first weak signal (WS-1)—which uses the pixel at the one-quarter location on the training data—has rather low accuracy, ALL trained using it is able to achieve high accuracy.

### 4.3 Breast Cancer Dataset

The second dataset we experiment with is a breast cancer dataset uci; street1993nuclear. Each example represents a fine needle aspirate of a breast mass. The features are 30 real-valued characteristics of the cell nuclei in the image. The classification task is to diagnose if the cell is from a malignant (positive) or benign (negative) case of breast cancer. We use the mean radius of the nucleus (WS-1), the radius standard error (WS-2), and worst radius (WS-3) of the cell nucleus as features to train the three different weak supervision models. The dataset contains 569 samples from which we use 132 for training, 266 for validation, and 171 for testing.

Figures 3(b) and 3(a) (left groups) plot the accuracies obtained on the validation and test sets by ALL and the baseline methods. ALL is again able to achieve higher accuracy than each individual weak signal as well as the baseline trained on the averages of weak signals. The second weak signal (WS-2) is the least accurate weak signal, and adding it to ALL decreases its accuracy, but even this lowered accuracy is higher than any single weak signal or baseline.

### 4.4 Burst Header Packet Flooding Attack Detection

The third dataset we experiment with is a collection of statistics about nodes in an optical burst switching (OBS) network. The classification task is to detect network nodes based on their behavior, identifying whether they should be blocked for potentially malicious behavior uci; rajab2016countering. For the first weak signal, we train using a feature representing the percentage of flood per node (WS-1). For the second weak signal, we train a model on the average packet drop rate (WS-2). And for the third weak signal, we train on the average utilized bandwidth. The original dataset contains four classes, so we select two classes with the most examples, resulting in a total of 795 examples, which we split into 185 for training, 371 for validation, and 239 for testing.

Figures 3(b) and 3(a) (right groups) plot the accuracies obtained on the validation and test sets by ALL and the baseline methods. For this task, ALL trains a model that is approximately as good as the best weak signal on validation, and slightly better on the test set. In contrast, the baseline method, which averages the weak signals, is hurt by the inclusion of the first two less accurate weak signals.

In all sets of experiments so far, we computed the error of the weak signals on the validation set to set the error bounds . Because we used this exact error, the primal objective is exactly an upper bound on the true validation error.

[width= ]figures/bc_obs_barval |
[width= ]figures/bc_obs_bartest |

### 4.5 Fixed Bounds During Optimization

Instead of using the true validation error as the bounds, we consider a more realistic scenario in which the experts are less precise about their error estimates. In practice, the true error rate may be difficult to estimate, so these experiments will validate whether our approach continues to work well when these bounds are inaccurate. We use a fixed upper bound of and report the performance of the ALL model and baselines in this setting.

[width= ]figures/fmnist_bound_val |
[width= ]figures/fmnist_bound_test |

[width= ]figures/bc_obs_boundval |
[width= ]figures/bc_obs_boundtest |

Figures 6 and 5 plot the accuracies obtained by the methods. The ALL model performs well on all the datasets except on the sandals vs. ankle boot Fashion-MNIST classification when using only one or two weak signals. Once the third signal is included, ALL again achieves the best performance. This result indicates that with an error bound of 0.3, the one-quarter and middle pixels are not good weak supervision signals for separating sandals from ankle boots. Since these weak signals have errors of over 0.4 (see the bars for WS-1 and WS-2), the error bound of 0.3 is violated by the true data. This violation means that the adversarial labeling was more constrained than it should have been. This extra constraint on the adversary then causes the learned model to be overly optimistic.

The accuracy scores from the dress vs. sneaker task are marginally higher than the results from the previous experiments, which used the true error rate (see Fig. 2). Again, the bound of 0.3 is overly constraining to the adversary, considering that WS-1 on this task performs no better than random. But in this case, the learner was able to extract information from the unlabeled data to separate the classes. Qualitatively, this comparison is the easiest of the three, since the dress images tend to have much larger non-background area than sneakers.

The fixed bound of 0.3 seems to help ALL on the breast cancer and OBS datasets. In fact, while ALL on the OBS data was only slightly better than the best weak signal when using the true error rates, ALL’s improvement on the baselines when using the 0.3 bound is more pronounced. More importantly, it again is not confounded by the less accurate weak signals as the averaging baseline is.

## 5 Conclusion

We introduced adversarial label learning (ALL), a method to train classifiers without labeled data by making use of weak supervision. The method trains a model to minimize the error rate for adversarial labels, which are subject to constraints defined by the weak supervision. We demonstrated that our method is robust against weak supervision signals that make dependent errors. Our experiments confirm that ALL is able to learn models that outperform the weak supervision and directly training classifiers to mimic the weak supervision.

While our contribution is a significant methodological advance, there are some limitations that future work can address. Though our algorithm can train any sub-differentiable classification function, we only experimented with logistic regression classifiers. Additional experiments are needed to verify if our optimization procedure works as well on nonlinear classifiers. We focused on training binary classifiers, but the principles underlying our method should extend to multi-class, regression, and even structured output settings. Our algorithm also requires reasoning over the entire training dataset, so we will explore ideas for scalability such as stochastic variations of our optimization procedure.