# Beyond Perturbations: Learning Guarantees with Arbitrary Adversarial Test Examples

## Abstract

We present a transductive learning algorithm that takes as input training examples from a distribution and arbitrary (unlabeled) test examples, possibly chosen by an adversary. This is unlike prior work that assumes that test examples are small perturbations of . Our algorithm outputs a selective classifier, which abstains from predicting on some examples. By considering selective transductive learning, we give the first nontrivial guarantees for learning classes of bounded VC dimension with arbitrary train and test distributions—no prior guarantees were known even for simple classes of functions such as intervals on the line. In particular, for any function in a class of bounded VC dimension, we guarantee a low test error rate and a low rejection rate with respect to . Our algorithm is efficient given an Empirical Risk Minimizer (ERM) for . Our guarantees hold even for test examples chosen by an unbounded white-box adversary. We also give guarantees for generalization, agnostic, and unsupervised settings.

## 1 Introduction

Consider binary classification where test examples are not from the training distribution. Specifically, consider learning a binary function where training examples are assumed to be iid from a distribution over , while the test examples are arbitrary. This includes both the possibility that test examples are chosen by an adversary or that they are drawn from a distribution (sometimes called “covariate shift”). For a disturbing example of covariate shift, consider learning to classify abnormal lung scans. A system trained on scans prior to 2019 may miss abnormalities due to COVID-19 since there were none in the training data. As a troubling adversarial example, consider explicit content detectors which are trained to classify normal vs. explicit images. Adversarial spammers synthesize endless variations of explicit images that evade these detectors for purposes such as advertising and phishing (Yuan et al., 2019).

A recent line of work on adversarial learning has designed algorithms that are robust to imperceptible perturbations. However, perturbations do not cover all types of test examples. In the explicit image detection example, Yuan et al. (2019) find adversaries using conspicuous image distortion techniques (e.g., overlaying a large colored rectangle on an image) rather than imperceptible perturbations. In the lung scan example, Fang et al. (2020) find noticeable signs of COVID in many scans.

In general, there are several reasons why learning with arbitrary test examples is actually impossible. First of all, one may not be able to predict the labels of test examples that are far from training examples, as illustrated by the examples in group (1) of Figure 1. Secondly, as illustrated by group (2), given any classifier , an adversary or test distribution may concentrate on or near an error. High error rates are thus unavoidable since an adversary can simply repeat any single erroneous example they can find. This could also arise naturally, as in the COVID example, if contains a concentration of new examples near one another–individually they appear “normal” (but are suspicious as a group). This is true even under the standard realizable assumption that the target function is in a known class of bounded VC dimension .

As we now argue, learning with arbitrary test examples requires selective classifiers and transductive learning, which have each been independently studied extensively. We refer to the combination as classification with redaction, a term which refers to the removal/obscuring of certain information when documents are released. A selective classifier (SC) is one which is allowed to abstain from predicting on some examples. In particular, it specifies both a classifier and a subset of examples to classify, and rejects the rest. Equivalently, one can think of a SC as where indicates , abstinence.

We say the learner classifies if and otherwise it rejects . Following standard terminology, if (i.e., ) we say the classifier rejects (the term is not meant to indicate anything negative about the example but merely that its classification may be unreliable). We sat that misclassifies or errs on if . There is a long literature on SCs, starting with the work of Chow (1957) on character recognition. In standard classification, transductive learning refers to the simple learning setting where the goal is to classify a given unlabeled test set that is presented together with the training examples (see e.g., Vapnik, 1998). We will also consider the generalization error of the learned classifier.

This raises the question: When are unlabeled test examples available in advance? In some applications, test examples are classified all at once (or in batches). Otherwise, redaction can also be beneficial in retrospect. For instance, even if image classifications are necessary immediately, an offensive image detector may be run daily with rejections flagged for inspection; and images may later be blocked if they are deemed offensive. Similarly, if a group of unusual lung scans showing COVID were detected after a period of time, the recognition of the new disease could be valuable even in hindsight. Furthermore, in some applications, one cannot simply label a sample of test examples. For instance, in learning to classify messages on an online platform, test data may contain both public and private data while training data may consist only of public messages. Due to privacy concerns, labeling data from the actual test distribution may be prohibited.

It is clear that a SC is necessary to guarantee few test misclassifications, e.g., if is concentrated on a single point , rejection is necessary to guarantee few errors on arbitrary test points. However, no prior guarantees (even statistical guarantees) were known even for learning elementary classes such as intervals or halfspaces with arbitrary . This is because learning such classes is impossible without unlabeled examples.

To illustrate how redaction (transductive SC) is useful, consider learning an interval on with arbitrary . This is illustrated below with (blue) dots indicating test examples:

With positive training examples as in (a), one can guarantee 0 test errors by rejecting the two (grey) regions adjacent to the positive examples. When there are no positive training examples,^{1}

Note that our redaction model assumes that the target function remains the same at train and test times. This assumption holds in several (but not all) applications of interest. For instance, in explicit image detection, U.S. laws regarding what constitutes an illegal image are based solely on the image itself (U.S.C., 1996). Of course, if laws change between train and test time, then itself may change. Label shift problems where changes from train to test is also important but not addressed here. Our focus is primarily the well-studied realizable setting, where , though we analyze an agnostic setting as well.

#### A note of caution.

Inequities may be caused by using training data that differs from the test distribution on which the classifier is used. For instance, in classifying a person’s gender from a facial image, Buolamwini and Gebru (2018) have demonstrated that commercial classifiers are highly inaccurate on dark-skinned faces, likely because they were trained on light-skinned faces. In such cases, it is preferable to collect a more diverse training sample even if it comes at greater expense, or in some cases to abstain from using machine learning altogether. In such cases, learning should not be used, as an unbalanced distribution of rejections can also be harmful.^{2}

### 1.1 Redaction model and guarantees

Our goal is to learn a target function of VC dimension with training distribution over . In the redaction model, the learner first chooses based on iid training examples and their labels . (In other words, it trains a standard binary classifier.) Next, a “white box” adversary selects arbitrary test examples based on all information including and the learning algorithm. Using the unlabeled test examples (and the labeled training examples), the learner finally outputs . Errors are those test examples in that were misclassified, i.e., .

Rather than jumping straight into the transductive setting, we first describe the simpler generalization setting. We define the model in which are drawn iid by nature, for an arbitrary distribution . While it will be easier to quantify generalization error and rejections in this simpler model, the model does not permit a white-box adversary to choose test examples based on . To measure performance here, define rejection and error rates for distribution , respectively:

(1) | ||||

(2) |

We write and when and are clear from context. We extend the definition of PAC learning to as follows:

###### Definition 1.1 (PQ learning).

Learner -PQ-learns if for any distributions over and any , its output satisfies

PQ-learns if runs in polynomial time and if there is a polynomial such that -PQ-learns for every .

Now, at first it may seem strange that the definition bounds rather than , but as mentioned cannot be bound absolutely. Instead, it can be bound relative to and the total variation distance (also called statistical distance) , as follows:

This new perspective, of bounding the rejection probability of , as opposed to , facilitates the analysis. Of course when , and , and when and have disjoint supports (no overlap), then and the above bound is vacuous. We also discuss tighter bounds relating to .

We provide two redactive learning algorithms: a supervised algorithm called , and an unsupervised algorithm . takes as input labeled training data and test data (and an error parameter ). It can be implemented efficiently using any oracle that outputs a function of minimal error on any given set of labeled examples. It is formally presented in Figure 2. At a high level, it chooses and chooses in an iterative manner. It starts with and then iteratively chooses that disagrees significantly with on but agrees with on ; it then rejects all ’s such that . As we show in Lemma 5.1, choosing can be done efficiently given oracle access to .

Theorem 5.2 shows that PQ-learns any class of bounded VC dimension , specifically with . (The notation hides logarithmic factors including the dependence on the failure probability .) This is worse than the standard bound of supervised learning when , though Theorem 5.4 shows this is necessary with an lower-bound for .

Our unsupervised learning algorithm , formally presented in Figure 3, computes only from unlabeled training and test examples, and has similar guarantees (Theorem 5.6). The algorithm tries to distinguish training and test examples and then rejects whatever is almost surely a test example. More specifically, as above, it chooses in an iterative manner, starting with . It (iteratively) chooses two functions such that and have high disagreement on and low disagreement on , and rejects all ’s on which disagree. As we show in Lemma B.1, choosing and can be done efficiently given a (stronger) oracle for the class of disagreements between . We emphasize that can also be used for multi-class learning as it does not use training labels, and can be paired with any classifier trained separately. This advantage of over comes at the cost of requiring a stronger base classifier to be used for , and may lead to examples being unnecessarily rejected.

In Figure 1 we illustrate our algorithms for the class of halfspaces. A natural idea would be to train a halfspace to distinguish unlabeled training and test examples—intuitively, one can safely reject anything that is clearly distinguishable as test without increasing . However, this on its own is insufficient. See for example group (2) of examples in Figure 1, which cannot be distinguished from training data by a halfspace. This is precisely why having test examples is absolutely necessary. Indeed, it allows us to use an ERM oracle to to PQ-learn .

We also present:

#### Transductive analysis

A similar analysis of in a transductive setting gives error and rejection bounds directly on the test examples. The bounds here are with respect to a stronger white-box adversary who need not even choose a test set iid from a distribution. Such an adversary chooses the test set with knowledge of and . In particular, first is chosen based on and ; then the adversary chooses the test set based on all available information; and finally, is chosen. We introduce a novel notion of false rejection, where we reject a test example that was in fact chosen from and not modified by an adversary. Theorem 5.3 gives bounds that are similar in spirit to Theorem 5.2 but for the harsher transductive setting.

#### Agnostic bounds

Thus far, we have considered the realizable setting where the target . In agnostic learning (Kearns et al. (1992)), there is an arbitrary distribution over and the goal is to learn a classifier that is nearly as accurate as the best classifier in . In our setting, we assume that there is a known such that the train and test distributions and over satisfy that there is some function that has error at most with respect to both and . Unfortunately, we show that in such a setting one cannot guarantee less than errors and rejections, but we show that nearly achieves such guarantees.

#### Experiments

As a proof of concept, we perform simple controlled experiments on the task of handwritten letter classification using lower-case English letters from the EMNIST dataset (Cohen et al. (2017)). In one setup, to mimic a spamming adversary, after a classifier is trained, test examples are identified on which errs and are repeated many times in the test set. Existing SC algorithms (no matter how robust) will fail on such an example since they all choose without using unlabeled test examples—as long as an adversary can find even a single erroneous example, it can simply repeat it. In the second setup, we consider a natural test distribution which consists of a mix of lower- and upper-case letters, while the training set was only lower-case letters. The simplest version of achieves high accuracy while rejecting mostly adversarial or capital letters.

#### Organization

## 2 Related work

The redaction model combines SC and transductive learning, which have each been extensively studied, separately. We first discuss prior work on these topics, which (with the notable exception of online SC) has generally been considered when test examples are from the same distribution as training examples.

#### Selective classification

Selective classification go by various names including “classification with a reject option” and “reliable learning.” To the best of our knowledge, prior work has not considered SC using unlabeled samples from . Early learning theory work by Rivest and Sloan (1988) required a guarantee of 0 test errors and few rejections. However, Kivinen (1990) showed that, for this definition, even learning rectangles under uniform distributions requires exponential number of examples (as cited by Hopkins et al. (2019) which like much other work therefore makes further assumptions on and ). Most of this work assumes the same training and test distributions, without adversarial modification. Kanade et al. (2009) give a SC reduction to an agnostic learner (similar in spirit to our reduction to ) but again for the case of .

A notable exception is the work in online SC, where an arbitrary sequence of examples is presented one-by-one with immediate error feedback. This work includes the “knows-what-it-knows” algorithm (Li et al., 2011), and Sayedi et al. (2010) exhibit an interesting trade-off between the number of mistakes and the number of rejections in such settings. However, basic classes such as intervals on the line are impossible to learn in these harsh online formulations. Interestingly, our division into labeled train and unlabeled test seems to make the problem easier than in the harsh online model.

#### Transductive (and semi-supervised) learning.

In transductive learning, the classifier is given test examples to classify all at once or in batches, rather than individually (e.g., Vapnik, 1998). Performance is measured with respect to the test examples. It is related to semi-supervised learning, where unlabeled examples are given but performance is measured with respect to future examples from the same distribution. Here, since the assumption is that training and test examples are iid, it is generally the case that the unlabeled examples greatly outnumber the training examples, since otherwise they would provide limited additional value.

We now discuss related work which considers , but where classifiers must predict everywhere without the possibility of outputting .

#### Robustness to Adversarial Examples

There is ongoing effort to devise methods for learning predictors that are robust to adversarial examples (Szegedy et al., 2013; Biggio et al., 2013; Goodfellow et al., 2015) at test time. Such work typically assumes that the adversarial examples are perturbations of honest examples chosen from . The main objective is to learn a classifier that has high robust accuracy, meaning that with high probability, the classifier will answer correctly even if the test point was an adversarially perturbed example. Empirical work has mainly focused on training deep learning based classifiers to be more robust (e.g., Madry et al., 2018; Wong and Kolter, 2018; Zhang et al., 2019). Kang et al. (2019) consider the fact that perturbations may not be known in advance, and some work (e.g., Pang et al., 2018) addresses the problem of identifying adversarial examples. We emphasize that as opposed to this line of work, we consider arbitrary test examples and use SC.

Detecting adversarial examples has been studied in practice, but Carlini and Wagner (2017) study ten proposed heuristics and are able to bypass all of them. Our algorithms also require a sufficiently large set of unlabeled test examples. The use of unlabeled data for improving robustness has also been empirically explored recently (e.g., Carmon et al., 2019; Stanforth et al., 2019; Zhai et al., 2019).

In work on real-world adversarial images, Yuan et al. (2019) find adversaries using highly visible transformations rather than imperceptible perturbations. They categorize seven major types of such transformations and write:

“Compared with the adversarial examples studied by the ongoing adversarial learning, such adversarial explicit content does not need to be optimized in a sense that the perturbation introduced to an image remains less perceivable to humans…. today’s cybercriminals likely still rely on a set of predetermined obfuscation techniques… not gradient descent.”

#### Covariate Shift

The literature on learning with covariate shift is too large to survey here, see, e.g., the book by Quionero-Candela et al. (2009) and the references therein. To achieve guarantees, it is often assumed that the support of is contained in the support of . Like our work, many of these approaches use unlabeled data from (e.g., Huang et al., 2007; Ben-David and Urner, 2012). Ben-David and Urner (2012) show that learning with covariate-shift is intractable, in the worst case, without such assumptions. In this work we overcome this negative result, and obtain guarantees for arbitrary , using SC. In summary, prior work on covariate shift that guarantees low test/target error requires strong assumptions regarding the distributions. This motivates our model of covariate shift with rejections.

## 3 Preliminaries and notation

Henceforth, we assume a fixed class of from domain to ,^{3}

## 4 Learning with redaction

We now describe the two settings for SC. We use the same algorithm in both settings, so it can be viewed as two justifications for the same algorithm. The PQ model provides guarantees with respect to future examples from the test distribution, while the transductive model provides guarantees with respect to arbitrary test examples chosen by an all-powerful adversary. Interestingly, the transductive analysis is somewhat simpler and is used in the PQ analysis.

### 4.1 PQ learning

In the PQ setting, an SC learner is given labeled examples drawn iid , labels for some unknown , and unlabeled examples . outputs and . The adversary (or nature) chooses based only on and knowledge of the learning algorithm . The definition of PQ learning is given in Definition 1.1. Performance is measured in terms of on future examples from and (rather than the more obvious . Rejection rates on (and ) can be estimated from held out data, if so desired. The quantities can be related and a small implies few rejections on future examples from wherever it “overlaps” with by which we mean for some constant .

###### Lemma 4.1.

For any and distributions over :

(3) |

Further, for any

(4) |

###### Proof.

For eq. 3, note that one can sample a point from by first sampling and then changing it with probability . This follows from the definition of total variation distance. Thus, the probability that is rejected is at most the probability is rejected plus the probability , establishing eq. 3. To see eq. 4, note

Clearly the above is at most . ∎

If then all that lie in ’s support would necessarily be classified (i.e., ). Note that the bound eq. 3 can be quite loose and a tight bound is given in Appendix G.

It is also worth mentioning that a PQ-learner can also be used to guarantee meaning that it has accuracy with respect to (like a normal PAC learner) but is also simultaneously robust to . The following claim shows this and an additional property that PQ learners can be made robust with respect to any polynomial number of different ’s.

###### Claim 4.2.

Let and be distributions over . Given a -PQ-learner , , , and additional unlabeled samples , one can generate such that satisfies,

###### Proof of creftypecap 4.2.

Let be the blended distribution. Give samples from and each one can straightforwardly construct iid samples . Running gives the guarantee that with prob. , which implies the claim since . ∎

### 4.2 Transductive setting with white-box adversary

In the transductive setting, there is no and instead empirical analogs and of error and rejection rates are defined as follows, for arbitrary :

(5) | ||||

(6) |

Again, and may be omitted when clear from context.

In this setting, the learner first chooses using only and . Then, a true test set is drawn. Based on all available information ( and the code for learner ) the adversary modifies any number of examples from to create arbitrary test set . Finally, the learner chooses based on , and . Performance is measured in terms of rather than , because . One can bound in terms of for any and , as follows:

(7) |

The hamming distance is the transductive analog of . The following bounds the “false rejections,” those unmodified examples that are rejected:

(8) |

#### White-box adversaries

The all-powerful transductive adversary is sometimes called “white box” in the sense that it can choose its examples while looking “inside” rather than using as a black box. While it cannot choose with knowledge of , it can know what will be as a function of if the learner is deterministic, as our algorithms are. Also, we note that the generalization analysis may be extended to a white-box model where the adversary chooses knowing , but it is cumbersome even to denote probabilities over when itself can depend on .

## 5 Algorithms and guarantees

We assume that we have a deterministic oracle which, given a set of labeled examples from , outputs a classifier of minimal error. Figure 2 describes our algorithm . It takes as input a set of labeled training examples , where and , and a set of test examples along with an error parameter that trades off errors and rejections. A value for that theoretically balances these is in Theorems 5.3 and 5.2.

###### Lemma 5.1 (Computational efficiency).

For any , , and , outputs for . Further, each iteration can be implemented using one call to on at most examples and evaluations of classifiers in .

###### Proof.

To maximize using the ERM oracle for , construct a dataset consisting of each training example, labeled by , repeated times, and each test example in , labeled , included just once. Running on this artificial dataset returns a classifier of minimal error on it. But the number of errors of classifier on this artificial dataset is:

which is equal to . Hence minimizes error on this artificial dataset if and only if it maximizes of the algorithm.

Next, let be the number of iterations of the algorithm , so its output is . We must show that . To this end, note that by definition, for every it holds that , and moreover,

(9) |

Hence, the fraction of additional rejected test examples in each iteration is greater than , and hence . Since is an integer, this means that .

For efficiency, of course each is not explicitly stored since even could be infinite. Instead, note that to execute the algorithm, we only need to maintain: (a) the subset of indices of test examples which are in the prediction set, and (b) the classifiers . Also note that updating from requires evaluating at most times. In this fashion, membership in and can be computed efficiently and output in a succinct manner. ∎

Note that since we assume is deterministic, the algorithm is also deterministic. This efficient reduction to , together with the following imply that is a PQ learner:

###### Theorem 5.2 (PQ guarantees).

For any and distributions over :

where and .

More generally, Theorem A.5 shows that, by varying parameter , one can achieve any trade-off between and . The analogous transductive guarantee is:

###### Theorem 5.3 (Transductive).

For any and dist. over :