# A General Method for Robust Learning from Batches

## Abstract

In many applications, data is collected in batches, some of which are corrupt or even adversarial. Recent work derived optimal robust algorithms for estimating discrete distributions in this setting. We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains. Building on these results, we derive the first robust agnostic computationally-efficient learning algorithms for piecewise-interval classification, and for piecewise-polynomial, monotone, log-concave, and gaussian-mixture distribution estimation.

## 1 Introduction

### 1.1 Motivation

In many learning applications, some samples are inadvertently or maliciously corrupted. A simple and intuitive example shows that this erroneous data limits the extent to which a distribution can be learned, even with infinitely many samples. Consider that could be one of two possible binary distributions: and . Given any number of samples from , an adversary who observes a fraction of the samples and can determine the rest, could use the observed samples to learn , and set the remaining samples to make the distribution always appear to be . Even with arbitrarily many samples, any estimator for fails to decide which is in effect, hence incurs a total-variation (TV) distance , that we call the adversarial lower bound.

The example may seem to suggest the pessimistic conclusion that if an adversary can corrupt a fraction of the data, a TV-loss of is inevitable. Fortunately, that is not necessarily so.

In the following applications, and many others, data is collected in batches, most of which are genuine, but some possibly corrupted. Data may be gathered by sensors, each providing a large amount of data, and some sensors may be faulty. The word frequency of an author may be estimated from several large texts, some of which are mis-attributed. Or user preferences may be learned by querying several users, but some users may intentionally bias their feedback. Interestingly, for data arriving in batches, even when a -fraction of which are corrupted, more can be said.

Recently, Qiao and Valiant (2017) formalized the problem for discrete domains. They considered estimating a distribution over in TV-distance when the samples are provided in batches of size . A total of batches are provided, of which a fraction may be arbitrarily and adversarially corrupted, while in every other batch the samples are drawn according a distribution satisfying , allowing for the possibility that slightly different distributions generate samples in each batch.

For , they derived an estimation algorithm that approximates any over a discrete domain to TV-distance , surprisingly, much lower than the individual samples limit of . They also derived a matching lower bound, showing that even for binary distributions, for any number of batches, and hence for general discrete distributions, the lowest achievable total variation distance is . We refer to this result as the adversarial batch lower bound.

Their estimator requires batches of samples, or equivalently samples in total, which is not always optimal. It also runs in time exponential in the domain size, rendering it impractical.

Recently, Chen et al. (2019) reduced the exponential time complexity. Allowing quasi-polymoially many samples, they derived an estimator that achieves TV distance and runs in quasi-polynomial time. When a sufficiently larger distance is permitted, their estimator has polynomial time and sample complexities. Concurrently, Jain and Orlitsky (2019) derived a polynomial-time, hence computationally efficient, estimator, that achieves the same TV distance, and for domain size uses the optimal samples.

When learning general distributions in TV-distance, the sample complexity’s linear dependence on the domain size is inevitable even when all samples are genuine. Hence, learning general distributions over large discrete, let alone continuous domains, is infeasible. To circumvent this difficulty, Chen et al. (2019) considered robust batch learning of structured discrete distributions, and studied the class of -piecewise degree- polynomials over the discrete set .

They first reduced the noise with respect to an distance described later, and used existing methods on this cleaned data to estimate the distribution. This allowed them to construct an estimator that approximates these distributions with number of batches that grows only poly-logarithmically in the domain size . Yet this number still grows with , and is quasi-polynomial in other parameters , , batch size , and . Additionally, its computational complexity is quasi-polynomial in these parameters and the domain size . Part of our paper generalizes and improves this technique.

The above results suffer several setbacks. While for general distributions there are sample-optimal polynomial-time algorithms, for structured distributions existing algorithms have suboptimal quasi-polynomial sample and time complexity. Furthermore both their sample- and time-complexities grow to infinity in the domain size, making them impractical for many complex applications, and essentially impossible for the many practical applications with continuous domains such as or .

This leaves several natural questions. For sample efficiency, can distributions over non-discrete spaces, be estimated in to the adversarial batch lower bound using finitely many samples, and if so, what is their sample complexity? For computational efficiency, are there estimators whose computational complexity is independent of the domain size, and can their run time be polynomial rather than quasi-polynomial in the other parameters. More broadly, can similar robustness results be derived for other important learning scenarios, such as classification? And most importantly, is there a more general theory of robust learning from batches?

### 1.2 Summary of techniques and contributions

To answer these questions, we first briefly foray into VC theory. Consider estimation of an unknown target distribution to a small -distance, where is a family of subsets with finite VC-dimension. Without adversarial batches, the empirical distribution of samples from estimates it to a small -distance. When some of the batches are adversarial, the empirical distribution could be far from . We construct an algorithm that “cleans” the batches and returns a sub-collection of batches whose empirical distribution approximates to near optimal -distance.

While the algorithm is near sample optimal, as expected from the setting’s broad generality, for some subset families, the it is necessarily not computationally efficient. We then consider the natural and important family of all unions of at most intervals in . We provide a computationally efficient algorithm that estimates distributions to near-optimal distance and requires only a small factor more samples than the best possible.

Building on these techniques, we return to estimation in total variation (TV) distance. We consider the family of distributions whose Yatracos Class Yatracos (1985) has finite VC dimension. This family consists of both discrete and continuous distributions, and includes piecewise polynomials, Gaussians in one or more dimensions, and arguably most practical distribution families. We provide a nearly-tight upper bound on the TV-distance to which these distributions can be learned robustly from batches.

Here too, the algorithms’ broad generality makes them computationally inefficient some distribution classes. For one-dimensional -piecewise degree- polynomials, we derive a polynomial-time algorithm whose sample complexity has optimal linear dependence on and moderate dependence on other parameters. This is the first efficient algorithm for robust learning of general continuous distributions from batches.

The general formulation also allows us to extend robust distribution-estimation results to other learning tasks. We apply this framework to derive the first robust classification results, where the goal is to minimize the excess risk in comparison to the best hypothesis, in the presence of adversarial batches. We obtain tight upper bounds on the excess risk and number of samples required to achieve it for general binary classification problems. We then apply the results to derive a computationally efficient algorithm for hypotheses consisting of one-dimensional intervals using only samples.

The rest of the paper is organized as follows. Section 2 describes the paper’s main technical results and their applications to distribution estimation and classification. Section 3 introduces basic notation and techniques. Section 4 recounts basic tools from VC theory used to derive the results. Section 5 derives a framework for robust distribution estimation in -distance from corrupt and adversarial sample batches, and obtains upper bounds on the estimation accuracy and sample complexity. Finally, section 6, develops computationally efficient algorithms for learning in distance.

### 1.3 General related work

The current results extend several long lines of work on estimating structured distributions, including O’Brien (2016); Diakonikolas (2016); Ashtiani and Mehrabian (2018). The results also relate to classical robust-statistics work Tukey (1960); Huber (1992). There has also been significant recent work leading to practical distribution learning algorithms that are robust to adversarial contamination of the data. For example, Diakonikolas et al. (2016); Lai et al. (2016) presented algorithms for learning the mean and covariance matrix of high-dimensional sub-gaussian and other distributions with bounded fourth moments in presence of the adversarial samples. Their estimation guarantees are typically in terms of , and do not yield the - distance results required for discrete distributions.

The work was extended in Charikar et al. (2017) to the case when more than half of the samples are adversarial. Their algorithm returns a small set of candidate distributions one of which is a good approximate of the underlying distribution. For more extensive survey on robust learning algorithms in the continuous setting, see Steinhardt et al. (2017); Diakonikolas et al. (2019).

## 2 Results

We consider learning from batches of samples, when a fraction of batches are adversarial.

More precisely, is a collection of batches, composed of two unknown sub-collections. A good sub-collection of good batches, where each batch consists of independent samples, all distributed according to the same distribution satisfying . And an adversarial sub-collection of the remaining batches, each consisting of the same number of arbitrary elements, that for simplicity we call samples as well. Note that the adversarial samples may be chosen in any way, including after observing the the good samples.

Section 2.1 of Jain and Orlitsky (2019) shows that for discrete domains, results for the special case , where all batch distributions are the target distribution , can be easily extended to the general case. The same can be shown for our more general result, hence for simplicity we assume that .

The next subsection describes our main technical results for learning in distance. The subsections thereafter derive applications of these results for learning distributions in total variation distance and for binary classification.

### 2.1 Estimating distributions in distance

Let be a family of subsets of a domain . The -distance between two distributions and over is the largest difference between the probabilities and assign to any subset in ,

The -distance clearly generalizes the total-variation and distances. For the collection of all subsets of ,

Our goal is to use samples generated by a target distribution to approximate it to a small -distance. For general families , this goal cannot be accomplished even with just good batches. Let be the collection of all subsets of the real interval domain . For any total number of samples, with high probability, it is impossible to distinguish the uniform distribution over from a uniform discrete distribution over a random collection of elements in . Hence any estimator must incur TV-distance 1 for some distribution.

This difficulty is addressed by Vapnik-Chervonenkis (VC) Theory. The collection shatters a subset if every subset of is the intersection of with a subset in . The VC-dimension of is the size of the largest subset shattered by .

Let , be independent samples from a distribution . The empirical probability of is

The fundamental Uniform deviation inequality of VC theory Vapnik and Chervonenkis (1971); Talagrand (1994) states that if has finite VC-dimension , then estimates well in distance. For all , with probability ,

It can also be shown that achieves the lowest possible -distance, that we call the information-theoretic limit.

In the adversarial-batch scenario, a fraction of the batches may be corrupted. It is easy to see that for any number of batches, however large, the adversary can cause to approximate to -distance , namely .

Let be the empirical distribution induced by the samples in a collection . Our first result states that if has a finite VC-dimension, for batches, can be “cleaned” to a sub-collection where , recovering with a simple empirical estimator.

###### Theorem 1.

For any , , , , and , there is an algorithm that with probability returns a sub-collection such that and

The -distance bound matches the adversarial limit up to a small factor. The bound on the number of batches required to achieve this bound is also tight up to a logarithmic factor.

The theorem applies to all families with finite VC dimension, and like most other results of this generality, it is necessarily non-constructive in nature. Yet it provides a road map for constructing efficient algorithms for many specific natural problems. In Section 6 we use this approach to derive a polynomial-time algorithm that learns distributions with respect to one of the most important and practical VC classes, where , and is the collection of all unions of at most intervals.

###### Theorem 2.

For any , , , , and , there is an algorithm that runs in time polynomial in all parameters, and with probability returns a sub-collection , such that and

The sample complexity in both the theorems are independent of the domain and depends linearly on the VC dimension of the family .

### 2.2 Approximating distributions in total-variation distance

Our ultimate objective is to estimate the target distribution in total variation (TV) distance, one of the most common measures in distribution estimation. In this and the next subsection, we follow a framework developed in Devroye and Lugosi (2001), see also Diakonikolas (2016).

The sample complexity of estimating distributions in TV-distance grows with the domain size, becoming infeasible for large discrete domains and impossible for continuous domains. A natural approach to address this intractability is to assume that the underlying distribution belongs to, or is near, a structured class of distributions.

Let be the TV-distance of from the closest distribution in . For example, for , . Given , we try to use samples from to find an estimate such that, with probability ,

for a universal constant , namely, to approximate about as well as the closest distribution in .

Following Devroye and Lugosi (2001), we utilize a connection between distribution estimation and VC dimension. Let be a class of distributions over . The Yatracos class Yatracos (1985) of is the family of subsets

It is easy to verify that for distributions ,

The Yatracos minimizer of a distribution is its closest distribution, by -distance, in ,

where ties are broken arbitrarily. Using this definition and equations, and a sequence of triangle inequalities, Theorem 6.3 in Devroye and Lugosi (2001) shows that, for any distributions , , and any class ,

(1) |

Therefore, given a distribution that approximates in -distance, it is possible to find a distribution in approximating in TV-distance. In particular, when , the opt term is zero.

If the Yatracos class has finite VC dimension, the VC Uniform deviation inequality ensures that for the empirical distribution of i.i.d. samples from , decreases to zero, and can be used to approximate in TV-distance. This general method has lead to many sample- and computationally-efficient algorithms for estimating structured distributions in TV-distance.

However, as discussed earlier, with a -fraction of adversarial batches, the empirical distribution of all samples can be at a -distance as large as from , leading to a large TV-distance.

Yet Theorem 1 shows that data can be “cleaned” to remove outlier batches and retain batches whose empirical distribution approximates to a much smaller -distance of . Combined with Equation (1), we obtain a much better approximation of in total variation distance.

###### Theorem 3.

For a distribution class with Yatracos Class of finite VC dimension , for any , , , and , there is an algorithm that with probability returns a distribution such that

The estimation error achieved in the theorem for TV-distance matches the lower to a small logarithmic factor of , and is valid for any class with finite VC Dimensional Yatracos Class.

Moreover, the upper bound on the number of samples (or batches) required by the algorithm to estimate to the above distance matches a similar general upper bound obtained for non adversarial setting to a log factor. This results for the first time shows that it is possible to learn a wide variety of distributions robustly using batches, even over continuous domains.

The theorem describes the rate at which can be learned in TV-distance. This rate mathces the similar upper bound for non-adversarial seeting to a small logarithmic factor of , and is valid for any class with finite VC Dimensional Yatracos Class. Moreover, the upper bound on the number of samples (or batches) required by the algorithm to estimate to the above distance matches a similar general upper bound obtained for non adversarial setting to a log factor. This results for the first time shows that it is possible to learn a wide variety of distributions robustly using batches, even over continuous domains.

### 2.3 Learning univariate structured distributions

We apply the general results in the last two subsections to estimate distributions over the real line. We start with one of the most studied, and important, distribution families, the class of piecewise-polynomial distributions, and then observe that it can be generalized to even broader classes.

A distribution over is -piecewise, degree-, if there is a partition of into intervals , and degree- polynomials such that and , . The definition extends naturally to discrete distributions over .

Let denote the collection of all -piece-wise degree distributions. is interesting in its own right, as it contains important distribution classes such as histograms. In addition, it approximates other important distribution classes, such as monotone, log-concave, Gaussians, and their mixures, arbitrarily well, e.g., Acharya et al. (2017).

Note that for any two distributions , the difference is a -piecewise degree- polynomial, hence every set in the Yatracos class of ,

is the union of at most intervals in . Therefore, . And since for any , has VC dimension .

Theorem 3 can then be applied to show that any target distribution can be estimated by a distribution in to a TV-distance that is within a small factor from adversarial lower bound, using a number of samples, and hence batches, that is within a logarithmic factor from the information-theoretic lower bound Chan et al. (2014).

###### Corollary 4.

Let be distribution over . For any , , , , , and , there is an algorithm that with probability returns a distribution such that

Next we provide a polynomial-time algorithm for estimating to the same TV-distance, but with an extra factor in sample complexity.

Theorem 2 provides a polynomial time algorithm that returns a sub-collection of batches whose empirical distribution is close to in -distance. Acharya et al. (2017) provides a polynomial time algorithm that for any distribution returns a distribution in minimizing to an additive error. Then Equation (1) and Theorem 2 yield the following result.

###### Theorem 5.

Let be any distribution over . For any , , , , , and , there is a polynomial time algorithm that with probability returns a distribution such that

### 2.4 Binary classification

The framework developed in this paper extends beyond distribution estimation. Here we describe its application to Binary classification. Consider a family of Boolean functions, and a distribution over . Let , where and . The loss of hypothesis for distribution is

The optimal classifier for distribution is

and the optimal loss is

The goal is to return a hypothesis whose loss is close to the optimal loss .

Consider the following natural extension of VC-dimension from families of subsets to families of Boolean functions. For a boolean-function family , define the family

of subsets of , and let the VC dimesnsion of be .

The largest difference between the loss of a classifier for two distributions and over is related to their -distance,

(2) |

The next simple lemma, proved in the appendix, upper bounds the excess loss of the optimal classifier in for a distribution for another distribution in terms of distance between the distributions.

###### Lemma 6.

For any two distributions and and hypothesis class ,

When is the empirical distribution of non-adversarial i.i.d. samples from , is called the empirical risk minimizer, and the excess loss of the empirical risk minimizer in the above equation goes to zero if VC dimension of is finite.

Yet as discussed earlier, when a -fractions of the batches, and hence samples, are chosen by an adversary, the empirical distribution of all samples can be at a large -distance from , leading to an excess classification loss up to for the empirical-risk minimizer.

Theorem 1 states that the collection of batches can be “cleaned” to obtain a sub-collection whose empirical distribution has a lower -distance from . The above lemma then implies that the optimal classifier for the empirical distribution of the cleaner batches will have a small excess risk for as well. The resulting non-constructive algorithm has excess risk and sample complexity that are optimal to a logarithmic factor.

###### Theorem 7.

For any , , , , and , there is an algorithm that with probability returns a sub-collection such that and

To derive a computationally efficient algorithm, we focus on the following class of binary functions. For let denote the collection of all binary functions over whose decision region, namely values mapping to 1, consists of at most -intervals. The VC dimension of is clearly .

Theorem 2 describes a polynomial time algorithm that returns a cleaner data w.r.t. distance. From Lemma 6, the hypothesis that minimizes the loss for the empirical distribution of this cleaner data will have a small excess loss. Furthermore, Maass (1994) derived a polynomial time algorithm to find the hypothesis that minimizes the loss for a given empirical distribution. Combining these results, we obtain a computationally efficient classifier in that achieves the excess loss in the above theorem.

###### Theorem 8.

For any , , , , and , there is a polynomial time algorithm that with probability returns a sub-collection such that and

## 3 Preliminaries

We introduce terminology that helps describe the approach and results. Some of the work builds on results in Jain and Orlitsky (2019), and we keep the notation consistent.

Recall that , , and are the collections of all-, good-, and adversarial-batches. Let , , and , denote sub-collections of all-, good-, and bad-batches. We also let denote a subset of the Borel -field .

Let denote the samples in a batch , and let denote the indicator random variable for a subset . Every batch induces an empirical measure over the domain , where for each ,

Similarly, any sub-collection of batches induces an empirical measure defined by

We use two different symbols to denote empirical distribution defined by single batch and a sub-collection of batches to make them easily distinguishable. Note that is the mean of the empirical measures defined by the batches .

Recall that is the batch size. For , let , the variance of a Binomial random variable. Observe that

(3) |

where the second property follows as .

For , the random variables for are distributed i.i.d. , and since is their average,

For batch collection and subset , the empirical probability of will vary with the batch . The empirical variance of these empirical probabilities is

## 4 Vapnik-Chervonenkis (VC) theory

We recall some basic concepts and results in VC theory, and derive some of their simple consequences that we use later in deriving our main results.

The VC shatter coefficient of is

the largest number of subsets of elements in obtained by intersections with subsets in . The VC dimension of is

the largest number of elements that are “fully shattered” by . The following Lemma Devroye and Lugosi (2001) bounds the Shatter coefficient for a VC family of subsets.

###### Lemma 9 (Devroye and Lugosi (2001)).

For all , .

Next we state the VC-inequality for relative deviation Vapnik and Chervonenkis (1974); Anthony and Shawe-Taylor (1993).

###### Theorem 10.

Let be a distribution over , and be a VC-family of subsets of and denote the empirical distribution from i.i.d samples from . Then for any , with probability ,

Another important ingredient commonly used in VC Theory is the concept of covering number that reflects the smallest number of subsets that approximate each subset in the collection.

Let be any probability measure over and be a family of subsets. A collection of subsets is an -cover of if for any , there exists a with . The -covering number of is

If is an -cover of , then is -self cover of .

The -self-covering number is

Clearly, . The next lemma establishes a reverse relation.

###### Lemma 11.

For any , .

###### Proof.

If , the lemma clearly holds. Otherwise, let be an -cover of size . We construct an -self-cover of equal or smaller size.

For every subset , there is a subset with . Otherwise, could be removed from to obtain a strictly smaller cover, which is impossible.

The collection has size , and it is an -self-cover of because for any , there is an with , and by the triangle inequality, . ∎

Let and be the largest covering numbers under any distribution.

The next theorem bounds the covering number of in terms of its VC-dimension.

###### Theorem 12 (Vaart and Wellner (1996)).

There exists a universal constant such that for any , and any family with VC dimension ,

Combining the theorem and Lemma 11, we obtain the following corollary.

###### Corollary 13.

For any distribution and family , let be any minimal-size -self-cover for of size .

## 5 A framework for distribution estimation from corrupted sample batches

We develop a general framework to learn in distance and derive Theorem 1. Recall that the distance between two distributions and is

The algorithms presented enhance the algorithm of Jain and Orlitsky (2019), developed for of a discrete domain , to any VC-family of subsets of any sample space . We retain the part of the analysis and notation that are common in our enhanced algorithm and the one presented in Jain and Orlitsky (2019).

At a high level, we remove the adversarial, or “outlier” batches, and return a sub-collection of batches whose empirical distribution is close to in distance. The uniform deviation inequality in VC theory states that the sub-collection of good batches has empirical distribution that approximates in distance, thereby ensuring the existence of such a sub-collection.

The family can be potentially uncountable, hence learning a distribution to a given distance may entail simultaneously satisfying infinitely many constrains. To decrease the constraints to a finite number, Corollary 13 shows that for any distribution and any , there exists a finite -cover of w.r.t this distribution.

Our goal therefore is to find an -cover of w.r.t. an appropriate distribution such that if for some sub-collection the empirical distribution approximates in -distance it would also approximate in -distance. The two natural distribution choices are the target distribution or empirical distribution from its samples. Yet the distribution is unknown to us, and its samples provided in the collection of batches are corrupted by an adversary.

The next theorem overcomes this challenge by showing that although the collection includes adversarial batches, for small enough , for any -cover of w.r.t. the empirical distribution , a small -distance , between and the empirical distribution induced by a sub-collection would imply a small -distance , between the two distributions.

Note that the theorem allows the -cover of to include sets in the subset family containing .

###### Theorem 14.

For and , let be an -cover of family w.r.t. the empirical distribution . Then with probability , for any sub-collection of batches of size ,

###### Proof.

Consider any batch sub-collection . For every , by the triangle inequality,

(4) |

Since is an -cover w.r.t. , for every there is an such that . For such pairs, we bound the second term on the right in the above equation.

(5) |

Choosing in the above equation and using gives,

(6) |

Then

with probability , here (a) used equation (6) and (b) follows from Lemma 22. Combining equations (4), (5) and the above equation completes the proof. ∎

The above theorem reduces the problem of estimating in distance to finding a sub-collection of at least batches such that for an -cover of w.r.t. distribution , the distance is small. If we choose a finite -cover of , the theorem would ensure that the number of constrains is finite.

To find a sub-collection of batches as suggested above, we show that with high probability, certain concentration properties hold for all subsets in . Note that the cover is chosen after seeing the samples in , but since , the results also hold for all subsets in .

The following discussion develops some notation and intuitions that leads to these properties.

We start with the following observation. Consider a subset . For evey good batch , has a sub-gaussian distribution with variance . Therefore, most of the good batches assign the empirical probability . Moreover, the empirical mean and variance of over converges to the expected values and , respectively.

In addition to the good batches, the collection of batches also includes an adversarial sub-collection of batches that constitute up to a fraction of . If the difference between and the average of over all adversarial batches is , namely comparable to the standard deviation of for the good batches , then the adversarial batches can change the overall mean of empirical probabilities by at most , which is within our tolerance. Hence, the mean of will deviate significantly from only in the presence of a large number of adversarial batches whose empirical probability differs from by .

To quantify this effect, for a subset let