Adversarial Multi-Source PAC Learning

# Adversarial Multi-Source PAC Learning

## Abstract

We study the problem of learning from multiple untrusted data sources, a scenario of increasing practical relevance given the recent emergence of crowdsourcing and collaborative learning paradigms. Specifically, we analyze the situation in which a learning system obtains datasets from multiple sources, some of which might be biased or even adversarially perturbed. It is known that in the single-source case, an adversary with the power to corrupt a fixed fraction of the training data can prevent PAC-learnability, that is, even in the limit of infinitely much training data, no learning system can approach the optimal test error. In this work we show that, surprisingly, the same is not true in the multi-source setting, where the adversary can arbitrarily corrupt a fixed fraction of the data sources. Our main results are a generalization bound that provides finite-sample guarantees for this learning setting, as well as corresponding lower bounds. Besides establishing PAC-learnability our results also show that in a cooperative learning setting sharing data with other parties has provable benefits, even if some participants are malicious.

\printAffiliationsAndNotice

## 1 Introduction

An important problem of current machine learning research is to make learned systems more trustworthy. One particular aspect of this is robustness against data of unexpected or even adversarial nature. Robustness at prediction time has recently received a lot of attention, in particular with work on the detection of out-of-distribution conditions Hendrycks and Gimpel (2017); Liang et al. (2018); Lee et al. (2018) and protection against adversarial examples Raghunathan et al. (2018); Singh et al. (2018); Cohen et al. (2019). Robustness at training time, however, is represented less prominently, despite also being of great importance. One reason might be that learning from a potentially adversarial data source is very hard: a classic result states that when a fixed fraction of the training dataset is adversarially corrupted, successful learning in the PAC sense is not possible anymore  Kearns and Li (1993). In other words, there exists no robust learning algorithm that could overcome the effects of adversarial corruptions in a constant fraction of the training dataset and approach the optimal model, even in the limit of infinite data.

In this work, we study the question of robust learning in the multi-source case, i.e. when more than one dataset is available for training. This is a situation of increasing relevance in the era of big data, where machine learning models tend to be trained on very large datasets. To create these, one commonly relies on distributing the task of collecting and annotating data, e.g. to crowdsourcing Sheng and Zhang (2019) services, or by adopting a collective or federated learning scenario McMahan and Ramage (2017).

Unfortunately, relying on data from other parties comes with the danger that some of the sources might produce data of lower quality than desired, be it due to negligence, bias or malicious behaviour. Consequently, the analogous question to the classic problem described above is the following, which we refer to as adversarial multi-source learning. Given a number of i.i.d. datasets, a constant fraction of which might have been adversarially manipulated, is there a learning algorithm that overcomes the effect of the corruptions and approaches the optimal model?

In this work, we study this problem formally and provide a positive answer. Specifically, our main result is an upper bound on the sample complexity of adversarial multi-source learning, that holds as long as less than half of sources are manipulated (Theorem 1).

A number of interesting results follow as immediate corollaries. First, we show that any hypothesis class that is uniformly convergent and hence PAC-learnable in the classical i.i.d. sense is also PAC-learnable in the adversarial multi-source scenario. This is in stark contrast to the single-source situation where, as mentioned above, no non-trivial hypothesis class is robustly PAC-learnable. As a second consequence, we obtain the insight that in a cooperative learning scenario, every honest party can benefit from sharing their data with others, as compared to using their own data only, even if some of the participants are malicious.

Besides our main result we prove two additional theorems that shed light on the difficulty of adversarial multi-source learning. First, we prove that the naïve but common strategy of simply merging all data sources and training with some robust procedure on the joint dataset cannot result in a robust learning algorithm (Theorem 2). Second, we prove a lower bound on the sample complexity under very weak conditions (Theorem 3). This result shows that under adversarial conditions a slowdown of convergence is unavoidable, and that in order to approach optimal performance, the number of samples per source must necessarily grow, while increasing the number of sources need not help.

## 2 Related work

To our knowledge, our results are the first that formally characterize the statistical hardness of learning from multiple i.i.d. sources, when a constant fraction of them might be adversarially corrupted. There are a number of conceptually related works, though, which we will discuss for the rest of this section.

Qiao and Valiant (2018), as well as the follow-up works of Chen et al. (2019); Jain and Orlitsky (2019), aim at estimating discrete distributions from multiple batches of data, some of which have been adversarially corrupted. The main difference to our results is the focus on finite data domains and estimating the underlying probability distribution rather than learning a hypothesis.

Qiao (2018) studies collaborative binary classification: a learning system has access to multiple training datasets and a subset of them can be adversarially corrupted. In this setup, the uncorrupted sources are allowed to have different input distributions, but share a common labelling function. The author proves that it is possible to robustly learn individual hypotheses for each source, but a single shared hypothesis cannot be learned robustly. For the specific case that all data distributions are identical, the setup matches ours, though only for binary classification in the realizable case, and with a different adversarial model.

In a similar setting, Mahloujifar et al. (2019) show, in particular, that an adversary can increase the probability of any ”bad property” of the learned hypothesis by a term at least proportional to the fraction of manipulated sources. These results differ from ours, by their assumption that different sources have different distributions, which renders the learning problem much harder.

In Konstantinov and Lampert (2019), a learning system has access to multiple datasets, some of which are manipulated, and the authors prove a generalization bound and propose an algorithm based on learning with a weighted combination of all datasets. The main difference to our work is that their proposed method crucially relies on a trusted subset of the data being known to the learner. Their adversary is also weaker, as it cannot influence the data points directly, but only change the distribution from which they are sampled, and the work also does not provide finite sample guarantees.

There are a number of classic results on the fundamental limits of PAC learning from a single labelled set of samples, a fraction of which can be arbitrarily corrupted, e.g. Kearns and Li (1993); Bshouty et al. (2002). We compare our results against this classic scenario in Section 4.1.

Another related general direction is the research on Byzantine-resilient distributed learning, which has seen significant interest recently, e.g. Blanchard et al. (2017); Chen et al. (2017); Yin et al. (2018, 2019); Alistarh et al. (2018). There the focus is on learning by exchanging gradient updates between nodes in a distributed system, an unknown fraction of which might be corrupted by an omniscient adversary and may behave arbitrarily. These works tend to design defences for specific gradient-based optimization algorithms, such as SGD, and their theoretical analysis usually assumes strict conditions on the objective function, such as convexity or smoothness. Nevertheless, the (nearly) tight sample complexity upper and lower bounds developed for Byzantine-resilient gradient descent Yin et al. (2018) and its stochastic variant Alistarh et al. (2018) are relevant to our results and are therefore discussed in detail in Sections 4.2 and 5.2.

The work of Awasthi et al. (2017) considers learning from crowdsourced data, where some of the workers might behave arbitrarily. However, they only focus on label corruptions. Feng (2017) consider the fundamental limits of learning from adversarial distributed data, but in the case when each of the nodes can iteratively send corrupted updates with certain probability. Feng et al. (2014) provide a method for distributing the computation of any robust learning algorithm that operates on a single large dataset. There is also a large body of literature on attacks and defences for federated learning, e.g. Bhagoji et al. (2019); Fung et al. (2018). Apart from focusing on iterative gradient-based optimization procedures, these works also allow for natural variability in the distributions of the uncorrupted data sources.

## 3 Preliminaries

In this section we introduce the technical definitions that are necessary to formulate and prove our main results. We start by reminding the reader of the classical notion of PAC-learnability and uniform convergence, as they can be found in most machine learning textbooks. We then introduce the setting of learning from multiple sources and notions of adversaries of different strengths.

### 3.1 Notation and Background

Let and be given input and output sets, respectively, and be a fixed but unknown probability distribution. By we denote a loss function, and by a set of hypotheses. All of these quantities are assumed arbitrary but fixed for the purpose of this work.

A (statistical) learner is a function . In the classic supervised learning scenario, the learner has access to a training set of labelled examples, , sampled i.i.d. from , and aims at learning a hypothesis with small risk, i.e. expected loss, under the unknown data distribution,

 R(h)=E(x,y)∼D(ℓ(h(x),y)). (1)

PAC-learnability is a key property of the hypothesis set, which ensures the existence of an algorithm that guarantees successful learning:

###### Definition 1 (PAC-Learnability).

We call (agnostic) probably approximately correct (PAC) learnable with respect to , if there exists a learner and a function , such that for any , whenever is a set of i.i.d. labelled samples from , then with probability at least over the sampling of :

 R(L(S))≤minh∈HR(h)+ϵ. (2)

Another important concept related to PAC-learnability is that of uniform convergence.

###### Definition 2 (Uniform convergence).

We say that has the uniform convergence property with respect to with rate , if there exists a function , such that for any distribution and any :

• given samples , with probability at least over the data :

 suph∈H|R(h)−ˆR(h)|≤sH,ℓ(m,δ,S), (3)

where is the empirical risk of the hypothesis .

• as , for any sequence with .

Throughout the paper we drop the dependence on and and simply write for . Note that above definition is equivalent to the classic definition of uniform convergence (e.g. Chapter 4 in Shalev-Shwartz and Ben-David (2014)). We only introduce an explicit notation, , for the sample complexity rate of uniform convergence, as this simplifies the layout of our analysis later. It is well-known that uniform convergence implies PAC-learnability and that the opposite is also true for agnostic binary classification Shalev-Shwartz and Ben-David (2014).

### 3.2 Multi-source learning

Our focus in this paper is on learning from multiple data sources. For simplicity of exposition, we assume that they all provide the same number of data points, i.e. the training data consists of groups of samples each, where are fixed integers.

Formally, we denote by the set of all possible collections (i.e. unordered sequences) of groups of datapoints each. A (statistical) multi-source learner is a function that takes such a collection of datasets and returns a predictor from .

### 3.3 Robust Multi-Source Learning

Informally, one considers a learning system robust if it is able to learn a good hypothesis, even when the training data is not perfectly i.i.d., but contains some artifacts, e.g. annotation errors, a selection bias or even malicious manipulations. Formally, one models this by assuming the presence of an adversary, that observes the original datasets and outputs potentially manipulated versions. The learner then has to operate on the manipulated data without knowledge of what the original one had been or what manipulations have been made.

###### Definition 3 (Adversary).

An adversary is any function .

Throughout the paper, we denote by the original, uncorrupted datasets, drawn i.i.d. from , and by the datasets returned by the adversary.

Different scenarios are obtained by giving the adversary different amounts of power. For example, a weak adversary might only be able to randomly flip labels, i.e. simulate the presence of label noise. A much stronger adversary would be one that can potentially manipulate all data and do so with knowledge not only of all of the datasets but also of the underlying data distribution and the learning algorithm to be used later.

In this work, we adopt the latter view, as it leads to much stronger robustness guarantees. We define two adversary types that can make arbitrary manipulations to data sources, but only influence a certain subset of them.

###### Definition 4 (Fixed-Set Adversary).

Let . An adversary is called fixed-set (with preserved set ), if it only influences the datasets outside of . That is, for all .

###### Definition 5 (Flexible-Set Adversary).

Let . An adversary is called flexible-set (with preserved size ), if it can influence any of the given datasets.

In both cases, we call the fraction of corrupted datasets the power of the adversary, i.e. for the fixed-set and for the flexible-set adversaries.

While similarly defined, the fixed-set adversary is strictly weaker than the flexible-set one, as the latter one can first inspect all data and then choose which subset to modify, while the former one is restricted to a fixed, data-independent subset of sources.

Both adversary models are inspired by real-world considerations and analogs have appeared in a number of other research areas. The fixed-set adversaries can model a situation in which parties collaborate on a single learning task, but an unknown and fixed set of them are compromised, e.g. by hackers, that can act maliciously and collude with each other. This is a similar reasoning as in Byzantine-robust optimization, where an unknown subset of computing nodes are assumed to behave arbitrarily, thereby disrupting the optimization progress.

The second adversary corresponds to a situation where a malicious party can observe all of the available datasets and choose which ones to corrupt, up to a certain budget. This is similar to classic models in the fields of robust PAC learning, e.g. Bshouty et al. (2002), and robust mean estimation, e.g. Diakonikolas et al. (2019), where the adversary itself can influence which subset of the data to modify once the whole dataset is observed.

Whether robust learning in the presence of an adversary is possible for a certain hypothesis set or not is captured by the following definition:

###### Definition 6.

A hypothesis set, , is called multi-source PAC-learnable against the class of fixed-set/flexible-set adversaries of power and with respect to , if there exists a multi-source learner and a function , such that for any and any fixed-set/flexible-set adversary of power , whenever is a collection of datasets of i.i.d. labelled samples from each, then with probability at least over the sampling of :

 R(L(A(S′))≤minh∈HR(h)+ϵ. (4)

A learner, , with this property is called an -robust multi-source learner for .

Note that the robust learner should achieve optimal error as , while can stay constant. This reflects that we want to study adversarial multi-source learning in the context of a constant and potentially not very large number of sources. In fact, our lower bound results in Section 5 show that the adversary can always prevent the learner from approaching optimal risk in the opposite regime of constant and .

## 4 Sample Complexity of Robust Multi-Source Learning

In this section, we present our main result, a theorem that states that whenever has the uniform convergence property, there exists an algorithm that guarantees a bounded excess risk against both the fixed-set and the flexible-set adversary. We then derive and discuss some instantiations of the general result that shed light on the sample complexity of PAC learning in the adversarial multi-source learning setting. Finally, we provide a high-level sketch of the theorem’s proof.

### 4.1 Main result

###### Theorem 1.

Let be integers, such that . Let be the proportion of corrupted sources. Assume that has the uniform convergence property with rate function . Then there exists a learner with the following two properties.

• Let be a fixed subset of of size . For denote by the dataset modified by any fixed-set adversary with preserved set . Let be the set of all uncorrupted data. Then, with probability at least over the sampling of :

 R(L(A(S′)))−minh∈HR(h) (5) ≤2s(km,δ2,SG)+6αmaxi∈[N]s(m,δ2N,Si).
• For denote by the dataset modified by any flexible-set adversary with preserved size . Let be the set of sources not modified by the adversary and be the set of all uncorrupted data. Then, with probability at least over the sampling of :

 R(L(A(S′)))−minh∈HR(h) (6) ≤2s(km,δ2(Nk),SG)+6αmaxi∈[N]s(m,δ2N,Si).

The learner is in fact explicit, we define and discuss it in the proof sketch that we provide in Section 4.3. The complete proof is provided in the supplementary material.

As an immediate consequence we obtain:

###### Corollary 1.

Assume that has the uniform convergence property. Then is multi-source PAC-learnable against the class of fixed-set and the class of flexible-set adversaries of power .

###### Proof.

It suffices to show that for any , the right hand sides of (5) and (6) converge to for . This it true, since as for any and , by the definition of uniform convergence. ∎

Discussion. Corollary 1 is in sharp contrast with the situation of single dataset PAC robustness. In particular, Bshouty et al. (2002) study a setup where an adversary can manipulate a fraction datapoints out of a dataset with i.i.d.-sampled elements1. The authors show that in the binary realizable case, for any hypothesis space with at least two functions, no learning algorithm can learn a hypothesis with risk less than with probability greater than . Similarly, Kearns and Li (1993) showed that for an adversary that modifies each data point with constant probability , no algorithm can learn a hypothesis with accuracy better than . Both results hold regardless of the value of , thus showing that PAC-learnability is not fulfilled.

### 4.2 Rates of convergence

While Theorem 1 is most general, it does not yet provide much insight into the actual sample complexity of the adversarial multi-source PAC learning problem, because the rate function might behave in different ways. In this section we give more explicit upper bounds in terms a standard complexity measure of hypothesis spaces – the Rademacher complexity. Let

 RS(ℓ∘H)=Eσ(suph∈H1nn∑i=1σiℓ(h(xi),yi)), (7)

be the (empirical) Rademacher complexity of with respect to the loss function on a sample . Here are i.i.d. Rademacher random variables. Let , and .

#### Rates for the fixed-set adversary.

An application of Theorem 1 with a standard uniform concentration result gives:

###### Corollary 2.

In the setup of Theorem 1, against a fixed-set adversary, it holds that

 R(L(A(S′))) −minh∈HR(h)≤4RG+6 ⎷log(4δ)2km (8) + α(18 ⎷log(4Nδ)2m+12maxi∈[N]Ri).

The full proof is included in the supplementary material.

In many common learning settings, the Rademacher complexity scales as with the sample size (see e.g. Bousquet et al. (2004)). Thereby, we obtain the following rates against the fixed-set adversary:

 (9)

where the -notation hides constant and logarithmic factors.

Discussion. We can make a number of observations from Equation (9). The -term is the rate one expects when learning from (uncorrupted) sources of samples each, that is from all the available uncorrupted data. The -term reflects the rate when learning from any single source of samples, i.e. without the benefit of sharing information between sources. The latter enters weighted by , i.e. it is directly proportional to the power of the adversary. In the limit of (i.e. all sources are uncorrupted, ), the bound becomes . Thus, we recover the classic convergence rate for learning from samples in the non-realizable case. This fact is interesting, as the robust learner of Theorem 1 actually does not need to know the value of for its operation. Consequently, the same algorithm will work robustly if the data contains manipulations but without an unnecessary overhead (i.e. with optimal rate), if all data sources are in fact uncorrupted.

Another insight follows from the fact that for reasonably small , we have:

 ˜O(1√km+α1√m)≪˜O(1√m), (10)

so learning from multiple, even potentially manipulated, datasets converges to a good hypothesis faster than learning from a single uncorrupted dataset. This fact can be interpreted as encouraging cooperation: any of the honest parties in the multi-source setting with fixed-set adversary will benefit from making their data available for multi-source learning, even if some of the other parties are malicious.

Comparison to Byzantine-robust optimization. Our obtained rates for the fixed-set adversary can also be compared to the state-of-art convergence results for Byzantine-robust distributed optimization, where the compromised nodes are also fixed, but unknown. Yin et al. (2018) and Alistarh et al. (2018) develop robust algorithms for gradient descent and stochastic gradient descent respectively, achieving convergence rates of order

 ˜O(1√km+α1√m+1m) (11)

for unknown. Clearly, these rates resemble ours, except for the additional -term, which matters when is or very small. As shown in Yin et al. (2018), this term can also be made to disappear if an upper bound is assumed to be known a priori.

Overall, these similarities should not be over-interpreted, as the results for Byzantine-robust optimization describe practical gradient-based algorithms for distributed optimization under various technical assumptions, such as convexity, smoothness of the loss function and bounded variance of the gradients. In contrast, our work is purely statistical, not taking computational cost into account, but holds in a much broader context, for any hypothesis space that has the uniform convergence property of suitable rate and without constraints on the optimization method to be used. Additionally, our rates improve automatically in situations where uniform convergence is faster.

#### Rates for the flexible-set adversary

An analogous result to Corollary 2 holds also for flexible-set adversaries:

###### Corollary 3.

In the setup of Theorem 1, against a flexible-set adversary, it holds that

 R(L(A(S′)))−minh∈HR(h) (12) ≤4RG+12αmaxi∈[N]Ri+O⎛⎝4√α√m+α√log(N)m⎞⎠.

The proof is provided in the supplemental material.

Making the same assumptions as above, we obtain a sample complexity rate

 ˜O(1√km+α1√m+4√α√m). (13)

which differs from (9) only in the additional third term2, which, if at all, matters only for very small (but non-zero) . Despite the difference, most of our discussion above still applies. In particular, even for the flexible-set adversary the same learning algorithm exhibits robustness for and achieves optimal rates for .

### 4.3 Proof Sketch for Theorem 1

The proof of Theorem 1 consists of two parts. First, we introduce a filtering algorithm, that attempts to determine which of the data sources can be trusted, meaning that it should be safe to use them for training a hypothesis. Note that this can be because they were not manipulated, or because the manipulations are too small to have negative consequences. The output of the algorithm is a new filtered training set, consisting of all data from the trusted sources only. Second, we show that training a standard single-source learner on the filtered training set yields the desired results.

Step 1. Pseudo-code for the filtering algorithm is provided in Algorithm 1. The crucial component is a carefully chosen notion of distance between the datasets, called discrepancy, that we define and discuss below. It guarantees that if two sources are close to each other then the difference of training on one of them compared to the other is small.

To identify the trusted sources, the algorithm checks for each source how close it is to all other sources with respect to the discrepancy distance. If it finds the source to be closer than a threshold to at least half of the other sources, it is marked as trusted, otherwise it is not. To show that this procedure does what it is intended to do it suffices to show that two properties hold with high probability: 1) all trusted sources are safe to be used for training, 2) at least all uncorrupted sources will be trusted.

Property 1) follows from the fact that if a source has small distance to at least half of the other datasets, it must be close to at least one of the uncorrupted sources. By the property of the discrepancy distance, including it in the training set will therefore not affect the learning of the hypothesis very negatively. Property 2) follows from a concentration of mass argument, which guarantees that for any uncorrupted source its distance to all other uncorrupted sources will approach zero at a well-understood rate. Therefore, with a suitably selected threshold, at least all uncorrupted sources will be close to each other and end up in the trusted subset with high probability.

Discrepancy Distance. For any dataset , let

 ˆRi(h)=1m∑(x,y)∈Siℓ(h(x),y) (14)

be the empirical risk with respect to the loss . The (empirical) discrepancy distance between two datasets, and , is defined as

 dH(Si,Sj)=suph∈H(|ˆRi(h)−ˆRj(h)|). (15)

This is the empirical counterpart of the so-called discrepancy distance, which, together with its unsupervised form, is widely adopted within the field of domain adaptation Kifer et al. (2004); Ben-David et al. (2010); Mohri and Medina (2012). Typically, the discrepancy is used to bound the maximum possible effect of distribution drift on a learning system. The metric was also used in Konstantinov and Lampert (2019) to measure the effect of training on sources that have been sampled randomly, but from adversarially chosen distributions. As shown in Kifer et al. (2004); Ben-David et al. (2010), for randomly sampled datasets, the empirical discrepancy concentrates with known rates to its distributional value, i.e. to zero, if two sources have the same underlying data distributions. The empirical discrepancy is well-defined even for data not sampled from a distribution, though, and together with the uniform convergence property it allows us to bound the effect of training on one dataset rather than another.

Step 2. Let be the output of the filtering algorithm, i.e. the union of all trusted datasets. Then, for any , the empirical risk over can be written as

 ˆRT(h)=1|T|∑i∈TˆRi(h) (16)

We need to show that training on , e.g. by minimizing , with high probability leads to a hypothesis with small risk under the true data distribution .

By construction, we know that for any trusted source , the difference between and for some uncorrupted source is bounded by a suitably chosen constant (that depends on the growth function ). By the uniform convergence property of , we know that for any uncorrupted source, the difference between and the true risk can also be bounded in terms of the growth function . In combination, we obtain that is a suitably good estimator of the true risk, uniformly over all . Consequently, can be used for successful learning.

For the formal derivations and, in particular, the choice of thresholds, please see the supplemental material.

## 5 Hardness of Robust Multi-Source Learning

We now take an orthogonal view compared to Section 4, and study where the hardness of the multi-source PAC learning stems from and what allows us to nevertheless overcome it. For this, we prove two additional results that describe fundamental limits of how well a learner can perform in the multi-source adversarial setting.

For simplicity of exposition we focus on binary classification. Let and be the zero-one loss, i.e. . Following Bshouty et al. (2002), we define:

###### Definition 7.

A hypothesis space over an input set is said to be non-trivial, if there exist two points and two hypotheses , such that , but .

### 5.1 What makes robust learning possible?

We show that if the learner does not make use of the multi-source structure of the data, i.e. it behaves as a single-source learner on the union of all data samples, then a (multi-source) fixed-set adversary can always prevent PAC-learnability.

###### Theorem 2.

Let be a non-trivial hypothesis space. Let and be any positive integers and let be a fixed subset of of size . Let be a multi-source learner that acts by merging the data from all sources and then calling a single-source learner. Let be drawn i.i.d. from . Then there exists a distribution with and a fixed-set adversary with index set , such that:

 PS′∼D(R(L(A(S′))>α8(1−α))>120, (17)

where is the power of the adversary.

The proof is provided in the supplemental material. Note that, since the theorem holds for the fixed-set adversary, it automatically also holds for the stronger flexible-set adversary.

The theorem sheds light on why PAC-learnability is possible in the multi-source setting, while in the single source setting it is not. The reason is not simply that the adversary is weaker, because it is restricted to manipulating samples in a subset of datasets instead of being able to choose freely. Inequality (17) implies that even against such a weaker adversary, a single-source learner cannot be adversarially robust. Consequently, it is the additional information that the data comes in multiple datasets, some of which remain uncorrupted even after the adversary was active, that gives the multi-source learner the power to learn robustly.

An immediate consequence of Theorem 2 is also that the common practice of merging the data from all sources and performing a form of empirical risk minimization on the resulting dataset is not a robust learner and therefore suboptimal in the studied context.

### 5.2 How hard is robust learning?

As a tool for understanding the limiting factors of learning in the adversarial multi-source setting, we now establish a lower bound on the achievable excess risk in terms of the number of samples per source and the power of the adversary.

###### Theorem 3.

Let be a hypothesis space, let and be any integers and let be a fixed subset of of size . Let be drawn i.i.d. from . Then the following statements hold for any multi-source learner :

• Suppose that is non-trivial. Then there exists a distribution on with , and a fixed-set adversary with index set , such that:

 PS′(R(L(A(S′))>α8m)>120. (18)
• Suppose that has VC dimension . Then there exists a distribution on and a fixed-set adversary with index set , such that:

 PS′( R(L(A(S′))−minh∈HR(h) (19) >√d1280Nm+α16m)>164.

In both cases, is the power of the adversary.

The proof is provided in the supplemental material. As for Theorem 2, it is clear that the same result holds also for flexible-set adversaries with preserved size .

Analysis. Inequality (18) shows that even in the realizable scenario, the risk might not shrink faster than with rate , regardless of how many data sources, and therefore data samples, are available. This is contrast to the i.i.d. situation, where the corresponding rate is . The difference shows that robust learning with a constant fraction of corrupted sources is only possible if the number of samples per dataset grows. Conversely, if the number of corrupted datasets is constant, regardless of the total number of sources, i.e., , we recover the rates for learning without an adversary up to constants.

In inequality (19), the term is due to the classic no-free-lunch theorem for binary classification and corresponds to the fundamental limits of learning, now in the non-realizable case. The -term appears as the price of robustness, and as before, it implies that for constant , is necessary in order to achieve arbitrarily small excess risk, while just does not suffice.

Relation to prior work. Lower bounds of similar structure as in Theorem 3 have also been derived for Byzantine optimization and collaborative learning. In particular, Yin et al. (2018) prove that in the case of distributed mean estimation of a -dimensional Gaussian on machines, an fraction of which can be Byzantine, any algorithm would incur loss of . Alistarh et al. (2018) construct specific examples of a Lipschitz continuous and a strongly convex function, such that no distributed stochastic optimization algorithm, working with an -fraction of Byzantine machines, can optimize the function to error less than , where is the number of parameters. For realizable binary classification in the context of collaborative learning, Qiao (2018) prove that there exists a hypothesis space of VC dimension , such that no learner can achieve excess risk less than .

Besides the different application scenario, the main difference between these results and Theorem 3 is that our bounds hold for any hypothesis space that is non-trivial (Ineq. (18)), or has VC-dimension (Ineq. (19)), while the mentioned references construct explicit examples of hypothesis spaces or stochastic optimization problems where the bounds hold. In particular, our results show that the limitations on the learner due the finite total number of samples, the finite number of samples per source and the fraction of unreliable sources are inherent and not specific to a subset of hard-to-learn hypotheses.

## 6 Conclusion

We studied the problem of robust learning from multiple unreliable datasets. Rephrasing this task as learning from datasets that might be adversarially corrupted, we introduced the formal problem of adversarial learning from multiple sources, which we studied in the classic PAC setting.

Our main results provide a characterization of the hardness of this learning task from above and below. First, we showed that adversarial multi-source PAC learning is possible for any hypothesis class with the uniform convergence property, and we provided explicit rates for the excess risk (Theorem 1 and Corollaries). The proof is constructive and shows also that integrating robustness comes at a minor statistical cost, as our robust learner achieves optimal rates when run on data without manipulations. Second, we proved that adversarial PAC learning from multiple sources is far from trivial. In particular, it is impossible to achieve for learners that ignore the multi-source structure of the data (Theorem 2). Third, we proved lower bounds on the excess risk under very general conditions (Theorem 3), which highlight an unavoidable slowdown of the convergence rate proportional to the adversary’s strength compared to the i.i.d. (adversarial-free) case. Furthermore, in order to facilitate successful learning with a constant fraction of corrupted sources, the number of samples per source has to grow.

A second emphasis of our work was to highlight connections of the adversarial multi-source learning task to related methods in robust optimization, cryptography and statistics. We believe that a better understanding of these connections will allow us to come up with tighter bounds and to design algorithms that are not only statistically efficient (as was the focus of this work), but also obtain insight into the trade-offs with computational complexity.

## Appendix A Proof of Theorem 1 and its corollaries

###### Theorem 1.

Let be integers, such that . Let be the proportion of corrupted sources. Assume that has the uniform convergence property with rate function . Then there exists a learner with the following two properties.

• Let be a fixed subset of of size . For denote by the dataset modified by any fixed-set adversary with preserved set . Let be the set of all uncorrupted data. Then, with probability at least over the sampling of :

 R(L(A(S′)))−minh∈HR(h)≤2s(km,δ2,SG)+6αmaxi∈[N]s(m,δ2N,Si). (20)
• For denote by the dataset modified by any flexible-set adversary with preserved size . Let be the set of sources not modified by the adversary and be the set of all uncorrupted data. Then, with probability at least over the sampling of :

 R(L(A(S′)))−minh∈HR(h)≤2s(km,δ2(Nk),SG)+6αmaxi∈[N]s(m,δ2N,Si). (21)
###### Proof.

Denote by for the initial datasets and by for the datasets after the modifications of the adversary. As explained in the main body of the paper, we denote by:

 ˆRi(h)=1mm∑j=1ℓ(h(xi,j),yi,j) (22)

the empirical risk of any hypothesis on the dataset and by:

 dH(Si,Sj)=suph∈H|ˆRi(h)−ˆRj(h)| (23)

the empirical discrepancy between the datasets and .

We show that a learner that first runs a certain filtering algorithm (Algorithm 1) based on the discrepancy metric and then performs empirical risk minimization on the remaining data to compute a hypothesis satisfies the properties stated in the theorem. The full algorithm for the learner is therefore given in Algorithm 2.

(a) The key idea of the proof is that the clean sources are close to each other with high probability, so they get selected when running Algorithm 1. On the other hand, if a bad source has been selected, it must be close to at least one of the good sources, so it can not have too bad an effect on the empirical risk.

For all , let be the event that:

 suph∈H∣∣R(h)−ˆRi(h)∣∣≤s(m,δ2N,Si). (24)

Further, let be the event that:

 suph∈H∣∣R(h)−ˆRG(h)∣∣≤s(km,δ2,SG), (25)

where

 ˆRG(h)=1km∑i∈Gm∑j=1ℓ(h(xi,j),yi,j).

Denote by and the complements of these events. Then we know that , and for all . Therefore, if , we have:

 P(Ec)=P(EcG∨(∨i∈GEci))≤P(EcG)+∑i∈GP(Eci)≤δ2+kδ2N≤δ. (26)

Hence, the probability of the event that all of (24) and (25) hold, is at least . We now show that under the event , Algorithm 2 returns a hypothesis that satisfies the condition in (a).

Whenever holds, for all we have:

 dH(Si,Sj)=suph∈H(|ˆRi(h)−ˆRj(h)|)≤suph∈H(|ˆRi(h)−R(h)|)+suph∈H(|R(h)−ˆRj(h)|)≤s(m,δ2N,Si)+s(m,δ2N,Sj). (27)

Now since , we get that . Moreover, for any , there exists at least one , such that . For any , denote by the smallest such . Therefore, for any :

 |ˆRi(h)−R(h)|≤|ˆRi−ˆRf(i)(h)|+|ˆRf(i)(h)−R(h)| ≤dH(Si,Sf(i))+s(m,δ2N,Sf(i)) (28) ≤s(m,δ2N,Si)+2s(m,δ2N,Sf(i)) (29)

Denote by

 ˆRT(h)=1|T|∑i∈TˆRi(h)=1|ST|∑(x,y)∈STℓ(h(x),y) (30)

the loss over all the trusted data. Then for any we have:

 ∣∣ˆRT(h)−R(h)∣∣ ≤1|T|m⎛⎝∣∣ ∣∣∑i∈Gm∑l=1(ℓ(h(xi,l),yi,l)−R(h))∣∣ ∣∣+∑i∈(T∖G)∣∣ ∣∣m∑l=1(ℓ(h(xi,l),yi,l)−R(h))∣∣ ∣∣⎞⎠ (31) =k|T|∣∣ˆRG(h)−R(h)∣∣+1|T|∑i∈(T∖G)∣∣ˆRi(h)−R(h)∣∣ (32) ≤k|T|s(km,δ2,SG)+1|T|∑i∈(T∖G)∣∣ˆRi(h)−R(h)∣∣ (33) ≤k|T|s(km,δ2,SG)+1|T|∑i∈(T∖G)(s(m,δ2N,Si)+2s(m,δ2N,Sf(i))) (34) ≤k|T|s(km,δ2,SG)+3|T|−k|T|maxi∈[N]s(m,δ2N,Si) (35) ≤s(km,δ2,SG)+3N−kNmaxi∈[N]s(m,δ2N,Si) (36)

Finally, let and . Then:

 R(hA)−R(h∗)=(R(hA)−ˆRT(hA))+(ˆRT(hA)−R(h∗)) ≤(R(hA)−ˆRT(hA))+(ˆRT(h∗)−R(h∗)) (37) ≤2suph∈H∣∣ˆRT(h)−R(h)∣∣ (38)

and the result follows.

(b) The crucial difference in the case of the flexible-set adversary is that the set is chosen after the clean data is observed. We thus need concentration results for all of the subsets of of size , as well as all individual sources.

For all , let be the event that:

 suph∈H∣∣R(h)−ˆR′i(h)∣∣≤s(m,δ2N,S′i), (39)

where

 ˆR′i=1mm∑j=1ℓ(h(x′i,j),y′i,j) (40)

Further, for any of size , let be the event that:

 suph∈H∣∣R(h)−ˆR′A(h)∣∣≤s(km,δ2(Nk),S′A), (41)

where and

 ˆR′A(h)=1km∑i∈Am∑l=1ℓ(h(x′i,l),y′i,l). (42)

Then we know that for all and for all with . Therefore, if , we have:

 P(Ec)=P((∨AEcA)∨(∨i∈[N]Eci))≤∑AP(EcA)+∑i∈[N]P(Eci)≤(Nk)δ2(Nk)+Nδ2N=δ. (43)

Hence, the probability of the event that all of (39) and (41) hold, is at least . In particular, under :

 suph∈H∣∣R(h)−ˆRG(h)∣∣=suph∈H∣∣R(h)−ˆR′G(h)∣∣≤s(km,δ2(Nk),S′G)=s(km,δ2(Nk),SG) (44)

and

 suph∈H∣∣R(h)−ˆRi(h)∣∣=suph∈H∣∣R(h)−ˆR′i(h)∣∣≤s(m,δ2N,