Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

# Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

Vitaly Feldman IBM Research - Almaden. E-mail: vitaly@post.harvard.edu. Part of this work done while visiting LIAFA, Université Paris 7.    David Xiao CNRS, Université Paris 7. E-mail: dxiao@liafa.univ-paris-diderot.fr. Part of this work done while visiting Harvard’s Center for Research on Computation and Society (CRCS).
###### Abstract

In this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. (2006) that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private PAC and agnostic learning was studied in a number of prior works starting with (Kasiviswanathan et al., 2011). However, a number of basic questions still remain open (Beimel et al., 2010; Chaudhuri and Hsu, 2011; Beimel et al., 2013a, b), most notably whether learning with privacy requires more samples than learning without privacy.

We show that the sample complexity of learning with (pure) differential privacy can be arbitrarily higher than the sample complexity of learning without the privacy constraint or the sample complexity of learning with approximate differential privacy. Our second contribution and the main tool is an equivalence between the sample complexity of (pure) differentially private learning of a concept class (or ) and the randomized one-way communication complexity of the evaluation problem for concepts from . Using this equivalence we prove the following bounds:

• , where is the Littlestone’s dimension characterizing the number of mistakes in the online-mistake-bound learning model (Littlestone, 1987). Known bounds on then imply that can be much higher than the VC-dimension of .

• For any , there exists a class such that but .

• For any , there exists a class such that the sample complexity of (pure) -differentially private PAC learning is but the sample complexity of the approximate -differentially private PAC learning is . This resolves an open problem from (Beimel et al., 2013b).

00footnotetext: Preliminary version of this work has appeared in Conference on Learning Theory (COLT), 2014

## 1 Introduction

In machine learning tasks, the training data often consists of information collected from individuals. This data can be highly sensitive, for example in the case of medical or financial information, and therefore privacy-preserving data analysis is becoming an increasingly important area of study in machine learning, data mining and statistics (Dwork and Smith, 2009; Sarwate and Chaudhuri, 2013; Dwork and Roth, 2014).

In this work we focus on the task of learning to classify from labeled examples. Two standard and closely related models of this task are PAC learning (Valiant, 1984) and agnostic (Haussler, 1992; Kearns et al., 1994) learning. In the PAC learning model the algorithm is given random examples in which each point is sampled i.i.d. from some unknown distribution over the domain and is labeled by an unknown function from a set of functions (called concept class). In the agnostic learning model the algorithm is given examples sampled i.i.d. from an arbitrary (and unknown) distribution over labeled points. The goal of the learning algorithm in both models is to output a hypothesis whose prediction error on the distribution from which examples are sampled is not higher (up to an additive ) than the prediction error of the best function in (which is in the PAC model). See Section 2.1 for formal definitions.

We rely on the well-studied differential privacy model of privacy. Differential privacy gives a formal semantic guarantee of privacy, saying intuitively that no single individual’s data has too large of an effect on the output of the algorithm, and therefore observing the output of the algorithm does not leak much information about an individual’s private data (Dwork et al., 2006) (see Section 2.2 for the formal definition). The downside of this desirable guarantee is that for some problems achieving it has an additional cost: both in terms of the number of examples, or sample complexity, and computation.

The cost of differential privacy in PAC and agnostic learning was first studied by Kasiviswanathan et al. (2011). They showed that the sample complexity111For now we ignore the dependence on other parameters and consider them to be small constants. of differentially privately learning a concept class over domain , denoted by , is and left open the natural question of whether is different from the VC dimension of which, famously, characterizes the sample complexity of learning (without privacy constraints). By Sauer’s lemma, and therefore the multiplicative gap between these two measures can be as large as .

Subsequently, Beimel et al. (2010) showed that there exists a large concept class, specifically single points, for which the sample complexity of learning with privacy is a constant. They also show that differentially private proper learning (the output hypothesis has to be from ) of single points and threshold functions on the set requires samples. These results demonstrate that the sample complexity can be lower than and also that lower bounds on the sample complexity of proper learning do not necessarily apply to non-proper learning that we consider here. A similar lower bound on proper learning of thresholds on an interval was given by Chaudhuri and Hsu (2011) in a continuous setting where the sample complexity becomes infinite. They also showed that the sample complexity can be reduced to essentially by either adding distributional assumptions or by requiring only the privacy of the labels.

The upper bound of Beimel et al. (2010) is based on an observation from (Kasiviswanathan et al., 2011) that if there exists a class of functions such that for every and every distribution over the domain, there exists such that then the sample complexity of differentially private PAC learning with error can be reduced to . They refer to such as an -representation of , and define the (deterministic) -representation dimension of , denoted as , as for the smallest that -represents . We note that this natural notion can be seen as a distribution-independent version of the standard notion of -covering of in which the distribution over the domain is fixed (e.g. Benedek and Itai, 1991).

Beimel et al. (2013a) then defined a probabilistic relaxation of -representation defined as follows. A distribution over sets of boolean functions on is said to -probabilistically represent if for every and distribution over , with probability over the choice of , there exists such that . The -probabilistic representation dimension is the minimal , where the minimum is over all that -probabilistically represent . Beimel et al. (2013a) demonstrated that characterizes the sample complexity of differentially private PAC learning. In addition, they show that can upper-bounded by as , where we omit and when they are equal to .

Beimel et al. (2013b) consider PAC learning with approximate -differential privacy where the privacy guarantee holds with probability (the basic notion is also referred to as pure to distinguish it from the approximate version). They show that can be PAC learned using samples ( is a constant as before). Their algorithm is proper so this separates the sample complexity of pure differentially private proper PAC learning from the approximate version. This work leaves open the question of whether such a separation can be proved for (non-proper) PAC learning.

### 1.1 Our results

In this paper we resolve the open problems described above. In the process we also establish a new relation between and Littlestone’s dimension, a well-studied measure of sample complexity of online learning (Littlestone, 1987) (see Section 2.5 for the definition). The main ingredient of our work is a characterization of and in terms of randomized one-way communication complexity of associated evaluation problems (Kremer et al., 1999). In such a problem Alice is given as input a function and Bob is given an input . Alice sends a single message to Bob, and Bob’s goal is to compute . The question is how many bits Alice must communicate to Bob in order for Bob to be able to compute correctly, with probability at least over the randomness used by Alice and Bob.

In the standard or “private-coin” version of this model, Alice and Bob each have their own source of random coins. The minimal number of bits needed to solve the problem for all and is denoted by . In the stronger “public coin” version of the model, Alice and Bob share the access to the same source of random coins. The minimal number of bits needed to evaluate (with probability at least ) in this setting is denoted by . See Section 2.4 for formal definitions.

We show that these communication problems are equivalent to deterministic and probabilistic representation dimensions of and, in particular, (for clarity we omit the accuracy and confidence parameters, see Theorem 3.1 and Theorem 3.2 for details).

###### Theorem 1.1.

and .

The evaluation of threshold functions on a (discretized) interval corresponds to the well-studied “greater than” function in communication complexity denoted as . if and only if , where are viewed as binary representations of integers. It is known that (Miltersen et al., 1998). By combining this lower bound with Theorem 1.1 we obtain a class whose dimension is 1 yet it requires at least samples to PAC learn differentially privately.

This equivalence also shows that some of the known results in (Beimel et al., 2010, 2013a) are implied by well-known results from communication complexity, sometimes also giving simpler proofs. For example (1) the constant upper bound on the sample complexity of single points follows from the communication complexity of the equality function and (2) the bound follows from the classical result of Newman (1991) on the relationship between the public and private coin models. See Section 3.1 for more details and additional examples.

Our second contribution is a relationship of (via the equivalences with ) to Littlestone’s (1987) dimension of . Specifically, we prove

###### Theorem 1.2.
1. .

2. For any , there exists a class such that but .

The first result follows from a natural reduction to the augmented index problem, which is well-studied in communication complexity (Bar-Yossef et al., 2004). While new in our context, the relationship of Littlestone’s dimension to quantum communication complexity was shown by Zhang (2011). Together with numerous known bounds on (e.g. Littlestone, 1987; Maass and Turán, 1994b), our result immediately yields a number of new lower bounds on . In particular, results of Maass and Turán (1994b) imply that linear threshold functions over require samples to learn differentially privately. This implies that differentially private learners need to pay an additional dimension factor as well as a bit complexity of point representation factor over non-private learners. To the best of our knowledge such strong separation was not known before for problems defined over i.i.d. samples from a distribution (as opposed to worst case inputs). Note that this lower bound is also almost tight since (e.g. Muroga, 1971).

In the second result of Theorem 1.2 we use the class of lines in (a plane over a finite field ). A lower bound on the one-way quantum communication complexity of this class was first given by Aaronson (2004) using his method based on a trace distance.

Finally, we consider PAC learning with -differential privacy. Our lower bound of on of thresholds together with the upper bound of from (Beimel et al., 2013b) immediately imply a separation between the sample complexities of pure and approximate differential privacy. We show a stronger separation for the concept class :

###### Theorem 1.3.

The sample complexity of -differentially privately learning is .

Our upper bound is also simpler than the upper bound in (Beimel et al., 2013b). See Section 6 for details.

### 1.2 Related work

There is now an extensive amount of literature on differential privacy in machine learning and related areas which we cannot hope to cover here. The reader is referred to the excellent surveys in (Sarwate and Chaudhuri, 2013; Dwork and Roth, 2014).

Blum et al. (2005) showed that algorithms that can be implemented in the statistical query (SQ) framework of Kearns (1998) can also be easily converted to differentially private algorithms. This result implies polynomial upper bounds on the sample (and computational) complexity of all learning problems that can be solved using statistical queries (which includes the vast majority of problems known to be solvable efficiently). Formal treatment of differentially private PAC and agnostic learning was initiated in the seminal work of Kasiviswanathan et al. (2011). Aside from the results we already mentioned, they separated SQ learning from differentially private learning. Further, they showed that SQ learning is (up to polynomial factors) equivalent to local differential privacy a more stringent model in which each data point is privatized before reaching the learning algorithm.

The results of this paper are for the distribution-independent learning, where the learner does not know the distribution over the domain. Another commonly-considered setting is distribution-specific learning in which the learner only needs to succeed with respect to a single fixed distribution known to the learner. Differentially private learning in this setting and its relaxation in which the learner only knows a distribution close to were studied by Chaudhuri and Hsu (2011). restricted to a fixed distribution is denoted by and equals to the logarithm of the smallest -cover of with respect to the disagreement metric given by (also referred to as the metric entropy). The standard duality between packing and covering numbers also implies that , and therefore these notions are essentially identical. It also follows from the prior work (Kasiviswanathan et al., 2011; Chaudhuri and Hsu, 2011), that characterizes the complexity of differentially private PAC and agnostic learning up to the dependence on the error parameter in the same way as it does for (non-private) learning (Benedek and Itai, 1991). Namely, samples are necessary to learn -differentially privately with error (and even if only weaker label differentially-privacy is desired (Chaudhuri and Hsu, 2011))and samples suffice for -differentially private PAC learning. This implies that in this setting there are no dimension or bit-complexity costs incurred by differentially private learners. Chaudhuri and Hsu (2011) also show that doubling dimension at an appropriate scale can be used to give upper and lower bounds on sample complexity of distribution-specific private PAC learning that match up to logarithmic factors.

In a related problem of sanitization of queries from the concept class the input is a database of points in and the goal is to output differentially privately a “synthetic” database such that for every , . This problem was first considered by Blum et al. (2013) who showed an upper bound of on the size of the database sufficient for this problem and also showed a lower bound of on the number of samples required for solving this problem when for . It is easy to see that from the point of view of sample complexity this problem is at least as hard as (differentially private) proper agnostic learning of (e.g. Gupta et al., 2011). Therefore lower bounds on proper learning such as those in (Beimel et al., 2010) and (Chaudhuri and Hsu, 2011) apply to this problem and can be much larger than that we study. That said, to the best of our knowledge, the lower bound for linear threshold functions that we give was not known even for this harder problem. Aside from sample complexity this problem is also computationally intractable for many interesting classes (see (Ullman, 2013) and references therein for recent progress).

Sample complexity of more general problems in statistics was investigated in several works starting with Dwork and Lei (2009) (measured alternatively via convergence rates of statistical estimators) (Smith, 2011; Chaudhuri and Hsu, 2012; Duchi et al., 2013a, b). A recent work of Duchi et al. (2013a) shows a number of -dimensional problems where differentially private algorithms must incur an additional factor cost in sample complexity. However their lower bounds apply only to a substantially more stringent local model of differential privacy and are known not to hold in the model we consider here.

Differentially private communication protocols were studied by McGregor et al. (2010) who showed that differential-privacy can be exploited to obtain a low-communication protocol and vice versa. Conceptually this result is similar to the characterization of sample complexity using given in (Beimel et al., 2013a). Our contribution is orthogonal to (McGregor et al., 2010) since the main step in our work is going from a learning problem to a communication protocol for a different problem.

## 2 Preliminaries

### 2.1 Learning models

###### Definition 2.0.

An algorithm PAC learns a concept class from examples if for every , and distribution over , given access to where each is drawn randomly from and , outputs, with probability at least over the choice of and the randomness of , a hypothesis such that .

Agnostic learning: The agnostic learning model was introduced by Haussler (1992) and Kearns et al. (1994) in order to model situations in which the assumption that examples are labeled by some does not hold. In its least restricted version the examples are generated from some unknown distribution over . The goal of an agnostic learning algorithm for a concept class is to produce a hypothesis whose error on examples generated from is close to the best possible by a concept from . For a Boolean function and a distribution over let . Define . Kearns et al. (1994) define agnostic learning as follows.

###### Definition 2.0.

An algorithm agnostically learns a concept class if for every , distribution over , , given access to where each is drawn randomly from , outputs, with probability at least over the choice of and the randomness of , a hypothesis such that .

In both PAC and agnostic learning model an algorithm that outputs a hypothesis in is referred to as proper.

### 2.2 Differentially Private Learning

Two sample sets are said to be neighboring if there exists such that , and for all it holds that . For , an algorithm is -differentially private if for all neighboring and for all :

 Pr[A(S)∈T]≤eαPr[A(S′)∈T]+β,

where the probability is over the randomness of (Dwork et al., 2006). When is -differentially private we say that it satisfies pure differential privacy, which we also write as -differential privacy.

Intuitively, each sample used by a learning algorithm is the record of one individual, and the privacy definition guarantees that by changing one record the output distribution of the learner does not change by much. We remark that, in contrast to the accuracy of learning requirement, the differential privacy requirement holds in the worst case for all neighboring sets of examples , not just those sampled i.i.d. from some distribution. We refer the reader to the literature for a further justification of this notion of privacy (Dwork et al., 2006).

The sample complexity is the minimal such that it is information-theoretically possible to -accurately and -differentially privately PAC learn with examples. without subscripts refers to .

### 2.3 Representation Dimension

###### Definition 2.0 (Beimel et al., 2010).

A class of functions -represents if for every and every distribution over the input domain of , there exists such that . The deterministic representation dimension of , denoted as equals for the smallest that -represents . We also let .

###### Definition 2.0 (Beimel et al., 2013a).

A distribution over sets of boolean functions on is said to -probabilistically represent if for every and distribution over , with probability over the choice of , there exists such that . The -probabilistic representation dimension equals the minimal value of , where the minimum is over all that -probabilistically represent . We also let .

Beimel et al. (2013a) proved the following characterization of by .

###### Theorem 2.1 (Kasiviswanathan et al., 2011; Beimel et al., 2013a).
 SCDPα,ε,δ(C) =O(1αε(log(1/ε)⋅(PRDim14,14(C)+loglog1εδ)+log1δ)). SCDPα,ε,δ(C) =Ω(1αεPRDim1/4,1/4(C)).

For agnostic learning we have that sample complexity is at most

This form of upper bound combines accuracy and confidence boosting from (Beimel et al., 2013a) to first obtain -probabilistic representation and then the use of exponential mechanism as in (Kasiviswanathan et al., 2011). The results in (Kasiviswanathan et al., 2011) show the extension of this bound to agnostic learning. Note that the characterization for PAC learning is tight up to logarithmic factors.

### 2.4 Communication Complexity

Let and be some sets. A private-coin one-way protocol from Alice who holds to Bob who holds is given by Alice’s randomized algorithm producing a communication and Bob’s randomized algorithm which outputs a boolean value. We describe Alice’s algorithm by a function of the input and random bits and Bob’s algorithm by a function of input , communication and random bits. (These algorithms need not be efficient.) The (randomized) output of the protocol on input is the value of on a randomly and uniformly chosen and . The cost of the protocol is given by the maximum over all and all possible random coins.

A public-coin one-way protocol is given by a randomized Alice’s algorithm described by a function and a randomized Bob’s algorithm described by a function . The (randomized) output of the protocol on input is the value of on a randomly and uniformly chosen . The cost of the protocol is defined as in the private-coin case.

Let denote the class of all private-coin one-way protocols computing with error , namely private-coin one-way protocols satisfying for all

 PrrA,rB[π(x,y;rA,rB)=g(x,y)]≥1−ε.

Define similarly as the class of all public-coin one-way protocols computing . Define and .

A deterministic one-way protocol and its cost are defined as above but without dependence on random bits. We will also require distributional notions of complexity, where there is a fixed input distribution from which are drawn. For a distribution over , we define to be all deterministic one-way protocols such that

 Pr(x,y)∼μ[π(x,y)=g(x,y)]≥1−ε.

Define . A standard averaging argument shows that the quantity remains unchanged even if we took the minimum over randomized (either public or private coin) protocols computing with error (i.e. since there must exist a fixing of the private coins that achieves as good error as the average error).

Yao’s minimax principle (Yao, 1977) states that for all functions :

 R→,pubε(g)=maxμD→ε(g;μ). (2.1)

Error in both public and private-coin protocols can be reduced by using several independent copies of the protocol and then taking a majority vote of the result. This implies that for every ,

 R→,pubε(f)=O(R→,pub1/2−γ(f)⋅log(1/ε)/γ2). (2.2)

Analogous statement holds for . This allows us to treat protocols with constant errors in range as equivalent up to a constant factor in the communication complexity.

### 2.5 Littlestone’s Dimension

While in this work we will not use the definition of the online mistake-bound model itself, we briefly describe it for completeness. In the online mistake-bound model learning proceeds in rounds. At the beginning of round , a learning algorithm has some hypothesis . In round , the learner sees a point and predicts . At the end of the round, the correct label is revealed and the learner makes a mistake if . The learner then updates its hypothesis to and this process continues. When learning a concept class in this model for some unknown . The (sample) complexity of such learning is defined as the largest number of mistakes that any learning algorithm can be forced to make when learning . Littlestone (1987) proved that it is exactly characterized by a dimension defined as follows.

Let be a concept class over domain . A mistake tree over and is a binary tree in which each internal node is labelled by a point and each leaf is labelled by a concept . Further, for every node and leaf : if is in the right subtree of then , otherwise . We remark that a mistake tree over and does not necessarily include all concepts from in its leaves. Such a tree is called complete if all its leaves are at the same depth. Littlestone’s dimension is defined as the depth of the deepest complete mistake tree over and (Littlestone, 1987). Littlestone’s dimension is also known to exactly characterize the number of (general) equivalence queries required to learn in Angluin’s (1988) exact model of learning (Littlestone, 1987).

## 3 Equivalence between representation dimension and communication complexity

We relate communication complexity to private learning by considering the communication problem associated with evaluating a function from a concept class on an input . Formally, for a Boolean concept class over domain , define to be the function defined as . In a slight abuse of notation we use to denote (and similarly for ).

Our main result is the following two bounds.

###### Theorem 3.1.

For any and , and any concept class , it holds that:

• .

• .

###### Proof.

: let be the public-coin one-way protocol that achieves the optimal communication complexity . For each choice of the public random coins , let denote the set of functions over all possible . Thus, each has size at most . Let the distribution be to choose uniformly random and then output .

We show that this family -probabilistically represents . We know from the fact that computes with error that it must hold for all and that:

 Prr[πB(πA(f;r),x;r)≠f(x)]≤εδ.

In particular, it must hold for any distribution over that:

 Prx∼D,r[πB(πA(f;r),x;r)≠f(x)]≤εδ.

Therefore, it must hold that

 Prr[Prx∼D[πB(πA(f;r),x;r)≠f(x)]>ε]<δ.

Note that and therefore, with probability over the choice of , there exists such that .

: let be the distribution over sets of boolean functions that achieves . We will show that for each distribution over inputs , we can construct a -correct protocol for over that has communication bounded by . Namely, we will prove that

 maxμD→ε+δ−εδ(EvalC;μ)≤PRDimε,δ(C). (3.1)

By Yao’s minimax principle (Equation 2.1) (Yao, 1977) this implies that

 R→,pubε+δ−εδ(C)≤PRDimε,δ(C).

Fix . This induces a marginal distribution over functions and for every a distribution which is conditioned on the function being (note that is equivalent to drawing from and then from ). The protocol is defined as follows: use public coins to sample . Alice knows and so knows the distribution . Alice sends the index of such that if such exists or an arbitrary otherwise. Bob returns .

The error of this protocol can be analyzed as follows. Fix and let denote the event that contains such that . Observe that is independent of so that even conditioned on remains distributed according to . Also, since -probabilistically represents , we know that for every , . Therefore we can then deduce that:

 Prr,(f,x)∼μ[π(f,x;r)=f(x)] =Prr,(f,x)∼μ[π(f,x;r)=f(x)∧Gf]+Prr,(f,x)∼μ[π(f,x;r)=f(x)∧¬Gf] ≥Prr,f∼F[Gf]⋅Prr,x∼Df[π(f,x;r)=f(x)∣Gf] ≥(1−δ)(1−ε)=1−δ−ε+ϵδ.

Thus computes with error at most and it has communication bounded by .

We also establish an analogous equivalence for and private-coin protocols.

###### Theorem 3.2.

For any , it holds that:

• .

• .

###### Proof.

: let and fix the private-coin one-way protocol that achieves . We define the deterministic representation to be all functions , i.e. the majority value of Bob’s outputs on input and communication . Observe that there are such functions (one for each possible) and therefore it suffices to show that -deterministically represents . To see this, observe that for each , and all , it holds that:

 PrrB,σ\lx@stackrelR←πA(f;rA)[f(x)=πB(σ,x;rB)]≥1−ε/2.

In particular, this means that for all distributions over , it holds that

 Prx∼D,rB,σ\lx@stackrelR←πA(f;rA)[f(x)=πB(σ,x;rB)]≥1−ε/2.

By a standard averaging argument, there must exist at least one such that

 Prx∼D,rB[f(x)=πB(σ,x;rB)]≥1−ε/2.

Now say that is bad if . By the above, it follows that . By definition, if is not bad then , since is the majority of over all . Therefore

 Prx∼D[f(x)=hσ(x)]≥1−ε.

This implies that -deterministically represents .

: We first apply von-Neumann’s Minimax theorem to the definition of deterministic representation. In particular, suppose is the family of functions that achieves . Thus, for each and each distribution over , there exists such that . We define a zero-sum game for each with the first player choosing a point and the second player choosing a hypothesis and the payoff of the second player being . The definition of implies that for every mixed strategy of the first player the second player has a pure strategy that achieves payoff of at least . By the Minimax theorem there exists a distribution over such that, for every , it holds that

 Eh∼hf[|h(x)−f(x)|]=Prh∼hf[h(x)=f(x)]≥1−ε.

Our private-coin protocol for will be the following: on input , Alice will use her private randomness to sample and send the index of to Bob. Bob then outputs . Thus, for each , it holds that

and so the protocol computes with error .

An immediate corollary of these equivalences and eq.(2.2) is that and as we stated in Theorem 1.1.

### 3.1 Applications

Our equivalence theorems allow us to import many results from communication complexity into the context of private PAC learning, both proving new facts and simplifying proofs of previously known results in the process.

#### Separating SCDP and VC dimension.

Define as the family of functions for where if and only if . The lower bound follows from an observation that is equivalent to the “greater-than” function if and only if , where are viewed as binary representations of integers in . Note and therefore these functions are the same up to the negation. is a well studied function in communication complexity and it is known that (Miltersen et al., 1998). By combining this lower bound with Theorem 3.1 we obtain that yet . From Theorem 2.1 it follows that .

We note that it is known that VC dimension corresponds to the maximal distributional one-way communication complexity over all product input distributions. Hence this separation is analogous to separation of distributional one-way complexity over product distributions and the maximal distributional complexity over all distributions achieved using the greater-than function (Kremer et al., 1999).

We also give more such separations using lower bounds on based on Littlestone’s dimension. These are discussed in Section 4.

#### Accuracy and confidence boosting.

Our equivalence theorems give a simple alternative way to reduce error in probabilistic and deterministic representations without using sequential boosting as was done in (Beimel et al., 2013a). Given a private PAC learner with constant error, say , one can first convert the learner to a communication protocol with error , use simple independent repetitions (as in eq.(2.2)) to reduce the error to , and then convert the protocol back into a -probabilistic representation. The “magic” here happens when we convert between the communication complexity and probabilistic representation using min-max type arguments. This is the same tool that can be used to prove (computationally inefficient) boosting theorems.

#### Probabilistic vs. deterministic representation dimension.

It was shown by Newman (1991) that public and private coin complexity are the same up to additive logarithmic terms. In our setting (and with a specific choice of error bounds to simplify presentation), Newman’s theorem implies that

 R→1/8(C)≤R→,pub1/9(C)+O(loglog(|C||X|)). (3.2)

We know by Sauer’s lemma that , therefore we deduce that:

 R→1/8(C)≤R→,pub1/9(C)+O(loglogVC(C)+loglog|X|).

By our equivalence theorems, and . This implies that

 DRDim(C)=O(PRDim(C)+loglog|X|).

A version of this was first proved in (Beimel et al., 2013a), whose proof is similar in spirit to the proof of Newman’s theorem. We also remark that the fact that while (Beimel et al., 2010, 2013a) corresponds to the fact that the private-coin complexity of the equality function is , while the public-coin complexity is . Here is the family of point functions over , i.e. functions that are zero everywhere except on a single point.

#### Simpler learning algorithms.

Using our equivalence theorems, we can “import” results from communication complexity to give simple private PAC learners. For example, the well-known constant communication equality protocol using inner-product-based hashing can be converted to a probabilistic representation using Theorem 3.1, which can then be used to learn point functions. The resulting learning algorithm is somewhat simpler than the constant sample complexity learner for described in (Beimel et al., 2010) and we believe that this view also provides useful intuition. We remark that the probabilistic representation for that results from the communication protocol is known and was used for learning by Feldman (2009) in the context of evolvability. A closely related representation is also mentioned in (Beimel et al., 2013a).

Furthermore in some cases this connection can lead to efficient private agnostic learning algorithms. Namely, if there is a communication protocol for where Bob’s algorithm is polynomial-time then one can run the exponential mechanism in time to differentially privately agnostically learn .

## 4 Lower Bounds via Littlestone’s Dimension

In this section, we show that Littlestone’s dimension lower bounds the sample complexity of differentially private learning. Let be a concept class over of . Our proof is based on a reduction from the communication complexity of to the communication complexity of Augmented Index problem on bits. is the promise problem where Alice gets a string and Bob gets and , and where . A variant of this problem in which the length of the prefix is not necessarily but some additional parameter was first explicitly defined by Bar-Yossef et al. (2004) who proved that it has randomized one-way communication complexity of . The version defined above is from (Ba et al., 2010) where it is also shown that a lower bound for follows from an earlier work of (Miltersen et al., 1998). We use the following lower bound for .

###### Lemma 4.0.

, where is the binary entropy function.

A proof of this lower bound can be easily derived by adapting the proof in (Bar-Yossef et al., 2004) and we include it in Section A.

We now show that if then one can reduce on bit inputs to .

###### Lemma 4.0.

Let be a concept class over and . There exist two mappings and such that for every and , the value of on point is equal to .

###### Proof.

By the definition of , there exists a complete mistake tree over and of depth . Recall that a mistake tree over and is a binary tree in which each internal node is labelled by a point in and each leaf is labelled by a concept in . For consider a path from the root of the tree such that at step we go to the left subtree if and the right subtree if . Such path will end in a leaf which we denote by and the concept that labels it by . For a prefix , let denote the internal node at depth on this path (with being the root) and let denote the point in which labels .

We define the mapping as for all and the mapping as for all . By the definition of a mistake tree over and , the value of the concept on the point is determined by whether the leaf is in the right (1) or the left (0) subtree of the node . Recall that the turns in the path from the root of the tree to are defined by the bits of . At the node , determines whether will be in the right or the left subtree. Therefore . Therefore the mapping we defined reduces to .

An immediate corollary of Lemma 4 and Lemma 4 is the following lower bound.

###### Corollary 4.0.

Let be a concept class over and . .

A stronger form of this lower bound was proved by Zhang (2011) who showed that the power of Partition Tree lower bound technique for one-way quantum communication complexity of Nayak (1999) can be expressed in terms of of the concept class associated with the communication problem.

### 4.1 Applications

We can now use numerous known lower bounds for Littlestone’s dimension of to obtain lower bounds on sample complexity of private PAC learning. Here we list several examples of known results where is (asymptotically) larger than the VC dimension of .

1. (Littlestone, 1987). .

2. Let denote the class of all axis-parallel rectangles over , namely all concepts for defined as if and only if for all , . (Littlestone, 1987). .

3. Let denote class of all linear threshold functions over . . This lower bound is stated in (Maass and Turán, 1994a). We are not aware of a published proof and therefore a proof based on counting arguments in (Muroga, 1971) appears in Section B for completeness. .

4. Let denote class of all balls over , that is all functions obtained by restricting a Euclidean ball in to . Then