Communication with Contextual Uncertainty

# Communication with Contextual Uncertainty

Badih Ghazi Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge MA 02139. Supported in part by NSF STC Award CCF 0939370 and NSF Award CCF-1217423. badih@mit.edu.    Ilan Komargodski Weizmann Institute of Science, Israel. Email: ilan.komargodski@weizmann.ac.il. Work done while an intern at MSR New England. Supported in part by a grant from the I-CORE Program of the Planning and Budgeting Committee, the Israel Science Foundation, BSF and the Israeli Ministry of Science and Technology.    Pravesh Kothari UT Austin, USA. Email: kothari@cs.utexas.edu. Work done while an intern at MSR New England.    Madhu Sudan Microsoft Research, One Memorial Drive, Cambridge, MA 02142, USA. Email: madhu@mit.edu.
July 14, 2019
###### Abstract

We introduce a simple model illustrating the role of context in communication and the challenge posed by uncertainty of knowledge of context. We consider a variant of distributional communication complexity where Alice gets some information and Bob gets , where is drawn from a known distribution, and Bob wishes to compute some function (with high probability over ). In our variant, Alice does not know , but only knows some function which is an approximation of . Thus, the function being computed forms the context for the communication, and knowing it imperfectly models (mild) uncertainty in this context.

A naive solution would be for Alice and Bob to first agree on some common function that is close to both and and then use a protocol for to compute . We show that any such agreement leads to a large overhead in communication ruling out such a universal solution.

In contrast, we show that if has a one-way communication protocol with complexity in the standard setting, then it has a communication protocol with complexity in the uncertain setting, where denotes the mutual information between and . In the particular case where the input distribution is a product distribution, the protocol in the uncertain setting only incurs a constant factor blow-up in communication and error.

Furthermore, we show that the dependence on the mutual information is required. Namely, we construct a class of functions along with a non-product distribution over for which the communication complexity is a single bit in the standard setting but at least bits in the uncertain setting.

## 1 Introduction

Most forms of communication involve communicating players that share a large common context and use this context to compress communication. In natural settings, the context may include understanding of language, and knowledge of the environment and laws. In designed (computer-to-computer) settings, the context includes knowledge of the operating system, communication protocols, and encoding/decoding mechanisms. Remarkably, especially in the natural setting, context can seemingly be used to compress communication, even when it is not shared perfectly. This ability to communicate despite a major source of uncertainty has led to a series of works attempting to model various forms of communication amid uncertainty, starting with Goldreich, Juba and Sudan [JS08, GJS12] followed by [JKKS11, JS11, JW13, HS14, CGMS15]. This current work introduces a new theme to this series of works by introducing a functional notion of uncertainty and studying this model. We start by describing our model and results below and then contrast our model with some of the previous works.

#### Model.

Our model builds upon the classical setup of communication complexity due to Yao [Yao79], and we develop it here. The classical model considers two interacting players Alice and Bob each possessing some private information and with known only to Alice and to Bob. They wish to compute some joint function and would like to do so while exchanging the minimum possible number of bits. In this work, we suggest that the function is the context of the communication and consider a setting where it is shared imperfectly. Specifically, we say that Bob knows the function and Alice knows some approximation to (with not being known to Bob). This leads to the question: when can Alice and Bob interact to compute with limited communication ?

It is clear that if , then bits of communication suffice — Alice can simply ignore and send to Bob. We wish to consider settings that improve on this. To do so correctly on every input, a necessary condition is that must have low communication complexity in the standard model. However, this necessary condition does not seem to be sufficient — since Alice only has an approximation to . Thus, we settle for a weaker goal: determining correctly only on most inputs. This puts us in a distributional communication complexity setting. A necessary condition now is that must have a low-error low-communication protocol in the standard setting. The question is then: can be computed with low error and low communication when Alice only knows an approximation to (with being unknown to Bob) ?

More precisely, in this setting, the input to Alice is a pair and the input to Bob is a pair . The functions are adversarially chosen subject to the restrictions that they are close to each other (under some distribution on the inputs) and that (and hence ) has a low-error low-communication protocol. The pair is drawn from the distribution (independent of the choice of and ). The players both know in addition to their respective inputs.

#### Results.

In order to describe our results, we first introduce some notation. Let denote the (weighted and normalized) Hamming distance between and with respect to the distribution . Let denote the minimum communication complexity of a protocol computing correctly on all but an fraction of the inputs. Let denote the corresponding one-way communication complexity of . Given a family of pairs of functions , we denote the uncertain complexity to be the minimum over all public-coin protocols of the maximum over , in the support of and settings of public coins, of the communication cost of , subject to the condition that for every , outputs with probability over the choice of and the shared randomness. That is,

 CCUμϵ(F)≜min{Π|∀(f,g)∈F:δμ(Π,g)≤ϵ}max{(f,g)∈F,(x,y)∈supp(μ), public coins}{Comm% .\ cost of Π((f,x),(g,y))}.

Similarly, let denote the one-way uncertain communication complexity of .

Our first result (Theorem 1.1) shows that if is a distribution on which and are close and each has a one-way protocol with communication bits in the standard model, then the pair has one-way uncertain communication complexity of at most bits with being the mutual information111Given a distribution over a pair of random variables with marginals and over and respectively, the mutual information of and is defined as . of . More precisely, let denote the family of all pairs of functions with and . We prove the following theorem.

###### Theorem 1.1.

There exists an absolute constant such that for every pair of finite sets and , every distribution over and every , it holds that

 owCCUμϵ+2δ+θ(owFμk,ϵ,δ)≤c(k+log(1θ))θ2⋅(1+I(X;Y)θ2). (1)

In the special case where is a product distribution, then and we obtain the following particularly interesting corollary of Theorem 1.1.

###### Corollary 1.2.

There exists an absolute constant such that for every pair of finite sets and , every product distribution over and every , it holds that

 owCCUμϵ+2δ+θ(owFμk,ϵ,δ)≤c(k+log(1θ))θ2.

In words, Corollary 1.2 says that for product distributions and for constant error probabilities, communication in the uncertain model is only a constant factor larger than in the standard model.

Our result is significant in that it achieves (moderately) reliable communication despite uncertainty about the context, even when the uncertainty itself is hard to resolve. To elaborate on this statement, note that one hope for achieving a low-communication protocol for would be for Alice and Bob to first agree on some function that is close to and , and then apply some low-communication protocol for this common function . This would be the “resolve the uncertainty first” approach. We prove (Theorem 3.2) that resolving the uncertainty can be very expensive (much more so than even the trivial protocol of sending ) and hence, this would not be a way to prove Theorem 1.1. Instead, we show a path around the inherent uncertainty to computing the desired function, and this leads to a proof of Theorem 1.1. To handle non-product distributions in Theorem 1.1, we in particular use a one-way distributional variant of the correlated sampling protocol of Braverman and Rao [BR11]. For a high-level overview of the proof of Theorem 1.1, we refer the reader to Section 4.1.

We now describe our lower bound. Given the upper bound in Theorem 1.1, a natural question is whether the dependence on in the right-hand side of Equation 1 is actually needed. In other words, is it also the case that for non-product distributions, contextual uncertainty can only cause a constant-factor blow-up in communication (for constant error probabilities) ? Perhaps surprisingly, the answer to this question turns out to be negative. Namely, we show that a dependence of the communication in the uncertain setting on is required.

###### Theorem 1.3.

There exist a distribution and a function class such that for every ,

 CCUμ12−ϵ(F)≥Ω(√δn)−log(1/ϵ).

In particular, if is any small constant (e.g., ), then Theorem 1.3 asserts the existence of a distribution and a class of distance- functions for which the zero-error (one-way) communication complexity in the standard model is a single bit, but under contextual uncertainty, any two-way protocol (with an arbitrary number of rounds of interaction) having a noticeable advantage over random guessing requires bits of communication! We note that the distribution in Theorem 1.3 has mutual information , so Theorem 1.3 rules out improving the dependence on the mutual information in Equation 1 to anything smaller than . It is an interesting open question to determine the correct exponent of in Equation 1.

In order to prove Theorem 1.3, the function class will essentially consist of the set of all close-by pairs of parity functions and the distribution will correspond to the noisy Boolean hypercube. We are then able to reduce the problem of computing under with contextual uncertainty, to the problem of computing a related function in the standard distributional communication complexity model (i.e., without uncertainty) under a related distribution. We then use the discrepancy method to prove a lower bound on the communication complexity of the new problem. This task itself reduces to upper bounding the spectral norm of a certain communication matrix. The choice of our underlying distribution then implies a tensor structure for this matrix, which reduces the spectral norm computation to bounding the largest singular value of an explicit family of matrices. For more details about the proof of Theorem 1.3, we refer the reader to Section 5.

#### Contrast with prior work.

The first works to consider communication with uncertainty in a manner similar to this work were those of [JS08, GJS12]. Their goal was to model an extreme form of uncertainty, where Alice and Bob do not have any prior (known) commonality in context and indeed both come with their own “protocol” which tells them how to communicate. So communication is needed even to resolve this uncertainty. While their setting is thus very broad, the solutions they propose are much slower and typically involve resolving the uncertainty as a first step.

The later works [JKKS11, HS14, CGMS15] tried to restrict the forms of uncertainty to see when it could lead to more efficient communication solutions. For instance, Juba et al. [JKKS11] consider the compression problem when Alice and Bob do not completely agree on the prior. This introduces some uncertainty in the beliefs, and they provide fairly efficient solutions by restricting the uncertainty to a manageable form. Canonne et al. [CGMS15] were the first to connect this stream of work to communication complexity, which seems to be the right umbrella to study the broader communication problems. The imperfectness they study is however restricted to the randomness shared by the communicating parties, and does not incorporate any other elements. They suggest studying imperfect understanding of the function being computed as a general direction, though they do not suggest specific definitions, which we in particular do in this work.

#### Organization

In Section 2, we carefully develop the uncertain communication complexity model after recalling the standard distributional communication complexity model. In Section 3, we prove the hardness of contextual agreement. In Section 4, we prove our main upper bound (Theorem 1.1). In Section 5, we prove our main lower bound (Theorem 1.3). For a discussion of some intriguing future directions that arise from this work, we refer the reader to the conclusion section 6.

## 2 The Uncertain Communication Complexity Model

We start by recalling the classical communication complexity model of Yao [Yao79] and then present our definition and measures.

### 2.1 Communication Complexity

We start with some basic notation. For an integer , we denote by the set . We use to denote a logarithm in base . For two sets and , we denote by their symmetric difference. For a distribution , we denote by the process of sampling a value from the distribution . Similarly, for a set we denote by the process of sampling a value from the uniform distribution over . For any event , let be the - indicator of . For a probability distribution over , we denote by the marginal of over . By , we denote the conditional distribution of over conditioned on .

Given a distribution supported on and functions , we let denote the (weighted and normalized) Hamming distance between and , i.e., . (Note that this definition extends naturally to probabilitistic functions and – by letting and be sampled independently.)

We now turn to the definition of communication complexity. A more thorough introduction can be found in [KN97]. Let be a function and Alice and Bob be two parties. A protocol between Alice and Bob specifies how and what Alice and Bob communicate given their respective inputs and communication thus far. It also specifies when they stop and produce an output (that we require to be produced by Bob). A protocol is said to be one-way if it involves a single message from Alice to Bob, followed by Bob producing the output. The protocol is said to compute if for every it holds that . The communication complexity of is the number of bits transmitted during the execution of the protocol between Alice and Bob. The communication complexity of is the minimal communication complexity of a protocol computing .

It is standard to relax the above setting by introducing a distribution over the input space and requiring the protocol to succeed with high probability (rather than with probability 1). We say that a protocol -computes a function under distribution if .

###### Definition 2.1 (Distributional Communication Complexity).

Let be a Boolean function and be a probability distribution over . The distributional communication complexity of under with error , denoted by , is defined as the minimum over all protocols that -compute over , of the communication complexity of . The one-way communication complexity is defined similarly by minimizing over one-way protocols .

We note that it is also standard to provide Alice and Bob with a shared random string which is independent of , and . In the distributional communication complexity model, it is a known fact that any protocol with shared randomness can be used to get a protocol that does not use shared randomness without increasing its distributed communication complexity.

In this paper, unless stated otherwise, whenever we refer to a protocol, we think of the input pair as coming from a distribution.

### 2.2 Uncertain Communication Complexity

We now turn to the central definition of this paper, namely uncertain communication complexity. Our goal is to understand how Alice and Bob can communicate when the function that Bob wishes to determine is not known to Alice. In this setting, we make the functions (that Bob wants to compute) and (Alice’s estimate of ) explicitly part of the input to the protocol . Thus, in this setting a protocol specifies how Alice with input and Bob with input communicate, and how they stop and produce an output. We say that computes if for every , the protocol outputs . We say that a (public-coin) protocol -computes over if .

Next, one may be tempted to define the communication complexity of a pair of functions as the minimum over all protocols that compute of their maximum communication. But this does not capture the uncertainty! (Rather, a protocol that works for the pair corresponds to both Alice and Bob knowing both and .) To model uncertainty, we have to consider the communication complexity of a whole class of pairs of functions, from which the pair is chosen (in our case by an adversary).

Let be a family of pairs of Boolean functions with domain . We say that a public-coin protocol -computes over if for every , we have that -computes over . We are now ready to present our main definition.

###### Definition 2.2 (Contextually Uncertain Communication Complexity).

Let be a distribution on and . The communication complexity of under contextual uncertainty, denoted , is the minimum over all public-coin protocols that -compute over , of the maximum communication complexity of over , from the support of and settings of the public coins.

As usual, the one-way contextually uncertain communication complexity is defined similarly.

We remark that while in the standard distributional model of Subsection 2.1, shared randomness can be assumed without loss of generality, this is not necessarily the case in Definition 2.2. This is because in principle, shared randomness can help fool the adversary who is selecting the pair . Also, observe that in the special case where , Definition 2.2 boils down to the standard definition of distributional communication complexity (i.e., Definition 2.1) for the function , and we thus have . Furthermore, the uncertain communication complexity is monotone, i.e., if then . Hence, we conclude that .

In this work, we attempt to identify a setting under which the above lower bound can be matched. If the set of functions is not sufficiently informative about , then it seems hard to conceive of settings where Alice can do non-trivially well. We thus pick a simple and natural restriction on , namely, that it contains functions that are close to (in -distance). This leads us to our main target classes. For parameters , define the sets of pairs of functions

 Fk,ϵ,δ≜{(f,g)|δμ(f,g)≤δ & CCμϵ(f),CCμϵ(g)≤k}

and

 owFk,ϵ,δ≜{(f,g)|δμ(f,g)≤δ & owCCμϵ(f),owCCμϵ(g)≤k}.

In words, (resp. ) considers all possible functions with communication complexity (resp. one-way communication complexity) at most with Alice being roughly under all possible uncertainties within distance of Bob.222For the sake of symmetry, we insist that (resp. ). We need not have insisted on it but since the other conditions anyhow imply that (resp. ), we decided to include this stronger condition for aesthetic reasons.

It is clear that . Our first main result, Theorem 1.1, gives an upper bound on this quantity, which in the particular case of product distributions is comparable to (up to a constant factor increase in the error and communication complexity). In Theorem 3.2 we show that a naive strategy that attempts to reduce the uncertain communication problem to a “function agreement problem” (where Alice and Bob agree on a function that is close to and and then use a protocol for ) cannot work. Furthermore, our second main result, Theorem 1.3, shows that for general non-product distributions, can be much larger than . More precisely, we construct a function class along with a distribution for which the one-way communication complexity in the standard model is a single bit whereas, under contextual uncertainty, the two-way communication complexity is at least !

## 3 Hardness of Contextual Agreement

In this section, we show that even if both and have small one-way distributional communication complexity on some distribution , agreeing on a such that is small takes communication that is roughly the size of the bit representation of (which is exponential in the size of the input). Thus, agreeing on before simulating a protocol for is exponentially costlier than even the trivial protocol where Alice sends her input to Bob. Formally, we consider the following communication problem:

###### Definition 3.1 (\textscAgreeδ,γ(F)).

For a family of pairs of functions , the -agreement problem with parameters is the communication problem where Alice gets and Bob gets such that and their goal is for Alice to output and Bob to output such that and .

Somewhat abusing notation, we will use to denote the distributional problem where is a distribution on and the goal now is to get agreement with probability over the randomness of the protocol and the input.

If the agreement problem could be solved with low communication for the family as defined at the end of Section 2, then this would turn into a natural protocol for for some positive and as well. Our following theorem proves that agreement is a huge overkill.

###### Theorem 3.2.

For every , there exists and a family such that for every , it holds that .

In words, Theorem 3.2 says that there is a family of pairs of functions supported on functions of zero communication complexity (with zero error) for which agreement takes communication polynomial in the size of the domain of the functions. Note that this is exponentially larger than the trivial communication complexity for any function , which is at most .

We stress that while an agreement lower bound for zero communication functions may feel a lower bound for a toy problem, a lower bound for this setting is inherent in any separation between agreement complexity for and communication complexity with uncertainty for . To see this, note that given any input to the problem, Alice and Bob can execute any protocol for pinning down the value of the function to be computed with high probability and low communication. If one considers the remaining challenge to agreement, it comes from a zero communication problem.

Our proof of Theorem 3.2 uses a lower bound on the communication complexity of agreement distillation (with imperfectly shared randomness) problem defined in [CGMS15], who in turn rely on a lower bound for randomness extraction from correlated sources due to Bogdanov and Mossel [BM11].

We describe their problem below and the result that we use. We note that their context is slightly different and our description below is a reformulation. First, we define the notion of -perturbed sequences of bits. A pair of bits is said to be a pair of -perturbed uniform bits if is uniform over , and with probability and with probability . A pair of sequences of bits is said to be -perturbed if and and each coordinate pair is a -perturbed uniform pair drawn independently of all other pairs. For a random variable , we define its min-entropy as .

###### Definition 3.3 (\textscAgreement−Distillationkγ,ρ).

In this problem, Alice and Bob get as inputs and , where form a -perturbed sequence of bits. Their goal is to communicate deterministically and produce as outputs (Alice’s output) and (Bob’s output) with the following properties: (i) and (ii) .

###### Lemma 3.4 ([Cgms15, Theorem 2]).

For every , there exists such that for every and , it holds that every deterministic protocol that computes has communication complexity at least .

We note that while the agreement distillation problem is very similar to our agreement problem, there are some syntactic differences. We are considering pairs of functions with low communication complexity, whereas the agreement-distillation problem considers arbitrary random sequences. Also, our output criterion is proximity to the input functions, whereas in the agreement-distillation problem, we need to produce high-entropy outputs. Finally, we want a lower bound for our agreement problem when Alice and Bob are allowed to share perfect randomness while the agreement-distillation bound only holds for deterministic protocols. Nevertheless, we are able to reduce to their setting quite easily as we will see shortly.

Our proof of Theorem 3.2 uses the standard Chernoff-Hoeffding tail inequality on random variables that we include below. Denote , where is the base of the natural logarithm.

###### Proposition 3.5 (Chernoff bound).

Let be a sum of identically distributed independent random variables . Let . It holds that for ,

 Pr[X<(1−δ)μ]≤exp(−δ2μ/2)

and

 Pr[X>(1+δ)μ]≤exp(−δ2μ/3),

and for ,

 Pr[X>μ+a]≤exp(−2a2/n)
###### Proof of Theorem 3.2.

We prove the theorem for , in which case we may assume since otherwise the right-hand side is non-positive.

Let denote the set of functions that depend only on Bob’s inputs, i.e., if there exists such that for all . Our family will be a subset of , the subset that contains functions that are at most apart.

 F≜{(f,g)∈FB×FB | δ(f,g)≤δ}.

It is clear that communication complexity of every function in the support of is zero, with zero error (Bob can compute it on his own) and so . So it remains to prove a lower bound on .

We prove our lower bound by picking a distribution supported mostly on and by giving a lower bound on . Let . The distribution is a simple one. It samples as follows. The function is drawn uniformly at random from . Then, is chosen to be a “-perturbation” of , namely for every , is chosen to be equal to with probability and with probability . For every , we now set .

By the Chernoff bound (see Proposition 3.5), we have that . So with overwhelmingly high probability, draws elements from . In particular, if some protocol solves , then it would also solve .

We thus need to show a lower bound on the communication complexity of . We now note that since this is a distributional problem, by Yao’s min-max principle, if there is randomized protocol to solve with bits of communication, then there is also a deterministic protocol for the same problem and with the same complexity. Thus, it suffices to lower bound the deterministic communication complexity of . Claim 3.6 shows that any such protocol gives a deterministic protocol for Agreement-Distillation with . Combining this with Lemma 3.4 gives us the desired lower bound on and hence on . ∎

###### Claim 3.6.

Every protocol for is also a protocol for for , where is the binary entropy function given by .

###### Proof.

Suppose Alice and Bob are trying to solve . They can sample -pertubed strings and interpret them as functions or equivalently as functions . They can now simulate the protocol for and output and . By definition of Agree, we have with probability at least . So it suffices to show that . But this is obvious since any function is output only if and we have that . Since the probability of sampling for any is at most , we have that the probability of outputting for any is at most . In other words, . Similarly, we can lower bound and thus we have that the outputs of the protocol for Agree solve Agreement-Distillation with . ∎

## 4 One-way Communication with Contextual Uncertainty

In this section, we prove Theorem 1.1. We start with a high-level description of the protocol.

### 4.1 Overview of Protocol

Let be a distribution over an input space . For any function and any , we define the restriction of to to be the function given by for any .

We now give a high-level overview of the protocol. First, we consider the particular case of Theorem 1.1 where is a product distribution, i.e., . Note that in this case, in the right-hand side of Equation 1. We will handle the case of general (not necessarily product) distributions later on.

The general idea is that given inputs , Alice can determine the restriction , and she will try to describe it to Bob. For most values of , will be close (in -distance) to the function . Bob will try to use the (yet unspecified) description given by Alice in order to determine some function that is close to . If he succeeds in doing so, he can output which would equal with high probability over .

We next explain how Alice will describe , and how Bob will determine some function that is close to based on Alice’s description. For the first part, we let Alice and Bob use shared randomness in order to sample , where the ’s are drawn independently with , and is a parameter to be chosen later. Alice’s description of will then be . Thus, the length of the communication is bits and we need to show that setting to be roughly suffices. Before we explain this, we first need to specify what Bob does with Alice’s message.

As a first cut, let us consider the following natural strategy: Bob picks an such that is close to on , and sets . It is clear that if , then , and for every , we would have . Moreover, if is such that is close to (which is itself close to ), then would now equal with high probability. It remains to deal with such that is far from . Note that if we first fix any such and then sample , then with high probability, we would reveal that is far from . This is because is close to , so should also be far from . However, this idea alone cannot deal with all possible — using a naive union bound over all possible would require a failure probability of , which would itself require setting to be roughly . Indeed, smaller values of should not suffice since we have not yet used the fact that — but we do so next.

Suppose that is a one-way protocol with bits of communication. Then, note that Alice’s message partitions into sets, one corresponding to each message. Our modified strategy for Bob is to let him pick a representative from each set in this partition, and then set for an among the representatives for which and are the closest on the samples . A simple analysis shows that the ’s that lie inside the same set in this partition are close, and thus, if we pick to be the representative of the set containing , then and will be close on the sampled points. For an other representative, once again if is close to , then will equal with high probability. For a representative such that is far from (which is itself close to ), we can proceed as in the previous paragraph, and now the union bound works out since the total number of representatives is only .333We note that a similar idea was used in a somewhat different context by [BJKS02] (following on [KNR99]) in order to characterize one-way communication complexity of any function under product distributions in terms of its VC-dimension.

We now turn to the case of general (not necessarily product) distributions. In this case, we would like to run the above protocol with sampled independently from (instead of ). Note that Alice knows and hence knows the distribution . Unfortunately, Bob does not know ; he only knows as a “proxy” for . While Alice and Bob cannot jointly sample such ’s without communicating (as in the product case), they can still run the correlated sampling protocol of [BR11] in order to agree on such samples while communicating at most bits. The original correlated sampling procedure of [BR11] inherently used multiple rounds of communication, but we are able in our case to turn it into a one-way protocol by leveraging the fact that our setup is distributional (see Subsection 4.2 for more details).

The outline of the rest of this section is the following. In Subsection 4.2, we describe the properties of the correlated sampling procedure that we will use. In Subsection 4.3, we give the formal proof of Theorem 1.1.

### 4.2 Correlated Sampling

We start by recalling two standard notions from information theory. Given two disributions and , the KL divergence between and is defined as . Given a joint distribution of a pair of random variables with and being the marginals of over and respectively, the mutual information of and is defined as .

The following lemma summarizes the properties of the correlated sampling protocol of [BR11].

###### Lemma 4.1 ([Br11]).

Let Alice be given a distribution and Bob be given a distribution over a common universe . There is an interactive public-coin protocol that uses an expected

 D(P||Q)+2log(1/ϵ)+O(√D(P||Q)+1)

bits of communication such that at the end of the protocol:

• Alice outputs an element distributed according to .

• Bob outputs an element such that for each , .

Moreover, the message that Bob sends to Alice in any given round consists of a single bit indicating if the protocol should terminate or if Alice should send the next message.

We point out that in general, the correlated sampling procedure in Lemma 4.1 can take more than one round of communication. This is because initially, neither Alice nor Bob knows and they will need to interactively “discover” it. In our case, we will be using correlated sampling in a “distributional setup”. It turns out that this allows us to use a one-way version of correlated sampling which is described in Lemma 4.2 below.

###### Lemma 4.2.

Let be a distribution over with marginal over , and assume that is known to both Alice and Bob. Fix and let Alice be given . There is a one-way public-coin protocol that uses at most

 O(m⋅I(X;Y)/ϵ+log(1/ϵ)/ϵ)

bits of communication such that with probability at least over the public coins of the protocol and the randomness of , Alice and Bob agree on samples at the end of the protocol.

###### Proof.

When is Alice’s input, we can consider running the protocol in Lemma 4.1 on the distributions and and with error parameter . Let be the resulting protocol transcript. The expected communication cost of is at most

 Ex∼μX[O(D(P||Q))+O(log(1/ϵ))] =O(Ex∼μX[D(P||Q)])+O(log(1/ϵ)) =O(m⋅I(X;Y))+O(log(1/ϵ)), (2)

where the last equality follows from the fact that

 Ex∼μX[D(P||Q)] =Ex∼μX[Ey1|x,…,ym|x[log(∏mi=1μ(yi|x)∏mi=1μ(yi))]] =m⋅I(X;Y).

By Markov’s inequality applied to (4.2), we get that with probability at least , the length of the transcript is at most

 ℓ≜O(m⋅I(X;Y)/ϵ)+O(log(1/ϵ)/ϵ).

Conditioned on the event that the length of is at most bits, the total number of bits sent by Alice to Bob is also at most .

Note that Lemma 4.1 guarantees that each message of Bob in consists of a single bit indicating if the protocol should terminate or if Alice should send the next message. Hence, Bob’s messages do not influence the actual bits sent by Alice; they only determine how many bits are sent by her.

In the new one-way protocol , Alice sends to Bob, in a single shot, the first bits that she would have sent him in protocol if he kept refusing to terminate. Upon receiving this message, Bob completes the simulation of protocol . The error probability of the new protocol is the probability that either Alice did not send enough bits or that the protocol makes an error, which by a union bound is at most

 Pr[¯¯¯¯E]+ϵ/2≤ϵ/2+ϵ/2=ϵ

where denotes the complement of event . ∎

### 4.3 Proof of Theorem 1.1

Recall that in the contextual setting, Alice’s input is and Bob’s input is , where and . Let be the one-way protocol for in the standard setting that shows that . Note that can be described by an integer and functions and , such that Alice’s message on input is , and Bob’s output on message from Alice and on input is . We use this notation below. We also set the parameter , which is chosen such that .

#### The protocol.

Algorithm 1 describes the protocol we employ in the contextual setting. Roughly speaking, the protocol works as follows. First, Alice and Bob run the one-way correlated sampling procedure given by Lemma 4.2 in order to sample . Then, Alice sends the sequence to Bob. Bob enumerates over and counts the fraction of for which . For the index which minimizes this fraction, Bob outputs and halts.

#### Analysis.

Observe that by Lemma 4.2, the correlated sampling procedure requires bits of communication. Thus, the total communication of our protocol is at most

 O(m⋅I(X;Y)/θ2+log(1/θ)/θ2)+m=c(k+log(1θ))θ2⋅(1+I(X;Y)θ2) bits

for some absolute constant , as promised. The next lemma establishes the correctness of the protocol.

.

###### Proof.

We start with some notation. For , let and let . Note that by definition, and . For , let . Note that by the triangle inequality,

 γπ(x),x=δμY|x(fx,Bπ(x))≤δx+ϵx. (3)

In what follows, we will analyze the probability that by analyzing the estimate and the index computed in the above protocol. Note that computed above attempts to estimate , and that both and are functions of .

Note that Lemma 4.2 guarantees that correlated sampling succeeds with probability at least . Henceforth, we condition on the event that correlated sampling succeeds (we will account for the event where this does happen at the end). By the Chernoff bound, we have for every and

 Pry1,…,ym∼μY|x[|γi,x−erri|>θ5]≤exp(−θ2⋅m75).

By a union bound, we have for every ,

 Pry1,…,ym∼μY|x[∃i∈[L] % s.t. |γi,x−erri|>θ5]≤L⋅exp(−θ2⋅m75)≤2θ5,

where the last inequality follows from our choice of .

Now assume that for all , we have that , which we refer to below as the “Good Event”. Then, for , we have

 γimin,x ≤errimin+θ/5 (since we assumed the Good Event) ≤errπ(x)+θ/5 (by definition of imin) ≤γπ(x),x+2θ/5 (since we assumed the Good Event) ≤δx+ϵx+2θ/5. (By Equation 3)

Let be the set of all for which correlated sampling succeeds with probablity at least (over the internal randomness of the protocol). By Lemma 4.2 and an averaging argument, . Thus,

 PrΠ,(x,y)∼μ[Bimin(y)≠f(x,y)] ≤Ex∼μX|x∈W[PrΠ,y∼μY|x[Bimin(y)≠f(x,y)]]+θ/10 ≤Ex∼μX|x∈W[Pry1,…,ym,y∼μY|x[Bimin(y)≠f(x,y)]]+θ/5 ≤Ex∼μX|x∈W[δx+ϵx]+3θ/5 =δ+ϵ+θ

where the third inequality follows from the fact that the Good Event occurs with probability at least , and from the corresponding upper bound on . The other inequalities above follow from the definition of the set and the fact that . Finally, since , we have that Bob’s output does not equal (which is the desired output) with probability at most . ∎

## 5 Lower Bound for Non-Product Distributions

In this section, we prove Theorem 1.3. We start by defining the class of function pairs and distributions that will be used. Consider the parity functions on subsets of bits of the string . Specifically, for every , let