Geometric Decompositions of Bell Polytopes with Practical Applications
Abstract
In the wellstudied Bell experiment consisting of two parties, two measurement settings per party, and two possible outcomes per setting, it is known that if the experiment obeys nosignaling constraints, then the set of admissible experimental probability distributions is fully characterized as the convex hull of 24 distributions: 8 PopescuRohrlich (PR) boxes and 16 local deterministic distributions. Furthermore, it turns out that in the case, any nonlocal nonsignaling distribution can always be uniquely expressed as a convex combination of exactly one PR box and (up to) eight local deterministic distributions. In this representation each PR box will always occur only with a fixed set of eight local deterministic distributions with which it is affiliated. In this paper, we derive multiple practical applications of this result: we demonstrate an analytical proof that the minimum detection efficiency for which nonlocality can be observed is even for theories constrained only by the nosignaling principle, and we develop new algorithms that speed the calculation of important statistical functions of Bell test data. Finally, we enumerate the vertices of the nosignaling polytope for the “chained Bell” scenario and find that similar decomposition results are possible in this general case. Here, our results allow us to prove the optimality of a bound, derived in Barrett et al. [1], on the proportion of local theories in a local/nonlocal mixture that can be inferred from the experimental violation of a chained Bell inequality.
pacs:
03.65.Ud, 03.65.Taams:
81P15, 81Qxx1 Introduction
Ever since Bell’s landmark 1965 work [2], it has been known that quantum mechanics can predict nonlocal behavior in certain experimental scenarios. The most widely studied scenario is the setting, in which there are two spatially separated parties with measuring devices where each measuring device can be put into one of two configurations (“measurement settings”) and has two possible measurement outcomes. In such a scenario, local realist theories must obey constraints, generally referred to as Bell inequalities, on the probabilities of certain outcomes. These constraints, which include the ClauserHorneShimonyHolt (CHSH) inequality [3] and the CH/Eberhard inequalities [4, 5], can be violated by quantum mechanics. In recent years, it has been shown that experimental violations of these inequalities can be exploited for practical purposes, such as deviceindependent quantum key distribution [6] and deviceindependent quantum random number expansion [7].
In a experiment, the relevant quantum predictions form just one example from a class of probability distributions collectively known as the nosignaling polytope. This polytope contains all possible quantum distributions for the experiment, as well as some additional distributions that cannot be realized by a quantum mechanical system. The nosignaling polytope can be expressed as the convex hull of 24 special extremal distributions – 16 “local deterministic” (LD) distributions and 8 nonsignaling nonlocal distributions [8, 9]. Any convex combination of these 24 extremal distributions represents a probability distribution over possible experimental outcomes that is consistent with the “nosignaling” principle, a constraint imposed by special relativity; conversely, any nonsignaling distribution can be expressed as a convex combination of these 24 extremal distributions.
Any convex combination consisting only of the 16 LD distributions will be compatible with local realism and cannot violate any Bell inequalities; such a distribution cannot be used for deviceindependent quantum information applications. It is only distributions that contain some weight on the 8 nonlocal distributions, commonly referred to as PopescuRohrlich (PR) boxes following the work [10], that can display nonlocality. In particular, the CHSH inequality [3],
(1) 
is maximally violated by one of the eight PR boxes. There are eight distinct versions of the inequality (1) that can be obtained by symmetry transformations (i.e., relabeling of settings and/or outcomes), which we can refer to as “CHSH symmetry inequalities.” As noted in [8], each of the eight PR boxes is associated with a unique CHSH symmetry inequality that it maximally violates.
In this paper, we highlight a useful fact related to the above results: any nonlocal nonsignaling distribution can always be written as a convex combination of exactly one PR box and at most eight LD distributions, where the LD distributions all saturate the CHSH symmetry inequality that is violated by the specific PR box occurring in the convex combination. (There are exactly eight such saturating LD distributions for each CHSH symmetry inequality.) This result refines what can be said about the nosignaling polytope using only standard results of convex geometry, such as Carathéodory’s theorem as can be found in (for example) [11].
While the existence of the decomposition into a single PR box and 8 LD distributions is a straightforward consequence of previous work on the setting [8, 12, 13, 14], it offers valuable new insights into the nature of experiments. The decomposition is surprisingly easy to construct given nothing more than a table of nonsignaling empirical frequencies that violate a Bell inequality. This allows for useful applications, such as a new proof that detection efficiency must exceed in order for a experiment to display nonlocality even for a theory constrained only by the nosignaling principle. Another application concerns the statistics of a Bell experiment. Specifically, finding the closest local distribution (in statistical distance) to a given nonlocal empirical distribution is important for quantifying the statistical evidence against local realism (LR) [15, 16, 17], and distributions that have higher CHSH violations can have more informationtheoretic potential. For two important measures of statistical distance, the calculation of the closest local distribution can be meaningfully simplified by using this decomposition: for the total variation distance (the norm), the calculation is rendered trivial and the statistical distance to the closest local distribution is shown to always be a constant multiple of the CHSH violation; for the KullbackLeibler divergence, the calculation still requires a computer but the search space can be significantly reduced.
A natural question is whether this result can be extended to more general situations. We provide an example by extending the result to the scenario of the chained Bell inequalities that were introduced in [18] and studied extensively in [19], and later shown to be relevant for quantum protocols for key distribution [20] and randomness amplification [21] where security can be guaranteed by the nosignaling conditions alone. To address the chained Bell scenario, we first classify the nonsignaling polytope for this situation, showing that the extremal points consist of local deterministic distributions and nonlocal nonsignaling distributions that generalize the concept of the PR box. Once the polytope is classified, we then prove the general version of our earlier result: that any nonsignaling nonlocal distribution in the chained Bell scenario can be expressed as a convex combination of exactly one generalized PR box and LD distributions. As an application of this result, we prove the optimality of a bound derived in Barrett et al. [1] on the proportion of local theories in a local/nonlocal mixture that can be inferred from the experimental violation of a chained Bell inequality.
The structure of the paper is as follows: in Section 2, we introduce the notation and definitions that we will use and prove the main result for the situation. In Section 3, we demonstrate the applications of this result to statistical problems, and in Section 4, we study the chained Bell scenario, followed by concluding remarks in Section 5.
2 Decomposition Theorems for the Scenario
2.1 Defining the NoSignaling Polytope
In the setting, there are two spatially separated parties which we call Alice and Bob. Alice and Bob each have a measuring apparatus, and each apparatus has two configurations which we call measurement settings. We label the two measurement settings for Alice with the symbols and the two measurement settings for Bob with the symbols . Whatever the choice of setting, each apparatus always returns one of two outcomes “+” or “0” in an experimental trial.
In a given experiment, there are then four possible setting configurations and for each there will be an associated probability distribution over the four possible outcomes , where denotes outcome for Alice and outcome for Bob. Henceforth, we will omit the and subscripts when possible, with the understanding if a pair of outcomes is written without subscripts, then the first outcome is Alice’s, and the second is Bob’s. We can conveniently represent the four settingsconditional distributions as a table with one row for each conditional distribution, such as the following example:
++  +0  0+  00  
1/4  1/4  1/4  1/4  
1/2  0  0  1/2  
1/4  1/4  1/4  1/4  
0  1/2  1/2  0 
So for an experiment obeying the distribution in Table 1, we would expect to see, for instance, + for Alice and 0 for Bob with probability 1/4 if the setting is . Bell inequalities like (1) are constraints on these outcome probabilities that can either be obeyed or violated. In particular, the terms appearing in (1) refer to the expectation, conditioned on the setting , of the product of random variables and , where equals or if Alice’s (Bob’s) outcome is + or 0, respectively.
The complete collection of all 16 entries in Table 1 is an example of what we will refer to as a distribution matrix. Mathematically, a valid distribution matrix is just an element of satisfying certain linear constraints. To enumerate these constraints, first note that each row of Table 1 must be a valid probability distribution. Referring to the entries of the table using the notation for “the probability of outcome given that the setting is ,” the associated constraints can be written as follows:
(2) 
(3)  
(4)  
(5)  
(6) 
Furthermore, the example in Table 1 is just one instance of a distribution matrix obeying the nosignaling conditions:
(7)  
(8)  
(9)  
(10)  
(11)  
(12)  
(13)  
(14) 
The nosignaling conditions capture the notion that Alice and Bob should not be able to exploit the experiment to send signals to each other, in the sense that Alice’s marginal outcome distribution should not depend on Bob’s measurement choice and vice versa. To illustrate, note that if the equality (7) did not hold, then if Alice were to choose setting , she could gain some information about whether Bob chose or because her probability of seeing “+” changes according to Bob’s choice. Equalities (7)(14) encapsulate all of the different versions of this scenario for different settings and outcomes.
We are interested in the collection of distribution matrices that satisfy all of the constraints (2)(14). These constraints are all linear equalities and inequalities so the collection will form a polyhedron in . In particular, we can say that it is a convex polytope – the convex hull of a finite set – because it is bounded. The set of distribution matrices satisfying (2)(14) is known as the nosignaling polytope, and any element of the set is expressible as a convex combination of elements of a special finite collection of distribution matrices known as the extremal points or vertices. There are 24 extremal points [8] which we list in A. As noted earlier, 16 of these extremal points are local (i.e., satisfying all Bell inequalities) and 8 are nonlocal PR boxes.
We can observe a few things about the polytope by studying the constraints that define it. It is straightforward to check that equations (11)(14) can be derived from (3)(6) along with (7)(10). Removing (11)(14) from the set of equalities, we are left with equalities (3)(6) and (7)(10), which are linearly independent. Thus these equalities reduce the dimension of the nosignaling polytope to 8 (down from the 16 dimensions of the ambient space ). A theorem of convex geometry, due to Carathéodory, tells us that for a given element in the convex hull of a finite set of points , it is always possible to express it as a convex combination of an affinely independent subset of [11]. For the nosignaling polytope, this indicates that any distribution matrix can always be expressed as a convex combination of no more than 9 extremal points. We can refine this result for our particular polytope by analyzing various properties belonging to this collection of 9 extremal points.
2.2 Decomposing the Polytope
We will later generalize these theorems to the chained Bell setting, but we include the following proofs of the special case for ease of understanding.
Theorem 2.1
In the Bell scenario, any nonsignaling distribution matrix that is an equal mixture of two distinct PR boxes is expressible as an equal mixture of four local deterministic distributions.
Proof. We prove this by inspection, referring to the table in A. We start by fixing the PR box labeled “1.” There are seven different possibilities for PR box 1 to be mixed with another PR box in equal amounts. For each of these seven pairs, we can list a group of four LD distributions that reproduce the equal mixture distribution. Since an arbitrary pair of PR boxes can be transformed into a pair containing PR box 1 by relabelling the outcomes and/or settings, this is sufficient to show that the theorem is true for all possible distinct PR box pairs.
For visualization purposes, we examine how this works for the equal mixture of PR box 1 and PR box 6, which results in the distribution matrix that was written earlier in Table 1. An equal mixture of LD distributions 9, 12, 14, and 15 will induce the same distribution. One way to see this is by proceeding in steps, first considering the equal mixture of LD distributions 9 and 14 (below left – to reduce clutter, we leave blank the cells that contain zero), and then considering an equal mixture of LD distributions 12 and 15 (below right):
++  +0  0+  00  
1/2  1/2  
1  
1/2  1/2  
1 
++  +0  0+  00  
1/2  1/2  
1  
1/2  1/2  
1 
It is easy to see that if we take an equal mixture of the above two tables – which is an even mixture over all the four LD distributions 9, 12, 14, and 15 – we get the same distribution matrix as the one in Table 1. The table below lists how to achieve this for each pairing of PR box 1 with another PR box, and thus completes the proof.
PR box mixture  1,2  1,3  1,4  1,5  1,6  1,7  1,8 
Det. collection  1,2,3,4  1,4,9,12  5,8,14,15  1,4,5,8  9,12,14,15  1,4,14,15  5,8,9,12 
Corollary 2.1
In the scenario, any nonsignalling distribution matrix can always be expressed as a convex combination of extremal points that contains at most one PR box.
Proof. Any nonsignaling distribution matrix has a representation as a convex combination of the extremal points of the polytope. We can express this algebraically as
(15) 
where and and refer to the distribution matrices given in Tables 4 and 5 of A and the and are nonnegative numbers satisfying . We can arrange the terms in decreasing order relative to the magnitude of . Then if and are the respectively the smallest and second smallest s,
where we have replaced with an equal mixture of deterministic strategies by using Theorem 2.1. We can then shift the ’s appearing above to the second sum of (15), and we have removed one PR box from the expression. This process can then be applied to the term and to remove from the expression, and similarly repeated until there is only one remaining PR box term.
Theorem 2.2
In the scenario, any convex combination of a fixed PR box and 16 local deterministic distributions can be reexpressed as either a) a convex combination consisting only of local deterministic distributions, or b) a convex combination of the same PR box and the eight local deterministic distributions that saturate the symmetry of the CHSH inequality maximally violated by the PR box.
Proof. We will explicitly prove the claim of the theorem for the particular case of PR box 1. Then a symmetry argument implies that the theorem holds for all 8 PR boxes, as the other PR boxes and their corresponding CHSH symmetries are equivalent to PR box 1 paired with inequality (1), up to relabeling of settings and/or outcomes.
By assumption, we are given a distribution that can be written in the form where and . Now consider the subset of distributions listed in A that saturate the CHSH inequality (1). By inspection, this is seen to be the eight distributions whose index falls in the set ; these are the LD distributions that “go with” PR box 1. The complement of is , so we can rewrite our distribution as
Our goal is to show that the distribution above can be induced by a similar convex combination that either a) does not include the term, or b) does not include any of the terms in the sum. To do this, we will show that for any deterministic distribution whose index is in , the linear combination with can be reexpressed equivalently by times a sum of three LD distributions whose indices are in the set . Then for each in , we can methodically replace with a collection of LD distributions in the set until we either a) “run out” of because was not big enough, or b) cast out all of the indexed LD distributions and end up with a convex combination of and LD distributions whose index is in . These two contingencies correspond to cases (a) and (b) in the statement of the theorem.
For visualization purposes, we again demonstrate one particular case. Suppose we have a linear combination consisting of of and of , with . Then the induced pseudodistribution is given by the following table:
++  +0  0+  00  

By inspection, we can see that the above table is equivalent to the linear combination of LD distributions . Similarly, we can table off all the possibilities as follows:
mixed with:  2  3  6  7  10  11  13  16 

Alt. Det. collection:  5,12,14  8,9,15  1,12,14  4,9,15  4,5,14  1,8,15  4,5,9  1,8,12 
This completes the proof.
Together, Corollary 2.1 and Theorem 2.2 tell us that any nonsignaling distribution matrix is either a) local (i.e., can be expressed as a convex combination of local deterministic distributions and thus violates no Bell inequalities), or b) can be expressed as a convex combination of exactly one PR box and (up to) 8 local deterministic distributions that saturate the specific CHSH symmetry violated by that PR box. The characterization (b), which we will call the 1 PR + 8 LD representation, offers a deeper understanding of nonlocal distributions, as illustrated in the following sections.
2.3 Relationships between Bell inequalities
Consider the following distribution matrix, which is a nonsignalling and approximates the empirical data reported in Table SII of the supplementary material of [22], a recent loopholefree Bell test:
0  0  

0.0001422  0.0000743  0.0000699  0.9997136  
0.0001530  0.0000635  0.0005249  0.9992586  
0.0001476  0.0004795  0.0000644  0.9993084  
0.0000024  0.0006247  0.0006755  0.9986974 
This distribution weakly violates the CHSH inequality (1), and therefore it has a 1 PR + 8 LD representation. But at first glance, it is not at all obvious how to find the coefficients of this representation. However, this task turns out to be rather easy. To see how, note that the 1 PR + 8 LD representation for a nonsignalling distribution violating the CHSH inequality is as follows:
(16) 
where the nine coefficients are nonnegative and sum to 1. Referring to A, we can fill in a table for the distribution matrix based on the expression above by filling in a everywhere the corresponding has a “1,” putting a everywhere has a “1/2,” and adding these entries together to get the following:
++  +0  0+  00  

We notice immediately that there are eight cells in the table that are uniquely determined by the coefficient of a specific deterministic distribution in expression (16). One can thus work backwards: presented with a nonsignaling distribution matrix that violates the CHSH inequality like the one at the beginning of the section, the construction of the representation (16) requires only one calculation to find and the remaining coefficients can be copied directly from the table (so for instance in the 1 PR + 8 LD representation of the distribution matrix given earlier.) The same process can also be applied to nonsignaling distributions violating other symmetries of the CHSH inequality, with appropriate changes to the collection of local distributions appearing in (16).
Expression (16) and its counterpart, Table 2, offer a useful new way of visualizing Bell inequalities. For instance, consider the following Eberhardtype inequality [5, 23, 24], recently tested in loopholefree Bell experiments [22, 25]:
(17) 
The left side of this Eberhard inequality is a linear combination of four of the entries in Table 2. Furthermore, the value of this linear combination is – so estimators of the Eberhard quantity are in fact estimators of the amount of PR box (up to a scale factor of ), which illustrates how positive values indicate nonlocality.
Expression (17) singles out by subtracting , , and from . We immediately see from Table 2 that there are eight different ways to do this, one for each of the table entries that consists of plus three other . Thus the Eberhard inequality is just one of a class of eight related inequalities associated with and the CHSH inequality. Figure 1 is a diagram that depicts these eight related inequalities. Each CHSH symmetry inequality will have its own version of Figure 1, and as there are 8 CHSH symmetries, this yields 64 total variantEberhard inequalities, but a given nonsignaling distribution can violate at most one CHSH symmetry.
For each triangle in Figure 1, one can generate a variantEberhard inequality by adding the unique cell of Table 2 containing all three of the in the triangle’s vertices, and subtracting the three cells of Table 2 where these appear individually. While these variant inequalities can also be generated by adding various linear combinations of the nosignaling conditions (7)(14) to inequality (17), this does not reveal the way in which the variantEberhard inequalities all estimate the same parameter . Note also that if one adds all the variantEberhard inequalities together, one obtains the following inequality:
The inequality above is equivalent to the CHSH inequality (1), which can be seen by adding 2 to both sides and using the fact that for each of the four fixed choices . Thus the amount of violation of the CHSH inequality is a constant multiple of the amount of PR box present in the 1 PR + 8 LD representation (16) of a nonlocal nonsignaling distribution matrix. Indeed, one can add variantEberhard inequalities together with different positive weightings to obtain a variety of Bell inequalities of the form , where indexes a subset of joint settings and outcomes.
This wide latitude in generating Bell inequalities can yield useful results. For instance, the left side of a variantEberhard inequality will take the value when applied to a nonsignaling distribution that violates the CHSH inequality, and the same will be true for any linear combination of left sides of variantEberhard inequalities so long as the sum of the coefficients in the linear combination is 1. The resulting expression can thus induce an unbiased linear estimator of by replacing the conditional probabilities with outcome tallies and multiplying by the setting probabilities: . So in an experimental situation where the setting probabilities are known, a minimum variance linear unbiased estimator of the “amount of PR box” can be generated by positing a best guess for the expected true distribution and then optimizing over linear combinations of variantEberhard inequalities using standard techniques. (Note this requires an assumption of independent, identically distributed experimental trials to be statistically valid.) This could yield smaller error bars for estimates of CHSH parameters in Bell experiments, especially for lossy experiments whose distributions lack in symmetry or otherwise deviate substantially from optimal quantum statistics.
3 Applications
The 1 PR + 8 LD representation of nonlocal nonsignaling distributions has useful applications, which we explore in this section.
3.1 A proof that detection efficiency must exceed 2/3 in order to witness nonlocality
Any realworld Bell test experiment will contain imperfections. In particular, an experimenter cannot always detect all of the particles being generated; some invariably evade detection. This can lead to the detection loophole: if enough particles are not detected, the observed empirical probabilities can deviate from the ideal quantum probabilities to such an extent that the observed probabilities do not violate any Bell inequalities.
The 1 PR + 8 LD representation can help us understand how this issue causes problems in a “onechannel” experiment [22, 23, 25, 26], which we briefly describe. In a single trial of a onechannel experiment, a particle flies through an obstacle which probabilistically either deflects the particle in an unmonitored direction, or allows the particle to pass where it then is (hopefully) registered by an awaiting particle detector. If the particle is deflected, the detector will not see anything and we assign this the outcome “0.” If the particle passes through and is detected, the detector registers a click, which we assign the outcome “+.” But if the imperfect detector fails to see a particle that is present, this will result in the outcome “0.” The quantum efficiency is the proportion of present particles that are actually detected by the detector.
Thus the effect of a missed detection is to convert a “+” count to a “0” count with probability . If the probability of a missed count is symmetrically for both Alice and Bob, the effect of this transition on a PR box is shown in Figure 2: the PR box remains a PR box with probability (all particles detected), becomes an equal mixture of and if Bob’s detector fails (probability ), becomes an equal mixture of and if Alice’s detector fails (probability ), and becomes if both detectors fail (probability ).
With this understanding, we can now prove that a detection efficiency strictly exceeding is necessary to witness nonlocality. Suppose that we start with an ideal experimental system described by a nonsignaling nonlocal distribution. The distribution matrix can then be expressed in the form (16). If is less then 1, the observed distribution will be a transformation of (16). Focusing on the term , the transformation replaces with , as illustrated by Figure 2. Note that we have picked up some weight on LD distributions and , which do not saturate the CHSH inequality. As the transformation takes the local distributions in (16) to other local distributions (this can be easily checked), this means that the proportion of posttransformation is at most . However, from the proof of Theorem 2.2, we recall that for or , can be replaced by a mixture of LD distributions. Since under the transformation, the PR box itself generates some and , this will decrease the subsequent amount of PR box below . When is small enough so that , the transformation will replace the entire coefficient of with coefficients of LD distributions; this occurs when . Thus we can analytically prove the bound for a general nonsignaling distribution. This complements a numerical proof of the same fact in [27], as well as related results in [5, 28, 29] where nondetection events are formulated as a third outcome.
We make one last point about the transformation. The transformation has an unequal effect on different distributions appearing in (16): some of these (, , ) are always taken to other that saturate the CHSH inequality, but others can be mapped to “bad” LD distributions – for instance, is mapped to with probability . Thus a distribution matrix with a fixed is “more hurt” by the transformation if it has more weight on LD distributions such as . So the search for partially entangled states that are noise tolerant, as introduced by Eberhard [5], is essentially the search for states whose representation (16) has less weight on LD distributions such as and more weight on the LD distributions , , and .
3.2 Calculating minimumstatisticaldistance distributions
In experiments and their applications, a fundamental concern is how to distinguish a given quantum distribution matrix from alternative local distributions. Thus we consider the problem of finding the local distribution that comes closest to approximating a given nonsignaling nonlocal distribution. Depending on which measure of statistical distance is being used, this can be a nontrivial calculation. Here we show that the 1 PR + 8 LD representation meaningfully simplifies this calculation for two important measures of statistical distance: the total variation distance and the KullbackLeibler divergence.
The following lemma will be useful for this purpose. It states that if we draw a straight line connecting a local distribution to a CHSHviolating nonsignaling distribution , then at some point the straight line intersects the set of convex combinations of the eight local deterministic CHSHsaturating distributions. This result is useful because we expect that measures of statistical distance should decrease as we travel along lines that connect to the target distribution. So for any local distribution matrix that is not already a convex combination of the saturating distributions, there is always a “better” distribution matrix that is such a convex combination.
Lemma 3.1
Let be a nonsignaling distribution matrix that violates the CHSH inequality (1), and let be a local distribution. Then there is a for which the mixture distribution is a convex combination of the eight local deterministic distributions that saturate the CHSH inequality.
Proof. can be written in the form (16) so where the set enumerates the local deterministic distributions that saturate the CHSH inequality. can similarly be written in the form . Let and let . Then
and by Theorem 2.2 each bracketed term in the second sum above is equivalent to a linear combination of distributions whose indices are in , so the above expression can be rewritten as a convex combination of the eight local deterministic distributions satisfying the CHSH inequality.
If we start already on the CHSHsaturating face, then is zero. Furthermore, it is clear that if we move any closer towards past the CHSHsaturating face, we leave the set of local distributions.
Total Variation Distance. For two distribution matrices and , we define the total variation distance between them as follows:
(18) 
where is the probability that the distribution matrix assigns to outcome conditioned on setting , and similarly for . Note that this definition involves conditional probabilities, and thus is different from the usual definition of total variation distance. According to the usual definition, (18) is really a sum of four total variation distances, one for each of the setting configurations. (If we assume that all setting configurations are equiprobable, the conditional could be dropped, and the two definitions of total variation distance would align up to a scale factor.)
We consider now the problem of finding the local distribution that is of minimum total variation distance from a given nonlocal nonsignaling distribution matrix. The following result gives a solution and shows that the distance to the closest local distribution is equal to the proportion of PR box in the 1 PR + 8 LD representation of the nonlocal nonsignaling distribution matrix (and thus also equal to a constant multiple of the CHSH violation).
Theorem 3.1
Theorem 3.1 has intuitive appeal: starting with the nonsignaling nonlocal distribution given by (16), you can get to the closest local distribution (with respect to total variation distance) by throwing away the component and replacing it with an even mixture of the eight local distributions, which is then added to the original weight on these eight distributions. (Note that this is not in general the same throwing away the coefficient and then renormalizing the local coefficients according to the formula .) Interestingly, the distribution in the statement of Theorem 3.1 is not unique in minimizing the quantity (18) – for instance, the distribution
will also achieve the minimum – the expression given in (19) is somewhat canonical in that it allocates the divergence from the quantum distribution matrix equally between all four measurement settings.
Prior to knowing Theorem 3.1, the problem of identifying a distribution that minimizes total variation distance from a nonsignaling nonlocal distribution would be a linearprogrammingtype problem likely requiring the use of a computer. With the result of Theorem 3.1, the calculation is immediate: from the conditional probabilities in the distribution matrix, one can immediately find the coefficients by referring to Table 2, from which the coefficients in (19) can be obtained directly.
KullbackLeibler Divergence. If one is conducting a hypothesis test against local realism according to the predictionbasedratio (PBR) protocol of [16, 17], calculations involving the KullbackLeibler divergence are of central importance. Specifically, to define a test statistic that will be optimal for a given nonsignaling nonlocal distribution matrix, one must compute the closest local distribution with respect to the KullbackLeibler divergence, which is in general not a distribution that minimizes the total variation distance. It may not always be possible to find the closest local distribution by analytical methods, in which case a numerical optimization can be performed following the procedures outlined in [16]. This optimization problem can be simplified by using the results of this paper.
For two discrete probability distributions and over the same set , the KullbackLeibler divergence from to is defined as
(20) 
where and denote the probability of outcome according to distribution and , respectively. We cannot directly evaluate (20) for pairs of distribution matrices in the setting, because such a distribution matrix actually consists of four separate conditional probability distributions, one for each setting configuration. Thus to compute expression (20), one multiplies by the known settings probabilities. So for a distribution matrix , let us define the induced probability distribution over 16 outcomes according to the rule
Note that if we have two distribution matrices and , and a third distribution is a convex combination of the two such that for some , then equivalently .
Now consider a nonsignaling nonlocal distribution matrix violating the CHSH inequality. We assert that a local distribution of minimumpossible KullbackLeibler divergence from to must be expressible as a convex combination of only the eight local deterministic distributions that saturate the CHSH inequality.
This is a consequence of Lemma 3.1; we just need to show that for , is strictly less than for . This can be derived as a consequence of the wellknown fact that is convex in its second argument:
Thus if is not already a convex combination of the LD distributions that saturate (1), Lemma 3.1 tells us there is a choice of for which is such a convex combination and . So any local distribution of minimum KullbackLeibler divergence from is necessarily a convex combination of the eight LD distributions that saturate the CHSH inequality. The entire space of local distributions consists of convex combinations of all sixteen LD distributions, so this observation reduces the number of parameters of the search space by a factor of 2, thereby simplifying the optimization problem of finding the local distribution of minimum KullbackLeibler divergence from .
4 Extension to the Chained Bell Setting
Having studied the scenario extensively, it is natural to ask whether these results generalize to other Bell scenarios. In this section, we study one particular generalization, the “chained Bell” scenario of two parties, settings, and two outcomes per setting. We will find that analogs of Theorems 2.1 and 2.2 do indeed hold for this scenario, and will demonstrate an application.
4.1 The Chained Bell Polytope
To formulate the statements of the chainedBell analogs of our previous theorems, it is necessary to enumerate the extremal points of the chained Bell polytope. To formulate the problem, recall that while the fold chained Bell scenario involves settings per party, it does not consider all possible setting configurations such as the situation studied in [30]. Rather, the chained Bell scenario considers only a subset of all setting configurations of cardinality . Thus a distribution matrix will consist of entries – an element of . We can organize distribution matrices into outcome tables as follows:
++  +0  0+  00  
To be a valid nonsignaling distribution matrix for a fixed , an element of dimension must satisfy the probability constraints (all entries nonnegative and each row sums to one) and the following linearly independent collection of nosignaling conditions:
(21) 
We refer to the subset of satisfying these conditions as the fold chained Bell polytope. This polytope can be thought of as a projection of the larger polytope studied in [30] that considers all possible measurement configurations, but for our current purposes this viewpoint is not essential. Any quantuminduced distribution matrix for the fold chained Bell scenario will obey the nosignaling conditions (21) and thus will be an element of the fold chained Bell polytope.
Based on an understanding of the polytope, one would reasonably expect the extremal points for the fold chained Bell polytope to include analogs of the PR boxes as well as local deterministic distributions. In preparation for our classification of the extremal points, we define a generalized PR box to be a fold chained Bell distribution matrix consisting of an odd number of rows of the form , an odd number of rows of the form , and no other types of rows. Two examples of generalized PR boxes for are as follows:
++  +0  0+  00  

1/2  1/2  
1/2  1/2  
1/2  1/2  
1/2  1/2  
1/2  1/2  
1/2  1/2 
++  +0  0+  00  

1/2  1/2  
1/2  1/2  
1/2  1/2  
1/2  1/2  
1/2  1/2  
1/2  1/2 
In the fold chained Bell setting, there are generalized PR boxes, as there are independent rows that can be assigned to one of two different formats; after these choices are made, the format of the last row is fixed by the requirement that the number of rows of each type be odd.
As for the local deterministic distributions, it is no longer practical to list these explicitly as we did for the special case . Instead, we enumerate them as the set of mappings that assign each and to either 0 or +. For example, here is one possible table of assignments for :
+  0  0  +  0  0  +  + 
In the above assignment table, we adopt a useful convention of separating all columns with a vertical line and adding an additional vertical line at the end. This way, each vertical line corresponds to a row of an outcome table: a vertical line between and corresponds to the row, while the last vertical line corresponds to the row. This will allow us to easily move from an assignment table like the one above to the corresponding outcome table of a distribution matrix – for instance, as is mapped to + and is mapped to 0, then the outcome row will have a “1” in the +0 cell and a zero in the other three cells, etc., and so the distribution matrix for the above assignment table is as follows:
++  +0  0+  00  

1  
1  
1  
1  
1  
1  
1  
1 
We define a local deterministic distribution to be a distribution matrix that can be generated from an assignment table like the one in Table 3. There are distinct ways to fill an assignment table with 0 and +, and thus LD distributions in the fold chained Bell scenario.
The following theorem classifies the extremal points of the fold chained Bell polytope. It is proved in C.
Theorem 4.1
The set of extremal points of the nonsignaling polytope corresponding to the fold chained Bell scenario consists of the local deterministic distributions and the “generalized PR boxes.”
4.2 Decomposition Theorems for the Fold Chained Bell Polytope
Having enumerated the vertices of the fold chained Bell polytope, we can now proceed to state and prove generalized versions of Theorems 2.1 and 2.2.
Theorem 4.2
In the fold chained Bell scenario, an equal mixture of any two (nonidentical) generalized PR boxes is equivalent to a mixture of local deterministic distributions.
Proof. Consider two nonidentical PR boxes called “PR A” and “PR B.” The equal mixture will yield a distribution matrix for which each row of the outcome table can take one of the three following forms: , , and , where these vectors represent settingsconditional assignments of probabilities to the outcomes . We call these three row types correlated, anticorrelated, and uniform, respectively. We claim that the number of uniform rows is nonzero and even. First, note that to have zero uniform rows would require that PR A and PR B be identical, a possibility that we are excluding. To see that the number of uniform rows is even, recall that PR A and PR B will each have an odd number of correlated rows and an odd number of anticorrelated rows. Now, out of the complete set of outcome rows, consider the collection of rows where PR A is anticorrelated and PR B is correlated. If the number of such rows is odd (even), then the number of rows where A is anticorrelated and B is anticorrelated must be even (odd), and then the number of rows where A is correlated and B is anticorrelated must be odd (even), so the total number of rows where A and B are inequivalent (either A anticorrelated/B correlated or A correlated/B anticorrelated) must be the sum of two odd numbers or the sum of two even numbers, and therefore even. Thus has a positive even number of uniform rows.
Now we will show how to select four LD distributions so that their mixture is equivalent to the equal mixture of PR A and PR B. To illustrate how this can work, first consider the case where PR A and PR B are perfectly misaligned so that their mixture consists entirely of uniform rows. Then the following suite of four LD distributions can replicate this distribution matrix:
1:  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  +  
2:  +  0  +  0  +  0  +  0  +  0  +  0  +  0  +  0  
3:  0  +  0  +  0  +  0  +  0  +  0  +  0  +  0  +  
4:  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 
To see why, notice that each vertical line in the assignment table above is always straddled by ++ once, +0 once, 0+ once, and 00 once. Thus putting th weight on each LD distribution generates the distribution in each row of the outcome table.
Now consider the general case, where we have noted that the two PR boxes will be misaligned on a nonzero even number of rows. We describe an algorithm to generate four LD distributions whose mixture replicates the mixture of the two PR boxes. First, start with an empty assignment table like the one above, with the vertical lines labeled to indicate the nature of the corresponding row of the PR box mixture – correlated, anticorrelated, or uniform. For example, this would look something like this:
1:  
2:  
3:  
4: 
Now fill the table according to the following procedure:

Locate the leftmost line marked “uniform” and enter in descending order in the column to the right of this line.

Moving to the right, fill in the next blank column according to the following set of rules:
If the last filled column contained… … and the line label to the left of the blank column is… …then fill the blank column as follows: