Sample Amplification: Increasing Dataset Size even when Learning is Impossible

# Sample Amplification: Increasing Dataset Size even when Learning is Impossible

Brian Axelrod        Shivam Garg        Vatsal Sharan        Gregory Valiant
{baxelrod, shivamgarg, vsharan, valiant}@stanford.edu
Stanford University
This work was supported by NSF awards AF-1813049 and CCF-1704417, an ONR Young Investigator Award N00014-18-1-2295, and an NSF Graduate Fellowship.
###### Abstract

Given data drawn from an unknown distribution, , to what extent is it possible to “amplify” this dataset and faithfully output an even larger set of samples that appear to have been drawn from ? We formalize this question as follows: an amplification procedure takes as input independent draws from an unknown distribution , and outputs a set of “samples”. An amplification procedure is valid if no algorithm can distinguish the set of samples produced by the amplifier from a set of independent draws from , with probability greater than . Perhaps surprisingly, in many settings, a valid amplification procedure exists, even in the regime where the size of the input dataset, , is significantly less than what would be necessary to learn distribution to non-trivial accuracy. Specifically we consider two fundamental settings: the case where is an arbitrary discrete distribution supported on elements, and the case where is a -dimensional Gaussian with unknown mean, and fixed covariance matrix. In the first case, we show that an amplifier exists. In particular, given samples from , one can output a set of datapoints, whose total variation distance from the distribution of i.i.d. draws from is a small constant, despite the fact that one would need quadratically more data, , to learn up to small constant total variation distance. In the Gaussian case, we show that an amplifier exists, even though learning the distribution to small constant total variation distance requires samples. In both the discrete and Gaussian settings, we show that these results are tight, to constant factors. Beyond these results, we formalize a number of curious directions for future research along this vein.

## 1 Learning, Testing, and Sample Amplification

How much do you need to know about at distribution, , in order to produce a dataset of size that is indistinguishable from a set of independent draws from ? Do you need to learn , to nontrivial accuracy in some natural metric, or does it suffice to have access to a smaller dataset of size drawn from , and then “amplify” this dataset to create one of size ? In this work we formalize this question, and show that for two natural classes of distribution, discrete distributions with bounded support, and -dimensional Gaussians, non-trivial data “amplification” is possible even in the regime in which you are given too few samples to learn.

From a theoretical perspective, this question is related to the meta-question underlying work on distributional property testing and estimation: To answer basic hypothesis testing or property estimation questions regarding a distribution , to what extent must one first learn , and can such questions be reliably answered given a relatively modest amount of data drawn from ? Much of the excitement surrounding distributional property testing and estimation stems from the fact that, for many such testing and estimation questions, a surprisingly small set of samples from suffices—significantly fewer samples than would be required to learn . These surprising answers have been revealed over the past two decades of work on property testing and estimation. The question posed in our work fits with this body of work, though instead of asking how much data is required to perform a hypothesis test, we are asking how much data is required to fool an optimal hypothesis test—in this case an “identity tester” which knows and is trying to distinguish a set of independent samples drawn from , versus datapoints constructed in some other fashion.

From a more practical perspective, the question we consider also seems timely. Deep neural network based systems, trained on a set of samples, can be designed to perform many tasks, including testing whether a given input was drawn from a distribution in question (i.e. “discrimination”), as well as sampling (often via the popular Generative Adversarial Network approach). There are many relevant questions regarding the extent to which current systems are successful in accomplishing these tasks, and the question of how to quantify the performance of these systems is still largely open. In this work, however, we ask a different question: Suppose a system can accomplish such a task—what would that actually mean? If a system can produce a dataset that is indistinguishable from a set of independent draws from a distribution, , does that mean the system knows , or are there other ways of accomplishing this task?

### 1.1 Formal Problem Definition

We begin by formally stating two essentially equivalent definitions of sample amplification and then provide an illustrative example. Our first definition of sample amplification, states that a function mapping a set of datapoints to a set of datapoints is a valid amplification procedure for a class of distributions , if for all , letting denote the random variable corresponding to independent draws from , the distribution of has small total variation distance 111We overload the notation for total variation distance, and also use it when the argument is a random variable instead of the distribution of the random variable, whenever convenient. to the distribution defined by independent draws from .

###### Definition 1.

A class of distributions over domain admits an amplification procedure if there exists a (possibly randomized) function , mapping a dataset of size to a dataset of size , such that for every distribution ,

 DTV(fC,n,m(Xn),Dm)≤1/3,

where is the random variable denoting independent draws from , and denotes the distribution of independent draws from . If no such function exists, we say that does not admit an amplification scheme.

Crucially, in the above definition we are considering the random variable whose randomness comes from the randomness of , as well as any randomness in the function itself. For example, every class of distributions admits an amplification procedure, corresponding to taking the function to be the identity function. If, instead, our definition had required that the conditional distribution of given be close to , then the above definition would simply correspond to asking how well we can learn , given the samples denoted by .

Definition 1 is also equivalent, up to the choice of constant in the bound on total variation distance, to the following intuitive formulation of sample amplification as a game between two parties: the amplifier who will produce a dataset of size , and a “verifier” who knows and will either accept or reject that dataset. The verifier’s protocol, however, must satisfy the condition that given independent draws from the true distribution in question, the verifier must accept with probability at least , where the probability is with respect to both the randomness of the set of samples, and any internal randomness of the verifier. We briefly describe this formulation, as a number of natural directions for future work—such as if the verifier is computationally bounded, or only has sample access to —are easier to articulate in this setting.

###### Definition 2.

The sample amplification game consists of two parties, an amplifier corresponding to a function which maps a set of datapoints in domain to a set of datapoints, and a verifier corresponding to a function . We say that a verifier is valid for distribution if, when given as input a set of independent draws from , the verifier accepts with probability at least , where the probability is over both the randomness of the draws and any internal randomness of :

 PrXm←Dm[v(Xm)=ACCEPT]≥3/4.

A class of distributions over domain admits an amplification procedure if, and only if, there is an amplifier function that, for every , can “win” the game with probability at least ; namely, such that for every and valid verifier for

 PrXn←Dn[vD(fC,n,m(Xn))=ACCEPT]≥2/3,

where the probability is with respect to the randomness of the choice of the samples, and any internal randomness in the amplifier and verifier, and .

As was the case in Definition 1, in the above definition it is essential that the verifier only observes the output produced by the amplifier. If the verifier sees both the amplified samples, in addition to the original data, , then the above definition also becomes equivalent to asking how well the class of distributions in question can be learned given samples.

###### Example 1.

Consider the class of distributions corresponding to i.i.d. flips of a coin with unknown bias . We claim that there are constants such that sample amplification is possible, but amplification is not possible. To see this, consider the amplification strategy corresponding to returning a random permutation of the original samples together with additional tosses of a coin with bias , where is the empirical bias of the original samples. Because of the random permutation, the total variation distance between these samples and i.i.d. tosses of the -biased coin is a function of only the distribution of the total number of . Hence this is equivalent to the distance between and the distribution corresponding to first drawing , and then returning . It is not hard to show that the total variation distance between these two can be bounded by any small constant by taking to be a sufficiently small constant. Intuitively, this is because both distributions have the same mean, they are both unimodal, and have variances that differ by a small constant factor for small constant . For the lower bound, to see that amplification by more than a constant factor is impossible, note that if it were possible, then one could learn to error , with small constant probability of failure, by first amplifying the original samples and then returning the empirical estimate of based on the amplified samples.

In the setting, this constant factor amplification is not surprising, since the amplifier can learn the distribution to non-trivial accuracy. It is worth observing, however, that the above amplification scheme corresponding to a amplifier will return a set of samples, whose total variation distance from i.i.d. samples is only ; this is despite the fact that the amplifier can only learn the distribution to total variation distance

### 1.2 Summary of Results

Our main results provide tight bounds on the extent to which sample amplification is possible for two fundamental settings, unstructured discrete distributions, and -dimensional Gaussians with unknown mean and fixed covariance. Our first result is for discrete distributions with support size at most . In this case, we show that sample amplification is possible given only samples from the distribution, and tightly characterize the extent to which amplification is possible. Note that learning the distribution to small total variation distance requires samples in this case.

###### Theorem 1.

Let denote the class of discrete distributions with support size at most . For sufficiently large and , admits an amplification procedure.

This bound is tight up to constants, i.e., there is a constant , such that for every sufficiently large , does not admit an amplification procedure.

Our amplification procedure for discrete distributions is extremely simple: roughly, we generate additional samples from the empirical distribution of the initial set of samples, and then randomly shuffle together the original and the new samples. For technical reasons, we do not exactly sample from the empirical distribution but from a suitable modification which facilitates the analysis.

Our second result concerns -dimensional Gaussian distributions with unknown mean and fixed covariance. We show that we can amplify even with only samples from the distribution. In contrast, learning the distribution to small constant total variation distance requires samples. As in the discrete case, we also tightly characterize the extent to which amplification is possible in this setting.

###### Theorem 2.

Let denote the class of dimensional Gaussian distributions with unknown mean and fixed covariance . For all and , admits an amplification procedure.

This bound is tight up to constants, i.e., there is a fixed constant such that for all , does not admit an amplification procedure for .

The amplification algorithm in the Gaussian case first computes the empirical mean of the original set , and then draws new samples from . We then shift the original samples to “decorrelate” the original set and the new samples; intuitively, this step hides the fact that the new samples were generated based on the empirical mean of the original samples. The final set of returned samples consists of the shifted versions of the original samples along with the freshly generated ones.

In contrast to the amplification procedure in the discrete setting, where the final set of returned samples contains the (unmodified) original set, in this Gaussian setting, none of the returned samples are contained in the original set of samples. A natural question is whether this is necessary: Does there exist a procedure which achieves optimal amplification in the Gaussian setting, that returns a superset of the original samples? We show a lower bound proving that, for there is no amplification procedure which always returns a superset of the original samples. Curiously, however, there is such an amplification procedure in the regime where , even though learning is not possible until . Additionally, as goes from to , such amplification procedures go from being unable to amplify at all, to being able to amplify by nearly samples. This is formalized in the following proposition.

###### Proposition 1.

Let denote the class of dimensional Gaussian distributions with unknown mean and covariance . There is an absolute constant, , such that for sufficiently large , if there is no amplification procedure that always returns a superset of the original points.

On the other hand, there is a constant such that for any , for , and for sufficiently large , there is an amplification protocol for that returns a superset of the original samples.

### 1.3 Open Directions

From a technical perspective, there are a number of natural open directions for future work, including establishing tight worst-case bounds on amplification for other natural distribution classes, such as dimensional Gaussians with unknown means and covariance. More conceptually, it seems worth getting a broader understanding of the range of potential amplification algorithms, and the settings to which each can be applied.

In the discrete distribution setting, our amplification results are tight (to constant factors) in a worst-case sense, and our amplifier essentially just returns the original samples, together with additional samples drawn from the empirical distribution of those samples, and then randomly permutes the order of these datapoints. This begs the question: In the case of discrete distributions, is there any benefit to considering more sophisticated amplification schemes? Below we sketch one example motivating a more clever amplification approach.

###### Example 2.

Consider obtaining samples corresponding to independent draws from a discrete distribution that puts probability on a single domain element, and with probability draws a sample from the uniform distribution over some infinite discrete domain. If then the amplification approach that adds samples from the empirical distribution of the data to the original set of samples, will fail. Indeed, with probability at least it will introduce a second sample of one of the “rare” elements, and such samples can be rejected by the verifier. For this setting, the optimal amplifier would always introduce extra samples corresponding to the element of probability .

The above example motivates a more sophisticated amplification strategy for the discrete distribution setting. Approaches such as Good-Turing frequency estimation, or more modern variants of it, adjust the empirical probabilities to more accurately reflect the true probabilities (see e.g. [14, 18, 22]). Indeed, in a setting such as Example 2, based on the fact that only one domain element is observed more than once, it is easy to conclude that the total probability mass of all the elements observed just once, is likely at most , which implies that a successful amplification scheme cannot duplicate any of them. While inserting samples from a Good-Turing adjusted empirical distribution will not improve the amplification in a worst-case sense for discrete distributions with a bounded support size, such schemes seem strictly better than the more naive schemes we analyze. The following question outlines one intriguing potential avenue for quantifying this, along the lines of the recent work on “instance optimal”/“competitive” distribution testing and estimation (see e.g. [1, 19, 22]):

Is there an “instance optimal” amplification scheme, which, for every distribution, , amplifies as well as could be hoped? Specifically, to what extent is there an amplification scheme which performs nearly as well as a hypothetical optimal scheme that knows distribution up to relabeling/permuting the domain?

In a completely different direction, our results showing that non-trivial amplification is possible even in the regime in which learning is not possible, rely on the modeling assumption that the verifier gets no information about the amplifier’s training set, (the set of i.i.d. samples). If this dataset is revealed to the verifier, then the question of amplification is equivalent to learning. This prompts the question about a middle ground, where the verifier has some information about the set , but does not see the entire set; this middle ground also seems the most practically relevant (e.g. how much do I need to know about a GAN’s training set to decide whether it actually understands a distribution of images?).

How does the power of the amplifier vary depending on how much information the verifier has about ? If the verifier is given a uniformly random subsample of of size how does the amount of possible amplification scale with ?

Finally, rather than considering how to increase the power of the verifier, as the above question asks, it might also be worth considering the consequences of decreasing either the computational power, or information theoretic power of the verifier.

If the verifier, instead of knowing distribution , receives only a set of independent draws from , how much more power does this give the amplifier? Alternately, if the verifier is constrained to be an efficiently computable function, does this provide additional power to the amplifier in any natural settings?

### 1.4 Related Work

The question of deciding whether a set of samples consists of independent draws from a specified distribution is one of the fundamental problems at the core of distributional property testing. Interest in this problem was sparked by the seminal work of Goldreich and Ron [13], who considered the specific problem of determining whether a set of samples was drawn from a uniform distribution of support size . This sparked a line of work on the slightly more general problem of “identity testing” whether a set of samples was drawn from a specified distribution, , versus a distribution with distance at least from . This includes the early work [2], and work of Paninski [20] who showed that if is the uniform distribution over elements, samples are necessary and sufficient, and the more recent work [23] who gave tight bounds on the sample complexity as a function of the distribution in question (which also established that samples suffice for any distribution supported on at most elements). While the identity testing problem is clearly related to the amplification problem we consider, these appear to be quite distinct problems. In particular, in the identity testing setting, the main technical challenge is understanding what statistics of a set of i.i.d. samples are capable of distinguishing samples drawn from the prescribed distribution, versus samples drawn from any distribution that is at least -far from that distribution. In contrast, in the amplification setting, the core of the question is how the amplifier can leverage a set of independent samples from to generate a larger set of (presumably) non-independent samples that can successfully masquerade as a set of independent samples drawn from ; of course, the catch is that the amplifier must do this in the data regime in which it is impossible for them to learn much about .

Beyond the specific question of identity testing, there is an enormous body of work on other distributional property testing questions, including the “tolerant” version of identity testing where one wishes to distinguish whether samples were drawn independently from a distribution that is close to a specified distribution, , versus far from , as well as the multi-distribution analogs where one obtains two (or more) sets of i.i.d. samples, drawn respectively from unknown distributions , , and wishes to distinguish the case that the two distributions are identical (or close) versus have significant total variation distance (see e.g. [3, 24, 8, 19, 5, 16, 12]). In the majority of these works, the assumption is that given samples consist of independent draws from some fixed distribution, and the common theme in these results is the punchline that such tests can typically be accomplished with far less data than would be required to learn the distribution in question.

Within this line of work on distributional property testing and estimation, there is also a recent thread of work on designing estimators for specific properties (such as entropy, or distance to uniformity), whose performance given independent draws from the distribution in question is comparable to the expected performance of a naive “plugin” estimator (which returns the property value of the empirical distribution) based on independent draws [22, 27]. The term “data amplification” has been applied to this line of work, although it is a different problem from the one we consider. In particular, we are considering the extent to which the samples can be used to create a larger set of samples; the work on property estimation is asking to what extent one can craft superior estimators whose performance is comparable to the performance that a more basic estimator would achieve with a larger sample size.

The recent work on sampling correctors [7] also considers the question of how to produce a “good” set of draws from a given distribution. That work assumes access to independent draws from a distribution, , which is close to having some desired structural property, such as monotonicity or uniformity, and considers how to “correct” or “improve” those samples to produce a set of samples that appear to have been drawn from a different distribution that possesses the desired property (or is closer to possessing the property). Part of that work also considers the question of whether such a protocol requires access to additional randomness.

Our formulation of sample amplification as a game between an amplifier and a verifier, closely resembles the setup for pseudo-randomness (see [21] for a relatively recent survey). There, the pseudo-random generator takes a set of independent fair coin flips, and outputs a longer string of outcomes. The verifier’s job is to distinguish the output of the generator from a set of independent tosses of the fair coin (i.e. truly random bits). In contrast to our setting, in pseudo-randomness, both players know that the distribution in question is the uniform distribution, the catch is that the generator does not have access to randomness, and the verifier is computationally bounded. Beyond the superficial similarity in setup, we are not aware of any deeper connections between our notion of amplification and pseudorandomness.

Finally, it is also worth mentioning the work of Viola on the complexity of sampling from distributions [25]. That work also considers the challenge of generating samples from a specified distribution, though the problem is posed as the computational challenge of producing samples from a specified distribution, given access to uniformly random bits. One of the punchlines of that work is that there are distributions, such as the distribution over pairs where is a uniformly random length- string, and where small circuits can sample from the distribution, yet no small circuit can compute given . A different way of phrasing that punchline is that there are distributions that are easy to sample from, for which it is much harder to sample from their conditional distributions (e.g. in the parity case, sampling given is hard).

## 2 Algorithms and Proof Overview

In this section, we describe our algorithms for data amplification for discrete and Gaussian distributions. We also give an intuitive overview of the proofs of both the upper and lower bounds.

### 2.1 Discrete Distributions with Bounded Support

We begin by providing some intuition for amplification in the discrete distribution setting, by considering the simple case where the distribution in question is a uniform distribution over an unknown support. We then describe how our more general amplification algorithm extends this intuition.

#### 2.1.1 Intuition via the Uniform Distribution

Consider the problem of generating samples from a uniform distribution over unknown elements, given a set of samples from the distribution. Suppose . Then with high probability, no element appears more than once in a set of samples from . Therefore, as the amplifier only knows elements of the support with samples, it cannot produce a set of samples such that each element only appears once in the set. Hence, no amplification is possible in this regime. Now consider the case when for a large constant . By the birthday paradox, we now expect some elements to appear more than once, and the number of elements appearing twice has expectation and standard deviation . In light of this fact, consider an amplification procedure which takes any element that appears only once in the set , adds an additional copy of this element to the set , and then randomly shuffles these samples to produce the final set . It is easy to verify that the distribution of will be close in total variation distance to a set of i.i.d. samples drawn from the original uniform distribution. Since the standard deviation of the number of elements in that appear twice is , intuitively, we should be able to amplify by an additional samples, by taking elements which appear only once and repeating them, and then randomly permuting these samples. Note that with high probability, most elements only appear once in the set , and hence the previous amplifier is almost equivalent to an amplifier which generates new samples by sampling from the empirical distribution of the original samples, and then randomly shuffles together the original and new samples. Our amplification procedure for general discrete distributions is based on this sampling-from-empirical procedure.

#### 2.1.2 Algorithm and Upper Bound

To facilitate the analysis, our general amplification procedure, which applies to any discrete distribution , deviates from the sampling-from-empirical-then-shuffle scheme in two ways. First, we use the “Poissonization” trick and go from working with the multinomial distribution to the Poisson distribution—making the element counts independent for all elements. And second, instead of generating samples from the empirical distribution and shuffling, we (i) divide the input samples into two sets, (ii) use the first set to estimate the empirical distribution, (iii) generate new samples using this empirical distribution, and (iv) randomly shuffle these new samples with the samples in the second set. More precisely, we simulate two sets and , of samples from the distribution , using the original set of samples from . This is straightforward to do, as a random variable is with high probability. We then estimate the probabilities of the elements using the first set , and use these estimated probabilities to generate more samples from a Poisson distribution, which are then randomly shuffled with the samples in to produce . Then the set of output samples just consist of the samples in concatenated with those in . This describes the main steps in the procedure, more technical details can be found in the full description in Algorithm 3. We show that this procedure achieves a amplifier for sufficiently large and .

To prove this upper bound, first note that the counts of each element in a shuffled set are a sufficient statistics for the probability of observing , as the ordering of the elements is uniformly random. Hence we only need to show that the distribution of the counts in the set is close in total variation distance to the distribution of counts in a set of elements drawn i.i.d. from . Note that as the first set is independent of the second set , the additional samples added to are independent of the samples originally in , which avoids additional dependencies in the analysis. Using this independence, we show a technical lemma that with high probability over the first set , the KL-divergence between the distribution of the set and of i.i.d. samples from is small. Then using Pinsker’s inequality, it follows that the total variation distance is also small. The final result then follows by a coupling argument, and showing that the Poissonization steps are successful with high probability.

#### 2.1.3 Lower Bound

We now describe the intuition for showing our lower bound that the class of discrete distributions with support at most does not admit an amplification scheme for , where is a fixed constant. For , we show this lower bound for the class of uniform distributions on some unknown elements. In this case, a verifier can distinguish between true samples from and a set of amplified samples by counting the number of unique samples in the set. Note that as the support of is unknown, the number of unique samples in the amplified set is at most the number of unique samples in the original set , unless the amplifier includes samples that are outside the support of , in which case the verifier will trivially reject this set. The expected number of unique samples in and draws from differs by , for some fixed constant . We use a Doob martingale and martingale concentration bounds to show that the number of unique samples in samples from concentrates within a margin of its expectation with high probability, for some fixed constant . This implies that there will be a large gap between the number of unique samples in and draws from . The verifier uses this to distinguish between true samples from and an amplified set, which cannot have sufficiently many unique samples.

Finally, we show that for , a amplification procedure for discrete distributions on elements implies a amplification procedure for the uniform distribution on elements, and for sufficiently large this is a contradiction to the previous part. This reduction follows by considering the distribution which has mass on one element and mass uniformly distributed on the remaining elements. With sufficiently large probability, the number of samples in the uniform section will be , and hence we can apply the previous result.

### 2.2 Gaussian Distributions with Unknown Mean and Fixed Covariance

Given the success of the simple sampling-from-empirical scheme for the discrete case, it is natural to consider the analogous algorithm for -dimensional Gaussian distributions with unknown mean and fixed covariance. In this section, we first show that this analogous procedure achieves non-trivial amplification for . We then describe the idea behind the lower bound that any procedure which does not modify the input samples does not work for . Inspired by the insights from this lower bound, we then discuss a more sophisticated procedure, which is optimal and achieves non-trivial amplification for as small as .

#### 2.2.1 Upper Bound for Algorithm which Samples from the Empirical Distribution

Let be the empirical mean of the original set . Consider the amplification scheme which draws new samples from and then randomly shuffles together the original samples and the new samples. We show that for any , this procedure—with a small modification to facilitate the analysis—achieves amplification for . This is despite the empirical distribution being far in total variation distance from the true distribution , for .

We now provide the proof intuition for this result. First, note that it is sufficient to prove the result for . This is because all the operations performed by our amplification procedure are invariant under linear transformations. The intuition for the result in the identity covariance case is as follows. Consider . In this case, with high probability, the empirical mean satisfies for a fixed constant . If we center and rotate the coordinate system, such that has the coordinates , then the distribution of samples from and only differs along the first axis, and is independent across different axes. Hence, with some technical work, our problem reduces to the following univariate problem: what is the total variation distance between samples from the univariate distributions and , where is a mixture distribution where each sample is drawn from with probability and from with probability ? We show that the total variation distance between these distributions is small, by bounding the squared Hellinger distance between them. Intuitively, the reason for the total variation distance being small is that, even though one sample from is easy to distinguish from one sample from , for sufficiently small it is difficult to distinguish between these two samples in the presence of other samples from . This is because for draws from , with high probability there are samples in a constant length interval around , and hence it is difficult to detect the presence or absence of one extra sample in this interval.

#### 2.2.2 Lower Bound for any Procedure which Returns a Superset of the Input Samples

We show that procedures which return a superset of the input samples are inherently limited in this Gaussian setting, in the sense that they cannot achieve amplification for , where is a fixed constant.

The idea behind the lower bound is as follows. If we consider any arbitrary direction and project a true sample from along that direction, then with high probability, the projection lies close to the projection of the mean. However, for input set with mean , the projection of an extra sample added by any amplification procedure along the direction will be far from the projection of the mean . This is because after seeing just samples, any amplification procedure will have high uncertainty about the location of relative to . Based on this, we construct a verifier which can distinguish between a set of true samples and a set of amplified samples, for .

We now explain this more formally. Let be the -th sample returned by the procedure, and let be the mean of all except the -th sample. Let “new” be the index of the additional point added by the amplifier to the original set , hence the amplifier returns the set . Note that , hence with high probability. Suppose the verifier evaluates the following inner product for the additional point ,

 ⟨x′new−^μ−new,μ−^μ−new⟩. (1)

Note that as the amplifier has not modified any of the original samples in . For a point drawn from , this inner product concentrates around . We now argue that if the true mean is drawn from the distribution , then the above inner product is much smaller than with high probability over . The reason for this is as follows. After seeing the samples in , the amplification algorithm knows that lies in a ball of radius centered at , but could lie along any direction in that ball. Formally, we can show that if is drawn from the distribution , then the posterior distribution of is a Gaussian with and . As is a random direction, for any that the algorithm returns, the inner product in (1) is with high probability over the randomness in . The verifier checks and ensures that . Hence for any amplification scheme, the inner product in (1) is at most with high probability over . In contrast, we argued before that this inner product is for a true sample from .

Finally, note that the algorithm can randomly shuffle the samples, and hence the verifier needs to do the above inner product test for every returned sample , for a total of tests. If tests are performed, then the inner product is expected to deviate by around its expected value of , even for true samples drawn for the distribution. But if , then , and hence any amplification scheme in this regime fails at least one of the following tests with high probability over :

1. ,

2. .

As true samples pass all the tests with high probability, this shows that amplification without modifying the provided samples is impossible for .

#### 2.2.3 Optimal Amplification Procedure for Gaussians: Algorithm and Upper Bound

The above lower bound shows that it is necessary to modify the input samples to achieve amplification for . It also shows what a modification to cross the threshold must achieve—the inner product in (1) should be driven towards its expected value of for a true sample drawn from the distribution. Note that the inner product is too small for the algorithm which samples from the empirical distribution as the generated point is too correlated with the mean of the remaining points. We can fix this by shifting the original points in themselves, to hide the correlation between and the original mean of . The full procedure is quite simple to state, and is described in Algorithm 1. Note that unlike our other amplification procedures, this procedure does not involve any random shuffling of the samples. We show that this procedure achieves amplification for all and .

We now provide a brief proof sketch for this upper bound, for the case when . First, we show that the returned samples in can also be thought of as a single sample from a -dimensional Gaussian distribution , as the returned samples are linear combinations of Gaussian random variables. Hence, it is sufficient to find their mean and covariance, and use that to bound their total variation distance to true samples from the distribution (which can also be though of as a single sample from a -dimensional Gaussian distribution ). Therefore, our problem reduces to ensuring that the total variation distance between these two Gaussian distributions is small, and this distance is proportional to . Our modification procedure removes the correlations between the original samples and the generated samples to ensure that the non-diagonal entries of are small, and hence the total variation distance is also small. For example, the original correlation between the first coordinates of the original sample and the generated sample is too large, but it is easy to verify that the correlation between the first coordinates of the modified sample and the generated sample is zero.

#### 2.2.4 General Lower Bound for Gaussians

We show a lower bound that there is no amplification procedure for Gaussian distibutions with unknown mean for , where is a fixed constant. The idea behind the lower bound is similar to the lower bound for procedures which return a superset of the input samples in Section 2.2.2. We define a verifier such that for and , true samples from are accepted by the verifier with high probability over the randomness in the samples, but samples generated by any amplification scheme are rejected by the verifier with high probability over the randomness in the samples and . As before, let be the -th sample returned by the procedure, and let be the mean of all except the -th sample. Our verifier for this setting evaluates the inner product and the distance of the sample from the mean as in Section 2.2.2, and in addition, also checks . In total, it evaluates the following,

1. ,

2. ,

3. .

Note that unlike Section 2.2.2, we show that the verifier only needs to do the above tests for any one index —chosen to be above—instead of . We now explain why these tests are sufficient to prove the lower bound. Note that for true samples drawn from , . Also, the squared distance of the mean of the original set from is concentrated around . Using this, for , we can show that no algorithm can find a which satisfies with decent probability over . This is because such an algorithm could be used to find the true mean with much smaller error than , which is not possible with samples. This argument works for , but that does not rule out amplification schemes for . To show that amplification is not possible for , we use the inner product test along with the test for to distinguish between true samples from and those produced by an amplification scheme, with high probability over and the samples. The analysis is similar to the analysis in Section 2.2.2.

## 3 Proofs: Gaussian with Unknown Mean and Fixed Covariance

### 3.1 Upper Bound

In this section, we prove the upper bound in Theorem 2 by showing that Algorithm 1 can be used as a amplification procedure.

First, note that it is sufficient to prove the theorem for the case when input samples come from an identity covariance Gaussian. This is because, for the purpose of analysis we can transform our samples to those coming from indentity covariance Gaussian, as our amplification procedure is invariant to linear transformations to samples. In particular, let denote our amplification procedure for samples coming from , and, denote the random variable corresponding to samples from . Let denote samples from , such that . Due to invariance of our amplification procedure to linear transformations, we get that is equal in distribution to . This gives us

 DTV(fΣ(Yn),Ym)=DTV(fΣ(Σ12(Xn−μ)+μ),Σ12(Xm−μ)+μ)=DTV(Σ12(fI(Xn)−μ)+μ,Σ12(Xm−μ)+μ)≤DTV(fI(Xn),Xm),

where the last inequality is true because the total variation distance between two distributions can’t increase if we apply the same transformation to both the distributions. Hence, we can conclude that it is sufficient to prove our results for identity covariance case. This is true for both the amplification procedures for Gaussians that we have discussed. So in this whole section, we will work with identity covariance Gaussian distributions.

###### Proposition 2.

Let denote the class of dimensional Gaussian distributions with unknown mean . For all and , admits an amplification procedure.

###### Proof.

The amplification procedure consists of two parts. The first uses the provided samples to learn the empirical mean and generate new samples from . The second part adjusts the first samples to “hide” the correlations that would otherwise arise from using the empirical mean to generate additional samples.

Let be i.i.d. samples generated from , and let . The amplification procedure will return with:

 x′i={xi−∑mj=n+1ϵjn,for i∈{1,2,…,n}^μ+ϵi,for i∈{n+1,n+2,…,m}. (2)

We will show later in this proof that subtracting will serve to decorrelate the first samples from the remaining samples.

Let be the random function denoting the map from to as described above, where . We need to show

 DTV(Zm=fC,n,m(Xn),Xm)≤1/3,

where and denote and independent samples from respectively.

For ease of understanding, we first prove this result for univariate case, and then extend it to the general setting.

So consider the setting where . In this case, corresponds to i.i.d. samples from a gaussian with mean , and variance . can also be thought of as a single sample from an dimensional Gaussian . Now, is a map that takes i.i.d samples from , i.i.d samples () from , and outputs samples that are a linear combination of the input samples. So, can be thought of as a dimensional random variable obtained by applying a linear transformation to a sample drawn from . As a linear transformation applied to gaussian random variable outputs a gaussian random variable, we get that is distributed according to , where and denote the mean and covariance. Note that as

 E[x′i]=⎧⎪⎨⎪⎩E[xi]−E[∑mj=n+1ϵjn]=μ−0=μ,for i∈{1,2,…,n}E[^μ]+E[ϵi]=μ+0=μ,for i∈{n+1,n+2,…,m} (3)

Next, we compute the covariance matrix .

For , and , we get

 Σii=E[(x′i−μ)2]=E[(xi−μ)2]+E[(∑mj=n+1ϵjn)2]=1+m−nn2.

For , and , we get

 Σii=E[(x′i−μ)2]=E[(^μ−μ)2]+E[ϵ2i]=1n+1.

For , we get

 Σij=E[(x′i−μ)(x′j−μ)]=E[(xi−∑mk=n+1ϵkn−μ)(^μ+ϵj−μ)]=E[(xi−μ)(^μ−μ)]−E[(∑mk=n+1ϵkn)(ϵj)]=1n−1n=0.

For , we get

 Σij=E[(x′i−μ)(x′j−μ)]=E[(xi−∑mk=n+1ϵkn−μ)(xj−∑mk=n+1ϵkn−μ)]=E[(xi−μ)(xj−μ)]+E⎡⎣(∑mk=n+1ϵkn)2⎤⎦=m−nn2.

For , we get

 Σij=E[(x′i−μ)(x′j−μ)]=E[(^μ+ϵi−μ)(^μ+ϵj−μ)]=E[(^μ−μ)2]=1n.

This gives us

 Σm×m=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣\par1+m−nn2m−nn2⋯m−nn200⋯0m−nn21+m−nn2⋯m−nn200⋯0⋮⋯⋯⋮⋮⋯⋯⋮⋮⋯⋯m−nn2⋮⋯⋯⋮m−nn2⋯m−nn21+m−nn200⋯00⋯⋯01+1n1n⋯1n0⋯⋯01n1+1n⋯1n⋮⋯⋯⋮⋮⋯⋯⋮⋮⋯⋯⋮⋮⋯⋯1n0⋯⋯01n⋯1n1+1n⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

Now, finding reduces to computing . From [11, Theorem 1.1], we know that . This gives us

 DTV(N(~μ,Im×m),N(~μ,Σm×m))≤min(1,32||Σ−I||F)≤ ⎷32((m−nn2)2n2+1n2(m−n)2)=√3(m−n)n. (4)

Now, for , by a similar argument as above, can be thought of as independent samples from the following distributions: