Perfect L_{p} Sampling in a Data Stream

# Perfect Lp Sampling in a Data Stream

Rajesh Jayaram
Carnegie Mellon University
rkjayara@cs.cmu.edu
David P. Woodruff
Carnegie Mellon University
dwoodruf@cs.cmu.edu
The authors thank the partial support by the National Science Foundation under Grant No. CCF-1815840.
###### Abstract

In this paper, we resolve the one-pass space complexity of perfect sampling for in a stream. Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector , a perfect sampler must output an index with probability , and is allowed to fail with some probability . So far, for no algorithm has been shown to solve the problem exactly using -bits of space. In 2010, Monemizadeh and Woodruff introduced an approximate sampler, which outputs with probability , using space polynomial in and . The space complexity was later reduced by Jowhari, Sağlam, and Tardos to roughly for , which matches the lower bound in terms of and , but is loose in terms of .

Given these nearly tight bounds, it is perhaps surprising that no lower bound exists in terms of —not even a bound of is known. In this paper, we explain this phenomenon by demonstrating the existence of an -bit perfect sampler for . This shows that need not factor into the space of an sampler, which closes the complexity of the problem for this range of . For , our bound is -bits, which matches the prior best known upper bound of , but has no dependence on . For , our bound holds in the random oracle model, matching the lower bounds in that model. Moreover, we show that our algorithm can be derandomized with only a blow-up in the space (and no blow-up for ). Our derandomization technique is quite general, and can be used to derandomize a large class of linear sketches, including the more accurate count-sketch variant of [MP14], resolving an open question in that paper.

Finally, we show that a relative error estimate of the frequency of the sampled index can be obtained using an additional -bits of space for , and bits for , which was possible before only by running the prior algorithms with .

## 1 Introduction

The streaming model of computation has become increasingly important for the analysis of massive datasets, where the sheer size of the input imposes stringent restrictions on the resources available to algorithms. Examples of such datasets include internet traffic logs, sensor networks, financial transaction data, database logs, and scientific data streams (such as huge experiments in particle physics, genomics, and astronomy). Given their prevalence, there is a large body of literature devoted to designing extremely efficient one-pass algorithms for analyzing data streams. We refer the reader to [BBD02, M05] for surveys of these algorithms and their applications.

More recently, the technique of sampling has proven to be tremendously powerful for the analysis of data streams. Substantial literature has been devoted to the study of sampling for problems in big data [M05, Haa16, Coh15, CDK09, CDK14, CCD11, EV03, GM98a, Knu98, MM12, Vit85b, CCD12, GLH08, GLH06], with applications to network traffic analysis [TLJ10, HNG07, GKMS01, MCS06, Duf04], databases [Olk93, Haa16, HNSS96, HS92, LNS90, LN95], distributed computation [WZ16, CMYZ10, CMYZ12, TW11], and low-rank approximation [WZ16, FKV04, DV06]. While several models for sampling in data streams have been proposed [BDM02, AKO10, CMYZ10], one of the most widely studied are the samplers introduced in [MW10]. Roughly speaking, given a vector , the goal of an sampler is to return an index with probability . In the data stream setting, the vector is given by a sequence of updates (insertions or deletions) to its coordinates of the form , where can either be positive or negative. A -pass sampler must return an index given only one pass through the updates of the stream.

Since their introduction, samplers have been utilized to develop alternative algorithms for important streaming problems, such as the heavy hitters problem, estimation, cascaded norm estimation, and finding duplicates in data streams [AKO10, MW10, JST11, BOZ12]. For the case of and insertion only streams, where the updates to are strictly positive, the problem is easily solved using bits of space with the well-known reservoir sampling algorithm [Vit85a]. When deletions to the stream are allowed or when , however, the problem is more complicated. In fact, the question of whether such samplers even exist was posed by Cormode, Murthukrishnan, and Rozenbaum in [CMR05]. Later on, Monemizadeh and Woodruff demonstrated that, if one permits the sampler to be approximately correct, such samplers are indeed possible [MW10]. We formally state the guarantee given by an approximate sampler below.

###### Definition 1.

Let and . For , an approximate sampler with -relative error is an algorithm which returns an index such that for every

 Pr[i=j]=|fj|p∥f∥pp(1±ν)+O(n−c)

Where is some arbitrarily large constant. For , the problem is to return with probability , If , then the sampler is said to be perfect. An sampler is allowed to output FAIL with some probability . However, in this case it must not output any index.

The one-pass approximate sampler introduced in [MW10] requires space, albeit with rather large exponents. Later on, in [AKO10], the complexity was reduced significantly to -bits111We note that previous works [JST11, KNP17] have cited the sampler of [AKO10] as using -bits of space, however the space bound given in their paper is in machine words, and is therefore a bit bound with . In order to obtain an bit sampler, their algorithm must be modified to use fewer repetitions. for , using a technique known as precision sampling. Roughly, the technique of precision sampling consists of scaling the coordinates by random variable coefficients as the updates arrive, resulting in a new stream vector with . The algorithm then searches for all which cross a certain threshold . Observe that if where is uniform on , then the probability that is precisely . By running an estimation algorithm to obtain , an sampler can then return any with as its output. These heavy coordinates can be found using any of the well-known -heavy hitters algorithms for a sufficiently small precision .

Using a tighter analysis of this technique with the same scaling variables , Jowhari, Sağlam, and Tardos reduced the space complexity of sampling for to -bits for , and bits of space for [JST11]. Roughly speaking, their improvements result from a more careful consideration of the precision needed to determine when a crosses the threshold, which they do via the tighter tail-error guarantee of the well-known count-sketch heavy hitters algorithm [CCFC02a]. In addition, they give an perfect sampler, and demonstrated an -bit lower bound for samplers for any . Recently, this lower bound was extended to [KNP17] bits, which closes the complexity of the problem for .

For , this means that the upper and lower bounds for samplers are tight in terms of , but a gap exists in the dependency on . Being the case, it would seem natural to search for an lower bound to close the complexity of the problem. It is perhaps surprising, therefore, that no lower bound in terms of exists – not even an bound is known. This poses the question of whether the lower bound is in fact correct.

### 1.1 Our Contributions

In this paper, we explain the phenomenon of the lack of an lower bound by showing that need not enter the space complexity of an sampler at all. In other words, we demonstrate the existence of perfect samplers using -bits of space for , thus resolving the space complexity of the problem up to terms222A previous version of this work claimed bits of space for , but contained an error in the derandomization. Thus, this bound only held in the random oracle model. In the present version we correct this derandomization using a slightly different algorithm, albeit with a blow-up in the space. The algorithm from the previous version can be found in Appendix A, along with a new analysis of its derandomization which allows it to run in -bits of space.. In the random oracle model, where we are given random access to an arbitrarily long tape of random bits which do not count against the space of the algorithm, our upper bound is , which matches the lower bound in the random oracle model. For , our space is -bits, which matches the best known upper bounds in terms of , yet again has no dependency on . In addition, for and the high probabiltiy regime of , we obtain a -bit perfect sampler, which also tightly matches the lower bound without paying the extra factor. A summary of the prior upper bounds for sampling, along with the contributions of this work, is given in Figure 1.

In addition to outputting a perfect sample from the stream, for we also show that, conditioned on an index being output, given an additional additive -bits we can provide a approximation of the frequency with probability . This separates the space dependence on and for frequency approximation, allowing us to obtain a approximation of in bits of space with constant probability, whereas before this required bits of space. For , our bound is , which still improves upon the prior best known bounds for estimating the frequency by an -factor. Finally, we show an bits of space lower bound for producing the estimate (conditioned on an index being returned).

### 1.2 Applications

Since their introduction, it has been observed that samplers can be used as a building block in algorithms for many important streaming problems, such as finding heavy hitters, -norm estimation, cascaded norm estimation, and finding duplicates in data streams [AKO10, MW10, JST11, BOZ12]. samplers, particularly for , are often used as a black-box subroutine to design representative histograms of on which more complicated algorithms are run [GMP, GM98b, Olk93, GKMS02, HNG07, CMR05]. For these black-box applications, the only property of the samplers needed is the distribution of their samples. Samplers with relative error are statistically biased and, in the analysis of more complicated algorithms built upon such samplers, this bias and its propagation over multiple samples must be accounted for and bounded. The analysis and development of such algorithms would be simplified dramatically, therefore, with the assumptions that the samples were truly uniform (i.e., from a perfect sampler). In this case, no error terms or variational distance need be accounted for. Our results show that such an assumption is possible without affecting the space complexity of the sampler.

Note that in Definition 1, we allow a perfect sampler to have variation distance to the true distribution. We note that this definition is in line with prior work, observing that even the perfect sampler of [JST11] incurs such an error from derandomizing with Nisan’s PRG. Nevertheless, this error will never be detected if the sampler is run polynomially many times in the course of constructing a histogram, and such a sampler is therefore statistically indistinguishable from a truly uniform sampler and can be used as a black box.

Another motivation for utilizing perfect samplers comes from applications in privacy. Here is some underlying dataset, and we would like to reveal a sample drawn from the distribution over to some external party without revealing too much global information about itself. Using an approximate sampler introduces a multiplicative bias into the sampling probabilities, and this bias can depend on global properties of the data. For instance, such a sampler might bias the sampling probabilities of a large set of coordinates by a factor if a certain global property holds for , and may instead bias them by if a disjoint property holds. Using only a small number of samples, an adversary would then be able to distinguish whether or holds by determining how these coordinates were biased. On the other hand, the bias in the samples produced by a perfect sampler is polynomially small, and thus the leakage of global information could be substantially smaller when using one, though one would need to formally define a notion of leakage and privacy for the given application.

### 1.3 Our Techniques

Our main algorithm is inspired by the precision sampling technique used in prior works [AKO10, JST11], but with some marked differences. To describe how our sampler achieves the improvements mentioned above, we begin by observing that all sampling algorithms since [AKO10] have adhered to the same algorithmic template (shown in Figure 2). This template employs the classic count-sketch algorithm of [CCFC02b] as a subroutine, which is easily introduced. For , let denote the set . Given a precision parameter , count-sketch selects pairwise independent hash functions and , for where . Then for all , it computes the following linear function , and outputs an approximation of given by . We will discuss the estimation guarantee of count-sketch at a later point.

The algorithmic template is as follows. First, perform some linear transformation on the input vector to obtain a new vector . Next, run an instance of count-sketch on to obtain the estimate . Finally, run some statistical test on . If the test fails, then output FAIL, otherwise output the index of the largest coordinate (in magnitude) of . We first describe how the sampler of [JST11] implements the steps in this template. Afterwards we describe the different implementation decisions made in our algorithm that allow it to overcome the limitations of prior approaches.

#### Prior Algorithms.

The samplers of [JST11, AKO10] utilize the technique known as precision sampling, which employs the following linear transformation. The algorithms first generate random variables with limited independence, where each . Next, each coordinate is scaled by the coefficient to obtain the transformed vector given by , thus completing Step of Figure 2. For simplicity, we now restrict to the case of and the algorithm of [JST11]. The goal of the algorithm is then to return an item that crosses the threshold , where is a constant factor approximation of the . Note the probability that this occurs is proportional to .

Next, implementing the second step of Figure 2, the vector is hashed into count-sketch to find an item that has crossed the threshold. Using the stronger tail-guarantee of count-sketch, the estimate vector satisfies , where is with the largest coordinates (in magnitude) set to . Now the algorithm runs into trouble when it incorrectly identifies as crossing the threshold when it has not, or vice-versa. However, if the tail error is at most , then since is a uniform variable the probability that is close enough to the threshold to be misidentified is , which results in at most relative error in the sampling probabilities. Thus it will suffice to have with probability . To show that this is the case, consider the level sets , and note . We observe here that results of [JST11] can be partially attributed to the fact that for , the total contribution of the level sets to decreases geometrically with , and so with constant probability we have . Moreover, if one removes the top largest items, the contribution of the remaining items to the is with probability . So taking , the tail error from count-sketch has the desired size. Since the tail error does not include the largest coordinates, this holds even conditioned on a fixed value of the maximizer.

Now with probability the guarantee on the error from the prior paragraph does not hold, and in this case one cannot still output an index , as this would result in a -additive error sampler. Thus, as in Step of Figure 2, the algorithm must implement a statistical test to check that the guarantee holds. To do this, using the values of the largest coordinates of , they produce an estimate of the tail-error and output FAIL if it is too large. Otherwise, the item is output if . The whole algorithm is run times so that an index is output with probability .

#### Our Algorithm.

Our first observation is that, in order to obtain a truly perfect sampler, one needs to use different scaling variables . Notice that the approach of scaling by inverse uniform variables and returning a coordinate which reaches a certain threshold faces the obvious issue of what to return when more than one of the variables crosses . This is solved by simply outputting the maximum of all such coordinates. However, the probability of an index becoming the maximum and reaching a threshold is drawn from an entirely different distribution, and for uniform variables this distribution does not appear to be the correct one. To overcome this, we must use a distribution where the maximum index of the variables is drawn exactly according to the distribution . We observe that the distribution of exponential random variables has precisely this property, and thus to implement Step of Figure 2 we set where is an exponential random variable. We remark that exponential variables have been used in the past, such as for moment estimation, , in [AKO10] and regression in [WZ13]. However it appears that their applicability to sampling has never before been exploited.

Next, we carry out the count-sketch step by hashing our vector into a count-sketch data structure . Because we are only interested in the maximizer of , we develop a modified version of count-sketch, called count-max. Instead of producing an estimate such that is small, count-max simple checks, for each , how many times hashed into the largest bucket (in absolute value) of a row of . If this number is at least a -fraction of the total number of rows, count-max declares that is the maximizer of . We show that with high probability, count-max never incorrectly declares an item to be the maximizer, and moreover if , then count-max will declare to be the maximizer. Using the min-stability property of exponential random variables, we can show that the maximum item is distributed as , where is another exponential random variable. Thus with constant probability. Using a more general analysis of the norm of the level sets , we can show that . If all these events occur together (with sufficiently large constants), count-max will correctly determine the coordinate . However, just as in [JST11], we cannot output an index anyway if these conditions do not hold, so we will need to run a statistical test to ensure that they do.

#### The Statistical Test.

To implement Step of the template, our algorithm simply tests whether count-max declares any coordinate to be the maximizer, and we output FAIL if it does not. This approach guarantees that we correctly output the maximizer conditioned on not failing. The primary technical challenge will be to show that, conditioned on . for some , the probability of failing the statistical test does not depend on . In other words, conditioning on being the maximum does not change the failure probability. Let be the -th order statistic of (i.e., . Here the ’s are known as anti-ranks. To analyze the conditional dependence, we must first obtain a closed form for which separates the dependencies on and . Hypothetically, if depended only on , then our statistical test would be completely independent of , in which case we could safely fail whenever such an event occurred. Of course, in reality this is not the case. Consider the vector and . Clearly we expect to be the maximizer, and moreover we expect a gap of between and . On the other hand, if you were told that , it is tempting to think that just barely beat out for its spot as the max, and so would not be far behind. Indeed, this intuition would be correct, and one can show that the probability that conditioned on changes by an additive constant depending on whether or not . Conditioned on this gap being smaller or larger, we are more or less likely (respectively) to output FAIL. In this setting, the probability of conditional failure can change by an factor depending on the value of .

To handle scenarios of this form, our algorithm will utilize an additional linear transformation in Step of the template. Instead of only scaling by the random coefficients , our algorithm first duplicates the coordinates to remove all heavy items from the stream. If is the vector from the example above and is the duplicated vector, then after duplications all copies of the heavy item will have weight at most . By uniformizing the relative weight of the coordinates, this washes out the dependency of on , since after duplications, for any . Notice that this transformation blows-up the dimension of by a factor. However, since our space usage is always , the result is only a constant factor increase in the complexity.

After duplication, we scale by the coefficients , and the rest of the algorithm proceeds as described above. Using expressions for the order statistics which separate the dependence into the anti-ranks and a set of exponentials independent of the anti-ranks, after duplication we can derive tight concentration of the ’s conditioned on fixed values of the ’s. Using this concentration result, we decompose our count-max data structure into two component variables: one independent of the anti-ranks (the independent component), and a small adversarial noise of relative weight . In order to bound the effect of the adversarial noise on the outcome of our tests we must randomize the threshold for our failure condition and demonstrate the anti-concentration of the resulting distribution over the independent components of . This will demonstrate that with high probability, the result of the statistical test is completely determined by the value of the independent component, which allows us to fail without affecting the conditional probability of outputting .

#### Derandomization

Now the correctness of our sampler crucially relies on the full independence of the ’s to show that the variable is drawn from precisely the correct distribution (namely, the distribution ). Being the case, we cannot directly implement our algorithm using any method of limited independence. In order to derandomize the algorithm from requiring full-independence, we will use a combination of Nisan’s pseudorandom generator [Nis92], as well as an extension of the recent PRG of [GKM15] which fools certain classes of Fourier transforms. We first use a closer analysis of the seed length Nisan’s generator requires to fool the randomness required for the count-max data structure, which avoids the standard -space blowup which would be incurred by using Nisan’s as a black box. Once the count-max has been derandomized, we demonstrate how the PRG of [GKM15] can be used to fool arbitrary functions of -halfspaces, so long as these half-spaces have bounded bit-complexity. We use this result to derandomize the exponential variables with a seed of length , which will allow for the total derandomization of our algorithm for and in the same space.

Our derandomization technique is in fact fairly general, and can be applied to streaming algorithms beyond the sampler in this work. Namely, we demonstrate that any streaming algorithm which stores a linear sketch , where the entries of are independent and can be sampled from with -bits, can be derandomized with only a -factor increase in the space requirements (see Theorem 5). This improves the -blow up incurred from black-box usage of Nisan’s PRG. As an application, we derandomize the count-sketch variant of Minton and Price [MP14] to use -bits of space, which gives improved concentration results for count-sketch when the hash functions are fully-independent. The problem of improving the derandomization of [MP14] beyond the black-box application of Nisan’s PRG was an open problem. We remark that using -bits of space in the classic count sketch of [CCFC02b] has strictly better error guarantees that those obtained from derandomizing [MP14] with Nisan’s PRG to run in the same space. Our derandomization, in contrast, demonstrates a strong improvement on this, obtaining the same bounds with an instead of an factor blowup.

#### Case of p=2.

Recall for , we could show that the norm of the level sets decays geometrically with . More precisely, for any we have with probability . Using this, we actually do not need the tight concentration of the ’s, since we can show that the top coordinates change by at most depending on , and the norm of the remaining coordinates is only an fraction of the whole , and can thus be absorbed into the adversarial noise. For however, each level set contributes weight to , so even for . Therefore, for it is essential that we show concentration of the ’s for nearly all . Since will now be larger than by a factor of with high probability, count-max will only succeed in outputting the largest coordinate when it is an factor larger than expected. This event occurs with probability , so we will need to run the algorithm times in parallel to get constant probability, for a total -bits of space. Using the same -bit Nisan PRG seed for all repititions, we show that the entire algorithm for can be derandomized to run in -bits of space.

#### Optimizing the Runtime.

In addition to our core sampling algorithm, we show how the linear transformation step to construct can be implemented via a parameterized rounding scheme to improve the update time of the algorithm without affecting the space complexity, giving a run-time/relative sampling error trade-off. By rounding the scaling variables to powers of , we discretize their support to have size . We then simulate the update procedure by sampling from the distribution over updates to our count-max data-structure of duplicating an update and hashing each duplicate independently into . Our simulation utilizes results on efficient generation of binomial random variables, through which we can iteratively reconstruct the updates to bin-by-bin instead of duplicate-by-duplicate. In addition, by using an auxiliary heavy-hitter data structure, we can improve our query time from the naïve to without increasing the space complexity.

#### Estimating the Frequency.

We show that allowing an additional additive bits of space, we can provide an estimate of the outputted frequency with probability when . To achieve this, we use our more general analysis of the contribution of the level sets to , and give concentration bounds on the tail error when the top items are removed. When , for similar reasons as described in the sampling algorithm, we require another factor in the space complexity to obtain a estimate. Finally, we demonstrate an lower bound for this problem, which is nearly tight when . To do so, we adapt a communication problem introduced in [JW13], known as Augmented-Indexing on Large Domains. We weaken the problem so that it need only succeed with constant probability, and then show that the same lower bound still holds. Using a reduction to this problem, we show that our lower bound for samplers holds even if the output index is from a distribution with constant additive error from the true distribution .

## 2 Preliminaries

For , we write to denote the containment . For positive integer , we use to denote the set , and notation to hide terms. For any vector , we write to denote the -th largest coordinate of in absolute value. In other words, . For any , we define to be but with the top coordinates (in absolute value) set equal to . For any , we define to be with the -th coordinate set to . We write to denote the entry-wise absolute value of , so for all . All space bounds stated will be in bits. For our runtime complexity, we assume the unit cost RAM model, where a word of -bits can be operated on in constant time, where is the dimension of the input streaming vector. Finally, we will use notation to hide poly factors; in other words for any constant .

Formally, a data stream is given by an underlying vector , called the frequency vector, which is initialized to . The frequency vector then receives a stream of updates of the form for some and . The update causes the change . For simplicity, we make the common assumption ([BCIW16]) that , though our results generalize naturally to arbitrary . In this paper, we will need Khintchine’s and McDiarmid’s inequality

###### Fact 1 ( Khintchine inequality [Haa81]).

Let and for i.i.d. random variables uniform on . Then .

###### Fact 2 (McDiarmid’s inequality [McD89]).

Let be independent random variables, and let by any function that satisfies

 supx1,…,xn,^xi∣∣ψ(x1,x2,…,xn)−ψ(x1,…,xi−1,^xi,xi+1,…,xn)∣∣≤cifor 1≤i≤n

Then for any , we have

Our analysis will use stability properties of Gaussian random variables.

###### Definition 2.

A distribution is said to be -stable if whenever are drawn independently, we have

 n∑i=1aiXi=∥a∥pX

for any fixed vector , where is again distributed as a -stable. In particular, the Gaussian random variables are -stable for (i.e., , where are Gaussian).

### 2.1 Count-Sketch and Count-Max

Our sampling algorithm will utilize a modification of the well-known data structure known as count-sketch (see [CCFC02b] for further details). We now introduce the description of count-sketch which we will use for the remainder of the paper. The count-sketch data structure is a table with rows and columns. When run on a stream , for each row , count-sketch picks a uniform random mapping and . Generally, and need only be -wise independent hash functions, but in this paper we will use fully-independent hash functions (and later relax this condition when derandomizing). Whenever an update to item occurs, count-sketch performs the following updates:

 Ai,hi(v)←Ai,hi(v)+Δgi(v) for i=1,2,…,d

Note that while we will not implement the ’s as explicit hash functions, and instead generate i.i.d. random variables , we will still use the terminology of hash functions. In other words, by hashing the update into the row of count-sketch, we mean that we are updating by . By hashing the coordinate into , we mean updating by for each . Using this terminology, each row of count-sketch corresponds to randomly hashing the indices in into buckets, and then each bucket in the row is a sum of the frequencies of the items which hashed to it multiplied by random signs. In general, count-sketch is used to obtain an estimate vector such that is small. Here the estimate is given for all . This vector satisfies the following guarantee.

###### Theorem 1.

If and , then for a fixed we have with probability . Moreover, if and is any constant, then we have with probability . Furthermore, if we instead set , then the same two bounds above hold replacing with .

In this work, however, we are only interested in determining the index of the heaviest item in , that is . So we utilize a simpler estimation algorithm based on the count-sketch data structure that tests whether a fixed , if . For analysis purposes, instead of having the ’s be random signs, we draw as i.i.d. Gaussian variables. Then for a fixed , set , and we declare that to be the maximizer if . The algorithm computes for all , and outputs the first index that satisfies (there will only be one with high probability). To distinguish this modified querying protocol from the classic count-sketch, we refer to this algorithm as count-max. To refer to the data structure itself, we will use the terms count-sketch and count-max interchangeably.

We will prove our result for the guarantee of count-max in the presence of the following generalization. Before computing the values of and reporting a maximizer as above, we will scale each bucket of count-max by a uniform random variable . This generalization will be used for technical reasons in our analysis of Lemma 3. Namely, we will need it to ensure that our failure threshold of our algorithm is randomized, which will allow us to handle small adversarial error.

###### Lemma 1.

Let be an arbitrarily large constant, set and , and let be a instance of count-max run on using fully independent hash functions and Gaussian random variables . Then then with probability the following holds: for every , if then count-max declares to be the maximum, and if , then count-max does not declare to be the maximum. Thus if count-max declares to be the largest coordinate of , it will be correct with high probability. Moreover, this result still holds if each bucket is scaled by a before reporting.

###### Proof.

First suppose , and consider a fixed row of . WLOG hashes to , thus and . By -stability (Definition 2), the probability that is less than probability that one Gaussian is times larger than another, which can be bounded by by direct computation. Thus hashes into the max bucket in a row of with probability at least , so by Chernoff bounds, taking , with probability we have that is in the largest bucket at least a fraction of the time, which completes the first claim.

Now suppose is not a unique max, and let be such that is maximal. Then conditioned on not hashing to the same bucket, the probability that hashes to a larger bucket than is at most . To see this, note that conditioned on this, one bucket is distributed as and the other as , where are identically distributed random variables. Thus the probability that is the in maximal bucket is at most , and so by Chernoff bounds will hash to strictly less than of the maximal buckets with probability . Union bounding over all gives the desired result. ∎

###### Corollary 1.

In the setting of Lemma 1, with probability , count-max will never report an index as being the maximum if .

###### Proof.

Suppose , and in a given row WLOG hashes to . Then we have and , where is restricted to the coordinates that hash to bucket , and . Since are i.i.d., with probability we have . Conditioned on this, we have . So conditioned on , we have whenever one Gaussian is times larger than another in magnitude, which occurs with probability greater than . So hashes into the max bucket with probability at most , and thus by Chernoff bounds, taking sufficiently large and union bounding over all , will hash into the max bucket at most a fraction of the time with probability , as needed. ∎

## 3 Exponential Order Statistics

In this section, we discuss several useful properties of the order statistics of independent non-identically distributed exponential random variables. Let be independent exponential random variables where has mean (equivalently, has rate ). Recall that is given by the cumulative distribution function . Our main sampling algorithm will require a careful analysis of the distribution of values , which we will now describe. We begin by noting that constant factor scalings of an exponential variable result in another exponential variable.

###### Fact 3 (Scaling of exponentials).

Let be exponentially distributed with rate , and let . Then is exponentially distributed with rate

###### Proof.

The cdf of is given by , which is the cdf of an exponential with rate . ∎

We would now like to study the order statistics of the variables , where has rate . To do so, we introduce the anti-rank vector , where for , is a random variable which gives the index of the -th smallest exponential.

Let