Binary sampling from discrete distributions

Binary sampling from discrete distributions

Hiroyuki Masuyama

6mm plus 2mm  

Binary sampling from discrete distributions

Hiroyuki Masuyama  

Abstract   This paper considers direct sampling methods from discrete target distributions. The inverse transform sampling (ITS) method is one of the most popular direct sampling methods. The main purpose of this paper is to propose a direct sampling algorithm that supersedes the binary-search ITS method (which is an improvement of the ITS method with binary search). The proposed algorithm is based on binarizing the support set of the target distribution. Thus, the proposed algorithm is referred to as binary sampling (BS). The BS algorithm consists of two procedures: backward binary sampling (BBS) and forward binary sampling (FBS). The BBS procedure draws a single sample (the first sample) from the target distribution while constructing a one-way random walk on a binary tree for the FBS procedure. By running the random walk, the FBS procedure generates the second and subsequent samples. The BBS and FBS procedures have and time complexities, respectively, and they also have space complexity, where is the cardinality of the support set of the target distribution. Therefore, the time and space complexities of the BS algorithm are equivalent to those of the standard (possibly best) binary-search ITS algorithm. However, the BS algorithm has two advantages over the standard binary-search ITS algorithm. First, the BBS procedure is parallelizable and thus the total running time of the BS algorithm can be reduced. Second, the BS algorithm is more accurate in terms of relative rounding error that influences generated samples.

 

This research was supported in part by JSPS KAKENHI Grant Number JP15K00034.

 

H. Masuyama
Email: masuyama@sys.i.kyoto-u.ac.jp
Department of Systems Science, Graduate School of Informatics, Kyoto University Kyoto 606-8501, Japan

Keywords: Distribution; Binary tree; Pairwise summation; Parallelizability; Inverse transform sampling

Mathematics Subject Classification: 65C05; 65C10

1 Introduction

In this paper, we consider sampling from discrete target (probability) distributions. Sampling from target distributions is crucial for Monte Carlo methods. The methods of sampling can be categorized into two groups: Markov chain Monte Carlo (MCMC) methods (see, e.g., Brooks et al. 2011) and direct sampling methods (i.e., non MCMC methods; see, e.g., Devroye 1986).

MCMC methods include Metropolis-Hastings algorithm, Gibbs sampling, slice sampling, etc. Basically, MCMC methods are approximate sampling methods, except for “Coupling From The Past (CFTP)” (see, e.g., Huber 2016). The CFTP algorithm achieves exact sampling (or perfect sampling), i.e., generates samples that exactly (or perfectly) follow the target distribution.

Direct sampling methods achieve exact sampling, and include inverse transform sampling (ITS), acceptance-rejection sampling, and importance sampling, etc. These methods are not, in general, suitable for high-dimensional target distributions. However, the methods do not have to construct Markov chains and therefore are more easily implementable than MCMC methods.

Among the above direct sampling methods, we focus on the ITS method (see, e.g., Devroye 1986, Section III.2.1). This has three reasons: (i) The ITS method is often used to generate samples from proposal distributions in acceptance-rejection sampling and importance sampling; (ii) itself does not require any proposal distribution; and (iii) is flexible and easily implementable for discrete target distributions.

It should be noted that the naive algorithm of the ITS method (called the naive ITS algorithm, for short) requires the cumulative distribution function of the target distribution in order to generate samples. Thus, if we know only the probability mass function of the target distribution, we have to compute its cumulative distribution function. This preprocessing has time complexity of , where represents Big- notation and (following the definition introduced later) denotes the cardinality of the support set (called size for short) of the target distribution. Furthermore, the naive ITS algorithm takes, at worst, time to generate a sample by mapping a uniform random number to an element of the support set of the target distribution.

To reduce the running time of this mapping, we can use binary search. For simplicity, we call such an improvement of the ITS method with binary search the binary-search ITS method. The binary-search ITS method has some algorithms depending on what type of a binary tree is constructed. The standard (and probably best) binary-search ITS algorithm constructs a complete binary tree such that its leaves store the probabilities (masses) of the target distribution and the other nodes (the root and internal nodes) store the sums of the probabilities of the leaves retrieved sequentially by inorder traversal (see Devroye 1986, Section III.2). Although this standard binary-search ITS algorithm generates a sample in time, its preprocessing (constructing the binary tree) has time complexity and produces relative rounding error in computing the probabilities stored in the root and internal nodes.

The main contribution of this paper is to propose a direct sampling algorithm that supersedes the binary-search ITS method and, of course, the ITS method. The proposed algorithm is based on binarizing the support set of the target distribution. Hence, we refer to the proposed algorithm as binary sampling (BS). The BS algorithm consists of two procedures: backward binary sampling (BBS) and forward binary sampling (FBS). Although the BBS procedure is the preprocessing of the FBS one, the former generates a single sample while constructing a one-way random walk on a binary tree for the latter, which is achieved by the pairwise summation of the target distribution. By running the one-way random walk, the FBS procedure generates samples.

The BBS and FBS procedures have and time complexities, respectively, which are equivalent to those of the preprocessing and main processing of the standard binary-search ITS algorithm. It should be noted that the BBS procedure (the preprocessing of the BS algorithm) generates a sample whereas the preprocessing of the standard binary-search ITS algorithm does not. In addition, since the BBS procedure performs the pairwise summation of the target distribution, this procedure causes only relative rounding error and is parallelizable. Therefore, our BS algorithm is more accurate and scalable than the standard binary-search ITS algorithm.

The rest of this paper is divided into four sections. Section 2 presents preliminary results together with basic definitions and notation. Section 3 describes the proposed algorithm, i.e., the BS algorithm. Section 4 compares the BS algorithm with the naive ITS algorithm and the standard binary-search ITS algorithm. Finally, Section 5 considers the adaptability of the BS algorithm to high-dimensional target distributions.

2 Preliminaries

We consider sampling from a target distribution with support set , where denotes a nonnegative integer, i.e., . Let denote the target distribution. Note here that and

(1)

Furthermore, let denote an integer such that , or equivalently,

(2)

For convenience, we set

We need more definitions. For , let denote

where . For , let denote a function from to such that, for ,

which is equivalent to the binary number . We then define ’s, , by the recursion:

(3)

and, for ,

(4)

Note here that the computation of ’s, , is the pairwise summation of the target distribution (see Fig. 1).

Figure 1: Computation of by pairwise summation

It follows from (3) and (4) that, for ,

(5)

where

(6)

It also follows from (1) and (3) that if and only if . For later use, let denote

(7)

For , let denote

(8)

where the second equality holds due to (7) and . Equations (5), (7) and (8) imply that

(9)

We now prove a lemma, which presents a basic idea behind our sampling algorithm.

Lemma 2.1

For ,

(10)

where ’s and ’s, , , are given by

(11)
(12)
Proof.

Fix arbitrarily. It then follows from (9), (11) and (12) that ’s and ’s are well-defined for and . Note here that the second equality of (11) holds due to (4). Note also that (11) and (12) yield

and thus

(13)

From (3), (6) and (13), we have

which shows that (10) holds. ∎

3 The proposed sampling algorithm: binary sampling

In this section, we describe our sampling algorithm. As mentioned in Section 1, the algorithm consists of the two procedures: backward binary sampling (BBS) and forward binary sampling (FBS). In what follows, we provide the details of the BBS and FBS procedures.

The BBS procedure is the preprocessing of the FBS procedure. The BBS procedure computes the probabilities for by the pairwise summation of . Using the computed probabilities, the BBS procedure constructs a one-way random walk on a binary tree, which is used by the FBS procedure.

To describe this one-way random walk, we introduce some definitions. Let denote

where . Let denote a random walk with state space , which evolves in the following law:

(14)

and, for , and ,

(15)

where , and where and are easily calculated by (11) and (12) with and .

The state space of the one-way random walk is considered a binary tree such that the root is labeled with and each of all the other nodes has a label in binary vector form that consists of its parent label and a binary digit, where “0” and “1” corresponds to left and right children, respectively. For example, the left and right children (if any) of node are labeled with and , respectively (see Fig. 2).

Figure 2: One-way random walk on a binary tree for the FBS procedure ()

In this perspective, the one-way random walk starts from the root of the binary tree, moves down according to the transition probabilities ’s and ’s and ends at one of the leaves.

From Lemma 2.1, we have the following result.

Lemma 3.1
Proof.

It follows from (14), (15) and Lemma 2.1 that, for ,

which completes the proof. ∎

Lemma 3.1 implies that the one-way random walk generates samples following the target distribution. Indeed, the FBS procedure achieves such sampling by choosing the values of ’s, , in the forward order, i.e., in the order of . The way of choosing the ’s is such that

(16)

where and

(17)

The obtained vector is converted to the integer , which is a sample from the target distribution. The description of the FBS procedure is summarized in Procedure 1.

Procedure 1 (FBS: Forward binary sampling)

Input:

Output: Sample from

  1. For , choose by (16).

  2. Return .

The following theorem is an immediate consequence of Lemma 3.1 and Procedure 1. Thus, we omit its proof.

Theorem 3.1

The FBS procedure generates samples following the target distribution .

The FBS procedure uses the one-way random walk , which is constructed by the BBS procedure. It is remarkable that the BBS procedure not only constructs this random walk but also draws a sample from the target distribution . More specifically, the BBS procedure draws a binary vector from by choosing the values of the ’s by (16) in the backward order, i.e., the order of . The description of the BBS procedure is summarized in Procedure 2. In addition, Fig. 3 provides an example of the behavior of the BBS procedure.

Procedure 2 (BBS: Backward binary sampling)

Input: Target distribution

Output: Sample from and

  1. Set .

  2. For , execute the following iteration.

    Iteration:

    For each , perform Steps (a)–(c):

    1. If , store the probabilities:

      otherwise (i.e., if ) compute the probabilities ’s , by (4) and store the results;

    2. choose the value of by (16); and

    3. delete the vector from .

  3. Return with the (unique) element of .

Figure 3: Example of the behavior of the BBS procedure ()

The following theorem guarantees that the BBS procedure (i.e., Procedure 2) works well. The proof of this theorem is given in Appendix.

Theorem 3.2

Steps (i) and (ii) of the BBS procedure result in set consisting of only one element. Furthermore, Step (iii) of the BBS procedure returns with probability .

Remark 3.1

In each iteration of Step (ii), the BBS procedure selects, by coin toss (appropriately biased in each selection), the candidates of the binary expression of a possible sample from the target distribution (see Fig. 3). Theorem 3.2 implies that a desired result is what goes through the whole process of selections whatever it is. Therefore, the coin tosses within one iteration need not be independent, though those between different iterations must be independent.

Remark 3.2

The BBS procedure works well even though the target distribution is not normalized, that is, is expressed as

(18)

where is an unknown positive constant and is a given sequence of positive numbers. In such a case, we define

and compute, for and ,

(19)

by the recursion (3) and (4) with the ’s replaced by the ’s. Note that

(20)

which follows from (5), (18) and (19). We then calculate, for and ,

(21)

Equations (11), (20) and (21) show that for all and .

We are now ready to describe our BS algorithm, which is summarized in Algorithm 1.

Algorithm 1 (BS: Binary sampling)

Input: Target distribution
Output: Samples from

  1. Perform the BBS procedure (Procedure 2) once, which generates a sample.

  2. Repeat the FBS procedure (Procedure 1) as many times as necessary.

  3. Return the generated samples.

In the rest of this section, we discuss the performance of Algorithm 1. Clearly, the time complexity of the BBS procedure is dominated by Step (ii) of Procedure 2, where the probabilities

(22)

are computed (if necessarily) and stored. The total number of these probabilities is . Thus, the time and space complexities of the BBS procedure are . It should be noted that the probabilities in (22) are computed by pairwise summation. Therefore, the computation of theses probabilities is parallelizable, and the results include only relative rounding error (see, e.g., Higham 1993). On the other hand, the FBS procedure determines the values of by running the one-way random walk . Thus, the FBS procedure has time complexity of

where the first equality follows from (2). The FBS procedure also has space complexity for the probabilities in (22) and the values of .

As a result, the BS algorithm has time and space complexities, though this algorithm generates the first sample in time and the second and subsequent samples in time. The obtained samples are influenced by relative rounding error. Finally, Table 1 summarizes the performance of the BS algorithm.

BS
BBS FBS
Time complexity
Space complexity
Relative rounding error
Table 1: Performance of BS algorithm

4 Comparison with ITS algorithms

In this section, we compare our BS algorithm with two ITS algorithms: (a) the naive ITS algorithm; and (b) the standard binary-search ITS algorithm. To this end, we define as the cumulative distribution function of the target distribution , i.e.,

We then assume that no explicit expressions of the cumulative distribution function are given, which implies that we have to compute or its equivalent, in order to perform the ITS method.

4.1 Naive ITS

We begin with the description of the naive ITS algorithm, which is summarized in Algorithm 2.

Algorithm 2 (Naive ITS)

Input: Target distribution
Output: Sample from

  1. Set , and, for , compute and store .

  2. Generate a uniform random number in .

  3. Return such that , where .

Step (i) of Algorithm 2 is the preprocessing step of the naive ITS algorithm, which has time and space complexities, and produces relative rounding error. Step (iii) of Algorithm 2 is the main processing of the naive ITS algorithm, which is equivalent to identifying such that

(23)

Therefore, the average time complexity of the main processing, denoted by , is given by

(24)

where . By definition, and thus . Indeed, Examples 4.14.3 below show that ranges from to .

Example 4.1

Suppose that

with . In this artificial case, and thus

Furthermore, as .

Example 4.2

Suppose that is a Zipf distribution with index , i.e.,

We then have

which leads to

where is the Riemann zeta function. Therefore, .

Example 4.3

Suppose that is a binomial distribution with parameter , i.e.,

We then have and thus .

Based on the above discussion, the performance of the naive ITS algorithm is summarized in Table 2.

Average time complexity
at best; at worst
Space complexity
Relative rounding error
Table 2: Performance of the native ITS algorithm

Tables 1 and 2 show that our BS algorithm has space complexity of the same order as that of the naive ITS algorithm. The tables also show that our BS algorithm is much more accurate than the naive ITS algorithm in terms of relative rounding error in cumulating the target distribution. As for the efficiency of generating samples, the BBS procedure of our BS algorithm generates a sample whereas its counterpart of the naive ITS algorithm (i.e., Step (i) of Algorithm 2) does not. In addition, the FBS procedure of the BS algorithm generates the second and subsequent samples in time. Therefore, the BS algorithm generally achieves high performance. On the other hand, the naive ITS algorithm can achieve extremely high performance in some cases, such as Example 4.1.

Examples 4.14.3 imply that nonincreasing is basically convenient for the naive ITS algorithm. We now consider the suitability of the naive ITS algorithm for nondecreasing target distributions. For this purpose, we suppose that

where is nondecreasing. In this case,

(25)

Substituting (25) into (24) yields . Thus, it may seem that nondecreasing is inconvenient for the naive ITS algorithm. In fact, this is not necessarily the case. It should be noted that (23) is equivalent to

(26)

Using (26), we can perform Step (iii) of Algorithm 2, whose time complexity is given by

(27)

From (25) and (27), we have .

Consequently, the naive ITS algorithm is expected to achieve high performance for monotone target distributions. Of course, the target distribution is not in general monotone. In such a general case, we can sort the target distribution by an appropriate sorting algorithm, e.g., heap sort, though this preprocessing takes time. Note that, in time, our BS algorithm generates samples because the time complexities of the BBS and FBS procedures are and , respectively (see Table 1). Thus, the combination of the naive ITS algorithm and sorting is not competitive to our BS algorithm.

4.2 Binary-search ITS

Instead of sorting, there is a technique that reduces the running time of generating a sample by the ITS method; more specifically, that efficiently performs mapping a uniform random number to an element of the support set of the target distribution. As mentioned in the introduction, such an efficient mapping is achieved by binary search. We refer to the combination of the ITS method and binary search as the binary-search ITS method. This binary-search ITS method is realized as some algorithms depending on what type of binary tree is constructed for the procedure of mapping. The standard construction of such binary trees is described in Procedure 3.

Procedure 3

(Standard construction of a binary tree for binary-search ITS)

Input: Target distribution
Output: Complete binary tree

Construct a complete binary tree from the labels in such that

  1. the root of this tree is labeled with zero;

  2. for , an internal node (not the root or a leaf) with label has a parent with label , and has children with labels and , where a left child has a smaller label than its paired right child;

  3. for , the probability is assigned to node with label ; and

  4. each of all the nodes, except the leaves, stores the sum of the probabilities assigned to the leaves visited before the present node in the inorder traversal.

It should be noted that, although the size of the target distribution is equal to , Procedure 3 constructs a complete binary tree with nodes. The last nodes (which are all leaves) correspond to the elements of the support set of the target distribution , and, for each , node (node with label ) stores the probability . On the other hand, the first nodes are the root and internal nodes, and each of them stores the sum of the probabilities ’s retrieved from the nodes visited according to the inorder traversal. Fig. 4 provides a simple example of complete binary trees for the binary-search ITS method, where the visiting order of the nodes is .

Figure 4: Example of complete binary trees for the binary-search ITS method ()

Algorithm 3 below describes the standard binary-search ITS algorithm based on Procedure 3.

Algorithm 3 (Standard binary-search ITS)

Input: Complete binary tree from Procedure 3
Output: Sample from

  1. Generate a uniform random number in , and then repeat the following operation, starting from the root and ending at one of the leaves.

    1. If is not greater than equal to the probability of the current node, move to its left child;

    2. otherwise move to its right child.

  2. Return , where is the label of the leaf arrived through Step (i).

We consider the performance of the standard binary-search ITS algorithm, which is Algorithm 3 together with Procedure 3. Procedure 3 is the preprocessing of Algorithm 3, and this procedure adds, one by one, the probabilities ’s retrieved according to the inorder traversal. Thus, the procedure has time and space complexities. The procedure also causes relative rounding error, which influences the accuracy of samples generated by Algorithm 3. Algorithm 3, as well as Procedure 3, needs space to keep the complete binary tree . The time complexity of Algorithm 3 is time complexity, because the complete binary tree has depth . As a result, the performance of the standard binary-search ITS algorithm is summarized in Table 3.

Standard binary-search ITS
Procedure 3 Algorithm 3
Time complexity
Space complexity
Relative rounding error
Table 3: Performance of the standard binary-search ITS algorithm

Tables 1 and 3 show that our BS algorithm has time and space complexities of the same order as those of the standard binary-search ITS algorithm. In the two algorithms, the most costly parts are their preprocessing. However, the preprocessing of our BS algorithm (i.e., Procedure 2) generates a sample, and it is parallelizable and thus scalable. These features do not appear in the standard binary-search ITS algorithm. In addition, our BS algorithm has a significant advantage over the standard binary-search ITS algorithm in terms of relative rounding error.

Remark 4.1

It is stated in Devroye 1986, Section III.2 that Huffman tree is optimal for the binary-search ITS method in the sense that Huffman tree minimizes the average running time of mapping a uniform random number to an element of the support set of the target distribution . In fact, the binary-search ITS method using Huffman tree performs such mapping in time, where is the mean of the target distribution (for details, see Devroye 1986, Section III.2, Theorem 2.1). Unfortunately, we need time to construct Huffman tree for the binary-search ITS method. Therefore, the binary search by Huffman tree, as well as, the sorting of the target distribution, is not a good strategy for the improvement of the ITS method.

5 Adaptability to multidimensional distributions

In this section, we discuss the adaptability of our BS algorithm to multidimensional target distributions. Let denote

where is a positive integer and ’s, , are nonnegative integers. Let denote a vector in . We then define as a -dimensional target distribution.

To draw a sample from this -dimensional target distribution , we transform it into an one-dimensional distribution such that

where

In this setting, we can obtain samples ’s, , from the transformed target distribution by the BS algorithm. We then convert the obtained samples ’s to -dimensional vectors ’s satisfying

This operation can be implemented regardless of the dimension of the target distribution. Nevertheless, the BS algorithm, as well as the ITS method, cannot escape from “Curse of Dimensionality” because its total time complexity is .

In what follows, we present a brief discussion of approximate sampling by the BS algorithm, which could be a solution to “Curse of Dimensionality” in some “lucky” cases. We assume that the support set of is possibly infinite. We also assume that denote a probability distribution such that

(28)

where is an unknown positive constant and is a given function such that is easily calculated for all . For any finite , we define as a finite discrete distribution such that

(29)

where

(30)

According to Remark 3.2, we can draw samples from the finite distribution by applying the BS algorithm to . Note that the distribution can be considered an approximation to the distribution . Therefore, we can say that the samples from the distribution are approximations of those from the distribution .

To evaluate this approximate sampling, we estimate the total variation distance between and , denoted by , i.e.,

Substituting (28), (29) and (30) into the above equation yields

Note that is computed by the BBS procedure (Procedure 2). Thus, we can obtain an upper bound for if we can estimate .

We now define , , as

If we find an containing a small number of elements for a sufficiently small , then we can perform approximate sampling from the distribution with high accuracy and efficiency.

Acknowledgments

The author acknowledges stimulating discussions with Kousei Sakaguchi. This research was supported in part by JSPS KAKENHI Grant Number JP15K00034.

Appendix A Appendix: Proof of Theorem 3.2

We first show that has only one element when Steps (i) and (ii) (of Procedure 2) are completed. To facilitate the discussion, let denote the set before Step (ii) starts, i.e.,

(A.1)

For , let