Stochastic Enumeration with Importance Sampling

# Stochastic Enumeration with Importance Sampling

Alathea Jensen
December 5, 2017
###### Abstract

Many hard problems in the computational sciences are equivalent to counting the leaves of a decision tree, or, more generally, by summing a cost function over the nodes. These problems include calculating the permanent of a matrix, finding the volume of a convex polyhedron, and counting the number of linear extensions of a partially ordered set. Many approximation algorithms exist to estimate such sums. One of the most recent is Stochastic Enumeration (SE), introduced in 2013 by Rubinstein. In 2015, Vaisman and Kroese provided a rigorous analysis of the variance of SE, and showed that SE can be extended to a fully polynomial randomized approximation scheme for certain cost functions on random trees. We present an algorithm that incorporates an importance function into SE, and provide theoretical analysis of its efficacy. We also present the results of numerical experiments to measure the variance of an application of the algorithm to the problem of counting linear extensions of a poset, and show that introducing importance sampling results in a significant reduction of variance as compared to the original version of SE.

## Acknowledgments

This is a pre-print of an article published in Methodology and Computing in Applied Probability. The final authenticated version is available online at:

The author would like to thank Isabel Beichl and Francis Sullivan for the idea for this project. The author would also like to thank the Applied and Computational Mathematics Division of the Information Technology Laboratory at the National Institute of Standards and Technology for hosting the author as a guest researcher during the preparation of this article.

## 1 Introduction

Many hard problems in mathematics, computer science, and the physical sciences are equivalent to summing a cost function over a tree. These problems include calculating the permanent of a matrix, finding the volume of a convex polyhedron, and counting the number of linear extensions of a partially ordered set.

There are tree-searching algorithms which give an exact answer by simply traversing every node in the tree; however, in many cases, the tree is far too large for this to be practical. Indeed, the problem of computing tree cost is in the complexity class #P-complete (Valiant, 1979). This complexity class consists of counting problems which find the number of solutions that satisfy a corresponding NP-complete decision problem.

Accordingly, there are various approximation algorithms for tree cost, and the two main types of these are Markov Chain Monte Carlo (MCMC) and sequential importance sampling (SIS). Both of these perform random sampling on a suitably defined set.

The original version of SIS is Knuth’s algorithm (Knuth, 1975), which samples tree cost by walking a random path from the root to a leaf, where each node in the path is chosen uniformly from the children of the previously chosen node. There have been several major adaptations to Knuth’s algorithm, all of which attempt to reduce the variance of the estimates produced.

One modification of Knuth’s algorithm is to choose the nodes of the path non-uniformly, proportional to an importance function on the nodes. Of course, choosing a good importance function requires some knowledge about the structure of the tree, and so this approach is not suitable for random trees, but rather for families of trees which share some general characteristics. Some cases where this approach has produced good results can be found in Beichl and Sullivan (1999), Blitzstein and Diaconis (2011), Harris, Sullivan, and Beichl (2014), Karp and Luby (1983), for example.

There have also been adaptations of Knuth’s algorithm which change the algorithm in a more structural way, such as stratified sampling, which was introduced by Knuth’s student, Chen (1992).

Stochastic Enumeration (SE) is the most recent of the structural adaptations. It was originally introduced by Rubinstein (2013), and further developed in Rubinstein, Ridder, and Vaisman (2014). Its approach to the problem is to run many non-independent trajectories through the tree in parallel, combining their effect on the estimate at each level of the tree to produce a single final estimate of the tree cost. Alternatively, one can view SE as operating on a hypertree associated with the original tree. A similar approach to the problem was taken by Cloteaux and Valentin (2011).

In Rubinstein’s original definition, the SE algorithm was only able to count the leaves of a tree. Vaisman and Kroese (2017) updated SE to estimate tree cost for any cost function, and provided a rigorous analysis of the variance. They also showed that SE can be extended to an fully polynomial randomized approximation scheme (FPRAS) for random trees with a cost function that is 1 on every node.

In this paper, we follow up on the work of Vaisman and Kroese to develop an adaptation of SE which we call Stochastic Enumeration with Importance (SEI). This algorithm chooses paths through the tree with non-uniform probability, according to a user-defined importance function on the nodes of the tree. We provide a detailed analysis of the theoretical properties of the algorithm, including ways to bound the variance.

Just as with SIS, SEI is not suitable for random trees, but rather for families of trees which share some characteristics. Therefore, in addition to theoretical analysis in which the importance function is not specified, we also provide a detailed example, with numerical results, of a family of trees and importance functions for which SEI provides a lower variance than SE.

## 2 Definitions and Preliminaries

Consider a tree with node set , where each node has some cost given by a cost function . We wish to estimate the total cost of the tree, denoted and given by

 Cost(T)=∑v∈Vc(v)

If our tree is uniform, in the sense that all the nodes on a given level have the same number of children, then it is very easy to determine the number of nodes on each level.

We will call the root node level 0, the root’s children level 1, and so on. Suppose the root has children, the root’s children all have children, and so on. Then there is 1 node on level 0, nodes on level 1, nodes on level 2, and, in general, nodes on level .

If the cost of nodes is also uniform across each level, then we can easily add up the cost of the entire tree. For each level , let the cost of any node on level be denoted . Then the cost of our tree is

 Cost(T)=c0+c1D0+c2D0D1+⋯+cnD0D1⋯Dn−1 (1)

where is the lowest level of the tree.

Of course, most trees are not uniform is the sense described above, but the central idea of Knuth’s algorithm (Knuth, 1975) for estimating tree cost is to pretend as though they are. In Knuth’s algorithm, we walk a single path from the root to a leaf, and note the number of children that we see from each node in our path (), as well as the cost of each node in our path (). We then calculate the cost of the tree using Formula (1), which is no longer exact but is now an unbiased estimator of the tree cost.

In the SE algorithm, just as in Knuth’s algorithm, we work our way down the tree level by level from the root to the leaves. The main difference is that instead of choosing a single node on each level of the tree, we choose multiple nodes on each level. We can also think of this as choosing a single hypernode from each level of a hypertree constructed from the original tree. The following definitions are necessary to describe the structure of the hypertree.

We define a hypernode to be a set of distinct nodes that are in the same level of the tree. We can extend the definition of the cost function to hypernodes by letting

 c(v)=∑v∈vc(v)

Let denote the set of successors (or children) of a node in the tree. Then we can define the set of successors of a hypernode as

 S(v)=⋃v∈vS(v)

Throughout the SE algorithm, each time we move to a new level, we choose a new hypernode from among the successors of the previous hypernode . We make no distinction between these successors in terms of which node in the previous hypernode they came from. This means that some nodes in the previous hypernode may have multiple children chosen to be in the next hypernode, while other nodes in the previous hypernode may not have any children chosen to be in the next hypernode.

Obviously there is some limit on our computing power, so we have to limit the size of the hypernodes we work with to be within a budget, which we will denote . At each level, we will choose nodes to be in the next hypernode, as long as is larger than . If , then we will take all of to be the next hypernode.

Thus, if our current hypernode is , the candidates for our next hypernode, which we call the hyperchildren of , are the elements of the set

 H(v)={w⊆S(v):|w|=min(B,|S(v)|)}

Many of the statements and proofs throughout this paper are in a recursive form that refers to subforests of a tree, and so we lastly need to define a forest rooted at a hypernode. For a hypernode , the forest rooted at , denoted , is simply the union of all the trees rooted at each of the nodes in .

 Tv=⋃v∈vTv

We can also extend the notion of the total cost of a tree to a forest rooted at a hypernode by letting

 Cost(Tv)=∑v∈vCost(Tv)

Let’s look at an example to familiarize ourselves further with the notation.

###### Example 2.1.

Consider the tree in Figure 1. It is labeled with a possible sequence of hypernodes that could be chosen by the SE algorithm, using a budget of .

On level 0, the root is automatically chosen to be the first hypernode, . We could then refer to the entire tree as . On level 1, we have . Since , we take all of to be our next hypernode, so .

On level 2, we have , so our choices for are the elements of . Let’s choose . Similarly, on level 3, we have , so our choices for are . Let’s choose .

Finally, on level 4, we have . Since , we take all of to be our next hypernode, so .∎

## 3 Stochastic Enumeration with Arbitrary Probability

We are now ready to state the first algorithm, Stochastic Enumeration with arbitrary probability (SEP). It is a generalization of the updated Stochastic Enumeration algorithm in Vaisman and Kroese (2017), which used uniform probabilities.

Note that the quantity is an estimate of the number of children of the nodes in level , so that after each update in line 5, is an estimate of the number of nodes in level of the tree.

Likewise, the quantity is an estimate of the average cost of nodes on level , so that by adding to on line 5, we are adding the estimated cost of all of level of the tree.

Before analyzing this algorithm further, let’s look at an example to get a better idea of how it works.

###### Example 3.1.

Consider the tree in Figure 2. To keep things simple, we’ll use a budget of and a cost function that is 1 on every node. Clearly the total cost of the tree is the number of nodes, 14. This choice simplifies to 1, so the update command for becomes

 CSEP←CSEP+D

Let’s choose hypernodes with a uniform probability, meaning . Since , this makes the formula for simplify to , so the update command for becomes

 D←|S(xk)||xk|D

Note that is the average number of children of the nodes in . In the original SE algorithm, the update command for always looks like this.

Now let’s examine a possible sequence of hypernodes produced by Algorithm 2, as shown in Figure 2, which is the same as the previous example.

We initialize with , , , . Then we compute , which means , and advance to with . We update

 D←|S(x0)||x0|D=2
 CSEP←CSEP+D=3

We advance to and loop. We compute , which means , and advance to with . We update

 D←|S(x1)||x1|D=3
 CSEP←CSEP+D=6

We advance to and loop. We compute , which means , and we advance to with . We update

 D←|S(x2)||x2|D=4.5
 CSEP←CSEP+D=10.5

We advance to and loop. We compute , which means , and we advance to with . We update

 D←|S(x3)||x3|D=2.25
 CSEP←CSEP+D=12.75

We increase to and loop. We compute , so we are in the terminal position and we stop. The algorithm returns as an estimator of the cost of the tree. This completes the example.∎

Now we begin our analysis of Algorithm 1. In general, the output of Algorithm 1 is a random variable

 CSEP(Tx0)=c(x0)|x0|+D0c(x1)|x1|+D0D1c(x2)|x2|+⋯+D0D1⋯Dτ−1c(xτ)|xτ|=c(x0)|x0|+D0(c(x1)|x1|+D1c(x2)|x2|+⋯+D2⋯Dτ−1c(xτ)|xτ|)

where is some height less than or equal to the height of .

This naturally suggests a recursive formulation of the output,

 CSEP(Tx0)=c(x0)|x0|+D0CSEP(Tx1)

Let be a hyperchild of selected from with probability . Then we have

 (2)

Before proceeding to a proof of the correctness of Algorithm 1, we stop to note a lemma that we will use in this and other proofs throughout the paper.

###### Lemma 3.1.
 Cost(TS(v))=∑w∈H(v)Cost(Tw)(|S(v)|−1|w|−1)
###### Proof.

We begin by expanding the right hand side of the proposed equation.

 ∑w∈H(v)Cost(Tw)(|S(v)|−1|w|−1)=∑w∈H(v)1(|S(v)|−1|w|−1)∑w∈wCost(Tw)

Since does not depend on the particular choice of , we can move the factor in which it appears outside the summation.

 ∑w∈H(v)Cost(Tw)(|S(v)|−1|w|−1)=1(|S(v)|−1|w|−1)∑w∈H(v)∑w∈wCost(Tw)

Each appears in precisely of the , therefore we can simplify the double summation.

 ∑w∈H(v)Cost(Tw)(|S(v)|−1|w|−1)=1(|S(v)|−1|w|−1)(|S(v)|−1|w|−1)∑w∈S(v)Cost(Tw)=∑w∈S(v)Cost(Tw)=Cost(TS(v))

∎∎

###### Theorem 3.1.

Algorithm 1 is an unbiased estimator of tree cost, meaning

 E[CSEP(Tv)]=Cost(Tv)|v|
###### Proof.

The proof proceeds by induction over the height of the tree. For a forest of height 0, we have , so the algorithm returns the exact answer

 c(v)|v|=Cost(Tv)|v|

Assuming that the proposition is correct for forests with heights strictly less than the height of , we have

Applying Lemma 3.1, we get

 E[CSEP(Tv)]=c(v)|v|+Cost(TS(v))|v|=Cost(Tv)|v|

∎∎

Now that we know Algorithm 1 works, we can start thinking about how to improve the variance of the estimates it produces.

The purpose of using a non-uniform probability distribution to select each hypernode is to try to achieve a better variance between the estimates. Therefore, it is important to know the optimal probability distribution, in other words, the probability distribution that would yield the exact answer for every estimate.

As with Knuth’s algorithm, it turns out that the optimal probability for choosing a hypernode is proportional to the cost of the forest rooted at the hypernode. Details are given below.

###### Theorem 3.2.

In Algorithm 1, if each hypernode is chosen from all possible hypernodes in with probability

 P(w)=Cost(Tw)∑x∈H(v)Cost(Tx)

then is a zero-variance estimator, meaning

###### Proof.

The proof proceeds by induction over the height of the tree. For a tree of height 0, we have , so the algorithm returns the exact answer

 CSEP(Tv)=c(v)|v|=Cost(Tv)|v|

Assuming that the proposition is correct for forests with heights strictly less than the height of , we have

 CSEP(Tv)=c(v)|v|+|w|CSEP(Tw)|v|(|S(v)|−1|w|−1)P(w)=c(v)|v|+Cost(Tw)|v|(|S(v)|−1|w|−1)P(w)=c(v)|v|+Cost(Tw)|v|(|S(v)|−1|w|−1)Cost(Tw)∑x∈H(v)Cost(Tx)=c(v)|v|+1|v|(|S(v)|−1|w|−1)∑x∈H(v)Cost(Tx)=c(v)|v|+1|v|∑x∈H(v)Cost(Tx)(|S(v)|−1|w|−1)

Applying Lemma 3.1, we get

∎∎

We are now ready to discuss using an importance function to implement a probability distribution.

## 4 Stochastic Enumeration with Importance

The information in Theorem 3.2 suggests that we should use a probability distribution in which each hypernode has a probability that is proportional to the cost of the forest beginning at that hypernode. Obviously this will be difficult to achieve even as an estimate, since it is the same problem that we are trying to address with our algorithms.

However, even supposing that we did have some way of estimating the ideal probability for each hypernode, there is another problem with trying to implement a non-uniform probability distribution on the hypernodes. Simply put, may be extremely large, and so, if we hope to keep the running time of the algorithm under control, we need a way of choosing hypernodes that does not require us to calculate or store the probability of each individual hypernode in .

It turns out that there is an easy way to do this. Consider a function from the nodes of a tree to the positive real numbers. For a node , we will call the weight of or the importance of . We can extend the domain of to sets of nodes by defining the weight of a set of nodes as .

Given this weighting scheme, there is a way to choose a hypernode with probability

 P(w)=r(w)∑x∈H(v)r(x)

that only requires us to calculate the weights of , and not of . This method is described in Algorithm 2.

It may not be obvious, but Algorithm 2 is simply Algorithm 1 with a specific probability distribution implemented, as we shall prove now.

###### Theorem 4.1.

Algorithm 2 is an unbiased estimator of tree cost, meaning

 E[CSEI(Tv)]=Cost(Tv)|v|
###### Proof.

We begin by calculating the probability with which each is being selected.

Since one element, , is selected separately from the rest of , there are different and mutually exclusive ways in which we can get the same . This is because each element in can play the role of .

Once an has been selected from with probability , the rest of the elements are selected uniformly at random from the remaining elements in , so the remaining elements are collectively selected with probability .

Therefore the probability with which any given is selected is

 P(xk+1)=∑x∈xk+1r(x)r(S(xk))1(|S(xk)|−1|xk+1|−1)=r(xk+1)r(S(xk))1(|S(xk)|−1)|xk+1|−1

The formula for in Algorithm 2 is then obtained by a simple substitution into the formula given in Algorithm 1, and so the proposition follows from Theorem 3.1. ∎∎

Let be selected from as described in Algorithm 2. Then the probability with which is selected is

 P(w)=r(w)r(S(v))1(|S(v)|−1|w|−1)=r(w)∑x∈S(v)r(x)(|S(v)|−1)|w|−1

Since each appears in precisely of the , we can also write this as

 P(w)=r(w)∑x∈H(v)r(x)

which was the desired probability. Clearly, from Theorem 3.2, the ideal importance function would be .

Before analyzing this algorithm any further, let’s look at an example to get a better idea of how it works.

###### Example 4.1.

Consider the tree in Figure 3, which is the same as that in the previous examples, except that it has been labeled with importance function values in addition to the names of the nodes.

To keep things simple, we are reusing as many parameters as possible from Example 3.1, so the budget is and the cost function is 1 on every node. Again, the total cost of the tree is the number of nodes, 14, and this choice simplifies to 1, so the update command for becomes

 CSEI←CSEI+D

The importance function we are using for each node is the number of leaves under , including itself if it is a leaf. We have labeled the importance of each node after the node’s name in the figure.

Now let’s examine a possible sequence of hypernodes produced by Algorithm 2, as shown in the figure.

We initialize with , , , . Then we compute . We choose with probability

 P(c)=r(c)r(S(x0))=32+3=35

and then choose uniformly at random from the remaining elements, to give us . We update

 D0←|x1||x0|r(S(x0))r(x1)=21⋅2+32+3=2
 D←D⋅D0=2
 CSEI←CSEI+D=3

We increase to and loop. We compute . We choose with probability

 P(e)=r(e)r(S(x1))=22+2+1=25

and then choose uniformly at random from the remaining elements, giving us . We update

 D1←|x2||x1|r(S(x1))r(x2)=22⋅2+2+12+2=54
 D←D⋅D1=52
 CSEI←CSEI+D=112

We increase to and loop. We compute . We choose with probability

 P(i)=r(i)r(S(x2))=12+1+1=14

and then choose uniformly at random from the remaining elements, giving us . We update

 D2←|x3||x2|r(S(x2))r(x3)=22⋅2+1+11+1=2
 D←D⋅D2=5
 CSEI←CSEI+D=212

We increase to and loop. We compute , and we choose with probability

 P(m)=r(m)r(S(x3))=11=1

Since there are no remaining elements to be chosen, we have . We then update

 D3←|x4||x3|r(S(x3))r(x4)=1211=12
 D←D⋅D3=52
 CSEI←CSEI+D=262=13

We increase to and loop. We compute , so we are in the terminal position and we stop. The algorithm returns as an estimator of the cost of the tree. This completes the example.∎

## 5 Variance

Recall that in Equation 2, we found a recursive expression for the output of Algorithm 1 as

 CSEP(Tv)=c(v)|v|+|w|CSEP(Tw)|v|(|S(v)|−1|w|−1)P(w)

By substituting for with the expression we found in the proof of Theorem 4.1, we get another recursive formula for the output of Algorithm 2.

 CSEI(Tv)=c(v)|v|+|w||v|r(S(v))r(w)CSEI(Tw)

With this information we can begin to analyze the variance of , or rather, the variance of , which is the actual estimate of tree cost produced by Algorithm 2.

###### Theorem 5.1.

For a forest rooted at a hypernode , the variance produced by Algorithm 2 is

###### Proof.

We know

 CSEI(Tv)=c(v)|v|+|w||v|r(S(v))r(w)CSEI(Tw)

which implies

 |v|CSEI(Tv)=c(v)+r(S(v))r(w)|w|CSEI(Tw)

Taking the variance of both sides, we get

and so

 Var(|v|CSEI(Tv))=E⎡⎣(r(S(v))r(w)|w|CSEI(Tw))2⎤⎦−(E[r(S(v))r(w)|w|CSEI(Tw)])2 (3)

We will tackle each of these terms separately. First,

 E⎡⎣(r(S(v))r(w)|w|CSEI(Tw))2⎤⎦=∑w∈H(v)P(w)(r(S(v))r(w))2E[(|w|CSEI(Tw))2]=∑w∈H(v)1(|S(v)|−1|w|−1)r(w)r(S(v))(r(S(v))r(w))2E[(|w|CSEI(Tw))2]=∑w∈H(v)1(|S(v)|−1|w|−1)r(S(v))r(w)E[(|w|CSEI(Tw))2]=∑w∈H(v)1(|S(v)|−1|w|−1)r(S(v))r(w)(Var(|w|CSEI(Tw))+(E[|w|CSEI(Tw)])2)=∑w∈H(v)1(|S(v)|−1|w|−1)r(S(v))r(w)(Var(|w|CSEI(Tw))+Cost(Tw)2) (4)

Next,

 E[r(S(v))r(w)|w|CSEI(Tw)]=∑w∈H(v)P(w)r(S(v))r(w)E[|w|CSEI(Tw)]=∑w∈H(v)1(|S(v)|−1|w|−1)r(w)r(S(v))r(S(v))r(w)E[|w|CSEI(Tw)]=∑w∈H(v)1(|S(v)|−1|w|−1)E[|w|CSEI(Tw)]=∑w∈H(v)Cost(