Tree pyramidal adaptive importance sampling

Tree pyramidal adaptive importance sampling


This paper introduces Tree-Pyramidal Adaptive Importance Sampling (TP-AIS), a novel iterated sampling method that outperforms current state-of-the-art approaches like deterministic mixture population Monte Carlo (DM-PMC Elvira et al. (2017)), mixture population Monte Carlo (M-PMC Cappé et al. (2008)) and layered adaptive importance sampling (LAIS Martino et al. (2017)).

TP-AIS iteratively builds a proposal distribution parameterized by a tree pyramid, where each tree leaf spans a convex subspace and represents it’s importance density. After each new sample operation, a set of tree leaves are subdivided improving the approximation of the proposal distribution to the target density. Unlike the rest of the methods in the literature, TP-AIS is parameter free and requires zero manual tuning to achieve its best performance.

Our proposed method is evaluated with different complexity randomized target probability density functions and also analyze its application to different dimensions. The results are compared to state-of-the-art iterative importance sampling approaches and other baseline MCMC approaches using Normalized Effective Sample Size (N-ESS), Jensen-Shannon Divergence, and time complexity.

1 Introduction

A central task in the application of probabilistic models is the estimation of latent or unknown variables from observed noisy data. Within the Bayesian framework, this involves combining prior knowledge about the latent variables with the information in the observations to obtain a joint posterior probability distribution. Inference using such models typically involves evaluating queries such as finding values that maximize the posterior (MAP queries), or computing the posterior probability of some variables given evidence on the others (conditional queries). Often, it involves computing expectations of some function of interest with respect to the posterior. In most real-world applications, exact inference is infeasible either because the dimensionality of the latent space is too high or because the posterior distribution has a highly complex form for which expectations are not analytically tractable.

In such situations, we resort to approximate inference. Historically, sampling methods - also known as Monte Carlo (MC) methods - have been the method of choice for such problems. The goal of MC methods is to generate random numbers from a target distribution of interest, . These serve as an approximate representation of and can be used to numerically compute approximations to desired quantities such as expectations. Sampling methods are asymptotically exact in that they can generate exact results given infinite computational resources. Within the rich variety of sampling-based methods, a particularly important class of methods are the Markov Chain Monte Carlo (MCMC) methods. In these, samples are not drawn directly from but instead from a Markov process whose stationary distribution is equal to . These methods have the elegant property that (under reasonable constraints) the state-distribution of the Markov process converges to irrespective of the starting distribution. Further, they scale very well with the dimensionality of the sample space. However, there are often significant challenges in practical implementation. One is that the samples are generated sequentially and hence these methods cannot be easily parallelized unless some underlying structure in either the distribution or the transition mechanism is assumed. More serious are the problems of burn-in and mixing. Burn-in time refers to the number of steps we must wait before being able collect samples from the chain. This happens because the initial state distribution is arbitrary and we need to wait until it comes close to . Mixing time of a Markov chain is the time until the Markov chain is close to . Long mixing times occur in multi-modal distributions where the regions between modes are low-probability. In such situations, the chain might have explored one region well, but it can take a long time for it to transitions between the modes.

An alternative to MCMC is the class of importance sampling (IS) methods. Historically, IS has been used to approximate expectations of the form , rather than generating samples from . Samples are drawn from an alternate distribution, , known as the proposal distribution from which it is easier to generate samples. Each sample is assigned an importance weight to correct the bias introduced by sampling from the wrong distribution. Unlike MCMC methods, there is no phenomenon of burn-in; the sample generation process can be parallelized; and all the generated samples along with their weights are retained. The samples with their weights represent an approximation of the target distribution. The key challenge lies in choosing good proposal distributions. For poor choices of , the set of importance weights may be dominated by a few weights having large values, resulting in an effective sample size much smaller than the apparent sample size. This situation gets exacerbated at higher dimensions and can result in an exponential blowup in the number of samples required resulting in very poor sampling efficiency.

We present here, therefore, the Tree Pyramid Adaptive Importance Sampling (TP-AIS) method that retains the desirable properties of Importance sampling (parallelizability, no burn-in) while achieving a higher sampling efficiency relative to state-of-the-art IS methods. We use a hierarchical data-structure called Tree Pyramids to describe the proposal distribution over the input probability space. Each node represents a convex K-dimensional subspace, with nodes further down the tree representing increasingly finer subspaces. Our algorithm adaptively divides the input space in a manner such that more samples are used to represent regions of higher probability mass, while fewer samples are used for lower mass regions. This enables us to efficiently approximate the target distribution . Unlike other state-of-the-art IS methods, TP-AIS is completely parameter-free and has anytime property that enables to make trade-offs between computation time and approximation quality.

The paper is structured as follows: Section 2 provides an overview of sampling methods focusing on importance sampling algorithms related to our contribution and comments on their differences. Section 3 details the proposed sampling algorithm. Section 4 describes the methods used for evaluation and Section 5 shows the results and discussion. The paper concludes with the final remarks and future directions in Section 6.

2 Related work

Sampling methods rely on the law of large numbers to guarantee asymptotic convergence to the exact solution. The rate of convergence depends on the algorithm and the problem at hand, and often it is one of the factors that determines the selection of the sampling method to use for a specific application.

2.1 Direct sampling and Monte Carlo Markov Chain

When direct sampling from the target distribution is possible, a common option is to draw a set of samples and use them to build Monte Carlo approximations for the quantities of interest or a Kernel Density Estimate (KDE) that approximates the posterior. Sometimes, the target joint distribution can be represented as a product of factors. If these factors represent conditional distributions , where is the parent node of , then it is possible to sample sequentially from the conditional distributions using an approach known as Ancestral Sampling Bishop (2006). For more general factors, approaches such as Gibbs sampling can be adopted Geman and Geman (1984). As the complexity of the models increases, direct sampling might become challenging. For these cases, two closely related approaches exist: Importance Sampling (IS) and Approximate Sampling.

The performance of IS methods for high dimensional problems is not good and Monte Carlo Markov Chain (MCMC) or Variational Inference (VI) methods are used instead Geman and Geman (1984); Hastings (1970); Hoffman and Gelman (2014); Duane et al. (1987); Jaakkola (2001). Although in this paper we do not focus on those methods, one of our evaluation baselines is based on the MCMC Metropolis-Hastings algorithm Hastings (1970), and some adaptive IS methods include MCMC steps in their adaptation steps Martino et al. (2017). Among other problems, MCMC does not approximate the partition function and VI does not deal well with multi-modal target distributions.

2.2 Multi-Adaptive Importance sampling

As will be discussed in Section 3, the selection of the proposal distribution has a big impact on the algorithm performance. The criteria for proposal distribution selection, leads to different IS methods. A well-known method is Rejection Sampling and its generalizations Casella et al. (2004). Although it is an exact simulation method, its rejection mechanism requires much more sampling operations to produce a useful amount of samples (low-variance estimates) than other methods.

Multi-IS methods use multiple proposal distributions to generate samples Veach (1997). Moreover, the different mixture components can be weighted and their weights tuned depending on previous samples Owen and Zhou (2000). IS is not naturally iterative, but in practice implementations are. The sequential generation of samples is exploited by adaptive IS methods to adapt the proposal distribution based on previous samples, making the proposal closer to the target, improving sampling efficiency Karamchandani et al. (1989).

Current state-of-the-art methods combine the previous ideas into what is know as adaptive multiple-IS algorithms. Different sampling strategies, weighting schemes, and most importantly adaptation algorithms, result in several methods with different performance and properties Martino et al. (2017); Cappé et al. (2008); CORNUET et al. (2012); Martino et al. (2015); Cappé et al. (2004b). A recent review that includes the aforementioned methods can be found in Bugallo et. al. Bugallo et al. (2017). In this paper, we propose a new adaptation algorithm and a proposal distribution that is parameterized by a tree pyramid that in combination, improve the performance over previous methods.

2.3 Stratified and quasi-Monte Carlo methods

For low dimensional problems, there are deterministic procedures with better convergence rate and accuracy than MC and IS methods, examples are: generalized stratified sampling Wessing (2017), latin hypercube sampling Owen (2013) and other grid based methods Joshi et al. (2016). However, deterministic sampling strategies scale even worse than IS with dimensionality and do not adapt iteratively as new samples are generated. The algorithm proposed in this paper is at the intersection of stratified sampling and Multi-Adaptive IS by using a non-uniform multi-resolution tree structure to define the strata that parameterize the mixture components of the proposal distribution.

The usage of tree structures has been explored in the sampling literature, especially in the computer graphics application domain. Agarwal et. al. use a hierarchical representation of the image space to create different regular strata, the subdivisions are guided by a custom importance metric Agarwal et al. (2003). Clarberg et. al propose an approach based on wavelet function decomposition that is hierarchically constructed using a KD-tree. After its construction, it is used to warp an initial grid of samples towards regions with more light Clarberg et al. (2005). More recently, Conty and Kulla have shown the computational benefits of a method with an adaptive tree strategy sampling that selects the light source used to compute a sample for a pixel value Estevez and Kulla (2018). Canevet et. al. inspired by Monte-Carlo Tree Search, proposed an adaptive binary tree to reduce training time in neural networks, the idea focuses on generating samples from the training dataset that are more important, weighting them using statistics from the loss function Canevet et al. (2016). These approaches are related with the method proposed in this paper, but tailored for their application specific case.

3 Adaptive importance sampling with tree pyramids

As discussed in Section 1, Importance Sampling (IS) methods focus on approximating unknown values (of an arbitrary function ) such as expected value, variance, skewness, etc. of an unknown PDF . Those computations often involve intractable integrals like:

that are approximated with a Monte Carlo estimator built sampling from :

When direct sampling from is not possible, IS draws samples from a known proposal distribution and weights them by the ratio of likelihoods , namely the importance weight:

where the variance of the IS estimator is determined by:

Therefore, the selection of the proposal distribution has a huge impact on the sample efficiency by directly influencing the estimator variance. Usually the closer and are, the better the method performs. It is common for sampling methods to have an iterative nature (e.g. Monte Carlo Markov Chain). Adaptive methods, take advantage of previous samples drawn from to compute a new that is closer to to improve the estimator performance.

This section provides the details of the adaptive importance sampling method proposed in this paper that uses Tree Pyramids to describe the proposal distribution which focuses on generating samples in regions of high probability mass in order to efficiently approximate the target posterior PDF . After each sampling step, the tree is expanded to accommodate the new sample and improve the approximation of the proposal to the target distribution.

3.1 Tree pyramid parameterized proposal distribution

A K-dimensional tree-pyramid (KD-TP) is a full tree where each node represents a K-dimensional convex subspace with center , radius , sample with corresponding importance weight , and a pointer to the set of children of . is given by , where is the radius of the root node and is the node level. The level is defined as the minimum number of edges that connect a node with the tree root. A node has either zero or children: . Each node The span of a tree is limited by the root radius. However, it is possible to dynamically re-root the tree to double the radius of the representable convex sub-space in each dimension. The most common instances of KD-TP are Full Binary Trees(), Quadtrees () and Octrees ().

Without limiting the generalization, we use the same radius for all dimensions making the sub-spaces hyper-cubes. This is not a limitation of the method as it is possible to represent the sub-spaces with different shapes by allowing dimension-wise radii. If required, for example by the nature of the data, the implementation of dimension-wise radii should be a straightforward extension to the proposed method by defining .

Creation and population of a KD-TP is described by the algorithm that expands a node, namely sub-divide one of the convex subspaces represented by a leaf node into its children, see Algorithm 1. An example of a 1-D tree construction by node expansion is shown in Figure 1.

The probability distribution parameterized by a KD-TP , is defined as a mixture model of distributions :


where the number of mixture components is determined by the cardinality of the set of leaf nodes with the location of the kernels described by each leaf node center and the scale determined by its radius. For the experimental evaluation of the method, we plugged into Eq. 1 two types of distribution, a Uniform and a Multivariate Normal .

Figure 1: Example of construction of a 1D-TP over the domain X by leaf node expansion as implemented in Algorithm 1.
1:function expandNode()
2:      Cartesian product of and in -dim.
3:      Initialize set of child nodes to null set.
4:     for  do
5:          Obtain the set of centers for the new nodes.
6:          Compute the radii of the new nodes.
7:          New node has no child nodes.
8:          Initial sample and importance values are undefined.
9:          Create child node.
10:          Insert new node into expansion set.      
11:     return Return the set of new nodes.
Algorithm 1 Leaf expansion algorithm for a K-Dimensional Tree Pyramid

3.2 Quasi-Monte Carlo adaptive importance sampling with tree pyramids

Quasi-Monte Carlo methods lie between stratified and Monte Carlo sampling. In our method, the tree structure and the sub-division in convex sub-spaces is deterministic, while obtaining a sample from each hyper-cube is stochastic. This paper introduces the concept of non-uniform stratification that adapts the proposal distribution to the target distribution . See a sequence of sampling and adaptation steps in Figure 2.

Like other AIS algorithms, TP-AIS can be divided in three major blocks: 1) Sampling, 2) weighting and 3) adaptation. Algorithm 2 describes the process of tree pyramid adaptive importance sampling that; 1) draws samples from the proposal distribution, parameterized by a tree pyramid, (lines 8-12). 2) Computes the importance weights (line 13), and 3) adapts the proposal after new samples are drawn (line 14).

The resulting proposal distribution (in this case parameterized by the resulting tree after samples) can be used as a generative model that approximates the target PDF. It is important to note that this sampling algorithm is anytime and there is no need to wait for samples to be generated to obtain a good approximation, the tree generation can be interrupted and the resulting set of samples and tree structure used as an approximation to the posterior PDF. In what follows, TP-AIS (Algorithm 2) is described in more detail.

First, the tree is initialized by adding the root node at the center of the space with a radius that spans the desired sampling space determined by the space limits and (lines 2-3). An initial sample is drawn from the proposal and is added to the sample set (line 4). The importance weight of the initial sample is computed and the root node is updated with the sample and its weight (lines 5-6).

Repeat the next steps until the desired number of samples are obtained: 1) Sort the set of leaf nodes by importance density, computed as the product of the volume of the leaf node hyper-cube and its importance weight (lines 7-9). 2) Expand the first leaves (line 11), and 3) generate samples from the new mixture components induced by the new nodes (line 12), and compute their weights (line 13). Depending on the choice for sampling, weighting and resampling, different flavors of TP-AIS can be implemented.

1:function SampleTreePyramid()
2:      Compute root node center and radius.
4:      Initialize the sample set from the root.
5:      Compute the importance weight.
6:      Define the root node.
7:      Initialize the tree with the root node.
10:     while  do
11:          Obtain the set of tree leaves.
12:          Choose node with max estimated evidence.
13:          Expand into its children.
14:         for  do
15:               Sample the proposal of each new leaf.
16:               Compute the importance weight.
18:               Insert the expanded nodes into the tree .
19:               Insert the new samples.
20:               Insert the new weights.               
21:     return
Algorithm 2 Standard tree pyramid adaptive importance sampling algorithm.

Standard and Deterministic Mixture weights

The algorithm described in Algorithm 2, corresponds to the standard TP-AIS (sTP-AIS) which is an implementation that follows the standard multiple importance sampling weighting Cappé et al. (2004a), where importance weights are computed using each individual mixture component as the proposal distribution:

Instead, weights can be computed using all the mixture components obtaining the Deterministic Mixture (DM) weighting. DM weights come at a more expensive computational cost (require the evaluation of all the proposals to compute weights) but are known to perform better in terms of variance Elvira et al. (2019). DM-TP-AIS can be implemented by plugging into lines 4 and 13 the following weighting:

In the special case of choosing a Uniform distribution to represent each of the subspace proposal distributions, the DM approach reduces to the simple case after normalizing the importance weights.

Leaf resampling

Another variation to the sTP-AIS is the resampling strategy. As described in Algorithm 2, each time a sample is generated in a new subspace, it is never removed from the sample set or replaced by another sample. This can be problematic if the target distribution is peaked, early unlucky samples in low probability regions may span a large subspace requiring lots of samples before selecting that subspace for expansion. This problem can be addressed by implementing a resampling strategy. Before obtaining a new set of tree leaves (line 8), all the leaf nodes of the tree are resampled and weighted. This leads to a lower acceptance rate but reduces the variance and increases robustness to multi-modal target distributions and unlucky samples. We believe other resampling strategies will lead to improved robustness and acceptance rate, which will be the topic of our future work.

Mixture TP-AIS

Instead of ordering the tree leaves by importance density to select which one to expand, it is possible to generate the new samples directly from the mixture distribution parameterized by the tree pyramid. This sampling approach is similar to the Mixture Population Monte Carlo (M-PMC Cappé et al. (2008)) approach, which, instead of generating one new sample from each individual component of the proposal, samples from the mixture itself. To implement this approach, the selection of the node to expand depends on a sample drawn from the weighted mixture model. The mixture weights depend on the normalized importance volume of the component:

Then, a sample from the weighted mixture model is obtained by Algorithm 3 in a two-step process. First, a random number is generated to select the sampling distribution (lines 3-6). Second, the sample is obtained by sampling from the selected mixture component. This approach replaces the sort operation in line 9 of Algorithm 2 by a sampling operation and adds a tree search operation to select which node to expand.

1:function SampleMixtureModel()
5:     while  do
7:     return
Algorithm 3 Weighted mixture model sampling.
Figure 2: Construction of a 1D-TP. Left: Target distribution (green), proposal distribution (red) and samples from each leaf (red cross). Right: TP parameterizing the proposal distribution. Top to bottom: Sequence of node expansions following the procedure described in Algorithm 2 for N=4, N=6, N=8 and N=50.

4 Evaluation

4.1 Baselines

We compare the presented TP-AIS method and some of its variations to a variety of state-of-the-art AIS methods: LAIS Martino et al. (2017), APIS Martino et al. (2015), M-PMC Cappé et al. (2008), DM-PMC Elvira et al. (2017). The comparison is extended beyond AIS methods and includes an MCMC baseline with the Metropolis-Hastings algorithm Hastings (1970) and Multi-Nested sampling Feroz et al. (2014) which is an evolution of Nested sampling Skilling (2006), an algorithm designed for evidence estimation, that addresses Nested Sampling problems with multi-modal distributions.

4.2 Evaluation metrics

When the task at hand is to draw samples from a target distribution, the Normalized Effective Sample Size (N-ESS) is a popular measure of sampling efficiency for MCMC and IS methods. It can be interpreted as a factor that determines the number of samples required from the evaluated algorithm to generate an independent sample from the target distribution. This metric takes into account that samples in IS are biased by the proposal distribution, and MCMC samples are autocorrelated. In IS, an approximation is often defined as the inverse sum of the squared normalized importance weights:

in MCMC the definition of ESS is:

Where is the number of samples and is the correlation at the i-th MCMC step. Although this metric has received some criticism, it is still widely used and it generally provides good performance Martino et al. (2017).

Besides ESS, we evaluate the approximation quality of the adapted proposal distributions obtained by AIS methods with a probability distribution similarity metric like the Jensen Shannon Divergence (a symmetric version of the more popular Kullback–Leibler divergence) shown in Eq. 3.


Unfortunately, JSD requires the computation of integrals of intractable densities. Because the parameter space we are considering is bounded, we could approximate them with enough precision by considering a small value of . However, it becomes rapidly intractable when the number of dimensions increases. To address this issue, we use a Monte Carlo estimate to the metrics by drawing uniformly distributed samples from P and Q and compute their empirical JSD metric, See Eq.4.

Figure 3: Example evaluation of our sTP-AIS with N=25. a) Ground truth PDF . b) Importance samples (red) generated from TP sampling algorithm and proposal PDF (green). c) Overlaid ground truth distribution (blue) and adapted proposal (green) with the corresponding Jensen-Shannon Divergence (red).

The MCMC based baseline methods do not have a proposal distribution that is being adapted to approximate the target distribution, in order to evaluate how close the samples generated are to the target distribution, we use a Kernel Density Estimate (KDE) of the density using the generated samples.

Some sampling methods are based on accept/reject Hastings (1970); Skilling (2006) or have complex sampling mechanisms Feroz et al. (2014) that can yield good ESS with an increased computational cost, hence we include runtime to the evaluation metrics. This is of special interest in cases where the likelihood function computational cost is negligible and the sampling strategy becomes the computational bottleneck. An example evaluation with N=25 is depicted in Figure 3. The evaluation proceeds as follows:

  1. Select a known PDF as the ground truth distribution.

  2. Use the evaluated sampling method to draw samples.

  3. If the IS method used is adaptive, the proposal distribution after drawing samples is used. If the evaluated method only generates samples and does not adapt the proposal, the drawn samples are used to compute a KDE approximate PDF .

  4. Compute evaluation metrics: ESS(), elapsed time, and .

4.3 Ground truth distributions

Figure 4: Example of 2D ground truth PDFs used to evaluate the sampling methods.

Sampling methods performance often depends on the properties of the target distribution . Thus, we evaluate the methods using different ground truth parametric PDFs with varied number of modes and randomized moments, see some examples in Figure 4. We use Gaussian Mixture Models (GMMs) to parameterize the ground truth distributions as follows:

Where is a vector of means, is a covariance matrix and is the normalized mixture weight. The sub-index is used to reference the mixture component of the components of dimensionality that compose the mixture model. Examples of 2-D GMMs used for evaluation are shown in Figure 4.

For evaluating the sampling methods, we use three different types of GMMs with randomized first and second order moments. i) The first target distribution, referred as “normal”, has just one mixture component with and . ii) the second target distribution has 5 mixture components with and . iii) the third target distribution is a GMM version of the egg crate function with 4 equidistant modes per dimension and . An example of each ground truth distribution is depicted in Figure 4.

Figure 5: Qualitative comparison of our TP-AIS and the baseline methods after 50 samples for a Gaussian Mixture Model target distribution (shown in green). The proposal distribution after 50 samples in shown in red, with the 50 samples drawn as red crosses.

5 Results and discussion

In this evaluation we have considered the simple version of TP-AIS with resampling which will be the flavor of TP-AIS being used unless otherwise mentioned. For more results with other variations of the method we refer the reader to the additional material.

To obtain the results presented in this section, we have conducted the procedure detailed in Section 4 a hundred times for each combination of dimensionality and random sampled target PDF. The random seed was fixed to evaluate all the methods with the same target distributions.

In Figure 5 we show a visualization of the posterior approximation obtained with the different methods evaluated. It can be seen how for the shown cases the TP-AIS approach provides a much closer approximation to the target distribution with the same number of samples. In the remainder of this section the evaluated metrics vs. number of samples are discussed.

One of the most impactful aspects in importance sampling performance is the similarity between the proposal distribution and the target distribution. We have evaluated the similarity using the Jensen-Shannon Divergence detailed in Section 4. For lower dimensional (1D-4D) problems, the left column of Figure 6 shows that our TP-AIS method greatly outperforms the baselines by converging much faster to a better approximate solution in the three types of distributions evaluated. For higher dimensional cases, TP-AIS performance is on par with other methods, see right column of Figure 6.

Figure 6: Experimental result statistics for the Jensen-Shannon Divergence metric for different target distributions and dimensionality.

N-ESS measures the sample efficiency of a method, results for this metric are shown in Figure 7, TP-AIS outperforms existing methods in terms of sample efficiency in low dimensions. In higher dimensions all the methods yield poor performance as can be seen in the right column of Figure 7. This low performance is an expected result, for higher dimensions density functions become more concentrated causing the importance weights to sky-rocket consequently reducing the ESS.

Figure 7: Experimental result statistics for the Normalized Effective Sample Size metric for different target distributions and dimensionality.
Figure 8: Experimental result statistics for the computational time complexity metric for different target distributions and dimensionality.

Regarding the time complexity metric, Figure 8 shows how the evaluated methods scale with the number of samples. Although TP-AIS implements a more sophisticated sampling method, the time required to generate N samples is comparable to the rest of the methods for a reasonable number of samples. A surprising result observed in higher dimensional problems (Figure 8 right), is the sub-par performance of the MCMC-MH approach, where it is supposed to scale well with dimensions. With the increase in dimensions, the target density is more concentrated causing a lot of the proposed MC moves to have very low acceptance probability and be rejected, dramatically increasing the time required to generate N samples. This is a well known problem that MCMC methods suffer from and can be alleviated by the fine-tuning the MC proposal distribution. It can be seen how the other methods are not impacted by this issue.


The presented method provides a significant speed-up sampling procedure by increasing the sampling efficiency (See. Figure 7). It might seem that the more complex formulation of the proposal distribution and its adaptation can impact the computational complexity of the sampling algorithm. However, although in some cases TP-AIS requires more time to generate a specific number of samples (especially when the desired number of samples is high), its sampling efficiency allows the method to obtain a better proposal distribution approximation with less samples. For example, in the 1D GMM case, shown in Figure 6, TP-AIS obtains with 100 samples better accuracy (i.e. lower JSD) that the other benchmarked methods after 1000 samples.

In contrast to other Quasi Monte-Carlo approaches Wessing (2017); Joshi et al. (2016), our proposed algorithm does not lose the anytime property that MCMC methods have and the sampling process can be halted at anytime allowing for accuracy-time trade-offs.

The main drawback of the proposed approach is dimensionality, which limits its application from low to mid dimensional problems. Each new sample step subdivides the space into subspaces, constraining the number of samples that can be generated after each time step. For example, for a 9D problem, each sampling step needs to generate 512 samples which, depending on the application, might be overwhelming. On the other hand, this fact opens the door to parallel implementations that compute the likelihood of all the new samples concurrently further improving the number of samples per unit of time that this approach can deliver. In any case, this is a well-known limitation of IS methods in general and TP-AIS is not an exception.

Another limitation of TP-AIS is the need of boundaries for the sampling space. This can cause to have regions of non-zero probability outside the approximated space, the probability mass outside the space is distributed over the sampling space resulting in density over-estimation. However, in practice it is reasonable to assume known boundaries in the domain of the sampled function and in our evaluation we did not experience this issue to be a limitation.

Besides the sampling domain, the parameter free aspect of TP-AIS is one of its strengths, all other methods compared have parameters that need to be tuned. It is known that proposal distribution design has a huge impact on performance, because parameters have an important impact on the adaptation of the proposal distribution, good parameter selection has a huge influence on the inference result. TP-AIS circumvents that problem and enables the user to exploit its full potential without deep domain knowledge.

6 Conclusions

In this paper, we have presented a new adaptive importance sampling algorithm that parameterizes the proposal distribution using tree pyramids that structure the proposal in partitioned convex subspaces. The exhaustive evaluation with a variety of complex target posteriors has shown the accuracy and sample efficiency of the proposed approach compared to several well-known and state-of-the art approaches.

We presented the first steps of a promising sampling scheme that combines quasi-Monte Carlo techniques with adaptive importance sampling. More research in the resampling strategies, kernel selection and representation of the subspaces can lead to improved sample efficiency and better approximations. A future extension is to explore a Rao-blackwellized representation of subspaces instead of discarding previous samples.

The convergence speed of this sampling technique and its high N-ESS may enable the application of IS for real-time inference in applications with mid-to-low dimensionality like object 6D pose estimation.

Extended state-of-the-art comparison results

Extended TP-AIS flavors comparison results


  1. S. Agarwal, R. Ramamoorthi, S. Belongie and H. W. Jensen (2003) Structured Importance Sampling of Environment Maps. In ACM SIGGRAPH 2003 Papers, SIGGRAPH ’03, New York, NY, USA, pp. 605–612. External Links: Document, ISBN 1-58113-709-5, Link Cited by: §2.3.
  2. C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §2.1.
  3. M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. Miguez and P. M. Djuric (2017-07) Adaptive Importance Sampling: The past, the present, and the future. IEEE Signal Processing Magazine 34 (4), pp. 60–79. External Links: Document, ISSN 1053-5888 Cited by: §2.2.
  4. O. Canevet, C. Jose and F. Fleuret (2016) Importance Sampling Tree for Large-scale Empirical Expectation. Proceedings of The 33rd International Conference on Machine Learning 48, pp. 1454–1462. External Links: ISBN 9781510829008, Link Cited by: §2.3.
  5. O. Cappé, A. Guillin, J. M. Marin and C. P. Robert (2004) Population Monte Carlo. Journal of Computational and Graphical Statistics 13 (4), pp. 907–929. External Links: Document, ISSN 10618600 Cited by: §3.2.1.
  6. O. Cappé, A. Guillin, J. M. Marin and C. P. Robert (2004) Population Monte Carlo. Journal of Computational and Graphical Statistics 13 (4), pp. 907–929. External Links: Document, ISSN 10618600 Cited by: §2.2.
  7. O. Cappé, R. Douc, A. Guillin, J. M. Marin and C. P. Robert (2008) Adaptive importance sampling in general mixture classes. Statistics and Computing 18 (4), pp. 447–459. External Links: Document, 0710.4242, ISSN 09603174, Link Cited by: Tree pyramidal adaptive importance sampling, §2.2, §3.2.3, §4.1.
  8. G. Casella, C. P. Robert and M. T. Wells (2004) Generalized accept-reject sampling schemes. In A Festschrift for Herman Rubin, Lecture Notes–Monograph Series, Vol. Volume 45, pp. 342–347. External Links: Document, Link Cited by: §2.2.
  9. P. Clarberg, W. Jarosz, T. Akenine-Möller and H. W. Jensen (2005) Wavelet importance sampling. ACM Transactions on Graphics 24 (3), pp. 1166. External Links: Document, ISSN 07300301, Link Cited by: §2.3.
  10. J. CORNUET, J. MARIN, A. Mira and C. P. Robert (2012) Adaptive multiple importance sampling. Scandinavian Journal of Statistics 39 (4), pp. 798–812. Cited by: §2.2.
  11. S. Duane, A. D. Kennedy, B. J. Pendleton and D. Roweth (1987) Hybrid Monte Carlo. Physics Letters B 195 (2), pp. 216–222. External Links: Document, ISSN 0370-2693, Link Cited by: §2.1.
  12. V. Elvira, L. Martino, D. Luengo and M. F. Bugallo (2017) Improving population Monte Carlo: Alternative weighting and resampling schemes. Signal Processing 131 (Mc), pp. 77–91. External Links: Document, arXiv:1607.02758v1, ISSN 01651684, Link Cited by: Tree pyramidal adaptive importance sampling, §4.1.
  13. V. Elvira, L. Martino, D. Luengo and M. F. Bugallo (2019-02) Generalized multiple importance sampling. Statist. Sci. 34 (1), pp. 129–155. External Links: Document, Link Cited by: §3.2.1.
  14. A. C. Estevez and C. Kulla (2018-08) Importance Sampling of Many Lights with Adaptive Tree Splitting. Proc. ACM Comput. Graph. Interact. Tech. 1 (2), pp. 25:1–25:17. External Links: Document, ISSN 2577-6193, Link Cited by: §2.3.
  15. F. Feroz, M.P. Hobson, E. Cameron and P. A.N. (2014) Importance Nested Sampling and the MULTINEST Algorithm. Arxiv astro physics. External Links: Document, 1306.2144v2, ISSN 00144754, Link Cited by: §4.1, §4.2.
  16. S. Geman and D. Geman (1984-11) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6 (6), pp. 721–741. External Links: Document, ISSN Cited by: §2.1, §2.1.
  17. W. K. Hastings (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 (1), pp. 97–109. External Links: Document, ISSN 0006-3444, Link Cited by: §2.1, §4.1, §4.2.
  18. M. D. Hoffman and A. Gelman (2014) The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo.. Journal of Machine Learning Research 15 (1), pp. 1593–1623. Cited by: §2.1.
  19. T. Jaakkola (2001) Tutorial on variational approximation methods. Advanced mean field methods: theory and practice, pp. 129. Cited by: §2.1.
  20. C. Joshi, P. T. Brown and S. Joe (2016) Improving grid based bayesian methods. arXiv preprint arXiv:1609.09174. Cited by: §2.3, §5.
  21. A. Karamchandani, P. Bjerager and C. Cornell (1989) Adaptive importance sampling. In Structural Safety and Reliability, pp. 855–862. Cited by: §2.2.
  22. L. Martino, V. Elvira, D. Luengo and J. Corander (2017-05) Layered adaptive importance sampling. Statistics and Computing 27 (3), pp. 599–623. External Links: Document, ISSN 1573-1375, Link Cited by: Tree pyramidal adaptive importance sampling, §2.1, §2.2, §4.1.
  23. L. Martino, V. Elvira and F. Louzada (2017) Effective sample size for importance sampling based on discrepancy measures. Signal Processing 131, pp. 386–401. External Links: Document, ISSN 0165-1684, Link Cited by: §4.2.
  24. L. Martino, V. Elvira, D. Luengo and J. Corander (2015) An adaptive population importance sampler: Learning from uncertainty. IEEE Transactions on Signal Processing 63 (16), pp. 4422–4437. Cited by: §2.2, §4.1.
  25. A. B. Owen (2013) Monte carlo theory, methods and examples. Online. Cited by: §2.3.
  26. A. Owen and Y. Zhou (2000) Safe and effective importance sampling. Journal of the American Statistical Association 95 (449), pp. 135–143. Cited by: §2.2.
  27. J. Skilling (2006) Nested Sampling for Bayesian Computations. Bayesian Analysis 4, pp. 833–860. External Links: Document, ISBN 0-7354-0217-5, ISSN 19360975, Link Cited by: §4.1, §4.2.
  28. E. Veach (1997) Robust monte carlo methods for light transport simulation. Vol. 1610, Stanford University PhD thesis. Cited by: §2.2.
  29. S. Wessing (2017) Experimental analysis of a generalized stratified sampling algorithm for hypercubes. arXiv preprint arXiv:1705.03809. Cited by: §2.3, §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description