Generalized Rejection Sampling Schemes and Applications in Signal Processing

Generalized Rejection Sampling Schemes and Applications in Signal Processing

Abstract

Bayesian methods and their implementations by means of sophisticated Monte Carlo techniques, such as Markov chain Monte Carlo (MCMC) and particle filters, have become very popular in signal processing over the last years. However, in many problems of practical interest these techniques demand procedures for sampling from probability distributions with non-standard forms, hence we are often brought back to the consideration of fundamental simulation algorithms, such as rejection sampling (RS). Unfortunately, the use of RS techniques demands the calculation of tight upper bounds for the ratio of the target probability density function (pdf) over the proposal density from which candidate samples are drawn. Except for the class of log-concave target pdf’s, for which an efficient algorithm exists, there are no general methods to analytically determine this bound, which has to be derived from scratch for each specific case. In this paper, we introduce new schemes for (a) obtaining upper bounds for likelihood functions and (b) adaptively computing proposal densities that approximate the target pdf closely. The former class of methods provides the tools to easily sample from a posteriori probability distributions (that appear very often in signal processing problems) by drawing candidates from the prior distribution. However, they are even more useful when they are exploited to derive the generalized adaptive RS (GARS) algorithm introduced in the second part of the paper. The proposed GARS method yields a sequence of proposal densities that converge towards the target pdf and enable a very efficient sampling of a broad class of probability distributions, possibly with multiple modes and non-standard forms. We provide some simple numerical examples to illustrate the use of the proposed techniques, including an example of target localization using range measurements, often encountered in sensor network applications.

1Introduction

Bayesian methods have become very popular in signal processing during the past decades and, with them, there has been a surge of interest in the Monte Carlo techniques that are often necessary for the implementation of optimal a posteriori estimators [6]. Indeed, Monte Carlo statistical methods are powerful tools for numerical inference and optimization [13]. Currently, there exist several classes of MC techniques, including the popular Markov Chain Monte Carlo (MCMC) [6] and particle filtering [3] families of algorithms, which enjoy numerous applications. However, in many problems of practical interest these techniques demand procedures for sampling from probability distributions with non-standard forms, hence we are often brought back to the consideration of fundamental simulation algorithms, such as importance sampling [2], inversion procedures [13] and the accept/reject method, also known as rejection sampling (RS).

The RS approach [13] is a classical Monte Carlo technique for “universal sampling”. It can be used to generate samples from a target probability density function (pdf) by drawing from a possibly simpler proposal density. The sample is either accepted or rejected by an adequate test of the ratio of the two pdf’s, and it can be proved that accepted samples are actually distributed according to the target density. RS can be applied as a tool by itself, in problems where the goal is to approximate integrals with respect to (w.r.t.) the pdf of interest, but more often it is a useful building block for more sophisticated Monte Carlo procedures [8]. An important limitation of RS methods is the need to analytically establish a bound for the ratio of the target and proposal densities, since there is a lack of general methods for the computation of exact bounds.

One exception is the so-called adaptive rejection sampling (ARS) method [8] which, given a target density, provides a procedure to obtain both a suitable proposal pdf (easy to draw from) and the upper bound for the ratio of the target density over this proposal. Unfortunately, this procedure is only valid when the target pdf is strictly log-concave, which is not the case in most practical cases. Although an extension has been proposed [9] that enables the application of the ARS algorithm with -concave distributions (where is a monotonically increasing function, not necessarily the logarithm), it does not address the main limitations of the original method (e.g., the impossibility to draw from multimodal distributions) and is hard to apply, due to the difficulty to find adequate transformations other than the logarithm. Another algorithm, called adaptive rejection metropolis sampling (ARMS) [14], is an attempt to extend the ARS to multimodal densities by adding Metropolis-Hastings steps. However, the use of an MCMC procedure has two important consequences. First, the resulting samples are correlated (unlike in the original ARS method), and, second, for multimodal distributions the Markov Chain often tends to get trapped in a single mode.

In this paper we propose general procedures to apply RS when the target pdf is the posterior density of a signal of interest (SoI) given a collection of observations. Unlike the ARS technique, our methods can handle target pdf’s with several modes (hence non-log-concave) and, unlike the ARMS algorithm, they do not involve MCMC steps. Hence, the resulting samples are independent and come exactly from the target pdf.

We first tackle the problem of computing an upper bound for the likelihood of the SoI given fixed observations. The proposed solutions, that include both closed-form bounds and iterative procedures, are useful when we draw the candidate samples from the prior pdf.

In this second part of the paper, we extend our approach to devise a generalization of the ARS method that can be applied to a broad class of pdf’s, possibly multimodal. The generalized algorithm yields an efficient proposal density, tailored to the target density, that can attain a much better acceptance rate than the prior distribution. We remark that accepted samples from the target pdf are independent and identically distributed (i.i.d).

The remaining of the paper is organized as follows. We formally describe the signal model in Section 2. Some useful definitions and basic assumptions are introduced in Section 3. In Section 4, we propose a general procedure to compute upper bounds for a large family of likelihood functions. The ARS method is briefly reviewed in Section 5, while the main contribution of the paper, the generalization of the ARS algorithm, is introduced in Section 6. Section 7 is devoted to simple numerical examples and we conclude with a brief summary in Section 8.

2Model and Problem Statement

2.1Notation

Scalar magnitudes are denoted using regular face letters, e.g., , , while vectors are displayed as bold-face letters, e.g., , . We indicate random variates with upper-case letters, e.g., , , while we use lower-case letters to denote the corresponding realizations, e.g., , . We use letter to denote the true probability density function (pdf) of a random variable or vector. This is an argument-wise notation, common in Bayesian analysis. For two random variables and , is the true pdf of and is the true pdf of , possibly different. The conditional pdf of given is written . Sets are denoted with calligraphic upper-case letters, e.g., .

2.2Signal Model

Many problems in science and engineering involve the estimation of an unobserved SoI, , from a sequence of related observations. We assume an arbitrary prior probability density function (pdf) for the SoI, , and consider scalar random observations, , , which are obtained through nonlinear transformations of the signal X contaminated with additive noise. Formally, we write

where is the random observation vector, , are nonlinearities and are independent noise variables, possibly with different distributions for each . We write for the vector of available observations, i.e., a realization of .

We assume exponential-type noise pdf’s, of the form

where is real constant and is a function, subsequently referred to as marginal potential, with the following properties:

  1. It is real and non negative, i.e., .

  2. It is increasing () for and decreasing () for .

These conditions imply that has a unique minimum at and, as a consequence has only one maximum (mode) at . Since the noise variables are independent, the joint pdf is easy to construct and we can define a joint potential function as

Substituting (Equation 2) into (Equation 3) yields

where is a constant. In subsequent sections we will be interested in a particular class of joint potential functions denoted as

where the subscript identifies the specific member of the class. In particular, the function obtained for , is termed quadratic potential.

Let be the vector-valued nonlinearity defined as . The scalar observations are conditionally independent given a realization of the SoI, , hence the likelihood function , can be factorized as

where . The likelihood in (Equation 5) induces a system potential function , defined as

that depends on x, the observations y, and the function g. Using (Equation 4) and (Equation 6), we can write the system potential in terms of the joint potential,

2.3Rejection Sampling

Assume that we wish to approximate, by sampling, some integral of the form , where is some measurable function of and is the posterior pdf of the SoI given the observations. Unfortunately, it may not be possible in general to draw directly from , so we need to apply simulation techniques to generate adequate samples. One appealing possibility is to perform RS using the prior, , as a proposal function. In such case, let be a lower bound for the system potential, , so that is an upper bound for the likelihood, . We can generate samples according to the standard RS algorithm.

  1. Set .

  2. Draw samples from and from , where is the uniform pdf in .

  3. If then , else discard and go back to step 2.

  4. Set . If then stop, else go back to step 2.

Then, can be approximated as . The fundamental figure of merit of a rejection sampler is the acceptance rate, i.e., the mean number of accepted samples over the total number of proposed candidates.

In Section 4, we address the problem of analytically calculating the bound . Note that, since the function is monotonous, it is equivalent to maximize w.r.t. x and to minimize the system potential also w.r.t. x. As a consequence, we may focus on the calculation of a lower bound for . Note that this problem is far from trivial. Even for very simple marginal potentials, , , the system potential can be highly multimodal w.r.t. x. See the example in the Section 7.1 for an illustration.

3Definitions and Assumptions

Hereafter, we restrict our attention to the case of a scalar SoI, . This is done for the sake of clarity, since dealing with the general case requires additional definitions and notations. The techniques to be described in Sections Section 4-Section 6 can be extended to the general case, although this extension is not trivial. The example in Section 7.3 illustrates how the proposal methodology is also useful in higher dimensional spaces, though.

For a given vector of observations , we define the set of simple estimates of the SoI as

Each equation , in general, can yield zero, one or more simple estimates. We also introduce the maximum likelihood (ML) SoI estimator , as

not necessarily unique.

Let us use to denote the support of the vector function , i.e., . We assume that there exists a partition of (i.e., and , ) such that the subsets are intervals in and we can define functions and , as

i.e., is the restriction of to the interval . We further assume that (a) every function is invertible in and (b) every function is either convex in or concave in . Assumptions (a) and (b) together mean that, for every and all , the first derivative is either strictly positive or strictly negative and the second derivative is either non-negative or non-positive. As a consequence, there are exactly simple estimates (one per observation) in each subset of the partition, namely for . We write the set of simple estimates in as . Due to the additivity of the noise in (Equation 1), if is bounded there may be a non-negligible probability that (or ), where denotes the closure of set , hence may not exist for some realization . In such case, we define (or , respectively), and admit (respectively, ) as valid simple estimates.

4Computation of upper bounds on the likelihood

4.1Basic method

Let be an arbitrary but fixed realization of the observation vector . Our goal is to obtain an analytical method for the computation of a scalar such that . Hereafter, we omit the dependence on the observation vector and write simply . The main difficulty to carry out this calculation is the nonlinearity , which renders the problem not directly tractable. To circumvent this obstacle, we split the problem into subproblems and address the computation of bounds for each set , , in the partition of . Within , we build adequate linear functions in order to replace the nonlinearities . We require that, for every , the inequalities

hold jointly for all , and all , where is any closed interval in such that (i.e., any ML estimator of the SoI restricted to , possibly non unique) is contained in . The latter requirement can be fulfilled if we choose (see the Appendix for a proof).
If (Equation 8) and (Equation 9) hold, we can write

which follows easily from the properties (P1) and (P2) of the marginal potential functions as described in Section 2.2. Moreover, since and (this function will be subsequently referred as the modified system potential) where and , Eq. (Equation 10) implies that , , and, as a consequence,

Therefore, it is possible to find a lower bound in for the system potential , denoted , by minimizing the modified potential in .

All that remains is to actually build the linearities . This construction is straightforward and can be described graphically by splitting the problem into two cases. Case 1 corresponds to nonlinearities such that (i.e., is either increasing and convex or decreasing and concave), while case 2 corresponds to functions that comply with (i.e., is either increasing and concave or decreasing and convex), when .

Figure 2 (a)-(b) depicts the construction of in case 1. We choose a linear function that connects the point and the point corresponding to the simple estimate, . In the figure, and denote the distances and , respectively. It is apparent that for all , hence inequality (Equation 8) is granted. Inequality (Equation 9) also holds for all , since and are either simultaneously greater than (or equal to) , or simultaneously lesser than (or equal to) .

Figure 2 (c)-(d) depicts the construction of in case 2. We choose a linear function that connects the point and the point corresponding to the simple estimate, . Again, and denote the distances and , respectively. It is apparent from the two plots that inequalities (Equation 8) and (Equation 9) hold for all .

A special subcase of 1 (respectively, of 2) occurs when (respectively, ). Then, is the tangent to at . If then is a horizontal asymptote of .

It is often possible to find in closed-form. If we choose , then is a global lower bound of the system potential. Table 1 shows an outline of the proposed method, that will be subsequently referred to as bounding method 1 (BM1) for conciseness.

Figure 1: Construction of the auxiliary linearities \left\{r_{i,j}\right\}_{i=1}^{n}. We indicate d_{r}=\left|y_{i}-r_{i,j}(x)\right| and d_{g}=\left|y_{i}-g_{i,j}(x)\right|, respectively. It is apparent that d_{r}\leq d_{g} and r_{i,j}(x) and g_{i,j}(x) are either simultaneously greater than (or equal to) y_{i}, or simultaneously lesser than (or equal to) y_{i}, for all x\in \mathcal{I}_{j}. Hence, the inequalities () and () are satisfied \forall x\in \mathcal{I}_{j}. (a) Function g_{i,j} is increasing and convex (case 1). (b) Function g_{i,j} is decreasing and concave (case 1). (c) Function g_{i,j} is decreasing and convex (case 2). (d) Function g_{i,j} is increasing and concave (case 2).
Figure 1: Construction of the auxiliary linearities . We indicate and , respectively. It is apparent that and and are either simultaneously greater than (or equal to) , or simultaneously lesser than (or equal to) , for all . Hence, the inequalities () and () are satisfied . (a) Function is increasing and convex (case 1). (b) Function is decreasing and concave (case 1). (c) Function is decreasing and convex (case 2). (d) Function is increasing and concave (case 2).
Figure 2: Construction of the auxiliary linearities \left\{r_{i,j}\right\}_{i=1}^{n}. We indicate d_{r}=\left|y_{i}-r_{i,j}(x)\right| and d_{g}=\left|y_{i}-g_{i,j}(x)\right|, respectively. It is apparent that d_{r}\leq d_{g} and r_{i,j}(x) and g_{i,j}(x) are either simultaneously greater than (or equal to) y_{i}, or simultaneously lesser than (or equal to) y_{i}, for all x\in \mathcal{I}_{j}. Hence, the inequalities () and () are satisfied \forall x\in \mathcal{I}_{j}. (a) Function g_{i,j} is increasing and convex (case 1). (b) Function g_{i,j} is decreasing and concave (case 1). (c) Function g_{i,j} is decreasing and convex (case 2). (d) Function g_{i,j} is increasing and concave (case 2).
Figure 2: Construction of the auxiliary linearities . We indicate and , respectively. It is apparent that and and are either simultaneously greater than (or equal to) , or simultaneously lesser than (or equal to) , for all . Hence, the inequalities () and () are satisfied . (a) Function is increasing and convex (case 1). (b) Function is decreasing and concave (case 1). (c) Function is decreasing and convex (case 2). (d) Function is increasing and concave (case 2).
Table 1: Bounding Method 1.
1. Find a partition of the space of the SoI.
2. Compute the simple estimates for each .
3. Calculate and build , for and .
4. Replace with , and minimize to find the lower bound .
5. Find .

4.2Iterative Implementation

The quality of the bound depends, to a large extent, on the length of the interval , denoted . This is clear if we think of as a linear approximation on of the nonlinearity . Since we have assumed is continuous and bounded in , the procedure to build in BM1 implies that

for all . Therefore, if we consider intervals which are shorter and shorter, then the modified potential function will be closer and closer to the true potential function , and hence the bound will be tighter.

The latter observation suggests a procedure to improve the bound for a given interval . Indeed, let us subdivide into subintervals denoted where and . We refer to the elements in the collection , with , as support points in the interval . We can build linear functions for every subinterval , using the procedure described in Section 4.1. We recall that this procedure is graphically depicted in Figure 2, where we simply need to

  • substitute by and

  • when the simple estimate , substitute by ( by ) if (if , respectively).

Using we compute a bound , , and then select . Note that the subscript in indicates how many support points have been used to computed the bound in (which becomes tighter as increases). Moreover if we take a new (arbitrary) support point from the subinterval that contains , and extend the set of support points with it, with , then we can iterate the proposed procedure and obtain a refined version of the bound, denoted .

The proposed iterative algorithm is described, with detail, in Table 2. Note that is an iteration index that makes explicit the number of support points . If we plug this iterative procedure for the computation of into BM1 (specifically, replacing steps 3 and 4 of Table 1), we obtain a new technique that we will hereafter term bounding method 2 (BM2).

As an illustration, Figure 3 shows four steps of the iterative algorithm. In Figure 3 (a) there are two support points , which yield a single interval . In Figures Figure 3 (b)-(c)-(d), we successively add a point chosen in the interval that contains the latest bound. In this example, the point is chosen deterministically as the mean of the extremes of the interval .

Table 2: Iterative algorithm to improve .
1. Start with , and . Let and .
2. Choose an arbitrary interior point in , and update the set of support points .
3. Sort in ascending order, so that where , ,
and is the number of elements of .
4. Build for each interval with .
5. Find , for .
6. Set the refined bound , and set .
7. To iterate, go back to step 2.
Figure 3: Four steps of the iterative algorithm choosing s_{*} as the middle point of the subinterval \mathcal{I}_{v^{*},v^{*}+1}. The solid line shows the system potential V(x;\textbf{y},\textbf{g})=(y_{1}-\exp{(x)})^2-\log(y_{2}-\exp{(-x)}+1)+(y_{2}-\exp{(-x)})+1 (see the example in Section ), with y_1=5 and y_2=2, while the dashed line shows the modified potential V(x;\textbf{y},\textbf{r}_j). We start in plot (a) with two points \mathcal{S}_{j,1}=\{\min{(\mathcal{X}_{j})}, \ \max{(\mathcal{X}_{j})}\}. At each iteration, we add a new point chosen in the subinterval \mathcal{I}_{v^{*},v^{*}+1} that contains the latest bound. It is apparent that V(x;\textbf{y},\textbf{r}_j) becomes a better approximation of V(x;\textbf{y},\textbf{g}_j) each time we add a new support point.
Four steps of the iterative algorithm choosing s_{*} as the middle point of the subinterval \mathcal{I}_{v^{*},v^{*}+1}. The solid line shows the system potential V(x;\textbf{y},\textbf{g})=(y_{1}-\exp{(x)})^2-\log(y_{2}-\exp{(-x)}+1)+(y_{2}-\exp{(-x)})+1 (see the example in Section ), with y_1=5 and y_2=2, while the dashed line shows the modified potential V(x;\textbf{y},\textbf{r}_j). We start in plot (a) with two points \mathcal{S}_{j,1}=\{\min{(\mathcal{X}_{j})}, \ \max{(\mathcal{X}_{j})}\}. At each iteration, we add a new point chosen in the subinterval \mathcal{I}_{v^{*},v^{*}+1} that contains the latest bound. It is apparent that V(x;\textbf{y},\textbf{r}_j) becomes a better approximation of V(x;\textbf{y},\textbf{g}_j) each time we add a new support point.
Four steps of the iterative algorithm choosing s_{*} as the middle point of the subinterval \mathcal{I}_{v^{*},v^{*}+1}. The solid line shows the system potential V(x;\textbf{y},\textbf{g})=(y_{1}-\exp{(x)})^2-\log(y_{2}-\exp{(-x)}+1)+(y_{2}-\exp{(-x)})+1 (see the example in Section ), with y_1=5 and y_2=2, while the dashed line shows the modified potential V(x;\textbf{y},\textbf{r}_j). We start in plot (a) with two points \mathcal{S}_{j,1}=\{\min{(\mathcal{X}_{j})}, \ \max{(\mathcal{X}_{j})}\}. At each iteration, we add a new point chosen in the subinterval \mathcal{I}_{v^{*},v^{*}+1} that contains the latest bound. It is apparent that V(x;\textbf{y},\textbf{r}_j) becomes a better approximation of V(x;\textbf{y},\textbf{g}_j) each time we add a new support point.
Four steps of the iterative algorithm choosing s_{*} as the middle point of the subinterval \mathcal{I}_{v^{*},v^{*}+1}. The solid line shows the system potential V(x;\textbf{y},\textbf{g})=(y_{1}-\exp{(x)})^2-\log(y_{2}-\exp{(-x)}+1)+(y_{2}-\exp{(-x)})+1 (see the example in Section ), with y_1=5 and y_2=2, while the dashed line shows the modified potential V(x;\textbf{y},\textbf{r}_j). We start in plot (a) with two points \mathcal{S}_{j,1}=\{\min{(\mathcal{X}_{j})}, \ \max{(\mathcal{X}_{j})}\}. At each iteration, we add a new point chosen in the subinterval \mathcal{I}_{v^{*},v^{*}+1} that contains the latest bound. It is apparent that V(x;\textbf{y},\textbf{r}_j) becomes a better approximation of V(x;\textbf{y},\textbf{g}_j) each time we add a new support point.
Figure 3: Four steps of the iterative algorithm choosing as the middle point of the subinterval . The solid line shows the system potential (see the example in Section ), with and , while the dashed line shows the modified potential . We start in plot (a) with two points . At each iteration, we add a new point chosen in the subinterval that contains the latest bound. It is apparent that becomes a better approximation of each time we add a new support point.

4.3Lower bound for quadratic potentials

Assume that the joint potential is quadratic, i.e., for each , and construct the set of linearities , for and . The modified system potential in becomes

and it turns out straightforward to compute . Indeed, if we denote and , then we can readily obtain

and . It is apparent that . Furthermore, is an approximation of the ML estimator restricted to .

4.4Adaptation of for generic system potentials

If the joint potential is not quadratic, in general it can still be difficult to minimize the modified function , despite the replacement of the nonlinearities with the linear functions . In this section, we propose a method to transform the bound for a quadratic potential, , into a bound for some other, non-quadratic, potential function.

Consider an arbitrary joint potential and assume the availability of an invertible increasing function such that , where denotes the composition of functions. Then, for the system potential we can write

and, as consequence, , hence is a lower bound for the non-quadratic system potential constructed from .

For instance, consider the family of joint potentials . Using the monotonicity of norms, it is possible to prove [16] that

Let . Since this function is, indeed, strictly increasing, we can transform the inequality (Equation 13) into

which yields

hence the transformation of the quadratic bound is a lower bound for with . Similarly, if we let , the inequality (Equation 14) yields

hence the transformation is a lower bound for when .

It is possible to devise a systematic procedure to find a suitable function given an arbitrary joint potential , where . Let us define the manifold . We can construct by assigning with the maximum of the quadratic potential when , i.e., we define

Note that (Equation 17) is a constrained optimization problem that can be solved using, e.g., Lagrangian multipliers.

From the definition in (Equation 17) we obtain that, , . In particular, since from the definition of , we obtain the desired relationship,

We additionally need to check whether is a strictly increasing function of . The two functions in the earlier examples of this Section, and , can be readily found using this method.

4.5Convex marginal potentials

Assume that and that we have already found , and , using the technique in Section 4.1. If a marginal potential is convex, the function is also convex in . Indeed, for all