An n-dimensional Rosenbrock Distribution for MCMC Testing

An -dimensional Rosenbrock Distribution for MCMC Testing

Filippo Pagani    Martin Wiegand    Saralees Nadarajah
The University of Manchester, School of Mathematics
September 20, 2019
Abstract

The Rosenbrock function is an ubiquitous benchmark problem for numerical optimisation, and variants have been proposed to test the performance of Markov Chain Monte Carlo algorithms. In this work we discuss the two-dimensional Rosenbrock density, its current -dimensional extensions, and their advantages and limitations. We then propose our own extension to arbitrary dimensions, which is engineered to preserve the key features of the density – such as the curved correlation structure – and is analytically tractable. We conclude with numerical experiments that show how a naively tuned Random Walk fails to to give a representative sample in reasonable time, and how even a Simplified Manifold MALA algorithm struggles on this target.

1 Introduction

The Rosenbrock function is an ubiquitous test problem in the optimisation literature [12], due to its challenging features: its minimum is located at the bottom of a narrow parabolic valley, where a small change in direction can lead to a steep increase of the gradient. The Rosenbrock function can be turned into a probability density that maintains the same basic features of the original problem, and has been adopted by the Markov Chain Monte Carlo (MCMC) community to serve as benchmark problem when testing MCMC algorithms.

The first MCMC method (Random Walk Metropolis Hastings) dates back to the ’50, when the scientists working on the Manhattan Project in Los Alamos used it to simulate systems of particles [7]. After that, MCMC remained largely confined to the physics literature until the ’90, when [3] popularised it to the statistics community. That spurred a new wave of research efforts that yielded advanced algorithms such as MALA [11], Reversible Jumps MCMC [6], HMC [2] 111HMC was actually developed in the ’80 in the physics literature, but its properties were studied extensively only recently. and others [10].

One of the current frontiers of research in this field is developing algorithms – e.g. [4, 9] – that can sample efficiently from densities with non-constant correlation structure, where it is difficult to move efficiently from one region of the density to an entirely different region. Researchers developing new methods for this class of problems often test their algorithms on a handful of test models, amongst which the (2-dimensional) Rosenbrock density is quite popular. However few properties of the density have been investigated and formalised, especially regarding multivariate extensions of the density for the purpose of MCMC sampling.

In section 2 we review the current state of knowledge on the 2- Rosenbrock and available -dimensional extensions. In section 3 we show how to calculate th normalisation constant for the 2- case, and why it cannot be calculated in the same way for the Full Rosenbrock kernel. In section 4 we present our -dimensional extension, and we discuss how it improves on the limits of the current solutions. In section 5 we show how well some popular MCMC algorithms explore the new distribution.

2 Current Literature

2.1 The 2- Rosenbrock Distribution

The simplest non-trivial case of the Rosenbock density is the 2- case, which can be written as

(1)

We follow [5] rescaling (1) by , so that the distribution – shown in figure 1 – takes the shape of a curved narrow ridge, which is normally quite challenging for MCMC algorithms.

Figure 1: Two dimensional histogram of Rosenbrock density as described in equation (1), with parameters , , .

It is not quite clear from the literature how the shape of the kernel in (1) is affected by changes in the parameters. Moreover, the normalisation constant is generally unknown, and there is no clear recipe to extend the distribution in more than two dimensions. Two methods have been proposed in the literature, and we will review them pointing out their advantages and limitations.

2.2 Full Rosenbrock Distribution

What we call Full Rosenbrock is an -dimensional extension of the 2- case that has the following kernel:

(2)

with . The normalisation constant is unknown. In three dimensions the kernel above can be written as

(3)

with . Running on equation (3) a sMMALA algorithm with SoftAbs metric, , for 2 million samples, with starting point , the joint plots are shown in figure 2.

Figure 2: Histogram of a 3- Full Rosenbrock density, as described in equation (3), obtained via sMMALA. The numbers in the upper panels show the correlation. The mass is much more concentrated around the mode than expected.

It is easy to see that the joint plot between and has significantly changed compared to figure 1: just looking at the scales of the graph, it is easy to see that the tails have become much thinner now. The distribution has changed from being a long narrow ridge to being much more concentrated around the mode. This is a serious drawback, after having tuned the density to pose an adequate challenge to the algorithms being studied.

This distribution does have some desirable features: as each new variable is the square of the previous one, the scale of each new dimension added increases quite steeply. Targets with wildly different scales – e.g. Neal’s Normal [8] – are known to pose a challenge to MCMC algorithms.

However, as new dimensions are added, their variance grows very quickly (depending on the choice of parameters), as it is reasonable to expect when repeatedly taking the square of quantities. With variances growing that fast, numerical computations may not be stable beyond a certain number of dimensions.

Furthermore, as increases, the joint distribution of and becomes progressively flatter around , and progressively steeper further away from the origin, which for the purpose of MCMC sampling, is not a very challenging shape as most of the mass is concentrated in a flat region of constant correlation structure. This shortcoming can be addressed by changing the parameter , but only to some extent: it is hard to find a single value of such that the mode of each marginal is not too far from a region where the correlation changes.

This and previous considerations suggest that it may not be advisable to extend he distribution beyond a certain number of dimensions.

2.3 Even Rosenbrock Distribution

In the optimisation literature, [1] proposes a simpler version of the Full Rosenbrock function used in section 2.2, which can be easily turned into a kernel as

(4)

maintaining the scaling, with , where must be an even number. The normalisation constant is unknown. This density could be described as a sequence of 2- Rosenbrocks which are pairwise independent. With , equation (4) reduces to

(5)

Sampling from (5) with sMMALA222We used a sMMALA algorithm tuned exactly as in the previous section, for 2 million samples., the result is shown in figure 3.

Figure 3: Histogram of a 4- Even Rosenbrock density, as described in equation (4). The numbers in the upper part of the matrix represent correlations. Most of the joint distributions are uncorrelated.

This is clearly a simpler problem than equation (3), with fewer dependencies, as the round shapes and lack of ridges in the lower left corner of figure 3 confirm. This distribution – unlike the Full Rosenbrock – does maintain the shape of the joint distributions as dimensions are added, and by construction the variance remains stable. However, only a small fraction of the joint distributions will be curved narrow ridges, as the majority of the joint distributions will be uncorrelated, and unlike the Full Rosenbrock, they will all have similar scales. That lowers significantly the difficulty of the problem.

3 Normalisation Constant and Interpretation of the Parameters

The normalisation constant is unknown in all the examples seen in section 2. However, we found that we can rewrite the 2- Rosenbrock function from equation (1) as

(6)
(7)

with , and in our specific case, , , . That should make it easy to see that the density is composed by two normal kernels, i.e. , where

As we know how to integrate the normal kernels in (6), this provides us with a viable method to calculate the normalisation constant.

Lemma 1.

The 2- Rosenbrock density shown in and (6) integrates to .

Proof.

We begin by integrating equation 6 over the domain :

(using Fubini’s theorem to exchange the order of integration)
(changed variables in the second integral to )
(therefore and , therefore)

Formulating the problem in this particular way also provides us with a simple interpretation for the coefficients: is the precision for the first normal in the dimension, while is the precision of the second normal, which is concentrated in the dimension around the manifold given by the parabola . Therefore, increasing increases the slope of the distribution along the parabola, while increasing decreases the dispersion around the parabola. Naturally, the variances of the marginals will be slightly different as the distribution has to be projected on the corresponding axes.

Moreover, the structure of the 2- problem is such that we can use the independence of random variables to split the joint density into its conditional densities as factors, from which we may easily sample, and we can calculate Credibility Regions (CR) and QQ-plots.

Given all these desirable properties, it is highly convenient to find an -dimensional extension to the 2- Rosenbrock density that preserves this structure, together with the other features we want our test problem to have.

In light of all this, it’s now easy to show why the Full Rosenbrock kernel changes shape as its dimension increases. Going back to equation (3), it can be rewritten as

(8)

it is possible to see that while the first and fourth terms are two normal kernels in and , and can be easily isolated, the kernel is now composed by two terms. Expanding the squares in the second and third terms of equation (8) to a sum of monomials and adding and subtracting terms, it is possible to obtain a single normal kernel for by completing the square:

(9)

which has to be substituted back in (8). The first term in (3) represents the new normal kernel for , therefore

(10)

The other terms are remaining terms from the calculations that we can’t simply include in the normalisation constant, because they depend on , which is an actual variable in the problem. This changes the distribution of , which is not not normal any more: is now distributed as and unknown density and we can’t sample directly from it with ease. The variance of also decreases significantly, as it is easy to see from equation (10), going from to . For , the same changes happen to all the variable but the very last one, .

4 Hybrid Rosenbrock Distribution

The ultimate goal of this paper is to find a -dimensional distribution that possesses four main properties. Firstly, it must have the curved correlation structure of a 2- Rosenbrock distribution. Secondly, it must be extendable to higher dimensions preserving that correlation structure. Thirdly, we want to maintain the difference in scales of the variables. All these properties are desirable, as together they present a very good challenge to most MCMC algorithms. Lastly, the problem must be analytically tractable, and allow for direct sampling. That is of paramount importance when testing new MCMC algorithms: if the shape of the test problem is unknown – which is often the case in high dimension – it may be very difficult to assess if the algorithm is working correctly or not. The Hybrid Rosenbock density fulfills all of these criteria, providing a problem where every single joint distribution has a challenging structure. The Hybrid Rosenbrock can be written as

(11)

where ; . This equation is quite general: taking and , it reduces to the 2- Rosenbrock. Taking and , it reduces to

(12)

which is a distribution not dissimilar from the Full Rosenbrock from equation (8), as its variance grows very quickly as dimension increases, but the original shape is maintained, and it preserves the conditional normal structure. That is because we omitted the extra kernel that was causing problems, as explained in section 3. Taking , this distribution looks like what would be reasonable to expect from a multivariate version of the 2- Rosenbrock density: every joint plot is either a parabolic ridge, or a straight ridge, where the scales of the variable remain the same as in the 2- case. We can go further, and taking , the result in dimensions can be seen in figure 4. For simplicity, we will call the variables as , and the kernel is

(13)

where and , and .

Figure 4: Histogram of a Hybrid Rosenbrock density as described in equation (11), with parameters , , , obtained via direct sampling. The number in the upper part of the graph represent correlations. Every joint distribution is either a straight or curved ridge.

This distribution is much more challenging for MCMC algorithms than any of the previous ones. As each block is not independent any more, as every variable is correlated with each other through the initial variable . That gives rise to straight and curved ridges with very long tails – see figure 4. However, all the kernels of the distribution are still conditionally or unconditionally normal, which allows us to calculate the normalisation constant almost as easily as before.

Lemma 2.

The integral of (11) is .

Proof.

The proof is similar to lemma 1, using the conditional structure of the density to split the integrals of the normal kernels, and solve them one at a time. ∎

Using the conditional normal structure of the problem, it is possible to sample from the joint distribution directly using the following algorithm.

1 \setstretch1.5 for  do
2      
3       for  do
4             for  do
5                  
6                  
7             end for
8            
9       end for
10      
11 end for
return
Algorithm 1 Pseudo code to sample from a Hybrid Rosenbrock

5 Numerical Tests

In this section we compare the quality of samples drawn from a Hybrid Rosenbrock density using a Random Walk Metropolis and a Simplified Manifold MALA algorithms with a direct sample from the distribution. We have repeated the same experiments doubling the number of samples (pre and post thinning) for all he algorithms involved, and the results are overall very similar, leading us to believe that the Monte Carlo error in the direct sampling process is negligible.

Figure 5: Joint credibility regions for a Hybrid Rosenbrock density as described in equation (11), with parameters , ; . The sMMALA algorithm performs adequately, while the Random Walk fails to converge in reasonable time.

Naturally, the difficulty of the problem affects MCMC convergence, especially for unsophisticated algorithms like the RW. Figure 5, and the top left plot of figure 6, show the joint credibility regions for the variables , , , and from equation (13). These are all the unique distributions from figure 4, the rest of them being identical (save for randomness) to one of those already shown. The algorithm sMMALA was run with SoftAbs metric, , step size , for million samples, thinned down to 2 millions. The RW was run with spherical proposal covariance, step size , for 200 million iterations, thinned down to 2 millions. The distribution was sampled directly for 2 million observations. The step size for both algorithms was chosen so that both achieve near optimal acceptance rate, i.e. around for RW, and around for sMMALA.

Figure 6: Top left plot shows the credibility region for variables of a Hybrid Rosenbrock density as described in equation (11), with parameters , ; . Bottom left, top right and bottom right plots show the QQ-plots for variables and . Again, the sMMALA algorithm performs adequately, while the Random Walk fails to converge in reasonable time.

Looking at the top left plot in figure 5, it is possible to see some discrepancies between the credibility regions from direct sampling and from sMMALA. The sMMALA algorithm seems to reach further than necessary in the left tail, while falling short in the right tail. The same happens for the bottom left plot. In both the plots on the right, sMMALA seems to capture the density correctly near zero (where the mode is, see figure 4), but once more performs poorly in the tails region. However, in the top left plot of figure 6, sMMALA seems to spend slightly more time in the tail than it should. This shows how difficult this problem already is, in only five dimensions: not even algorithms designed on purpose for this class of densities can sample from it with ease.

The Random Walk Metropolis instead, completely fails to give an accurate description of the shape of the distribution: it does not move past the region of high density at all, neither in figure 5, nor in the top left plot of figure 6, despite the very large original MCMC sample of 200 millions. The QQ-plots from figure 6 give a more accurate picture of convergence in the tails. Only variables and are shown, as and are just copies of and , save for randomness in the sampling process.

The bottom left plot of figure 6 shows the quantiles obtained from sMMALA and RW with respect to the quantiles based on a direct sample of the variable . As before, it looks like sMMALA is working appropriately while sampling near the mode, but as soon as we leave roughly the range of the credibility region shown in 5, the quantiles start to diverge, and the farther away from the mode, the thinner the tails obtained from sMMALA look, compared to what a direct sample suggests. The same behaviour happens in the other two plots for variables and : the farther away from the mode, the thinner the tails become, relative to how they should be. This is due to the fact that the distribution is a narrow curved ridge, and to cover some ground in one direction, the dynamics often has to cover a lot more ground in one other or more directions, making the exhaustive exploration of the tails quite difficult. The Random Walk on the other hand, shows again severe problems in sampling from the density correctly for all the variables, as it never moves past the mode.

6 Conclusions

The (2-) Rosenbrock distribution is a very common test problem in Markov Chain Monte Carlo, when testing algorithms on targets with curved correlation structure. However, as its normalisation constant is generally not known, it is often hard to assess the quality of the results. That is especially true in higher dimensions, as the optimisation literature provides multiple ways to extend the 2- Rosenbrock function, but the shape and statistical properties of the resulting distributions are not well studied or documented. This may lead to confusion in the interpretation of the results, as it may seem that an algorithm is working appropriately while instead it is completely neglecting to explore entire regions of the support.

In this paper we give the normalisation constant for the 2- Rosenbrock density, by using the fact that it can be divided into conditional normal kernels. That fact can also be used to obtain a direct sample from the density with ease. Furthermore, we showed that by carefully extending the 2- distribution to dimensions, it is possible to obtain a test problem with some important features. Firstly, it is well defined, and its statistical properties can be derived by simple integration. Secondly, the problem has a very challenging correlation structure where all the joint distribution are straight or curved ridges. Thirdly, the variables have very different scales, a feature that adds an additional layer of difficulty to the problem. Lastly, it is possible to sample from the distribution directly with ease. Furthermore, we showed the effect of the parameters on the shape of the distribution, which can be tweaked at pleasure, making the Hybrid Rosenbrock an excellent test problem for Markov Chain Monte Carlo algorithms, easily extendable to any desired dimension and degree of difficulty.

Finally, we showed how sophisticated algorithms like the Simplified Manifold MALA have to reduce their step size to very small values in order to maintain a reasonable acceptance ratio. Instead, basic algorithms like the Random Walk Metropolis struggle to the point of failing to explore the Hybrid Rosenbrock density appropriately in reasonable time, often in a way that is hard to detect if the test problem is not known in advance.

Acknowledgements

We want to thank Dr. Tim Waite for the productive discussions, and Dr. Simon Cotter for the many useful comments.

References

  • [1] L. C. W. Dixon and D. J. Mills. Effect of Rounding Errors on the Variable Metric Method. Journal of Optimization Theory and Applications, 80(1):175–179, Jan 1994.
  • [2] Simon Duane, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
  • [3] Alan E. Gelfand and Adrian F. M. Smith. Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc., 85(410):398–409, 1990.
  • [4] Mark Girolami, Ben Calderhead, and Siu A. Chin. Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods. Journal of the Royal Statistical Society, Series B (Methodological), 2011.
  • [5] J. Goodman and J. Weare. Ensemble samplers with affine invariance. Communications in Applied Mathematics and Computational Science, Vol. 5, No. 1, p. 65-80, 2010, 5:65–80, 2010.
  • [6] Peter J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732, 1995.
  • [7] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953.
  • [8] Radford M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113–162, 2010.
  • [9] Matthew Parno. Transport Maps for Accelerated Bayesian Inference. Ph.D. Thesis, MIT Computational Science and Engineering, 2014.
  • [10] Christian Robert and George Casella. A Short History of Markov Chain Monte Carlo: Subjective Recollections from Incomplete Data. Statist. Sci., 26(1):102–115, February 2011.
  • [11] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal Scaling of Discrete Approximations to Langevin Diffusions. J. R. Statist. Soc. B, 60:255–268, 1997.
  • [12] H. H. Rosenbrock. An Automatic Method for Finding the Greatest or Least Value of a Function. The Computer Journal, 3(3):175–184, 1960.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
350953
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description