\fnmsCyril \snmChimisov\thanksreflabel=e1]K.Chimisov@warwick.ac.uk [    \fnmsKrzysztof \snmŁatuszyński\thanksreflabel=e2]K.G.Latuszynski@warwick.ac.uk [    \fnmsGareth O. \snmRoberts\thanksref label=e3]Gareth.O.Roberts@warwick.ac.uk [ University of Warwick\thanksmarkm1 Department of Statistics
University of Warwick
Coventry
CV4 7AL
United Kingdom
###### Abstract

The popularity of Adaptive MCMC has been fueled on the one hand by its success in applications, and on the other hand, by mathematically appealing and computationally straightforward optimisation criteria for the Metropolis algorithm acceptance rate (and, equivalently, proposal scale). Similarly principled and operational criteria for optimising the selection probabilities of the Random Scan Gibbs Sampler have not been devised to date.

In the present work we close this gap and develop a general purpose Adaptive Random Scan Gibbs Sampler that adapts the selection probabilities. The adaptation is guided by optimising the L_{2}-spectral gap for the target’s Gaussian analogue [2, 35], gradually, as target’s global covariance is learned by the sampler. The additional computational cost of the adaptation represents a small fraction of the total simulation effort.

We present a number of moderately- and high-dimensional examples, including truncated Gaussians, Bayesian Hierarchical Models and Hidden Markov Models, where significant computational gains are empirically observed for both, Adaptive Gibbs, and Adaptive Metropolis within Adaptive Gibbs version of the algorithm. We argue that Adaptive Random Scan Gibbs Samplers can be routinely implemented and substantial computational gains will be observed across many typical Gibbs sampling problems.

We shall give conditions under which ergodicity of the adaptive algorithms can be established.

\startlocaldefs\endlocaldefs\runtitle

{aug}

, and

## 1 Introduction

Markov Chain Monte Carlo (MCMC) methods is a powerful tool to estimate integrals \int_{\mathcal{X}}f(x)\pi({\rm d}x) of some function f with respect to (w.r.t.) some probability measure \pi on a measurable space \mathcal{X}.

The idea behind the MCMC technique is fairly simple. First, we need to construct a Markov kernel P that has \pi as its stationary distribution, i.e., \int_{\mathcal{X}}P(x,\cdot)\pi({\rm d}x)=\pi(A). Then, we run a Markov chain using the kernel P to obtain samples \{X_{i}\}_{i=1}^{n}, which can be used to estimate \int f{\rm d}\pi by the average \frac{1}{n}\sum_{i=1}^{n}f(X_{i}) (see, e.g., [29]).

While designing the kernel P is easy (e.g., one can come up with dozens of proposals in the Random Walk Metropolis (RWM) scheme), identifying kernels P for which \frac{1}{n}\sum_{i=1}^{n}f(X_{i}) does not converge excessively slowly is a hard problem.

Typically, the user has to choose a kernel from a parametrised family P_{\gamma}, \gamma\in\Gamma with a common target stationary distribution \pi. For example, \Gamma may represent a collection of proposals for the RWM algorithm or a set of selection probabilities (weights) for the Random Scan Gibbs Sampler (RSGS) that are used to decide which coordinate to update next.

A naive approach to find a good parameter \gamma would require the user to re-run the MCMC algorithm many times before a good Markov kernel candidate P_{\gamma} is found.

An alternative idea is to come up with an adaptation rule which changes the value of \gamma during the run of the Markov chain, gradually, as further information is acquired by the chain. This approach is known as Adaptive MCMC (AMCMC) algorithms and is very attractive in practice since it frees users from the cumbersome process of hand-tuning the parameters and potentially accelerates convergence to the target distribution. Formally, an AMCMC algorithm produces a chain X_{n} by repeating the following two steps.

[label*=(0)]
1. 1.

Sample X_{n+1} from P_{\gamma_{n}}\left(X_{n},\cdot\right);

2. 2.

Given \{X_{0},..,X_{n+1},\gamma_{0},..,\gamma_{n}\} update \gamma_{n+1} according to some adaptation rule.

After running an adaptive chain, we can use its output in the same way as if it were a usual MCMC chain in order to estimate \int f{\rm d}\pi. Note that the adaptive chain is not Markov in general, making its analysis particularly complicated.

First guidance on how to construct an ergodic AMCMC is proposed by Gilks et al. [22], where the authors allow any kind of adaptations to take place but only at the regeneration times of the underlying Markov chains. Unfortunately, the algorithm is inefficient in high dimensional settings since the regeneration rate deteriorates exponentially in dimension. More practical conditions are due to Roberts & Rosenthal [39] and are known as diminishing and containment conditions 1, 2, which we discuss in Section 7.

Even though one has theoretical results that help establish convergence of the adaptive algorithms, no less important challenge is to come up with an adaptation scheme for the Step 2 of the AMCMC algorithms. For the RWM, the adaptation rule is based on the approximation of the optimal Gaussian proposal that has been studied by [20, 34, 37, 9, 10]. The authors have noticed that in high dimensional spaces, the RWM behaves like a diffusion, so that one has to optimise the proposal variance in order to maximise convergence speed of the limiting diffusion. In dimensions d\geq 5, the optimal covariance matrix for the Gaussian proposal according to [34, 37] is \alpha\Sigma, where \Sigma is a d\times d covariance matrix of the target distribution and the scaling parameter \alpha>0 is chosen so that the average acceptance ratio of the algorithm is 0.234. In practice both \Sigma and \alpha are not known in advance but can be learned in the Step 2 of the AMCMC algorithm. Successful adaptive RWM algorithms have been proposed by [24, 40, 47]. Convergence properties of the Adaptive RWM have been extensively studied in the literature, e.g., [5, 6, 7, 44, 46, 47].

At the same time, the Random Scan Gibbs Sampler (RSGS) and Metropo-
lis-within-Gibbs (MwG) algorithms are very popular in practice. Recall that the RSGS at every iteration chooses a coordinate i with probability p_{i} and updates it from its full conditional distribution. If the full conditional distribution is expensive or impossible to sample from, then a proposal is generated for the direction i from some proposal distribution Q_{i}, followed by the Metropolis-Hastings acceptance/rejection procedure. The corresponding algorithm is called Metropolis-within-Gibbs.

Usually, uniform selection probabilities p_{i} are used, while we argue that this is often a sub-optimal strategy. To date, there is no guidance on the optimal choice of the selection probabilities as noticed in [28].

A possible solution is to use those probabilities that maximise the L_{2}-spectral gap (hereafter, spectral gap) of the corresponding algorithm. Of course, estimating the spectral gap is a challenging problem. On the other hand, if the target distribution is normal, then for the RSGS there is an explicit formula (4) for the spectral gap. Since (4) depends only on the correlation structure of the target distribution, the equation (4) may be optimised for an arbitrary target distribution resulting in some selection probabilities p^{\mathrm{opt}}, that we call pseudo-optimal. The corresponding value of (4) at p^{\mathrm{opt}} is the pseudo-spectral gap.

In Bayesian Analysis, by virtue of Bernstein-von Mises Theorem (see Section 10.2 of [45]), under certain conditions, given sufficient amount of observations, the posterior distribution is well approximated by an appropriate Gaussian. Thus if one applies the RSGS to sample from the posterior, the pseudo-optimal weights p_{i} might represent a good approximation to the true optimal weights that maximise the spectal gap. Interestingly, as we demonstrate by simulations in Section 8, even if the target distribution is discrete, the pseudo-optimal weights might still be advantageous over the uniform selection probabilities.

Since the pseudo-optimal selection probabilities are a function of the correlation structure of the target distribution, which is usually not known, and optimising the pseudo-spectral gap function (4) is a hard problem (see, e.g., [31]), we develop a general purpose Adaptive Random Scan Gibbs Sampler (ARSGS) that adapts the selection probabilities on the fly.

We also find that a special case of the MwG algorithm, namely, Random Walk Metropolis within Gibbs (RWMwG) algorithm, may be significantly improved by adapting both the proposal distribution (for instance, as suggested in [43]) and the underlying selection probabilities in the same manner as for the RSGS.

Because the implementation of the adaptive algorithms is easy and the additional computational cost is often negligible compared to the total computational effort, we argue that the algorithms could be routinely implemented. We demonstrate in Section 8 that the ARSGS and ARWMwAG algorithms speed up convergence to the target distribution for many typical Gibbs sampling problems.

Finally, we introduce a notion of local simultaneous geometric drift condition 3 in Section 7. It turns out that for the RSGS it is a natural property to have as we demonstrate in Theorem 10. In Theorem 13 we prove convergence of the modified ARSGS under the local simultaneous geometric drift condition.

The paper is organised as follows. In Section 2 we exploit ideas of Amit [1, 2] and Roberts & Sahu [35] to derive the formula for the spectral gap for a particular case of sampling from the Multivariate Normal distribution using the RSGS. For a general target distribution, we introduce the concept of pseudo-spectral gap and pseudo-optimal selection probabilities in Section 3 and demonstrate potential advantage of the pseudo-optimal weights on toy examples studied in Section 4. Derivation of the ARSGS and ARWMwAG algorithms is presented in Sections 5 and 6 respectively. Convergence properties of the adaptive algorithms are discussed in Section 7. We provide simulation study and discuss computational cost of the adaptive algorithms in Section 8. Unless stated otherwise, the proofs are presented in the Supplementary Material section 10.

## 2 RSGS spectral gap for Multivariate Gaussian distribution

In this section we consider the RSGS for the normal target distribution and establish an explicit representation of the spectral gap in Theorem 2. One may skip all the technical details and notice only that the spectral gap in this case relies solely on the correlation structure of the target distribution and the selection probabilities.

Let \pi be a distribution of interest in \mathbb{R}^{d}. Let \Sigma and Q=\Sigma^{-1} denote the covariance matrix of \pi and its inverse respectively, where we assume throughout the paper that \Sigma is positive-definite. Partition Q into blocks Q=\left(Q_{ij}\right)_{i,j=1}^{s} where Q_{ij} is a r_{i}\times r_{j} matrix, \sum_{i=1}^{s}r_{i}=d. For vectors x\in\mathbb{R}^{d} introduce splitting x=(x_{1},..,x_{s}), where x_{i} is a vector in \mathbb{R}^{r_{i}} so that x_{i}=\left(x_{i1},..,x_{ir_{i}}\right).

Given a probability vector p=(p_{1},..,p_{s}) (i.e., p_{i}>0,\ \sum_{i=1}^{s}p_{i}=1), RSGS(p) is a Markov kernel that at every iteration chooses a subvector x_{i}=\left(x_{i1},..,x_{ir_{i}}\right) with probability p_{i} and updates it from the conditional distribution \pi(x_{i}|x_{-i}) of x_{i} given x_{-i}:=(x_{1},..,x_{i-1},x_{i+1},..,x_{s}). In other words, the RSGS(p) is a Markov chain with kernel

 \displaystyle P_{p}(x,A)=\sum_{i=1}^{s}p_{i}Pr_{i}(x,A), (1)

where A is a \pi-measurable set, x\in\mathbb{R}^{d} and Pr_{i} is a kernel that stands for updating x_{i} from the full conditional distribution \pi(x_{i}|x_{-i}). We call the kernel Pr since it is in fact a projection operator (i.e., Pr^{2}=Pr) acting on the set of the space of square integrable functions L_{2}(\mathbb{R}^{d},\pi) with respect to \pi. For \pi-integrable functions f, let \pi(f):=\int f{\rm d}\pi and \left(P_{p}f\right)(x):=\int f(y)P(x,{\rm d}y).

###### Definition.

Let \rho=\rho(p)>0 be the minimum number such that for all f\in L_{2}(\mathbb{R}^{d},\pi) and r>\rho,

 \displaystyle\lim_{n\to\infty}r^{-2n}\mathbf{E}_{\pi}[\{\left(P_{p}^{n}f\right% )(x)-\pi(f)\}^{2}]=0. (2)

Then \rho is called the {L_{2}-}rate of convergence in L_{2}(\mathbb{R}^{d},\pi) of the Markov chain with the kernel P_{p}. The value 1-\rho is called the L_{2}-spectral gap (or simply spectral gap) of the kernel P_{p}.

In the case when s=d and the selection probabilities are uniform, i.e., p=(\frac{1}{d},..,\frac{1}{d}), Amit [2] provides a formula for the spectral gap. Here we generalise Amit’s result by essentially changing p_{i} for \frac{1}{s} in the proof of Theorem 1 in [2].

It is easy to see that the RSGS kernel is reversible w.r.t. the target distribution \pi. It is known that if the spectrum of kernel P_{p} (considered as an operator on L_{2}(\mathbb{R}^{d},\pi)) consists of eigenvalues only, the L_{2}-rate of convergence is given by the second largest eigenvalue of the kernel P_{p} (follows, e.g, from Theorem 2 and the following remark in [36]).

There are two key steps to establish an explicit formula for the rate of convergence \rho(p).

Step 1. For the kernels P_{p}, find finite dimensional invariant subspaces S_{k} (i.e., P_{p}S_{k}\subset S_{k}) in L_{2}(\mathbb{R}^{d},\pi) by considering action of P_{p} on the orthonormal basis of Hermite polynomials.

Step 2. Identify the subspace S_{k} with the maximum eigenvalue less than one.

To clarify the steps we need to introduce some additional notations. Without loss of generality, suppose that \pi has zero mean.

Let K=\sqrt{Q} be the symmetric square root of Q defined through the spectral decomposition , i.e., if for an orthogonal matrix U (i.e., U^{\mathsf{T}}U=I), Q=U\rm diag(\lambda_{1},..,\lambda_{n})U^{\mathrm{T}}, then K=U\rm diag(\sqrt{\lambda_{1}},..,\sqrt{\lambda_{n}})U^{\mathrm{T}}. Set

 \displaystyle D_{i}={\rm diag}(0,..,Q^{-1}_{ii},..,0), (3)

where we stress that D_{i} is a d\times d matrix with Q^{-1}_{ii} being at the same place as in partition (Q^{-1}_{ii})_{i=1}^{s}.

For \alpha=(\alpha_{1},..,\alpha_{d})\in Z^{d}_{+} let \alpha!=\alpha_{1}!..\alpha_{d}!, |\alpha|=\alpha_{1}+..+\alpha_{d}. Define h_{k} to be the Hermite polynomial of order k, i.e.,

 h_{k}(x)=(-1)^{k}\exp\left(\frac{x^{2}}{2}\right)\frac{d}{dx^{k}}\exp\left(-% \frac{x^{2}}{2}\right),\ x\in\mathbb{R}^{d}.

Set H_{\alpha}(x)=\frac{1}{\sqrt{\alpha!}}h_{\alpha_{1}}(x_{1})..h_{\alpha_{d}}(x_% {d}), H_{0}(x):=1. The next lemma summarizes Steps 1 and 2 above.

###### Lemma 1.

\left\{{H_{\alpha}(Kx)}|\ \alpha\in Z^{n}_{+}\right\} form an orthonormal basis in L_{2}(\mathbb{R}^{d},\pi) and for all integers k\geq 0, spaces

 S_{k}:={\rm span}\Big{\{}H_{\alpha}(Kx)\Big{|}\ |\alpha|=k\Big{\}},

spanned by \{H_{\alpha}(Kx)|\ |\alpha|=k\}, are finite dimensional and P_{p}-invariant (i.e., P_{p}(f)\in S_{k} for all f\in S_{k}). Moreover, for all k\geq 0,

 \lambda_{max}(P_{p}|_{S_{1}})\geq\lambda_{max}(P_{p}|_{S_{k}}),

where \lambda_{max}(\cdot) is the maximum eigenvalue and P_{p}|_{S_{k}} is a restriction of P_{p} on S_{k}.

Lemma 1 immediately implies that \mathrm{Gap}(p)=1-\lambda_{max}(P_{p}|_{S_{1}}) and the next theorem provides a representation of \mathrm{Gap}(p) through the correlation structure of the target distribution.

###### Theorem 2.

The L_{2}-spectral gap in the RSGS(p) scheme for the Gaussian target distribution with precision matrix Q is given by

 \displaystyle\mathrm{Gap}(p)=1-\lambda_{max}(F_{1}), (4)

where

 \displaystyle F_{1}=I-K\left(\sum_{i=1}^{s}p_{i}D_{i}\right)K, (5)

D_{i} is given by (3), and K=\sqrt{Q}.

Since S_{1} is a set of linear functions, Lemma 1 also implies

###### Theorem 3.

Consider a Gibbs kernel P_{p} that corresponds to a normal target distribution \pi. Then the second largest eigenfunction of P_{p} in L_{2}(\mathbb{R}^{d},\pi) is a linear function.

We end this section by comparing formula (4) with the results by Roberts & Sahu [35]. Consider the case when p_{1}=..=p_{s}=\frac{1}{s} and introduce a matrix

 \displaystyle A=I-{\rm diag}(Q^{-1}_{11},..,Q^{-1}_{ss})Q, (6)

The following lemma will be useful throughout the paper and can be easily obtained.

###### Lemma 4.

Let A and B be two d\times d matrices. Then AB and BA have the same eigenvalues.

Lemma 4 implies that the spectrum (the set of all eigenvalues) of A defined in (6) is equal to the spectrum of

 I-K{\rm diag}(Q^{-1}_{11},..,Q^{-1}_{ss})K.

One can easily see that T^{(i)}:=I-KD_{i}K is a projection matrix, hence

 \displaystyle I-K{\rm diag}(Q^{-1}_{11},..,Q^{-1}_{ss})K=I+\sum_{i=1}^{s}T^{(i% )}-sI\geq(1-s)I,

and the minimum eigenvalue of A is bounded below by (1-s). Therefore, (4) is equivalent to

 \mathrm{Gap}\left(\frac{1}{s}\right)=\lambda_{max}\left(\frac{1}{s}\left((s-1)% I+A\right)\right)=\frac{1}{s}\Big{(}s-1+\lambda_{max}(A)\Big{)},

where \mathrm{Gap}\left(\frac{1}{s}\right) is the spectral gap of the RSGS with the uniform selection probabilities.

The last equation is the representation of the spectral gap in Theorem 2 of [35].

## 3 Pseudo-spectral gap

For a general target distribution computing the spectral gap is not feasible. But one can always deal with its normal counterpart (4) which we call pseudo-spectral gap. Optimizing (4) over all possible selection probabilities p leads to the notion of pseudo-optimal selection probabilities.

As mentioned in Section 1, in many Bayesian settings Bernstein-von Mises theorem (see, e.g, Section 10.2 of [45]) applies, that is, under certain conditions the posterior distribution converges to normal in the total variation norm. Thus we hope that the pseudo-spectral gap of RSGS is a meaningful approximation to the true value of the spectral gap and the pseudo-optimal weights are close to the ones that maximise the spectral gap.

In fact, as we will see in Section 8, where we sample from the Truncated Multivariate Normal distribution and the posterior in Markov Switching Model, if the correlation matrix is well-informative about the dependency structure of the target distribution, running the RSGS with the pseudo-optimal weights instead of the uniform ones, may substantially fasten the convergence, even if the target distribution has discrete components.

To formally define the pseudo-spectral gap, we need a couple of additional notations.

 {\Delta}_{s-1}:=\{\bar{p}\in\mathbb{R}^{s-1}|\bar{p}_{i}>0,\ i=1,..,{s-1};\ 1-% \bar{p}_{1}-..-\bar{p}_{s-1}>0\}

is a convex set in \mathbb{R}^{s-1}, so that \Delta_{s-1} defines a set of s-dimensional probability vectors p=(p_{1},..,p_{s}) and we write p\in\Delta_{s-1} meaning (p_{1},..,p_{s-1})\in\Delta_{s-1}.

Let \lambda_{min}(\cdot) and \lambda_{max}(\cdot) denote the minimum and the maximum eigenvalues of a matrix respectively. As before, for a covariance matrix \Sigma, Q=\Sigma^{-1}, K=\sqrt{Q}. For probability weights p=(p_{1},..,p_{s}), let

 \displaystyle D_{p}=\rm diag(p_{1}Q_{11}^{-1},...,p_{s}Q_{ss}^{-1}) (7)

be a d\times d block-diagonal matrix.

###### Definition (Pseudo-spectral gap).

For arbitrary distribution \pi with precision matrix Q, and any probability vector p\in\Delta_{s-1}, the pseudo-spectral gap for RSGS(p) is defined as

 \displaystyle\mathop{\mbox{$\rm P$-$\rm Gap$}}(p):=1-\lambda_{max}\left(I-KD_{% p}K\right), (8)

which due to Lemma 4 can be written as

 \displaystyle\mathop{\mbox{$\rm P$-$\rm Gap$}}(p)=1-\lambda_{max}\left(I-D_{p}% Q\right)=\lambda_{min}\left(D_{p}Q\right). (9)

Weights p^{\mathrm{opt}}=\left(p^{\mathrm{opt}}_{1},..,p^{\mathrm{opt}}_{s}\right)\in% \Delta_{s-1} are called pseudo-optimal for RSGS if they maximize the corresponding pseudo-spectral gap, i.e,

 \displaystyle p^{\mathrm{opt}}=\underset{{p\in\Delta_{s-1}}}{\rm argmax}% \lambda_{min}\left(D_{p}Q\right). (10)

Remark. It follows from Section 2, that for RSGS(p) the pseudo-spectral and the spectral gap are the same if the target distribution is normal.

Useful observation for both theoretical and practical purposes is the uniqueness of the pseudo-optimal weights.

###### Theorem 5.

There exists a unique solution for (10).

We conclude this section by presenting an upper bound on the possible improvement of the spectral gap of RSGS(p^{\mathrm{opt}}) compared to the spectral gap of the vanilla chain, i.e., the chain with uniform selection probabilities.

###### Theorem 6.

Let \mathrm{Gap}(p) be the spectral gap of RSGS(p) and \mathrm{Gap}\left(\frac{1}{s}\right) be the spectral gap of the vanilla chain, i.e., the RSGS with uniform selection probabilities. Then for any probability vectors p and q

 \mathrm{Gap}(p)\leq\left(\max_{i=1,..,s}{\frac{p_{i}}{q_{i}}}\right)\mathrm{% Gap}(q),

in particular,

 \displaystyle\mathrm{Gap}(p)\leq\left(\max_{i=1,..,s}{sp_{i}}\right)\mathrm{% Gap}\left(\frac{1}{s}\right), (11)

where s is the number of components in the Gibbs sampling scheme.

Remark. Theorem 6 implies

 \displaystyle\mathop{\mbox{$\rm P$-$\rm Gap$}}(p)\leq\left(\max_{i=1,..,s}{sp_% {i}}\right)\mathop{\mbox{$\rm P$-$\rm Gap$}}\left(\frac{1}{s}\right), (12)

where \mathop{\mbox{$\rm P$-$\rm Gap$}}(\frac{1}{s}) is the pseudo-spectral gap for the vanilla chain.

Theorem 6 states that the maximum gain one can get by using non-uniform selection probabilities is bounded by s times - the number of blocks in the Gibbs sampling scheme. Thus we expect the pseudo-optimal weights to be particularly useful in high dimensional settings.

## 4 Motivating examples

The pseudo-optimal weights (10) have complicated interpretation as we will see in the following examples.

Example 1 . In case where the correlation matrix of the target distribution has blocks of highly correlated coordinates, one would prefer to update them more frequently than the others. In this section we construct an artificial example where the upper bound in (12) is \frac{d}{2}\mathrm{Gap}\left(\frac{1}{d}\right). Consider a target distribution in \mathbb{R}^{d}, d=2k with correlation and normalised precision (inverse covariance) matrices given respectively by their block form, i.e., {\rm Corr}=\left(C_{ij}\right)_{i,j=1}^{k}, Q=\left(Q_{ij}\right)_{i,j=1}^{k}, where C_{ij} and Q_{ij} are 2\times 2 matrices such that all Q_{ij}, C_{ij} are zero matrices if i\neq j and for all i=1,..,k

 \displaystyle C_{ii}=\left(\begin{array}[]{cc}1&-\rho_{i}\\ -\rho_{i}&1\end{array}\right),\ \ \ Q_{ii}=\left(\begin{array}[]{cc}1&\rho_{i}% \\ \rho_{i}&1\end{array}\right),

where we assume \rho_{i}\geq 0, i=1,..,k. Assume one wants to apply the coordinate-wise RSGS to sample from a distribution with the above correlation matrix.

###### Proposition 7.

Let the inverse covariance matrix Q be as above. Define

 \displaystyle\alpha_{i}=\frac{\prod_{l=1,l\neq i}^{k}(1-\rho_{l})}{\sum_{l=1}^% {k}\prod_{j=1,j\neq l}^{k}(1-\rho_{j})}. (13)

Then the pseudo-optimal weights are given by

 \displaystyle p^{\mathrm{opt}}_{2i-1}=p^{\mathrm{opt}}_{2i}=\frac{\alpha_{i}}{% 2}. (14)

The corresponding \mathop{\mbox{$\rm P$-$\rm Gap$}} is

 \displaystyle\mathop{\mbox{$\rm P$-$\rm Gap$}}\left(p^{\mathrm{opt}}\right)=% \frac{\prod_{l=1}^{k}(1-\rho_{l})}{2\sum_{l=1}^{k}\prod_{j=1,j\neq l}^{k}(1-% \rho_{j})}. (15)

Without loss of generality assume \rho_{1}=\max\{\rho_{1},...,\rho_{k}\}. We shall compare pseudo-spectral gaps of the vanilla chain with RSGS(p^{\mathrm{opt}}). One can easily obtain that the pseudo-spectral gap of the vanilla chain is given by

 \mathop{\mbox{$\rm P$-$\rm Gap$}}\left(\frac{1}{d}\right)=\frac{1}{d}(1-\rho_{% 1}).

Simple calculations yield

 \displaystyle\lim_{\rho_{1}\to 1}\frac{\mathop{\mbox{$\rm P$-$\rm Gap$}}\left(% \frac{1}{d}\right)}{\mathop{\mbox{$\rm P$-$\rm Gap$}}\left(p^{\mathrm{opt}}% \right)}=\lim_{\rho_{1}\to 1}\frac{1-\rho_{1}}{2k\left(\frac{\prod_{l=1}^{k}(1% -\rho_{l})}{2\sum_{l=1}^{k}\prod_{j=1,j\neq l}^{k}(1-\rho_{j})}\right)}= \displaystyle=\lim_{\rho_{1}\to 1}\frac{1}{k}\frac{\left(\sum_{l=1}^{k}\prod_{% j=1,j\neq l}^{k}(1-\rho_{j})\right)}{\prod_{l=2}^{k}(1-\rho_{l})}=\frac{1}{k}=% \frac{2}{d}.

Moreover,

 \lim_{\rho_{1}\to 1}\left(\max_{i}dp^{\mathrm{opt}}_{i}\right)=\frac{1}{k}=% \frac{2}{d}.

Thus we obtained a sequence of precision matrices for which the pseudo-optimal weights improve the pseudo-spectral gap by \frac{d}{2} times in the limit which is the upper bound in (12). Notice, if the underlying target distribution is normal, the upper bound in (11) for the L_{2}-spectral gap is approximated.

Remark. Corollary 1 to Theorem 5 of [35] implies that the spectral gap of Deterministic Update Gibbs Sampler (denoted by \mathrm{Gap}\left({\rm DUGS}\right)) for the normal target with a 3-diagonal precision Q is greater than the gap of the vanilla RSGS (i.e., with the uniform selection probabilities). Moreover, from Corollary 2 to Theorem 5 of [35], \underset{\rho_{1}\to 1}{\lim}\frac{\mathrm{Gap}\left({\rm DUGS}\right)}{% \mathop{\mbox{$\rm P$-$\rm Gap$}}\left(\frac{1}{d}\right)}=2. We constructed an example of a 3-diagonal precision matrix, where in dimensions greater than 6, RSGS with pseudo-optimal weights p^{\mathrm{opt}} converges \frac{d}{4} times faster than DUGS for \rho_{1}\to 1.

Example 2 . One mistakenly might conclude that significant gain from using the pseudo-optimal weights is achieved only if some of the off-diagonal entries of the covariance matrix are close to one . Here we provide a somewhat counter-intuitive example that demonstrates fallacy of such statement.

Consider a correlation matrix matrix given by \Sigma^{(2)}=\left(C_{ij}\right)_{i,j=1}^{d}, where C_{ii}=1 for i=1,..,d, C_{1i}=C_{i1}:=c_{i}\geq 0 for i=2,..,d and all other entries C_{ij}=0.

One can easily work out that the smallest eigenvalue of \Sigma^{(2)}, \lambda_{min}=1-\sqrt{\sum_{i=2}^{d}c^{2}_{i}}. Thus if \lambda_{min}>0, then \Sigma^{(2)} is a valid correlation matrix. Set d=50 and c_{i}=\frac{1}{7.01}\approx 0.143 for i=2,..,50.

We run the subgradient optimisation algorithm presented in Section 5 in order to estimate p^{\mathrm{opt}}. We estimate p^{\mathrm{opt}}_{1}\approx 0.484, p^{\mathrm{opt}}_{i}\approx 0.01 for i=2,..,50. From (9) the pseudo-spectral gap is roughly \frac{1}{1496}, whilst \mathop{\mbox{$\rm P$-$\rm Gap$}}\left(\frac{1}{50}\right) is roughly \frac{1}{18294}. Thus if the target distribution is normal, the spectral-gap of the vanilla RSGS is improved by more than 12 times. Note, however, all off-diagonal correlations are less than 0.143.

## 5 Adapting the Gibbs Sampler

In this section we derive the Adaptive Random Scan Gibbs Sampler (ARSGS) Algorithm 5. We provide all the steps and intuition leading towards the final working version of the algorithm presented in the end of the section.

The goal is to compute the pseudo-optimal weights (10) for the RSGS (1). However, in practice the correlation matrix of the target distribution is usually not known. Thus we could proceed in the adaptive way, similarly to Haario et al. [24]. Given output of the chain of length n, let \widehat{\Sigma}_{n}, \widehat{Q}_{n}, and \widehat{D(p)}_{n} be estimators of \Sigma, Q, and D_{p} respectively built upon the chain output. For instance, one may choose the naive estimator

 \displaystyle\widehat{\Sigma}_{n}=\frac{1}{n}\left(\sum_{i=0}^{n}X_{i}X_{i}^{% \mathrm{T}}-(n+1)\overline{X}_{n}\overline{X}_{n}^{\mathrm{T}}\right), (16)

where X_{n} is the chain output at time n and \overline{X}_{n} is a sample mean of the output up to time n.

{algorithm}

Adaptive Random Scan Gibbs Sampler (general idea) Generate a starting location X_{0}\in\mathbb{R}^{d}. Set an initial value of p^{0}\in\Delta_{s-1}. Choose a sequence of positive integers \left(k_{m}\right)_{m=0}^{\infty}. Set n=0, i=0.
Beginning of the loop

[label=0.,ref=0]
1. 1.

n:=n+k_{i}. Run RSGS(p^{i}) for k_{i} steps;

2. 2.

Re-estimate \widehat{\Sigma}_{n} and \widehat{D(p)}_{n};

3. 3.

Compute p^{i+1}=\rm argmax_{p\in\Delta_{s-1}}\lambda_{min}\left(\widehat{D(p)}_{n}% \widehat{Q}_{n}\right);

4. 4.

i:=i+1.

Go to Beginning of the loop

The Algorithm 5 summarises the above ideas. The algorithm is limited by Step 3, where one needs to maximize the minimum eigenvalue. Maximising the minimum eigenvalue is known to be a complicated optimisation problem. There is vast literature covering optimisation problem in Step 3 and we refer to [31, 32], [17], and references therein. Unfortunately, the existing optimisation algorithms require computation of the minimum eigenvalues of \lambda_{min}\left(\widehat{D(p)}_{n}\widehat{Q}_{n}\right) which is not a reasonable way to waste computational resources since we do not know the covariance matrix \Sigma anyway. Therefore, we develop a new algorithm based on the subgradient method for convex functions (see Chapter 8 of [13]) applied to (10).

For \varepsilon>0, introduce a contraction set of {\Delta}_{s}:

 \displaystyle{\Delta}^{\varepsilon}_{s}:=\{w\in\mathbb{R}^{s}|w_{i}\geq% \varepsilon,\ i=1,..,{s-1};\ 1-w_{1}-..-w_{s}\geq\varepsilon\}, (17)

and consider (d+1)\times(d+1) matrices

 \displaystyle\begin{aligned} &\displaystyle{Q^{\rm ext}}={\rm diag}\left(Q,1% \right),\\ &\displaystyle{\Sigma^{\rm ext}_{n}}={\rm diag}\left(\widehat{\Sigma}_{n},1% \right),\\ &\displaystyle{D^{\rm ext}_{w}}={\rm diag}\left(D_{w},\ 1-\sum_{i=1}^{s}w_{i}% \right),\end{aligned}\qquad\begin{aligned} &\displaystyle{\Sigma^{\rm ext}}={% \rm diag}\left(\Sigma,1\right),\\ &\displaystyle{Q^{\rm ext}_{n}}={\rm diag}\left(\widehat{Q}_{n}.1\right),\\ &\displaystyle{D^{\rm ext}_{n}(w)}={\rm diag}\left(\widehat{D(w)}_{n},\ 1-\sum% _{i=1}^{s}w_{i}\right).\end{aligned}

Let us denote the target function

 \displaystyle f(w)=\lambda_{min}\left({D^{\rm ext}_{w}}{Q^{\rm ext}}\right)=% \lambda_{min}\left(\sqrt{{Q^{\rm ext}}}{D^{\rm ext}_{w}}\sqrt{{Q^{\rm ext}}}% \right), (18)

where the last equality holds in view of Lemma 4.

Using the definition of the pseudo-optimal selection probabilities (10), one can easily verify the following proposition

###### Proposition 8.

The pseudo-optimal weights (10) can be obtained as a normalised solution of

 \displaystyle w^{\star}=\underset{{w\in\Delta_{s}}}{\rm argmax}f(w),

i.e.,

 \displaystyle p^{\mathrm{opt}}_{j}=\frac{w^{\star}_{j}}{w^{\star}_{1}+..+w^{% \star}_{s}},\ j=1,..,s, (19)

Moreover,

 \mathop{\mbox{$\rm P$-$\rm Gap$}}(p^{\mathrm{opt}})=\frac{1}{(w^{\star}_{1}+..% +w^{\star}_{s})}f(w^{\star}),

where f is defined in (18).

Remark. One could easily avoid introducing the extended matrices {\Sigma^{\rm ext}},{Q^{\rm ext}} by simply setting p_{s}=1-p_{1}-..-p_{s-1} and treating function \lambda_{min}\left(\widehat{D(p)}_{n}\widehat{Q}_{n}\right) as a function of s-1 variables. However, we found empiricallym that such approach can significantly slow down convergence of the ARSGS Algorithm 5 introduced later in this section.

It is easy to prove concavity of the function f (18).

###### Proposition 9.

Function f defined in (18) is concave in \Delta_{s}.

[3] show that f is differentiable at w\in\Delta_{s} if and only if f(w) is a simple eigenvalue of \sqrt{{Q^{\rm ext}}}{D^{\rm ext}_{w}}\sqrt{{Q^{\rm ext}}}. It is also known that convex functions in Euclidean spaces are differentiable almost everywhere w.r.t. Lebesgue measure (see [15], Section 2.5). [3] also provide exact formulas for computing derivatives of f where they exist. Thus we are motivated to adapt subgradient method for convex functions in order to modify Step 3 in the above algorithm.

Let <\cdot,\cdot> denote scalar product in \mathbb{R}^{d}. Recall the definition of subgradient and subdifferential.

###### Definition.

Let h:\mathbb{R}^{d}\to\mathbb{R} be a convex function. We say v is a subgradient of h at point x if for all y\in\mathbb{R}^{d},

 h(y)\geq h(x)+\left.

If h is concave, we say that v is a supergradient of h at a point x, if (-v) is a subgradient of the convex function (-h) at x. The set of all sub-(super-)gradients at the point x is called sub-(super-)differential at x and is denoted by \partial h(x).

In other word, \partial h(x) parametrises a collection of all tangent hyperplanes at a point x.

Note that f(w)=0 on the boundary of \Delta_{s}. Therefore, the maximum of f is attained inside \Delta_{s}. One may apply the subgradient optimisation method in order to estimate p^{\mathrm{opt}}. The method is described in Algorithm 5.

{algorithm}

Subgradient optimisation algorithm Set an initial value of w^{0}=(w^{0}_{1},..,w^{0}_{s})\in\Delta_{s}. Define a sequence of non-negative numbers \left(a_{m}\right)_{m=1}^{\infty} such that \sum_{m=1}^{\infty}a_{m}=\infty and \lim_{m\to\infty}a_{m}=0. Set i=0.
Beginning of the loop

[label=0.,ref=0]
1. 1.

Compute any d^{i}\in\partial f(w^{i}). Normalise d^{i}:=\frac{d^{i}}{|d^{i}_{1}|+..+|d^{i}_{s}|};

2. 2.

w^{\rm new}_{j}:=w_{j}^{i}+a_{i+1}d_{j}^{i}, j=1,..s;

3. 3.

w^{i+1}:=\mathrm{Pr}_{\Delta_{s}}\left(w^{\rm new}\right) , where \mathrm{Pr}_{\Delta_{s}} is the projection operator on \Delta_{s};

4. 4.

i:=i+1.

Go to Beginning of the loop

It is known that Algorithm 5 produces a sequence \{w^{i}\} such that w^{i}\to w^{\star} as i\to\infty (see Chapter 8 of [13]). Therefore, it is reasonable to combine the ARSGS 5 with the subgradient algorithm. In order to do so, define a sequence of approximations of (18):

 f_{n}(w)=\lambda_{min}\left({D^{\rm ext}_{n}(w)}{Q^{\rm ext}_{n}}\right)=% \lambda_{min}\left(\sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext% }_{n}}}\right).
{algorithm}

Adaptive Gibbs Sampler based on subgradient optimisation method (not implementable) Generate a starting location X_{0}\in\mathbb{R}^{d}. Fix \frac{1}{s+1}>\varepsilon>0. Set an initial value of w^{0}=(w^{0}_{1},..,w^{0}_{s})\in\Delta^{\varepsilon}_{s}. Define a sequence of non-negative numbers \left(a_{m}\right)_{m=1}^{\infty} such that \sum_{m=1}^{\infty}a_{m}=\infty and \lim_{m\to\infty}a_{m}=0 . Set i=0. Choose a sequence of positive integers \left(k_{m}\right)_{m=0}^{\infty}.
Beginning of the loop

[label=0.,ref=0]
1. 1.

n:=n+k_{i}. p^{i}_{j}:=\frac{w_{j}^{i}}{w^{i}_{1}+..+w^{i}_{s}}, j=1,..,s. Run RSGS(p^{i}) for k_{i} steps;

2. 2.

Re-estimate \widehat{\Sigma}_{n} and {D^{\rm ext}_{n}(w)};

[label=3.0.,ref=3.0]
1. 1.

Compute d^{i}\in\partial f_{n}(w^{i}). Normalise d^{i}:=\frac{d^{i}}{|d^{i}_{1}|+..+|d^{i}_{s}|};

2. 2.

w^{\rm new}_{j}:=w_{j}^{i}+a_{i+1}d_{j}^{i}, j=1,..s;

3. 3.

w^{i+1}:=\mathrm{Pr}_{\Delta_{s}^{\varepsilon}}\left(w^{\rm new}\right) , where \mathrm{Pr}_{\Delta^{\varepsilon}_{s}} is the projection operator on \Delta^{\varepsilon}_{s};

[label*=0.]
1. 4.

i:=i+1.

Go to Beginning of the loop

Algorithm 5 resembles the aforementioned ideas. Here we consider iterations w^{i} to be in \Delta^{\varepsilon}_{s} for \varepsilon>0 because of three reasons. Firstly, the RSGS with selection probabilities that are on the boundary of \Delta_{s-1} is not ergodic. Secondly, this assumption is motivated by the results of [28], where it is a minimum requirement to establish convergence of an Adaptive Gibbs Sampler. Finally, in the final Algorithm 5, it is an essential assumption to be able to perform power iteration Step a. Note, however, that \varepsilon>0 may be chosen arbitrary small.

In order to construct an implementable and practical ARSGS algorithm, we still need to find a way to approximate the subgradient \partial f_{n}(w^{i}) in Step 1 and also find a cheap way of computing the projection \mathrm{Pr}_{\Delta_{s}^{\varepsilon}}\left(w^{\rm new}\right) in Step 3.

An efficient algorithm to compute the projection on \Delta_{s}^{\varepsilon} is presented in [48] and summarized in Algorithm 5. First, we increase all small coordinates to be \varepsilon in Step 1. If the resulting point is outside \Delta_{s}^{\varepsilon}, we need to project it on the hyperplane \{w\in\mathbb{R}^{s}|1-\sum_{j=1}^{s}w=\varepsilon,\ w_{i}\geq\varepsilon\}. In order to find the projection, we first rescale the coordinates in Step 3. Then we use the algorithm of [48] to compute the projection on \{w\in\mathbb{R}^{s}|\sum_{j=1}^{s}w=1,\ w_{i}\geq 0\} in Steps 4 - 6. Finally, we rescale the resulting point in Step 7 and thus obtain the desired projection.

{algorithm}

projection on \Delta_{s}^{\varepsilon} The output of the algorithm is w^{\rm proj} - projection of w\in\mathbb{R}^{s} onto \Delta_{s}^{\varepsilon}.

[label=0.,ref=0]
1. 1.

Define an auxiliary variable w^{\rm aux}:=w. For j=1,..,s, if w^{\rm aux}_{j}<\varepsilon, set w^{\rm aux}_{j}:=\varepsilon;

2. 2.

If 1-\sum_{j=1}^{s}w^{\rm aux}_{j}>\varepsilon, then w^{\rm proj}:=w^{\rm aux} and go to Step 8; else go to Step 3;

3. 3.

For j=1,..,s, w^{\rm temp}_{j}:=\frac{1}{1-\varepsilon(s+1)}(w^{\rm aux}_{j}-\varepsilon);

4. 4.

Sort vector (w^{\rm temp}_{1},..,w^{\rm temp}_{s}) into u:\ u_{1}\geq...\geq u_{s};

5. 5.

\rho:=\max\Bigg{\{}1\leq j\leq s:\ u_{j}+\frac{1}{j}\left(1-\sum_{k=1}^{j}u_{k% }\right)>0\Bigg{\}};

6. 6.

Define \lambda=\frac{1}{\rho}\left(1-\sum_{k=1}^{\rho}u_{k}\right);

7. 7.

For j=1,..,s, w^{\rm proj}_{j}:=\varepsilon+(1-\varepsilon(s+1))\max\{w^{\rm temp}_{j}+% \lambda,0\};

8. 8.

Return w^{\rm proj}.

We are left to construct a procedure that approximates a supergradient d^{i}\in\partial f_{n}(w^{i}) in Step 1 of Algorithm 5. Since f_{n}(w) is the minimum eigenvalue of a self-adjoint matrix, f_{n}(w) may be obtained as

 f_{n}(w)=\min_{x:\|x\|=1}\left<\sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}% \sqrt{{Q^{\rm ext}_{n}}}x,x\right>,

where x\in\mathbb{R}^{d+1} and \left<\cdot,\cdot\right> denotes scalar product in \mathbb{R}^{d}. Define

 g^{n}_{x}(w)=\left<\sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext% }_{n}}}x,x\right>.

Let \nabla denote a gradient w.r.t. w. Then

 \displaystyle\nabla g^{n}_{x}(w)=\left(\left<\sqrt{{Q^{\rm ext}_{n}}}\frac{% \partial{D^{\rm ext}_{n}(w)}}{\partial w_{1}}\sqrt{{Q^{\rm ext}_{n}}}x,x\right% >,..,\left<\sqrt{{Q^{\rm ext}_{n}}}\frac{\partial{D^{\rm ext}_{n}(w)}}{% \partial w_{s}}\sqrt{{Q^{\rm ext}_{n}}}x,x\right>\right). (20)

Here \frac{\partial}{\partial w_{i}} stands for the element-wise derivative w.r.t. w_{i}, i=1,..,s. Ioffe-Tikhomirov theorem (see, e.g., [49]) implies that the superdifferential of f_{n} at a point w\in\Delta^{\varepsilon}_{s} can be computed as

 \partial f_{n}(w)={\rm conv}\left\{\nabla g_{x}(w)\ \Bigg{|}\ x:\ \sqrt{{Q^{% \rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext}_{n}}}x=f_{n}(w)x,\ \|x\|=% 1\right\},

where {\rm conv}\{A\} denotes a convex hull of the set A.

Computing elements of the set \partial f_{n}(w) is computationally expensive, since one has to calculate the minimum eigenvectors of \sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext}_{n}}}. Therefore, we look for a cheap approximation of the points \nabla g^{n}_{x}(w) in \partial f_{n}(w).

Let y=\sqrt{{Q^{\rm ext}_{n}}}x. Since we are interested in minimum eigenvectors x, such that

 \sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext}_{n}}}x=f_{n}(w)x,

we can rewrite this equation as

 \displaystyle\frac{1}{f_{n}(w)}y=\left({D^{\rm ext}_{n}(w)}\right)^{-1}{\Sigma% ^{\rm ext}_{n}}y. (21)

That is, computing the minimum eigenvector of \sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext}_{n}}} is equivalent to computing the maximum eigenvector of \left({D^{\rm ext}_{n}(w)}\right)^{-1}\widehat{\Sigma}_{n}. Given y that solves (21) and substituting x=\frac{1}{\|\sqrt{{\Sigma^{\rm ext}_{n}}}y\|}\sqrt{{\Sigma^{\rm ext}_{n}}}y into (20), we obtain

 \displaystyle\nabla g^{n}_{x}(w)=\frac{1}{\|\sqrt{{\Sigma^{\rm ext}_{n}}}y\|^{% 2}}\Bigg{(}\Big{<}\frac{\partial{D^{\rm ext}_{n}(w)}}{\partial w_{1}}y,y\Big{>% },..,\Big{<}\frac{\partial{D^{\rm ext}_{n}(w)}}{\partial w_{s}}y,y\Big{>}\Bigg% {)}. (22)

We can do further transformations. Let

 \displaystyle\left({D^{\rm ext}_{n}(w)}\right)^{-1}=L_{n}(w)L_{n}^{\mathrm{T}}% (w) (23)

be the Cholesky decomposition of \left({D^{\rm ext}_{n}(w)}\right)^{-1}, where L_{n}(w) is a lower triangular matrix. Define z:=L^{-1}_{n}(w)y and

 \displaystyle R_{i}(w):=\rm diag\left(0,..,0,\frac{1}{w_{i}},..,\frac{1}{w_{i}% },0,..,0,-\frac{1}{1-w_{1}-..-w_{s}}\right), (24)

where \frac{1}{w_{i}} are placed exactly on the positions of the diagonal elements of Q_{ii} in the partition Q=\left(Q_{ij}\right)_{i,j=1}^{s}. Then after simple manipulations, (21) and (22) are equivalent respectively to

 L_{n}^{\mathrm{T}}(w){\Sigma^{\rm ext}_{n}}L_{n}(w)z=\frac{1}{f_{n}(w)}z

and

 \displaystyle\nabla g^{n}_{x}(w)=\frac{1}{\left}\Bigg{(}\left,..,\left\Bigg{)}, (25)

where we used the block-diagonal structure of L_{n}(w) and a representation

 \frac{\partial{D^{\rm ext}_{n}(w)}}{\partial w_{i}}={\rm diag}\left(0,..,0,Q^{% -1}_{ii},0,..,0,-1\right).

Because of the normalisation in Step 1 of the Adaptive Gibbs Sampler 5, (22) and (25) imply that a supergradient of f_{n}(w) is proportional to

 \displaystyle d_{y}(w)=\left(\left<({D^{\rm ext}_{n}(w)})_{1}y,y\right>,..,% \left<({D^{\rm ext}_{n}(w)})_{s-1}y,y\right>\right), (26)

or, in terms of z, to

 \displaystyle d_{z}(w)=\left(\left,..,\left% \right), (27)

where y and z are the maximum eigenvectors of \left({D^{\rm ext}_{n}(w)}\right)^{-1}\widehat{\Sigma}_{n} and
L_{n}^{\mathrm{T}}(w){\Sigma^{\rm ext}_{n}}L_{n}(w), respectively. Here the lower triangular matrix L_{n}(w) is defined by the Cholesky decomposition (23).

Power iteration step may be performed in order to approximate y and z. Let z_{0} and y_{0} be randomly generated unit vectors. Then at every iteration of the algorithm, we compute

 \displaystyle y_{i+1}={Q^{\rm ext}_{n}}D^{\rm ext}_{n}(w^{i})y_{i}, (28) \displaystyle z_{i+1}=L_{n}^{\mathrm{T}}(w^{i}){\Sigma^{\rm ext}_{n}}L_{n}(w^{% i})z_{i}. (29)

and use the normalized vectors y_{i+1} and z_{i+1} instead of y and z when computing the directions (26) and (27)

Given the intuition above, we present two versions of the ARSGS in the Algorithm 5, where in round brackets we denote an alternative version of the algorithm.

One might notice the perturbation term b_{i+1}\xi_{i+1} in the Step a. In fact, without the perturbation we may break the algorithm due to the fact that the power iteration step may fail to approach the maximum eigenvalue. It happens when z_{i} (or y_{i}) "slips" into the eigenspace of a wrong eigenvalue and can’t get out of it for the subsequent algorithm steps.

The simplest example one can think of is sampling from N(0,I_{2}) using coordinatewise RSGS. Set {\Sigma^{\rm ext}_{n}}=I_{3}. If one starts from w^{0}=(\frac{1}{4},\frac{1}{4},\frac{1}{2}) and steps a_{m} (see Algorithm 5 for the meaning of a_{m}) are chosen to be tiny, (0,0,1) is the maximum eigenvector of L_{n}^{\mathrm{T}}(w^{i}){\Sigma^{\rm ext}_{n}}L_{n}(w^{i})=\rm diag(w^{i}_{1}% ,w^{i}_{2},1-w^{i}_{1}-w^{i}_{1}) for i=1,..,N_{0}, where N_{0} depends on the sequence a_{m}. If N_{0} is big enough (equivalently, a_{m} is small enough), eventually z_{i}=(0,0,1) for all i due to the computational precision error and we will not get out of this eigenspace. Therefore, there is a possibility that eventually w^{i} sticks to the boundary of \Delta_{s}^{\varepsilon}. To surpass the issue we modify the power iteration Step a by perturbing the values of z_{i},

 z_{i+1}=L_{n}^{\mathrm{T}}(w^{i}){\Sigma^{\rm ext}_{n}}L_{n}(w^{i})z_{i}+b_{i}% \xi_{i},

where b_{i} is a non-negative sequence convergent to 0, and \xi_{i} is i.i.d. sequence of points uniformly distributed on the unit sphere.

Remark. In Step a of the ARSGS Algorithm 5 , \frac{1}{\|L_{n}^{\mathrm{T}}(w^{i})\widehat{\Sigma}_{n}L_{n}(w^{i})z_{i}\|} and \frac{1}{\|\left({D_{n}^{\rm ext}(w^{i})}\right)^{-1}\widehat{\Sigma}_{n}y_{i}\|} are approximations of \underset{w\in\Delta_{s}^{\varepsilon}}{\max}f(w), where f is defined in (18). Therefore, taking into an account Proposition 8, we can estimate \mathop{\mbox{$\rm P$-$\rm Gap$}}(p^{\mathrm{opt}}) by

 \displaystyle\mathop{\mbox{$\rm P$-$\rm Gap$}}(p^{\mathrm{opt}})\approx\left((% w^{i}_{1}+..+w^{i}_{s})\ \|L_{n}^{\mathrm{T}}(w^{i}){\Sigma^{\rm ext}_{n}}L_{n% }(w^{i})z_{i}\|\right)^{-1} (30)

or

 \displaystyle\mathop{\mbox{$\rm P$-$\rm Gap$}}(p^{\mathrm{opt}})\approx\left((% w^{i}_{1}+..+w^{i}_{s})\ \|\left({D_{n}^{\rm ext}(w^{i})}\right)^{-1}{\Sigma^{% \rm ext}_{n}}y_{i}\|\right)^{-1}. (31)
{algorithm}

Adaptive Random Scan Gibbs Sampler (final version) Generate a starting location X_{0}\in\mathbb{R}^{d}. Fix \frac{1}{s+1}>\varepsilon>0. Set an initial value of w^{0}=(w^{0}_{1},..,w^{0}_{s})\in\Delta^{\varepsilon}_{s}, generate a random unit vectors z_{0}\in\mathbb{R}^{d+1} (or y_{0}\in\mathbb{R}^{d+1}). Define two sequences of non-negative numbers \left(b_{m}\right)_{m=1}^{\infty} and \left(a_{m}\right)_{m=1}^{\infty} such that \sum_{m=1}^{\infty}a_{m}=\infty, a_{m}\to 0 and b_{m}\to 0 as m\to\infty. Set i=0. Choose a sequence of positive integers \left(k_{m}\right)_{m=0}^{\infty}.
Beginning of the loop

[label=0.,ref=0]
1. 1.

n:=n+k_{i}. p^{i}_{j}:=\frac{w_{j}^{i}}{w^{i}_{1}+..+w^{i}_{s}}, j=1,..,s. Run RSGS(p^{i}) for k_{i} steps;

2. 2.

Re-estimate \widehat{\Sigma}_{n}. Recompute {\Sigma^{\rm ext}_{n}}, D_{n}^{\rm ext}(w^{i}), L_{n}(w^{i});

[label=3.0.,ref=3.0]
1. 1.

[label=3.1.0.,ref=3.1.0]
1. (a)

Generate \xi_{i+1}\sim N(0,I_{d+1}). \xi_{i+1}:=\frac{\xi_{i+1}}{\|\xi_{i+1}\|}.
Compute

 z_{i+1}:=L_{n}^{\mathrm{T}}(w^{i}){\Sigma^{\rm ext}_{n}}L_{n}(w^{i})z_{i}+b_{i% +1}\xi_{i+1}.
 \Bigg{(}y_{i+1}:=\left({D_{n}^{\rm ext}(w^{i})}\right)^{-1}{\Sigma^{\rm ext}_{% n}}y_{i}+b_{i+1}\xi_{i+1}\Bigg{)}

Normalise z_{i+1}:=\frac{z_{i+1}}{\|z_{i+1}\|}.\ \Bigg{(}y_{i+1}:=\frac{y_{i+1}}{\|y_{i+% 1}\|}\Bigg{)}.

2. (b)

Compute d^{i}=d_{z_{i+1}}(w^{i})\ \mbox{ from }(\ref{gradient alternative 1}) \Bigg{(}d^{i}=d_{y_{i+1}}(w^{i})\mbox{ from }(\ref{gradient alternative 2})% \Bigg{)}. Normalise d^{i}:=\frac{d^{i}}{|d^{i}_{1}|+..+|d^{i}_{s}|};

2. 2.

w^{\rm new}_{j}:=w_{j}^{i}+a_{i+1}d_{j}^{i}, j=1,..s;

3. 3.

Using Algorithm 5 compute projection w^{i+1} of w^{\rm new} onto \Delta^{\varepsilon}_{s};

[label=0.,ref=0]
1. 4.

i:=i+1.

Go to Beginning of the loop

Sometimes one can not or does not want to sample from the full conditionals Pr_{i} of the target distribution. In this case one may want to proceed with the Metropolis-within-Gibbs algorithm. For simplicity, we restrict ourselves to the coordinate-wise update Random Walk Metropolis-within-Gibbs (RWMwG) Algorithm 6, though the idea presented below goes beyond this particular case.

One should not get confused with the parameter q in Step 2 of the Algorithm 6. If q=1, one recovers the RWMwG algorithm in its canonical form.

It is often not clear how to choose proposal variances \beta_{i} to speed up the convergence. We follow [20] suggestion that the average acceptance rate \alpha should be around 0.44 and adapt \beta_{i} on the fly to keep up with this acceptance rate. Algorithm 6 is the adaptive version of the RWMwG as suggested in [28].

One could also adapt the selection probabilities p_{i} but, as noted in [28], there is no to-date guidance on the optimal choice of p_{i}. Heuristically, we would expect the Adaptive RWMwG to mimic the RSGS, so that we find it to be reasonable to adapt the selection probabilities p_{i} in the same manner as for the RSGS. Therefore, we introduce Adaptive Random Walk Metropolis within Adaptive Gibbs (ARWMwAG) Sampler described in Algorithm 6, where running the ARWMwG sampler in Step 1 alternates with adaptation of the selection probabilities in Step 2.

{algorithm}

[H] Random Walk Metropolis-within-Gibbs (RWMwG) Generate a starting location X_{0}\in\mathbb{R}^{d}. Let (p_{1},..,p_{s}) be a probability vector, 0<q\leq 1, \sigma^{2}>0. Fix variances \beta_{1},..,\beta_{s} and choose starting location \left(X^{0}_{1},..,X^{0}_{d}\right)\in\mathbb{R}^{d}. n:=0.
Beginning of the loop

[label=0.,ref=0]
1. 1.

Sample i\in\{1,..,s\} from probability distribution (p_{1},..,p_{s});

2. 2.

Draw Y\sim\left\{\begin{array}[]{rcl}N(X_{i},\beta^{2}_{i})&\mbox{with probability}% &q,\\ N(X_{i},\sigma^{2})&\mbox{with probability }&1-q;\end{array}\right.

3. 3.

Compute acceptance rate \alpha=\min\Bigg{\{}1,\frac{\pi(Y|X^{n}_{-i})}{\pi(X^{n}_{i}|X^{n}_{-i})}\Bigg% {\}};

4. 4.

With probability \alpha accept the proposal and set

 X^{n+1}=\left(X^{n}_{1},..,X^{n}_{i-1},Y,X^{n}_{i+1},..,X^{n}_{s}\right),

otherwise, reject the proposal and set X^{n+1}=X^{n};

5. 5.

n=n+1.

Go to Beginning of the loop

{algorithm}

Adaptive Random Walk Metropolis-within-Gibbs (ARWMwG) Generate a starting location X_{0}\in\mathbb{R}^{d}. Let (p_{1},..,p_{s}) be a probability vector, 0<q\leq 1, \sigma^{2}>0. Fix variances \beta_{1}^{0},..,\beta_{s}^{0} and choose starting location \left(X^{0}_{1},..,X^{0}_{d}\right)\in\mathbb{R}^{d}. n:=0.
Beginning of the loop

[label=0.,ref=0]
1. 1.

Do Steps 1 - 4 of RWMwG Algorithm 6 with proposal variances (\beta_{1}^{n},..,\beta_{s}^{n});

2. 2.

\beta_{i}^{n+1}=\beta_{i}^{n}\cdot\exp\left(\frac{1}{n^{0.7}}(\alpha-0.44)\right), where \alpha is the acceptance rate in Step 3 of RWMwG Algorithm 6;

3. 3.

n=n+1.

Go to Beginning of the loop

{algorithm}

[H] Adaptive Random Walk Metropolis within Adaptive Gibbs Generate a starting location X_{0}\in\mathbb{R}^{d}. Fix variances \beta_{1}^{0},..,\beta_{s}^{0}, 0<q\leq 1, and \sigma^{2}>0. Choose also \frac{1}{s+1}>\varepsilon>0. Set an initial value of w^{0}=(w^{0}_{1},..,w^{0}_{s})\in\Delta^{\varepsilon}_{s}, generate a random unit vector z_{0}\in\mathbb{R}^{d+1} (or y_{0}\in\mathbb{R}^{d+1}). Define two sequences of non-negative numbers \left(b_{m}\right)_{m=1}^{\infty} and \left(a_{m}\right)_{m=1}^{\infty} such that \sum_{m=1}^{\infty}a_{m}=\infty, a_{m}\to 0 and b_{m}\to 0 as m\to\infty. Set i=0. Choose a sequence of positive integers \left(k_{m}\right)_{m=0}^{\infty}.
Beginning of the loop

[label=0.,ref=0]
1. 1.

n:=n+k_{i}. p^{i}_{j}:=\frac{w_{j}^{i}}{w^{i}_{1}+..+w^{i}_{s}}, j=1,..,s. Iterate k_{i} times Steps 1 and
2 of ARWMwG Algorithm 6 with sampling weights (p^{i}_{1},..,p^{i},s) and proposal variances (\beta_{1}^{n},..,\beta_{s}^{n});

2. 2.

Do steps 2 - 4 of ARSGS Algorithm 5.

Go to Beginning of the loop

## 7 Ergodicity of the Adaptive Gibbs Sampler

Here \{P_{\gamma}\}_{\gamma\in\Gamma} is a collection of Markov kernels with a common stationary distribution \pi. For example, this can be a collection of RSGS kernels (1) or the kernels of the Random Walk Metropolis Algorithm 6.

The main result is presented in Theorem 13, where ergodicity of the modified ARSGS Algorithm 13 is established under the local simultaneous geometric drift condition 3. We shall show in Theorem 10 that the local simultaneous geometric drift is a natural condition for the ARSGS to have. More generally, if the condition 3 holds, we prove ergodicity for a class of modified AMCMC Algorithms 12 in Theorem 12.

Ergodicity of the ARWMwG and ARWMwAG (Algorithms 6 and 6) is established under various conditions on the tails of the target distribution \pi in Section 5 of [28]. In order to fulfil these conditions we, for example, could take arbitrary 0\leq q<1 and large enough \sigma^{2} in the settings of the adaptive algorithms (see Theorems 5.6, 5.9 and Remark 5.8 in [28]).

For the ARSGS, we shall utilise Theorem 2 of [39]. The theorem guarantees ergodicity of an adaptive MCMC algorithm under the diminishing adaptation and containment conditions 1 and 2.

[label*=(C0)]
1. 1.

 \sup_{x\in\mathcal{X}}\|P_{\gamma_{n}}(x,\cdot)-P_{\gamma_{n+1}}(x,\cdot)\|_{% TV}\to^{\rm P}0\mbox{ as }n\to\infty,

where \|\cdot\|_{TV} is the total variation norm, \gamma_{n}\in\Gamma - random sequence of parameters, and \to^{\rm P} denotes the convergence in probability. Recall, for a signed measure \mu, its total variation norm \|\mu\|_{TV}=\sup_{A\in\mathcal{B}(\mathcal{X})}|\mu(A)|, where the supremum is taken over all measurable sets.

2. 2.

Containment condition. For x\in\mathbb{R}^{d}, \gamma\in\Gamma and all \varepsilon>0 define a function

 M_{\varepsilon}(x,\gamma):=\inf\Bigg{\{}N\geq 1\Bigg{|}\|P^{N}_{\gamma}(x,% \cdot)-\pi(\cdot)\|_{TV}\leq\varepsilon\Bigg{\}}.

We say that an adaptive chain \{X_{n},\gamma_{n}\} satisfies the containment condition, if for all \varepsilon>0, the sequence \{M_{\varepsilon}(X_{n},\gamma_{n})\}_{n=0}^{\infty} is bounded in probability, i.e., \underset{N\to\infty}{\lim}\underset{n}{\sup}\ {\rm P}\Bigg{(}M_{\varepsilon}(% X_{n},\gamma_{n})>N\Bigg{)}=0, where {\rm P} is the probability measure induced by the chain.

Theorem 2 of [39]. Let \{X_{n},\gamma_{n}\} be an adaptive chain with \{\gamma_{n}\} being the corresponding sequence of parameters. If \{X_{n},\gamma_{n}\} satisfies 1 and 2, then the adaptive chain is ergodic, i.e.,

 \|\mathcal{L}(X_{n})-\pi\|_{TV}\to 0\mbox{ as }n\to\infty,

where \mathcal{L}(X_{n}) is the probability distribution law of X_{n} and \pi is the target distribution.

One can easily see that 1 holds for the ARSGS since |p^{n+1}-p^{n}|\xrightarrow{P}0, due to the choice of decaying to zero adaptation rate a_{m} in the Step 2 of Algorithm 5.

Verifying the containment condition 2 is less so trivial. In Theorem 3 of [8], the containment is established if the simultaneous geometric drift conditions hold, i.e., if the following assumptions are fulfilled:

[label*=(A0)]
1. 0.

Uniform small set. There exist a uniform (\nu_{\gamma},m)-small set C, i.e., there exists a measurable set C\in\mathbb{R}^{d}, an integer m\geq 1, a constant \delta>0 and a probability measure \nu_{\gamma} probably depending on \gamma\in\Gamma, such that

 \displaystyle P^{m}_{\gamma}(x,\cdot)\geq\delta\nu_{\gamma}(\cdot). (32)
2. 1.

Simultaneous geometric drift. There exist numbers b<\infty, 0<\lambda<1, and a function 1\leq V<\infty, such that \underset{x\in C}{\sup}V(x)<\infty and for all \gamma\in\Gamma,

 P_{\gamma}V\leq\lambda V+bI_{C},

where P_{\gamma}V(x)=\underset{\mathbb{R}^{d}}{\int}V(y)P_{\gamma}(x,{\rm d}y) and the small set C is defined in 7.

Where the entire state space is small (i.e., C=\mathbb{R}^{d} in 7) for some RSGS kernel P_{p}, p\in\Delta_{s-1}^{\varepsilon}, ergodicity of the ARSGS is established in Section 4 of [28] (under additional \pi-irreducibility and aperiodicity assumptions).

In general, one could establish the simultaneous geometric drift condition 1 and use Theorem 5.1 of [28] to derive the ergodicity. For the ARSGS, it might be hard to find a drift function V that satisfies 1. Nevertheless, we show that the local simultaneous geometric drift condition holds, provided that P_{p} is geometrically ergodic for some p\in\Delta_{s-1}^{\varepsilon}.

[label*=(A0)]
1. 2.

Geometric ergodicity. There exists \gamma\in\Gamma such that P_{\gamma} is geometrically ergodic. That is, P_{\gamma} is \pi-irreducible, aperiodic (see Section 3.2 of [38] for definitions), and there exist drift coefficients (\lambda_{\gamma},V_{\gamma},b_{\gamma},C_{\gamma}) such that

 \displaystyle P_{\gamma}V_{\gamma}\leq\lambda_{\gamma}V_{\gamma}+b_{\gamma}I_{% \{C_{\gamma}\}}, (33)

where b_{\gamma}<\infty, 0\leq\lambda_{\gamma}<1, V_{\gamma} is a function such that \pi-almost surely 1\leq V_{\gamma}<\infty , and I_{\{C_{\gamma}\}} is an indicator function of a small set C_{\gamma} (that is, for all x\in C_{\gamma}, (32) holds).

[label*=(A0)]
1. 3.

Local simultaneous geometric drift. For every \gamma\in\Gamma, there exists a measurable function 1\leq V_{\gamma}<\infty, a small set C_{\gamma} and an open neighbourhood B_{\gamma} such that

(a)

C_{\gamma} is a uniform small set for \hat{\gamma}\in B_{\gamma}, i.e., (32) holds for all \hat{\gamma}\in B_{\gamma} and x\in C_{\gamma};

(b)

for all \hat{\gamma}\in B_{\gamma},

 \displaystyle P_{\hat{\gamma}}V_{\gamma}\leq\tilde{\lambda}_{\gamma}V_{\gamma}% +\tilde{b}_{\gamma}I_{\{C_{\gamma}\}} (34)

for some \tilde{b}_{\gamma}<\infty and \tilde{\lambda}_{\gamma}<1.

###### Theorem 10.

Assume 2 for the RSGS kernels P_{p}, p\in\Delta_{s-1}^{\varepsilon}=\Gamma. Then P_{p} is geometrically ergodic for each p\in\Delta_{s-1}^{\varepsilon} and satisfies the local simultaneous drift condition 3.

Proof of Theorem 10. Since for reversible \pi-irreducible chains, geometric ergodicity and existence of L_{2}-spectral gap are equivalent (see Theorem 2 of [41]), the first statement follows from Theorem 6.

Let (\lambda_{p},V_{p},b_{p},C_{p}) be the drift conditions that satisfy (33). For every selection probability vecor p=(p_{1},..,p_{s}) let m=m(p)=\underset{i\in\{1,..,s-1\}}{\min}p_{i}. Define norm |p|=\underset{i\in\{1,..,s-1\}}{\max}\left|p_{i}\right| and take \delta=\delta(p)>0 such that (1+(s-1)\delta)\lambda_{p}\leq 1. Set \tilde{\lambda}_{p}=(1+(s-1)\delta)\lambda_{p}. Then for every \hat{p} such that \left|\hat{p}-p\right|\leq m\delta,

 \displaystyle P_{\hat{p}}V_{p}=\sum_{i=1}^{s}\hat{p}_{i}Pr_{i}V_{p}\leq\sum_{i% =1}^{s}(p_{i}+(s-1)m\delta)Pr_{i}V_{p}\leq(1+(s-1)\delta)\sum_{i=1}^{s}p_{i}Pr% _{i}V_{p}= \displaystyle=(1+(s-1)\delta)P_{p}V_{p}\leq\tilde{\lambda}_{p}V_{p}+(1+(s-1)% \delta)b_{p}I_{\{C_{p}\}},

where we used representation (1) for P_{p} and the bound

 |p_{s}-\hat{p}_{s-1}|\leq\sum_{i=1}^{s-1}|p_{i}-\hat{p}_{i}|\leq(s-1)m\delta.

We are left to show that the condition (a) of 3 is satisfied. Indeed, fix any probability vector p\in\Delta_{s-1}^{\varepsilon}. Since C_{p} is a small set, P^{m}_{p}(x,\cdot)\geq\delta_{0}\nu(\cdot) for some m\geq 1, \delta_{0}>0, some probability measure \nu and all x\in C_{p}. Then for all \hat{p}\in\Delta_{s-1}^{\varepsilon},

 P^{m}_{\hat{p}}(x,\cdot)\geq\left(\frac{\varepsilon}{\underset{i\in\{1,..,s\}}% {\max}\ p_{i}}\right)^{m}P_{p}^{m}(x,\cdot)\geq\left(\frac{\varepsilon}{% \underset{i\in\{1,..,s\}}{\max}\ p_{i}}\right)^{m}\delta_{0}\nu(\cdot),

whence the condition (a) follows.
\square

In order to derive the ergodicity of the ARSGS, we will need the following crucial consequence of the assumption 3.

###### Theorem 11.

Assume that \Gamma is compact in some topology and that the collection of Markov kernels P_{\gamma} satisfy 3. Then there exists a finite partition of \Gamma into k sets F_{i} such that

 \cup_{i=1}^{k}F_{i}=\Gamma,

and simultaneous geometric drift conditions 7 and 1 hold inside F_{i} with coefficients (\lambda,V_{i},b,C_{i}), where 0\leq\lambda<1, 1\leq V_{i}<\infty, \pi-a.s., b<\infty, and C_{i} is the uniform small set for \gamma\in F_{i}.

Proof of Theorem 11. Notice,

 \Gamma\subset\cup_{\gamma\in\Gamma}B_{\gamma},

where B_{\gamma} is an open neighbourhood of \gamma as in the assumption 3. Since every open coverage of a compact set \Gamma has a finite subcoverage (see, e.g., Theorem 6.37 of [25]), there exist a finite number of B_{\gamma} that cover \Gamma, say B_{\gamma_{1}},..,B_{\gamma_{k}}. Then one can take \lambda:=\max\{\lambda_{\gamma_{1}},..,\lambda_{\gamma_{k}}\}, F_{i}:=B_{\gamma_{i}}, b=\max\{b_{\gamma_{1}},..,b_{\gamma_{k}}\}, C_{i}:=C_{\gamma_{i}}.
\square

[label*=(A0)]
1. 4.

Assumption 3 holds and for a chosen set B\in\mathbb{R}^{d}, and the corresponding drift functions V_{\gamma},\gamma\in\Gamma are bounded on B, i.e.,
\underset{x\in B}{\sup}V_{\gamma}(x)<\infty.

We are now ready to state the main ergodicity result.

###### Theorem 12.

Fix a measurable set B\subset\mathcal{X}. Assume \Gamma is compact in some topology and let P_{\gamma}, \gamma\in\Gamma be a collection of \pi-irreducible, aperiodic Markov kernels with a common stationary distribution \pi. Consider an AMCMC Algorithm 12, where the adaptations are allowed to take place only when the adaptive chain \{X_{n}\} visits B. Let the conditions 1, 3 and 4 hold and assume that for a starting location (X_{0},\gamma_{0}) of the adaptive chain, \mathbb{E}_{(X_{0},\gamma_{0})}V_{\gamma_{0}}(X_{0})<\infty, where V_{\gamma_{0}} is the drift function for the initial kernel P_{\gamma_{0}}. Then the adaptive chain \{X_{n}\} produced by the Algorithm 12 is ergodic.

{algorithm}

[H] Modified AMCMC Set some initial values for X_{0}\in\mathcal{X}; \gamma_{0}\in\Gamma; \overline{\gamma}:=\gamma_{0}; k:=1; n:=0. Fix any measurable set B\subset\mathcal{X}.
Beginning of the loop

[label=0.,ref=0]
1. 1.

sample X_{n+1}\sim P_{\overline{\gamma}}(X_{n},\cdot);

2. 2.

given \{X_{0},..,X_{n+1},\gamma_{0},..,\gamma_{n}\} update \gamma_{n+1} according to some adaptation rule;

3. 3.

If X_{n+1}\in B, \overline{\gamma}:=\gamma_{n+1}.

Go to Beginning of the loop

Proof of Theorem 12. Since we assume the diminishing adaptation condition 1, the proof follows once we establish the containment 2.

Theorem 11 yields there exists a finite partition \{F_{i}\}_{i=1}^{k} such that

 \cup_{i=1}^{k}F_{i}=\Gamma,

where simultaneous geometric drift conditions hold within every F_{i} with some drift coefficients (\lambda,V_{i},b,C_{i}) (as in Theorem 11).

On \Gamma define a function r such that r(p)=j if \gamma\in F_{j}. 4 yields there exists M<\infty such that V_{i}(x)<M for x\in B, i=1,..,k.

As in the proof of Theorem 3 of [8], to verify the containment condition, it suffices to prove that

 \sup_{n}\mathbb{E}[V_{r(\gamma_{n})}(X_{n})]<\infty,

where hereafter \mathbb{E}=\mathbb{E}_{(X_{0},\gamma_{0})} is the expectation with respect to the probability measure generated by the adaptive chain started from (X_{0},p^{0}).

Drift condition (34) implies

 \displaystyle\mathbb{E}\left[V_{r(\gamma_{n+1})}(X_{n+1})|X_{n}=x,\gamma_{n}=% \gamma\right]= \displaystyle=\mathbb{E}\left[V_{r(\gamma_{n+1})}(X_{n+1})I_{\{X_{n+1}\notin B% \}}|X_{n}=x,\gamma_{n}=\gamma\right]+ \displaystyle+\mathbb{E}\left[V_{r(\gamma_{n+1})}(X_{n+1})I_{\{X_{n+1}\in B\}}% |X_{n}=x,\gamma_{n}=\gamma\right]\leq \displaystyle\leq\mathbb{E}\left[V_{r(\gamma_{n+1})}(X_{n+1})I_{\{X_{n+1}% \notin B\}}|X_{n}=x,\gamma_{n}=\gamma\right]+M= \displaystyle=\mathbb{E}\left[V_{r(\gamma_{n})}(X_{n+1})I_{\{X_{n+1}\notin B\}% }|X_{n}=x,\gamma_{n}=\gamma\right]+M\leq \displaystyle\leq\mathbb{E}\left[V_{r(\gamma)}(X_{n+1})|X_{n}=x,\gamma_{n}=% \gamma\right]+M= \displaystyle=P_{r(\gamma)}V_{r(\gamma)}(x)+M\leq\lambda V_{r(\gamma)}(x)+b+M,

where in the first inequality we used the condition 4 and in the last one we used the fact that \gamma_{n+1}=\gamma_{n}, if X_{n+1}\notin B. Here 0<\lambda<1 and b<\infty are as in Theorem 11. Integrating out \gamma_{n} and X_{n} leads to

 \mathbb{E}\left[V_{r(\gamma_{n+1})}(X_{n+1})\right]\leq\lambda E\left[V_{r(% \gamma_{n})}(X_{n})\right]+b+M,

implying (see Lemma 2 of [39]),

 \sup_{n}\mathbb{E}[V_{r(\gamma_{n})}(X_{n})]\leq\max\left\{\mathbb{E}V_{r(% \gamma_{0})}(X_{0}),\frac{b+M}{1-\lambda}\right\}<\infty.

\square

The reader can easily see that Theorem 12 can be applied to a modified version of the ARSGS Algorithm 13, where the adaptations are allowed to happen only when the adaptive chain hits a set B that satisfies 4.

###### Theorem 13.

Fix a measurable set B\in\mathbb{R}^{d}. Consider an Adaptive Random Scan Gibbs Sampler (ARSGS) Algorithm 13 that produces a chain \{X_{n}\} for which the selection probabilities p^{n+1} are allowed to be changed only if X_{n+1}\in B (i.e., p^{n+1}=p^{n} if X_{n+1}\notin B).
Let the assumption 2 hold. Then the assumption 3 holds.
Let also 4 be satisfied for the set B and assume that for the starting location (X_{0},p^{0}) of the adaptive chain, \mathbb{E}_{(X_{0},p^{0})}V_{p^{0}}(X_{0})<\infty, where V_{p^{0}} is the drift function for the initial kernel P_{p^{0}}. Then the adaptive chain \{X_{n}\} produced by the ARSGS Algorithm 13 is ergodic.

Proof of Theorem 13. Since \Delta_{s-1}^{\varepsilon} is closed and bounded, it is compact (see Heine-Borel Theorem in [25]). Theorem 10 implies that 3 holds. The diminishing adaptation condition 1 holds since \underset{i\in\{1,..,s\}}{\max}|p_{i}^{n+1}-p_{i}^{n}|\to 0 as n\to\infty, by the construction of the ARSGS Algorithm 5. Therefore, we are in a position to apply Theorem 12 to derive the desired ergodicity of the adaptive chain.
\square

Remarks

[label*=0)]
1. 1.

We do not have a proof that the ARSGS Algorithm 5 presented in Section 5 is ergodic. However, the modified Algorithm 13 is ergodic under the assumptions of Theorem 13. The only difference of the ergodic modification from the original version of the ARSGS is that we do not change the sampling weights p_{i} if the chain is not in the set B.

2. 2.

The idea of introducing the set B to an adaptive algorithm comes from the work of Craiu et al.[18], where the authors study stability properties (e.g., recurrence) of adaptive chains where the adaptations are allowed to occur only in the set B.

3. 3.

Assumption 4 is satisfied for level sets B=B(N)=\cap_{i=1}^{k}\{x:V_{i}(x)<N\}, where V_{i} are the drift functions as in the proof of the theorem. Theorem 14.2.5. of [30] implies that for large N, B(N) covers most of the support of \pi, meaning that the adaptation will occur in most of the iterations of the Algorithm 13.

4. 4.

In practice often one can choose B to be any bounded set in \mathbb{R}^{d}.

{algorithm}

Adaptive Random Scan Gibbs Sampler (ergodic modification) Generate a starting location X_{0}\in\mathbb{R}^{d}. Fix a measurable set B\subset\mathbb{R}^{d}. Fix \frac{1}{s+1}>\varepsilon>0. Set an initial value of w^{0}=(w^{0}_{1},..,w^{0}_{s})\in\Delta^{\varepsilon}_{s}, generate random unit vectors z_{0},y_{0}\in\mathbb{R}^{d+1}. Define two sequences of non-negative numbers \left(b_{m}\right)_{m=1}^{\infty} and \left(a_{m}\right)_{m=1}^{\infty} such that \sum_{m=1}^{\infty}a_{m}=\infty, a_{m}\to 0 and b_{m}\to 0 as m\to\infty. Set i=0. Choose a sequence of positive integers \left(k_{m}\right)_{m=0}^{\infty}.
Beginning of the loop

[label*=0.]
1. 1.

n:=n+k_{i}. Run RSGS(p^{i}) for k_{i} steps;

2. 2.

Do Steps 2 - 4 of ARSGS Algorithm 5.

3. 3.

If current state of chain X_{n}\in B, then p^{i}:=\frac{w^{i}}{w^{i}_{1}+..+w^{i}_{s}};
Otherwise, p^{i}:=p^{i-1}.

Go to Beginning of the loop

## 8 Simulations

It is known that for reversible Markov chains existence of the spectral gap is equivalent to geometric ergodicity (see Theorem 2 of [41]). Moreover, geometric ergodicity implies that the Central Limit Theorem holds (see, e.g., [12]) . The following theorem of Kipnis & Varadhan [27] states an important relation between the asymptotic variance in CLT and the spectral gap.

###### Theorem 14.

Assume that P_{p} is a RSGS kernel (1). Then the following upper bound holds, connecting notions of the asymptotic variance with the spectral gap:

 \sigma^{2}_{as}(f)\leq\frac{2-\mathrm{Gap}(p)}{\mathrm{Gap}(p)}Var_{\pi}(f), (35)

where Var_{\pi} denotes variance w.r.t. \pi and \mathrm{Gap} is the spectral gap of P_{p}. Moreover, if the spectrum of P_{p} is discrete, then the equality in (35) is attained on a second largest eigenfunction of P_{p}.

Theorem 14 states that by increasing the spectral gap one decreases the worst case asymptotic variance. Theorem 3 states that the second largest eigenfunction of the RSGS kernel for the normal target distribution is a linear function. Of course, for arbitrary distribution Theorem 3 is false. Nevertheless, we believe that comparing the maximum asymptotic variance over linear functions for the adatptive and non-adaptive algorithms is a reasonable thing to do. Define

 \displaystyle l_{i}=l_{i}(x)=\frac{x_{i}}{\sqrt{Var_{\pi}(x_{i})}} (36)

to be normalized linear functions depending on one coordinate only.

We compute the maximum asymptotic variance in CLT, \underset{i=1,..,d}{\max}\ \sigma^{2}_{as}(l_{i}), for the adative and vanilla RSGS. We hope that in certain situations the ratio between the estimated pseudo-spectral gaps is close to the ratio of the maximum asymptotic variances over l_{i} as follows from Theorem 14. We study also how the pseudo-optimal weights affect the autocorrelation function (ACF) of l_{i}.

Three different examples are studied where we implement coordinate-wise ARSGS, ARWMwAG and their non-adaptive versions. Two of the examples are in moderate dimension 50: sampling from the posterior in a Poisson Hierarchical Model (PHM) and sampling form the Truncated Multivariate Normal (TMVN) distribution. We also consider sampling from a posterior in a Markov Switching Model (MSM) in 200-dimensional space.

All the asymptotic variances are obtained using the batch-means estimator (see [26] and [11]). Below we outline settings for every problem.

### Poisson Hierarchical Model

Gibbs Sampler arises naturally for Hierarchical Models, where our goal is to sample from a posterior distribution. In the present model data Y_{i} comes from the Poisson distribution with intensity \lambda_{i}:

 \displaystyle Y_{i}\sim{\rm Poisson}(\lambda_{i}),i=1,..,n, (37)

where

 \displaystyle\lambda_{i}=\exp\left(\sum_{j=1}^{d}x_{ij}\beta_{j}\right), (38)

with \beta=(\beta_{1},..,\beta_{d}) being the parameter of interest. Stress that here d is the dimensionality of the problem and n is the number of observations. We set d=50, n=100.

Gibbs sampling through the adaptive rejection sampling presented by Gilks & Wild in [23] is utilised for this problem. See [19] for details and formulas for the full conditionals.

We fix the true parameter \beta_{0}:=(1,..,1) and take the prior distribution on \beta to be normal with mean (-1,..,-1) and variance matrix I_{d}. We consider two different examples of the design matrix X={(x_{ij})_{i=1}^{n}}_{j=1}^{d}.

Design matrix X^{(1)} is formed as follows. First, we set all the elements to be zero. Let k=\frac{n}{d}=2. Then, we form two upper blocks of ones: X^{(1)}_{ij}=1 for i\in\{1,..,2k\}, j\in\{1,2\}, and X^{(1)}_{ij}=1 for i\in\{2k+1,..,5k\},\ j\in\{3,4,5\}. Now there are at least two blocks of correlated variables in the posterior. For every other variable \beta_{j}, j=5,..,d, set X^{(1)}_{ij}=1 for i\in\{jk+1,..,(j+1)k\}. In order to enforce dependency between all variables, we perturb every entry of X^{(1)}: X^{(1)}_{ij}=x^{(1)}_{ij}+0.1\xi_{ij}, where \xi_{ij} are independent beta distributed variables with parameters (0.1,0.1).

The second design matrix X^{(2)} is formed as follows. For i=1,..,n and j=1,..,d set

 X^{(2)}_{ij}=0.3\left(\delta_{ij}+\frac{\xi_{i}}{i}\right),

where \delta_{ij} is the Kronecker symbol and \xi_{i} are i.i.d. beta distributed with parameters (0.1,0.1).

Remark. Correlation matrix of the posterior with the design matrix X^{(1)} has blocks of highly correlated coordinates, whereas in case of the design matrix X^{(2)}, the correlation matrix seems to have only moderate non-diagonal entries.

### Truncated Multivariate Normal Distribution

Gibbs sampler is a natural algorithm to sample from the TMVN distribution as suggested by [21]. We consider linear truncation domain b_{1}\leq Ax\leq b_{2} where x\in\mathbb{R}^{d}, A is some d\times d matrix. [21] suggested to transform the underlying normal distribution so that one needs to sample from N(0,\Sigma_{0}) truncated to a rectangle c_{1}\leq x\leq c_{2}.

We set c_{1}=(1,..,1)\in\mathbb{R}^{d}, c_{2}=(3,..,3)\in\mathbb{R}^{d} and generate two different covariance matrices \Sigma_{0}.

 \Sigma^{(1)}_{0}={\rm Corr}\left(0.01I_{d}+v_{1}v_{1}^{\mathtt{T}}\right),
 \Sigma^{(2)}_{0}={\rm Corr}\left(0.01I_{d}+v_{2}v_{2}^{\mathtt{T}}\right),

where v_{1}=(\xi_{1},..,\xi_{d}), v_{2}=\left(\frac{\xi_{1}}{\log(2)},\frac{\xi_{2}}{\log(3)},..,\frac{\xi_{d}}{% \log(d+1)}\right), and \xi_{i} are i.i.d. beta distributed with parameters (0.1,0.2). Here {\rm Corr}(M) denotes a correlation matrix that corresponds to M.

Note that the truncation domain does not contain the mode of the distribution, making it very different from the non-truncated normal distribution.c

### Markov Switching Model

Let x_{1:i}=(x_{1},..,x_{i}). We consider a version of stochastic volatility model where the underlying chain may be in either high or low volatility mode. Namely, the latent data X_{i} forms an AR(1) process:

 X_{i}\sim N(X_{i-1},\sigma^{2}_{r(i)}),

where the chain can be in one of the two volatility regimes r(i)\in\{0,1\}. r(i) itself forms a Markov chain with a transition matrix

 \left(\begin{array}[]{cc}1-a_{1}&a_{1}\\ a_{2}&1-a_{2}\end{array}\right).

Here a_{1} and a_{2} are called switching probabilities and assumed to be known. The observed data Y_{i} is normally distributed:

 Y_{i}\sim N(X_{i},\beta^{2})

with known variance \beta^{2}. We consider data of n=100 observations and aim to sample from the posterior

 X_{1},..,X_{n},r(1),..,r(n)|Y_{1},..,Y_{n}.

Since n=100, the total number of parameters d=200. We fix a_{1}=0.001 and a_{2}=0.005.

The underlying hidden Markov chain (X_{i},r(i)) is obtained as follows. We start chain r(i) at it’s stationary distribution and X_{0}\sim N(0,\sigma^{2}_{r(0)}). We then randomly generate chain r(i) so that there are two switchings occur. Thus we obtain r(i)=1 for i=57,..,79, and r(i)=0, otherwise.

We consider data for 3 different combinations of \sigma^{2}_{0},\sigma^{2}_{1}, and \beta^{2}:

[label=()]
1. 1.

\sigma^{2}_{0}=1, \sigma^{2}_{1}=10, \beta^{2}=1;

2. 2.

\sigma^{2}_{0}=1, \sigma^{2}_{1}=10, \beta^{2}=3;

3. 3.

\sigma^{2}_{0}=1, \sigma^{2}_{1}=5, \beta^{2}=1.

Full conditionals may be obtained and are easy to sample from. We shall demonstrate performance of the coordinate-wise Gibbs Sampler for this problem.

One might notice that the even and odd blocks of the coordinates can be updated simultaneously, meaning that coordinate-wise updates might be suboptimal. However, we are not motivated to find the best algorithms, but rather to demonstrate that the ARSGS can provide speed up even when the target distribution is discrete. Our intuition is such that the variables around the switching points will mix much slower, meaning that the ARSGS should update those coordinates more frequently.

### 8.1 Adaptive Random Scan Gibbs Sampler

We implement the ARSGS Algorithm 5 for all the examples. To do so, we need to specify a number of parameters in the algorithm. Since in the Euclidean space \mathbb{R}^{d} distance from the origin to a simplex is \frac{1}{\sqrt{n}}, we find it reasonable to choose a_{m}=\frac{\log(50\sqrt{d}+m)}{50\sqrt{d}+m}. In fact, one may choose arbitrary positive constant instead of 50. We do not know the right scaling for b_{m} and thus set b_{m}=a_{m}.

We set the lower bound \varepsilon:=\frac{1}{d^{2}}. The choice of \varepsilon is motivated by Theorem 6, where it is shown that the maximum improvement of the pseudo-spectral gap is d (dimensionality of the space), which can only happen when one of the coordinates gets all of the probability mass. With the above choice of \varepsilon, the maximum probability mass that coordinate can get is 1-\frac{(d-1)}{d^{2}}, meaning that we might not be able to identify the optimal selection probabilities. On the other hand, the pseudo-spectral gap that corresponds to the selection probabilities obtained by the adaptive algorithm (with the specified value of \varepsilon=\frac{1}{d^{2}}) will be close to the optimal value.

We choose the sequence k_{m} to be k_{m}=5000 for PHM and TMVN examples, and k_{m}=8\times 10^{5} for MSM. We discuss an effective choice of k_{m} in Section 8.3.

We run the coordinate-wise ARSGS and the vanilla RSGS to obtain 5 million samples with 50 iterations thinning (i.e., we record every 50-th iteration of the chain) in PHM and TMVN examples. Whereas, for the MSM example thinning is 8000 and number of samples is 10 million in cases 1, 2, and 30 million in the case 3.

### Poisson Hierarchical Model

For this example, we show that in a long run the ARSGS not only outperforms the vanilla RSGS but performs similarly to the RSGS with the pseudo-optimal weights (that are estimated from the adaptive chain run).

The data Y_{i} is generated separately for the design matrices X^{(1)} and X^{(2)}. We summarize results in Tables 8 and 8, respectively. In a view of the von Mises theorem (see [45]), we expect the ARSGS to work well in both examples. One can observe reduction in the maximum asymptotic variance over the linear functions (36) by 9.27 and 6.97 times respectively.

In the example with the design matrix X^{(1)}, the correlation between the 1st and 2nd coordinate is -0.98 and they are both nearly uncorrelated with the other coordinates. However, the corresponding optimal weights are such that p^{\mathrm{opt}}_{1}\approx p^{\mathrm{opt}}_{2}\approx 0.085. The most of the probability mass is put on the coordinate 5: p^{\mathrm{opt}}_{5}\approx 0.29. The maximum correlation of the coordinate 5 with other directions is at most 0.57 in absolute value. In fact, excluding coordinates 1, 2 and 5, all the off-diagonal correlations do not exceed 0.57 in absolute value.

For the second example with the design matrix X^{(2)}, all the correlations are less than 0.21. In some sense this example is consistent with the toy Example LABEL:example2 from Section 4. Here p^{\mathrm{opt}}_{1}\approx 0.188, whereas all other optimal selection probabilities are in a range between 0.001 and 0.01.

Note that the RSGS with the optimal weights performs nearly the same as the adaptive counterpart. For each design matrix ACF plots are produced for two coordinates with high and low optimal weights in Figures 3 and 3.

Empirically, we observe that the adaptive algorithm tries to allocate the selection probabilities in such a way that the effective number of independent samples (effective sample size) for every direction is the same. Hence, all the coordinates have about the same autocorrelation function (see Figure 8 and 8). Note that the autocorrelation changes proportionally to the reduction in the asymptotic variance.

\captionof

tablePHM. Gibbs (d=50). Example 1 \arraybackslash \arraybackslash1/\mathop{\mbox{$\rm P$-$\rm Gap$}} \arraybackslash\underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) \arraybackslashvanilla \arraybackslash13435 \arraybackslash482 \arraybackslashadaptive \arraybackslash1355 \arraybackslash52 \arraybackslashoptimal \arraybackslash- \arraybackslash54 \arraybackslash
\arraybackslash9.9 \arraybackslash9.27

\captionof

tablePHM. Gibbs (d=50). Example 2 \arraybackslash \arraybackslash1/\mathop{\mbox{$\rm P$-$\rm Gap$}} \arraybackslash\underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) \arraybackslashvanilla \arraybackslash7340 \arraybackslash272 \arraybackslashadaptive \arraybackslash919 \arraybackslash39 \arraybackslashoptimal \arraybackslash- \arraybackslash40 \arraybackslash
\arraybackslash7.97 \arraybackslash6.97

Remark. We use the Adaptive Rejection Sampling algorithm (see [23]) in order to sample from the full conditionals. Since the normalising constant is not known in this case, we could not establish geometric ergodicity of the RSGS in this case. On the other hand, results of Latuszynski et al. [28] ensure that the RWMwG is geometrically ergodic. Since typically the RSGS converges faster than the corresponding RWMwG, we suggest that the RSGS is also geometrically ergodic for the Poisson Hierarchical Model. This means that heuristically the modified ARGS Algorithm 13 is ergodic in the current settings.

### Truncated Multivariate Normal Distribution

For the first correlation matrix \Sigma_{0}^{(1)} the reduction in the maximum asymptotic variance over the linear functions(36) is 3.44, which is surprisingly very close to the ratio of the pseudo-spectral gaps (Table 8). The autocorrelation plot of 2nd and 47th coordinates is in Figure 4. The same effect of keeping the same autocorrelations for all the coordinates is observed.

\captionof

tableTMVN (d=50). Example 1 \arraybackslash \arraybackslash1/\mathop{\mbox{$\rm P$-$\rm Gap$}} \arraybackslash\underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) \arraybackslashvanilla \arraybackslash6449 \arraybackslash239 \arraybackslashadaptive \arraybackslash1857 \arraybackslash72 \arraybackslash
\arraybackslash3.47 \arraybackslash3.32

For the second correlation matrix \Sigma_{0}^{(2)}, the improvement of the asymptotic variance is only half of the improvement of the spectral gap, as seen from Table 4. However, we still observe that the ARSGS assigns more weight to the coordinates that mix slower.

\captionof

tableTMVN (d=50). Example 2 \arraybackslash \arraybackslash1/\mathop{\mbox{$\rm P$-$\rm Gap$}} \arraybackslash\underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) \arraybackslashvanilla \arraybackslash467 \arraybackslash12.6 \arraybackslashadaptive \arraybackslash161 \arraybackslash8.3 \arraybackslash
\arraybackslash2.9 \arraybackslash1.5

Remark. In this example, the RSGS kernels (1) satisfy the uniform minorisation condition 7. The corresponding small set is the whole domain, because it is compact. The diminishing adaptation condition 1 holds by construction. Therefore, the ARSGS algorithm is ergodic by virtue of Theorem 4.2 [28].

### Markov Switching Model

Note that half of the coordinates in the target distribution are discrete. The naive estimator (16) of the covariance structure is often singular (i.e., non-invertible) in these settings. Therefore, to implement the ARSGS, in Step 2 of the Algorithm 5, we use a perturbed naive estimator \widehat{\Sigma}_{n}+\frac{1}{d^{3}}I_{d}, where d=200 is dimensionality of the target distribution. We did not observe any significant impact of the added perturbation on the estimated optimal selection probabilities.

We expect that if the dependency structure is described by the correlations (i.e., zero correlation implies weak in some sense dependency), then the ARSGS shall outperform the vanilla RSGS. We observe that this happens in cases 1 and 2. More precisely, the ARSGS tends to put more weight on coordinates that have larger asymptotic variance and results are found in Table 8, where the improvement of the pseudo spectral gap for each case is presented, and in Table 8, where the corresponding improvement in asymptotic variance over the linear functions (36) is presented. Notice, in all cases the maximum asymptotic variance \underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) is attained for the coordinate i that corresponds to some discrete direction r(i).

\captionof

tableMSM (d=200). 1/\mathop{\mbox{$\rm P$-$\rm Gap$}} \arraybackslash \arraybackslash(a) \arraybackslash(b) \arraybackslash(c) \arraybackslashvanilla \arraybackslash18875 \arraybackslash71106 \arraybackslash117127 \arraybackslashadaptive \arraybackslash3450 \arraybackslash9924 \arraybackslash19301 \arraybackslash
\arraybackslash5.47 \arraybackslash7.17 \arraybackslash6.07

\captionof

tableMSM (d=200). \underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) \arraybackslash \arraybackslash(a) \arraybackslash(b) \arraybackslash(c) \arraybackslashvanilla \arraybackslash21.7 \arraybackslash45.7 \arraybackslash197 \arraybackslashadaptive \arraybackslash6 \arraybackslash12.6 \arraybackslash204 \arraybackslash
\arraybackslash3.6 \arraybackslash3.63 \arraybackslash0.97

The case 3 is special in a sense that the correlation structure does not reveal the dependency structure. Here the maximum selection probability is assigned to the coordinate that corresponds to the variable r(99). The asymptotic variance for the corresponding linear function l_{i} drops roughly by 4.5 times from 40.56 to 9.06. However, the maximum asymptotic variance over the linear functions (36) is attained on the coordinate i that corresponds to r(64). The estimated optimal weight p^{\mathrm{opt}}_{i} corresponding to r(64) is only slightly larger than the uniform weight 1/200.

To justify convergence of the ARSGS and the results in Tables 8 and 8, we provide a proof of the geometric ergodicity.

###### Proposition 15.

In cases 1, 2 and 3, the RSGS for the Markov Switching Model described above is geometrically ergodic, i.e., satisfies 2.

As before, we sample from the same PHM in \mathbb{R}^{50}. We consider the same algorithm settings for the ARWMwAG Algorithm 6 as for the ARSGS Algorithm 5. We compare performance of the Random Walk Metropolis-within-Gibbs algorithm with it’s adaptive versions ARWMwG and ARWMwAG.

For the RWMwG the proposal variances \beta_{i} are chosen to be ones. For demonstration purposes, the parameter q of the adaptive versions of the algorithm, is chosen to be q:=1.

### Poisson Hierarchical Model

Tables 8 and 8 provide the analysis of the asymptotic variances. For the first design matrix X^{(1)}, we observe a 7 times improvement of the ARWMwAG over the ARWMwG algorithm, and the total improvement of almost 15 times over the non-adaptive RWMwG. For the second design matrix X^{(2)}, the corresponding improvement is 6.1 and 12.3 times respectively. On Figure 7 we present the improvements to the ACF.

\captionof

tablePHM. MwG (d=50). Example 1 \arraybackslash \arraybackslash1/\mathop{\mbox{$\rm P$-$\rm Gap$}} \arraybackslash\underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) \arraybackslashRWMwG (non-adaptive) \arraybackslash \arraybackslash1993 \arraybackslashARWMwG (partially adaptive) \arraybackslash13244 \arraybackslash971 \arraybackslashARWMwAG (fully adaptive) \arraybackslash1376 \arraybackslash138 \arraybackslash\frac{\mbox{partially adaptive}}{\mbox{fully adaptive}} \arraybackslash9.63 \arraybackslash7 \arraybackslash\frac{\mbox{non-adaptive}}{\mbox{fully adaptive}} \arraybackslash \arraybackslash14.45

\captionof

tablePHM. MwG (d=50). Example 2 \arraybackslash \arraybackslash1/\mathop{\mbox{$\rm P$-$\rm Gap$}} \arraybackslash\underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) \arraybackslashRWMwG (non-adaptive) \arraybackslash \arraybackslash1276 \arraybackslashARWMwG (partially adaptive) \arraybackslash7461 \arraybackslash639 \arraybackslashARWMwAG (fully adaptive) \arraybackslash970 \arraybackslash104 \arraybackslash\frac{\mbox{fully adaptive}}{\mbox{partially adaptive}} \arraybackslash7.69 \arraybackslash6.14 \arraybackslash\frac{\mbox{fully adaptive}}{\mbox{non-adaptive}} \arraybackslash \arraybackslash12.27

Remark. The target distribution satisfies the Assumption 5.4 of [28]. If one chooses the proposal in Step 2 of RWMwG Algorithm 6 to be a mixture of normals, i.e., 0<q<1, or restricts the proposal variances \beta_{i} to be in some interval [c_{1},c_{2}], \infty>c_{2}>c_{1}>0, then the ARWMwG and ARWMwAG is ergodic by the virtue of Theorem 5.5 of [28].

### 8.3 Computational cost of the adaptation

By doing the adaptations of an MCMC algorithm we increase the total running time of the algorithm. Complexity of the projection Algorithm 5 is bounded by the complexity of a sorting algorithm used in Step 4, which is usually of order \mathcal{O}(d\log(d)). Thus one can easily see that the total adaptation cost of Steps 2 - 4 of the ARSGS Algorithm 5 is bounded by the complexity of the Step 2, which requires finding the diagonal blocks of the inverted covariance matrix. Usual Gauss matrix inversion is of order \mathcal{O}(d^{3}). In high dimensional settings it is an expensive procedure. However, one can choose the sequence k_{m} in the setting of the ARSGS Algorithm 5 in order to make the adaptation cost negligible comparing to the sampling Step 1.

Turn back to the Poisson Hierarchical Model example with the design matrix X^{(1)}. The sequnce k_{m} was chosen to be k_{m}=5000. In column 2 of Table 8.3 we put the average real time in seconds spent on sampling Step 1 of the ARSGS and ARWMwAG algorithms. The average time spent for one adaptation (i.e., to perform Steps 2 - 4 of the ARSGS Algorithm 5 is in column 3. The maximum asymptotic variance over the linear functions (36) is in column 1.

\captionof

tablePHM. Example 1 (d=50) \arraybackslash \arraybackslash\underset{i}{\max}\ \sigma^{2}_{as}(l_{i}) \arraybackslashCost per 5000 iterations \arraybackslashCost of adaptation \arraybackslashARSGS \arraybackslash52 \arraybackslash0.37 \arraybackslash0.0025 \arraybackslashARWMwAG \arraybackslash138 \arraybackslash0.028 \arraybackslash0.0025

Gibbs Sampling for the PHM requires the use of the adaptive rejection sampling (see [19, 23]), which significantly increases the time needed to obtain a sample. Therefore, even though the ARSGS has 2.65 times lower asymptotic variance than the ARWMwAG algorithm, it samples more than 10 times slower.

By adjusting the sequence k_{m} one can tune the ratio of the adaptation time over the sampling time. In fact, the sampling and adaptations can be performed independently in a sense that they may be computed on different CPUs as demonstrated in the Algorithm 8.3.

{algorithm}

Parallel versions of ARSGS and ARWMwAG Set all initial parameters for the ARSGS \Big{(}ARWMwAG \Big{)}.
Do on different CPUs:

• n=n+k_{i}. p^{i}=\frac{w^{i}}{w^{i}_{1}+..+w^{i}_{s}} Run RSGS(p^{i}) \Big{(}or ARWMwG(p^{i})\Big{)} for k_{i} steps.

• Do steps the steps 2 - 4 of ARSGS based on available chain output.

Wait till both CPUs finish their jobs. Then iterate the procedure.

## 9 Discussion

We have analysed ergodicity property of the adaptive algorithms in Section 7. We have developed a concept of the local simultaneous drift condition 3. We have shown in Theorem 10 that the condition is natural for the ARSGS. Under this condition, in Theorem 12 we have established ergodicity of modified AMCMC Algorithms 12. In particular, in Theorem 13 we have proved ergodicity of the modified ARSGS Algorithm 13.

In order to establish convergence in Theorem 13, we do not require the sequence of estimated optimal sampling probabilities weights p^{n} to converge at all. Instead, we require only the diminishing condition to hold, i.e., |p_{n}-p_{n-1}|\overset{P}{\rightarrow}0 as n\to\infty. In fact, it is not clear whether the estimated weights converge to the pseudo-optimal ones, even if one knows the target covariance matrix. Empirically, for numerous examples, we have observed that the adapted selection probabilities do converge to a unique solution, where the uniqueness is guaranteed by Theorem 5.

Open problem. Assume that the covariance matrix of the target distribution is known, i.e., \widehat{\Sigma}_{n}=\Sigma for all n. Prove that the estimated weights p^{i} in the ARSGS algorithm converge to the pseudo-optimal weights p^{\mathrm{opt}}.

We emphasise that there is no universal algorithm to optimise the pseudo-spectral gap function (18), given that the covariance structure \Sigma is unknown.

Various other modification of the ARSGS algorithm are possible. For instance, we can think of using some other optimisation algorithm instead of the subgradient method (described in Algorithm 5) in order to estimate the pseudo-optimal weights (10). Also, the user may know the structure of the covariance matrix \Sigma in advance, so that the naive estimator (16) could be improved. For example, if the covariance matrix is banded, a more efficient threshold estimator should be used (see [14]).

In Section 5.1 of [16] we introduce a modified (Air) version of the ARSGS Algorithm 13, for which we prove the SLLN and the Mean Squared Error convergence under the local simultaneous geometric drift assumption 3. If, additionally, the sequence of adapted selection probabilities p^{n} converges, we derive the CLT.

## 10 SUPPLEMENTARY MATERIAL

Proof of Lemma 1. The proof is a modification of Theorem 1 of [2], and thus we outline only the key points.

One can easily check that \{H_{\alpha}(Kx)\} form an orthonormal system in L_{2}(\mathbb{R}^{d},\pi) using the definition. From Theorem 6.5.3 of [4], it follows that one dimensional Hermite polynomials h_{k} form a complete orthogonal basis of L_{2}(\mathbb{R},\exp(-x^{2}/2)), implying \{H_{\alpha}(Kx)\} form an orthogonal basis of L_{2}(\mathbb{R}^{d},\pi).

For c\in\mathbb{R}^{d}, define a generating function

 \displaystyle f_{c}(x):=\sum_{\alpha}c^{\alpha}\frac{H_{\alpha}(Kx)}{\sqrt{% \alpha!}}, (39)

where c^{\alpha}:=c_{1}^{\alpha_{1}},..,c_{d}^{\alpha_{d}} and 0^{0}:=1.

From Section 4.2.1 of [42], we know that the generating function can be represented as

 f_{c}(x)=\exp\left(\left-\frac{\|c\|^{2}}{2}\right).

Recall that Pr_{i} stands for full conditional update of x_{i}=(x_{i1},..x_{ir_{i}}) from its full conditional. For functions f\in L_{2}(\mathbb{R}^{d},\pi), let

 (Pr_{i}f)(x):=\int f(y_{i},x_{-i})\pi(y_{i}|x_{-i})dy_{i}.

Define T^{(i)}:=I-KD_{i}K. Note that T^{(i)}=\left(T^{(i)}_{ij}\right)_{i,j=1}^{d} is a d-dimensional matrix.

The key property to prove the first part of the lemma is the following statement that can be obtained via direct calculations.

###### Lemma 16.

For all c\in\mathbb{R}^{d}

 (Pr_{i}f_{c})(x)=f_{T^{(i)}c}(x),

where f_{c} is defined in (39).

Let \Pi_{k} be the set of all sequences (\varepsilon_{1},..,\varepsilon_{k}) of length k with elements from \{1,..,d\}. Partition \Pi_{k} into equivalence classes R_{\alpha}, \alpha=(\alpha_{1},..,\alpha_{d})\in Z^{n}_{+}, |\alpha|=k, such that \varepsilon=(\varepsilon_{1},..,\varepsilon_{k})\in R_{\alpha} if and only if the sequence \varepsilon has \alpha_{1} 1’s,.., \alpha_{d} d’s. In other words, R_{\alpha} forms a set of all permutations of the elements of (\varepsilon_{1},..,\varepsilon_{k}), implying that the number of elements in R_{\alpha} is |R_{\alpha}|=\frac{k!}{\alpha!}.

Lemma 16 implies

 \displaystyle Pr_{i}\left(\sum_{\alpha}c^{\alpha}\frac{H_{\alpha}(K\cdot)}{% \sqrt{\alpha!}}\right)(x)=\sum_{\alpha}(T^{(i)}c)^{\alpha}\frac{H_{\alpha}(Kx)% }{\sqrt{\alpha!}}. (40)

Fix \beta such that |\beta|=k. For each \alpha, |\alpha|=k, fix some representative \sigma=\sigma(\alpha)\in R_{\alpha}. Rewrite T^{(i)}c as

 T^{(i)}c=\left(T^{(i)}_{j1}c_{1}+..+T^{(i)}_{jd}c_{d}\right)_{j=1}^{d}.

Since c\in\mathbb{R}^{d} is arbitrary, the coefficient of c^{\beta} on both sides of (40) should coincide for all \beta\in Z^{d}_{+}, providing a formula for the image of H_{\beta}(Kx):

 \displaystyle Pr_{i}\left(\frac{H_{\beta}(K\cdot)}{\sqrt{\beta!}}\right)(x)=% \sum_{|\alpha|=k}\frac{1}{\sqrt{\alpha!}}\left(\sum_{\varepsilon\in R_{\beta}}% T^{(i)}_{\sigma_{1}\varepsilon_{1}}..T^{(i)}_{\sigma_{k}\varepsilon_{k}}\right% )H_{\alpha}(Kx).

Since \sigma=\sigma(\alpha) was chosen arbitrary, the above sum is equal to

 \displaystyle Pr_{i}\left(\frac{H_{\beta}(K\cdot)}{\sqrt{\beta!}}\right)(x)=% \sum_{|\alpha|=k}\frac{1}{|R_{\alpha}|\sqrt{\alpha!}}\left(\sum_{\varepsilon% \in R_{\beta},\ \sigma\in R_{\alpha}}T^{(i)}_{\sigma_{1}\varepsilon_{1}}..T^{(% i)}_{\sigma_{k}\varepsilon_{k}}\right)H_{\alpha}(Kx)= \displaystyle=\sum_{|\alpha|=k}\frac{\sqrt{\alpha!}}{k!}\left(\sum_{% \varepsilon\in R_{\beta},\ \sigma\in R_{\alpha}}T^{(i)}_{\sigma_{1}\varepsilon% _{1}}..T^{(i)}_{\sigma_{k}\varepsilon_{k}}\right)H_{\alpha}(Kx).

We conclude that

 \displaystyle Pr_{i}(H_{\beta}(K\cdot))(x)=\sum_{|\alpha|=k}\frac{\sqrt{\alpha% !}\sqrt{\beta!}}{k!}\left(\sum_{\varepsilon\in R_{\beta},\ \sigma\in R_{\alpha% }}T^{(i)}_{\sigma_{1}\varepsilon_{1}}..T^{(i)}_{\sigma_{k}\varepsilon_{k}}% \right)H_{\alpha}(Kx), (41)

implying the first part of the Lemma.

We are left to show that the maximum eigenvalue of P_{p} on S_{k} is non-increasing, revealing that the second largest eigenvalue of P_{p} is attained on S_{1}.

Recall that \{R_{\alpha}|\ \alpha\in Z^{n}_{+},\ |\alpha|=k\} form a partition of all possible sequences (\varepsilon_{1},..,\varepsilon_{k}), \varepsilon_{i}\in\{1,..,d\}. Moreover, as we have just seen, Pr_{i} is invariant on S_{k} (S_{k} is defined in the statement of the lemma), that is Pr_{i} acts like a matrix on S_{k}. (41) implies that Pr_{i} can be represented as

 Pr_{i}(H_{\beta}(K\cdot))(x)=\sum_{|\alpha|=k}\frac{1}{\sqrt{|R_{\alpha}||R_{% \beta}|}}\left(\sum_{\varepsilon\in R_{\beta},\ \sigma\in R_{\alpha}}T^{(i)}_{% \sigma_{1}\varepsilon_{1}}..T^{(i)}_{\sigma_{k}\varepsilon_{k}}\right)H_{% \alpha}(Kx),

implying that the matrix that corresponds to Pr_{i} consists of entries

 \frac{1}{\sqrt{|R_{\alpha}||R_{\beta}|}}\left(\sum_{\varepsilon\in R_{\beta},% \ \sigma\in R_{\alpha}}T^{(i)}_{\sigma_{1}\varepsilon_{1}}..T^{(i)}_{\sigma_{k% }\varepsilon_{k}}\right)

Thus we have shown that on S_{k}, P_{p} acts as a matrix with corresponding entries obtained as normalised block sums of

 F_{k}=\sum_{i=1}^{s}p_{i}\ (T^{(i)})^{\otimes k},

where (T^{(i)})^{\otimes k} is the k-th Kronocker product of T^{(i)} (i.e., the k-th tensor product, see [33], VIII.10).

The next statement is Lemma 1 from [2] and we do not prove it.

###### Lemma 17.

Let A be a non-negative definite r\times r matrix. Let R_{1},..,R_{q} be a partition of \{1,..,r\}. Define matrix B to be the q\times q matrix,

 B_{kl}=\frac{1}{\sqrt{|R_{k}||R_{l}|}}\sum_{i\in R_{k},\ j\in R_{l}}A_{ij}.

Then the maximum eigenvalue of B is less or equal than the maximum eigenvalue of A.

Lemma 17 shows that the maximum eigenvalue of P_{p} restricted to S_{k} is dominated by the maximum eigenvalue of F_{k}.

Rewrite F_{k+1} as a difference of two positive semi-definite operators

 F_{k+1}=F_{k}\otimes I-\sum_{i=1}^{s}p_{i}\left(T^{(i)}\right)^{\otimes k}% \otimes\left(I-T^{(i)}\right)

It follows that for all k\geq 0, F_{k}\otimes I\geq F_{k+1} (i.e., for all vectors x, \left<F_{k}\otimes Ix,x\right>\geq\left<F_{k+1}x,x\right>). Since \|F_{k}\|=\|F_{k}\otimes I\| (see [33], VIII.10), the largest eigenvalue of F_{k} (that is equal to largest eigenvalue of F_{k}\otimes I) is greater than the one of the operator F_{k+1}. Thus the largest eigenvalues of P_{p}|_{S_{k}}, k\geq 0 form a non-increasing sequence.

Note that P_{p}|_{S_{0}} corresponds to the unit eigenvalue and that the matrix that corresponds to P_{p}|_{S_{1}} is exactly F_{1}, as easily seen from (41). Therefore, the second largest eigenvalue of P_{p} is attained on S_{1} and is equal to the maximum eigenvalue of F_{1}.
\square

Proof of Theorem 2. From the formula (41) it follows that for k=1, a matrix that corresponds to P_{p}|S_{1} is exactly F_{1}.
\square

Proof of Lemma 4. Let \lambda\neq 0 such that

 ABx=\lambda x

for some non-zero x. Multiply both sides by B

 BA(Bx)=\lambda Bx.

If \lambda=0 and x\neq 0 such that

 ABx=0,

we may have either Bx\neq 0 or Bx=0. In the first case multiply both sides by B so that BA(Bx)=0. Otherwise, if A is invertible, find y such that Ay=x so that BAy=0. If A is not invertible, there exists y\neq 0 such that Ay=0 so that again BAy=0.
\square

Proof of Theorem 5. Define

 \displaystyle h(p)=\lambda_{min}\left(D_{p}Q\right)=\lambda_{min}\left(\sqrt{Q% }D_{p}\sqrt{Q}\right). (42)

Assume p^{\mathrm{opt}}_{1} and p^{\mathrm{opt}}_{2} are two different points that maximise h. Then a function

 g(\alpha):=h(\alpha p^{\mathrm{opt}}_{1}+(1-\alpha)p^{\mathrm{opt}}_{2})

is constant on [0,1] due to concavity of h (see Proposition 9) and equals, say, \lambda. Since h(p) is itself the minimum eigenvalue of \sqrt{Q}D_{p}\sqrt{Q}, there exist unit vectors x_{0}, x_{1}, x_{2} such that

 \displaystyle\sqrt{Q}D_{\frac{1}{2}p^{\mathrm{opt}}_{1}+\frac{1}{2}p^{\mathrm{% opt}}_{2}}\sqrt{Q}x_{0}=\lambda x_{0} \displaystyle\sqrt{Q}D_{p^{\mathrm{opt}}_{1}}\sqrt{Q}x_{1}=\lambda x_{1} \displaystyle\sqrt{Q}D_{p^{\mathrm{opt}}_{2}}\sqrt{Q}x_{2}=\lambda x_{2}

Since g is constant on [0,1],

 \displaystyle\frac{1}{2}\left<\sqrt{Q}D_{p^{\mathrm{opt}}_{1}}\sqrt{Q}x_{0},x_% {0}\right>+\frac{1}{2}\left<\sqrt{Q}D_{p^{\mathrm{opt}}_{2}}\sqrt{Q}x_{0},x_{0% }\right>=g\left(\frac{1}{2}\right)=\frac{g(0)}{2}+\frac{g(1)}{2}= \displaystyle=\frac{1}{2}\left<\sqrt{Q}D_{p^{\mathrm{opt}}_{1}}\sqrt{Q}x_{1},x% _{1}\right>+\frac{1}{2}\left<\sqrt{Q}D_{p^{\mathrm{opt}}_{2}}\sqrt{Q}x_{2},x_{% 2}\right>\leq \displaystyle\leq\frac{1}{2}\left<\sqrt{Q}D_{p^{\mathrm{opt}}_{1}}\sqrt{Q}x_{0% },x_{0}\right>+\frac{1}{2}\left<\sqrt{Q}D_{p^{\mathrm{opt}}_{2}}\sqrt{Q}x_{0},% x_{0}\right>,

where the last inequality holds since x_{1}, x_{2} are the minimum eigenvectors. Hence,

 \displaystyle\left<\sqrt{Q}D_{p^{\mathrm{opt}}_{1}}\sqrt{Q}x_{0},x_{0}\right>=% \left<\sqrt{Q}D_{p^{\mathrm{opt}}_{1}}\sqrt{Q}x_{1},x_{1}\right>=\lambda, \displaystyle\left<\sqrt{Q}D_{p^{\mathrm{opt}}_{2}}\sqrt{Q}x_{0},x_{0}\right>=% \left<\sqrt{Q}D_{p^{\mathrm{opt}}_{2}}\sqrt{Q}x_{2},x_{2}\right>=\lambda.

Therefore,

 \displaystyle\sqrt{Q}D_{p^{\mathrm{opt}}_{1}}\sqrt{Q}x_{0}=\lambda x_{0}\mbox{% and }\sqrt{Q}D_{p^{\mathrm{opt}}_{2}}\sqrt{Q}x_{0}=\lambda x_{0}, (43)

which follows from the following simple statement.

###### Lemma 18.

Let A be a n\times n symmetric matrix, and x be a unit vector, such that \left<Ax,x\right>=\lambda_{min}(A). Then Ax=\lambda_{min}(A)x.

Let y_{0}=\sqrt{Q}x_{0}. Then (43) is equivalent to

 \displaystyle D_{p^{\mathrm{opt}}_{1}}y_{0}=\lambda\Sigma y_{0}\mbox{ and }D_{% p^{\mathrm{opt}}_{2}}y_{0}=\lambda\Sigma y_{0}.

Using the definition of (7), the last equalities yield

 (p^{\mathrm{opt}}_{i})_{j}=\lambda\frac{\left<\left(\Sigma y_{0}\right)_{j},(y% _{0})_{j}\right>}{\left},

for i=1,2, if (y_{0})_{j}\neq 0.

Let p:=\frac{1}{2}p^{\mathrm{opt}}_{1}+\frac{1}{2}p^{\mathrm{opt}}_{2}. It is left to show that for every j\in\{1,..,s\}, one can find a minimum eigenvector x_{0} of \sqrt{Q}D_{p}\sqrt{Q}, such that for the corresponding vector y_{0}=\sqrt{Q}x_{0}, we have (y_{0})_{j}\neq 0.

Define a space S_{0}=\{x_{0}|\sqrt{Q}D_{p}\sqrt{Q}x_{0}=\lambda x_{0}\} as a space generated by all the minimum eigenvectors of \sqrt{Q}D_{p}\sqrt{Q}.

Assume, on the contrary, that for some j\in\{1,..,s\}, and all x_{0}\in S_{0}, we have (y_{0})_{j}=0.

Define a space S_{\perp}:=\{x\ |x\mbox{ is orthogonal to }S_{0}\}. For \varepsilon\geq 0, let

 A_{\varepsilon}:=\sqrt{Q}D_{p^{(\varepsilon)}}\sqrt{Q},

where p^{(\varepsilon)}_{k}=(1+\varepsilon)p_{k} if k\neq j, and p^{(\varepsilon)}_{j}=p_{j}-\varepsilon\sum_{k\neq j}p_{k}. Note that for all x_{0}\in S_{0} and \varepsilon\geq 0,

 \displaystyle A_{\varepsilon}x_{0}=\sqrt{Q}D_{p^{(\varepsilon)}}\sqrt{Q}x_{0}=% \sqrt{Q}D_{p^{(\varepsilon)}}y_{0}=(1+\varepsilon)\lambda\sqrt{\Sigma}y_{0}=(1% +\varepsilon)\lambda x_{0}, (44)

since y_{0}=\sqrt{Q}x_{0}, D_{p}y_{0}=\lambda\Sigma y_{0}, and (y_{0})_{j}=0 by the assumption. That is, (1+\varepsilon)\lambda is an eigenvalue of A_{\varepsilon}, and S_{0} is the subspace of the corresponding eigenvectors.

Also, S_{\perp} is invariant under A_{\varepsilon} (i.e., A_{\varepsilon}S_{\perp}\subset S_{\perp}). Indeed, for all x\in S_{\perp} and x_{0}\in S_{0},

 \left=\left=% \lambda(1+\varepsilon)\left=0.

Let {\lambda}^{(\varepsilon)} be the minimum eigenvalue of A_{\varepsilon} restricted to the space S_{\perp}. Since S_{0} contains all possible minimum eigenvectors of \sqrt{Q}D_{p}\sqrt{Q}, we have {\lambda}^{(0)}>\lambda. Therefore, we can find small enough r>0, such that

 {\lambda}^{(0)}>(1+r)\lambda.

Recall that {\lambda}^{(\varepsilon)} is a continuous function of \varepsilon (since it is a concave function by the Proposition 9). Thus there exists \delta\in(0,r), such that for all \varepsilon\in[0,\delta], {\lambda}^{(\varepsilon)}>(1+r)\lambda and also p^{(\varepsilon)}\in\Delta_{s-1}. In particular,

 {\lambda}^{(\delta)}>(1+r)\lambda>(1+\delta)\lambda.

Since S_{\perp}\cup S_{0}=\mathbb{R}^{d}, and A_{\delta} is a symmetric, invariant operator on S_{\perp} and S_{0}, we obtain,

 \lambda_{min}(A_{\delta})=\min\{\lambda_{min}(A_{\delta}|_{S_{\perp}}),\lambda% _{min}(A_{\delta}|_{S_{0}})\}=\min\{{\lambda}^{(\delta)},(1+\delta)\lambda\}=(% 1+\delta)\lambda,

meaning \lambda(1+\delta) is the minimum eigenvalue of A_{\delta}=\sqrt{Q}D_{p^{(\delta)}}\sqrt{Q}, p^{(\delta)}\in\Delta_{s-1}. Hence \lambda is not the maximum of (42), which contradicts to the definition of \lambda. Thus there exists x_{0}\in S_{0}, such that (y_{0})_{j}\neq 0.
\square

Proof of Lemma 18. Let \{x_{i}\} be an orthonormal basis of eigenvectors of A with \{\lambda_{i}\} being the corresponding eigenvalues. Then x=\sum_{i=1}^{n}\left<x,x_{i}\right>x_{i}. We need to show that \left<x,x_{i}\right>=0 for all x_{i} that are not the minimum eigenvectors. Assume there are at least two vectors x_{k} and x_{j} such that \lambda_{k}\neq\lambda_{j}, \left<x,x_{k}\right>^{2}>0, and \left<x,x_{j}\right>^{2}>0. Then

 \left=\sum_{i=1}^{n}\lambda_{i}\left^{2}>\lambda_{% min}(A)\sum_{i=1}^{n}\left^{2}=\lambda_{min}(A),

\square

Proof of Theorem 6. Let P_{p} be the Gibbs kernel as in (1) with corresponding weights p and let P_{\frac{1}{s}} be the kernel of the vanilla chain. For functions f,g\in L_{2}(\mathbb{R}^{d},\pi) let

 =\int fg{\rm d}\pi,\ \|f\|^{2}=\int f^{2}{\rm d}\pi.

Using an equivalent representation for the spectral gap (see a remark to Theorem 2 of [36]), inequality (11) is equivalent to

 \displaystyle\inf_{\|f\|=1,\pi(f)=0}\Bigg{<}(I-P_{p})f,f\Bigg{>}\leq\max_{i}% \left(\frac{p_{i}}{q_{i}}\right)\inf_{\|f\|=1,\pi(f)=0}\Bigg{<}(I-P_{q})f,f% \Bigg{>}

It suffices to establish

 \Bigg{<}(I-P_{p})f,f\Bigg{>}\leq\max_{i}\left(\frac{p_{i}}{q_{i}}\right)\Bigg{% <}(I-P_{q})f,f\Bigg{>}

for all f, \|f\|=1,\pi(f)=0. Let j=\underset{i}{\rm argmax}\ \frac{p_{i}}{q_{i}} Using the representation (1) of P_{p}, the last inequality is equivalent to

 \sum_{i=1}^{s}p_{i}\left(\frac{q_{i}}{p_{i}}-\frac{q_{j}}{p_{j}}\right)\left<% Pr_{i}f,f\right>+\frac{q_{j}}{p_{j}}\|f\|^{2}\leq\|f\|^{2}.

Since \frac{q_{j}}{p_{j}}\leq\frac{q_{i}}{p_{i}} and \left<Pr_{i}f,f\right>\leq\|f\|^{2}, the last inequality follows:

 \sum_{i=1}^{s}p_{i}\left(\frac{q_{i}}{p_{i}}-\frac{q_{j}}{p_{j}}\right)\left<% Pr_{i}f,f\right>+\frac{q_{j}}{p_{j}}\|f\|^{2}\leq\sum_{i=1}^{s}p_{i}\left(% \frac{q_{i}}{p_{i}}-\frac{q_{j}}{p_{j}}\right)\|f\|^{2}+\frac{q_{j}}{p_{j}}\|f% \|^{2}=\|f\|^{2},

where in the last equality we used the fact that \sum_{i=1}^{s}p_{i}=\sum_{i=1}^{s}q_{i}=1.
\square

Proof of Proposition 7. For i=1,..,k let

 A_{i}=\left(\begin{array}[]{cc}p_{2i-1}&p_{2i-1}\rho_{i}\\ p_{2i}\rho_{i}&p_{2i}\\ \end{array}\right).

One can see that the pseudo-optimal weights p^{\mathrm{opt}} satisfy

 \displaystyle p^{\mathrm{opt}}=\rm argmax_{p\in\Delta_{2k-1}}\min\{\lambda_{% min}(A_{1}-\lambda I),...,\lambda_{\min}(A_{k}-\lambda I)\}. (45)

Set \alpha_{i}=p_{2i-1}+p_{2i}, i=1,..,k. We obtain

 \rm argmax_{p\in\Delta_{2k-1}}\lambda_{min}(A_{i}-\lambda I)=p^{\mathrm{opt}}_% {2i}=p^{\mathrm{opt}}_{2i-1}=\frac{\alpha_{i}}{2},

so that (45) takes the form

 \displaystyle p^{\mathrm{opt}}=\rm argmax_{p\in\Delta_{2k-1}}\min\left\{\frac{% \alpha_{1}(1-\rho_{1})}{2},...,\frac{\alpha_{k}(1-\rho_{k})}{2}\right\}. (46)

It is easy to verify that p^{\mathrm{opt}} should satisfy

 \frac{\alpha_{1}(1-\rho_{1})}{2}=...=\frac{\alpha_{k}(1-\rho_{k})}{2}.

The last relation leads to (13) and we conclude that the optimal selection probabilities p^{\mathrm{opt}}_{i}, i=1,..,k are computed as in (14). Finally, (15) follows from (46).
\square

Proof of Proposition 8. Note that the target function f in (18) is the minimum eigenvalue of

 \left(\begin{array}[]{cc}\sqrt{Q}D_{w}\sqrt{Q}&0\\ 0&1-w_{1}-..-w_{s}\end{array}\right),

so that we can rewrite f as

 \begin{split}&\displaystyle f(w)=\lambda_{min}\left({D^{\rm ext}_{w}}{Q^{\rm ext% }}\right)=\lambda_{min}\left(\sqrt{{Q^{\rm ext}}}{D^{\rm ext}_{w}}\sqrt{{Q^{% \rm ext}}}\right)=\\ &\displaystyle=\min\{\lambda_{min}(\sqrt{Q}D_{w}\sqrt{Q}),1-w_{1}-..-w_{s}\}=% \\ &\displaystyle=\min\{\mathop{\mbox{$\rm P$-$\rm Gap$}}(w),1-w_{1}-..-w_{s}\}.% \end{split} (47)

Let w^{\star}=\underset{{w\in\Delta_{s}}}{\rm argmax}f(w), and denote p^{\star}_{j}=\frac{w^{\star}_{j}}{w^{\star}_{1}+..+w^{\star}_{s}},\ j=1,..,s,. To prove the proposition it suffices to show that for the pseudo-optimal weights p^{\mathrm{opt}},

 \displaystyle\mathop{\mbox{$\rm P$-$\rm Gap$}}(p^{\mathrm{opt}})\leq\mathop{% \mbox{$\rm P$-$\rm Gap$}}(p^{\star}). (48)

Let k^{\star}=\sum_{i=1}^{s}w^{\star}_{i}. It is easy to see from (47) that

 1-k^{\star}=\mathop{\mbox{$\rm P$-$\rm Gap$}}(w^{\star})=\mathop{\mbox{$\rm P$% -$\rm Gap$}}(k^{\star}p^{\star})=f(k^{\star}p^{\star}).

Since for any k>0 and any p\in\Delta_{s}, \mathop{\mbox{$\rm P$-$\rm Gap$}}(kp)=k\mathop{\mbox{$\rm P$-$\rm Gap$}}(p), we can choose k<1 such that

 1-k=\mathop{\mbox{$\rm P$-$\rm Gap$}}(kp^{\mathrm{opt}})=f(kp^{\mathrm{opt}}).

Hence, by definition of w^{\star},

 \mathop{\mbox{$\rm P$-$\rm Gap$}}(kp^{\mathrm{opt}})=1-k\leq 1-k*=\mathop{% \mbox{$\rm P$-$\rm Gap$}}(k^{\star}p^{\star}),

implying k^{\star}\leq k. Therefore,

 \displaystyle\mathop{\mbox{$\rm P$-$\rm Gap$}}(p^{\mathrm{opt}})=\frac{1}{k}% \mathop{\mbox{$\rm P$-$\rm Gap$}}(kp^{\mathrm{opt}})\leq\frac{1}{k}\mathop{% \mbox{$\rm P$-$\rm Gap$}}(k^{\star}p^{\star})= \displaystyle=\frac{k^{\star}}{k}\mathop{\mbox{$\rm P$-$\rm Gap$}}(p^{\star})% \leq\mathop{\mbox{$\rm P$-$\rm Gap$}}(p^{\star}),

whence we conclude (48).
\square

Proof of Proposition 9. Note that \left<\sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext}_{n}}}x,x\right> is linear in w for all x\in\mathbb{R}^{d+1}. That is, there exist functions a_{0}(x),..,a_{s}(x) such that

 \left<\sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext}_{n}}}x,x% \right>=a_{0}(x)+w_{1}a_{1}(x)+..+w_{s}a_{s}(x).

Thus \left<\sqrt{{Q^{\rm ext}_{n}}}{D^{\rm ext}_{n}(w)}\sqrt{{Q^{\rm ext}_{n}}}x,x\right> is concave for all x. Then f_{n} is concave as the minimum over concave functions.
\square

Proof of Proposition 15. Geometric ergodicity follows if we find drift coefficients to establish 2. We argue that

 V(x_{1:n})=\frac{x_{1}^{2}}{2}+x_{2}^{2}..+x_{n-1}^{2}+\frac{x_{n}^{2}}{2}

is an appropriate drift function for the vanilla RSGS is cases 1, 2 and 3. Note that V does not depend on the regimes r(i). One can work out the full conditionals for X_{i},

 \begin{split}&\displaystyle X_{i}|X_{-i},Y_{1:n},r(1),..,r(n)\sim N(\mu,q_{r(i% ),r(i+1)}),\\ &\displaystyle\mu=q_{r(i),r(i+1)}\left(\frac{X_{i-1}I_{\{i>1\}}}{\sigma^{2}_{r% (i)}}+\frac{X_{i+1}I_{\{i

where

 q_{r(i),r(i+1)}=\frac{1}{\frac{1}{\beta^{2}}+\frac{I_{\{i>1\}}}{\sigma^{2}_{r(% i)}}+\frac{I_{\{i

Here we set X_{n+1}=X_{0}:=0.

Let f_{i}(x_{1:n})=x_{i}^{2}. For Pr_{i}, i=1,..,n, defined in (1), where Pr_{i} corresponds for updating X_{i} from its full conditional, one see it is obvious that

 \displaystyle Pr_{j}f_{i}(x_{1:n})=f_{i}(x_{1:n}),\ i\neq j. (50)

From (49), using the Cauchy-Schwartz inequality, we get

 \begin{split}&\displaystyle Pr_{i}f_{i}(x_{1:n})=\mu^{2}+q_{r(i),r(i+1)}\leq\\ &\displaystyle\leq q^{2}_{r(i),r(i+1)}\left(\frac{I_{\{i>1\}}}{\sigma^{4}_{r(i% )}}+\frac{I_{\{i1\}}+f_{i+1}(x_{1:n})I_{\{i<% n\}}\Big{)}+L_{i}(x_{i-1},x_{i+1}),\end{split} (51)

where L_{i}(x_{i-1},x_{i+1}) is a linear function. Note that for the considered cases 1, 2 and 3, for any configuration of (r(1),..,r(n)) and i\in\{2,..,n-1\},

 \displaystyle 2q^{2}_{r(i),r(i+1)}\left(\frac{1}{\sigma^{4}_{r(i)}}+\frac{1}{% \sigma^{4}_{r(i+1)}}\right)<0.99. (52)

Moreover, for i\in\{1,n\},

 \displaystyle q^{2}_{r(i),r(i+1)}\left(\frac{I_{\{i>1\}}}{\sigma^{4}_{r(i)}}+% \frac{I_{\{i

It follows from (51), (52) and (53),

 \begin{split}&\displaystyle\frac{1}{2}Pr_{1}f_{1}(x_{1:n})+\sum_{i=2}^{n-1}Pr_% {i}f_{i}(x_{1:n})+\frac{1}{2}Pr_{n}f_{n}(x_{1:n})\leq\\ &\displaystyle\leq 0.99V(x_{1:n})+L(x_{1:n}),\end{split} (54)

where L(x_{1:n}) is a linear function. Together with (50), the inequality (54) yields

 \displaystyle\frac{1}{d}\sum_{i=1}^{n}Pr_{i}V(x_{1:n})\leq\frac{\lambda}{2}V(x% _{1:n})+A, (55)

for some \lambda<1 and A<\infty.

For Pr_{i+n} that corresponds for updating r(i) from its full conditional, since V does not depend on r(i), we get

 \displaystyle Pr_{i+n}V=V. (56)

Let P_{\frac{1}{d}} be the RSGS kernel that corresponds to the vanilla chain with uniform sampling weights \frac{1}{d}. Combining (55) and (56) together, we obtain,

 P_{\frac{1}{d}}V\leq\frac{\lambda}{2}V+\frac{1}{2}V+A.

Since for all C<\infty, set \{\left(x_{1:n},r(1),..,r(n)\right)\ |\ V(x_{1:n})<C\} is small, Lemma 15.2.8 of [30] yields that V is a geometric drift function.

For the RSGS with non-uniform selection probabilities p=(p_{1},..,p_{d}), we can use theorem 10 to conclude the geometric ergodicity.
\square

## References

• 1 Yali Amit. On rates of convergence of stochastic relaxation for Gaussian and non-Gaussian distributions. J. Multivariate Anal., 38(1):82–99, 1991.
• 2 Yali Amit. Convergence properties of the Gibbs sampler for perturbations of Gaussians. Ann. Statist., 24(1):122–140, 1996.
• 3 Alan L. Andrew, K.-w. Eric Chu, and Peter Lancaster. Derivatives of eigenvalues and eigenvectors of matrix functions. SIAM J. Matrix Anal. Appl., 14(4):903–926, 1993.
• 4 George E. Andrews, Richard Askey, and Ranjan Roy. Special functions, volume 71 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 1999.
• 5 Christophe Andrieu and Yves F. Atchadé. On the efficiency of adaptive MCMC algorithms. Electron. Comm. Probab., 12:336–349 (electronic), 2007.
• 6 Yves Atchadé and Gersende Fort. Limit theorems for some adaptive MCMC algorithms with subgeometric kernels. Bernoulli, 16(1):116–154, 2010.
• 7 Yves F. Atchadé and Jeffrey S. Rosenthal. On adaptive Markov chain Monte Carlo algorithms. Bernoulli, 11(5):815–828, 2005.
• 8 Yan Bai, Gareth O. Roberts, and Jeffrey S. Rosenthal. On the containment condition for adaptive Markov chain Monte Carlo algorithms. Adv. Appl. Stat., 21(1):1–54, 2011.
• 9 Mylène Bédard. Weak convergence of Metropolis algorithms for non-i.i.d. target distributions. Ann. Appl. Probab., 17(4):1222–1244, 2007.
• 10 Mylène Bédard and Jeffrey S. Rosenthal. Optimal scaling of Metropolis algorithms: heading toward general target distributions. Canad. J. Statist., 36(4):483–503, 2008.
• 11 Witold Bednorz and Krzysztof Łatuszyński. A few remarks on “Fixed-width output analysis for Markov chain Monte Carlo” by Jones et al. [mr2279478]. J. Amer. Statist. Assoc., 102(480):1485–1486, 2007.
• 12 Witold Bednorz, Krzysztof Łatuszyński, and Rafał Latała. A regeneration proof of the central limit theorem for uniformly ergodic Markov chains. Electron. Commun. Probab., 13:85–98, 2008.
• 13 Dimitri P. Bertsekas. Convex analysis and optimization. Athena Scientific, Belmont, MA, 2003. With Angelia Nedić and Asuman E. Ozdaglar.
• 14 Peter J. Bickel and Elizaveta Levina. Covariance regularization by thresholding. Ann. Statist., 36(6):2577–2604, 2008.
• 15 Jonathan M. Borwein and Jon D. Vanderwerff. Convex functions: constructions, characterizations and counterexamples, volume 109 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 2010.
• 16 Cyril Chimisov, Krzysztof Łatuszynski, and Roberts Gareth. Air Markov Chain Monte Carlo.
• 17 King-wah Eric Chu. On multiple eigenvalues of matrices depending on several parameters. SIAM J. Numer. Anal., 27(5):1368–1385, 1990.
• 18 Radu V. Craiu, Lawrence Gray, Krzysztof Łatuszyński, Neal Madras, Gareth O. Roberts, and Jeffrey S. Rosenthal. Stability of adversarial Markov chains, with an application to adaptive MCMC algorithms. Ann. Appl. Probab., 25(6):3592–3623, 2015.
• 19 Hani Doss and B Narasimhan. Bayesian Poisson Regression Using The Gibbs Sampler: Sensitivity analysis through Dynamical Graphics, 1994.
• 20 A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient Metropolis jumping rules. pages 599–607, 1996.
• 21 John Geweke. Efficient Simulation from the Multivariate Normal and Student-t Distributions Subject to Linear Constraints and the Evaluation of Constraint Probabilities. 1991 Computing Science and Statistics: the Twenty-Third Symposium on the Interface, pages 1–14, 1991.
• 22 Walter R. Gilks, Gareth O. Roberts, and Sujit K. Sahu. Adaptive Markov chain Monte Carlo through regeneration. J. Amer. Statist. Assoc., 93(443):1045–1054, 1998.
• 23 W.R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society. Series C (Applied Statistics), 41(2):337–348, 1992.
• 24 Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive metropolis algorithm. Bernoulli, 7(2):223–242, 2001.
• 25 Edwin Hewitt and Karl Stromberg. Real and abstract analysis. A modern treatment of the theory of functions of a real variable. Springer-Verlag, New York, 1965.
• 26 Galin L. Jones, Murali Haran, Brian S. Caffo, and Ronald Neath. Fixed-width output analysis for Markov chain Monte Carlo. J. Amer. Statist. Assoc., 101(476):1537–1547, 2006.
• 27 C. Kipnis and S. R. S. Varadhan. Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Comm. Math. Phys., 104(1):1–19, 1986.
• 28 Krzysztof Łatuszyński, Gareth O. Roberts, and Jeffrey S. Rosenthal. Adaptive Gibbs samplers and related MCMC methods. Ann. Appl. Probab., 23(1):66–98, 2013.
• 29 Jun S. Liu. Monte Carlo strategies in scientific computing. Springer Series in Statistics. Springer, New York, 2008.
• 30 Sean Meyn and Richard L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. With a prologue by Peter W. Glynn.
• 31 Michael L. Overton. On minimizing the maximum eigenvalue of a symmetric matrix. SIAM J. Matrix Anal. Appl., 9(2):256–268, 1988. SIAM Conference on Linear Algebra in Signals, Systems, and Control (Boston, Mass., 1986).
• 32 Michael L. Overton. Large-scale optimization of eigenvalues. SIAM J. Optim., 2(1):88–120, 1992.
• 33 Michael Reed and Barry Simon. Methods of modern mathematical physics. I. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], New York, second edition, 1980. Functional analysis.
• 34 G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab., 7(1):110–120, 1997.
• 35 G. O. Roberts and S. K. Sahu. Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. J. Roy. Statist. Soc. Ser. B, 59(2):291–317, 1997.
• 36 Gareth O. Roberts and Jeffrey S. Rosenthal. Geometric ergodicity and hybrid Markov chains. Electron. Comm. Probab., 2:no. 2, 13–25 (electronic), 1997.
• 37 Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling for various Metropolis-Hastings algorithms. Statist. Sci., 16(4):351–367, 2001.
• 38 Gareth O. Roberts and Jeffrey S. Rosenthal. General state space Markov chains and MCMC algorithms. Probab. Surv., 1:20–71, 2004.
• 39 Gareth O. Roberts and Jeffrey S. Rosenthal. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J. Appl. Probab., 44(2):458–475, 2007.
• 40 Gareth O. Roberts and Jeffrey S. Rosenthal. Examples of adaptive MCMC. J. Comput. Graph. Statist., 18(2):349–367, 2009.
• 41 Gareth O. Roberts and Richard L. Tweedie. Geometric L^{2} and L^{1} convergence are equivalent for reversible Markov chains. J. Appl. Probab., 38A:37–41, 2001. Probability, statistics and seismology.
• 42 Steven Roman. The umbral calculus, volume 111 of Pure and Applied Mathematics. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], New York, 1984.
• 43 Jeffrey S. Rosenthal. Optimal proposal distributions and adaptive MCMC. pages 93–111, 2011.
• 44 Eero Saksman and Matti Vihola. On the ergodicity of the adaptive Metropolis algorithm on unbounded domains. Ann. Appl. Probab., 20(6):2178–2203, 2010.
• 45 A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998.
• 46 Matti Vihola. On the stability and ergodicity of adaptive scaling Metropolis algorithms. Stochastic Process. Appl., 121(12):2839–2860, 2011.
• 47 Matti Vihola. Robust adaptive Metropolis algorithm with coerced acceptance rate. Stat. Comput., 22(5):997–1008, 2012.
• 48 W. Wang and M. Á. Carreira-Perpiñán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. ArXiv e-prints, September 2013.
• 49 C. Zălinescu. Convex analysis in general vector spaces. World Scientific Publishing Co., Inc., River Edge, NJ, 2002.
113358