A Review of Dirac Notation

# Quantum Inspired Training for Boltzmann Machines

## Abstract

We present an efficient classical algorithm for training deep Boltzmann machines (DBMs) that uses rejection sampling in concert with variational approximations to estimate the gradients of the training objective function. Our algorithm is inspired by a recent quantum algorithm for training DBMs [1]. We obtain rigorous bounds on the errors in the approximate gradients; in turn, we find that choosing the instrumental distribution to minimize the divergence with the Gibbs state minimizes the asymptotic algorithmic complexity. Our rejection sampling approach can yield more accurate gradients than low-order contrastive divergence training and the costs incurred in finding increasingly accurate gradients can be easily parallelized. Finally our algorithm can train full Boltzmann machines and scales more favorably with the number of layers in a DBM than greedy contrastive divergence training.

\nipsfinalcopy

## 1 Introduction

In 2002, Hinton provided the first efficient algorithm for training Boltzmann machines [2], a type of stochastic recurrent neural network with undirected edges, called contrastive divergence (CD). Due to the emergence of CD, Boltzmann machines, specifically restricted Boltzmann machines (RBMs) and their deep layered counterparts (DBMs), have become standard tools for solving problems in vision and speech recognition [3, 4]. Despite the remarkable successes that contrastive divergence has achieved, there are a number of theoretical and practical drawbacks to the use of the algorithm. A major drawback of contrastive divergence training is that it implicitly adds directionality to the edges in the graphical model, which lessens any advantages Boltzmann machines may have over feed–forward neural nets. On a more practical level, parallelism cannot be used to increase the accuracy of contrastive divergence training which means that only the lowest–order (and least accurate) contrastive divergence approximation is used.

Recent work in quantum computing has revealed a class of training algorithms that use a quantum form of rejection sampling to overcome these problems [1]. The approach hinges on refining quantum states that crudely approximate the joint and conditional probability distributions for the units in the Boltzmann machine into ones that closely mimic them. This process is expected to yield accurate and efficient approximations to the distribution if sufficiently strong regularization is used in the training process [5, 6]. The gradients of the training objective function (the average log–likelihood of the BM producing the training data) are then estimated by either sampling from the resulting quantum states or by using techniques like quantum amplitude amplification and estimation. Amplitude amplification results in a quadratic speedup with respect to the acceptance probability of the rejection step and amplitude estimation quadratically reduces the number of training vectors (at the price of quadratically worsening the scaling with [1]. These advantages are summarized in Table 1.

This quantum approach has several significant features. First, it is a very natural method for training a Boltzmann machine using a quantum computer because it only involves state preparation and measurement. Second, it does not explicitly depend on the interaction graph used: it can be used to train full Boltzmann machines rather than just DBMs. Third, it is mathematically easy to verify that the approximate gradients converge to the true gradients in the appropriate limit. Finally, it does not have a well known classical counterpart, unlike many existing quantum machine learning results.

Unfortunately, quantum computers are currently limited to tens of quantum bits, which means that [1] can only be used to train impractically small Boltzmann machines using present-day hardware. Consequently, determining a classical analogue of the quantum approach would embue classical training methods with the quantum advantages. The challenge is whether the classical analogue would still result in efficient training. We present such a classical analogue in this paper.

We call our approach Instrumental Rejection Sampling (IRS). It retains all of the theoretical advantages of the quantum algorithm, while providing practical advantages for training highly optimized deep Boltzmann machines in the presence of regularization. It is worth noting that our approach is not specific to Boltzmann machines: it also applies to more general classes of undirected graphical models with latent variables.

Here we investigate the quality of the gradients yielded by this method and provide theoretical and numerical evidence that training using IRS is not only efficient, but it may also convey practical advantages for training certain classes of deep Boltzmann machines. Since these insights stemmed from recent progress on quantum machine learning, our work underscores the value of investigating quantum paradigms for machine learning even in the absence of large–scale quantum computers.

## 2 Instrumental Rejection Sampling

The main insight behind our result, and stemming from [1], is that variational approximations to the Gibbs state can be used to make training with rejection sampling much more efficient than it would initially appear. This result is conceptually related to work by Murray and Gharhramani [7] in the context of MCMC algorithms; whereas, our results hold for rejection sampling and also show how to choose the approximation to minimize the error in the sample distribution. We present the general theory of IRS here and apply it to training DBMs in Section 3.

Rejection sampling seeks to draw samples from a distribution that cannot be sampled from directly by sampling instead from an instrumental distribution, , and rejecting the samples with a probability . Here is a normalizing constant introduced to ensure that the rejection probability is well defined. In other words, we draw samples from and reject them with probability

 Praccept(x|Q(x),κ,Z)=P(x)ZκQ(x), (1)

until a sample is accepted. This can be implemented by drawing uniformly from and accepting if .

In applications such as training Boltzmann machines, the constants needed to normalize (1) are not known or may be prohibitively large for some . This can be addressed using a form of approximate rejection sampling where we use and such that for some set of configurations which we call bad. The approximate rejection sampling algorithm then proceeds as the precise rejection sampling algorithm except that the sample will always be accepted if . This means that the samples yielded by approximate rejection sampling are not precisely drawn from . Error estimates for this sampling process are given below.

###### Theorem 1.

Let be an efficiently computable probability distribution that can be efficiently sampled from. Assume that can be efficiently computed and where

There then exists an efficient classical algorithm that samples from a distribution such that . The probability of accepting a sample is at least .

Our proof of this theorem uses quantum techniques and is given in the appendix. Theorem 1 shows that if the conditions normally required of rejection sampling are not met then an approximate sampling algorithm exists that is promised to be close to the distribution if the chosen value of is sufficiently large. Since the acceptance rate of the sampling algorithm scales inversely with , elementary choices of (such as the uniform distribution) are unlikely to efficiently produce accurate samples from for polynomially large . We call our approach Instrumental Rejection Sampling training to emphasize that we use a non-trivial to draw the samples.

In order to minimize the error in Theorem 1, we want to choose to minimize an appropriate divergence between and . The choice of divergence that minimizes is by no means unique. The result of [1] chooses to be a product distribution that minimizes known as the mean–field distribution; however, this choice only provides a polynomial penalty for using an exponentially poor approximation to the tail probability.

We can address this problem by using the method of [8] to find to minimize a divergence that does not de–emphasize these tail probabilities. This method finds that minimizes where

 Dα(p||q)=∫xαp(x)+(1−α)q(x)−p(x)αq(x)1−αdxα(1−α). (2)

Note that and hence the method of [8] generalizes the mean–field approximation. The following lemma shows that choosing to be the that minimizes also minimizes an upper bound on the rejection sampling error.

###### Lemma 1.

Let and be probability distributions then

where is the divergence and we consider the asymptotic regime where .

###### Proof.

There are two cases, and . If we have the former case then and the result trivially follows. Now let us focus on the case where the set is non–empty. Since is a probability distribution and

In order to estimate the probability, it is helpful to think of the problem as a sampling problem for the random variable where . The desired probability is then bounded above (using Markov’s inequality) by

 ∑x∈badP(x)/Z = Prx∼P(x)/Z(P(x)/ZQQ(x)>κA) (4) ≤ Prx∼P(x)/Z(P(x)/ZQQ(x)≥κA) ≤ E(P/Q)κAZQ=∑xP2(x)ZQ(x)κAZQ.

Rewriting the –divergence as a sum and using the fact that and are normalized to leads to

 D2(P/Z||Q)=12∑xP2(x)Z2Q(x)−12. (5)

Thus (4) and (5) imply

which completes the proof under the assumption that . ∎

Lemma 1 shows that choosing to minimize will asymptotically minimize the approximation error. In contrast, no corresponding result has been rigorously shown for .

## 3 Approximate training of Boltzmann machines using IRS

We now show how instrumental rejection sampling can be applied to training Boltzmann machines. Boltzmann machines model the training data as an Ising model in thermal equilibrium with its environment. The goal of training is to adjust the parameters of the Ising model to maximize the likelihood that the observed training data would emerge from the thermal distribution in the system.

A Boltzmann machine consists of spins (bits) which are composed of visible units and hidden units [2, 9]. The visible units represent the input and output of the model and the hidden units are used to generate appropriate correlations between the features in the input and output. The correlations can be visualized as edges in a graphical model between pairs of hidden and visible units. In general, the underlying graph can be a complete graph; however, in practice a layered bipartite graph is usually preferred since it admits efficient training.

The restricted Boltzmann machine (RBM) consists of two layers of units, where each layer consists of either exclusively hidden or exclusively visible units. Edges are restricted to inter-layer correlations. The unnormalized probability of a given configuration of hidden and visible units is

 P(v,h)=e−E(v,h), (7)

where for is the normalized joint probability distribution and

 E(v,h)=−∑ibivi−∑jdjhj−∑i,jwi,jvihj, (8)

for binary units and . The vectors and are called biases and set the probability of a unit being zero or one irrespective of the values of the adjacent units. The weights provide energy penalties for two adjacent units taking the value .

Training the model formally involves maximizing the average log–likelihood of the training data emerging from the model. This can be thought of as an optimization problem with objective function

 OML=−1Ntrain∑x∈xtrainlog(∑he−E(x,h))−log(Z)−λ2wTw, (9)

where is a regularization term introduced to combat overfitting.

Unfortunately, calculation of the objective function is in general intractable as calculation of the partition function is complete for non–planar Ising models. Nonetheless, the gradient of can be estimated without knowing the log–partition function [2].

 ∂OML∂wi,j=⟨vihj⟩data−⟨vihj⟩model−λwi,j. (10)

Here the expectation value over the data refers to the average of the corresponding pair of visible and hidden units given the constraint that the visible units are clamped to the values of the individual training vectors (the hidden units are allowed to vary and take values given by the Gibbs distribution). The expectation value over the model corresponds to the average value of the product of the pair of units for the Ising model when the visible units are not constrained to take values in the training set. The derivatives of in the directions of the biases are given in the appendix.

Despite the innocuous appearance of the gradient in (10), it can still be challenging to compute because the expectation values require drawing samples from a Gibbs distribution described by (7), which is itself an \NP-hard task in general. This intractability is sidestepped through the use of approximate methods such as contrastive divergence [2, 5]. CD training is efficient, but it has theoretical and practical shortcomings [10] that we aim to address with Instrumental Rejection Sampling.

### 3.1 Approximate Gibbs distributions

Many methods can be employed to provide an estimate of the Gibbs distribution . Perhaps the simplest is the mean–field approximation (MF), which takes to be a factorized probability distribution over all hidden and visible units in the graphical model. Among these factorized distributions, is taken to be the one that is closest to , where closest means that it minimizes . In particular, for an RBM

 QMF(v,h)=nv∏j=1μvjj(1−μj)1−vjnh∏k=1νhkk(1−νk)1−hk,

where

 μj=(1+e−bj−∑kwjkνk)−1, νk=(1+e−dk−∑jwjkμj)−1. (11)

The optimal mean–field parameters can be approximated by solving (11) using fixed point iteration. The MF approximation for generic Boltzmann machines takes a very similar form [5]. Note that here is a product distribution, but multi–modal or structured mean–field approximations can be used instead in cases where the MF approximation is expected to break down [6].

Although the mean–field approximation is expedient to compute, Lemma 1 suggests that choosing to minimize will asymptotically minimize the algorithm’s complexity. We refer to the product distribution that as . The minimization strategy used to find does not work for because does not contain logarithms and hence more general methods, such as fractional belief propagation, are needed. Fractional belief propagation works by choosing to variationally minimize an upper bound on the log–partition function that corresponds to the choice . The algorithm is explained in detail in [11, 8].

In order to maximize the probability of success, it is useful to have an estimate of the partition function . The log–partition function can be estimated for any product distribution using

 log(Z)≥log(ZQ):=∑xQ(x)log(e−E(x)Q(x))=−⟨E⟩−H[Q(x)], (12)

where is the Shannon entropy of and is the expected energy in the state . Equation (12) holds with equality if and only if . The slack in the inequality is the Kullback–Leibler divergence , which means that if then the approximation will become more accurate as approaches the Gibbs distribution [8]. We denote this mean–field approximation to the partition function as . Other tractable approximations to the log–partition function can also be used [8, 12]. For simplicity, we use for the majority of the subsequent numerical experiments.

### 3.2 Training algorithm

Our training algorithm for Boltzmann machines is given in Algorithm 1. The algorithm assumes that the user has (1) a method that approximates the model distribution and (2) a family of distributions that estimates the data distribution when the visible units are clamped to data vector . We assume in both cases that the resulting distributions are product distributions. Note that is expected to be approximately unimodal for trained models [5] and hence it is reasonable to expect that it will provide a good approximation to the true probability. We assume the that drawing a sample from or costs one operation. Arithmetic operations such as addition, multiplication and exponentiation are each assumed to cost a single operation.

###### Theorem 2.

The expected number of query and arithmetic operations required to train a –layer connected DBM containing edges using rejection sampling for epochs is , if the assumptions of Theorem 1 hold with and the mean–field approximation is used to estimate the partition function.

The proof of the theorem is a straightforward exercise in counting the number of expected steps in Algorithm 1 and is given for completeness in the appendix.

In contrast, the cost of using greedy contrastive divergence training with CD- is  [9]. Thus IRS training will have an advantage over contrastive divergence for training sufficiently deep networks if . It is worth noting that MF-CD training [5] also offers advantages for training deep networks but can be less accurate than CD training [13].

A further advantage of IRS over CD training is that the rejection sampling step can be parallelized. Conversely, the sampling steps used in CD- cannot be parallelized. Parallelism can boost the accuracy of IRS training without increasing the runtime of the algorithm, as discussed in Table 1.

As with contrastive divergence training, the accuracy of the gradients yielded by IRS training is controllable. The two parameters that affect the quality of the gradients are and . As both of these quantities approach infinity, the error in the gradient goes to zero. We formalize this in the following corollary, which is proven in the appendix.

###### Corollary 1.

The expected Euclidean norm of the difference between the gradient computed by samples using rejection sampling with instrumental distribution using and for a connected binary Boltzmann machine with edges computed using is

 O(√E(1√Nsamp+ZD2(P/Z||Q)κAZQ)).

The proof of Corollary 1 follows directly from Lemma 1, standard error bounds on statistical sampling and norm inequalities. Note that the choice in Algorithm 1 is not necessary.

Figure 1 confirms the expectations of Corollary 1 by showing that the error in the estimated gradient shrinks as if is sufficiently large. Furthermore, because there are twice as many weights in the RBM with and as there are in the RBM with and , the ratio of the two errors should be a factor of . The numerical experiments on these small RBMs agree with this assumption, suggesting that the scaling with also agrees with Corollary 1. Also, proves to be sufficient for the majority of the data sets considered, while appears to be the threshold below which the quality of the gradient is negatively impacted by .

We examine the value of obtained using and case in Table 2 for a random ensemble of RBMs in order to assess the benefits of IRS training using . We choose a large weight distribution for the data to emphasize the discrepancies between the two. For more modest weight distributions, the two quantities become comparable (see appendix). We find that using the mean–field approximation instead of the optimal distribution increases the expected predicted by Corollary 1 by up to orders of magnitude for a unit RBM. In particular, the values of needed are sufficiently low such that the majority of these distributions can be exactly prepared from using a reasonable number of samples.

### 3.4 Accuracy of training

We will now compare the performance of IRS training algorithm to CD-1 training for small restricted Boltzmann machines. We use the following synthetic training data:

 [x1]j = 1 if j≤nv/2, else 0 [x2]j = j mod 2, (13)

and their bitwise complements. We further add Bernoulli noise of strength to each of the bits in the training vectors to make the data set more challenging to learn.

We assess the quality of the training methods by using Algorithm 1 or CD-1 to find an approximation to the optimal model parameters and then compute the exact gradients of ) to find the location of the true optima that these methods estimate. The data is presented in Figure 2 for a small RBM with and . The data shows that while contrastive divergence finds an optima that is on average within of that exact ML training can provide, IRS training deviates by and continues to converge to the ML optima as increases. Further numerical evidence is provided in the appendix. Such differences are expected to be even more striking for deep restricted Boltzmann machines because IRS does not greedily optimize the weights [1].

## 4 Outlook

Our results open a number of further avenues for further inquiry. Although IRS has asymptotic advantages over existing methods for training DBMs in the presence of sufficiently strong regularization, our work only shows that it can provide accurate and efficient approximations to the gradients of the training objective under such circumstances. Further work will be needed to examine the performance of IRS training in practical machine learning problems.

Since IRS achieves its goals by combining results from the disparate fields of variational approximations and deep learning, it is natural to suspect that it could be optimized by going beyond the simple unimodal approximations used in the main body. Indeed, we see in the appendix that the use of bimodal approximations can substantially reduce the sample complexity of training. Further study may reveal even more practical variational approximations to .

Our work also illustrates that quantum machine learning may be an important avenue of inquiry for understanding machine learning in a broader context. This utility of thinking from a quantum perspective is not likely to be unique to training Boltzmann machines since classical computing is in a subset of quantum computing and hence every result shown for classical machine learning also applies to a subset of quantum machine learning. Conversely, lower bounds and no-go theorems proven for the quantum setting also apply to the classical setting, which makes quantum computing ideally suited for understanding the limitations and opportunities that physics places on a machine’s ability to learn. Just as quantum insights inspired the present work, re-examining other areas of machine learning through the lens of quantum computing may not only lead to new classical algorithms but also provide deep insights into the nature of learning and inference.

#### Acknowledgments

We thank Tom Minka for valuable discussions and for the code for computing –divergences.

## Appendix A Review of Dirac Notation

We will derive much of the theory behind our method using language from quantum computing. This language not only proves to be useful for representing rejection sampling based algorithms, but also is useful because it provides a clear method for dequantizing the method of [1]. The central object in quantum computing is the quantum state (which we will take to mean a pure state in the following). A quantum state is simply a unit vector in such that the magnitude squared of each of its components yields the probability of the corresponding outcome.

A quantum state is typically represented (using Dirac notation) as which can be interpreted to be a column vector in . Similarly, is its Hermitian transpose. The notation is linear, which means that if is a complete orthonormal vector space on then the state can be written as

 |ψ⟩=N∑j=1√Pj|j⟩. (14)

Physically, such states describe the probability distribution of outcomes that emerges when the state is measured in this basis where each gives the probability of finding the system in basis state . Classically, this measurement process is equivalent to sampling from a probability distribution but in quantum mechanics this is more subtle because the resultant probability distribution changes depending on the basis that the system is measured in; whereas in classical applications there is implicitly only one such basis. The fact that Dirac notation is basis independent gives it a distinct advantage for concisely describing distributions relative to column–vector notation.

Dirac notation also has an implicit tensor product structure built in, meaning that

 |ψ⟩|ϕ⟩:=|ψ⟩⊗|ϕ⟩. (15)

This convention is useful for describing large sets of uncorrelated variables because the tensor product structure implicitly captures the lack of correlations between the variables described by and those described by . Correlated distributions can be described by combining (14) and (15) via

 ∑i,jai,j|ui⟩∣∣vj⟩, (16)

where and represent orthonormal basis vectors that span the space that the variables described by and are supported in. Also, for the probability of drawing a particular sample from the marginal distribution over is, for example, and the resultant marginal state over the is

 ∑iai,j√∑i|ai,j|2|ui⟩. (17)

## Appendix B Proofs

### b.1 De–quantization of Quantum Rejection Sampling and Proof of Theorem 1

While the states in quantum computing are the fundamental objects of the computational model, quantum computers also need a complete set of operations that are capable of performing an arbitrary (unitary) transformation on input states. This means that a quantum computer has to have the ability to transform an arbitrary input unit vector into any other such vector. For the present purposes, however, it suffices to note that probability distributions are first class entities in quantum computing and that quantum computers provide a method for preparing any such distribution. As a consequence, it should come as no surprise that notation developed to describing quantum devices should also be useful for describing classical sampling algorithms.

Quantum rejection sampling is one of the most important tools used to design quantum algorithms not only because it can be used to enable exponential speedups for certain tasks, but also because it is very natural in quantum computing [14]. Let us assume that we have a (potentially un–normalized) distribution that we cannot directly prepare and also have an efficiently preparable distribution . Furthermore, let us also assume that a constant is known such that . We then prepare the state

 ∑x√Q(x)|x⟩(√P(x)Q(x)κ|1⟩+√1−P(x)Q(x)κ|0⟩). (18)

If we measure the right–most register, which we will refer to as the coin, to be one then the resultant state over the sample register is

 ∑x√P(x)∑xP(x)|x⟩:=∑x√P(x)Z|x⟩, (19)

and thus if we consider the state post–selected on successfully measuring the right most register in (18) to be then we prepare the desired distribution over the remaining register. The probability of success for this is .

Although quantum notation is used in this procedure, there is nothing inherently quantum about it as written. In fact, the exponential advantages that quantum algorithms accrued through quantum rejection sampling arise only because the distributions or cannot be efficiently computed or because the distribution cannot be efficiently sampled from. This is summarized in the following lemma.

###### Lemma 2.

Let be an efficiently computable probability distribution that can be efficiently sampled from. Assume that can be efficiently computed and then the task of drawing samples from the distribution yielded by quantum rejection sampling can be efficiently simulated by a classical computer.

###### Proof.

Let us assume that we are provided with a state as per (18). Drawing a sample from the correct distribution corresponds to measuring the coin register in (18) and conditioned on measuring the sample register is measured and the result is output as the sample from . Because the partial trace is a commutative operation, the order of these two measurements is arbitrary. Instead, we could first measure the sample register resulting and if the result is then the marginal distribution over the coin is . Now the coin register can be measured and the sample will be accepted if and only if the result is , which occurs with probability . This process is equivalent to the original quantum algorithm.

By assumption the distribution can be sampled from efficiently be a classical computer. This means that the first step in the re–ordered quantum algorithm can be efficiently simulated. Next, using the fact that and that and are efficiently computable it follows that we can efficiently simulate drawing a sample from the coin register by sampling from a Bernoulli distribution with . This process, also known as rejection sampling, is efficient and hence quantum rejection sampling can be efficiently simulated using a classical computer under these assumptions. ∎

This consequently shows that the GEQS algorithm for training deep networks given in [1] can be efficiently simulated and has a direct classical analog in the IRS algorithm. Similarly, the GEQAE algorithm in [1] can also be efficiently simulated but IRS is not a natural analog of it since GEQAE uses a manifestly quantum method (known as amplitude estimation) for inferring the expectation values needed to train the DBM. This method may be of particular importance because it can lead to quadratic reductions in the number of times the database of training vectors needs to be queries, which leads to significant cost savings for typical machine learning problems.

Despite the fact that Lemma 2 shows that a classical computer can often simulate quantum rejection sampling efficiently quantum computers can nonetheless provide huge advantages for rejection sampling. In fact, speedups relative to classical algorithms can arise from the use of quantum subroutines such as amplitude amplification to quadratically reduce the rejection rate in the sampling process [15]. However, since we focus on classical algorithms for rejection sampling we will ignore such quantum algorithms in the following.

• Proof of Theorem 1.

The proof here follows one given in [1]. The algorithm described in Lemma 2 will fail as writted if because the marginal distribution over the coin register will no longer be normalized. This can be addressed, at the price of introducing errors in the distribution, by clipping the probabilities used in the coin register to for the configurations where . Using Dirac notation, we wish to sample from the following state.

 Missing or unrecognized delimiter for \right (20)

where . The probability of measuring the coin to be is

We denote the resultant state

Now the fidelity of the marginal state that results from post–selected measurement of the coin register (i.e. accepting the sample from ) is

This shows that there exists a quantum algorithm that has the desired success probabilities and incurs error at most in the resultant distribution. The remainder of the proof then follows by the same logic as that used in Lemma 2 to show that the algorithm can be de–quantized by exchanging the order that the coin and sample registers are measured in. Thus there is an equivalent classical algorithm under the assumptions that and are efficiently computable and can be efficiently sampled from. ∎

### b.2 Proof of Corollary 1

• Proof of Corollary 1.

Let us focus on the problem of approximating the model expectation present in the gradient with respect to . The triangle inequality and Theorem 1 imply that if is the probability distribution that is obtained after rejection sampling is performed using and are the samples drawn from in the rejection sampling protocol

 ∣∣∣1NsampNsamp∑k=1δx,y(k)xixj−⟨xixj⟩model∣∣∣ ≤∣∣∣1NsampNsamp∑k=1δx,y(k)xixj−∑xixj(~P(x))+∑xixj~P(x)−⟨xixj⟩model∣∣∣ Missing or unrecognized delimiter for \biggr ≤∣∣∣∑xixj(~P(x)−δx∈{yk}/Nsamp)∣∣∣+∣∣∣∑(~P(x)−P/Z)∣∣∣. ∈O(∣∣∣∑xixj(~P(x)−δx∈{yk}/Nsamp)∣∣∣+ZD2(P/Z||Q)κAZQ). (24)

Because the units are binary the variance of is at most , which means that the sampling error can be bounded and hence (24) is

 O(1√Nsamp+ZD2(P/Z||Q)κAZQ). (25)

The exact same argument can be applied to the model average and the gradients of the biases. The conclusion is identical in each case, that the contribution to the error from sampling and using an insufficient value of is at most (25). For a connected graph, the maximum number of components of any of these vectors is . The triangle inequality and the fact that then gives us our result.

### c.1 Scaling with λ

In the main text we examined the scaling of for random RBMs with edges whose weights are drawn from for different sized graphical models. Such Boltzmann machines are not expected to be typical of those that emerge from training in the presence of regularization, which is expected to lead to substantial weight decay for large models. We examine the accuracy of rejection sampling, as a function of , for both mean–field approximations and in Figure 3. Specifically, consider an RBM with and whose weights were found by exact training and training vectors drawn from the synthetic set described in the main text with . We also take in order to simplify the comparison.

The data in Figure 3 shows that for , the mean values of the probability such that are graphically indistinguishable for both the mean–field approximation and . This is to be expected since corresponds to relatively strong regularization and it is perhaps not surprising that if the Gibbs distribution is nearly a product distribution that optimizing either or should lead to comparable results. The sum of the bad probabilities was found to fall off, for large , as for and . In contrast the asymptotic scaling for for is unclear for this regularization constant as the values of considered are insufficient to see the asymptotic behavior of the curve.

There are substantial differences between the fraction of configurations that are correctly handled for large for . As expected by Corollary 1, outperforms in the asymptotic limit. It perhaps is unsurprising that can outperform for modestly small because choosing the distribution to minimize the average value of may not minimize the tail probabilities as effectively as one may like owing to the looseness of the Markov inequality. In practice, this suggests that minimizing for even more general divergences may be useful for trading off asymptotic versus short–time performance of the sampling algorithm.

We examine the same problem but for fixed and variable in Figure 4. We note that for these networks that the value of needed to obtain of the probability mass scales roughly as for . This scaling should be taken with a grain of salt as we do not have enough data to meaningfully extrapolate to large system sizes. However, the salient feature is that there is no sign of exponential divergence despite the number of edges in the graphical model tripling.

For we notice no such trend: the sum of the good probabilities at is on average less than that observed for . This is likely because the model contains edges for and if the effect of the regularization term on is potentially less significant than it would be if the same distribution of weights were taken at where edges are present. As such it is not helpful to consider how the cost of refining the mean–field approximation into the Gibbs state scales with the size of the graphical model. More important features such as the sparsity of the graph and the presence or absence of frustration are expected to better determine the viability of these variational approximations [6].

Another interesting feature is that at the mean values of the sum of the good probabilities in the distribution that results from rejection sampling for and are within of each other if . This is despite the fact that the Hilbert space for the case is times larger and the number of edges in the model is larger. This shows that in the presence of regularization the error in variational approximations to the Gibbs distribution need not strongly depend on the size of the graph. Similar results have been noticed for mean–field state preparations in [1]. The results for is much more significant with an average difference of . We suspect these differences occur because the graphs considered are not yet large enough for weak regularization to push the Gibbs state towards an approximately unimodal distribution.

### c.2 Scaling with κA

The quantity is perhaps the most important factor for determining the viability of our method relative to contrastive divergence because it dictates the success probability of the rejection step. Here we will provide numerical evidence in small samples that illustrates that small values of provide comparable, or greater, accuracy than contrastive divergence training. For all these results we again use an equal mixture of and the uniform distribution as our instrumental distribution. We anticipate that the use of will tend to result in better results for a fixed value of .

Figure 5 shows the difference in the value of the objective functions obtained relative to those found by exactly following the ML–objective function for an RBM with and and . We observe for that data that suffices to provide a mean discrepancy that is comparable to contrastive divergence. It is important to note though that a substantial fraction of the results for even are dramatically better than the results for contrastive divergence; however, the worst cases are nearly twice as bad. The results for show evidence of saturating and by computing the values that they saturate at we see that the data agrees well with an scaling and has relatively poor agreement with powerlaw or linear scalings. This captures the observed fact that the quality of the gradients rapidly improves as is increased. In particular, for , the discrepancies between the worst case scalings and the best case scalings of the data collapse and nearly all of the examples tested were found to perform better than contrastive divergence.

These results do reveal an interesting feature of contrastive divergence, namely its consistency. The best and worst case performances seem to be tightly clustered relative to those observed for our sampling algorithm. It also seems to perform better for the first few epochs of optimization than IRS does, even in the limit of large . This illustrates that contrastive divergence is likely to remain an algorithm of choice for serial (rather than parallel) training environments where variational approximations to the Gibbs states are expected to abjectly fail.

We examine a similar case with much weaker regularization in Figure 6. There we note that much larger values of are needed to obtain good approximations to the gradient. In particular, we observe that is the first example where the best case performance beats that of contrastive divergence. The data set is otherwise qualitatively similar to that considered in Figure 5 except no evidence of training plateauing is observed within the number of epochs considered (with the exception of some of the data for ).

## Appendix D Hedging strategies

In the main text, we showed that choosing our product distribution to minimize the rather than . However, as we noted in Figure 3, the mean–field approximation may actually yield a better approximation to the Gibbs state than does if is small. The reason for this is that attempts to minimize the average ratio between the two, but in practice it may not necessarily model the high probability regions of the distribution accurately. In contrast, the mean–field distribution aims to find the distribution that is closest to the Gibbs distribution but requires a large value of to accurately model the tails of the probability distribution. This begs the question of whether choosing a different instrumental distribution that combines the best features of both distributions can be used.

A strategy proposed in [1] is to choose the instrumental distribution to be a combination of the mean–field distribution and one that captures the tail probability more effectively. In [1] they mix uniform distribution with the mean–field distribution. While this works well for small systems, it is not expected to work well for high–dimensional systems because the prediction shrinks exponentially with the dimension of the Hilbert space that the probability distribution is supported over. Instead, a more sensible approach is to combine the mean–field and in the following way

 Q(v,h)=γQMF(v,h)+(1−γ)Qα=2(v,h), (26)

for .

We examine this strategy in Figure 7 where we observe that for small the quality of the approximation yielded at is comparable to that at . This is surprising because averaging the mean–field approximation with one that is known to give a worse approximation may be expected to result in an inferior approximation. For larger values of , the hedged approximation outperforms either or individually. In fact, at the distribution with outperforms both of the distributions by roughly . This not only suggests that hedging can lead to substantial improvements but also suggests that choosing different values of to interpolate between the performance of and (mean–field) may also improve the performance of IRS for small kalues of .

## Appendix E Detailed algorithm for computing gradients

We provide a more detailed algorithm below for computing the gradient of the ML objective function using IRS. The algorithm utilizes subroutines and that provide an estimate of the probability of a given configuration of hidden and visible units. These functions will often correspond to a tractable approximation to the Gibbs distribution such as the mean–field approximation or a product distribution that minimizes the divergence with the Gibbs state. We also assume that a sampling procedure is known for these distributions.

### References

1. Nathan Wiebe, Ashish Kapoor, and Krysta M Svore. Quantum deep learning. arXiv preprint arXiv:1412.3489, 2014.
2. Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
3. Navdeep Jaitly and Geoffrey Hinton. Learning a better representation of speech soundwaves using restricted boltzmann machines. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5884–5887. IEEE, 2011.
4. SM Ali Eslami, Nicolas Heess, Christopher KI Williams, and John Winn. The shape boltzmann machine: a strong model of object shape. International Journal of Computer Vision, 107(2):155–176, 2014.
5. Max Welling and Geoffrey E Hinton. A new learning algorithm for mean field boltzmann machines. In Artificial Neural NetworksâICANN 2002, pages 351–357. Springer, 2002.
6. Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
7. Iain Murray and Zoubin Ghahramani. Bayesian learning in undirected graphical models: approximate mcmc algorithms. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 392–399. AUAI Press, 2004.
8. Tom Minka. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005.
9. Yoshua Bengio. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
10. Ilya Sutskever and Tijmen Tieleman. On the convergence properties of contrastive divergence. In International Conference on Artificial Intelligence and Statistics, pages 789–795, 2010.
11. Wim Wiegerinck, Tom Heskes, et al. Fractional belief propagation. Advances in Neural Information Processing Systems, pages 455–462, 2003.
12. Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. Tree-reweighted belief propagation algorithms and approximate ml estimation by pseudo-moment matching. In Workshop on Artificial Intelligence and Statistics, volume 21, page 97. Society for Artificial Intelligence and Statistics Np, 2003.
13. Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1033–1040. ACM, 2009.
14. Maris Ozols, Martin Roetteler, and Jérémie Roland. Quantum rejection sampling. ACM Transactions on Computation Theory (TOCT), 5(3):11, 2013.
15. David Poulin and Pawel Wocjan. Sampling from the thermal quantum gibbs state and evaluating partition functions with a quantum computer. Physical review letters, 103(22):220502, 2009.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters