Generative training of quantum Boltzmann machines with hidden units

# Generative training of quantum Boltzmann machines with hidden units

Nathan Wiebe
Microsoft Research,
nawiebe@microsoft.com &Leonard Wossnig
University College London, Rahko,
leonard.wossnig@cs.ucl.ac.uk
corresponding author
###### Abstract

In this article we provide a method for fully quantum generative training of quantum Boltzmann machines with both visible and hidden units while using quantum relative entropy as an objective. This is significant because prior methods were not able to do so due to mathematical challenges posed by the gradient evaluation. We present two novel methods for solving this problem. The first proposal addresses it, for a class of restricted quantum Boltzmann machines with mutually commuting Hamiltonians on the hidden units, by using a variational upper bound on the quantum relative entropy. The second one uses high-order divided difference methods and linear-combinations of unitaries to approximate the exact gradient of the relative entropy for a generic quantum Boltzmann machine. Both methods are efficient under the assumption that Gibbs state preparation is efficient and that the Hamiltonian are given by a sparse row-computable matrix.

Generative training of quantum Boltzmann machines with hidden units

Nathan Wiebe Microsoft Research, nawiebe@microsoft.com Leonard Wossnigthanks: corresponding author University College London, Rahko, leonard.wossnig@cs.ucl.ac.uk

\@float

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

The objective of quantum machine learning is to understand the ability of agents to learn in quantum mechanical settings biamonte2017quantum (); ciliberto2018quantum (); servedio2004equivalences (); arunachalam2018optimal (); lloyd2013quantum (); wiebe2019language (). One aspect that obstructs this goal is the fact that quantum state vectors lie in an exponentially large vector space mcclean2018barren (); schuld2018circuit (). Owing to the size of these vectors, generative models play a central role in quantum machine learning as they can be used to give concise descriptions of these complicated quantum states kieferova2016tomography (); schuld2019quantum (); romero2017quantum (); benedetti2019adversarial (). Various approaches have been put forward to solve this problem kieferova2016tomography (); romero2017quantum (); kappen2018learning (); amin2018quantum (); crawford2016reinforcement (), but to date all proposed solutions suffer from problems such as requiring classical input data, yielding exponentially small gradients, or an inability to learn with hidden units. Here we present a new approach to training quantum Boltzmann machines kieferova2016tomography (); benedetti2017quantum (); amin2018quantum () that resolves all these problems, and therefore addresses a major open problem in quantum machine learning.

Just as for the classical case, i.e., generative training on classical computers, the main goal in quantum generative training is to build a model that allows to sample from a distribution over quantum state vectors that mimics some training distribution. The natural analog of such a quantum training set would be a density operator which we denote , which is a positive semi-definite trace- Hermitian matrix that (roughly speaking) describes a probability distribution over quantum state vectors. The goal in quantum generative training is to find a process by sampling from such that (where is a quantum state vector, i.e., just a unit norm vector) such that some chosen distance measure, e.g., , is small. Such a task corresponds, in quantum information language, to partial tomography kieferova2016tomography () or approximate cloning chefles1999strategies ().

Boltzmann machines, a physics inspired class of neural network aarts1988simulated (); salakhutdinov2007restricted (); tieleman2008training (); le2008representational (); salakhutdinov2009deep (); salakhutdinov2010efficient (); lee2009convolutional (); hinton2012practical (), have found numerous applications over the last decade lee2009unsupervised (); lee2009convolutional (); mohamed2011acoustic (); srivastava2012multimodal (). have recently gained popularity as a method to model complex quantum systems carleo2017solving (); torlai2016learning (); nomura2017restricted (). One of the features of Boltzmann machines is that they directly resemble the quantum physics that is inherent in a quantum computer. In particular, a Boltzmann machine provides an energy for every configuration of a system and then generates samples from a distribution with probabilities that vary exponentially with this energy. Indeed the distribution it represents is just the canonical ensemble in statistical physics. The explicit model in this case is

 σv(H)=Trh(e−HZ)=Trh e−HTr e−H, (1)

where is the partial trace over an auxillary sub-system known as the Hidden subsystem which serves to build correlations between the “visible” system. For classical Boltzmann machines is an energy function, i.e., a diagonal matrix, but for quantum Boltzmann machines it is a Hermitian matrix known as the Hamiltonian, which has off-diagonal entries. Notably, if we want to simulate a quantum system then the dimension of grows exponentially with the number of units in the sytem, i.e., . Thus the goal of generative quantum Boltzmann training is to find a set of parameters (also called weights) that specify a Hamiltonian such that for an appropriate distance, or divergence, function. As an example, a quantum analogue of an all-visible Boltzmann machine with units could take the form

 H(θ)=nv∑n=1θ2n−1σx(n)+θ2nσ(n)z+∑n>n′θ(n,n′)σ(n)zσ(n′)z. (2)

Here and are Pauli matrices acting on qubit (unit) .

For generative training on classical data the natural divergence to use between the input and output distributions is the KL divergence. In the case where the input is a quantum state the natural notion of distance changes to the quantum relative entropy:

 S(ρ|σv)=Tr(ρlogρ)−Tr(ρlogσv), (3)

which reduces to the KL divergence if and are diagonal matrices and is zero if and only if .

While the relative entropy is generally difficult to compute, the gradient of the relative entropy is straight forward to compute for Boltzmann machines with all visible units. This is straight forward because in such cases and the fact that allows the matrix derivatives to be easily computed. However, no methods are known for the generative training of Boltzmann machines for the quantum relative entropy loss function, if hidden units are present. This is because the partial trace in prevents us from simplifying the logarithm term when computing the gradient.

##### Our Contribution:

In this work we provide practical methods for training generic quantum Boltzmann machines that have both hidden as well as visible units. We provide two new approaches for achieving this. The first, and more efficient of the two, works by assuming a special form for the Hamiltonian that allows us to find a variational upper bound on the quantum relative entropy. Using this upper bound, the derivatives are easy to compute. The second, and more general method uses recent techniques from quantum simulation to approximate the exact expression for the gradient using Fourier series approximations and high-order divided difference formulas in place of the analytic derivative. Both methods are efficient, given that Gibbs state preparation is efficient which we expect to hold in most practical cases of Boltzmann training, although it is worth noting that efficient Gibbs state preparation in general would imply which is unlikely to hold.

## 2 Training Boltzmann Machines

We now present methods for training quantum Boltzmann machines.
The quantum relative entropy cost function for a quantum Boltzmann machine (QBM) with hidden units is given by

 Oρ(H)=S(ρ∣∣Trh[e−H/Tr[e−H]]), (4)

where is the quantum relative entropy as defined in eq. 3. Note that we can add a regularization term to the quantum relative entropy to penalize unnecessary quantum correlations in the model kieferova2016tomography (). In this work, we generally aim to train the QBM with a gradient-based method. For this we are require to evaluate the gradient of the cost function, and hence the gradient of the quantum relative entropy.

In the case of an all-visible Boltzmann machine (which corresponds to ) a closed form expression for the gradient of the quantum relative entropy is known:

 ∂Oρ(H)∂θ=−Tr[∂∂θρlogσ], (5)

which can be simplified using and Duhamels formula to obtain the following equation for the gradient, denoting ,

 Tr[ρ∂θH]−Tr[e−H∂θH]/Tr[e−H]. (6)

However, the above gradient formula is not generally valid, and indeed does not hold if we include hidden units. Allowing for these, we need to additionally trace out the subsystem which results in the majorised distribution from eq. 1. This also changes the cost function which takes then the form described in eq. 4. Note that is depending on the variables we will alter during the training process, while is the target density matrix, i.e., the input data. Therefore, if we want to estimate the gradient of the above, omitting the since it is a constant, we obtain

 ∂Oρ(H)∂θ=−Tr[∂∂θρlogσv], (7)

for the gradient of the objective function.

In the following, we discuss two different approaches for evaluating the gradient in eq. 7. While the first is less general, it gives an easy implementable algorithm and strong bounds based on optimizing a variational bound. The second approach is on the other hand applicable to any problem instance, and hence a general purpose gradient optimisation algorithm for relative entropy training. The no-free-lunch theorem suggests that no (good) bounds can be obtained without assumptions on the problem instance, and indeed, the general algorithm exhibits potentially exponentially worse complexity. However, for many practical applications we assume that this will not be the case, and in particularly, the result presented gives a generally applicable algorithm for training quantum Boltzmann machines on a quantum device, which is to the best of our knowledge the first known result of this kind.

### 2.1 Variational training for restricted Hamiltonians

Our first approach is based on optimizing a variational bound of the objective function, i.e., the quantum relative entropy, in a restricted - but still practical setting. This approach will give us a fast and easy to implement quantum algorithm which, however, is less general applicable due to the requirement of certain input assumptions. These assumptions are important, as several instances of scalar calculus fail when we transit to matrix functional analysis, which particularly applies to the gradient of the quantum relative entropy, and we require these in order to obtain a feasible analytical solution.

We express the Hamiltonian in this case as

 H=Hv+Hh+Hint, (8)

i.e., a decomposition of the Hamiltonian into a part acting on the visible layers, the hidden layers and a third interaction Hamiltonian that creates correlations between the two. In particular, we further assume for simplicity that there are two sets of operators and composed of terms such that

 Hv =Wv∑k=1θkvk⊗I, Hh=Wv+Wh∑k=Wv+1θkI⊗hk Hint =Wv+Wh+Wint∑k=Wv+Wh+1θkvk⊗hk, [hk,hj]=0 ∀ j,k, (9)

which implies that the Hamiltonian can in general be expressed as

 H=D∑k=1θkvk⊗hk. (10)

We break up the Hamiltonian into this form to emphasize the qualitative difference between the types of terms that can appear in this model. Note that we generally assume throughout this article that are unitary operators, which is typically the case.

The intention of the form of the Hamiltonian in (9) is to force the non-commuting terms, i.e., terms for which it holds that the commutator , to act only on the visible units of the model. In contrast, only commuting Hamiltonian terms act on the hidden register. Since the hidden units commute, the eigenvalues and eigenvectors for the Hamiltonian can be expressed as

 H|vh⟩⊗|h⟩=λvh,h|vh⟩⊗|h⟩, (11)

where both the conditional eigenvectors and eigenvalues for the visible subsystem are functions of the eigenvector obtained in the hidden register, and we hence denote these as respectively. This allows the hidden units to select between eigenbases to interpret the input data while also penalizing portions of the accessible Hilbert space that are not supported by the training data. However, since the hidden units commute they cannot be used to construct a non-diagonal eigenbasis. This division of labor between the visible and hidden layers not only helps build intuition about the model but also opens up the possibility for more efficient training algorithms that exploit this fact.

For the first result we rely on a variational bound on the entropy in order to train the quantum Boltzmann machine weights for a Hamiltonian of the form given in (10). We can express this variational bound compactly in terms of a thermal expectation against a fictitious thermal probability distribution. We define this expectation below.

###### Definition 1.

Let be the Hamiltonian acting conditioned on the visible subspace only on the hidden subsystem of the Hamiltonian . Then we define the expectation value over the marginal distribution over the hidden variables as

 Eh(⋅)=∑h(⋅)e−Tr[ρ~Hh]∑he−Tr[ρ~Hh]. (12)

Using this we derive an upper bound on in section A.4.1 of the supplemental material, which leads to the following lemma.

###### Lemma 2.

Assume that the Hamiltonian of the quantum Boltzmann machine takes the form described in eq. 10, where are the parameters which determine the interaction strength and are unitary operators. Furthermore, let be the eigenvalues of the hidden subsystem, and as given by Definition 1, i.e., the expectation value over the effective Boltzmann distribution of the visible layer with . Then, a variational upper bound of the objective function, meaning that , is given by

 (13)

where is the corresponding Gibbs distribution for the visible units.

The proof that (13) is a variational bound proceeds in two steps. First, we note that for any probability distribution

 Tr[ρlog(N∑h=1e−∑kEh,kθkvk)]=Tr[ρlog(N∑h=1αhe−∑kEh,kθkvk/αh∑h′αh′)] (14)

We then apply Jensen’s inequality and minimize the result over all . This not only verifies that but also yields a variational bound. The details of the proof can be found in eq. A.4.1 in section A.4.1 of the supplemental material.

Using the above assumptions we can obtain the gradient of the variational upper bound of the relative entropy which is derived in the section A.4.2 of the supplemental material and summarized in lemma 3.

###### Lemma 3.

Assume that the Hamiltonian of the quantum Boltzmann machine takes the form described in eq. 10, where are the parameters which determine the interaction strength and are unitary operators. Furthermore, let be the eigenvalues of the hidden subsystem, and as given by Definition 1, i.e., the expectation value over the effective Boltzmann distribution of the visible layer with . Then, the derivatives of with respect to the parameters of the Boltzmann machine are given by

 ∂˜S(ρ|H)∂θp=Eh[Tr[ρEh,pvp]]−Tr[∂H∂θpe−HZ]. (15)

Notably, if we consider no interactions between the visible and the hidden layer, then indeed the gradient above reduces to the case of the visible Boltzmann machine, which was treated in kieferova2016tomography (), resulting in the gradient

 Tr[ρ∂θpH]−Tr[e−HZ∂θpH], (16)

under our assumption on the form of , .

From Lemma 3, we know the form of the derivatives of the relative entropy w.r.t. any parameter via Eq. 15. Note that we can easily evaluate the second term by preparing the Gibbs state and then evaluating the expectation value of the operator w.r.t. the Gibbs state, using amplitude estimation for the Hadamard test aharonov2009polynomial (). This is a standard procedure and we describe it in algorithm 4 in section A.3 of the supplemental material.

The computational complexity of this procedure is easy to evaluate. If is the query complexity for the Gibbs state preparation, the query complexity of the whole algorithm including the phase estimation step is then given by for an -accurate estimate of phase estimation. Next, we derive an algorithm to evaluate the first term, which requires a more involved process. For this, note first that we can evaluate each term independently from , and individually for all , i.e., all dimensions of the gradient. This can be done via the Hadamard test for which we recapitulate in section A.3 of the supplemental material, assuming is unitary. More generally, for non-unitary we could evaluate this term using a linear combination of unitary operations. Therefore, the remaining task is to evaluate the terms in (15), which reduces to sampling elements according to the distribution , recalling that applied to the subsystem has eigenvalues . For this we need to be able to create a Gibbs distribution for the effective Hamiltonian which contains only terms and can hence be evaluated efficiently as long as is small, which we can generally assume to be true. In order to sample according to the distribution , we first evaluate the factors in the sum over via the Hadamard test, and then use these in order to implement the Gibbs distribution for the Hamiltonian

 ~Hh=∑kθkTr[ρvk]hk.

The algorithm is summarized in Algorithm 1.

The algorithm is build on three main subroutines. The first one is Gibbs state preparation, which is a known routine which we recapitulate in Theorem 12 in the supplemental material. The two remaining routines are the Hadamard test and amplitude estimation, both are well established quantum algorithms. The Hadamard test, will allow us to estimate the probability of the outcome. This is concretely given by

 Pr(0)=12(1+Re⟨ψ|Gibbs(hk⊗I)|ψ⟩Gibbs)=12(1+∑he−EhEh,kZ), (18)

i.e., from we can easily infer the estimate of up to precision for all the terms, since the last part is equivalent to . To speed up the time for the evaluation of the probability , we use amplitude estimation. We recapitulate this procedure in detail in the suppemental material in section A.2. In this case, we let be the reflector, where is the identity which is just the Pauli matrix up to a global phase, and let , for being the state after the Hadamard test prior to the measurement. The operator has then the eigenvalue , where , and is the probability to measure the ancilla qubit in the state. Let now be the query complexity for preparing the purified Gibbs state (c.f. eq (51) in the supplemental material). We can then perform phase estimation with precision for the operator requiring queries to the oracle of .

In section A.4.3 of the supplemental material we analyse the runtime and error of the above algorithm. The result is summarized in Theorem 4.

###### Theorem 4.

Assume that the Hamiltonian of the quantum Boltzmann machine takes the form described in eq. 10, where are the parameters which determine the interaction strength and are unitary operators. Furthermore, let be the eigenvalues of the hidden subsystem, and as given by Definition 1, i.e., the expectation value over the effective Boltzmann distribution of the visible layer with , and suppose that with bounded spectral norm , and let be -sparse. Then can be computed for any such that

 ∥∥S−∇~S∥∥max≤ϵ, (19)

with

 ˜O(√ξD∥θ∥1dn2ϵ), (20)

queries to the oracle and with probability at least , where is the sum of absolute values of the parameters of the Hamiltonian, , , , and are known lower bounds on the partition functions for the Gibbs state of and respectively.

Theorem 4 shows that the computational complexity of estimating the gradient grows the closer we get to a pure state, since for a pure state the inverse temperature , and therefore the norm , as the Hamiltonian is depending on the parameters, and hence the type of state we describe. In such cases we typically would rely on alternative techniques. However, this cannot be generically improved because otherwise we would be able to find minimum energy configurations using a number of queries in , which would violate lower bounds for Grover’s search. Therefore more precise statements of the complexity will require further restrictions on the classes of problem Hamiltonians to avoid lower bounds imposed by Grover’s search and similar algorithms.

### 2.2 Gradient based training for general Hamiltonians

Our second scheme to train a quantum Boltzmann machine is general applicable and does not require a particular form of the Hamiltonian as was required for the first approach. We use higher order divided difference estimates for the relative entropy error based on function approximation schemes. For this we generate differentiation formulas by differentiating an interpolant. The idea for this is straightforward: First we construct an interpolating polynomial from the data. Second, an approximation of the derivative at any point is obtained by a direct differentiation of the interpolant. Concretely we perform the following steps. We first approximate the logarithm via a Fourier-like approximation, i.e., where the subscripts indicate the level of truncation similar to van2017quantum (). This will yield a Fourier-like series in terms of , i.e., .
Next, we need to evaluate the gradient of the function . Taking the derivative yields many terms of the form

 ∫10dse(ismπσv)∂σv∂θe(i(1−s)mπσv), (21)

as a result of the Duhamel’s formula for the derivative of exponentials of operators (c.f., Sec. 9 of the supplemental material). Each term in this expansion can furthermore be evaluated separately via a sampling procedure, since the terms in Eq. 21 can be approximated by . Furthermore, since we only have a logarithmic number of terms, we can combine the results of the individual terms via classical postprocessing once we have evaluated the trace.
Now, we apply a divided difference scheme to approximate the gradient term which results in an interpolation polynomial of order (for being the number of points at which we evaluate the function) in which we can efficiently evaluate.
However, evaluating these terms is still not trivial. The final step consists hence of implementing a routine which allows us to evaluate these terms on a quantum device. In order to do so, we once again make use of the Fourier series approach. This time we take the simple idea of aproximating the density operator by the series of itself, i.e., , which we can implement conveniently via sample based Hamiltonian simulation lloyd2014quantum (); kimmel2017hamiltonian ().
Following these steps we obtain the expression in Eq. 91. The real part of

 (22)

then approximates with at most error, where is the derivative of the interpolation polynomial which we obtain using divided differences, and are coefficients of the approximation polynomials, which can efficiently be evaluated classically. We can evaluate each term in the sum separately and combine the results then via classical post-processing, i.e., by using the quantum computer to evaluate terms containing the trace.

The main challenge for the algorithmic evaluation hence to compute the terms

 Tr[ρeisπm2σveiπm′2σv(θj)ei(1−s)πm2σv]. (23)

Evaluating this expression is done through Algorithm 2, relies on two established subroutines, namely sample based Hamiltonian simulation lloyd2014quantum (); kimmel2017hamiltonian (), and the Hadamard test. Note that the sample based Hamiltonian simulation approach introduces an additional -error in trace norm, which we also need to take into account in the analysis. In section A.5 of the supplemental material we derive the following guarantees for Algorithm. 2.

###### Theorem 5.

Let being two density matrices, , and we have access to an oracle that computes the locations of non-zero matrix elements in each row and their values for the -sparse Hamiltonian (as per berry2007efficient ()) and an oracle which returns copies of purified density matrix of the data , and an error parameter. With probability at least we can obtain an estimate of the gradient w.r.t. of the relative entropy such that

 ∥∇θTr[ρlogσv]−G∥max≤ϵ, (24)

with

 ~O(√NzD∥H(θ)∥dμ5γϵ3), (25)

queries to and , where , , for being the number of visible units and being the number of hidden units, and

 ~O(poly(γ,nv,nh,log(1/ϵ)))

classical precomputation.

In order to obtain the bounds in Theorem 5 we decompose the total error into the errors that we incur at each step of the approximation scheme,

 ∣∣∂θTr[ρlogσv]−∂θTr[ρlogsK1,M1~σv]∣∣≤∑iσi(ρ)⋅∥∥∂θ[logσv−logsK1,M1~σv]∥∥ ≤∑iσi(ρ)⋅(∥∥∂θ[logσv−logK1,M1σv]∥∥ +∥∥∂θ[logK1,M1σv−logK1,M1~σv]∥∥+∥∥∂θ[logK1,M1~σv−logsK1,M1~σv]∥∥). (26)

Then bounding each term separately and adjusting the parameters to obtain an overall error of allows us to obtain the above result. We are hence able to use this procedure to efficiently obtain gradient estimates for a QBM with hidden units, while making minimal assumptions on the input data.

## 3 Conclusion

Generative models play an important role in quantum computing as they yield concise models for complex quantum states that have no known a priori structure. In this article, we solve an outstanding problem in the field: the previous inability to train quantum generative models to minimize the quantum relative entropy (the analogue of the KL-divergence) between the input training set and the output quantum distribution for quantum devices with hidden units. The inability to handle hidden units, for models such as the quantum Boltzmann machine, was a substantial drawback.

Our work showed, given an efficient subroutine for preparing Gibbs states and an efficient algorithm for computing the matrix elements of the Hamiltonian, that one can efficiently train a quantum Boltzmann machine. Specifically, we provide two quantum algorithms for training the devices. The first assumes that the Hamiltonian terms acting on the Hidden units are mutually commuting and relies on optimizing a variational bound on the relative entropy; whereas the second method is completely general and is based on quantum finite difference methods and Fourier techniques. In fact, this approach is sufficiently general that similar ideas could be used to train models where the input data density operator is not a thermal state. This would allow much more general, and therefore potentially more powerful, models to be trained without necessitating Gibbs states preparation. We observe that the first training method requires requires polynomially fewer queries in both the number of units and the error tolerance, making it much more practical but much less general.

A number of open problems remain. First, while we show upper bounds on the query complexity for training Boltzmann machines lower bounds have not been demonstrated. Consequently, we do not know whether linear scaling in is optimal, as it is in Hamiltonian simulation. However, linear scaling in is unlikely to be optimal because of recent results on quantum gradient descent which only require  gilyen2019optimizing () complexity.

A related issue surrounding this work involves the complexity of performing the Gibbs state preparation. While the method we propose in the text scales as  chowdhury2016quantum (), other methods exist that potentially yield better scaling in certain circumstances yung2012quantum (); van2017quantum (). Continuing to find better methods for preparing Gibbs states will likely be a vital task to make the training of QBMs practical on near-term quantum devices. While they do not come with theoretical bounds, recent heuristic approaches that are inspired by ideas from quantum thermodynamics and other physical phenomena may be useful to make the constant factors involved in the state preparation process palatable.

By including all these optimizations it is our hope that quantum Boltzmann machines may not be just a theoretical tool that can one day be used to model quantum states but rather an experimental tool that will be useful for modeling quantum or classical data sets in near term quantum hardware.

## References

• (1) Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning. Nature, 549(7671):195, 2017.
• (2) Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil, Andrea Rocchetto, Simone Severini, and Leonard Wossnig. Quantum machine learning: a classical perspective. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2209):20170551, 2018.
• (3) Rocco A Servedio and Steven J Gortler. Equivalences and separations between quantum and classical learnability. SIAM Journal on Computing, 33(5):1067–1092, 2004.
• (4) Srinivasan Arunachalam and Ronald De Wolf. Optimal quantum sample complexity of learning algorithms. The Journal of Machine Learning Research, 19(1):2879–2878, 2018.
• (5) Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and unsupervised machine learning. arXiv preprint arXiv:1307.0411, 2013.
• (6) Nathan Wiebe, Alex Bocharov, Paul Smolensky, Krysta Svore, and Matthias Troyer. Quantum language processing. arXiv preprint arXiv:1902.05162, 2019.
• (7) Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy, Ryan Babbush, and Hartmut Neven. Barren plateaus in quantum neural network training landscapes. Nature communications, 9(1):4812, 2018.
• (8) Maria Schuld, Alex Bocharov, Krysta Svore, and Nathan Wiebe. Circuit-centric quantum classifiers. arXiv preprint arXiv:1804.00633, 2018.
• (9) Mária Kieferová and Nathan Wiebe. Tomography and generative training with quantum boltzmann machines. Phys. Rev. A, 96:062327, Dec 2017.
• (10) Maria Schuld and Nathan Killoran. Quantum machine learning in feature hilbert spaces. Physical review letters, 122(4):040504, 2019.
• (11) Jonathan Romero, Jonathan P Olson, and Alan Aspuru-Guzik. Quantum autoencoders for efficient compression of quantum data. Quantum Science and Technology, 2(4):045001, 2017.
• (12) Marcello Benedetti, Edward Grant, Leonard Wossnig, and Simone Severini. Adversarial quantum circuit learning for pure state approximation. New Journal of Physics, 21(4):043023, 2019.
• (13) Hilbert J Kappen. Learning quantum models from quantum or classical data. arXiv preprint arXiv:1803.11278, 2018.
• (14) Mohammad H Amin, Evgeny Andriyash, Jason Rolfe, Bohdan Kulchytskyy, and Roger Melko. Quantum boltzmann machine. Physical Review X, 8(2):021050, 2018.
• (15) Daniel Crawford, Anna Levit, Navid Ghadermarzy, Jaspreet S Oberoi, and Pooya Ronagh. Reinforcement learning using quantum boltzmann machines. arXiv preprint arXiv:1612.05695, 2016.
• (16) Marcello Benedetti, John Realpe-Gómez, Rupak Biswas, and Alejandro Perdomo-Ortiz. Quantum-assisted learning of hardware-embedded probabilistic graphical models. Physical Review X, 7(4):041052, 2017.
• (17) Anthony Chefles and Stephen M Barnett. Strategies and networks for state-dependent quantum cloning. Physical Review A, 60(1):136, 1999.
• (18) Emile Aarts and Jan Korst. Simulated annealing and boltzmann machines. 1988.
• (19) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning, pages 791–798. ACM, 2007.
• (20) Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
• (21) Nicolas Le Roux and Yoshua Bengio. Representational power of restricted boltzmann machines and deep belief networks. Neural computation, 20(6):1631–1649, 2008.
• (22) Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Artificial intelligence and statistics, pages 448–455, 2009.
• (23) Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltzmann machines. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 693–700, 2010.
• (24) Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pages 609–616. ACM, 2009.
• (25) Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pages 599–619. Springer, 2012.
• (26) Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in neural information processing systems, pages 1096–1104, 2009.
• (27) Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE transactions on audio, speech, and language processing, 20(1):14–22, 2011.
• (28) Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012.
• (29) Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355(6325):602–606, 2017.
• (30) Giacomo Torlai and Roger G Melko. Learning thermodynamics with boltzmann machines. Physical Review B, 94(16):165134, 2016.
• (31) Yusuke Nomura, Andrew S Darmawan, Youhei Yamaji, and Masatoshi Imada. Restricted boltzmann machine learning for solving strongly correlated quantum systems. Physical Review B, 96(20):205152, 2017.
• (32) Dorit Aharonov, Vaughan Jones, and Zeph Landau. A polynomial quantum algorithm for approximating the jones polynomial. Algorithmica, 55(3):395–421, 2009.
• (33) Joran Van Apeldoorn, András Gilyén, Sander Gribling, and Ronald de Wolf. Quantum sdp-solvers: Better upper and lower bounds. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pages 403–414. IEEE, 2017.
• (34) Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nature Physics, 10(9):631, 2014.
• (35) Shelby Kimmel, Cedric Yen-Yu Lin, Guang Hao Low, Maris Ozols, and Theodore J Yoder. Hamiltonian simulation with optimal sample complexity. npj Quantum Information, 3(1):13, 2017.
• (36) Dominic W Berry, Graeme Ahokas, Richard Cleve, and Barry C Sanders. Efficient quantum algorithms for simulating sparse hamiltonians. Communications in Mathematical Physics, 270(2):359–371, 2007.
• (37) András Gilyén, Srinivasan Arunachalam, and Nathan Wiebe. Optimizing quantum optimization algorithms via faster quantum gradient computation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1425–1444. SIAM, 2019.
• (38) Anirban Narayan Chowdhury and Rolando D Somma. Quantum algorithms for gibbs sampling and hitting-time estimation. arXiv preprint arXiv:1603.02940, 2016.
• (39) Man-Hong Yung and Alán Aspuru-Guzik. A quantum–quantum metropolis algorithm. Proceedings of the National Academy of Sciences, 109(3):754–759, 2012.
• (40) Howard E Haber. Notes on the matrix exponential and logarithm. 2018.
• (41) Nicholas J Higham. Functions of matrices: theory and computation, volume 104. Siam, 2008.
• (42) Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitude amplification and estimation. Contemporary Mathematics, 305:53–74, 2002.
• (43) LD Landau and EM Lifshitz. Statistical physics, vol. 5. Course of theoretical physics, 30, 1980.
• (44) David Poulin and Pawel Wocjan. Sampling from the thermal quantum gibbs state and evaluating partition functions with a quantum computer. Physical review letters, 103(22):220502, 2009.

## Appendix A Supplemental material

### a.1 Mathematical preliminaries

While computing the gradient of the average log-likelihood is a straight forward task when training ordinary Boltzmann machines, finding the gradient of the quantum relative entropy is much harder. The reason for this is that in general . This means that the ordinary rules that are commonly used in calculus for finding the derivative no longer hold. One important example that we will use repeatedly is Duhamel’s formula:

 ∂θeH(θ)=∫10dseH(θ)s∂θH(θ)eH(θ)(1−s). (27)

This formula can be easily proven by expanding the operator exponential in a Trotter-Suzuki expansion with time-slices, differentiating the result and then taking the limit as . However, the relative complexity of this expression compared to what would be expected from the product rule serves as an important reminder that computing the gradient is not a trivial exercise. A similar formula also exists for the logarithm as shown in Appendix A.

Similarly, because we are working with functions of matrices here we need to also work with a notion of monotonicity. We will see that for some of our approximations to hold we will also need to define a notion of concavity (in order to use Jensen’s inequality). These notions are defined below.

###### Definition 6 (Operator monoticity).

A function is operator monotone with respect to the semidefinite order if , for two symmetric positive definite operators implies, . A function is operator concave w.r.t. the semidefinite order if , for all positive definite and .

We now derive or review some preliminary equations which we will need in order to obtain a useful bound on the gradients in the main work.

###### Claim 7.

Let be a linear operator which depends linearly on the density matrix . Then

 ∂∂θA(θ)−1=−A−1∂σ∂θA−1. (28)
###### Proof.

The proof follows straight forward by using the identity .

 ∂I∂θ=0=∂∂θAA−1=(∂A∂θ)A−1+A(∂A−1∂θ).

Reordering the terms completes the proof. This can equally be proven using the Gateau derivative. ∎

In the following we will furthermore rely on the following well-known inequality.

###### Lemma 8 (Von Neumann Trace Inequality).

Let and with singular values and respectively such that if . It then holds that

 |Tr[AB]|≤n∑i=1σ(A)iσ(B)i. (29)

Note that from this we immediately obtain

 (30)

This is particularly useful if is Hermitian and PSD, since this implies for Hermitian .

Since we are dealing with operators, the common chain rule of differentiation does not hold generally. Indeed the chain rule is a special case if the derivative of the operator commutes with the operator itself. Since we are encountering a term of the form , we can not assume that , where is the derivative w.r.t., . For this case we need the following identity similarly to Duhamels formula in the derivation of the gradient for the purely-visible-units Boltzmann machine.

###### Lemma 9 (Derivative of matrix logarithm [40]).

For completeness we here inlude a proof of the above identity.

###### Proof.

We use the integral definition of the logarithm [41] for a complex, invertible, matrix with no real negative

 logA=(A−I)∫10[s(A−I)+I]−1. (32)

From this we obtain the derivative

Applying (28) to the second term on the right hand side yields

which can be rewritten as

by adding the identity in the first integral and reordering commuting terms (i.e., ). Notice that we can hence just substract the first two terms in the integral which yields (9) as desired. ∎

### a.2 Amplitude estimation

In the following we describe the established amplitude estimation algorithm [42]:

Algorithm 3 describes the amplitude estimation algorithm. The output is an -close estimate of the target amplitude. Note that in step (3), changes the sign of the amplitude if and only if the state is the zero state , and is the sign-flip operator for the target state, i.e., if is the desired outcome, then .

The algorithm can be summarized as the unitary transformation

 ((QFT†⊗I)ΛN(Q)(QFTN⊗I))

applied to the state , followed by a measurement of the first register and classical post-processing returns an estimate of the amplitude of the desired outcome such that with probability at least . The result is summarized in the following theorem, which states a slightly more general version.

###### Theorem 10 (Amplitude Estimation [42]).

For any positive integer , the Amplitude Estimation Algorithm returns an estimate () such that

 |~a−a|≤2πk√a(1−a)N+k2π2N2

with probability at least for and with probability greater than for . If then with certainty, and and if and is even, then with certainty.

Notice that the amplitude can hence be recovered via the relation as described above which incurs an -error for (c.f., Lemma 7, [42]).

Here we present an easy subroutine to evaluate the trace of products of unitary operators with a density matrix , which is known as the Hadamard test.

Note that this procedure can easily be adapted to be used for being some Gibbs distribution. We then would use a Gibbs state preparation routine in step (2). For example for the evaluation of the gradient of the variational bound, we require this subroutine to evaluate for being the Gibbs distribution corresponding to the Hamiltonian .

### a.4 Deferred proofs

First for convenience, we formally define quantum Boltzmann machines below.

###### Definition 11.

A quantum Boltzmann machine to be a quantum mechanical system that acts on a tensor product of Hilbert spaces that correspond to the visible and hidden subsystems of the Boltzmann machine. It further has a Hamiltonian of the form such that . The quantum Boltzmann machine takes these parameters and then outputs a state of the form .

Given this definition, we are then able to discuss the gradient of the relative entropy between the output of a quantum Boltzmann machines and the input data that it is trained with.

#### a.4.1 Derivation of the variational bound

###### Proof of Lemma 2.

Recall that we assume that the Hamiltonian takes the form

 H:=∑kθkvk⊗hk,

where and are operators acting on the visible and hidden units respectively and we can assume to be diagonal in the chosen basis. Under the assumption that , c.f. the assumptions in (9), there exists a basis for the hidden subspace such that . With these assumptions we can hence reformulate the logarithm as

 logTrh[e−H]=log⎛⎝∑v,v′,h⟨v,h|e−∑kθkvk⊗hk|v′,h⟩|v⟩⟨v′|⎞⎠ (35) =log⎛⎝∑v,v′,h⟨v|e−∑kEh,kθkvk|v′⟩|v⟩⟨v′|⎞⎠ (36) =log(∑he−∑kEh,kθkvk), (37)

where it is important to note that are operators and we hence just used the matrix representation of these in the last step. In order to further simplify this expression, first note that each term in the sum is a positive semi-definite operator. In particularly, note that the matrix logarithm is operator concave and operator monotone, and hence by Jensen’s inequality, for any sequence of non-negative number we have that

 log(∑Ni=1αiUi∑jαj)≥∑Ni=1αilog(Ui)∑jαj.

and since we are optimizing we hence obtain for arbitrary choice of under the above constraints,

 Tr[ρlog(N∑h=1e−∑kEh,kθkvk)] =Tr[ρlog(N∑h=1αhe−∑kEh,kθkvk/αh∑h′αh′)] ≥−Tr[ρ∑hαh∑kEh,kθkvk+∑hαhlogαh∑h′αh′]. (38)

Hence, the variational bound on the objective function for any is

 Oρ(H)= Tr[ρlogρ]−Tr[ρlogσv] ≤Tr[ρlogρ]+Tr[ρ∑hαh∑kEh,kθkvk+∑hαhlogαh∑h′αh′]+logZ=:~S (39)

For the following result we will rely on a variational bound in order to train the quantum Boltzmann machine weights for a Hamiltonian of the form given in (10). We begin by proving Lemma 3 in the main work, which will give us an upper bound for the gradient of the relative entropy.

###### Proof of Lemma 3.

We first derive the gradient of the normalization term () in the relative entropy, which can be trivially evaluated using Duhamels formula to obtain

Note that we can easily evaluate this term by first preparing the Gibbs state and then evaluating the expectation value of the operator w.r.t. the Gibbs state, using amplitude estimation for the Hadamard test. If is the query complexity for the Gibbs state preparation, the query complexity of the whole algorithm including the phase estimation step is then given by for an -accurate estimate of phase estimation. Taking into account the desired accuracy and the error propagation will hence straight forward give the computational complexity to evaluate this part.
We now proceed with the gradient evaluations for the model term. Using the variational bound on the objective function for any , given in eq. A.4.1, we obtain the gradient

 ∂~S∂θp =−Tr[∂H∂θpe−HZ]+Tr[∂∂θpρ∑hαh∑kEh,kθkvk]+∂∂θp∑hαhlogαh (40) =−Tr[∂H∂θpe−HZ]+∂∂θp(∑hαhTr[ρ∑kEh,kθkvk]+∑hαhlogαh) (41)

where the first term results from the partition sum. The latter term can be seen as a new effective Hamiltonian, while the latter term is the entropy. The latter term hence resembles the free energy , where is the mean energy of the effective system with energies , the temperature and the Shannon entropy of the distribution. We now want to choose these terms to minimize this variational upper bound. It is well-established in statistical physics, see for example [43], that the distribution which maximizes the free energy is the Boltzmann (or Gibbs) distribution, i.e.,

 αh=e