Machine learning algorithms based on generalized Gibbs ensembles

# Machine learning algorithms based on generalized Gibbs ensembles

Tatjana Puškarov    Axel Cortés Cubero Institute for Theoretical Physics, Center for Extreme Matter and Emergent Phenomena, Utrecht University, Princetonplein 5, 3584 CC Utrecht, the Netherlands
July 15, 2019
###### Abstract

Machine learning algorithms often take inspiration from established results and knowledge from statistical physics. A prototypical example is the Boltzmann machine algorithm for supervised learning, which utilizes knowledge of classical thermal partition functions and the Boltzmann distribution. Recently, a quantum version of the Boltzmann machine was introduced by Amin, et. al., however, non-commutativity of quantum operators renders the training process by minimizing a cost function inefficient. Recent advances in the study of non-equilibrium quantum integrable systems, which never thermalize, have lead to the exploration of a wider class of statistical ensembles. These systems may be described by the so-called generalized Gibbs ensemble (GGE), which incorporates a number of “effective temperatures”. We propose that these GGE’s can be successfully applied as the basis of a Boltzmann-machine-like learning algorithm, which operates by learning the optimal values of effective temperatures. We show that the GGE algorithm is an optimal quantum Boltzmann machine: it is the only quantum machine that circumvents the quantum training-process problem. We apply a simplified version of the GGE algorithm, where quantum effects are suppressed, to the classification of handwritten digits in the MNIST database. While lower error rates can be found with other state-of-the-art algorithms, we find that our algorithm reaches relatively low error rates while learning a much smaller number of parameters than would be needed in a traditional Boltzmann machine, thereby reducing computational cost.

## I Introduction and classical Boltzmann machines

Supervised learning consists of approximating a function based on the input-output pairs supplied in a training data set. Besides capturing the input-output relationships in the training set, the approximated function should be general enough so that it can be used to map new input examples supplied in a test data set. Machine learning algorithms do this by introducing a model with flexible parameters which are learned from the training data set. The learning is done by adjusting the parameters to minimize a “cost function” which measures the distance of the inferred mapping from the actual one.

A common approach to classification is the use of probabilistic modeling. Given some input data , one computes the probability that it corresponds to a particular output class . Given an input, this will produce a different probability for each possible output class, and the class with the highest probability is selected as the algorithm’s prediction. The goal is then to come up with some probability distributions, , which have flexible parameters that can be learned from a training data set.

The main goals of statistical mechanics are very similar to those of supervised machine learning. In statistical mechanics, one is interested in computing macroscopic observable quantities (such as energy density, pressure, etc.) corresponding to a given configuration of microscopic particles. One is not necessarily interested in the specific individual dynamics of each particle, since slightly different microscopic configurations lead to the same macroscopic output. This line of reasoning parallels the goals of supervised machine learning, where the input variables can correspond to a microscopic configuration of particles, and the output can be viewed as a macroscopic observable.

This conceptual connection to statistical mechanics gave rise to a popular probabilistic model, with many possible applications to machine learning Hinton et al. (2006); Hinton and Salakhutdinov (2006); Larochelle and Bengio (2008); Coates et al. (2011), the so-called (restricted) Boltzmann machine (RBM/BM). Boltzmann machines are generative models which learn the probability distribution of data using a set of hidden variables. In a typical Boltzmann machine the variables are distributed in two connected layers - a visible and a hidden layer. Since fully connected Boltzmann machines have proved impractical, the connections between the variables are usually constrained - in the case of a restricted Boltzmann machine the only non-vanishing weights are the interlayer ones connecting the visible and the hidden layer, while the intralayer ones are set to zero.

The idea is that the probabilities of configurations can be computed, as is done in statistical physics, from a Boltzmann distribution, given by

 p(vi)=∑he−βE(vi,h)Z,Z=∑vi,he−βE(vi,h), (1)

where are the visible units which are fed the input data and are the hidden units. The parameter corresponds to the inverse temperature and is traditionally set to , although there have recently been explorations of temperature-based BM Li et al. (2016); Passos and Papa (2017), and can be thought of as the energy of a given configuration. One useful form of the energy function that can be used is that corresponding to an Ising-like model Hopfield (1982); Hinton and Sejnowski (1983); Lieb et al. (1961),

 E(vi,h)=∑ibhihi+∑ibiivii+∑i,jwii,jhivij, (2)

where and are stochastic binary units which represent the hidden and the input variables, and biases and weights are free parameters that need to be learned. The goal of the algorithm is to learn, from the training data set, the optimal values of and that produce the most accurate probabilities . This is done by setting the states of the visible layer to a training vector and then using those to set the nodes of the hidden layer to 1 with the probability , where . The training of a BM consists in “reconstructing” the visible and hidden vector in this way from one another while adjusting the weights and biases. The weights and biases are adjusted so that the reconstructed visible vector gets closer to the input one and the energy of the total configuration is decreased, thus increasing the probability  Hinton (2012).

Training of a classical Boltzmann machine is generally done by minimizing a cost function, which gives a measure of the distance between the correct probabilities provided by the training data set, , and the probabilities generated by the model, . A useful and commonly used cost function is the cross-entropy, defined as

 L=−∑ip(vi)datalogp(vi). (3)

The cost function can be minimized using gradient descent. This technique consists of iterative changing of the parameters and (which we can call collectively the set of parameters ) by a small amount in the direction of steepest descent of the cost function in the parameter space. Namely, the change in a parameter at a given step is

 Δθ=−ϵ∂θL. (4)

For the gradient descent technique to be useful, it has to be possible to compute the gradients efficiently. It can be easily shown that

 −∂θL=⟨∂θE(vi,h)⟩−∑ip(vi)data⟨∂θE(vi,h)⟩h, (5)

where represents statistical Boltzmann averaging over the space of hidden variables, and represents Boltzmann averaging over both, hidden and visible variables, which can be efficiently performed. Minimizing the cross-entropy via gradient descent is then an efficient procedure for the classical Boltzmann machine.

The Boltzmann machine is a generative model since it learns a probability distribution which represents the training data and can, after training, generate new data with the learned distribution. It can also be used for classification of data Goodfellow et al. (2016), usually as a feature extractor. While training the RBM, the network learns a different representation of the input data , which can be fed to a classification algorithm Hinton et al. (2006). In this way, the RBM is used to extract the features , which can then be fed to, e.g., a single softmax output layer to classify the data. Alternatively, an RBM with a part of its visible layer fed the output data can be used to generate the joint probabilities of the inputs and the output labels. The probability of such a configuration is

 p(vi,h,vo)=e−βE(vi,h,vo)Z,Z=∑vi,h,voe−βE(vi,h,vo), (6)

where the energy is now

 E(vi,h,vo)=∑ibhihi+∑ibiivii+∑iboivoi+∑i,jwii,jhivij+∑i,jwoi,jhivoj. (7)

The RBM then models the joint probability distribution of the input and the output data by summing over the possible states of the hidden units

 Poi=∑hp(vi,h,vo). (8)

A simple example of a supervised learning problem is that of the classification of handwritten digits. The Modified National Institute of Standards and Technology (MNIST) database is a large database of handwritten digits used for training and testing image processing models 111Available at http://yann.lecun.com/exdb/mnist/. The data consists of 60000 training and 10000 testing images and their corresponding labels. The input data in this case is a set of grayscale pixels which form the image of a handwritten digit. The images are pixels with 256 grayscale values. The output are the corresponding “class” labels of the images, i.e. the correct digits (0-9) that the images represent. Creators of the database trained several types of algorithms and achieved test set error rates ranging from 12% for a linear classifier to 0.8% for a support vector machine LeCun et al. (1998). At the moment, an algorithm with a committee of 35 convolutional neural networks achieves the best performance on the test set with an error of 0.23% Ciregan et al. (2012).

In this paper we develop a new learning algorithm which is inspired by recent advances in the study of the non-equilibrium statistical physics of quantum many body systems. In particular, our supervised learning model will be based on statistical ensembles that describe integrable systems, which are generalizations of the traditional Gibbs ensemble of equilibrium thermodynamics Rigol et al. (2007); Calabrese et al. (2012); Fagotti and Essler (2013). These systems are described by the so-called generalized Gibbs ensemble (GGE), which incorporates a large number of quantities known as “effective temperatures”, which are generalizations of the parameter .

GGEs are quantum statistical ensembles, therefore we need to study the properties of quantum Boltzmann machines. In the next section we will provide a brief introduction to these quantum algorithms, following the discussion of Amin et al. (2016). We will see that it is generally not possible to efficiently train a quantum Boltzmann machine by minimizing the cross-entropy, as can be done in the classical version.

We will expose in some detail the structure of the GGE for a simple quantum Ising chain, which will be the basis of our model. We will show in Section V how these effective temperatures can be treated as the adjustable parameters to be learned from the training data set, yielding a computationally cheap and reasonably effective learning algorithm. Particularly we will show that the GGE algorithm is an optimal version of the quantum Boltzmann machine, and that it is the only quantum Boltzmann machine where it is possible to efficiently minimize the cross-entropy using the gradient descent technique.

We will then apply a very simple version of the GGE-based algorithm in the problem of classification of MNIST hand-written digits, and compare our results to previous established classification algorithms. This is done by representing the input data as momentum-space eigenstates of the quantum Ising Hamiltonian, and by having no hidden variables, which essentially reduces the algorithm to a classical problem, which can be easily done. Our algorithm at this point is not competitive with the high levels of precision of other state-of-the-art approaches Ciregan et al. (2012). The main advantage of our model instead is its simplicity, given that it manages to achieve reasonably low error rates with a comparatively very small number parameters that need to be fitted.

Our results show the simplest version of the GGE algorithm is a reasonable model for a simple treatment of the MNIST database. At this point we do not explore further if introducing the full quantum effects, which can be done by introducing hidden variables and choosing a different basis of states, can improve the capabilities of the GGE algorithm.

## Ii Quantum Boltzmann machines

Drawing on the success of learning algorithms based on classical statistical mechanics, the natural generalization to quantum Boltzmann machines was recently proposed Amin et al. (2016). In this case the energy function is promoted to a Hamiltonian operator, and the data is represented as a particular quantum state. A quantum Ising Hamiltonian can be defined by promoting the classical spin variables to spin operators, which can be expressed in the basis of the Pauli matrices . One example considered in Amin et al. (2016) is the Hamiltonian corresponding to an inhomogeneous Ising chain in a transverse and a longitudinal magnetic field

 Ho=−∑iΓiσxi−∑ibiσzi−∑i,jwi,jσziσzj. (9)

The Hilbert space corresponding to the Hamiltonian (9) can be split into spin variables corresponding to visible and hidden units, given by the tensor product . An input sample is represented as a particular state on the visible sector of the Hilbert space , which corresponds to a configuration of spins in the quantum Ising chain, while the configuration of spins corresponding to hidden variables is not specified. The probabilities can now be written in terms of the quantum version of the Boltzmann distribution, also known as the canonical Gibbs ensemble, by computing the quantum average of the projection operator

 Λi=|v(i)⟩⟨v(i)|⊗Ih, (10)

where is the identity operator acting on the space of hidden variables. The probabilities then are

 Poi=⟨Λi⟩oGE≡Tr{Λie−βHo}Zo,Zo=Tr{e−βHo}, (11)

where the trace is over all the possible spin configurations, and denotes the quantum average of an operator using the Gibbs ensemble.

One can attempt to train the quantum Boltzmann machine by minimizing the cross-entropy via gradient descent, as is done for the classical Boltzmann machine. It was, however, shown in Ref Amin et al. (2016) that this process cannot be done efficiently, since the gradient of the cross-entropy cannot be expressed in terms of the standard quantum averages, . The cross-entropy is given by

 L=−∑i,oPodatailog⟨Λi⟩oGE. (12)

We can then compute the gradient, , where stands for the parameters , and . One can easily obtain the expression

 ∂θL=∑i,oPodatai(Tr[Λi∂θe−βHo]Tr[Λie−βHo]−Tr[∂θe−βHo]Tr[e−βHo]). (13)

It can be shown Amin et al. (2016) that the second term in Eq. (13) can be simplified to

 Tr[∂θe−βHo]Tr[e−βHo]=⟨β∂θHo⟩oGE, (14)

which is a standard quantum average that can be efficiently estimated by sampling. The problem lies in the first term of (13) which cannot be written as a GE average, as a consequence of the noncommutativity of the operators . This term can be written as Amin et al. (2016)

 Tr[Λi∂θe−βHo]Tr[Λie−βHo]=−∫10dtTr[Λie−tβHo∂θHe−(1−t)βHo]Tr[Λie−βHo], (15)

which cannot be efficiently estimated via sampling, and makes the quantum Boltzmann machine inefficient for large systems.

An alternative approach to training the quantum algorithm was proposed in Amin et al. (2016), consisting on placing an upper bound on the cross-entropy, instead of computing the absolute minimum. The upper bound is a quantity that can be efficiently estimated by sampling and thus minimized. It was shown that this bound-based approach worked well enough for some simple data sets. The problem of efficiently truly minimizing the cross-entropy of a quantum Boltzmann machine was, however, left unsolved in Amin et al. (2016).

An alternate training procedure for quantum Boltzmann machines, which can be called state based training, was proposed in Kieferova and Wiebe (2016) for training data sets that can be expressed as a density matrix, , which allows for the source of the training data also being of quantum nature. The training data set we have discussed so far, consisting on the set of operators , and their associated probabilities can also be expressed as a quantum density matrix as

 ρodata=∑iPodataiΛi, (16)

although more general density matrices can be considered as well.

The goal in state-based training is then to learn the parameters in the Hamiltonian, , such that the density matrix, approximates the training density matrix, . In this case one needs to minimize some measure of the distance between the two density matrix. A convenient such measure that can take the role of a cost function, is given by the relative entropy,

 (17)

One can now attempt to minimize this cost function by gradient descent, for which we need to compute derivatives of it with respect to parameters as

 ∂θL=β∑o{−Tr[ρodata∂θHo]+Tr[ρo∂θHo]}. (18)

The quantum traces in Eq. (18) are standard Gibbs-ensemble-like averages. In practice, these traces can be computed by introducing an arbitrary complete basis of orthogonal states defined such that . We can then write in general

 Tr[ρodata∂θHo]=∑n,m⟨n|ρodata|m⟩⟨m|∂θHo|n⟩. (19)

For a system consisting of spin-1/2 variables, the Hilbert space is spanned by a basis of distinct orthogonal states. It is then necessary to compute in general matrix elements, and and matrix elements . For each parameter , one needs to compute a new set of elements . A quantum Boltzmann machine in general contains at least parameters , such that the total number of matrix elements that need to be computed to evaluate the trace (19), and therefore for all parameters , is .

In the next section we will introduce our new GGE-based algorithm. We will show that this GGE algorithm is a quantum algorithm which circumvents the quantum training problem from Amin et al. (2016), and its cross-entropy can be efficiently minimized via sampling. In this sense, the GGE algorithm is the quantum machine with the optimal training process. We will also show that if one atttempts to perform state-based training on the GGE algorithm, the number of matrix elements to one needs to compute is , instead of , even though one still has free tunable parameters.

## Iii The GGE machine as the optimal quantum Boltzmann machine

The main goal of this paper is to explore the utility of applying different physics-inspired probability distributions to supervised learning, and to see if they can have any advantage over the (quantum) Boltzmann distributions.

In recent years, there has been significant progress in our understanding the dynamics of quantum many-body systems out of thermal equilibrium, where concepts such as the quantum Boltzmann distribution are not applicable. In particular, there has been a large amount of work towards understanding the non-equilibrium dynamics of one-dimensional integrable quantum systems (for a review, see Polkovnikov et al. (2011); Eisert et al. (2015); Gogolin and Eisert (2016)).

Integrable quantum systems are characterized by having a large number of conserved charges (whose expectation value does not change under time evolution). We can write these conserved quantities as quantum operators , labeled by some integer , with the property that they commute with the Hamiltonian, . Generally these charges also commute with each other, , and thus can all be simultaneously diagonalized. This large number of dynamical constraints usually results in these systems being exactly solvable, and enables analytic computation of certain quantities (Mussardo, 2010, and references therein). One important result in the study of integrable systems out of equilibrium, is the realization that even after very long times, these systems never reach a state of thermal equilibrium. It is now understood that at long times, physical observables of integrable quantum systems generally reach an equilibrium state described by a generalized Gibbs ensemble (GGE) Rigol et al. (2007); Barthel and Schollwöck (2008); Calabrese et al. (2012), where the probabilities associated with a given state are given by

 PΨ=⟨Ψ|e−∑nβnQn|Ψ⟩Z,Z=Tr{e−∑nβnQn}. (20)

where can be considered to be “effective temperatures” corresponding to each of the higher conserved quantities of the integrable model. It is important to point out that in quantum integrable models, the conserved quantities, , are generally extensive and can be expressed as the sum of local operators, which are essential properties needed for (20) to be a reasonable ensemble for statistical physics Vidmar and Rigol (2016).

One appeal of using a GGE in a machine learning algorithm is that it may be possible to store a large amount of non-trivial information in the effective temperature parameters, . The effective temperature variables contain information directly related to the macroscopic quantities of a system. It can then be reasonably expected that if one learns a handful of effective temperatures corresponding to a given macroscopic output, this may carry more essential information than learning a similar number of microscopic parameters, such as the coupling between two particular input variables. We then propose that in some cases, it should be more useful to learn a set of effective temperatures, , than to learn the full set of microscopic couplings, , , . This hypothesis is further motivated by the results of Li et al. (2016), which shows how the temperature is a very useful parameter to learn in a traditional Boltzmann machine; in our case, this idea is exploited by having a large number of temperature-like parameters.

The starting point of the GGE algorithm is the Hamiltonian of an integrable quantum spin chain, and its set of conserved charges. A simple integrable spin chain is described by a homogeneous Hamiltonian similar to (9) where we can set . This is the prototypical transverse-field Ising (TFI) model

 H=−wL∑i=1σziσzi+1+ΓL∑i=1σxi. (21)

The conserved charges of this model are Fagotti and Essler (2013)

 Q+n=−wL∑i=1(Sxxi,i+n+Syyi,i+n−2)−ΓwL∑i=1(Sxxi,i+n−1+Syyi,i+n−1),Q−n=−wL∑i=1(Sxyi,i+n−Syxi,i+n−2), (22)

with . It can be shown that the TFI describes a system of non-interacting fermionic quasiparticles, which makes it the simplest integrable chain with spin 1/2. A GGE machine could also be built, in principle, using other more general interacting spin chain models.

The Hilbert space can again be split into spin variable corresponding to visible and hidden nodes. The probabilities for a given input and output are then given by computing the GGE average of , as

 Poi=⟨Λi⟩oGGE≡Tr{Λie−∑nβonQn}Zo,Zo=Tr{e−∑nβonQn}, (23)

where the output dependence now has been shifted only to the effective temperature parameters . This GGE machine can also be interpeted as a standard quantum Boltzmann machine, with effective Hamiltonian .

The training process now consists on learning the effective temperatures that yield the most accurate probabilities. This can be done again by defining a cross-entropy as

 L=−∑i,oPodatailog⟨Λi⟩oGGE (24)

and attempting to minimize it by gradient descent. The gradient of the cross-entropy can be computed as in the quantum Boltzmann machine as

 ∂θL=∑i,oPodatai(Tr[Λi∂θe−Ho]Tr[Λie−Ho]−Tr[∂θe−Ho]Tr[e−Ho]), (25)

where now stands for the effective temperature parameters .

The full advantage of the GGE machine as a quantum Boltzmann machine now becomes evident. It is easy to see that

 [∂θHo,Ho]=0, (26)

which follows from the fact that all the conserved charges commute with each other. The GGE machine, built out of a large set of mutually commuting conserved charges is then the optimal quantum Boltzmann machine, in that it is the only quantum machine that can be efficiently trained by minimizing the cross-entropy. The gradient can then be simply written as

 ∂θL=∑i,oPodatai(⟨Λi∂θHo⟩oGGE⟨Λi⟩oGGE−⟨∂θHo⟩oGGE), (27)

which can readily be estimated by sampling. The cross-entropy can thus be efficiently minimized using a simple gradient descent technique.

We now discuss how the GGE machine is also more efficient in the case of state-based training. We suppose the training data set is given as a density matrix . The goal is then to tune the effective temperature parameters such that the density matrix approximates the training data density matrix. We then minimize the relative entropy (17). The gradient of this cost function is then given by

 ∂θL=β∑o{−Tr[ρodata∂θHo]+Tr[ρo∂θHo]}. (28)

Again, these quantum traces can be computed by defining the orthogonal basis of states, , such that

 Tr[ρodata∂θHo]=∑n,m⟨n|ρodata|m⟩⟨m|∂θHo|n⟩, (29)

which is so far identical to (19), thus, in order to compute this trace, an amount of nontrivial matrix elements needs to be computed. It is also convenient to express this trace in terms of the basis of eigenstates of the operator , which we denote as , which diagonalize this operator such that . We can then write the trace as

 Tr[ρodata∂θHo]∑n,m,l⟨n|ρodata|m⟩(∂θH)l(cHml)∗cHln, (30)

where we have defined the overlaps between the states in the two different bases, . The problem of computing matrix elements is then shifted to that of computing overlaps .

The advantage of the GGE machine over the general quantum Boltzmann machine becomes apparent when one considers computing the trace for another parameter, . For the general quantum Boltzmann machine, every new parameter generates an entire new basis of eigenstates, making it necessary to compute overlaps when computing the derivative with respect to all parameters. In the case of the GGE machine, all operators are simultaneously diagonalizable, therefore, no new overlaps need to be computed when considering a new parameter . One then only needs to compute parameters to perform the full computation, even though one still has parameters (effective temperatures) available for training.

A limitation of the GGE machine is that it can only learn information about the diagonal elements, in the eigenstate basis, , of the density matrix, , that is, only the matrix elements of the form . The GGE machine is able to learn about these diagonal elements much more efficiently than a generic quantum Boltzmann machine would learn about a comparable number of matrix elements. Whether these diagonal matrix carry sufficient information for the algorithm to work satisfactorily will depend on the particular problem. The basis of eigenstates, , depends on the Hamiltonian parameters, in this case and . It is then also possible to optimize the GGE machine by selecting the Hamiltonian parameters for which the matrix is the most approximately diagonal. For this reason it would be interesting in the future to implement a GGE machine based on interacting integrable quantum spin chain Hamiltonians, such as the Heisenberg chain chain Ilievski et al. (2015), rather than free models like the TFI. This is because the basis of eigenstates depends much more strongly on the Hamiltonian parameters in interacting systems, with changes in parameters inducing wider changes to the eigenstate basis Sotiriadis et al. (2012), making it possible to learn properties of more diverse classes of matrices , by choosing the optimal eigenstate basis.

In the next section we will show a simple application of the GGE machine towards to problem of classifying handrwritten digits of the MNIST data set. We will see that the amount of information learned by this GGE machine leads to reasonably accurate results, relative to the simplicity of the algorithm.

## Iv A simple GGE machine for the MNIST dataset

We now test a very simple implementation of the GGE machine towards the problem of classifying the MNIST data set. We use the TFI Hamiltonian, (21). Because of the commutativity of the conserved charges, it is possible to write a simplified GGE algorithm, where quantum effects are suppressed, by choosing to implement the data in a convenient state basis, and eliminating hidden variables. This toy model then only shows that learning effective temperatures is a viable approach towards the MNIST classification problem, but it does not test if there are any advantages to introducing quantum effects. Quantum effects can be introduced by choosing a different basis and adding hidden variables, but such an implementation is beyond the scope of this introductory paper.

The Hamiltonian (21) describes an integrable spin chain which can be diagonalized by a Jordan-Wigner and a Bogoliubov transformation Lieb et al. (1961) to give a free fermion model

 H=∑kεk(η†kηk−12), (31)

where the single particle energies are and  Calabrese et al. (2012). The momenta are quantized as in the even and in the odd sector, and . The even or odd sector of momentum values correspond to whether periodic or antiperiodic boundary conditions are imposed on the free fermion operator. We interpret the input data, vectors of 256 grayscale values for each pixel, as the eigenstates of the Hamiltonian (21). In order to do that, we binarize the MNIST data by setting a pixel value to 0 if it is smaller than 256/2, and 1 otherwise. The states are then given in the basis of the occupation number operator of the Bogoliubov fermions - a 0 pixel means that the corresponding fermionic excitation is not occupied and a pixel 1 that it is occupied. A state belongs to the even/odd sector if it has a total even/odd number of excitations. From a practical standpoint, binarizing the data is not necessary, but we do it here to keep in line with the physical interpretation, given that fermions satisfy the Pauli exclusion principle.

The learning part of the algorithm consists in optimizing the set of effective temperatures that reproduce the output data .

The probability that a configuration of pixels represents a digit is

 Poi=Tr[Λie−∑nβonQn]Z=e−∑nβon⟨v(i)|Qn|v(i)⟩Z, (32)

where the second equality follows from the fact that that are orthogonal eigenstates of all the conserved charges (because in this implementation we chose to interpret the data in this convenient state basis that makes it so). From this point onward, this implementation of the GGE machine is effectively classical, having eliminating the quantum averaging with this representation.

where are the conserved charges of the TFI chain. In terms of the Bogoliubov fermions, they have the simple form Fagotti and Essler (2013)

 Q+n=∑kεkcos[nk]η†kηk,Q−n=−2w∑ksin[(n+1)k]η†kηk. (33)

The even charges are defined for , whereas the odd charges are defined for . The training of the algorithm corresponds to learning sets of Lagrange multipliers for each digit in order to produce the appropriate probabilities in (32), with the further simplification that we only learn the conditional probabilities

 Poi=⟨v(i)|e−∑nβonQn|v(i)⟩∑9o=0⟨v(i)|e−∑nβonQn|v(i)⟩, (34)

thus circumventing the difficulty of calculating the full partition function. As is standard practice in neural networks, the algorithm additionally learns “biases”. These are Lagrange multipliers which corresponds to a trivial charge, the identity .

The general expectation is that one can reach a reasonable level of accuracy in classifying the images by including a subset of the conserved charges and learning the corresponding parameters . On the other hand, the traditional Boltzmann machine would need to learn a larger set of parameters in order to reach the same accuracy. The GGE-based algorithm can then provide a computationally much cheaper alternative. We analyze and support this in the following section.

## V The algorithm and performance on the MNIST dataset

The MNIST data is supplied as vectors, read from the images line by line starting from the top left corner and binarized. From it, we extract a total of conserved charges which are the features we later feed to a neural network. For a given , we calculate the expectation values of the same number of even and odd charges defined in eq. (33) if is even, or one extra even charge if is odd. The features we select are therefore

 Qn={Q+n/2if n evenQ−(n+1)/2if n odd,n=0,1,2,…,N−1. (35)

Principal component analysis (PCA) plot and histograms in fig. 1 show clustering of the data for each digit indicating that the conserved charges for a particular digit do capture the similarities between different training instances of that digit. On the other hand, there are significant overlaps between the different clusters, which is expected considering that we discard a lot of information by using only charges. We use PCA only to illustrate the data, and not to change the basis before feeding the data to the network.

Having selected the features with , we train a fully connected single-layer neural network. The input layer consists of the selected features which are rescaled to have zero mean and unit variance, as is standard practice in neural networks 222Not rescaling the input data, which is better suited to the physical analogy that we use, yields approximately lower validation scores as compared to the same network architecture fed with rescaled data. For example, a network with and yields accuracy with rescaled and with raw, not scaled, charges. For the comparison, each networks’ hyperparameters were optimized using a random search.. The input layer is connected to a softmax output layer with 10 nodes, corresponding to the ten possible digits. There is an additional bias node that connects to each output node. The weights connecting the layers are contained in the matrix and the bias weights in the vector . The output is calculated as

 y=ϕ(Wx+b), (36)

where is the activation function, in this case the softmax function Goodfellow et al. (2016). This corresponds to eq. (34) where, for each training sample, is a vector of the conserved charges and is a vector containing the output probabilities for all the digits . The weights matrix contains the Lagrange multipliers , and the biases are the multipliers corresponding to the identity charge .

The weights and biases are initialized at random values and they are learned by training the network to minimize the cross-entropy, including an regularization term Goodfellow et al. (2016); Pedregosa et al. (2011).

The accuracy rates depend on a number of hyperparameters (parameters of the algorithm and not the model). These include the regularization parameter, the learning rate, the stopping criterion and the maximum number of epochs Goodfellow et al. (2016); Pedregosa et al. (2011). The optimal values of the hyperparameters are chosen using randomized parameter optimization Bergstra and Bengio (2012) and stratified 10-fold cross-validation. In this procedure the full training set is split into three parts called “folds” (with sample distribution in each fold similar to the full dataset distribution), the model is trained on nine of the folds, and the performance is measured on the remaining, previously unseen, one. This is repeated ten times so that each fold acts as the validation set once and the performance is averaged. The procedure is repeated for each combination of the hyperparameters. The hyperparameter values for the combinations are picked at random from a specified range (for continuous parameters) or a set (for discrete parameters).

With the selected hyperparamters the network is trained on the samples of the training set using stochastic average gradient descent and stratified 10-fold cross-validation. Typical weights and biases learned are shown in fig. 3, whereas fig. 3 shows typical learning curves. We can conclude that the model is not overfitted, but the relatively high error might indicate a high bias with respect to the data. However, since we significantly truncate the number of input features thus losing a portion of the information, the high error is not unexpected. Furthermore, since the two curves converge, adding additional training instances would most likely not improve the performance.

The algorithm performance is measured by the accuracy of the prediction on the samples of the test set. We plot the observed accuracies with respect to the number of charges kept in the truncated GGE and with respect to the Hamiltonian parameter in fig. 4. In this section, . As is expected, keeping a larger number of charges in the GGE improves performance. However, the accuracy rate saturates at about with and does not improve by further adding of the charges to the ensemble - the accuracy rate when using charges with is still . While this score is lower than what state of the art algorithms achieve, it nevertheless gives a systematic way of reducing the number of parameters of the model. The simplest neural network, trained in the original MNIST paper LeCun et al. (1998), consists of a single layer of 784 input nodes which are fed the pixel data and are connected to the 10 output nodes. This model has free parameters and yields accuracy. This is similar to the GGE model with charges at and free parameters which yields accuracy, a slight improvement as opposed to using the raw pixel data. However, we can reach similar accuracies even with a much smaller number of parameters, as exemplified by fig. 4. A network with and has parameters and yields accuracy, whereas a network with and has parameters and yields .

For further comparison, we have trained a standard classification restricted Boltzmann machine (as defined in the introduction) Hinton et al. (2006) to sort the MNIST data. In this case, the network consists of a visible layer of 784 input stochastic binary nodes, a hidden layer with stochastic binary nodes and an output layer with 10 binary nodes. The input, raw MNIST data rescaled to the interval , is fed to the visible layer. The nodes of the hidden layer are connected to the visible nodes and are activated by a sigmoid function. The energy of this system is given in eq. (2), and the weights and biases are learned using the contrastive divergence algorithm such that the energy is minimized. Using a grid search for the hyperparameters, we find that the best configuration is a network with 100 hidden nodes, thus learning parameters. The hidden nodes are further connected to a softmax output layer, as in the GGE case, and this part of the network is trained as a supervised model. This adds an additional parameters, for a total of . The accuracy on the test set is in this case . To reiterate, in the described setup, RBM is used to extract 100 features which are used for classification. Using 100 GGE conserved quantities in contrast yields test set accuracy of . This is not surprising since the features which the RBM learns have no restrictions in terms of analytic forms, which is the case with the GGE conserved quantities which are physical quantities. In order to compare the computational complexities of the two approaches, we train an RBM with and optimized hyperparameters, which achieves a similar performance to our GGE algorithm. In this example, the RBM has a test set accuracy of with the trade off of learning a total of parameters, while the GGE algorithm achieves with learning parameters with and . While there is a computational cost associated to calculating the charges we feed to the network, this is still a decrease in the total cost. The key difference here is the fact that the GGE algorithm assumes a simple Hamiltonian (21) with homogeneous coupling, whereas the RBM learns an inhomogeneous Hamiltonian with many different coupling constants.

## Vi Conclusions

Inspired by the parallels between statistical mechanics and supervised learning, we described a machine learning algorithm based on the generalized Gibbs ensemble. This GGE algorithm turns out to be an optimal implementation of a quantum Boltzmann machine. It is the only quantum Boltzmann machine which can be efficiently trained by minimizing the cross-entropy function via gradient descent. This result follows from the fact that all the conserved charges corresponding to an integrable Hamiltonian commute with each other.

This commutativity properties also allows us to write a simplified implementation of the GGE machine. This simplified algorithm assumes that the input is an eigenstate of a simple Hamiltonian, and uses this to extract conserved charges from the inputs. Interpreting the data to be in this eigenstate basis, and eliminating hidden variables, results in all quantum effects being suppressed. This simplified version of the GGE machine can then be implemented and trained as a classical algorithm. Our numerical experiment then tests the viability of using effective temperatures as useful parameters in machine learning algorithms, but it does not yet test if there are any computational benefits to introducing quantum effects. Quantum effects can be reintroduced by working on a different basis of states and introducing hidden variables, and shall be studied in future projects.

The effective temperatures of the GGE are directly related to macroscopic observable quantities. It is therefore expected that these parameters are more efficient quantities to learn than the typical couplings learned in a Boltzmann machine. The biggest advantage of our simplified classical GGE algorithm is then that it seems to achieve reasonably low error rates, while learning a comparatively low number of parameters. This advantage was shown explicitly by comparing with a restricted Boltzmann machine, where it was shown that the GGE algorithm outperforms a RBM with around 36 times the number of learned parameters.

While our GGE algorithm currently does not outperform state-of-the art learning algorithms in terms of low error rates, it still proves to be a useful way to reduce the number of parameters to be learned. It is expected the GGE algorithm can be further improved in the future, perhaps by adding more hidden layers, and especially by the introduction of quantum effects, which might make it a more competitive alternative compared to classical algorithms.

###### Acknowledgements.
The authors would like to thank Dirk Schuricht for valuable discussions and for proof reading the manuscript. ACC’s work is supported by the European Union’s Horizon 2020 under the Marie Sklodowoska-Curie grant agreement 750092. This work is part of the D-ITP consortium, a program of the Netherlands Organisation for Scientific Research (NWO) that is funded by the Dutch Ministry of Education, Culture and Science (OCW).

## References

• Hinton et al. (2006) G. E. Hinton, S. Osindero,  and Y.-W. Teh, Neural computation 18, 1527 (2006).
• Hinton and Salakhutdinov (2006) G. E. Hinton and R. R. Salakhutdinov, Science 313, 504 (2006).
• Larochelle and Bengio (2008) H. Larochelle and Y. Bengio, in Proceedings of the 25th international conference on Machine learning (ACM, 2008) pp. 536–543.
• Coates et al. (2011) A. Coates, A. Ng,  and H. Lee, in Proceedings of the fourteenth international conference on artificial intelligence and statistics (2011) pp. 215–223.
• Li et al. (2016) G. Li, L. Deng, Y. Xu, C. Wen, W. Wang, J. Pei,  and L. Shi, Scientific reports 6, 19133 (2016).
• Passos and Papa (2017) L. A. Passos and J. P. Papa, Neural Processing Letters , 1 (2017).
• Hopfield (1982) J. J. Hopfield, Proceedings of the national academy of sciences 79, 2554 (1982).
• Hinton and Sejnowski (1983) G. E. Hinton and T. J. Sejnowski, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (Citeseer, 1983) pp. 448–453.
• Lieb et al. (1961) E. Lieb, T. Schultz,  and D. Mattis, Annals of Physics 16, 407 (1961).
• Hinton (2012) G. E. Hinton, in Neural networks: Tricks of the trade (Springer, 2012) pp. 599–619.
• Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, A. Courville,  and Y. Bengio, Deep learning, Vol. 1 (MIT press Cambridge, 2016).
• (12) Available at http://yann.lecun.com/exdb/mnist/.
• LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio,  and P. Haffner, Proceedings of the IEEE 86, 2278 (1998).
• Ciregan et al. (2012) D. Ciregan, U. Meier,  and J. Schmidhuber, in Computer vision and pattern recognition (CVPR), 2012 IEEE conference on (IEEE, 2012) pp. 3642–3649.
• Rigol et al. (2007) M. Rigol, V. Dunjko, V. Yurovsky,  and M. Olshanii, Phys. Rev. Lett. 98, 050405 (2007).
• Calabrese et al. (2012) P. Calabrese, F. H. Essler,  and M. Fagotti, J. Stat. Mech. 2012, P07022 (2012).
• Fagotti and Essler (2013) M. Fagotti and F. H. Essler, Phys. Rev. B 87, 245107 (2013).
• Amin et al. (2016) M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy,  and R. Melko, arXiv preprint arXiv:1601.02036  (2016).
• Kieferova and Wiebe (2016) M. Kieferova and N. Wiebe, arXiv preprint arXiv:1612.05204  (2016).
• Polkovnikov et al. (2011) A. Polkovnikov, K. Sengupta, A. Silva,  and M. Vengalattore, Reviews of Modern Physics 83, 863 (2011).
• Eisert et al. (2015) J. Eisert, M. Friesdorf,  and C. Gogolin, Nature Physics 11, 124 (2015).
• Gogolin and Eisert (2016) C. Gogolin and J. Eisert, Reports on Progress in Physics 79, 056001 (2016).
• Mussardo (2010) G. Mussardo, Statistical field theory: an introduction to exactly solved models in statistical physics (Oxford University Press, 2010).
• Barthel and Schollwöck (2008) T. Barthel and U. Schollwöck, Phys. Rev. Lett. 100, 100601 (2008).
• Vidmar and Rigol (2016) L. Vidmar and M. Rigol, J. Stat. Mech. 2016, 064007 (2016).
• Ilievski et al. (2015) E. Ilievski, J. De Nardis, B. Wouters, J.-S. Caux, F. Essler,  and T. Prosen, Phys. Rev. Lett. 115, 157201 (2015).
• Sotiriadis et al. (2012) S. Sotiriadis, D. Fioretto,  and G. Mussardo, J. Stat. Mech , P02017 (2012).
• (28) Not rescaling the input data, which is better suited to the physical analogy that we use, yields approximately lower validation scores as compared to the same network architecture fed with rescaled data. For example, a network with and yields accuracy with rescaled and with raw, not scaled, charges. For the comparison, each networks’ hyperparameters were optimized using a random search.
• Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,  and E. Duchesnay, Journal of Machine Learning Research 12, 2825 (2011).
• Bergstra and Bengio (2012) J. Bergstra and Y. Bengio, Journal of Machine Learning Research 13, 281 (2012).
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters