Restricted Boltzmann Machines for galaxy morphology classification with a quantum annealer
Abstract
We present the application of Restricted Boltzmann Machines (RBMs) to the task of astronomical image classification using a quantum annealer built by DWave Systems. Morphological analysis of galaxies provides critical information for studying their formation and evolution across cosmic time scales. We compress the images using principal component analysis to fit a representation on the quantum hardware. Then, we train RBMs with discriminative and generative algorithms, including contrastive divergence and hybrid generativediscriminative approaches. We compare these methods to Quantum Annealing (QA), Markov Chain Monte Carlo (MCMC) Gibbs Sampling, Simulated Annealing (SA) as well as machine learning algorithms like gradient boosted decision trees. We find that RBMs implemented on Dwave hardware perform well, and that they show some classification performance advantages on small datasets, but they don’t offer a broadly strategic advantage for this task. During this exploration, we analyzed the steps required for Boltzmann sampling with the DWave 2000Q, including a study of temperature estimation, and examined the impact of qubit noise by comparing and contrasting the original DWave 2000Q to the lowernoise version recently made available. While these analyses ultimately had minimal impact on the performance of the RBMs, we include them for reference.
pacs:
Valid PACS appear hereI Introduction
Machine learning techniques are being used increasingly in high energy physics (e.g., Albertsson2018, ) and astrophysics (e.g., Ntampaka2019, ) for applications such as event detection, particle identification, data analysis, and simulation of detector responses. In many cases, machine learning provides an efficient alternative to analytical models, which are intractable, or Monte Carlobased simulations, which can be computationally expensive. Data analysis tasks in cosmology are extremely computing intensive and will become even more so as new instruments like LSST lsst come online, motivating advances in data analysis techniques. Feynman first proposed the idea of using quantum computers to simulate physical systems Feynman1982 . More recently, a variety of approaches have been studied for combining quantum computing with machine learning techniques Biamonte2017 . While quantum computing hardware is still in the early stages of development, initial attempts have been made to apply quantum machine learning to high energy physics, e.g. classifying Higgs decay events in Large Hadron Collider data Mott2017 . In this study, we focus on the challenge of morphological classification of galaxies via astronomical images using the DWave 2000Q quantum annealer dwave2000q . Galaxies exhibit morphologies (structure in their shape) that tend to correlate with their evolutionary state and history. For example, spiral galaxies (typically blue) have higher rates of star formation often visible in clumpy regions of their spiral arms, irregulars have quasirandomly distributed clumps of star formation, while elliptical galaxies (typically red) tend to have ceased making their stars. Star formation occurs more readily in relaxed kinematic environments, where gravity has sufficient relative influence to pull together cold material that can fuse into stellar cores. Highly energetic or dense environments, like the cores of galaxy clusters or filaments of the cosmic web, may cause galaxy mergers or other disruptive events that can slow or halt star formation. Galaxies evolve rapidly in stellar mass in the range of cosmic redshifts (measures of cosmic age) , where star formation density peaks near . The rate of star formation is one of the primary measures of cosmic energy exchange, and structural and morphological analysis of galaxies permits a critical avenue of investigation cosmic evolution. The accurate classification of galaxies based on morphology is a critical step in this analysis. Please see the review in Conselice2003 for more details. Classical methods for morphological analysis have typically relied on a) visual examination, such as conducted through the Galaxy Zoo project Willett2013 ; multiwavelength modelfitting vika2015 ; and structural proxies, like concentration, asymmetry, and clumpiness Conselice2003 . Recent advancements in deep learning permit the usage of convolutional neural networks for morphological classification, which have become the state of art Dieleman2015 ; Tuccillo2017 ; Barchi2019 . The conventional convolutional neural network does not yet have an efficient implementation on the DWave quantum annealer. In this work, we use a different type of machine learning model, the Restricted Boltzmann Machine (RBM) Smolensky1986 . While there are many other types of machine learning models, the RBM model has stochastic binary variables and a quadratic energy functional, which can be efficiently implemented using the relatively small number of qubits available on nearterm quantum computing devices, such as the DWave quantum annealer Johnson2011 ; Pudenz2013 . Training an RBM is classically hard, but there is reason to believe quantum annealers may eventually offer performance advantages Adachi2015 . While quantum annealers are generally used for solving optimization problems, they have also been used in a machine learning context, where the quantum annealer is programmed with coefficients derived from the RBM, and used as a sampling engine to generate samples from the Boltzmann distribution Adachi2015 ; Benedetti2016 . As in classical machine learning, an iterative training process is used to refine the RBM coefficients. Quantum annealers offer a way to leverage the power of quantum computers while avoiding the complexity of a gatebased programming model, making them an attractive tool for domain scientists. However, given the limitations of presentday quantum annealers, this approach presents a number of challenges. Input data must be severely compressed to fit the available qubits. Samples from the quantum annealer may be not be Boltzmann distributed, in which case postprocessing or temperature estimation techniques may need to be applied.
i.1 Overview of the paper
This paper consists of two main results. First we show the outcome of some studies of the distribution of states coming from the DWave. We considered a variety of postprocessing techniques to bring the output distributions closer to Boltzmann distributions, which are theoretically necessary for training RBMs. Second, we show the results of trained RBMs and other algorithms for the galaxy morphology classification problem. We also include a discussion of the data and compression methods employed. Specifically, in Section II we briefly review the galaxy morphology classification datasets used for training and testing, and the techniques used to compress the data for the DWave 2000Q. In Section III we discuss the training algorithms used to prepare RBMs for classification. We compare a variety of options for training RBMs, including various combinations of generative and discriminative training. In Section IV, we discuss a variety of postprocessing steps for producing a Boltzmanndistributed set of energy states using the DWave quantum annealer. We also compare two versions of the DWave 2000Q, one of which featured lower levels of noise. In Section V we study the performance of RBMs on the quantum device and using classical resources, and we compare the performance of other classical machine learning algorithms. Finally, in Section VI we conclude and offer thoughts on future directions.
Ii Astronomical Data and Compression
ii.1 Data: Galaxy Zoo
We use data from the Galaxy Zoo 2 data release, which contains 304,122 galaxies that are taken from the Sloan Digital Sky Survey Willett2013 . For each image of a galaxy, this dataset includes crowdsourced answers to a set of 11 questions characterizing the galaxy’s morphology. There are 16 million classifications of morphology, with features like bulges, disks, bars, spirals. We simplify the problem into a binary classification problem by picking spiral galaxies (those with more than 50% “yes” answers to “Is there a spiral pattern?”) and rounded smooth galaxies (those with more than 50% “completely round” answers to the question “How rounded is it?”). These classes contain 10397 and 8434 galaxies, respectively. We select 5000 random images of each of the two classes. Before applying any data compression algorithm, we crop the images into 200 by 200 pixel images.
ii.2 Compression and Manipulation
Raw images are RGB pixels in size, which is far too large to encode in the binary variables available on the DWave 2000Q. There are a number of interesting compression schemes available, including, for example, discrete variational autoencoders 2016arXiv160902200R ; 2018arXiv180507445V . In practice, we found no appreciable advantage with different compression schemes while reducing data dimensionality to the level where we could encode the essential information about a given image into the binary variables available. Therefore, we relied on principal component analysis (PCA) on the basis that the method is very simple and easy to explain and understand. We used 5000 images to train a PCA model using Scikitlearn scikitlearn , and applied it on the remaining 5000 images to obtain the dataset we used to train and test the RBM. The ratio of explained variance added by each PCA component is shown in Fig. 1. We can see that the information contained in each additional component rapidly decays.
The encoded components defined by PCA are 64bit floating point numbers. We would like to transform them into a more compact representation. We will do this by linearly mapping the range of each PCA component in the training set to the interval [15, 240]. We can then round to the nearest integer and transform the data into unsigned 8bit integers, which support numbers between 0 and 255. The range we map the training set into was chosen to be safely inside the [0, 255] interval in order to accomodate outliers beyond the ranges present in the training set. See Fig. 2 for an example of the compressed data. To these compact representations of the images, we add a bit representing the class (0 for rounded smooth galaxies and 1 for spiral galaxies). This means that if our RBM has visible units, the first will correspond to the first bits in the compressed images, while the last visible unit will encode the class of each image. During the analysis we were concerned that the digitization scheme employed was putting extra weight on the most significant bits of each encoded PCA component, but this was not information we could easily share directly with the RBM algorithm. We tested several different reordering schemes for the bits and also tested preferentially keeping the most significant bits only for higher order components as a way of including information from those components when working with small RBMs. However, the reordering schemes generally slightly degraded performance, and attempts to include a larger number of components using only the most significant bits did not offer any performance advantages.
Iii Generative and Discriminative Training
iii.1 RBM training
Restricted Boltzmann machines (RBMs) are generally speaking a generative model, where one attempts to approximate one’s target distribution over a string of binary variables , , as the marginal distribution of a larger system composed of binary variables and latent variables , with an ansatz such that
(1) 
for some bias vectors , , and a weight matrix . This corresponds to a complete bipartite graph with local biases and interactions along the edges. The binary variable are called the visible nodes, as they compose the distribution of interest, while the are the hidden nodes. By variationally maximizing the loglikelihood of the data of the RBM model with respect to the weights and biases, we can train the RBM to better approximate the data distribution. It can be readily shown that maximizing the loglikelihood corresponds to matching the one and twopoint correlation functions of the model between states conditioned on the data distribution and the free generative model. Defining the loss as the negative loglikelihood, dubbed , we get derivatives for the variational parameters
(2)  
(3)  
(4) 
Collectively these derivatives form the gradient, to be used in gradient descent to the adjustments for , , and respectively. Here the expectations are computed over the training set and the the model (also called the positive and negative phases). Once the RBM has been trained, we can use it to make a prediction on the class of unseen images. To do this, we calculate the free energies of the RBM with the visible units set to the compressed image representation and both options for the class. The class corresponding to the lowest free energy is then the most likely class for that image. This type of discriminative RBM was introduced in Larochelle2008 .
iii.2 Classical training algorithms
In general, one cannot compute expectations over the model directly, as it takes a time that scales as where and represent the number of visible and hidden units, respectively. This is generally intractable. However, we can use a variety of algorithms to perform training.
iii.2.1 Contrastive divergence
We may perform efficiently block sampling updates of and , as the conditional distributions reduce to singlespin probabilities that can be sampled in linear time with the number of variables. Initializing a Markov chain performing such block Gibbs sampling at each training datapoint and taking expectations over the resulting chains is the basis of the contrastive divergence (CD) algorithm, first put forward in Hinton2002 . Using CD, one can often train RBMs of quite large size reasonably efficiently.
iii.2.2 Discriminative training
In this work, we are interested in using RBMs not as a strictly generative model, but as a classification algorithm. In essence, we wish to be able to input an image and sampling the posterior distribution for the class of that image using the RBM. Thus, rather than our directly modeling the full , as is standard practice, we are really interested in only . Rather than training a model to represent the entire distribution over , we can instead directly train to maximize the loglikelihood of the distribution, as was proposed in Larochelle2008 . The training process is much the same as before, except that now the sample over the model is vastly simplified, as one is still taking an expectation by fixing the image dataset, reducing the effective number of variables to merely that of the number of bits used to represent the class. In our case, where we use a single variable, we thus can contract the graph in linear time to get an exact gradient. In general, one can contract in a time scaling no worse than for unary encoding of the classes. Using a binary representation for the classes, one can do this in linear time in the number of classes, and thus training the discriminative model can be done efficiently on a classical computer in an exact fashion, with no Markov chains required.
iii.2.3 Hybrid approaches
Finally, one may consider a hybrid approach. For instance, an approach where one takes a combination of both the aforementined gradients, generative and discriminative, so as to better approximate while still representing the full distribution efficiently. In this, we can set a value which combines the the gradients and for the generative and discriminative models as
(5) 
This approach was also investigated in Larochelle2008 and found to be beneficial at small values of . We additionally investigate another hybrid approach, where we use generative training as a kind of pretraining and then follow it with pure discriminative training, which we dub “annealed hybrid” training, even though if one is considering it as annealing the parameter it is better thought of as a quench. This was motivated by our observations of the performance of generative and discriminative training.
iii.3 Generative training with quantum annealing
We also compare against a quantum annealing (QA) based model for estimating the negative phase (the intractable model expectation values) and alternatives to QA, including a pure Gibbs sampling MCMC algorithm initialized at a random position, and simulated annealing. In essence we seek to understand what causes observed QA performance by testing against other annealing algorithms. In training via quantum annealing, we map our RBM energy function, which is in the form of a QUBO (quadratic unconstrained binary optimization), and use DWave’s provided embedding function to map this QUBO into the physical architecture of the DWave device, called a Chimera graph, see Fig. 3. This is done because the Chimera graph has a maximum complete bipartite subgraph of 4x4. By minor embedding Choi2008MinorembeddingIA the graph we identify a chain of qubits and bind them tightly together so that they act approximately as a single large spin. Each programming cycle we use 100 samples drawn from the DWave to take our gradient estimate, and apply a varying number (typically 2) postanneal Gibbs sweeps over the variables to aid in additional thermalization.
Iv Boltzmann Distributions on the DWave Quantum Annealer
In order to train a Boltzmann machine, we need to sample expectation values from a Boltzmann distribution with set to 1. We use the KolmogorovSmirnov (KS) test to check the statistical consistency of our sample distribution with that of a Boltzmann distribution.
iv.1 Comparisons between initializing sampling with an annealer vs a random bitstring
The raw distribution of states coming from a DWave 2000Q is often not close to a Boltzmann distribution with . It is “colder”, with a higher propensity for producing states at the lowest energy levels. This energy shift may be advantageous in optimization problems, but RBM training relies on being able to sample from a Boltzmann distribution, so postprocessing is generally required. For us, this process will consist of taking a few steps of Gibbs sampling as a postprocessing step. In this section, to check how many steps is enough, we carry out the KS test after each step and keep taking Gibbs steps until the KS pvalue rises above 0.05. The advantage in starting the postprocessing using samples from a DWave 2000Q is not clear in some of the RBM shown in Fig. 4, for instance in Fig. (d)d. On the other hand, Fig. (b)b and Fig. (c)c show some advantages. In all cases, however, there are regions of couplings for which we need quite a few steps, as shown in Fig. 4.
iv.2 Temperature estimation
It is possible that the DWave returns a Boltzmann distribution, but at a temperature that needs to be determined. If we know the effective inverse temperature , we can sample from a distribution with and couplings by setting the couplings on the DWave. The effective temperature of the DWave has been shown to be problemdependent and different from the physical temperature of the annealer Amin2015 . In this work, we will follow a modification of the temperature estimation recipe proposed in Benedetti2016 . The algorithm follows the following steps:

At each step, take RBM couplings . Set couplings on DWave to , with estimated at the previous step (on the first step, we need to take a guess).

Take one set of samples. We bin the samples into bins according to their energy, obtaining probability density estimates .

We want a second set that will provide different “enough” samples for distinguishability. Following Benedetti2016 , we take , with ,^{1}^{1}1Benedetti2016 suggests transitioning to a sign in the expression for once the RBM couplings get large enough. We found that even at late stages, this would result in values of that are close to zero. where is the standard deviation of the first sample.

Take a second set of samples and use the same bins as in step 2 to obtain probability density estimates .

Denoting the Ising energy of each state with couplings as , note that
(6) With this in mind, we can extract an estimate of from the slope of the linear regression between and the bin energies, as exemplified in Fig. 5. In order to reduce noise caused by bins with a small number of samples, we limit the regression to bins with at least five samples in both draws.
We can see the results of this temperature estimation procedure throughout training in Fig. 6.
We found some pitfalls in this procedure. Namely, as the couplings of the RBM and therefore the magnitude of the energies involved grow, the distribution of states becomes more and more skewed towards the lower energy states. This is a desirable outcome of training an RBM. However, this leaves the higherenergy bins with a small number of samples, causing large variance in the estimates of . In all our training runs, this leads to a step where happens to fluctuate to a larger value than usual for some of the larger energy bins. This causes to be underestimated at that step. The effect compounds in a few training steps, often leading to negative estimates of and a crash of the algorithm. Potential solutions include:

some regularization to keep the weights from growing. This successfully kept the temperature estimation routine from crashing, but at the cost of impairing the classifier performance of our RBM. This is to be expected, as a welltrained RBM should strongly separate the energies of different states.

only estimating during the initial stages of training. This can be a good solution, since does not seem to change by a large amount during training, as we can see in Fig. 6.
Even without a temperature estimation routine, weights growing to be too large is a problem with the algorithm on a QA in general. This is because if weights grow above the maximum coupling that can be implemented on the DWave, we must rescale the weights as in order to set coupling constants on the DWave. However, discretization of the coupling constants means that if one weight is very large, subtle variations between much smaller weights are lost. Another possible solution would be to turn off weight rescaling, but not let couplings grow beyond what is physically implementable on the DWave. Conceptually, this is equivalent to allowing the RBM to learn chains of logical qubits that are strongly coupled. Either of these solutions can impair classifier performance because sometimes RBM might just need very large weights, or might need a large ratio across some weights, to reproduce the probability distribution of the data. Finally, we have tested whether temperature estimation allows us to take fewer Gibbs steps to reach a Boltzmann distribution. The results of this test are shown in Fig. 4. Once again, the results do not always show a decisive advantage in the number of postprocessing steps needed when using temperature estimation.
iv.3 Noise and RBMs
DWave has recently released a lownoise version of its 2000Q quantum computer, with claims to enhancing tunneling rates by a factor of 7.4 qubits_pres . It is claimed that this leads to a larger diversity of states returned by the machine, as well as a larger proportion of lowerenergy states. In this section, we test whether these lowernoise properties also help us obtain a more Boltzmannlike distribution from the DWave output. To do this, we train two RBM using the temperature estimation techniques described in Sec. IV.2. One of the RBM was trained using the original 2000Q, and the other RBM using the lownoise machine. At each 20 training steps, we compare the distribution obtained using the DWave machine with a Boltzmann distribution obtained from analytically calculated energies for the current RBM couplings. To compare the distributions, we use the KolmogorovSmirnov statistic, which should be close to zero for samples drawn from the same distribution. We compare the KS values as a function of the RBM weight distribution in each machine. In Fig. 7, we show the mean of the KS statistic binned as a function of the mean and maximum RBM coupling. We see no advantage from using the lowernoise 2000Q in how Boltzmannlike the returned distributions are. For both machines, samples returned are not far from Boltzmann distributions (with KS statistics below 0.1) for low RBM weights, but the distributions diverge from Boltzmann as the weights grow larger.
V Results and discussion
Here we focus on comparing training accuracy, defined as the total fraction of all testing data correctly classified by the trained RBMs, between the wide variety of algorithms discussed in section III, along with straightforward logistic regression and gradient boosted decision trees, as a function of training epoch. For results given in this section, we use two Gibbs sweeps of postprocessing for quantum annealing samples, except as otherwise stated. We present a comprehensive figure of our primary results in Fig. 9. In Fig. (a)a we see that QA requires at least two Gibbs sweeps to perform competitively. Moreover there we also see, all RBMstructured models (which are upperbounded, as seen in Fig. (b)b, by discriminative training) are outperformed for this problem by simple logistic regression and particularly gradient boosted decision trees. (Note: “epoch” for gradient boosted decision trees corresponds to the number of trees in these plots.) Looking at Fig. (b)b (where we focus on RBMstructured models), we show that while QA appears to achieve higher accuracy at early stages of training than other RBMstructured models/training methods, as training progresses it either matches (for smaller size RBMs) or underperforms other algorithms such as purely discriminative and hybrid training. Moreover, it appears to lend little if anything to discriminative training to incorporate hybrid updates using QA, or to transition from QA to discriminative training near the observed crossover point of performance. Unless one is only going to run training for a short period (on the order of 25 epochs), one observes no improvement from the incorporation of QA. We also examine performance of MCMC Gibbs sampling (ie directly taking expectation values from a Markov chain of appropriate length) and simulated annealing, and observe broadly similar performance as QA, with small improvements for MCMC and SA over QA at larger size RBMs.
We also present Fig. 10 to highlight in more detail the relative performance of the various RBMstructured algorithms, wherein we take the ratio of their accuracy at each step of training with the accuracy of the quantum annealing training. As this figure makes clear, QA achieves better early training results but fails to maintain that advantage with additional training. It also makes clear that MCMC Gibbs sampling and SA with sufficiently many sweeps outperforms QA slightly at larger RBM sizes. In that figure we exclude logistic regression and gradient boosted decision trees as they dominate all models.
Inspired by the observations in Mott2017 , and by the advantages shown very early in training by the QA approach, we also performed a study using a very small training set. As shown in Fig. 11, the RBM approach actually shows a decisive advantage under this condition. When the training set is restricted to 250 events (reduced from 5000), we find strong overfitting when using logistic regression or gradient boosted trees, but good performance by the RBM. We again see stronger performance by the QA early in the training process, but as the number of epochs of training increases, discriminatively trained models take over.
Curiously, this suggests that if there is a need to train a classifier based on very small datasets and if the number of epochs is limited (perhaps by time concerns in a situation where we were able to operate training for the QA resource with no network latency), then a QAtrained RBM shows promise as a superior algorithm.
Vi Conclusions and Future Work
In this work, we explored a classification application of importance in cosmology, and studied the distribution of energy states coming from the DWave 2000Q. We tested several postprocessing methods that aimed to bring the output distributions closer to Boltzmann distributions, which are theoretically required for training RBMs, but we found little impact from postprocessing on RBM performance. As a consequence and for simplicity of interpretation, we subsequently minimized postprocessing. We presented the results of trained RBMs, and other ML algorithms, for galaxy morphology classification. While we ultimately find RBMs implemented on DWave hardware perform well, we don’t find compelling evidence of algorithmic performance advantage with this dataset for this problem over the most likely training scenarios. We do not believe this result is an indictment of the performance of the quantum resources — we found regions of phase space in the training where the quantum computer performed better than its classical counterpart. In particular, for small datasets and for limited numbers of training repetitions, QAbased RBMs performed very well and outperformed the alternative classical algorithms studied here (logistic regression and gradient boosted trees), and outperformed classically trained RBMs. However, outside of these rather special training scenarios, RBMs (regardless of the classical or quantum nature of the training algorithm) did not outperform the gradient boosted tree algorithm. Perhaps more complex and higher dimensional data would be more challenging for algorithms like gradient boosted trees and regression, but here we find that they are able to handle the dataset well. In the cases where significantly less or no compression is required due to significantly larger quantum resources, we may see a performance advantage for this algorithm. This line of investigation will be interesting to revisit on future versions of quantum hardware, or perhaps on a digital annealer Inagaki2016 ; Aramon2019 with substantially larger RBMs. For this data, due to the compression mechanisms involved, enlarging the RBM significantly did not lead to performance improvements, but with much larger numbers of qubits available, it would be possible to pursue different compression mechanisms or perhaps avoid compression altogether. Another option to pursue with less compressed data would be increasing the network connectivity and adding additional layers. There is evidence in classical machine learning alexnet that multilayer networks are able to construct hierarchical representations that often offer some advantage in data analysis tasks.
Acknowledgements.
This manuscript has been authored by Fermi Research Alliance, LLC under Contract No. DEAC0207CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DEAC0500OR22725. This project was funded in part by the DOE HEP QuantISED program. S. Adachi and J. Job acknowledge Internal Research and Development funding from Lockheed Martin. We thank DWave Systems for providing access to their DW2000Q systems. We thank Travis Humble and Alex McCaskey for useful discussions about quantum machine learning, and Aristeidis Tsaris for useful comments and algorithms for the DWave. We thank Maxwell Henderson, Carleton Coffrin and Vaibhaw Kumar for discussion on obtaining Boltzmann distributions from quantum annealers.References

(1)
K. Albertsson, P. Altoe, D. Anderson, J. Anderson, M. Andrews, J. P.
Araque Espinosa, A. Aurisano, L. Basara, A. Bevan, W. Bhimji et
al., Machine Learning in High Energy
Physics Community White Paper, arXiv eprint (2018).
arXiv:1807.02876[physics.compph].
URL https://arxiv.org/abs/1807.02876 
(2)
M. Ntampaka, C. Avestruz, S. Boada, J. Caldeira, J. CisewskiKehe,
R. Di Stefano, C. Dvorkin, A. E. Evrard, A. Farahi, D. Finkbeiner
et al., The Role of Machine Learning
in the Next Decade of Cosmology, arXiv eprint (2019).
arXiv:1902.10159[astroph.IM].
URL https://arxiv.org/abs/1902.10159  (3) Large Synoptic Survey Telescope, https://www.lsst.org/ (2019).

(4)
R. P. Feynman, Simulating physics
with computers, International Journal of Theoretical Physics 21 (6) (1982)
467–488.
doi:10.1007/BF02650179.
URL https://doi.org/10.1007/BF02650179 
(5)
J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd,
Quantum machine learning, Nature
549 (2017) 195 EP –.
URL https://doi.org/10.1038/nature23474 
(6)
A. Mott, J. Job, J.R. Vlimant, D. Lidar, and M. Spiropulu,
Solving a higgs optimization
problem with quantum annealing for machine learning, Nature 550 (2017) 375
EP –.
URL https://doi.org/10.1038/nature24047  (7) DWave Systems, The DWave 2000Q™system, https://www.dwavesys.com/dwavetwosystem (2019).

(8)
C. J. Conselice, The
evolution of galaxy structure over cosmic time, Annual Review of Astronomy
and Astrophysics 52 (1) (2014) 291–337.
doi:10.1146/annurevastro081913040037.
URL https://doi.org/10.1146/annurevastro081913040037  (9) K. W. Willett, C. J. Lintott, S. P. Bamford, K. L. Masters, B. D. Simmons, K. R. V. Casteels, E. M. Edmondson, L. F. Fortson, S. Kaviraj, W. C. Keel et al., Galaxy Zoo 2: detailed morphological classifications for 304 122 galaxies from the Sloan Digital Sky Survey, Monthly Notices of the Royal Astronomical Society 435 (4) (2013) 2835–2860. doi:10.1093/mnras/stt1458.

(10)
Vika, Marina, Vulcani, Benedetta, Bamford, Steven P., Häußler,
Boris, and Rojas, Alex L.,
MegaMorph: classifying
galaxy morphology using multiwavelength Sérsic profile fits, A&A 577
(2015) A97.
doi:10.1051/00046361/201425174.
URL https://doi.org/10.1051/00046361/201425174  (11) S. Dieleman, K. W. Willett, and J. Dambre, Rotationinvariant convolutional neural networks for galaxy morphology prediction, Monthly Notices of the Royal Astronomical Society 450 (2) (2015) 1441–1459. arXiv:1503.07077, doi:10.1093/mnras/stv632.
 (12) D. Tuccillo, M. HuertasCompany, E. Decencière, and S. VelascoForero, Deep learning for studies of galaxy morphology, in: M. Brescia, S. G. Djorgovski, E. D. Feigelson, G. Longo, and S. Cavuoti (Eds.), Astroinformatics, Vol. 325 of IAU Symposium, 2017, pp. 191–196. arXiv:1701.05917, doi:10.1017/S1743921317000552.

(13)
P. H. Barchi, R. R. de Carvalho, R. R. Rosa, R. Sautter,
M. SoaresSantos, B. A. D. Marques, E. Clua, T. S. Gonçalves,
C. de SáFreitas, and T. C. Moura,
Machine and Deep Learning Applied to
Galaxy Morphology – A Comparative Study, arXiv eprint (Jan 2019).
arXiv:1901.07047[astroph.IM].
URL https://arxiv.org/abs/1901.07047  (14) P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory, in: D. E. Rumelhart and J. L. McClelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, MIT Press, Cambridge, MA, USA, 1986, pp. 194–281.
 (15) M. Johnson, M. Amin, S. Gildert, T. Lanting, F. Hamze, N. Dickson, R. Harris, A. Berkley, J. Johansson, P. Bunyk et al., Quantum annealing with manufactured spins, Nature 473 (2011) 194–8. doi:10.1038/nature10012.

(16)
K. L. Pudenz and D. A. Lidar,
Quantum adiabatic machine
learning, Quantum Information Processing 12 (5) (2013) 2027–2070.
doi:10.1007/s1112801205064.
URL https://doi.org/10.1007/s1112801205064 
(17)
S. H. Adachi and M. P. Henderson,
Application of quantum annealing to
training of deep neural networks, arXiv eprint (2015).
arXiv:1510.06356[quantph].
URL https://arxiv.org/abs/1510.06356 
(18)
M. Benedetti, J. RealpeGómez, R. Biswas, and A. PerdomoOrtiz,
Estimation of
effective temperatures in quantum annealers for sampling applications: A case
study with possible applications in deep learning, Phys. Rev. A 94 (2016)
022308.
doi:10.1103/PhysRevA.94.022308.
URL https://link.aps.org/doi/10.1103/PhysRevA.94.022308 
(19)
J. T. Rolfe, Discrete Variational
Autoencoders, arXiv eprint (2016).
arXiv:1609.02200[stat.ML].
URL https://arxiv.org/abs/1609.02200 
(20)
A. Vahdat, E. Andriyash, and W. G. Macready,
DVAE#: Discrete Variational
Autoencoders with Relaxed Boltzmann Priors, arXiv eprint (2018).
arXiv:1805.07445[stat.ML].
URL https://arxiv.org/abs/1805.07445  (21) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., Scikitlearn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.

(22)
H. Larochelle and Y. Bengio,
Classification using
discriminative restricted Boltzmann machines, in: Proceedings of the 25th
International Conference on Machine Learning, ICML ’08, ACM, New York, NY,
USA, 2008, pp. 536–543.
doi:10.1145/1390156.1390224.
URL http://doi.acm.org/10.1145/1390156.1390224 
(23)
G. E. Hinton, Training
products of experts by minimizing contrastive divergence, Neural Computation
14 (8) (2002) 1771–1800.
doi:10.1162/089976602760128018.
URL https://doi.org/10.1162/089976602760128018  (24) V. Choi, Minorembedding in adiabatic quantum computation: I. the parameter setting problem, Quantum Information Processing 7 (2008) 193–209.

(25)
M. H. Amin,
Searching for
quantum speedup in quasistatic quantum annealers, Phys. Rev. A 92 (2015)
052323.
doi:10.1103/PhysRevA.92.052323.
URL https://link.aps.org/doi/10.1103/PhysRevA.92.052323 
(26)
C. McGeoch,
Comparison
of 2000Q to LowerNoise 2000Q, Qubits North America 2019 (2019).
URL https://www.dwavesys.com/sites/default/files/14DWMcGeogh.pdf 
(27)
T. Inagaki, Y. Haribara, K. Igarashi, T. Sonobe, S. Tamate, T. Honjo,
A. Marandi, P. L. McMahon, T. Umeki, K. Enbutsu et al.,
A coherent ising
machine for 2000node optimization problems, Science 354 (6312) (2016)
603–606.
doi:10.1126/science.aah4243.
URL https://science.sciencemag.org/content/354/6312/603 
(28)
M. Aramon, G. Rosenberg, E. Valiante, T. Miyazawa, H. Tamura, and H. G.
Katzgraber,
Physicsinspired
optimization for quadratic unconstrained problems using a digital annealer,
Frontiers in Physics 7 (2019) 48.
doi:10.3389/fphy.2019.00048.
URL https://www.frontiersin.org/article/10.3389/fphy.2019.00048  (29) A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, in: Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1, NIPS’12, Curran Associates Inc., USA, 2012, pp. 1097–1105, http://dl.acm.org/citation.cfm?id=2999134.2999257.