A regression algorithm for accelerated lattice QCD that exploits sparse inference on the D-Wave quantum annealer
We propose a regression algorithm that utilizes a learned dictionary optimized for sparse inference on D-Wave quantum annealer. In this regression algorithm, we concatenate the independent and dependent variables as an combined vector, and encode the high-order correlations between them into a dictionary optimized for sparse reconstruction. On a test dataset, the dependent variable is initialized to its average value and then a sparse reconstruction of the combined vector is obtained in which the dependent variable is typically shifted closer to its true value, as in a standard inpainting or denoising task. Here, a quantum annealer, which can presumably exploit a fully entangled initial state to better explore the complex energy landscape, is used to solve the highly non-convex sparse coding optimization problem. The regression algorithm is demonstrated for a lattice quantum chromodynamics simulation data using a D-Wave 2000Q quantum annealer and good prediction performance is achieved. The regression test is performed using six different values for the number of fully connected logical qubits, between 20 and 64, the latter being the maximum that can be embedded on the D-Wave 2000Q. The scaling results indicate that a larger number of qubits gives better prediction accuracy, the best performance being comparable to the best classical regression algorithms reported so far.
Sparse coding refers to a class of unsupervised learning algorithms for finding an optimized set of bases vectors, or dictionary, for accurately reconstructing inputs drawn from any given dataset using the fewest number of non-zero coefficients. Sparse coding explains the self-organizing response properties of simple cells in the mammalian primary visual cortex olshausen96; olshausen_sparse_1997, and has been successfully applied in various fields including image classification yang2009; Coates:2011:IEV:3104482.3104598, image compression yijing, and compressed sensing Tao_Information; Donoho_Compressed. Optimizing a dictionary for a given dataset and inferring optimal sparse representations of input data involves finding solutions of the following minimization problem:
where is the index of the input data, and is the sparsity penalty parameter. Because of the -norm, the minimization problem falls into a NP-hard complexity class with multiple local minima Natarajan:1995:SAS:207985.207987.
Recently, we developed a mapping of the -optimization in Eq. (1) to the quadratic unconstrained binary optimization (QUBO) problem that can be solved on an quantum annealer and demonstrated its feasibility on the D-Wave systems Nguyen:2016; 8123653; Nguyen2018ImageCU. The quantum processing unit of the D-Wave systems realizes the quantum Ising spin system in a transverse field and finds the lowest or the near-lowest energy states of the classical Ising model,
using quantum annealing PhysRevE.58.5355; Finnila_1994; DWAVE. Here is the binary spin variable, and are the qubit biases and coupling strengths that can be controlled by a user, and optimization for the Ising model is isomorphic to a QUBO problem with . By mapping the sparse coding to a QUBO structure, the sparse coefficients are restricted to binary variables . Despite the restriction, it was able to provide good sparse representation for the the MNIST lecun-mnisthandwrittendigit-2010; Nguyen:2016; Nguyen2018ImageCU and CIFAR-10 cifar10; 8123653 images.
Ii Regression algorithm using sparse coding on D-Wave 2000Q
ii.1 Regression model
Consider sets of training data , and sets of the test data , where is an input vector known as the independent variable, and is an output variable known as the dependent variable. A regression model can be built by learning correlations between the input and output variables on the training dataset, so that it can make predictions of for an unseen input data as
Such regression model can be built using the sparse coding learning implemented on a quantum annealer described below.
Normalize and so that their standard deviations becomes comparable. One possible choice is rescaling the data to have a zero mean and a unit variance using the sample mean and sample variance of the training dataset. This procedure is an essential step for the regression algorithm as it makes the reconstruction error for each component comparable.
Using in the test dataset () or those in the combined training and test datasets (), perform sparse coding training and obtain the dictionary for .
Concatenate the input and output variables of the training dataset and build the concatenated vectors . Extend the dictionary matrix obtained in the pre-training to , filling up the new elements by zeros.
Taking as the input signal and as an initial guess of the dictionary, perform sparse coding training on the training dataset and obtain the dictionary . Through this procedure, will encode the correlation between and .
For the test dataset, for which only is given, build a vector , where is an initial guess of . One possible choice of is the average value of in the training dataset.
Using the dictionary obtained in (4), find sparse representation for and calculate reconstruction as . This replaces the outlier components, including , in by the values that can be described by .
After inverse-normalization, the ’th component of is the prediction of : .
In this regression model, should be sufficiently large so that initial guess of the dependent variable does not bias the reconstruction. This procedure can be extended to predict multiple variables by increasing the dimension of , in exchange for prediction accuracy.
ii.2 Sparse coding on D-Wave quantum annealer
Here the qubit-qubit coupling shares similarity with the lateral neuron-neuron inhibition in the locally competitive algorithm rozell.06c, and the constant makes the solution sparse by acting a constant field forcing the qubits to stay in () state. By performing the quantum annealing for a given dictionary and input data vector with the transformations given in Eq. (4), one can obtain the optimal sparse representation . On the D-Wave systems, however, computing graph, known as the Chimera architecture, has limited qubit connectivity by design, while the sparse coding problem mapped to the QUBO form requires a fully connected qubit graph. Therefore, it requires an additional step called embedding, which translates the fully-connected logical qubits to the partially-connected physical qubits on the Chimera structure, as described in Fig. 1.
The D-Wave 2000Q hardware consists of quantum bits (qubits) arranged into unit cells of physical qubits forming the Chimera structure. After accounting the less than of defects that occurs during the calibration process, as temperature is cooled down to a superconducting threshold, the machine have approximately qubits with larger than local spin-spin interactions. Using these qubits and connections, embedding an arbitrary QUBO problem onto the D-Wave 2000Q typically allows no more than fully connected logical qubits.
Iii Application to lattice QCD
QCD is a theory of quarks and gluons, which are the fundamental particles composing hadrons such as pions and protons, and their interactions. It is a part of the Standard Model of particle physics, and the theory has been demonstrated by large class of experiments over the decades Patrignani:2016xqp; Greensite:2011zz. Lattice QCD is an discrete formulation of QCD on an Euclidean space time lattice, which allows us to solve low-energy QCD problems using computer simulations by carrying out the Feynman path integration using Monte Carlo methods Wilson:1974sk; Creutz:1980zw.
In lattice QCD simulations, large number of observables are calculated over an ensemble of the Gibbs samples of gluon fields, called the lattices, and computational cost for calculating those observables are expensive in modern simulations. However, the observables’ fluctuations over the statistical samples of the lattices are correlated as they share the same background lattice. By exploiting the correlation between them, in Ref. Yoon:2018krb, boosted decision tree (BDT) regression algorithm was able to replace the computationally expensive direct calculation of some observables by the computationally cheap machine learning predictions of them from other observables.
In this section, we apply the regression algorithm proposed in Section II to the lattice QCD simulation data used for the calculation of the charge-parity (CP) symmetry violating phase of the neutron Yoon:2017tag; Pospelov:2005pr. Here we consider three types of observables: (1) two-point correlation functions of neutrons calculated without CP violating (CPV) interactions , (2) -projected two-point correlation functions of neutrons calculated without CPV interactions , and (3) -projected two-point correlation functions of neutrons calculated with CPV interactions , and the phase is extracted from the imaginary part of . Those observables are calculated at multiple values of the nucleon source and sink separations in Euclidean time direction .
Our goal of the regression problem is to predict the imaginary part of at from the real and imaginary parts of the two-point correlation functions calculated without CPV interactions, and , at and , where is the lattice spacing. It forms a problem with single value of output variable () and 20 values (two observables, real/imag, 5 timeslices) of the input variables (). In this application, we use 15616 data points of of these observables measured in Refs. Bhattacharya:2016oqm; Bhattacharya:2016rrc divided into 6976 training data and 8640 test data. Using this datasets, we follow the regression procedure proposed in Section II to predict of the test dataset that contains around 9K data points.
The procedure can be summarized as following. First, we standardize the total data using the mean and variance of the training dataset for normalization. Then, we perform the pre-training and obtained for the 20 elements of only using the test dataset. After appending the to as the 21st element in the training dataset, we perform the sparse coding dictionary learning and update to encode correlation between and . For prediction, input vectors in the test dataset are augmented to dimension of 21 vectors by appending the average value of , which is 0 after standardization. Finally, sparse coefficients for the augmented input vectors are calculated with the fixed dictionary obtained above, and predictions of are estimated by taking the 21st element of the reconstructed vectors on the test dataset.
Note that a sparse coding problem solves for the sparsest representation and the dictionary , simultaneously, by minimizing Eq. (1). First, our optimization for is performed using the D-Wave 2000Q at a given , whose initial guess is given, in general, by random numbers or via imprinting technique. Then, the optimization for is performed on classical CPUs. The latter step is an offline learning for the fixed values of obtained using D-Wave 2000Q. In the offline learning procedure, is learned using the a batch stochastic gradient descent (SGD) algorithm:
where with is the sparse coding energy function for a given input data given in Eq. (1), and is the learning rate. Batch-learning is used with the batch size of . We repeat the iterative update of the quantum D-Wave inference for and SGD learning for a convergence is attained. On average, we find the convergence after or iterations. In this study, we use the SAPI2 python client libraries sapi for implementing D-Wave operations.
The sparsity of the sparse representation associated with the sparsity penalty parameter is calculated by the ratio of nonzero elements in . In this study, is tuned to the values that make the average sparsity about 20%, because we find that the 20% of sparsity provides an optimal prediction performance, after examining a few different values of . Although the prediction performance could be further optimized by an extensive parameter search, such as that performed in Ref. Kenyon_phasetransition, the procedure is computationally expensive so ignored in this proof-of-principle study.
Note that the definition of the overcompleteness is not straightforward for the D-Wave inferred sparse coding because the input signal may have arbitrary real numbers, while the sparse coefficients could have only binary numbers of 0 or 1. Ignoring the subtlety, for simplicity, the overcompleteness for the input signal of dimension 20 (or 21 for extended vectors) can be calculated by .
Examples of the reconstruction and prediction from the randomly chosen test data points are visualized in Fig. 2. In the plot, first 20 elements are the input variables, and the element 21 is the output of the prediction algorithm. As one can see, the reconstruction of the st element, which was 0 on their in initial guess, are successfully shifted close to their ground truth, as expected.
In order to investigate the prediction accuracy for different , we explore the prediction algorithm with six different numbers of qubits and , which corresponds to . Note that the larger implies the more difficult optimization problem, and is the maximum number of logical qubits that can be embedded onto the D-Wave 2000Q. In Fig. 3, we show the distribution of the normalized original data of the dependent variable and its prediction error defined by the difference between the ground truth and its prediction : . It is clearly demonstrated that (1) the prediction error is much smaller than the fluctuation of the original data, (2) the prediction error is sharply distributed near 0, which indicates no obvious bias in the prediction, and (3) the prediction error tends to be smaller when becomes larger.
To evaluate the prediction quality, the recovery of the st element in the extended input vector, quantitatively, we calculate the ratio of the standard deviations of the prediction error and that of the original data: . converges to when the prediction is precise, and indicates no prediction for a statistical data. Note that this definition of the prediction quality does not account for the bias of the prediction because the bias for the prediction of a statistical data can be removed by following the procedure introduced in Ref. Yoon:2018krb based on the variance reduction technique for lattice QCD calculations Bali:2009hu; Blum:2012uh.
Fig. 4 shows the prediction error as a function of the number of qubits. It is clearly demonstrated that the systematic decrease of the prediction error as is increased. Although no theory explaining the scaling is known, we find that the scaling roughly follows the exponential decay ansatz . By fitting the ansatz to the data points, an asymptotic value of the prediction quality is obtained as or for , depending on whether we include data point or not in the fit. For a comparison, regression algorithms provided by the scikit-learn library scikit-learn on a classical computer are investigated for the same dataset, and BDT regression algorithm showed the best prediction performance with .
Pre-training is demonstrated to lower the prediction error of this regression algorithm, significantly. When performed the prediction with qubits without the pre-training procedure, we find that , while it becomes with the pre-training. Without the pre-training, furthermore, we find that the required number of iterative updates of the D-Wave inference for and SGD learning for is increased to about 10 iterations.
In this paper, we proposed a regression algorithm using sparse coding dictionary learning that can be implemented on a quantum annealer, based on the formulation of a regression as an inpainting problem. A pre-training technique is introduced to improved the prediction quality. The procedure is described in Section II.1. The regression algorithm was numerically demonstrated using a set of lattice QCD simulation observables and was able to predict the correlation function calculated in presence of the CPV interactions from those calculated without the CPV interaction. The regression experiment is carried out using the D-Wave 2000Q quantum annealer with minor embedding technique in order to obtain fully-connected logical qubits. The study is performed for six different values of the number of qubits between 20 and 64, and it showed systematic decrease of the prediction error as the number of qubits is increased (see Fig. 4). With larger number of qubits and elaborately tuned sparsity parameter, we expect further improved performance in future.