Supervised learning with quantum enhanced feature spaces
Machine learning and quantum computing are two technologies each with the potential for altering how computation is performed to address previously untenable problems. Kernel methods for machine learning are ubiquitous for pattern recognition, with support vector machines (SVMs) being the most well-known method for classification problems. However, there are limitations to the successful solution to such problems when the feature space becomes large, and the kernel functions become computationally expensive to estimate. A core element to computational speed-ups afforded by quantum algorithms is the exploitation of an exponentially large quantum state space through controllable entanglement and interference.
Here, we propose and use two novel methods which represent the feature space of a classification problem by a quantum state, taking advantage of the large dimensionality of quantum Hilbert space to obtain an enhanced solution. One method, the quantum variational classifier builds on mitarai2018quantum (); farhi2018classification () and operates through using a variational quantum circuit to classify a training set in direct analogy to conventional SVMs. In the second, a quantum kernel estimator, we estimate the kernel function and optimize the classifier directly. The two methods present a new class of tools for exploring the applications of noisy intermediate scale quantum computers to machine learning.
The intersection between machine learning and quantum computing has been dubbed quantum machine learning, and has attracted considerable attention in recent years Arunachalam:2017:GCS (); ciliberto2018quantum (); dunjko2018machine (). This has led to a significant increase in the number of proposed quantum algorithms biamonte2017quantum (); romero2017quantum (); wan2016quantum (); mitarai2018quantum (); farhi2018classification (). Here, we present a quantum algorithm that has the potential to run on near-term quantum devices. A natural class of algorithms for such noisy devices are short-depth circuits, which are amenable to error-mitigation techniques that reduce the effect of decoherence temme2017error (); li2017efficient (). There are convincing arguments that indicate that even very simple circuits are hard to simulate by classical computational means terhal2002adaptive (); Bremner2017achievingsupremacy (). The algorithm we propose takes on the original problem of supervised learning, the construction of a classifier. For this problem, we are given data from a training set and a test set of a subset . Both are assumed to be labeled by a map unknown to the algorithm. The training algorithm only receive the labels of the training data . The goal is to infer an approximate map on the test set such that it agrees with high probability with the true map on the test data . For such a learning task to be meaningful it is assumed that there is a correlation between the labels given for training and the true map. A classical approach to constructing an approximate labeling function uses so-called support vector machines (SVMs) vapnik2013nature (). The data gets mapped non-linearly to a high dimensional space, the feature space, where a hyperplane is constructed to separate the labeled samples. A quantum version of this approach has already been proposed in rebentrost2014quantum (), where an exponential improvement can be achieved if data is provided in a coherent superposition. However, when data is provided in the conventional way, i.e. from a classical computer, then the methods of rebentrost2014quantum () cannot be applied. Here, we propose a method the handles data provided purely classically and uses the quantum state space as the feature space to still obtain a quantum advantage. This is done by mapping the data non-linearly to a quantum state , c.f. Fig 1(a). We implement two SVM type classifiers in a quantum computing experiment. In the first approach we use a variational circuit as given in kandala2017hardware (); farhi2017quantum (); farhi2018classification (); mitarai2018quantum () that generates a separating hyperplane in the quantum feature space. In the second approach we use the quantum computer to estimate the kernel function of the quantum feature space directly and implement a conventional SVM. A necessary condition to obtain a quantum advantage, in either of the two approaches, is that the kernel cannot be estimated classically. This is true, even when complex variational quantum circuits are used as classifiers. In the experiment, we want to disentangle the question of whether the classifier can be implemented in hardware, from the problem of choosing a suitable feature map for a practical data set. The data that is classified here is chosen so that it can be classified with success to verify the method.This success ratio is subsequently also attained in our actual experiment.
Our experimental device consists of five coupled superconducting transmons, only two of which are used in this work, as shown in Figure 2 (a). Two co-planar waveguide (CPW) resonators, acting as quantum buses, provide the device connectivity. Each qubit has one additional CPW resonator for control and readout. The dispersive readout signals are amplified by Josephson Parametric Converters Bergeal2010 () (JPC) (c.f. supplementary material). Entanglement in our system is achieved via CNOT gates, which use cross-resonance Rigetti2010 () as well as single qubit gates as primitives. All single- and two-qubit gates are characterized using randomized benchmarking (c.f. supplementary material). Both the quantum processor and the JPC amplifiers are thermally anchored to the mixing chamber plate of a dilution refrigerator.
Quantum feature map: Before discussing the two methods of classification, we discuss the feature map. Training and classification with conventional support vector machines is efficient when inner products between feature vectors can be evaluated efficiently vapnik2013nature (); burges1998tutorial (); boser1992training (). We will see that classifiers based on quantum circuits, such as the one presented in Fig 2(c) cannot provide a quantum advantage over a conventional support vector machine if the feature vector kernel is too simple. For example, a classifier that uses a feature map that only generates product states can immediately be implement classically. To obtain an advantage over classical approaches we need to implement a map based on circuits that are hard to simulate classically. Since quantum computers are not expected to be classically simulable, there exists a long list of (universal) circuit families one can choose from. Here, we propose to use a circuit that works well in our experiments and is not too deep. We define a feature map on -qubits generated by the unitary , where denotes the conventional Hadamard gate and
is a diagonal gate in the Pauli - basis, c.f. Fig 1 (b). This circuit will act on as initial state. The , are weights that are used to encode the data . In the experiments restrict ourselves to only low-weight terms , where we furthermore imposed . In general any diagonal unitary can be considered if it can be implemented efficiently. The exact evaluation of the inner-product between two states generated from a similar circuit with only a single diagonal layer is - hard goldberg2017complexity (). Nonetheless, in the experimentally relevant context of additive error approximation, simulation of a single layer preparation circuit can be achieved efficiently classically by uniform sampling demarie2018classical (). We conjecture that the evaluation of inner products generated from circuit with two basis changes and diagonal gates up to additive error to be hard, c.f. supplementary material for a discussion.
The data: To test our two methods, we generate artificial data that can be fully separated by our feature map. We use the feature map for - qubits in Fig. 1(b) with and . We generate the labels for data vectors , by first choose the parity function and a random unitary . We assign , when and when , c.f. Fig 3(b). The data has been separated by a gap of . Both the training sets and the classification sets consist of data points per label. We show one of such classification sets as circle symbols in Fig. 3(b).
Quantum variational classification: The first classification protocol follows four steps. First, the data is mapped to a quantum state by applying the feature map circuit Fig. 1(b) to a reference state . Second, a short depth quantum circuit , described in Fig 2(c) is applied to the feature state. The circuit depends on free parameters that will be optimized during training. Third, for a two label classification problem, a binary measurement is applied to the state . This measurement is implemented by measurements in the - basis and feeding the output bit-string to a boolean function , that is chosen by the programmer. The measurement operator is given by , where we have defined . The probability of obtaining outcome is . Fourth, for the decision rule we perform repeated measurement shots to obtain the empirical distribution . We assign the label , whenever , where we have introduced an additional bias parameter that can be optimized during training.
To train the classifier and optimize the parameters we need to define a cost-function. We define the empirical risk given by the error probability of assigning the incorrect label averaged over the samples in the training set ,
For the binary problem, the error probability of assigning the wrong label is given by the binomial cumulative density function (CDF) of the empirical distribution , c.f. supplementary material for a derivation. The binomial CDF can be approximated for a large number of samples (shots) by a sigmoid function . The probability that the label is assigned incorrectly is approximated by
The experiment itself is split in to two phases; First, we train the classifier and optimize . We have found that Spall’s SPSA OneSPSA (); AdaptiveSPSA () stochastic gradient decent algorithm performs well in the noisy experimental setting. We can use the circuit as a classifier after the parameters have converged. Second, in the classification phase, the classifier assigns labels to unlabeled data according to the decision rule .
We implement the quantum variational classifier over 5 different depths ( through ), c.f. Fig 2(c), in our superconducting quantum processor. The binary measurement is obtained from the parity function . For each depth we train three different data sets, using training sets consisting of 20 data points per label. One of these data sets is shown in Fig. 3 (b), along with the training set used for this particular data set.
Figure 3 (a) shows the evolution of the empirical risk as a function of optimization trial step for two different training sets and depths. In all experiments throughout this work we implemented an error mitigation technique which relies on zero-noise extrapolation to first order temme2017error (); Kandala18 (). To extrapolate, a copy of the circuit was run on a time scale slowed down by a factor of , c.f. supplemental material. This technique occurs at each trail step, and it is the resulting cost function that is fed to the classical optimizer. Two main differences appear prominent between the black and red lines in Figure 3 (a): first, the empirical risk for depth 0 converges much faster than that of depth 4; and second, the final value of the empirical risk after the 250 trials for which we run our experiments, is about five times higher for depth 0 than for depth 4. The fast convergence is a result of the small number of parameters to optimize over in the case of zero depth (for our case of two qubits just four parameters, since we have fixed ). The final value of the cost function is also higher for zero depth due to the limited amount of entanglement present in the optimizing circuit in that case. Whereas error mitigation does not appreciably improve the results for depth 0 -the noise in our system is not the limiting factor in that case-, it does help substantially for larger depths. Although explicitly includes the number of experimental shots taken, we fixed to avoid gradient problems, even though we took shots in the actual experiment.
After each training is completed, we use the trained set of parameters to classify 20 different test sets -randomly drawn each time- per data set. We run these classification experiments at 10,000 shots, versus the 2,000 used for training. The classification of each data point is error-mitigated and repeated twice, averaging the success ratios obtained in each of the two classifications. Figure 3 (c) shows the classification results for our quantum variational approach. We clearly see an increase in classification success with increasing circuit depth, reaching values very close to for depths larger than 1. This classification success remarkably remains up to depth 4, despite the fact that training and classification circuits contain 8 CNOTs.
A path to quantum advantage: Such variational circuit classifiers are directly related to conventional SVMs vapnik2013nature (); burges1998tutorial (). To see why a quantum advantage can only be obtained for feature maps with a classically hard to estimate kernel, we point out the following: The decision rule can be restated as . The variational circuit followed by a binary measurement can be understood as a separating hyperplane in quantum state space. Choose an orthogonal, hermitian, matrix basis , where with such as the Pauli-group on -qubits. Expand both the quantum state and the measurement in this matrix basis. Both the expectation value of the binary measurement and the decision rule can be expressed in terms of and . For any variational unitary the classification rule can be restated in the familiar SVM form . The classifier can only be improved when the constraint is lifted that the come from a variational circuit. The optimal can alternatively be found by employing kernel methods and considering the standard Wolfe - dual of the SVM vapnik2013nature (). Moreover, this decomposition indicates that one should think of the feature space as the quantum state space with feature vectors and inner products . Indeed, the direct use of the Hilbert space as a feature space would lead to a conceptual problem, since
a vector is only physically defined up to a global phase.
Quantum kernel estimation: The second classification protocol uses this connection to implement the SVM directly. Rather than using a variational quantum circuit to generate the separating hyperplane, we use a classical support vector machine for classification. The quantum computer is used twice in this protocol. First, the kernel is estimated on a quantum computer for all pairs of training data , c.f. Fig 2(b). The optimization problem for the optimal SVM can be formulated in terms of a dual quadratic program that only uses access to the kernel. We maximize
subject to and . This problem is concave whenever is a positive definite matrix. The solution to this problem will be given in terms of a set of with index set that correspond to the support vectors . The quantum computer is used a second time to estimate the kernel for a new datum with all the support vectors. The optimal solution is used to construct the classifier
The bias can calculated from the weights by choosing any and solving for . Let us discuss how the quantum computer is used to estimate the kernel. The kernel entries are the fidelities between different feature vectors. Various methods buhrman2001quantum (); cincio2018learning () exist, such as the swap test, to estimate the fidelity between general quantum states. However, since the states in the feature space are not arbitrary, the overlap can be estimated directly from the transition amplitude . First, we apply the circuit Fig 2(b), a composition of two consecutive feature map circuits, to the initial reference state . Second, we measure the final state in the -basis - times and record the number of all zero strings . The frequency of this string is the estimate of the transition probability. The kernel entry is obtained to an additive sampling error of when shots are used. In the training phase a total of amplitudes have to be estimated. An estimator for the kernel matrix that deviates with high probability in operator norm from the exact kernel by at most can be obtained with a total of shots. The sampling error can compromise the positive semi-definiteness of the kernel. Although not applied in this work, this can be remedied by employing an adaption of the scheme presented in smolin2012efficient (). The direct connection to conventional SVMs enables us to use the conventional bounds on the -dimension that ensure convergence of the empirical cost-function and guide the structural risk minimization.
For the experimental implementation of estimating the kernel matrix , c.f. circuit Fig 2(c), we again apply the error-mitigation protocol temme2017error (); Kandala18 () to first order. The kernel entries are obtained by running a time-stretched copy of the circuit and reporting the mitigated entry. We use 50,000 shots per matrix entry and perform the same zero-noise extrapolation as in our variational method. Using this protocol, we obtain support vectors that are very similar to the noise-free case. We run the training stage on three different data sets, which we will label as Set I, Set II and Set III. Set III is shown in Fig. 3 (b). Note that the training data used to obtain the kernel and the support vectors is the same data used in training of our variational classifier. The support vectors (green circles in (b)) are then used to classify 10 different test sets randomly drawn from each entire set. Set I and Set II yield success over the classification of all 10 different test sets each, whereas Set III averages a success of . These classification results are given in Figure 3 (c) as dashed blue lines to compare with the results of our variational method. All support vectors for the three sets studied are given in the supplementary material. We achieve such high classification ratios without correcting for negative eigenvalues or matrix entries that are due to experimental imperfections.
In Fig. 4 (a) we show the ideal and the experimentally obtained kernel matrices, and , for Set III. The maximum difference across the matrices between and is found at row (or column) 8. This is shown in Fig. 4 (b). Even though all three sets show experimental kernels equally close to their respective ideal kernels, Set III does not classify as well as Sets I and II. Equivalent plots for Sets I and II are given in the supplementary material.
Conclusions: We have experimentally demonstrated a classifier that exploits a quantum feature space. The kernel of this feature space has been conjectured to be hard to estimate classically. The data was generated to be separable within the quantum feature space to test the applicability of the method. In the experiment we find that even in the presence of noise, we are capable of achieving success rates up to 100. In the future it becomes intriguing to find suitable feature maps for this technique with provable quantum advantages while providing significant improvement on real world data sets. With the ubiquity of kernel methods in machine learning, we are optimistic that our technique will find application beyond binary classification.
Supplementary Information is available in the online version of the paper.
We thank Sergey Bravyi and Abhinav Kandala for insightful discussions. A.W.H. acknowledges funding from the MIT-IBM Watson AI Lab under the project Machine Learning in Hilbert space. The research was supported by the IBM Research Frontiers Institute. We acknowledge support from IARPA under contract W911NF-10-1-0324 for device fabrication.
The work on the classifier theory was led by V.H. and K.T. The experiments were performed by A.D.C and all authors contributed to the manuscript.
Author information The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to A.D.C. and K.T.
- (1) Mitarai, K., Negoro, M., Kitagawa, M. & Fujii, K. Quantum circuit learning. arXiv preprint arXiv:1803.00745 (2018).
- (2) Farhi, E. & Neven, H. Classification with quantum neural networks on near term processors. arXiv preprint arXiv:1802.06002 (2018).
- (3) Arunachalam, S. & de Wolf, R. Guest column: a survey of quantum learning theory. j-SIGACT 48, 41–67 (2017).
- (4) Ciliberto, C. et al. Quantum machine learning: a classical perspective. Proc. R. Soc. A 474, 20170551 (2018).
- (5) Dunjko, V. & Briegel, H. J. Machine learning & artificial intelligence in the quantum domain: a review of recent progress. Reports on Progress in Physics (2018).
- (6) Biamonte, J. et al. Quantum machine learning. Nature 549, 195 (2017).
- (7) Romero, J., Olson, J. P. & Aspuru-Guzik, A. Quantum autoencoders for efficient compression of quantum data. Quantum Science and Technology 2, 045001 (2017).
- (8) Wan, K. H., Dahlsten, O., Kristjánsson, H., Gardner, R. & Kim, M. Quantum generalisation of feedforward neural networks. arXiv preprint arXiv:1612.01045 (2016).
- (9) Temme, K., Bravyi, S. & Gambetta, J. M. Error mitigation for short-depth quantum circuits. Physical review letters 119, 180509 (2017).
- (10) Li, Y. & Benjamin, S. C. Efficient variational quantum simulator incorporating active error minimization. Physical Review X 7, 021050 (2017).
- (11) Terhal, B. M. & DiVincenzo, D. P. Adaptive quantum computation, constant depth quantum circuits and arthur-merlin games. Quant. Inf. Comp. 4, 134–145 (2004).
- (12) Bremner, M. J., Montanaro, A. & Shepherd, D. J. Achieving quantum supremacy with sparse and noisy commuting quantum computations. Quantum 1, 8 (2017). URL https://doi.org/10.22331/q-2017-04-25-8.
- (13) Vapnik, V. The nature of statistical learning theory (Springer science & business media, 2013).
- (14) Rebentrost, P., Mohseni, M. & Lloyd, S. Quantum support vector machine for big data classification. Physical review letters 113, 130503 (2014).
- (15) Kandala, A. et al. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature 549, 242 (2017).
- (16) Farhi, E., Goldstone, J., Gutmann, S. & Neven, H. Quantum algorithms for fixed qubit architectures. arXiv preprint arXiv:1703.06199 (2017).
- (17) Bergeal, N. et al. Analog information processing at the quantum limit with a josephson ring modulator. Nature Physics 6, 296 EP – (2010). URL http://dx.doi.org/10.1038/nphys1516. Article.
- (18) Rigetti, C. & Devoret, M. Fully microwave-tunable universal gates in superconducting qubits with linear couplings and fixed transition frequencies. Phys. Rev. B 81, 134507 (2010). URL https://link.aps.org/doi/10.1103/PhysRevB.81.134507.
- (19) Burges, C. J. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery 2, 121–167 (1998).
- (20) Boser, B. E., Guyon, I. M. & Vapnik, V. N. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, 144–152 (ACM, 1992).
- (21) Goldberg, L. A. & Guo, H. The complexity of approximating complex-valued ising and tutte partition functions. computational complexity 26, 765–833 (2017).
- (22) Demarie, T. F., Ouyang, Y. & Fitzsimons, J. F. Classical verification of quantum circuits containing few basis changes. Physical Review A 97, 042319 (2018).
- (23) Spall, J. C. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33, 109 (1997).
- (24) Spall, J. C. Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transaction on Automatic Control 45, 1839 (2000).
- (25) Kandala, A. et al. In preparation .
- (26) Buhrman, H., Cleve, R., Watrous, J. & De Wolf, R. Quantum fingerprinting. Physical Review Letters 87, 167902 (2001).
- (27) Cincio, L., Subaşı, Y., Sornborger, A. T. & Coles, P. J. Learning the quantum algorithm for state overlap. arXiv preprint arXiv:1803.04114 (2018).
- (28) Smolin, J. A., Gambetta, J. M. & Smith, G. Efficient method for computing the maximum-likelihood quantum state from measurements with additive gaussian noise. Physical review letters 108, 070502 (2012).
- (29) Schuld, M. & Killoran, N. Quantum machine learning in feature hilbert spaces. arXiv preprint arXiv:1803.07128 (2018).
- (30) Schuld, M., Bocharov, A., Svore, K. & Wiebe, N. Circuit-centric quantum classifiers. arXiv preprint arXiv:1804.00633 (2018).
- (31) Stoudenmire, E. & Schwab, D. J. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, 4799–4807 (2016).
- (32) Van Dam, W., Hallgren, S. & Ip, L. Quantum algorithms for some hidden shift problems. SIAM Journal on Computing 36, 763–778 (2006).
- (33) Rötteler, M. Quantum algorithms for highly non-linear boolean functions. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete algorithms, 448–457 (Society for Industrial and Applied Mathematics, 2010).
- (34) Bremner, M. J., Montanaro, A. & Shepherd, D. J. Average-case complexity versus approximate simulation of commuting quantum computations. Physical review letters 117, 080501 (2016).
- (35) d’Alessandro, D. Introduction to quantum control and dynamics (CRC press, 2007).
- (36) Tropp, J. A. et al. An introduction to matrix concentration inequalities. Foundations and Trends ® in Machine Learning 8, 1–230 (2015).
- (37) Gambetta, J. M. et al. Characterization of addressability by simultaneous randomized benchmarking. Phys. Rev. Lett. 109, 240504 (2012). URL https://link.aps.org/doi/10.1103/PhysRevLett.109.240504.
- (38) Córcoles, A. D. et al. Process verification of two-qubit quantum gates by randomized benchmarking. Phys. Rev. A 87, 030301 (2013). URL https://link.aps.org/doi/10.1103/PhysRevA.87.030301.
- (39) Sheldon, S., Magesan, E., Chow, J. M. & Gambetta, J. M. Procedure for systematically tuning up cross-talk in the cross-resonance gate. Phys. Rev. A 93, 060302 (2016).
Supervised learning with quantum enhanced feature spaces
Consider a classification task on a set of classes (labels) in a supervised learning scenario. In such settings, we are given a training set and a test set , both are assumed to be labeled by a map unknown to the programmer. Both sets and are provided to the programmer, but the programmer only receives the labels of the training set. So formally, the programmer has only access to a restriction of the indexing map :
It is the programmers goal to use the knowledge of to infer an indexing map over the set , such that with high probability for any . The accuracy of the approximation to the map is quantified by a classification success rate, proportional to the number of collisions of and :
For such a learning task to be meaningful it is assumed that there is a correlation in output of the indexing map over the sets and . For that reason, we assume that both sets could in principle be constructed by drawing the and sample sets from a family of -dimensional distributions and labeling the outputs according to the distribution. It is assumed that the hypothetical classification function to be learned is constructed this way. The programmer, however, does not have access to these distributions of the labeling function directly. She is only provided with a large, but finite number of samples and the matching labels.
The conventional approach to this problem is to construct a family of classically computable function , indexed by a set of parameters . These weights are then inferred from by a optimization procedure on a classical cost function. We consider a scenario where the whole, or parts of the classification protocol , are generated on a quantum computer.
Description of the Algorithm
We consider two different learning schemes. The first is referred to as “Quantum variational classification”, the second is referred to “Quantum kernel estimation”. Both schemes construct a separating hyperplane in the state space of - qubits. The classical data is mapped to this space with using a unitary circuit family starting from the reference state .
Quantum variational classification
For our first classification approach we design a variational algorithm which exploits the large dimensional Hilbert space of our quantum processor to find an optimal cutting hyperplane in a similar vein as Support Vector Machines (SVM) do. The algorithm consists of two main parts: a training stage and a classification stage. For the training stage, a set of labeled data points are provided, on which the algorithm is performed. For the classification stage, we take a different set of data points and run the optimized classifying circuit on them without any label input. Then we compare the label of each data point to the output of the classifier to obtain a success ratio for the data set. For both the training and the classification stages, the quantum circuit that implements the algorithm comprises three main parts: the encoding of the feature map, the variational optimization and the measurement, Fig S1. The training phase consists of these steps.
The classification can be applied when the training phase is complete. The optimal parameters are used to decide the correct label for new input data. Again, the same circuit is applied as in Fig S1, however, this time the parameters are fixed and and the outcomes are combined to determine the label which is reported as output of the classifier.
Quantum kernel estimation
For the second classification protocol, we restrict ourselves to the binary label case, with . In this protocol we only use the quantum computer to estimate the - kernel matrix . For all paris of points in the the training data, we sample the overlap to obtain the matrix entry in the kernel. This output probability can be estimated from the circuit depicted in Fig. S5.b. by sampling the output distribution with shots and only taking the count. After the kernel matrix for the full training data has been constructed we use the conventional (classical) support vector machine classifier. The optimal hyperplane is constructed by solving the dual problem in eqn. (8), which is completely specified after we have been given the labels and have estimated the kernel . The solution of the optimization problem is given in terms of the support vectors for which .
In the classification phase, we want to assign a label to a new datum of the test set. For this, the inner product between all support vectors with and the new datum has to be estimated on the quantum computer. The new label for the datum is assigned according to eqn. (16). Since all support vectors are known from the training phase and we have obtained access to the kernel from the quantum hardware, the label can be directly computed.
The Relationship of variational quantum classifiers to support vector machines
The references vapnik2013nature (); burges1998tutorial () provide a detailed introduction to the construction of support vector machines for pattern recognition. Support vector machines are an important tool to construct classifiers for tasks in supervised learning. We will show that the variational circuit classifier bears many similarities to a classical non-linear support vector machine.
Support vector machines (SVM):
First, let us briefly review the training task of classical, linear support vector machines for data where , so that with , that is linearly separable. Linear separability asks that the set of points can be split in two regions by a hyperplane , parametrized by a normal vector and a bias . The points that lie directly on the hyperplane satisfy the equation
expressed in terms of the inner product for vectors in . The perpendicular distance of the Hyperplane to the origin in is given by . The data set is linearly separable by margin in if there exists a vector and a , such that:
The classification function that is constructed from such a Hyperplane for any new data point assigns the label according to which side of the Hyperplane the new data-point lies by setting
The task in constructing a linear support vector machine (SVM) in this scenario is the following. One is looking for a Hyperplane that separates the data, with the largest possible distance between the two separated sets. The perpendicular distance between the plane and two points with different labels is called a margin and such points are referred to as ‘support vectors’. This means that we want to maximize the margin by minimizing , or equivalently subject to the constraints as given in eqn. (4), for all data points in the training set . The corresponding cost function can be written as:
where are Lagrange multipliers chosen to ensure the constraints are satisfied.
For non-separable datasets, it is possible to introduce non-negative slack variables which can be used to soften the constraints for linear separability of to
These slack variables are then used to modify the objective function by . When we choose the optimization problem remains convex and the a dual can be constructed. In particular, for , neither the or their Lagrange multipliers appear in the dual Lagrangian.
It is very helpful to consider the dual of the original primal problem in eqn. (6). The primal problem is a convex, quadratic programming problem, for which the Wolfe - dual cost function for the Lagrange multipliers can be readily derived by variation with respect to and . The dual optimization problem is
subject to constraints:
The variables of the primal are given in terms of the dual variables by
and the bias can be computed from the Karush-Kuhn-Tucker (KKT) conditions when the corresponding Lagrange multiplier does not vanish. The optimal variables satisfy the KKT conditions and play an important role in the understanding of the SVM. They are given for primal as
note that, that the condition eqn. (14) ensures that either the optimal or the corresponding constraint eqn. (14) is tight. This is a property referred to complementary slackness, and indicates that only the vectors for which the constraint is tight give rise to non-zero . These vectors are referred to as the support vectors and we will write for their index set. The classifier in the dual picture is given by substituting from eqn. (9) and in to the classifier eqn. (5). The bias is obtained for any from the equality in eqn. (14).
The method can be generalized to the case when the decision function does depend non-linearly on the data by using a trick from boser1992training () and introducing a high-dimensional, non-linear feature map. The data is mapped via
from a low dimensional space non-linearly in to a high dimensional Hilbert-space . This space is commonly referred to as the feature space. If a suitable feature map has been chosen, it is then possible to apply the SVM classifier for the mapped data in , rather then in .
It is important to note that it is in fact not necessary to construct the mapped data in explicitly. Observe that both the training data, as well as the new data to be classified enters only through inner products, in both the optimization problem for training, c.f. eqn. (8), as well as in the classifier, eqn. (5). Hence, we can construct the SVM for arbitrarily high dimensional feature maps , if we can efficiently evaluate the inner products and , for and . In particular, if we can find a kernel that satisfies Mercer’s condition (which ensures that the kernel is positive semi-definite and can be interpreted as matrix of inner products) boser1992training (); vapnik2013nature (), we can construct a classifier by setting
Here we only need to sum over all support vectors for which . Moreover, one can replace the inner product in the optimization problem eqn. (8) by the kernel. Examples of such a kernels that are frequently considered in the classical literature are for instance the polynomial kernel or even the infinite dimensional Gaussian kernel . If the feature map is sufficiently powerful, increasingly complex distributions can be classified. In this paper, the feature map is a classical to quantum mapping by a tunable quantum circuit family, that maps in to the state space, or space of density matrices, of - qubits with . The example of the Gaussian kernel indicates, that the sheer - dimension of the Hilbert space on a quantum computer by itself does not provide an advantage, since classically even infinite dimensional spaces are available by for instance using the Gaussian kernel. However, this hints towards a potential source of quantum advantage as we may construct states in feature space with hard-to-estimate overlaps.
Variational circuit classifiers:
Let us now turn to the case of binary classification based on variational quantum circuits. Recall that in our setting, we first take the data and map it to a quantum state on -qubits, c.f. eqn. (24). Then we apply a variational circuit to the initial state that depends on some variational parameters , c.f. eqn. (33). Lastly, for a binary classification task, we measure the resulting state in the canonical -basis and assign the resulting bit-string to a label based on a predetermined boolean function . Hence the probability of measuring either label is given by:
where we have defined the diagonal operator
In classification tasks we assign c.f. eqn (36), the label with the highest empirical weight of the distribution . We ask whether the outcome is more likely than , or visa versa. That is, we ask, whether or whether the converse is true. This of course depends on the sign of the expectation value for the data point .
To understand how this relates to the SVM in greater detail, we need to choose an orthonormal operator basis, such as for example the Pauli-group
Note that when fixing the phase to every element , with of the Pauli-group is an orthogonal reflection . Furthermore, Pauli matrices are mutually orthogonal in terms of the trace inner product
This means that both the measurement operator in the -rotated frame as well as the state can be expanded in terms of the operator basis with only real coefficients as
Note, that the values as well as are constraint due to the fact that they originate from a rotated projector and from a pure state. Since , we have that . Furthermore, the projector squares to itself so that . In particular, this means that the norms of both vectors satisfy as well as . Since the expectation value of the measured observable is , it can be expressed in terms of the inner product:
Observe that only has eigenvalues , and we have that . Let us now consider a decision rule, where we assign the label over the label with some fixed bias . In that case we demand that . If we substitute eqn. (17) and use the expansion in eqn. (22) we have that the corresponding label is given by the decision function , where
This expression is identical to the conventional SVM classifier, c.f. eqn. (5), after the feature map has been applied. However, in the experiment we only have access to the probabilities through estimation. Furthermore, the are constrained to stem from the observable measured in the rotated frame.
This means, that the correct feature space, where a linearly separating hyperplane is constructed is in fact the quantum state space of density matrices, and not the Hilbert space itself. This is reasonable, since the physical states in are only defined up to a global phase . The equivalence of states up to a global phase would make it impossible to find a separating hyperplane, since both and give rise to the same physical state but can lie on either side of a separating plane.
Encoding of the data using a suitable feature map
In the quantum setting, the feature map is an injective encoding of classical information into a quantum state on an - qubit register. Here is a single qubit Hilbert space, and denotes the cone of positive semidefinite density matrices with unit trace . This cone is a subset of the dimensional Hilbert space of of complex matrices when fitted with the inner product for . The feature map acts as
The action of the map can be understood by a unitary circuit family denoted by that is applied to some reference state, e.g. . The resulting state is given by . The state in the feature space should depend non-linearly on the data. Let us discuss proposals for possible feature maps
Product state feature maps
There are many choices for the feature map . Let us first discuss what would happen if we were to choose a feature map that corresponds to a product input state. We assume a feature map, comprised of single qubit rotations on every qubit on the quantum circuit. The angles for every qubit can be chosen as a non-linear function into the space of Euler angles for the individual qubits, so that the full feature map can be implemented as:
One example for such an implementation is the unitary implementation of the feature map used in the context of the classical classifiers by Stoudenmire and Schwab stoudenmire2016supervised () based on tensor networks. There each qubit encodes a single component of so that qubits are used. The resulting state that that is prepared is then:
when expanded in terms of the Pauli-matrix basis where for all . and . The corresponding decision function can be constructed as in Eqn. (16). Where the kernel is replaced by the inner product between the resulting product states. These can be evaluated with resources scaling linearly in the number of qubits, so that no quantum advantage can be expected in this setting.
Non-trivial feature map with entanglement
There are many choices of feature maps, that do not suffer from the malaise of the aforementioned product state feature maps. To obtain an quantum advantage we would like these maps fo give rise to a kernel that is the computationally hard to estimate up to an additive polynomially small error by classical means. Otherwise the map is immediately amenable to classical analysis and we are guaranteed to have lost any conceivable quantum advantage.
Let us therefore turn to a family of feature maps, c.f. Fig S2 for which we conjecture that it is hard to estimate the overlap on a classical computer. We define the family of feature map circuit as follows