Quantum classification of the MNIST dataset via Slow Feature Analysis

Quantum classification of the MNIST dataset via Slow Feature Analysis

Iordanis Kerenidis Alessandro Luongo
Abstract

Quantum machine learning carries the promise to revolutionize information and communication technologies. While a number of quantum algorithms with potential exponential speedups have been proposed already, it is quite difficult to provide convincing evidence that quantum computers with quantum memories will be in fact useful to solve real-world problems. Our work makes considerable progress towards this goal.

We design quantum techniques for Dimensionality Reduction and for Classification, and combine them to provide an efficient and high accuracy quantum classifier that we test on the MNIST dataset. More precisely, we propose a quantum version of Slow Feature Analysis (QSFA), a dimensionality reduction technique that maps the dataset in a lower dimensional space where we can apply a novel quantum classification procedure, the Quantum Frobenius Distance (QFD). We simulate the quantum classifier (including errors) and show that it can provide classification of the MNIST handwritten digit dataset, a widely used dataset for benchmarking classification algorithms, with accuracy, similar to the classical case. The running time of the quantum classifier is polylogarithmic in the dimension and number of data points. We also provide evidence that the other parameters on which the running time depends (condition number, Frobenius norm, error threshold, etc.) scale favorably in practice, thus ascertaining the efficiency of our algorithm.

1 Introduction

Quantum computing has the potential to revolutionize information and communication technologies and one of the most promising areas is quantum machine learning. The field of quantum machine learning stemmed from the well known HHL algorithm [HHL09], that takes as input a sparse and well conditioned system of equations and an input vector , and in time polylogarithmic in the input dimensions produces a quantum state that encodes the classical solution . After that, other quantum algorithms with potential exponential speedups were proposed, using and extending techniques such as phase estimation, Hamiltonian simulation, and amplitude amplification, including for recommendation systems [KP16], topological data analysis [LGZ16], support vector machines [RML14], kernel PCA for anomaly detection [LR18], and data fitting [WBL12].

Nevertheless whether quantum processing machines can be used in practice along other processing units in order to solve efficiently and accurately real-world problems and lead to quantum innovation remains the billion-dollar question. Of course one would need powerful and robust quantum hardware but even in the case of full-scale quantum computers with quantum access to large datasets, it is essential, and rather not straightforward, to find practical applications of quantum computing that would have a real economic and societal impact.

Our work provides evidence that large-scale quantum computers with quantum access to data will be indeed useful for classification, one of the most important tasks in machine learning with applications in different industrial sectors, including automation, health-care, Internet of Things etc. More precisely, we design efficient quantum algorithms for dimensionality reduction and for classification and we also simulate their behavior on the commonly-used MNIST dataset of handwritten digits and show that they would provide accurate and competitive classification.

Before providing more details on our results, we give a brief description of the notions of dimensionality reduction and classification. Dimensionality reduction (DR), a technique used in practice both in supervised and unsupervised learning, refers to the procedure by which the dimension of the input data is reduced as a first step before any further processing of the data. It is often necessary when trying to solve practical problems in machine learning and there are many classical techniques for performing it. For instance, it is used in order to decrease the variance of the learner, since it simplifies the model and reduces the noise in the data. It is also necessary when the running time of the algorithm has polynomial dependence on the number of features or dimensions which can be prohibited for large dimensional data sets. In the context of big data analysis, by removing features that carry low information (like features that are strictly proportional to other features, or features for which the data contains too little information), it is possible to optimize the storage space and allow for data visualization. Most importantly, supervised algorithms often suffer from the curse of dimensionality: by allowing large dimensionality of data, the informative power of the data points in the training set decreases, thus leading to a degradation in classification performances. One solution to improve the accuracy would be to increase the number of elements in the training set, but this is not always possible nor desirable, so the common route is to decrease the dimension of the data. Mathematically, the idea of the dimensionality reduction algorithms is to map vectors from a high dimensional space to a low dimensional space , such that the most meaningful information (according to some criteria) is preserved. Of course, understanding which criterion to use is far from trivial.

The choice of the right DR algorithm depends on the nature of the data as well as on the type of algorithm that will be applied after the dimensionality reduction. A very well known DR algorithm is the Principal Component Analysis (PCA), that projects the data points onto the subspace spanned by the eigenvectors associated to the largest eigenvalues of the covariance matrix of the data. In this way, the projection holds “most of the information” of the dataset. In fact it is possible to show that, for a subspace of dimension , this choice of eigenvectors minimize the reconstruction error, i.e. the distance between the original and the projected vectors. However, PCA is not always the best choice of dimensionality reduction. PCA projects the data into the subspace along which the data has more variance. This does not take into consideration the information that different points might belong to different classes, and there are cases in which PCA can worsen the performance of the classifier. Other methods, like Fisher Linear Discriminant (FLD) also take into account the variance of each single class of points. Indeed, FLD projects the data in a subspace trying to maximize the distance between points belonging to different clusters and minimizing the distance between points belonging to the same cluster, thus preserving or increasing the accuracy. Slow Feature Analysis (SFA) is a different dimensionality reduction technique proposed in the context of computational neurosciences that projects the data points onto the subspace spanned by the eigenvectors associated to the smallest eigenvalues of the derivative covariance matrix of the data. SFA has been shown to model a kind of neuron (called complex cell) situated in the cortical layer in the primary visual cortex (called V1) [BW05]. SFA can be used in machine learning as a DR algorithm, and it has been successfully applied to enhance the performance of classifiers [ZD12, Ber05]. It has been shown that solving the optimization problem upon which SFA is based is equivalent to other dimensionality reduction algorithms, like Laplacian Eigenmaps [Spr11] and Fisher Linear Discriminant [KM09], thus a quantum algorithm for SFA also provides algorithms for Laplacian Eigenmaps and Fisher Linear Discriminant.

The main goal of our work is to show that quantum computers will be able to provide efficient and accurate classification, a task of prominent importance in machine learning. The classification problem consists of finding the correct label for a new sample given a training set of elements, each one associated with one of possible labels. Classification algorithms can be used for solving problems like customer segmentation, medical diagnosis, image classification, spam detection, and others. Some classification algorithms could also be adapted for performing regression. Formally, we will show how to perform the following task: given a new sample and quantum access to the training set , find the correct class for by building a function . As we said, for most real-world datasets, the efficiency and accuracy of the classification can be considerably improved by performing dimensionality reduction before the actual classification.

In this work we first define quantum Slow Feature Analysis (QSFA), a quantum method for dimensionality reduction. The QSFA algorithm that we propose uses techniques based on the Singular Value Estimation and the projection procedure in [KP17, KP16], as well as recent improvements of these procedures in [CGJ18]. The idea is to approximately project the input vectors onto the space spanned by the eigenvectors corresponding to the smallest eigenvalues of the derivative covariance matrix. Our algorithm runs in time polylogarithmic in the dimension and number of points in the dataset, and - as is the case for algorithms based on HHL - depends on some characteristics of the input matrix (like the condition number), thus it can achieve an exponential speedup in certain cases. We use QSFA as the first step in an efficient and accurate procedure for solving the classification problem: given a set of labeled elements belonging to different classes, the problem is to find the correct class for a new observation.

Second, we define the quantum Frobenius Distance (QFD) classification, a novel quantum method that classifies a new data point based on the average square -distance between the new data point and each labeled cluster, i.e. the average of the square distances between the new data point and each point in the cluster. The running time is logarithmic in the number and dimension of the data points and proceeds by creating a weighted superposition of all points in a cluster and the new data point, and then estimating the distance of each point with the new data point in superposition, hence allowing to estimate the average square distance.

We started out wanting to provide concrete evidence that quantum computers will be able to solve real-world problems. To this end, we also provide the results of a simulation of the performance of the quantum classification algorithm on a standard dataset in machine learning: the MNIST dataset of handwritten digits. We simulated classically the procedures that the quantum algorithm performs including the errors in these procedures.

With respect to the accuracy, our classification using QSFA and QFD provides accuracy around 98.5%, which is comparable to classical machine learning algorithms, and better than many of the previous results registered in [LeC]. We note that the best classical classification algorithms, and our QFD classification as well, when applied directly on the raw data provide quite low accuracy, showing that the step of dimensionality reduction is indeed crucial.

With respect to the running time of our quantum procedure, it depends only polylogarithmically on the dimension and the number of data points, but it also depends on the characteristics of the input matrix, in particular its Frobenius norm (or maximum norm depending on the quantum procedure used) and its condition number (or a condition threshold which can be used instead). We show that all these parameters increase very slowly with the number and dimension of the data points in the MNIST dataset. Moreover, we provide specific numbers for all other error parameters that appear in the running time.

We see our results as evidence that our quantum classifier can provide efficient and high accuracy classification. We note that the fact that the dependence on the dimension is only polylogarithmic can be seen from two points of view. First, as a much faster algorithm to provide classification with similar accuracy as the classical algorithms. Second, and maybe more importantly, by increasing the initial input dimension (number or dimension of data points), our quantum classifier may provide much higher accuracy classification in time similar to classical algorithms. Of course we will not be able to confirm the second statement before we can actually have quantum computers, but it is well known in the ML community that increasing the input dimension (up to some point) increases the accuracy of the classification, which is very important for much more interesting data sets than the MNIST one.

Let us remark also that our goal is to assess the power of quantum computers with access to a QRAM, i.e. a quantum computer that can access classical data in superposition, in a similar manner to classical high-performance computing machines which by default possess a RAM. Of course, we do not have such quantum computers right now and there is a possibility that they may never be realized. We think of our results as motivation for actually building such quantum processing machines. We will provide more details about the QRAM model in the following sections.

Note also that the quantum dimensionality reduction and classification proposed in this work can be used also on quantum data without the need of a QRAM, in other words, data produced by a quantum process that need to be further elaborated or classified. For instance, after Hamiltonian simulation used in quantum chemistry or for testing quantum devices, and so on. This way, one may not need to preload the classical data into the quantum memory, since there is an efficient procedure of creating the quantum states directly.

1.1 Previous Results

Despite the fact that quantum machine learning is a recent field, it already encompasses a rich set of techniques and approaches. In this section we describe the main trends, with special attention on classification algorithms.

The first category of algorithms use circuits of parametrized gates to perform machine learning tasks such as classification or regression [VBB17, SBSW18, CGAG17]. In the training phase, the parameters of the circuit are learned using classical optimization techniques, where the function to optimize is a loss function calculated on the output of the quantum circuit. A work in this direction [OMA17] used the Quantum Approximative Optimization Algorithm (QAOA) to solve a clustering problem, exploiting a reduction from clustering to MaxCut, a combinatorial optimization problem which QAOA can solve. In the same work, a downscaled version of the algorithm was tested on real hardware. This model has similarities with classical neural networks, at the point that some of these works have been presented as quantum neural networks. For instance, in the work of [FN18] a quantum circuit of this kind has been trained on an extremely downsampled version of the MNIST dataset with only two digits and with a test error of 10%. A possible issue of this approach is outlined in [MBS18], where the authors showed that for a large class of random quantum circuits, the gradient of the function under optimization can be exponentially small in the number of qubits, thus the probability to find a non-zero value of the gradient is vanishingly small. If the gradient is zero almost everywhere, gradient based methods for finding a correct set of parameters over large number of qubits are deemed to fail.

In [SFP17], the authors presented a kernelized binary distance-based classifier and tested it with two experiments on the IRIS dataset, first with numerical simulations and then on quantum hardware. Tested on a severly downsampled version of two linearly separable classes, the simulation lead to 100% accuracy, as expected.

Using quantum annealing it is possible to fit a generative model, as shown in [BRGBPO17]. This work used a dataset of handwritten digits as well, specifically the OptDigits dataset [Kay95] along with the BAS dataset of artificial images [Mac02]. The two datasets were studied for image reconstruction and classification. Preprocessing on the dataset helped to cope with the limitations of the hardware. The authors trained a quantum Boltzmann machine such that the samples from the annealer are as similar as possible to the distribution of the real data. One could potentially use this method to perform classification as well.

Another trend in quantum machine learning focuses on the use of amplitude amplification and estimation techniques, gaining a quadratic factor in the running time. Prominent examples of these approach is the quantum perceptron algorithm [KWS16], quantum k-nearest-neigbour [WKS14] and a genetic algorithm [LB16].

The approach that may achieve exponential speedups uses quantum computers to speed up the linear algebraic operations used in machine learning algorithms. The work of Robentrost et al. [RML14] consists of an algorithm for a Support Vector Machines classifier with an exponential speedup on the number of samples and the dimension of the feature space. There, the well known HHL algorithm is used to solve a linear system of equations associated to the optimization problem of SVM. Other results in this paradigm exploit novel Hamiltonian simulation techniques (called quantum self-testing) developed for the first time in [LMR13] and analyzed in [KLL17]. Previous Hamiltonian simulation techniques were limited to sparse or local matrices, and this technique extends Hamiltonian simulation to dense but low-rank matrices. Based on quantum self-testing and assuming the existence of QRAM, the authors proposed a quantum version of PCA.

The work most related to ours is [CD16], where the authors described a dimensionality reduction and a classification algorithm for Linear and Nonlinear Fisher Discriminant Analysis using the model of Hamiltonian simulation. Their algorithm works for the case where the sparsity or rank of the matrix is polylogarithmic in the dimension and has complexity where is the desired number of principal eigenvectors, and is a pre-defined conditioning number which is chosen to be . At the end of the dimensionality reduction, the algorithm uses tomography to learn the corresponding eigenvectors and project clasically the data points onto this low-dimensional space. After this, the authors present an efficient QRAM-based classification algorithm using Quadratic Discriminant Analysis (QDA) with similar running time. The assumptions on the input matrix and the bad dependence of the running time on the error parameter make this algorithm hard to apply on real data.

1.2 Notation and Organisation

Let be a matrix in . It’s singular value decomposition is , where is a matrix whose columns are called the left singular vectors of and are the eigenvectors of the matrix, is a diagonal matrix whose diagonal elements are the singular values of A, and is a matrix whose columns are called the right eigenvectors of and are the eigenvectors of the matrix. We denote with the set . Unless otherwise stated, our vector are in . With we denote the projection of the matrix onto the vector subspace spanned by the singular vectors associated to the singular values that are smaller than . The conditioning number of a matrix is defined as the ratio between the biggest and the smallest singular values.

Our training set is made of pairs for : a vector and its associated label (also called class). With the notation we denote the -th sample of the training set, which we think of as a row of the matrix . We consider different labels , and for each label, we denote the set of training samples of this label, and the corresponding matrix.

With we express the time complexity of an algorithm with polylogarithmic dependencies factored out.

The paper is organized as follows. In Section 2, SFA is introduced as an dimensionality reduction problem for classification. In Section 3, we show how to perform linear algebraic operations with a quantum computer. Here we give a new algorithm to perform singular value estimation of a product of two matrices. With these tools, we describe in Section 4 a quantum version of SFA that works with classical data. In Section 5, we describe a new distance based classification algorithm called quantum Frobenius Distance (QFD) classifier, and in Section 6 we show how the combination of QSFA and QFD has performance comparable to classical versions of the algorithm on the same dataset. We conclude our work in Section 7, by outlining applications, possible enhancements, and open problems.

2 Classical Slow Feature Analysis

SFA was originally proposed as an online, nonlinear, and unsupervised algorithm [WW99]. Its task was to learn slowly varying features from generic input signals that vary rapidly over time [Ber05, WW99]. SFA has been motivated by the temporal slowness principle, that postulates that while the primary sensory receptors (like the retinal receptors in an animal’s eye) are sensitive to very small changes in the environment and thus vary on a very fast time scale, the internal representation of the environment in the brain varies on a much slower time scale. The slowness principle is a hypothesis for the functional organization of the visual cortex and possibly other sensory areas of the brain [WBF11] and it has been introduced as a way to model the transformation invariance in natural image sequences [ZD12].

SFA is an algorithm that formalizes the slowness principle as a nonlinear optimization problem. In [BW04, SZW14], SFA has been used to do nonlinear blind source separation. Although SFA has been developed in the context of computational neurosciences, there have been many applications of the algorithm to solve ML related tasks. A prominent advantage of SFA compared to other algorithms is that it is almost hyperparameter-free. The only parameters to chose are in the preprocessing of the data, e.g. the initial PCA dimension and the nonlinear expansion that consists of a choice of a polynomial of (usually low) degree . Another advantage is that it is guaranteed to find the optimal solution within the considered function space [EBW12]. For a detailed description of the algorithm, we suggest [SW08].

With appropriate preprocessing, SFA can be used in conjunction to a supervised algorithm to acquire classification capabilities. For instance it has been used for pattern recognition to classify images of digits in the famous MNIST database [Ber05]. SFA can be adapted to be used to solve complex tasks in supervised learning, like face and human action recognition [GLW13, ZD12, SJC14].

The high level idea of using SFA for classification is the following: One can think of the training set as an input series . Each belongs to one of different classes. The goal is to learn functions such that the output is very similar for the training samples of the same class and largely different for samples of different classes. Once these functions are learned, they are used to map the training set in a low dimensional vector space. When a new data point arrive, it is mapped to the same vector space, where classification can be done with higher accuracy.

Now we introduce the minimization problem in its most general form as it is commonly stated for classification [Ber05]. Let For all , minimize:

 Δ(yj)=1aK∑k=1∑s,t∈Tks

with the following constraints:

The minimization of the delta values encodes the requirement on the output signal to vary “as slow as possible”, and thus the delta values are our measure of slowness. They are the average of the square of the first order derivative (over time) of the -th component of the output signal . The first requirement states that the average over time of each component of the signal should be zero, and it is stated just for convenience, such that the other two requirements take a simple form. The second requirement asks for the variance over time of each component of the signal to be . It is equivalent to saying that each signal should carry some information and avoid the trivial solution . The third requirement is to say that we want the signals to be decorrelated with each other. This also introduces an order, such that the first signal is the slowest, the second signal is the second slowest and so on. The first and the second constraint also avoid the trivial solution . Intuitively, the decorrelation constraint forces different functions to encode different “aspects” of the input, maximizing the information carried by the output signal.

In order for the minimization problem to be feasible in practice, the ’s are restricted to be linear functions such that the output signal becomes or else , where is the matrix with rows the input samples and the matrix that maps the input matrix into a lower dimensional output .

In case it is needed to capture nonlinear relations in the dataset, one performs a standard nonlinear polynomial expansion on the input data as preprocessing. Usually, a polynomial expansion of degree or is chosen. For example we can take

 x=[x1,x2,x3]→[x21,x1x2,x1x3,x22,x2x3,x23,x1,x2,x3].

The choice of the nonlinear expansion is important for using SFA in machine learning contexts. If it is a low dimensional expansion, it might not solve the task with high accuracy, while if the dimension is too high, it might overfit the training data, and therefore not generalize properly to test data. This technique also goes under the name of polynomial kernel.

We also need to satisfy the constraint on the average of the signal being zero and have unit variance. This is not a strong assumption, since it takes linear preprocessing time with respect to the dimensions of the dataset to remove the mean and scale the components by their variance. Namely, we assume that the -th component of the -th vector in the dataset satisfies the condition:

 xj(i):=~xj(i)−Ei[~xj(i)]√Ei[(~xj(i)−Ei[~xj(i)])2],

where with we define a raw signal with arbitrary mean and variance, the expected value of a single component of the vectors.

This allows us to rewrite the minimization problem including the constraints of zero mean and unit variance. We can restate the definition of the delta function as:

 Δ(yj)=wTjAwjwTjBwj, (1)

where the matrix is called the covariance matrix and defined as:

 B:=1n∑i∈[n]x(i)x(i)T=XTX (2)

and the matrix is called the derivative covariance matrix and defined as:

 A:=1aK∑k=1∑i,i′∈Tki

Note also, that we can approximate the matrix by subsampling from all possible pairs from each class and this is indeed what happens in practice.

It is not hard to see that the weight vectors that correspond to the minima of equation (1) are the eigenvectors associated with the smallest eigenvalues of the generalized eigenvalue problem [HTF09, SW82], where is the diagonal matrix of eigenvalues, and is the matrix of generalized eigenvectors. Picking the smallest eigenvectors will allow us to create the matrix that will project our data to the slow feature space. In other words, SFA reduces to a generalized eigenvalue problem.

2.1 The SFA algorithm

The SFA algorithm basically provides a solution to the generalized eigenvalue problem and outputs the eigenvectors corresponding to the smallest eigenvalues. As we said we assume that the data has been normalized and polynomially expanded.

We are now ready for the main steps of the algorithm. The first step is to whiten the data that will reduce the problem into a normal eigenvalue problem; the second step is to perform PCA in order to find the eigenvalues and eigenvectors. We refer to [EBW12] for a more complete description.

2.1.1 Step 1: Whitening the data

Recall that . We now show how to whiten the data by right multiplying with the matrix . Then, the input matrix becomes and the covariance matrix of the whitened data is thus the identity.

Claim 1.

Let be the matrix of whitened data. Then .

Proof.

Let . Remember that .

 ZTZ = [XB−1/2]TXB−1/2=(B−1/2)TBB−1/2 = [((UΣVT)T(UΣVT))−1/2]T(UΣVT)T(UΣVT)[((UΣVT)T(UΣVT))−1/2] = [((VΣUT)(UΣVT))−1/2]T(VΣUT)(UΣVT)[((VΣUT)(UΣVT))−1/2] = = (VΣ−1VT)(VΣ2VT)(VΣ−1VT)=(VIVT)=I

2.1.2 Step 2: Projection in slow feature space

The second step of SFA consists of outputing the “slowest” eigenvectors of the derivative covariance matrix of the whitened data , where is defined similar to by using the whitened samples instead of the original ones. Note that taking the derivatives of the whitened data is equal to whitening the derivatives. In other words,

Let . Then .

Proof.
 A = ˙ZT˙Z=1aK∑k=1∑i,i′∈Tki

This will in fact allow us to whiten the data with a quantum procedure. In practice, the matrix is usually approximated with a small fraction of all the possible derivatives, roughly linear (and not quadratic) on the number of data points. In our case we take the number of rows of the derivative matrix to be just double the number of data points without compromising the accuracy.

3 Quantum algorithms for machine learning

We start by stating some known results that we will use in the following sections.

Definition 1.

The vector state for is defined as

Proposition 1.

(Phase estimation [Kit96]) Let be a unitary operator, with eigenvectors and eigenvalues for , i.e. we have for . For a precision parameter , there exists a quantum algorithm that runs in time and with probability maps a state to the state such that for all .

Proposition 2.

(Amplitude amplification and estimation [BHMT02]) If there is unitary operator such that then can be estimated to multiplicative error in time and can be generated in expected time .

We will provide algorithms for a quantum computer that has quantum access to classical data through a structure called QRAM which allows for creating quantum states efficiently from classical data stored in a data structure. We say that a dataset is efficiently loaded in the QRAM, if the size of the data structure is linear in the dimension and number of data points and the time to enter/update/delete an element is polylogarithmic in the dimension and number of data points. In Appendix A we show how to construct such data structures. We also assume that the classical data is stored already polynomially expanded and scaled to zero mean and unit variance.

We first recall some known results about the power of quantum computers to perform linear algebraic procedures efficiently. In what follows we follow the convention that the matrices have been stored in QRAM normalized, i.e. (see [KP17] for an efficient procedure for this normalization).

Theorem 1 (Singular Value Estimation [Kp17]).

Let be a matrix with singular value decomposition stored in the data structure described in Appendix A. Let the precision parameter. There is an algorithm with running time that performs the mapping , where for all with probability at least .

The function is defined as , while is defined as . The optimal value for depend on the matrix under consideration, but we have that is smaller than the maximum norm of the row vectors, which is always smaller than the sparsity. For a matrix and a vector , we define as the projection of onto the space spanned by the singular vectors of whose corresponding singular values are smaller than , and some subset of singular vectors whose corresponding singular values are in the interval .

The relevance of Theorem 1 in machine learning is the following: if we are able to estimate the singular values of a matrix, then we can perform a conditional rotation controlled by these singular values and hence perform a variety of linear algebraic operations, including matrix inversion, matrix multiplication or projection onto a subspace. We provide below theorems from [KP16, KP17] and recent improvements by [CGJ18], while in Algorithm 2, we describe quantum algorithms for linear algebra based on the SVE procedure.

Theorem 2 (Matrix algebra [Kp16, Kp17]).

Let such that , and a vector stored in QRAM as described in Appendix A. There exist quantum algorithms that with probability at least return

1. a state such that in time

2. a state such that in time

3. a state in time

One can also get estimates of the norms with multiplicative error by increasing the running time by a factor .

In recent work [CGJ18], QRAM based procedures for matrix multiplication and inversion have been discovered where the dependence on the error parameter was greatly improved to and that on to linear. There, the notion of block encodings is used which is shown to be equivalent to our notion of having an efficient data structure as the one described in Appendix A. Hence, we can restate the above theorem with the improved times as

Theorem 3 (Matrix algebra [Cgj18]).

Let such that , and a vector stored in QRAM as described in Appendix A. There exist quantum algorithms that with probability at least return

1. a state such that in time

2. a state such that in time

One can also get estimates of the norms with multiplicative error by increasing the running time by a factor .

3.1 Singular value estimation of a product of two matrices

We will also need a novel algorithm to perform Singular Value Estimation of a matrix that is the product of two matrices and for which we have QRAM access.

Theorem 4.

Let and be loaded in QRAM as described in Appendix A. Define and an error parameter. There is a quantum algorithm that with probability at least performs the mapping where is an approximation of the eigenvalues of such that , in time .

Proof.

We start by noting that for each singular value of there’s a corresponding eigenvalue of the unitary matrix . Also, we note that we know how to multiply by by applying Theorem 3 sequentially with and . This will allow us to approximately apply the unitary . The last step will consist of the application of phase estimation to estimate the eigenvalues of and hence the singular values of . Note that we need to be a symmetric matrix because of the Hamiltonian simulation part. In case is not symmetric, we redefine it as

 W=[0PQ(PQ)T0]

Note we have for the matrices stored in QRAM and defined as

 M1=[P00QT],M2=[0QPT0].

We now show how to approximately apply efficiently. Note that for a symmetric matrix we have and using the Taylor expansion of the exponential function we have

 U=e−iW=∞∑j=0(−iW)jj!=V∞∑j=0(−iΣ)jj!VT

With we denote our first approximation of , where we truncate the sum after terms.

 ˜U=ℓ∑j=0(−iW)jj!=Vℓ∑j=0(−iΣ)jj!VT

We want to chose such that . We have:

 ∥∥U−˜U∥∥ ≤ ||∞∑j=0(−iW)jj!−ℓ∑j=0(−iW)jj!||≤||∞∑j=ℓ+1(−iW)jj!||≤∞∑j=ℓ+1||(−iW)jj!||≤∞∑j=ℓ+11j! ≤ ∞∑j=ℓ+112j−1≤2−ℓ+1

where we used triangle inequality and that . Choosing makes the error less than .

In fact, we cannot apply exactly but only approximately, since we need to multiply with the matrices and we do so by using the matrix multiplication algorithm for the matrices and . For each of these matrices, we use an error of which gives an error for of and an error for of at most . The running time for multiplying with each is at most by multiplying sequentially. Hence, we will try to apply the unitary by using the Taylor expansion up to level and approximating each in the sum through our matrix multiplication procedure that gives error at most .

In order to apply on a state , let’s assume is a power of two and define . We start with the state

 1√Nll∑j=0−ijj!|j⟩|x⟩

Controlled on the first register we use our matrix multiplication procedure to multiply with the corresponding power of and get a state at most away from the state

 1√Nll∑j=0−ijj!|j⟩|Wjx⟩.

We then perform a Hadamard on the first register and get a state away from the state

 1√ℓ|0⟩(1√N′l∑j=0−ijj!|Wjx⟩)+|0⊥⟩|G⟩

where just normalizes the state in the parenthesis. Note that after the Hadamard on the first register, the amplitude corresponding to each is the same. We use this procedure inside an amplitude amplification procedure to increase the amplitude to be close to 1, by incurring a factor in the running time. The outcome will be a state away from the state

 (1√N′l∑j=0−ijj!|Wjx⟩)=|~Ux⟩

which is the application of . Since , we have that the above procedure applies a unitary such that . Note that the running time of this procedure is given by the amplitude amplification and the time to multiply with , hence we have that the running time is

 O(ℓ3/2(κ(M1)μ(M1)log(8ℓ/ϵ)+κ(M2)μ(M2)log(8ℓ/ϵ))

Now that we know how to apply , we can perform phase estimation on it with error . This provides an algorithm for estimating the singular values of with overall error . The final running time is

 O(ℓ3/2ϵ(κ(M1)μ(M1)log(8ℓ/ϵ)+κ(M2)μ(M2)log(8ℓ/ϵ))

We have and , and since the running time can be simplified to

 ~O((κ(P)+κ(Q))(μ(P)+μ(Q))ϵ).

4 Quantum Slow Feature Analysis

Our goal is to quantumly map the input matrix to the output matrix . For this, we assume that the matrices and are stored in QRAM and we provide efficient data structures for that in Appendix A. Let us define the state . We will describe how to build a circuit that approximately performs the unitary where . As the classical algorithm, QSFA is divided in two parts. In the first step we whiten the data, i.e. we map the state to the state , and in the second step we approximately project onto the subspace spanned by the smallest eigenvectors of the whitened derivative covariance matrix .

4.1 Step 1: Whitening the data

Recall that and . We now show how to whiten the data having quantum access to the matrix . We know that the data is whitened by multiplying with the matrix . Note that is a symmetric matrix with eigenvectors the column singular vectors of and eigenvalues equal to . In fact the results in [CGJ18] easily imply that the time to multiply with is the same as the time to multiply with and hence we have

Theorem 5 (Whitening algorithm).

Let stored in QRAM as described in Appendix A. Let the matrix of whitened data. There exists a quantum algorithm that produces as output a state such that in expected time .

4.2 Step 2: Projection in slow feature space

Remember that the previous step produced the state of the whitened data up to some error . Now we want to project this state onto the subspace spanned by the eigenvectors associated to the “slowest” eigenvectors of the whitened derivative covariance matrix , where is the whitened derivative matrix . Let a threshold value and a precision parameter. With we denote a projection of the matrix onto the vector subspace spanned by the union of the singular vectors associated to singular values that are smaller than and some subset of singular vectors whose corresponding singular values are in the interval .

As in the previous section, we note that the eigenvalues of are the squares of the singular values of , and the two matrices share the same column space: , and . Note also that whitening the derivatives is equal to taking the derivatives of the whitened data. In this way, the problem of performing SVE on is reduced to performing SVE of . Theorem 4 provides exactly the procedure for performing SVE on the matrix , since we know how to multiply with (it is in QRAM) and with (we showed in the whitening procedure that it reduces to performing SVE on ).

Since we showed how to perform SVE of using Theorem 4, we can then use Algorithm 2 to project the input onto the “slow” subspace of and also estimate the norm. To perform the projection, we will need a threshold for the eigenvalues that will give us the subspace of the slowest eigenvectors. A priori, we don’t know the appropriate threshold value, and thus it must be found experimentally through binary search since it depends on the distribution of singular values of the matrix representing the dataset.

We provide the full analysis in the next section that describes the entire QSFA algorithm.

4.3 Analysis of QSFA

We can now describe and analyse the entire QSFA algorithm. Our goal is to use QRAM access to the input data and and output a quantum state that corresponds to the projection of the whitened data onto the slow eigenvectors of the derivative covariance matrix, i.e. output .

We conclude this section by stating the main dimensionality reduction theorem of this paper.

Theorem 6 (QSFA algorithm).

Let and its derivative matrix stored in QRAM as described in Appendix A. Let . There exists a quantum algorithm that produces as output a state with in time