Eigen Component Analysis: A Quantum Theory Incorporated Machine Learning Technique to Find Linearly Maximum Separable Components

Eigen Component Analysis: A Quantum Theory Incorporated Machine Learning Technique to Find Linearly Maximum Separable Components

Abstract

For a linear system, the response to a stimulus is often superposed by its responses to other decomposed stimuli. In quantum mechanics, a state is the superposition of multiple eigenstates. Here, we propose \glseca, an interpretable linear learning model that incorporates the principle of quantum mechanics into design of algorithms capable of feature extraction, classification, dictionary and deep learning, and adversarial generation, etc. The simulation of \glseca, possessing a measurable class-label , on a classical computer outperforms the existing classical linear models. An enhanced \glsecan, a network of concatenated \glseca models, gains the potential to be not only integrated with nonlinear models, but also an interface for deep neural networks to implement on a quantum computer, by analogizing a data set as recordings of quantum states. Therefore, \glseca and its derivatives contribute to expanding the feasibility of linear learning models, by adopting the strategy of quantum machine learning to replacing heavy nonlinear models with succinct linear operations in tackling complexity.

\stackMath\makeglossaries\setacronymstyle

long-short \newacronymrbmRBMristricted boltzmann machine \newacronymecaECAeigen component analysis \newacronymaecaAECAapproximated eigen component analysis \newacronymvecaVECAvanilla eigen component analysis \newacronympePEpure eigenfeature \newacronymefmEFMeigenfeature matrix \newacronymecmmECMMeigenfeature-class mapping matrix \newacronymlorLoRlogistic regression \newacronymlirLiRlinear regression \newacronymsvmSVMsupport vector machine \newacronymksvmKSVMkernel support vector machine \newacronympcaPCAprincipal component analysis \newacronymtsnet-SNEt-distributed stochastic neighbor embedding \newacronymknnKNNk-nearest neighbor \newacronymkmcKMCk-means clustering \newacronymfnnFNNfragment neural network \newacronym1d1Done-dimensional \newacronym2d2Dtwo-dimensional \newacronym3d3Dthree-dimensional \newacronymmseMSEmean squared error \newacronymkecaKECAkernel eigen component analysis \newacronymcecaCECAcontinuous eigen component analysis \newacronymuecaUECAunsupervised eigen component analysis \newacronymncecaNCECAnonlinear continuous eigen component analysis \newacronymganGANgenerative adversarial network \newacronymasdASDadditive state decomposition \newacronymrbfRBFradial basis function \newacronymldaLDAlinear discriminative analysis \newacronymqdaQDAquadratic discriminative analysis \newacronymecanECANeigen component analysis network \newacronymsrSRsoftmax regression \newacronymsfSFsoftmax function \newacronymecabganECAbGANeigen component analysis-based generative adversarial network \newacronymecanbganECANbGANeigen component analysis network-based generative adversarial network \newacronymgecaGECAgenerative eigen component analysis \newacronymgecanGECANgenerative eigen component analysis network \newacronymfcfnnFcFNNfully connected fragment neural network \newacronymdbfnnDbFNNdegeneracy-based fragment neural network \newacronymdnnDNNdeep neural networks \newacronymcnnCNNconvolutional neural networks \newacronymicaICAindependent component analysis \newacronymdictlDictLdictionary learning \newacronymqmlQMLquantum machine learning \newacronymradoRaDOrasing dimension operator \newacronymredoReDOreducing dimension operator \newacronymemEMExpectation-maximization \newacronympmfp.m.f.probability mass function \newacronympdfp.d.f.probability density function \newacronymreluReLUrectified linear unit

\keywords

Quantum Mechanics Machine Learning Degeneracy Component analysis Linear separability

\glsresetall

1 Introduction

Machine learning is widely used in areas ranging from chemistry [11, 44, 21], biology [49, 51, 34], materials [13] to medicine [8] . It has also been used in quantum mechanics [35, 7, 32, 36] and quantum chemistry [48, 40]. Quantum mechanics has also inspired many machine learning algorithms [17, 4, 38, 9], which, in turn, facilitate physics growth per se [29, 37]. The entanglement between machine learning and quantum mechanics has started to produce increasing cross-disciplinary breakthrough in physics, chemistry, artificial intelligence and even social sciences [20], and emerged \Glsqml, an interdisciplinary field that employs quantum mechanics principles into machine learning. In fact, quantum mechanics share high similarity with machine learning in both of their underlying principles and prediction manners [36, 37].

In machine learning, the features of a data set are usually redundant [14]. Feature extraction refers to enriching the features of interest and suppressing or discard features out of interest. A number of classical dimension reduction methods has been proposed to learn the similarity or difference among the features of a data set or across multiple data sets. \Glspca seeks an orthogonal transformation to maximize the variance and separate the data, but it is incapable of exploiting class labels or performing inter-class differentiation. \Glslda takes the advantage of class label to differentiate inter-class data, but it is conditioned on Gaussian distribution. \Glstsne is a nonlinear feature extraction model that finds a low dimensional representation of a high dimensional data, but could suffer false clustering if with low perplexity. \glsica decouples a mixed signal into multiple source signals, but it’s limited to non-Gaussian data distribution. Classification can be conducted in an easier manner with features extracted. The goal of classification is to assign a class label on a given sample. If several classes are linearly separable, the classifier is termed linear classifier. Usually, linear separation refers to one or several hyperplanes existed to separate the data. \Glslor finds a hyperplane and converts the distance between a new input and the hyperplane into probability of the data belonging to a class. Taking this one step further, \glssvm maximizes the two margins on each side of the hyperplane. \Glslda finds several hyperplanes at once, with each one being similar to the ones found by \glslor. Empirically, it’s generally assumed that a linear model is less robust and powerful than a nonlinear model. However, this is not the case for \glseca. By utilizing the linearity of a quantum system that superimpose eigenfeatures (i.e. eigenstates), it functions as a linear model that could couple with most of the machine learning subjects, including but not limited to classification, generative model, feature extraction, dictionary learning, \glsdnn, \glscnn or adversarial generation. \Glseca provides a solution toward image generation by learning from a few coefficient generators, such as normal and uniform distribution. The image generation from known coefficients are analogous to prepare a ’cuisine’ by following an established ’recipe’.

However, it is demanded to establish a method that learns the similarities, i.e. the common features possessed by the data in a set across classes, and differences, i.e. the features possessed by the data belonging to a specific class, simultaneously. The similarities are used to denoise or remove the background, such as telling wheat from chaff. The differences are used to ’tell the wheat from rye’. \Glspca and \glslda pursuit orthogonal bases whose subsets could be used as undercomplete dictionaries. However, it’s not required for the dictionary to be orthogonal in generic \glsdictl. From feature extraction to linear classification and dictionary learning, a model is required to have good performance as well as high interpretability. The principle of superposition in a quantum linear system empowers it being of higher significance than classical linear and nonlinear systems in interpretation.

The responses of a linear system to two or more stimuli are superposed. It allows inspection on the response of individual stimulus to obtain the overall response by superposition. The decomposing strategy could also be extended to approximate nonlinear systems, by dividing one composite signal into multiple basis signals in their analysis. A well chosen basis is important in decomposing such a signal. For example, the Fourier transform can decompose a signal into infinite orthogonal basis signals composed by sine and cosine functions. The overall response is the superposed consequence of responses to individual decomposed signals. In analysis of a basis signal,such as the sinusoid function,

(1)

it could be distinguished by its amplitude , frequency , and phase shift . From a local observer’s point of view at , only the amplitude and phase difference could be sensed. Hence, we define two types of separable components, the amplitude component and the phase component. We use ’amplitude’ and ’phase’ here to avoid confusion with the concept of space, time and frequency domains in Fourier transform. The amplitude differences is related to amplitude, magnitude, and coordinate differences. The phase difference is related to phase, frequency, direction and eigenvalue. There exists many classical algorithms to distinguish the coordinate difference or amplitude difference. \Glsknn, \glskmc and mean shift classify or cluster data by a metric of ’distance’ from a centroid [16, 31, 41]. \Glspca maximizes the coordinate gap of the projection on one dimension of the vector space in a data set, whereas \glslda maximizes the coordinate gap among classes. \Glstsne is a nonlinear transformation method that presents the coordinate difference in a lower dimensional feature space. \Glslor finds the hyperplane to tell the coordinate difference and \glssvm maximizes the margins of a hyperplane upon their distance between coordinates.

Phase differences substantially exist among various data sets. Each signal can be decomposed into several phases with varied probabilities. Meanwhile, a phase may be prevalent in some classes but rare in the rest, which suggests differential probability one eigenfeature belonging to each class. Here, we propose \glseca that is a quantum theory-based algorithm and focuses on the phase differences. \Glseca identifies the phase differences for a -class data set. Benefiting from \glseca, the tasks such as feature extraction, classification, clustering, and dictionary learning can all be performed in linear models. In classical machine learning, the class label follows a Bernoulli or categorical distribution. \Glseca challenges it with a more rational assumption, i.e. class labels following independent Bernoulli distributions. First, a -dimensional data is prepared as a state on qubits. Second, measurements on measurables with commutative operators are taken on this state in an arbitrary order and the measured results are recorded. Last, optimization is performed on a classical computer based upon the parameters of the operators, probabilities in measurements and ground truth of each prepared state.

2 Results

2.1 Background and eigen component analysis (ECA) mechanism

Figure 1: Two artificial intersected data sets. (a) 2D data set. Data in red belong to class and data in blue belong to class ; (b) 3D data set. Data in red belong to class , and data in blue and purple belong to class .
Figure 2: Some 324 pure eigenfeatures (PEs) randomly choosed from all 328 PEs learnt on MNIST data set by approximated eigen component analysis (AECA). The total number of eigenfeatures is 784.
Figure 3: These images are coarsely generated with weighted sum of pure eigenfeatures (PEs) of MNIST data set. For each class, the weight of a PE is the mean projection of all training samples on that PE.
Figure 4: Randomly chosen images from MNIST data set upon basis learnt by vanilla eigen component analysis (AECA) and the standard basis. The images on the right half are input images to train AECA. The images on the left are corresponding result of a basis transformation with learnt basis, i.e. eigenfeatures, from right half. The darkest and brightest pixels on the left half are dominant eigenfeatures, which remain constant regardless of digit features on the right.

First, we clarify the notations used in this paper. A quantum algorithm is intrinsically simple and intuitive, but also abstractive. Its implementation can be simulated on a classical computer. To help understand the concept and verify the simulation, we describe the simulation algorithm in a classical machine learning language, but meanwhile, follow the conventional notations as in quantum mechanics to be consistent with the quantum algorithm, unless specified otherwise. For example, to avoid the confusion caused by using ’observe’ and its derivatives, we adopt their meanings as in quantum mechanics in this article. Likewise, the samples in the data set are termed sample, input, state, recording or just vector. Furthermore, we use notations of convention in real coordinate spaces when no complex numbers are involved.

It should be paid attention that in quantum algorithms, the eigenvalues of measurable for a qubit or a composite system, are defined as and , representing ’false’ and ’true’ that if an input or eigenfeature belongs to a class. In classical simulation, the counterparts of these eigenvalues, indicating the class label of an eigenfeature or input vector, are defined as and . Moreover, the term ’class label’ could mean both the original class label given by the data set and the derived class label that if an input belongs to one class. For example, the class label ’3’ of a sample in a 10-class data set could derive ten class labels, which are ’+1’ for being a sample from class ’3’ and ’-1’ for the corresponding input belonging to classes other than ’3’.

All the sets used in this paper are -indexed. The values , , index the sample data set (with size ), input vector (with size ), and class label (with size ), respectively. refers to a data set with samples, where , together with a finite set of class labels . The corresponding target values compose the set . Notation refers to an initially non-normalized vector. The data set is then normalized to . All the vector in \glsveca are normalized and the magnitude information are discarded if without notation or unspecified. The normalized data set is the recordings of states and their measured values. We also denote one indicator function and one-hot vector function , where . is a vector with its -th element being and otherwise . In addition to that, we denote a one-hot matrix of stacked Bernoulli one-hot vector as

(2)

where the operator takes the one’s complement of each element in . In the classical simulation, for discrete \glseca, i.e. the observed values are discrete, the ket-vectors (or kets) is the same as . The probability without a superscript specified on means a vector representation of its \glspmf. The bold font indicates a vector of stacked probabilities of independent random variables. The outline font indicates a matrix of stacked \glspmf of independent Bernoulli random variables. The element of is a Bernoulli random variable if without a superscript. Thus, with a superscript, is one-hot vector, yet without superscript, it is a vector of stacked independent random variables. For some situation, usually in general discussion, the superscript of a numerical value like , , or is omitted for simplicity, when it could be inferred from the context . The subscript and superscript are omitted when there is no risk of ambiguity in the rest of the paper.

In quantum mechanics, a vector representation of an object or state is ’measurable’ as long as we know the measurable. We could also predict these measurements, once we know the mathematical expression of the measurable and its state. Likewise, we could abstract a real world state or object, such as a spin or image, as a vector, and construct its measurable, no matter it’s momentum or class-label.

As we addressed, samples in a data set have two types of variance. One is amplitude-based and the other one is phase-based. Unlike the off-the-shelf algorithms like \glssvm, \glslda or \glsknn that have proven extraordinary performance on telling the difference on amplitudes, our \glseca focuses on identifying the phase-based differences of a data set.

For a vector, in general, a linear classifier or even some kernel-based classifiers treat each element of the vector as one feature. However, these ’features’ may not fully represent the real property of the data. For a vector possessing more complicated structures, all the elements in the vector become necessary to constitute a reliable feature. In other words, it’s a choice to express the data on a well-defined basis (see Figure 1 (a)) or a standard basis (see Figure 1 (b)). Therefore, for a vector

(3)

each is viewed as a feature of this vector upon a standard basis. For some complicated structures (e.g. the edges of an object in an image) , one single element of i.e. a vector in a standard basis, cannot tell the whole story. If we finds an orthogonal basis (i.e. eigenfeatures, see Figure 2) , all the vectors could be a unique linear combination of vectors in this basis (see Figure 3). If , then

(4)

in which is nomalized coefficients (see Figure 4), with

(5)

In quantum mechanics, the state x would collapse on eigenstate with the probability

(6)

Not only the input vector but also its class label have quantum interpretation. Classical machine learning algorithms usually assume the class labels of input or basis vectors following Bernoulli or categorical distribution, which is true for the eigenvectors in Figure 5 (a)-(f). For a -class data set ( Figure 1 (a)), the probabilities of the predictions could be described by a probability matrix

(7)

in which and are the probability of an input vector belonging to class ’0’ and class ’1’, respectively. Under the classical assumption, the trace of equals , i.e.

(8)
Figure 5: The distribution of projections of all training samples on eigenfeatures selected from eigenfeatures learnt on MNIST data set using approximated eigen component analysis (AECA). The superscript denotes degree of overlapping. (a) - (f) Frequency distribution of normalized projections on pure eigenfeatures (PEs) corresponding to the class label ’0’, ’1’, ’2’, ’5’, ’7’, and ’8’ sequentially. That PEs used in (a) - (c) are corresponding to the least degenerated and these in (d) - (f) are corresponding to the most degenerated. (g) - (i) Frequency distributions of randomly choosed , and eigenfeatures in turn. (j) - (l) Frequency distributions of the most overlapped three eigenfeatures with for eigenfeature in (g) and (h) and for (i).

If a new input vector unambiguously belongs to one class, then either or equals 1, reaching

(9)

However, this could be wrong. For the data points in the center of Figure 1 (a), it could belong to either class. Meanwhile, many data points in the \gls2d space do not belong to any of these two classes. A data set usually occupies a compact space or spans a subspace in the full vector space. For a -class data set, the total number of decisions in the full space could reach , but the mutual exclusive decision is only a small subset of size . If we assume that each input vector as well as its basis vectors have considerable possibilities belonging to one or more classes (Figure 5 (g)-(l)), we could choose to inspect each class independently. If we prepare a new input vector as a quantum state, it’s assumed an apparatus could be constructed to measure the class label indicating whether an input vector belongs to a to-be-decided class. For the classes, sized apparatuses could be built to take measurements on identical copies of the state. Second, an operator on a -dimensional state vector has eigenvalues, with each eigenvalue has two possible outcomes, (’false’) and (’true’), to represent if it belongs to a specific class. The candidates of all these eigenvalues of each operator should be arranged to degenerate to and . Taking all classes into account, we should assume all these corresponding operators of these measurables share a complete basis of simultaneous eigenvectors. Hence, for a data set with classes and -dimensional features, the task of degenerating the data set into two distinctive states is converted into a measurement on independent systems with qubits in total. In these measurements, for each measurable, the measurement on the whole system could be product of observed values taken on each qubit. For more concurrency, qubits could be used. As these operators commute, all these operators share a complete basis of simultaneous eigenvectors but has its own eigenvalues.

Therefore, in classical simulation, instead of learning which class a vector belongs to, we learn which classes each eigenfeature of a vector belongs to. Furthermore, the decision-making should be conducted independently on each class. Next, we have a mapping table between eigenfeatures and class labels. This leads us to learning an \glsefm representing the unitary operator with a complete basis of simultaneous eigenvectors and \glsecmm. \glsecmm bridges the superposition of the probabilities of the class label of each eigenfeature and the class label of the input vector. All we need is to sum up the probabilities of each eigenfeature assigned for all classes independently. Afterward, we could obtain the combined probabilities one vector belonging to each class. For classes, the probabilities of an input vector belonging to each class is . The mutual exclusive probability for class ’c’ is calculated by

The unitarity of \glsefm could guarantee that the difference is kept in a change of basis transformation. For \glsefm, the variance of projection on eigenfeature is maximized (see Figure 4). In the left half of Figure 4, the bright and dark pixels indicate significant signals, whereas the gray pixels are trivial, enabling for dimension reduction. Also, the stable positioning of these bright and dark pixels among different inputs suggests it being appropriate as a classifier.

For prediction, we prove that an independent decision can be made without calculating the mutual exclusive probabilities for a data set with two classes. The proof can be easily extended to data sets with multiple classes using mathematical induction. For a given input, the two mutual exclusive probabilities of each class label are

and

Then the proof is given as

Proof.

Without loss of generality, suppose , such that

Further, we can build a -fold \glsecan by concatenating \glseca models. \Glsecan gives \glseca the capability to integrate nonlinear models such as \glspldnn. A dimension operator assuming the nonlinearity, which can be specially designed or a classical \glsdnn, is installed between consequent \glseca models.

2.2 Related work

The related work include several classical algorithms like \glsica, \glsdictl, and also \glsqml algorithms.

Independent component analysis (ICA)

\gls

ica shares the similar goal as \glseca. Both algorithms try to find independent components which could generate the data from some independent sources. \glsica decouples a mixed signal by multiple recorders depending on the varied combination of source signals. The recorders can be regarded as another kind of label because they record the intensity of the source variably. Thus, \glsica could be replaced by \glseca as \glseca is dependent on the most intensive source. The major advantage of \glseca over \glsica is that there is no necessary for prior distribution assumption.

Dictionary learning (DictL)

\Gls

dictl is similar to \glseca because they both want to find a sparse representation of the data set. A supervised dictionary learning method presented in [24] is like a second cousin of \glseca in appearance. A discriminative task is added to the objective while the reconstruction term is reserved. Likewise, in \glsveca, our objective is also to identify the independent eigenfeatures, based on which the data classification is conducted. In comparison, \glseca is easier to train because it comprises less hyperparameters. Moreover, \glseca is less likely for loss of information because it takes all classes and a complete profile of basis into account and preserves the difference.

Quantum machine learning (QML)

\Gls

qml is a wide range of machine learning algorithms including quantum computation-based machine learning or inspired and facilitated by quantum mechanics. The method presented in [38] assumes a prior for the input feature states and obtains a number of template classes. However, the identities of these template states are ignored. In their method, the need to find some template states which are all linear combination of some pure quantum states. In contrast, in our \glseca, all the inputs are linear combinations of pure quantum states, the identities of which are utilized for further pursuit.

2.3 Preliminary performance test of eigen component analysis (ECA)

Before moving forward to the algorithm, we prepared an example that illustrated two ideal cases with two artificial data sets (see Figure 1). To birth some epiphanies, this informal discussion is based upon guess and intuition.

Two data sets showed in Figure 1. The data of each class intersected with each other. For the \gls2d and \gls3d data set , we could guess an \glsefm (which is an unitary operator) of a linear seprator with eigenfeature as column vector and \glsecmm (of which the elements ) which are

and

respectively, in which the symbol indicates a numerical estimation and is an equivalent sparse representation of . For the \gls3d data set, each columns of \glsefm represent a pivot axis or principal component of the data set. The element means the 0th eigenfeature of this data set doesn’t belong to class ’0’ and indicates the 1st eigenfeature belongs to class ’0’. The two 1s in the third row of represents that the 2nd eigenfeature could be noise or background shared among the two classes. For a new input, we only need to sum up the probabilities that the input projects on the 1st and 2nd eigenfeature to decide the probability whether this input belongs to class ’0’. One should notice the decision of class label of a input is not mutual exclusive because the summation of each class could be equal and the sum of the two probabilities could surpass 1 as they are independent decision.

We give a more concrete development for this informal discussion. For vector in aforementioned 3D data set,

(10)

in which is the nomalization operator. Hence we denote the elements of

(11)

as and we have

(12)

For the aforementioned \gls3d data set, instead of treating the class label of a vector as a single categorical distribution, the \glspmf of each class label given is assumed to follow a independent Bernoulli distribution, such that the probability of one vector belongs to class ’0’ is

(13)

in which is the all unknown parameters, is element-wise Hadamard product operator, is a placeholder for taking the entire column (as first index) or row (as second index) and means the 0-th column of . Thus, the combined probabilities of these two Bernoulli random variables with observed event as could be defined as

(14)

and the complement probabilities could be

(15)

in which the outline font are vector or matrix with corresponding digits.

For simplicity, we could write these two \glspmf of Bernoulli random variable and together. With the feature matrix and the mapping matrix, these stacked or combined \glspmf of combined Bernoulli random variable given (i.e. , ) could be written in

(16)

where the rows are the vectors of corresponding \glspmf.

To obtain the mutual exclusive decision on the class label, the unambiguous probability of belonging to class ’0’ could be calculated as

(17)

2.4 Experiment results

We compared our model with \glslor, \glslda, \glsqda, \glssvm, \glsksvm with an \glsrbf kernel.

Counting the parameters

For a data set with features and classes, the number of total parameters could be calculated as:

  • .

We don’t count the parameters of \glsksvm in these experiments.

Two artificial data sets (2D and 3D)

  • Metrics
    Name Accuracy Confustion Matrix Parameters
    LoR 0.5242 3
    LDA 0.5239 5
    QDA 0.8124 11
    SVM 0.6048 3
    KSVM 0.8063 \diagbox[dir=NW,width=6em,height=2.5em]
    ECA 0.8139 8
    Table 1: Compare with other classifiers of 2D data set
  • One class of the \gls2d data set (Figure 1) is random normally generated with mean of and covariance matrix ; and the another class is gengerated with mean and covariance matrix .

    The and obtained by \glseca is

    respectively, with which the model could obtain an accuracy of on the validation data which is on par with \glsqda and outperforms the rest (Table 1).

  • One class of \gls3d data set (Figure 1 (b)) is random normally generated with mean of and covariance matrix ; and another one is random normally generated with mean and covariance matrix and mean and covariance matrix .

    The and obtained by \glseca is

    with which the model could obtain an accuracy of on the validation data. It outperforms any other linear models included in the table. Meantime. We would obtain the equivalent form of if we used \glsaeca with Equation 55. That is

    to which the result is rounded. And we won’t mention if we round the result in the rest of the paper.

    Metrics
    Name Accuracy Confustion Matrix Total Parameters
    LoR 0.6671 4
    LDA 0.6667 7
    QDA 0.9368 19
    SVM 0.6684 4
    KSVM 0.4682 \diagbox[dir=NW,width=10em,height=2.5em]
    ECA 0.9424 15
    Table 2: Compare with other classifiers of 3D data set

MNIST data set (using approximated eigen component analysis (AECA), vanilla eigen component analysis (VECA) and eigen component analysis network (ECAN))

Figure 6: Confusion matrix of approximated eigen component analysis (AECA) model on MNIST data set (Accuracy of ECA, LDA, QDA are 0.918, 0.873, 0.144 respectively.)
Figure 7: Degeneracy of all distinctive eigenvalues of on the MNIST data set using approximated eigen component analysis (AECA). The degeneracy of eigenvalue is which means there aren’t eigenfeatures assume the similarities of the data set. The eigenvalue with the largest degeneracy () is , all corresponding to pure eigenfeatures (PEs) mapped to class label ’8’ (). The largest eigenvalue is () with degeneracy .
Figure 8: (a) Degeneracy of pure eigenfeatures (PEs) of MNIST data set with approximated eigen component analysis (AECA); (b) Crowdedness of classes on MNIST data set with AECA.
Figure 9: Overlapping of classes on eigenfeatures of MNIST data set with approximated eigen component analysis (AECA).

This experiment exhibited the capability of dimension reduction of \glseca. Meantime, it’s a good illustration of the extensionality of \glseca.

  • \gls

    aeca

    With no more than 12 epochs of training, we could obtain an accuracy of on the MNIST data set, which outperforms \glslda (). \glsqda collapsed on this data set. The corresponding confusion matrices of \glseca, \glslda, \glsqda are listed together with their accuracy (Figure 6).

    Part of the learnt eigenfeatures are displayed in Figure 2. These overlapped eigenfeatures (mapped to two and more classes) could be separated by amplitude based separator or raising dimension. The crowdedness of eigenfeatures (Figure 8) showed that the digit ’1’ needs the least eigenfeatures to express itself and the digit ’8’ needs the most eigenfeatures to express. From the overlapping histogram of classes on eigenfeatures (Figure 9), we could found that more than 300 eigenfeatures are mapped to a single class.

    Part of our obtained and is

    and

    such that

  • \gls

    veca

    The learnt by \glsveca is extremely sparse, thus our eigenfeature is rather abstract. Using \glsveca, we achieved an validation accuracy of . The reason that this accuracy is lower than that obtained by \glsaeca is that \glsveca is less tolerant to weak mapping between eigenfeatures and their class labels. \Glsveca intends to learn each element of an \glsecmm as unambiguous as possible. Nevertheless, the most intriguing part is that we learnt only 110 \glsplpe (LABEL:fig:mnist_ber_overlapping, LABEL:fig:mnist_ber_nonoverlap, LABEL:fig:mnist_ber_degeneracy_and_crowdedness and LABEL:fig:mnist_ber_proj_freq_dist_ef) attribute to the unambiguity. In LABEL:fig:mnist_ber_proj_freq_dist_ef (i), we found the class ’7’ and the class ’9’ are both distant from distributions of other classes. However, this \glspe is unambiguously assigned to the class ’9’. This phenomenon indicates that \glsveca might be less tolerant to weak mappings between eigenfeature and their class labels, which is consistent to our objective in development of this two models.

  • \gls

    ecan

    We implemented several 2-fold \glsplecan in this experiment. All these experiments are trained in 12 epochs. An indentity dimension operator is implemented if a \glsrado or \glsredo is not mentioned.

    Since the major task in this demonstatrion is extensionability, the margin of accuracy for parameter tuning is possible for suited dimension operator. As limited by finding the orthogonal dictionary and linear operation, the performance in prediction accuracy is marginally underperformed than standard \glspldnn. Also, in classical simulation, the training time is at least doubled than standard \glspldnn because the extra linear operation which won’t be a problem on a quantum computer.

    The validation accuracy of the second fold achieved when we implemented a specially designed \glsredo (see Equation 59) in that reduced the dimension to in the first fold of \glsecan.

    Moreover, a non-quadratic \glsrado (see Equation 60) and \glsredo (see Equation 61) has been implemented in neural network with \glsrelu activation function which is on par with these quadratic operators. The accuracies of each folds get to and , and , and for the three subexperiments with \glsrado, \glsredo, and both operators.

    Instead of \glsrado or \glsredo, we implemented fully connected neural networks as a dimension operator. The accuracies of each folds get to and , and , and with a fully connected neural network with units implemented at the position of \glsrado, \glsredo and both place in the first fold.

Two breast cancer data sets

This experiment used two data sets which could be used to illustrate the high interpretability of \glseca. We analyzed two eigenfeature of the first data set to explain the meaning of what we’ve obtained.

In this two experiments, we used two data sets downloaded from UCI machine learning repository, which was originally obtained from the University of Wisconsin Hospitals, Madison by Dr. William H. Wolberg) One data set was published in 1992 (abbreviated as Wis1992) and the other one was in 1995 (Wis1995). ECA achieved the validation accracies of and , respectively, with all other afermentioned failed on these two data sets.

  • With \glsveca, we achieve an accuracy (Table 3). The eigenvalue of and its corresponding degeneracy is listed below.

    Eigenvalue Binary eigenvalue Class label of \glspe Degeneracy
    0 00 \diagbox[dir=NW,width=15em,height=0.8em] 5
    1 01 2
    2 10 2
    3 11 \diagbox[dir=NW,width=15em,height=0.8em] 0
    Metrics
    Name Accuracy Confusion Matrix Total Parameters
    LoR 0.3420 10
    LDA 0.3420 19
    QDA 0.3420 109
    SVM 0.3420 10
    KSVM 0.3420 \diagbox[dir=NW,width=10em,height=2.5em]
    ECA 0.9004 99
    Table 3: Comparison with other classifiers on the Wis1992 data set

    In Wis1992, the 9 original features are ’Clump Thickness’, ’Uniformity of Cell Size’, ’Uniformity of Cell Shape’, ’Marginal Adhesion’, ’Single Epithelial Cell Size’, ’Bare Nuclei’, ’Bland Chromatin’, ’Normal Nucleoli’ and ’Mitoses’. For this data set, the \glsecmm we obtained is

    First, we choose the 0th eigenfeature from \glsefm as obtained. With the \glsecmm, we know this eigenfeature is a \glspe mapping to the class ’0’ (i.e. ’benign’ tumor). This eigenfeature and its squared value are

    What’s the meaning of and ? In analysis of a new input vector, we could use \glsefm. To analyzethe eigenfeatures in \glsefm, we should use a special \glsefm, the identity matrix. First of all, is a paradigm or a textbook solution of ’benign’ tumor indicator. For , the value of its elements represent the relative intensity on each original feature. In more detail, high ’Bland Chromatin’, and relatively high ’Clump Thickness’ and ’Uniformity of Cell Size’ with low ’Uniformity of Cell Shape’,’Bare Nuclei’, and ’Mitoses’ tend to be symptoms of ’benign’ tumor. This ’benign’ \glspe indicates how we take into each original feature into account when we make the decision of the tumor being ’benign’. Considering a special \glsefm , the 6th value in is , which means one should take into account of ’Bland Chromatin’ together with ’Uniformity of Cell Shape’ () and ’Bare Nuclei’ ().

    Next, we take the 1st eigenfeature which is a ’malignant’ \glspe into inspection. This eigenfeature and its squared value are

    No doubt the ’Mitoses’ is the factor that we should consider the most () to decide a tumor being ’malignant’, together with ’Clump Thickness’(). If a patient with less ’Mitoses’ and ’Uniformity of Cell Size’ and relatively high ’Marginal Adhesion’, ’Single Epithelial Cell Size’, ’Bare Nuclei’, and ’Bland Chromatin’, a ’malignant’ diagnosis might be on the way.

  • With \glsveca, we achieved an accuracy (Table 4). The eigenvalue of and its corresponding degeneracy is listed below.

    Eigenvalue Binary eigenvalue Class label of \glspe Degeneracy
    0 00 \diagbox[dir=NW,width=15em,height=0.8em] 4
    1 01 17
    2 10 9
    3 11 \diagbox[dir=NW,width=15em,height=0.8em] 0
    Metrics
    Name Accuracy Confustion Matrix Total Parameters
    LoR 0.4043 31
    LDA 0.4043 61
    QDA 0.3032 991
    SVM 0.4043 31
    KSVM 0.5957 \diagbox[dir=NW,width=10em,height=2.5em]
    ECA 0.9414 960
    Table 4: Compare with other classifiers on Wis1995 data set

3 Discussions

We proposed a new quantum machine learning algorithm. This algorithm could be simulated on a classical computer. We used \glsveca for data classification which outperforms \glslor, \glslda, \glsqda, \glssvm and \glsksvm (with \glsrbf kernel). One drawback of \glsveca is that it ignores the amplitude difference but focuses on the phase difference. The magnitude information was discarded in its classification performance, though, in practice, we found this magnitude information did not show substantial influence on the model. One solution to recover the lost information is to wrap the magnitude into our original vector. Other solutions include raising dimension before normalization or adopting a parallel \glsfnn on magnitude (see more about this in LABEL:app:ext_eca). With extension, our algorithm could work with amplitude difference along each eigenfeature. Thus a combination method based upon these two components is expected to build a more robust linear classifier. \Glsecan can further improve the performance by integrating with nonlinear models such as deep networks. This method could be used in text classification and sentiment analysis as text usually has intricate linearity.

The advantages of \glseca can be found in several aspects. First, as a classifier, any hyperplane separating data can be separated by \glseca with no more than one auxilliary dimension, i.e. an extra dimension with unit or constant length (details omitted due to space limitation). In addition, for classification, \glseca could process more than two classes simultaneously. Unlike \glspca or \glslda, \glseca does not need to specify the number of dimensions for a lower dimensional feature space. The concrete number could be calculated from \glsecmm neither more nor less. Not only can \glseca work as a good classifier, but it can also obtain a good dictionary.

Moreover, as this method is inspired by quantum mechanics, we introduce the concept of degeneracy in quantum mechanics as redundancy in machine learning problem. With degeneracy, we could also learn an undercomplete or overcomplete dictionary. With \glseca, a complete dictionary could also be nontrivial. Indeed, the final dictionary (composed by \glsplpe) will be a subset of all eigenfeatures based upon the obtained \glsecmm. Besides, this redundancy introduced could be not only used in our method, but also anywhere independently tackling a machine learning problem to avoid overfitting on linearity. In conclusion, \glseca is an algorithm which deeply exploits the divide-and-conquer strategy.

4 Methods

We first develop a classical approximation of the algorithm which could be implemented on a classcial computer. Afterward, a quantum algorithm could be presented to implement on a quantum computer. We begin our development from the quantum intuition and fade out to the classical simulation followed by the full quantum algorithm.

For an observable, we could develop a ’machine’ or apparatus to measure it. In quantum mechanics, such an observable or measurable are represented by linear operators . These eigenvalues of the operator are the possible to-be-observed results of a measurement. Corresponding eigenvectors of these eigenvalues represents unambiguously distinguishable states. When we measure on an observable at state , the probability to observer value is given by

(18)

Now we define a new observable or measurable the class-label and its corresponding linear operator . In addition to all the principles or assumptions in quantum mechanics, there are two assumptions needed to be hold in the development of \glsveca:

  • Assumption 1: Any system with a measurable class-label is a quantum system:

  • Assumption 2:

    • On a quantum computer, for each class, commutative measurable could be built for each class label;

    • On a classical computer, each class label of each eigenfeature follows independent Bernoulli distribution.

If we view our vector representation of a to-be-classified object as a quantum state, the measurement could be expressed as

(19)

or

(20)

in which is the eigenvalue and is the eigenvector of . We conduct a series of measuremnts on states to obtain its correponding observed value . Then we acquire a data set of measurement results on states . One ideal situation for totally unambiguous states, the measurement on class-label could observe the class label (which is an integer). Hence the class label is the corresponding eigenvalue of . That is

(21)

For a data set with a full rank design matrix whose states are unambiguous, we could obtain an analytic solution for Equation 21, such that

(22)

in which is the vector of all the class labels and is element-wise scalaring along columns. Given a new state , the prediction could be obtained by the rounded result of the expectation value of given state . This analytic solution is implemented with performance over several classical algorithms mentioned in experiments section. However, one eigenstate (i.e. eigenfeatures) could be overlapped by several classes and states in one class could project on several eigenfeatures. What we could predict, in the measurement of on a state , would be the expectation of the observable. With Equation 20, we have

(23)

These eigenvalues crushed together, which cannot give us the information about the \glspmf of how the input state would collapse on eigenstates. Hence it couldn’t give us the information which the classes it belongs to. Thus for a -class data set, we need to define commutative operators

(24)

To identify the class label, we need to know the \glspmf firstly. For matrix form of (we’ll assume is a matrix here and non-matrix would be discussed in appendix), any Hermitian matrix could be diagonalized, such that

(25)

in which is composed by eigenvalues of on its diagonal and unitary operator is composed by simultaneous eigenvectors of all operators as its column vector. Thus, for all these operators, we want to find a complete basis of simultaneous eigenvectors of all these operators and the eigenvalues of each measurable.

Then, with (Equation 18), the \glspmf of collapsing on each eigenstate given a state is

(26)

To identify the unambiguous relationship between eigenfeature and class label, we assume the Bernoulli random variable if one eigenfeature or vector is belonging to a class follows independent Bernoulli distributions (development based upon a categorical distribution assumption would be attached in appendix). Thus, the \glspmf of the classes to which an eigenfeature is belonging could be described as

(27)

Hence , by the principle of superposition, the \glspmf of given which is the decision if one state belongs to one class would be

(28)

For all classes, the matrix composed by stacked or combined \glspmf of with combined Bernoulli random variables given could be denoted as

(29)

in which the bold font indicates a matrix and the subscript indicates the size of the vector or matrix.

Furthermore, the mapping between eigenfeature and class label follow the rule of winner-take-all, i.e the probability would be rounded to 0 or 1, such that

(30)

and

(31)

in which is round operation.

Then we put the rounded distribution of given together to form a matrix . By substituting Equation 26 into eq. 29, the combined probabilities of given could be written in

(32)

in which denotes all the unknown parameters and is a all-ones matrix. When one eigenfeature only belongs to one class, the corresponding row of is a bitwise representation of the binary digit of the class label. With these eigenfeature overlapped by several calsses, we define its eigenvalue the binarized number of the corresponding reversed row of . Hence, we define

(33)

in which the subscript means reversely binarizing the row vector of and denotes the operation diagnalizing a vector into a matrix with elements on diagonal and otherwise 0. With or , in classical simulation, the commutative operators () could be combined into a single operator such that

(34)

In the example of the \gls2d and \gls3d data set (see Sections 2.3 and 1), the corresponding separator has eigenvalues and respectively, in which indicates that the number or the number in a set is binary. Correspondingly, we also convert the class labels as reversely bitwise view of its one-hot vector expression. The conversion is depicted as

(35)

with which we could easily to determine which classes an eigenfeature belongs to. These eigenfeatures that only belongs to one class are called \glspe. In our terminology, \glspe are eigenfeature in which the superscript denotes degree of overlapping. For a \glspe corresponding to eigenvalue , the class label of it could be calculated trivially such that

(36)

Therefore, to predict the measurements on measurable class-label , instead of learning directly (to build neural networks simulating a function to represent ), we could learn and (i.e. ) to construct our in a classical simulation. In the quantum algorithm, a subtle difference is that could be learned directly with a quantum computer.

Given a data set , with Equation 30, we could denote the combined probabilities of observing given as

(37)

and then with Equation 29 we have

(38)

As we assume the decision on each class label of each eigenfeature are independent, the probability of a measurement of given is

(39)

Then the log-likelihood function is

(40)

To learn and , our objective could be

(41)

in which the constraint on is a shorthand for the constraints on and . Actually, the optimization of this objective is NP-hard.

By substituting Equation 39 into Equation 41, then with expansion and regrouping the objective could be simplified as

(42)

in which the is the one’s complement of each element of , is the trace of corresponding matrix and and are and in short respectively.

Hence our objective becomes

(43)

With Equation 18, we have

(44)

The round operation is not differentiable. Thus, we replace with which is sigmoid function of parameters (Figure 10 (a)):

(45)

in which is the imambiguity factor to make the probability more concentrate on or and is the sigmoid function operator. The higher then the probability would be more concentrated on 0 or 1. Then the round operation on would be less risky. As we want be a binary matrix, to make this constraint more neat, we add an auxiliary sinusoid function on (Figure 10 (d)) such that

(46)
Figure 10: Several hypothesis of eigenfeature-class mapping matrix (ECMM). (a)ReLU function; (b)Sigmoid function; (c)Sigmoid function which is more concentrated on 0 or 1; (d)Sigmoid function on sinusoid which periodicly concentrates on 0 or 1.

The sigmoid function could output 0 or 1 when the input approches respectively. Nevertheless, the elements of could never approch 0 or 1 theoretically. In practice, with a relatively large , works as a good approximation of sigmoid function.

Thus, we replace the constraint on with and . Our objective becomes

(47)

Then we obtain our objective function using Frobenius norm which is

(48)

Since most samples we meet are in real coordinate space, for simplicity and without loss of generality, we assume all the vectors would always project only on real space (i.e. the imaginary part always equals 0). The real version of our objective could be

(49)

4.1 Approximation of eigen component analysis (ECA)

Furthermore, for a relative large data set, the combined probabilities of a combined random vector of Bernoulli random variables could be an approximation of the \glspmf of , such that these combined probabilities

(50)

could be used to estimate \glspmf of given

(51)

The log-likelihood function becomes

(52)

Then the objective of \glsaeca could be written in