Machine Learning by Two-Dimensional Hierarchical Tensor Networks: A Quantum Information Theoretic Perspective on Deep Architectures

Machine Learning by Two-Dimensional Hierarchical Tensor Networks: A Quantum Information Theoretic Perspective on Deep Architectures

Ding Liu Department of Computer Science and Technology, School of Computer Science & Software Engineering, Tianjin Polytechnic University, Tianjin 300387, China ICFO-Institut de Ciencies Fotoniques, The Barcelona Institute of Science and Technology, 08860 Castelldefels (Barcelona), Spain    Shi-Ju Ran Email: shi-ju.ran@icfo.eu ICFO-Institut de Ciencies Fotoniques, The Barcelona Institute of Science and Technology, 08860 Castelldefels (Barcelona), Spain    Peter Wittek Email: peter.wittek@icfo.eu ICFO-Institut de Ciencies Fotoniques, The Barcelona Institute of Science and Technology, 08860 Castelldefels (Barcelona), Spain    Cheng Peng Theoretical Condensed Matter Physics and Computational Materials Physics Laboratory, School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China    Raul Blázquez García ICFO-Institut de Ciencies Fotoniques, The Barcelona Institute of Science and Technology, 08860 Castelldefels (Barcelona), Spain    Gang Su Theoretical Condensed Matter Physics and Computational Materials Physics Laboratory, School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China Kavli Institute for Theoretical Sciences, University of Chinese Academy of Sciences, Beijing 100190, China    Maciej Lewenstein ICFO-Institut de Ciencies Fotoniques, The Barcelona Institute of Science and Technology, 08860 Castelldefels (Barcelona), Spain ICREA, Passeig Lluis Companys 23, 08010 Barcelona, Spain
Abstract

The resemblance between the methods used in studying quantum-many body physics and in machine learning has drawn considerable attention. In particular, tensor networks (TNs) and deep learning architectures bear striking similarities to the extent that TNs can be used for machine learning. Previous results used one-dimensional TNs in image recognition, showing limited scalability and a high bond dimension. In this work, we train two-dimensional hierarchical TNs to solve image recognition problems, using a training algorithm derived from the multipartite entanglement renormalization ansatz (MERA). This approach overcomes scalability issues and implies novel mathematical connections among quantum many-body physics, quantum information theory, and machine learning. While keeping the TN unitary in the training phase, TN states can be defined, which optimally encodes each class of the images into a quantum many-body state. We study the quantum features of the TN states, including quantum entanglement and fidelity. We suggest these quantities could be novel properties that characterize the image classes, as well as the machine learning tasks. Our work could be further applied to identifying possible quantum properties of certain artificial intelligence methods.

I Introduction

Over the past years, we have witnessed a booming progress in applying quantum theories and technologies to realistic problems. Paradigmatic examples include quantum simulators Trabesinger et al. (2012) and quantum computers Steane (1998); Knill (2010); Buluta et al. (2011) aimed at tackling challenging problems that are beyond the capability of classical digital computations. The power of these methods stems from the properties quantum many-body systems.

Tensor networks (TNs) belong to the most powerful numerical tools for studying quantum many-body systems Orús (2014a, b); Ran et al. (2017a). The main challenge lies in the exponential growth of the Hilbert space with the system size, making exact descriptions of such quantum states impossible even for systems as small as electrons. To break the “exponential wall”, TNs were suggested as an efficient ansatz that lowers the computational cost to a polynomial dependence on the system size. Astonishing achievements have been made in studying, e.g. spins, bosons, fermions, anyons, gauge fields, and so on Verstraete et al. (2008); Cirac and Verstraete (2009); Orús (2014b); Ran et al. (2017a). TNs are also exploited to predict interactions that are used to design quantum simulators Ran et al. (2017b).

As TNs allowed the numerical treatment of difficult physical systems by providing layers of abstraction, deep learning achieved similar striking advances in automated feature extraction and pattern recognition LeCun et al. (2015). The resemblance between the two approaches is beyond superficial. At at theoretical level, there is a mapping between deep learning and the renormalization group Mehta and Schwab (2014); Levine et al. (2017a); Koch-Janusz and Ringel (2017), which in turn connects holography and deep learning You et al. (2017); Gan and Shu (2017), and also allows studying network design from the perspective of quantum entanglement Levine et al. (2017b). In turn, neural networks can represent quantum states Carleo and Troyer (2017); Chen et al. (2017); Huang and Moore (2017); Glasser et al. (2017).

Most recently, TNs have been applied to solve machine learning problems such as dimensionality reduction Cichocki et al. (2016, 2017), handwriting recognition Stoudenmire and Schwab (2016); Han et al. (2017), and linguistic applications Gallego and Orús (2017). Through a feature mapping, an image described as classical information is transferred into a product state defined in a Hilbert space. Then these states are acted onto a TN, giving an output vector that determines the classification of the images into a predefined number of classes. Going further with this clue, it can be seen that when using a vector space for solving image recognition problems, one faces a similar “exponential wall” as in quantum many-body systems. For recognizing an object in the real world, there exist infinite possibilities since the shapes and colors change, in principle, continuously. An image or a gray-scale photo provides an approximation, where the total number of possibilities is lowered to per channel, with describing the number of pixels, and it is assumed to be fixed for simplicity. Similar to the applications in quantum physics, TNs show a promising way to lower such an exponentially large space to a polynomial one.

This work contributes in two aspects. Firstly, we derive an efficient quantum-inspired learning algorithm based on a hierarchical representation that is known as tree TN (TTN) Fannes et al. (1992); Niggemann et al. (1997); Friedman (1997); Lepetit et al. (2000); Martín-Delgado et al. (2002); Shi et al. (2006); Nagaj et al. (2008); Tagliacozzo et al. (2009); Murg et al. (2010); Li et al. (2012); Nakatani and Chan (2013); Pižorn et al. (2013); Murg et al. (2015). Compared with Refs. Stoudenmire and Schwab (2016); Han et al. (2017) where a one-dimensional (1D) TN (called matrix product state (MPS) Östlund and Rommer (1995)) is used, TTN suits more the two-dimensional (2D) nature of images. The algorithm is inspired by the multipartite entanglement renormalization ansatz (MERA) approach Vidal (2007, 2008); Cincio et al. (2008); Evenbly and Vidal (2009), where the tensors in the TN are kept to be unitary during the training. We test the algorithm on both the MNIST (handwriting recognition with binary images) and CIFAR (recognition of color images) databases and obtain accuracies comparable to the performance of convolutional neural networks. More importantly, the TN states can then be defined that optimally encodes each class of images as a quantum many-body state, which is akin to the study of a duality between probabilistic graphical models and TNs Robeva and Seigal (2017). We contrast the bond dimension and model complexity, with results indicating that a growing bond dimension overfits the data. we study the representation in the different layers in the hierarchical TN with t-SNE Van der Maaten and Hinton (2008), and find that the level of abstraction changes the same way as in a deep convolutional neural network Krizhevsky et al. (2012) or a deep belief network Hinton et al. (2006), and the highest level of the hierarchy allows for a clear separation of the classes. Finally, we show that the fidelities between each two TN states from the two different image classes are low, and we calculate the entanglement entropy of each TN state, which gives an indication of the difficulty of each class.

Ii Preliminaries of tensor network and machine learning

Figure 1: (Color online) Schematic diagram of a matrix product state.

A TN is defined as a group of tensors whose indexes are shared and contracted in a specific way. TN can represent the partition function of a classical system, and also of a quantum many-body state. For the latter, one famous example is the MPS (Fig. 1), with the following mathematical representation:

(1)

When describing a physical state, the indexes are called physical bonds that represent the physical Hilbert space, and dummy indexes are called virtual bonds that carry the quantum entanglement. MPS is essentially a 1D state representation. When applied to 2D systems, MPS suffers severe restrictions since one has to choose a snake-like 1D path that covers the 2D manifold. This issue is known as the area law of entanglement entropy Verstraete and Cirac (2006); Hastings (2007); Schuch et al. (2008).

A TTN (Fig. 2) provides a natural expression for 2D states, which we can write as a hierarchical structure of layers:

(2)

where is the number of tensors in the -th layer. For simplicity, we ignore the indexes by using bold capitals, and write Eq. (2) as

(3)

as long as no confusion is caused. The summation signs are also omitted by providing that all indexes that are shared by two tensors will be contracted. Meanwhile, all vectors are assumed to be column. We will follow these conventions throughout this paper.

Figure 2: (Color online) The illustration of a TTN. The squares at the bottom represent the vectors obtained from the pixels of one image through the feature map. The sphere at the top represents the label.

In a TTN, each local tensor is chosen to have one upward index and four downward indexes. For representing a pure state, the tensor on the top only has four downward indexes. All the indexes except the downward ones of the tensors in the first layer are dummy and will be contracted. In our work, the TTN is slightly different from the pure state representation, by adding an upward index to the top tensor (Fig. 2). This added index corresponds to the labels in the supervised machine learning.

Before training, we need to prepare the data with a feature function that maps scalars ( is the dimension of the images) to the space of vectors. The choice of the feature function is arbitrary: we chose the one used in Ref. [Stoudenmire and Schwab, 2016], where the dimension of each vector (denoted by ) can be controlled. Then, the space is transformed from that of scalars to a -dimensional Hilbert space.

After “vectorizing” the -th image in the dataset, the output for classification is a -dimensional vector obtained by contracting the vectors with the TTN, which reads as

(4)

where denotes the -th vector given by the -th sample. One can see that is the dimension of the upward index of the top tensor, and should equal to the number of the classes. We use the convention that the position of the maximum value gives the classification of the image predicted by the TTN, akin to a softmax layer in a deep learning network.

For training, the cost function to be minimized is the square error, which is defined as

(5)

where is the number of training samples. is a -dimensional vector corresponding to the -th label. For example, if the -th sample belongs to the -th class, is defined as

(6)

Iii MERA-inspired training algorithm

We use MERA to derive a highly efficient training algorithm. To proceed, let us rewrite the cost function in the following form

(7)

Here, the third term comes from the normalization of , and we assume that the second term is always real.

The dominant cost comes from the first term. We borrow the idea from the MERA approach to reduce this cost. The central idea of MERA is the renormalization group of the entanglement Vidal (2007). The renormalization group flows are implemented by tensors that satisfy orthogonal conditions. More specifically, the indexes of one tensor are grouped into to two kinds, say upward and downward indexes. By contracting all the downward indexes of a tensor with its conjugate, one gets an identity (Fig. 2), i.e., . On one hand, the orthogonality makes the state remain normalized, a basic requirement of quantum states. On the other hand, the renormalization group flows can be considered as the compressions of the Hilbert space (from the downward to upward indexes). The orthogonality ensure that such compressions are unbiased with in the subspace. The difference from the identity characterizes the errors caused by the compressions.

In our case with the TTN, each tensor has one upward and four downward indexes, which gives a non-square orthogonal matrix by grouping the downward indexes into a large one. Such tensors are called isometries. When all the tensors are isometries, the TTN gives a unitary transformation that compresses a -dimensional space to a -dimensional one. One will approximately have in the subspace that optimizes the classification. For this reason, the first term can be considered as a constant with the orthogonality, and the cost function becomes

(8)

Each term in is simply the contraction of one TN, which can be efficiently computed. We stress that independent of Eq. (5), Eq. (8) can be directly used as the cost function. This will lead to a more interesting picture connected to the quantum information theory. The details are given in Sec. V.

The tensors in the TTN are updated alternatively to minimize Eq. (8). To update the tensor for instance, we assume other tensors are fixed and define the environment tensor , which is calculated by contracting everything in Eq. (8) after taking out (Fig. 3) Evenbly and Vidal (2009). Then the cost function becomes . Under the constraint that is an isometry, the solution of the optimal point is given by where and are calculated from the singular value decomposition . At this point, we have .

The update of one tensor becomes the calculation of the environment tensor and its singular value decomposition. In the alternating process for updating all the tensors, some tricks are used to accelerate the computations. The idea is to save some intermediate results to avoid repetitive calculations by taking advantage of the tree structure. Another trick is to normalize the vector each time obtained by contracting four vectors with a tensor.

The strategy for the training is the one-against-all classification scheme in machine learning (here dubbed as Strategy-I). For each class, we train one TTN so that it recognizes whether an image belongs to this class. The output of Eq. (4) is a two-dimensional vector. We fix the label for a yes answer as . For classes, we will accordingly have TTNs, denoted by . Then for recognizing an image (vectorized to ), we define a -dimensional vector as

(9)

The position of its maximal element gives which class the image belongs to. For comparison, we also directly train the TTN by the samples labeled in classes (dubbed as Strategy-II). In this case, the output [Eq. (4)] is -dimensional, where the position of its maximal element is expected to give the correct class.

The scaling of both time complexity and space complexity is , where is the dimension of input vector; the dimension of virtual bond; the dimension of input bond; the number of training inputs.

Figure 3: Illustration of the environment tensor.
Figure 4: Binary classification accuracy on CIFAR-10. (a) Number of training samples=200; (b) Number of training samples=600.
Figure 5: Embedding of data instances of CIFAR-10 by t-SNE corresponding to each layer in the TN. (a) Original data distribution; (b) 1st layer; (c) 2nd layer; (d) 3rd layer; (e) 4th layer; (f) 5th layer.

Iv Experiments on image recognition

Our approach to classify image data begins by mapping each pixel to a d-component vector . This feature map was introduced by Stoudenmire and Schwab (2016)) and defined as Eq. (10)

(10)

where runs from to . By using a larger , the TTN has the potential to approximate a richer class of functions. Our implementation is available under an open source license NoteWeb ()

iv.1 Benchmark on CIFAR-10

To verify the representation power of TTNs, we used the CIFAR-10 dataset Krizhevsky and Hinton (2009). The dataset consists of 60,000 RGB images in 10 classes, with 6,000 instances per class. There are 50,000 training images and 10,000 test images. Each RGB image was originally pixels: we transformed them to grayscale. Working with gray-scale images reduced the complexity of training, with the trade-off being that less information was available for learning.

We built a TTN with five layers and used the MERA-like algorithm (Section III) to train the model. Specifically, we built a binary classification model to investigate key machine learning and quantum features, instead of constructing a complex multiclass model. We found both the input bond (physical indexes) and the virtual bond (geometrical indexes) had a great impact on the representation power of TTNs, as showed in Fig. 4. This indicates that the limitation of representation power (learnability) of the TTNs is related to the input bond. The same way, the virtual bond determine how accurately the TTNs approximate this limitation.

From the perspective of tensor algebra, the representation power of TTNs depends on the tensor contracted from the entire TTN. Thus the limitation of this relies on the input bond. Furthermore, the TTNs could be considered as a decomposition of this complete contraction, and the virtual bond determine how well the TTNs approximate this. Moreover, this phenomenon could be interpreted from the perspective of quantum many-body theory: the higher entanglement in a quantum many-body system, the more representation power this quantum system has.

The sequence of convolutional and pooling layers in the feature extraction part of a deep learning network is known to arrive at higher and higher levels of abstractions that helps separating the classes in a discriminative learner LeCun et al. (2015). This is often visualized by embedding the representation in two dimensions by t-SNE Van der Maaten and Hinton (2008), and by coloring the instances according to their classes. If the classes clearly separate in this embedding, the subsequent classifier will have an easy task performing classification at a high accuracy. We plotted this embedding for each layer in the TN in Fig. 5. We observe the same pattern as in deep learning, having a clear separation in the highest level of abstraction.

Figure 6: Training and test accuracy as the function of the dimension of indexes on the MNIST dataset. The number of training samples is 1000 for each pair of classes.

.

iv.2 Benchmark on MNIST

To test the generalization of TTNs on a benchmark dataset, we used the MNIST collection, which is widely used in handwritten digit recognition. The training set consists of 60,000 examples, and the test set of 10,000 examples. Each gray-scale image of MNIST was originally pixels, and we rescaled them to pixels for building TTNs with four layers on this scale. The MERA-like algorithm was used to train the model.

Similar to the last experiment, we built a binary model to show the performance of generalization. With the increase of bond dimension (both of the input bond and virtual bond), we found an apparent rise of training accuracy, which is consistent with the results in Fig. 6. At the same time, we observed the decline of testing accuracy. The increase of bond dimension leads to a sharp increase of the number of parameters and, as a result, it will give rise to overfitting and lower the performance of generalization. Therefore, one must pay attention to finding the optimal bond dimension – we can think of this as a hyperparameter controlling model complexity.

For multi-class learning, we choose the one-against-all strategy to build a 10-class model, which classify an input image by choosing the label for which the output is largest. To avoid overfitting and lower computing cost, we apply different feature map to each individual model. The parameters configuration and testing results are in Table 1. We repeated the t-SNE visualization on MNIST, but since the classes separate well even in the raw data, we did not include the corresponding figures here.

model Training Testing Input Virtual
accuracy(%) accuracy(%) bond bond
0 96 97 3 3
1 97 97 3 3
2 96 95 3 4
3 94 93 4 4
4 96 95 2 3
5 94 95 6 6
6 97 96 2 3
7 94 94 6 6
8 93 93 6 6
9 94 93 4 6
10-class 92 / /
Table 1: 10-class classification on MNIST
Figure 7: Fidelity between each two handwritten digits, which ranges from to . The diagonal terms because the quantum states are normalized.
Figure 8: Schematic diagram of fidelity and entanglement entropy calculation. (a) Fidelity; (b) Entanglement entropy.
Figure 9: Entanglement entropy corresponding to each handwritten digit.

V Encoding images in quantum states: fidelity and entanglement

Taking a TTN trained with Strategy-II, we define TN state as

(11)

In , the upward index of the top tensor is contracted with the label (), giving a TN state that represents a pure quantum state.

The quantum state representations allow us to use quantum theories to study images and the related issues. Let us begin with the cost function. In Section III, we started from a frequently used cost function in Eq. (5), and derived a cost function in Eq. (8). In the following, we show that such a cost function can be understood by the notion of fidelity. With Eq. (11), the cost function in Eq. (8) can be rewritten as

(12)

Knowing that the fidelity between two states is defined as their inner product, each term in the summation is simply the fidelity Steane (1998); Bennett and DiVincenzo (2000) between a vectorized image and the corresponding state . Considering the fidelity measure as the distance between two states, are the states, where the distance between each and the corresponding vectorized images are minimized. In other words, the cost function is in fact the total fidelity, and is the quantum state that optimally encodes the -th class of images.

Note that due to the orthogonality of the top tensor, such states are orthogonal to each other, i.e., . This might trap us to a bad local minimum. For this reason, we propose Strategy-I. For each class, we train a TTN to give yes-or-no answers. Each TTN gives two TN states labeled yes and no, respectively. Then we will have TN states. are then defined by taking the yes-labeled TN states. The elements of in Eq. (9) are defined by the summation of the fidelity between and the class of vectorized images. In this scenario, the classification is decided by finding the that gives the maximal fidelity with the input image, while the orthogonal conditions among no longer exist.

Besides the algorithmic interpretation, fidelity may imply more intrinsic information. Without the orthogonality of , the fidelity (Fig. 8 (a)) describes the differences between the quantum states that encode different classes of images. As shown in Fig. 7, remains quite small in most cases, indicating that the orthogonality still approximately holds using Strategy-I. Still, some results are still relatively large, e.g., . We speculate this is closely related to the ways how the data are fed and processed in the TN. In our case, two image classes that have similar shapes will result in a larger fidelity, because the TTN essentially provides a real-space renormalization flow. In other words, the input vectors are still initially arranged and renormalized layer by layer according to their spatial locations in the image; each tensor renormalizes four nearest-neighboring vectors into one vector. Fidelity can be potentially applied to building a network, where the nodes are classes of images and the weights of the connections are given by the . This might provide a mathematical model on how different classes of images are associated to each other. We leave these questions for future investigations.

Another important concept of quantum mechanics is (bipartite) entanglement, a quantum version of correlations Bennett and DiVincenzo (2000); Horodecki et al. (2009). It is one of the key characters that distinguishes the quantum states from classical ones. Entanglement is usually given by a normalized positive-defined vector called entanglement spectrum (denoted as ), and is measured by the entanglement entropy . Having two subsystems, entanglement entropy measures the amount of information of one subsystem that can be gained by measuring the other subsystem. In the framework of TN, entanglement entropy determines the minimal dimensions of the dummy indexes needed for reaching a certain precision.

In our image recognition, entanglement entropy characterizes how much information of one part of the image we can gain by knowing the rest part of the image. In other words, if we only know a part of an image and want to predict the rest according to the trained TTN (the quantum state that encodes the corresponding class), the entanglement entropy measures how accurately this can be done. Here, an important analog is between knowing a part of the image and measuring the corresponding subsystem of the quantum state. Thus, the trained TTN might be used on image processing, e.g., to recover an image from a damaged or compressed lower-resolution version.

Fig. 9 shows the entanglement entropy for each class in the MNIST dataset. With the TTN, the entanglement spectrum is simply the singular values of the matrix with the label and the top tensor (Fig. 8 (b)). This is because the all the tensors in the TTN are orthogonal. Note that has four indexes, of which each represents the effective space renormalized from one quarter of the vectorized image. Thus, the bipartition of the entanglement determines how the four indexes of are grouped into two bigger indexes before calculating the SVD. We compute two kinds of entanglement entropy by cutting the system in the middle along the x or y direction. Our results suggest that the images of “0” and “4” are the easiest and hardest, respectively, to predict one part of the image by knowing the other part.

Vi Conclusion and outlook

We continued the forays into using tensor networks for machine learning, focusing on hierarchical, two-dimensional tree tensor networks that we found a natural fit for image recognition problems. This proved a scalable approach that had a high precision, and we can conclude the following observations:

  • The limitation of representation power (learnability) of the TTNs model strongly depends on the open indexes (physical indexes). And, the inner indexes (geometrical indexes) determine how well the TTNs approximate this limitation.

  • A hierarchical tensor network exhibits the same increase level of abstraction as a deep convolutional neural network or a deep belief network.

  • Fidelity can give us an insight how difficult it is to tell two classes apart.

  • Entanglement entropy can characterize the difficulty of representing a class of problems.

In future work, we plan to use fidelity-based training in an unsupervised setting and applying the trained TTN to recover damaged or compressed images and using entanglement entropy to characterize the accuracy.

Acknowledgments

SJR is grateful to Ivan Glasser and Nicola Pancotti for stimulating discussions. DL was supported by the China Scholarship Council (201609345008) and the National Natural Science Foundation in China (61771340). SJR, PW, and ML acknowledge support the Spanish Ministry of Economy and Competitiveness (Severo Ochoa Programme for Centres of Excellence in R&D SEV-2015-0522), Fundació Privada Cellex, and Generalitat de Catalunya CERCA Programme. SJR and ML were further supported by ERC AdG OSYRIS (ERC-2013-AdG Grant No. 339106), the Spanish MINECO grants FOQUS (FIS2013-46768-P), FISICATEAMO (FIS2016-79508-P), Catalan AGAUR SGR 874, EU FETPRO QUIC, EQuaM (FP7/2007-2013 Grant No. 323714), and Fundació Catalunya - La Pedrera Ignacio Cirac Program Chair. PW acknowledges financial support from the ERC (Consolidator Grant QITBOX) and QIBEQI FIS2016-80773-P), and a hardware donation by Nvidia Corporation.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
44735
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description