Information Scrambling in Quantum Neural Networks
Quantum neural networks are one of the promising applications for near-term noisy intermediate-scale quantum computers. A quantum neural network distills the information from the input wavefunction into the output qubits. In this Letter, we show that this process can also be viewed from the opposite direction: the quantum information in the output qubits is scrambled into the input. This observation motivates us to use the tripartite information, a quantity recently developed to characterize information scrambling, to diagnose the training dynamics of quantum neural networks. We empirically find strong correlation between the dynamical behavior of the tripartite information and the loss function in the training process, from which we identify that the training process has two stages for randomly initialized networks. In the early stage, the network performance improves rapidly and the tripartite information increases linearly with a universal slope, meaning that the neural network becomes less scrambled than the random unitary. In the latter stage, the network performance improves slowly while the tripartite information decreases. We present evidences that the network constructs local correlations in the early stage and learns large-scale structures in the latter stage. We believe this two-stage training dynamics is universal and is applicable to a wide range of problems. Our work builds bridges between two research subjects of quantum neural networks and information scrambling, which opens up a new perspective to understand quantum neural networks.
The neural network lies at the heart of the recent blossom of deep learning Goodfellow et al. (2016). Mathematically, the neural network is a trainable mapping from the input feature to the output target. The input feature is typically represented as a high-dimensional vector. The information is distilled from the input by the neural network and is encoded into a lower-dimensional output vector. Recently, the quantum generalization of neural networks have been proposed and actively studied Benedetti et al. (2017); Torrontegui and García-Ripoll (2019); Benedetti et al. (2019); Farhi and Neven ; McClean et al. (2018); Mitarai et al. (2018); Huggins et al. (2019); Schuld et al. ; Grant et al. (2018); Liu and Wang (2018); Verdon et al. ; Zeng et al. (2019); Du et al. ; Beer et al. ; Beach et al. . In quantum neural networks, both inputs and outputs are quantum wavefunctions. The classical mapping is replaced by a quantum channel composed by unitary evolutions and measurements. The quantum channel is parameterized and trained in classical optimization loops. As a result, these quantum neural networks are also called “parameterized quantum circuits”. This hybrid quantum-classical framework can process both classical and quantum data Biamonte et al. (2017). It is considered as one of the most promising applications for near-term noisy intermediate-scale quantum devices Preskill (2018). Moreover, it has been suggested that these quantum neural networks have more expressive power than their classical counterparts Du et al. .
Similar to classical neural networks, quantum information in the input wavefunction is distilled and encoded into the output. This process is illustrated by the forward arrow in Fig. 1(a). Intriguingly, for quantum neural networks, this process can also be viewed from the opposite direction. By deferring measurements until the end of the quantum channel Nielsen and Chuang (2010), the information encoded in the output qubits just before the measurement is spread into the entire system by unitary transformations, as illustrated by the backward arrow in Fig. 1(a). Such processes that information is scrambled from a small system to a large one is now known as the information scrambling. The subject of information scrambling is now well-studied in contexts such as thermalization, chaos and information dynamics in quantum many-body systems and even black-hole physics Altman (2018); Qi (2018); Swingle (2018). In particular, the out-of-time-order correlation function is proposed as a powerful tool to diagnose information scrambling Larkin and Ovchinnikov (1969); Kitaev ; Shenker and Stanford (2014); Maldacena et al. (2016); Fan et al. (2017).
Quantum neural networks and quantum information scrambling so far are two separated research topics. The purpose of this Letter is to bridge the gap and make their connection: Information encoding in a quantum neural network and the information scrambling are the same process viewed from opposite directions.
There have been information-theoretic studies of classical neural networks. For example, the mutual information between hidden layers’ intermediate results and the input or the output was studied and a universal training dynamics was found Shwartz-Ziv and Tishby ; Saxe et al. (2018); Goldfeld et al. (2019). However, in classical neural networks, the mapping at every layer is usually not invertible and the information is generally not preserved. Due to the information loss during the process, the mutual information always decreases with the network depth. In contrast, the unitarity of the quantum evolution preserves the information perfectly. The mutual information between the input and the output of any unitary transformation is always maximal. In order to obtain nontrivial diagnosis in quantum neural networks, the key is to consider the mutual information between subsystems of the input and the output. This naturally leads to the tripartite information—a quantity that characterizes the information scrambling Kitaev and Preskill (2006); Hosur et al. (2016).
In this Letter, we study the training dynamics of quantum neural networks using the tripartite information. We simultaneously monitor both the network performance and the tripartite information during training and observe empirical relations between them. Based on the behavior of these two quantities, the training process can be decomposed into two stages. We call the first stage as “local construction stage”, and the second stage as “global relaxation stage”. In the following, we present detailed analysis of the training dynamics and provide evidences to support our claim.
Tripartite Information of Quantum Neural Networks. Consider a unitary operator in the -qubit Hilbert space , where denotes a complete set of bases in the Hilbert space. It can be regarded as a tensor with input and output legs. As illustrated in Fig. 1(b), we divide the output legs (qubits) to two non-overlapping subsytems and and similarly divide the input legs (qubits) to and .
The operator can be mapped to a state in the -qubit Hilbert space as . Since is a pure state, the entanglement entropy of its subsystem is well-defined, e.g. with being the reduced density matrix of subsystem . The mutual information between the output subsystem and the input subsystem is . Similar definition can be made for and . In this way, the tripartite information of the unitary is defined as Kitaev and Preskill (2006); Hosur et al. (2016)
Because are all input qubits, it can be proved that , where is the number of qubits in subsystem . Therefore, it is crucial to consider the mutual information between subsystems of both input and output qubits.
The strong subadditivity of the entanglement entropy leads to for a unitary gate. The absolute value of the tripartite information measures how much information of the subsystem is shared by and simultaneously after the unitary transformation, thus quantifies how scrambled a unitary is. For example, for an identity unitary transformation , if is entirely contained in or , it is straightforward to show that . As an opposite limit, for uniform Haar random unitary, local measurements can not extract any information. It follows on average and are exponentially small and therefore , which is the maximal absolute value of Hosur et al. (2016).
Having introduced the tripartite information for a general unitary transformation, we now turn to tripartite information of quantum neural networks. Here we only consider parameterized quantum circuits with brick-wall geometry. As shown in Fig. 1(a), each square represents an independent two-qubit unitary gate in the group, and is parameterized using its 15 Euler angles Dita (2003). During training, these parameters are optimized with classical optimization algorithms. All these two-qubit gates form a quantum circuit represented by a giant unitary transformation .
The datasets to be studied in this work have several important features. First, the input wavefunctions all have time reversal symmetry, and consequently can be represented as real vectors. Therefore we restrict two-qubit gates to with 6 Euler angles each. Second, the output target is either a real number within or a binary label within , only one readout qubit is needed at the end of the quantum circuit. For simplicity, we always let be odd and fix the readout qubit to be the qubit at the center, i.e. -th qubit.
To define tripartite information, we always fix the output subsystem to be the central readout qubit. To respect the symmetry that is located at the center, we always choose to be the central input qubits in the circuit, and to be the remaining input qubits. Note that under this definition, in general contains two disconnected regions. In this way, the tripartite information characterizes how much information of the output qubit is scrambled on the input side between the central region and the outer region .
Magnetization Learning. The first task is to supervisedly learn the average magnetization of a many-body wavefunction of half spins. The dataset consists of input-target pairs , where the input wavefunction is the ground state wavefunction of the parent Hamiltonian with random long-ranged spin-spin interactions:
where represents the -th Pauli matrix on the -th qubit, and . , , and are all random numbers. The target is the average magnetization computed as , where the magnetization operator is . In sampling the random Hamiltonian, we ensure such that the ground state wavefunctions are either “ferromagnetic” or “paramagnetic” measured under . is a small pinning field randomly drawn from a distribution with zero mean, which is used to trigger the spontaneous symmetry breaking in the ferromagnetic phase.
The quantum neural network takes the input wavefunction and applies the unitary transformation on it. The magnetization is readout by measuring of the central qubit. The task was such design to challenge the quantum neural network to learn how to summarize the average magnetization in the -basis and present the result in the -basis. This is essentially a regression task and the loss function to be minimized during training is the absolute error of the magnetization:
We simulate the above hybrid quantum-classical quantum neural network training algorithm. The distributions of random parameters in the Hamiltonian Eq. (2) are chosen such that in the dataset roughly distributes uniformly within . All two-qubit unitaries in the quantum neural network are initialized randomly. The parameters are optimized with the AMSGrad gradient descent algorithm Reddi et al. (2018). The gradients can be computed directly thanks to the linearity of the quantum channel and are measurable in a realistic quantum neural network SM ; Mitarai et al. (2018); Schuld et al. .
Two-stage Training. In Fig. 2, we show both the training and validation loss, along with the tripartite information, as functions of the training epoch. Both training and validation losses decrease monotonically when the training proceeds, indicating that the network can learn to compute the magnetization reasonably well without overfitting.
At the early stage of the training, the rapid improvement of the quantum neural network performance, characterized by a fast decrease of both training and validation losses, is accompanied by an almost linear increase of the tripartite information. In other words, the quantum neural network becomes less scrambled compared with the initial random unitary. This training stage terminates when the tripartite information reaches its local maximum, as indicated by the vertical dotted line in Fig. 2. In the next stage, the tripartite information decreases again, meaning that the network scrambles information faster. The network performance also improves, but with a much slower rate compared with that in the first stage.
We call the training stage before reaching the local maximum the “local construction stage”, and the latter stage where decreases as the “global relaxation stage”. The reason for these names will be clear after we study the training dynamics in detail below. This empirical observation that quantum neural network performance and the information scrambling is closely correlated is the main finding of this work. This correlation has been observed in most of our numerical tests with different network initializations, training algorithms, system sizes and network depths 111For network initializations, we require initial unitaries to be scrambled enough such that initial ( is about is half of the negative-most value). For training algorithms, we require these algorithms to be gradient-based. For network depths, we require the networks to be not too shallow. .
We also train quantum neural networks for a different task of learning the winding number of a product quantum state. Compared with the magnetization task, the input wavefunction here is a product state and is essentially classical, and the target is now a binary label instead of a real number. Despite the very different nature of this task, the empirical correlation between the neural network performance and the tripartite information still holds. All details of the winding number learning task are presented in SM .
Local Construction Stage. We claim that during the first stage when the tripartite information linearly increases, the quantum neural network learns local features of the input wavefunction. For the magnetization learning task, for example, because of the existence of ferromagnetic domain in the training wavefunction, there is some probability that any single spin is aligned relatively well with remaining spins in the system. Simply outputting any single-spin magnetization of the input wave function is actually a reasonable guess, the training loss can decease rapidly. For such networks where only local features are extracted, information does not need to be scrambled into the whole system. Therefore, the tripartite information increases during this stage.
To support the above claim, we compute two-point correlations between input qubits and the readout qubit:
If one views as a time evolution operator, then is simply a two-point function between two different places and two different times. In Fig. 3(a), we plot as a function of different input qubits and training epochs at early training stage. As can be seen, they increase rapidly and then saturate to large values. The increasing correlation indicates that the quantum neural network is establishing the correspondence between local input features and the output qubit. During this stage, the tripartite information also increases, and the two-point correlation function saturates when the tripartite information reaches the maximum. All these observations are consistent with our claim that during the first local construction stage, local features are extracted from the input.
Before concluding this section, we point out another interesting observation that the linear increasing slope of the tripartite information is nearly a constant that is independent of the initialization, shown in Fig. 3(b). Of course, this slope depends on the learning rate of the gradient descent algorithm. As shown in the inset, the -independent slope scales linearly with the learning rate.
Global Relaxation Stage. We now turn to the second stage where the tripartite information decreases and the training loss decreases with a much slower rate. We claim that during this stage, the quantum neural network learns global features of the wavefunction. To provide evidence for this claim, we test the quantum neural network in an artificial test dataset , constructed according to the following process. First, we sample ground states from the random Hamiltonian of Eq. (2). Then we apply the following unitary transformation to flip a region of spins:
For “paramagnetic” wavefunctions , this transformation leaves these wavefunctions still “paramagnetic”. However, for “ferromagnetic” wavefunctions , the transformation creates a ferromagnetic domain wall of size , as sketched in Fig. 4 . In order to accurately compute the magnetization of such wavefunctions, the quantum neural network must be able to learn structures larger than the domain wall size . In SM , we present an argument on why in this task, long string operators should exist in when it is expanded under the basis of product of local Pauli matrices.
In Fig. 4(b), we show losses on test datasets with and , as functions of the training epoch. In the later stage of the training, although the training loss is decreasing slowly, the tripartite information can decrease rather drastically, accompanied with the rapid decreases of losses on these test datasets. Moreover, the larger the average domain wall size is, the later the test loss begins to decease. This means that the tripartite information deceasing is associated with the performance improvement on data with large domain structures. It naturally explains why the unitary has to become more scrambled during this stage. Since such data are rare in the training dataset, it also explains why the improvement of performance with respect to training loss is slow.
Discussion and Outlook. In summary, we apply an information-theoretic measure of the quantum information scrambling, namely the tripartite information, to diagnose the learning process of quantum neural networks. We find strong correlation between this metric and the loss function, and identify a two-stage training dynamics of quantum neural networks. We show that the neural network establishes local correlations in the early stage and builds up global structures in the later stage. We believe this two-stage scenario is applicable to a wide range of quantum machine learning problems.
We would also like to shed some physical insights onto this two-stage dynamical process. First, it is reminiscent of the annealing process. For instance, when cooling a spin system towards a ferromagnetic ground state, a fast process is to reach local equilibrium by forming ferromagnetic domains, and a slow process is to remove the domain walls and to reach global equilibrium. Second, the process can also be understood in terms of operator growth in the context of many-body quantum chaos. In fact, the formation of string operators in magnetization learning task is already discussed above. The time scales when the two stages end are direct analogs of the dissipation time and the scrambling time there, and the latter is believed to be longer than the former in a generic many-body system.
Finally we believe that the connection between the information scrambling and the quantum neural network is profound. The connection can find broader applications in quantum machine learning much beyond the neural network structure discussed here, including revealing the underlying mechanism of quantum machine learning and guiding quantum machine learning architecture design.
Acknowledgment. We thank Yingfei Gu for discussions. HS thanks IASTU for hosting his visit to Beijing, where key parts of this work were done. PZ acknowledges support from the Walter Burke Institute for Theoretical Physics at Caltech. This work is supported by Beijing Distinguished Young Scientist Program (HZ), MOST under Grant No. 2016YFA0301600 (HZ) and NSFC Grant No. 11734010 (HZ).
- Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning (MIT Press, 2016).
- Benedetti et al. (2017) Marcello Benedetti, John Realpe-Gómez, Rupak Biswas, and Alejandro Perdomo-Ortiz, “Quantum-Assisted Learning of Hardware-Embedded Probabilistic Graphical Models,” Phys. Rev. X 7, 041052 (2017).
- Torrontegui and García-Ripoll (2019) E. Torrontegui and J. J. García-Ripoll, “Unitary quantum perceptron as efficient universal approximator,” EPL (Europhysics Letters) 125, 30004 (2019).
- Benedetti et al. (2019) Marcello Benedetti, Delfina Garcia-Pintos, Oscar Perdomo, Vicente Leyton-Ortega, Yunseong Nam, and Alejandro Perdomo-Ortiz, ‘‘A generative modeling approach for benchmarking and training shallow quantum circuits,” npj Quantum Inf. 5, 45 (2019).
- (5) Edward Farhi and Hartmut Neven, “Classification with Quantum Neural Networks on Near Term Processors,” arXiv:1802.06002 .
- McClean et al. (2018) Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy, Ryan Babbush, and Hartmut Neven, “Barren plateaus in quantum neural network training landscapes,” Nat. Commun. 9, 4812 (2018).
- Mitarai et al. (2018) K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,” Phys. Rev. A 98, 032309 (2018).
- Huggins et al. (2019) William Huggins, Piyush Patil, Bradley Mitchell, K Birgitta Whaley, and E Miles Stoudenmire, “Towards quantum machine learning with tensor networks,” Quantum Sci. Technol. 4, 024001 (2019).
- (9) Maria Schuld, Alex Bocharov, Krysta Svore, and Nathan Wiebe, “Circuit-centric quantum classifiers,” arXiv:1804.00633 .
- Grant et al. (2018) Edward Grant, Marcello Benedetti, Shuxiang Cao, Andrew Hallam, Joshua Lockhart, Vid Stojevic, Andrew G Green, and Simone Severini, “Hierarchical quantum classifiers,” npj Quantum Inf. 4, 65 (2018).
- Liu and Wang (2018) Jin-Guo Liu and Lei Wang, “Differentiable learning of quantum circuit Born machines,” Phys. Rev. A 98, 062324 (2018).
- (12) Guillaume Verdon, Jason Pye, and Michael Broughton, “A Universal Training Algorithm for Quantum Deep Learning,” arXiv:1806.09729 .
- Zeng et al. (2019) Jinfeng Zeng, Yufeng Wu, Jin-Guo Liu, Lei Wang, and Jiangping Hu, “Learning and inference on generative adversarial quantum circuits,” Phys. Rev. A 99, 052306 (2019).
- (14) Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, and Dacheng Tao, “The Expressive Power of Parameterized Quantum Circuits,” arXiv:1810.11922 .
- (15) Kerstin Beer, Dmytro Bondarenko, Terry Farrelly, Tobias J Osborne, Robert Salzmann, and Ramona Wolf, “Efficient Learning for Deep Quantum Neural Networks,” arXiv:1902.10445 .
- (16) Matthew J. S. Beach, Roger G. Melko, Tarun Grover, and Timothy H. Hsieh, “Making Trotters Sprint: A Variational Imaginary Time Ansatz for Quantum Many-body Systems,” arXiv:1904.00019 .
- Biamonte et al. (2017) Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd, “Quantum machine learning,” Nature 549, 195–202 (2017).
- Preskill (2018) John Preskill, “Quantum Computing in the NISQ era and beyond,” Quantum 2, 79 (2018).
- Nielsen and Chuang (2010) Michael A. Nielsen and Isaac .L. Chuang, Quantum Computation and Quantum Information: 10th Anniversary Edition (Cambridge University Press, 2010).
- Altman (2018) Ehud Altman, “Many-body localization and quantum thermalization,” Nat. Phys. 14, 979–983 (2018).
- Qi (2018) Xiao-Liang Qi, “Does gravity come from quantum information?” Nat. Phys. 14, 984–987 (2018).
- Swingle (2018) Brian Swingle, “Unscrambling the physics of out-of-time-order correlators,” Nat. Phys. 14, 988–990 (2018).
- Larkin and Ovchinnikov (1969) A I Larkin and Yu N Ovchinnikov, “Quasiclassical Method in the Theory of Superconductivity,” Sov. Phys. JETP 28, 1200–1205 (1969).
- (24) Alexei Kitaev, “Hidden correlations in the hawking radiation and thermal noise,” A talk given at Fundamental Physics Prize Symposium, 2014.
- Shenker and Stanford (2014) Stephen H. Shenker and Douglas Stanford, “Black holes and the butterfly effect,” J. High Energy Phys. 2014, 67 (2014).
- Maldacena et al. (2016) Juan Maldacena, Stephen H. Shenker, and Douglas Stanford, “A bound on chaos,” J. High Energy Phys. 2016, 106 (2016).
- Fan et al. (2017) Ruihua Fan, Pengfei Zhang, Huitao Shen, and Hui Zhai, “Out-of-time-order correlation for many-body localization,” Sci. Bull. 62, 707 – 711 (2017).
- (28) Ravid Shwartz-Ziv and Naftali Tishby, “Opening the Black Box of Deep Neural Networks via Information,” arXiv:1703.00810 .
- Saxe et al. (2018) Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox, “On the information bottleneck theory of deep learning,” in International Conference on Learning Representations (2018).
- Goldfeld et al. (2019) Ziv Goldfeld, Ewout Van Den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, and Yury Polyanskiy, “Estimating information flow in deep neural networks,” in Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, edited by Kamalika Chaudhuri and Ruslan Salakhutdinov (PMLR, Long Beach, California, USA, 2019) pp. 2299–2308.
- Kitaev and Preskill (2006) Alexei Kitaev and John Preskill, “Topological Entanglement Entropy,” Phys. Rev. Lett. 96, 110404 (2006).
- Hosur et al. (2016) Pavan Hosur, Xiao-Liang Qi, Daniel A. Roberts, and Beni Yoshida, “Chaos in quantum channels,” J. High Energy Phys. 2016, 4 (2016).
- Dita (2003) P Dita, “Factorization of unitary matrices,” J. Phys. A 36, 2781–2789 (2003).
- Reddi et al. (2018) Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar, “On the Convergence of Adam and Beyond,” in International Conference on Learning Representations (2018).
- (35) See Supplemental Material, which includes Ref. PhysRevLett.120.066401, for further results on magnetization learning, results of winding number learning, and details of gradient calculation and measurement.
- (36) For network initializations, we require initial unitaries to be scrambled enough such that initial ( is about is half of the negative-most value). For training algorithms, we require these algorithms to be gradient-based. For network depths, we require the networks to be not too shallow.