Learning Unitaries by Gradient Descent
We study the hardness of learning unitary transformations in via gradient descent on time parameters of alternating operator sequences. We provide numerical evidence that, despite the non-convex nature of the loss landscape, gradient descent always converges to the target unitary when the sequence contains or more parameters. Rates of convergence indicate a “computational phase transition.” With less than parameters, gradient descent converges to a sub-optimal solution, whereas with more than parameters, gradient descent converges exponentially to an optimal solution.
A fundamental task in both quantum computation and quantum control is to determine the minimum amount of resources required to implement a desired unitary transformation. In this paper, we present a simple model that allows us to analyze key aspects of implementing unitaries in the context of both quantum circuits and quantum control. In particular, we implement unitaries using sequences of alternating operators of the form . Each unitary is parameterized by the times . This approach of parameterizing unitaries is the basis for the quantum approximate optimization algorithm (QAOA) farhi2014quantum1; farhi2014quantum2. The acronym QAOA is also used to refer to the phrase “Quantum Alternating Operator Ansatz.” Recently, it has been shown that quantum alternating operator unitaries can perform universal quantum computation lloyd2018quantum. In the infinitesimal time setting, QAOA also encompasses the more general problem of the application of time varying quantum controls rabitz2004quantum; khaneja2005optimal; rabitz2005landscape; rabitz2006topology; chakrabarti2007quantum; moore2008relationship; rabitz2009landscape; brif2010control; riviello2015searching; riviello2017searching; russell2016quantum. In this work, we study the quantum alternating operator formalism as a general framework of performing arbitrary unitary transformations.
We investigate the difficulty of learning Haar random unitaries in using parameterized alternating operator sequences. Here, we find that unsurprisingly, when the number of parameters in the sequence is less than , gradient descent fails to learn the random unitary. Initially, we had expected that because of the highly non-convex nature of the loss landscape, when the number of parameters in the sequence was greater than or equal to – the minimum number of parameters required to specify a unitary matrix – gradient descent would sometimes fail to learn the target unitary. However, our numerical experiments reveal the opposite. When the number of parameters is or greater, gradient descent always finds the target unitary. Moreover, we provide evidence for a “computational phase transition” at the critical point between the under-parameterized and over-parameterized cases where the number of parameters in the sequence equals .
Learning Setting. Suppose we have knowledge of the entries of a unitary and access to the Hamiltonians and . Recent work has provided a constructive approach to build a learning sequence that can perform any target unitary where lloyd2019efficient. In this work, we ask whether optimal learning sequences for performing the target unitary can be obtained by using gradient descent optimization on the parameters of . The matrices are sampled from the Gaussian Unitary Ensemble (GUE) so that the algebra generated by via commutation is with probability one complete in , i.e., the system is controllable rabitz2004quantum; khaneja2005optimal; rabitz2005landscape; rabitz2006topology; chakrabarti2007quantum; moore2008relationship; rabitz2009landscape; brif2010control; riviello2015searching; riviello2017searching; russell2016quantum. The parameters represent the times for which the generators of are applied. We assume we can apply ; equivalently, we can take to be positive or negative. Note that this problem formulation lies in the domain of quantum optimization algorithms such as the Quantum Approximate Optimization Algorithm farhi2016quantum; jiang2017qaoa; zhou2018quantum; gilyen2019optimizing, the Variational Quantum Eigensolver mcclean2016theory; peruzzo2014variational; khatri2019quantumassisted; sharma2019noise, and the Variational Quantum Unsampling carolan2020variational in which one varies the classical parameters in a quantum circuit to minimize some objective function.
In general, the control landscape for learning the unitary is highly non-convex rabitz2004quantum; khaneja2005optimal; rabitz2005landscape; rabitz2006topology; chakrabarti2007quantum; moore2008relationship; rabitz2009landscape; brif2010control; riviello2015searching; riviello2017searching; russell2016quantum. Gradient descent algorithms do not necessarily converge to a globally optimal solution in the parameters of a non-convex space zaheer2018adaptive, and they frequently converge instead to some undesired critical point of the loss function landscape. We study how hard it is to learn an arbitrary unitary with the quantum alternating operator formalism via gradient descent. We quantify the hardness of learning a unitary with the minimum number of parameters required in the sequence to perform the unitary . Since has independent parameters, in general, at least parameters in the sequence are required to learn a unitary within a desired error. Nevertheless, the non-convex loss landscape suggests that it might not be possible to learn an arbitrary with gradient descent using parameters. Our work numerically shows that exactly parameters in the sequence suffice to learn an arbitrary unitary to a desired accuracy.
We also consider the case of learning “shallow” target unitaries of the form where the number of parameters in the target unitary is . For example, the simplest such target unitary is a depth-1 sequence . Such unitaries are, by definition, attainable via a shallow depth alternating operator sequence, and we look to see if it is possible to use gradient descent to obtain a learning sequence of the same depth that approximates the target unitary . That is, we look at the alternating operator version of whether it is possible to learn the unitaries generated by shallow quantum circuits. We find that gradient descent typically requires parameters in the sequence to learn even a depth-1 unitary. This result suggests that gradient descent is not an efficient method to learn low depth unitaries.
Rabitz et al. consider the case of controllable quantum systems with time varying controls, including systems with drift, and show that when the controls are unconstrained (space of controls is essentially infinite dimensional), there are no sub-optimal local minima even though loss landscapes may be non-convex rabitz2004quantum; rabitz2005landscape; russell2016quantum. For example, it has been shown in rabitz2009landscape that non-convexity in the loss landscape of fully controllable quantum systems with infinite dimensional control fields is due to the presence of non-trapping saddle points in the loss landscape. When the sequence of controls is finite dimensional, prior studies sometimes find traps in the control landscape moore2008relationship; riviello2015searching; riviello2017searching. Here, we look at the simplest possible case where the system does not have drift and the space of controls is finite dimensional. Our numerical results show that even in spaces where the dimension of the system is the minimum it can be to attain the desired unitary and the control landscape is highly non-convex, it still contains no sub-optimal local minima and gradient descent obtains the global optimal solution.
We now provide a detailed numerical analysis of the learnability of both arbitrary and shallow depth unitaries using gradient descent optimization.
Ii Numerical experiments for learning an arbitrary unitary
In this section, we present numerical experiments that aim to learn an arbitrary unitary by constructing a sequence and performing gradient descent on all parameters to minimize the loss function . Here denotes the Frobenius norm. Given access to the entries of a Haar random target unitary , we fix the number of parameters and ask how many gradient descent steps are required to construct the sequence that can learn the target unitary to a given accuracy or loss.
We present numerical evidence that with at least parameters in the sequence , we can learn any selected Haar random unitary . Because of the highly non-convex nature of the loss landscape over the control parameters, we did not expect this result. The details of the numerical analysis are provided below.
We ran experiments for a Haar random target unitary of dimension 32 while varying the parameters in . At each step, we compute the gradient and perform gradient descent with fixed step size.
In Fig.(1), we plot the loss as a function of the number of gradient descent steps for learning sequences of varying depth . When the sequence is under-parameterized with parameters, we find that the loss function initially decreases but then plateaus. Thus, in the under-parameterized loss landscape, we find that as expected, with high probability, the gradient descent algorithm reaches a sub-optimal value of the loss which cannot be decreased by further increasing the number of gradient descent steps.
When the number of parameters in is equal to or more, we find that gradient descent always converges to the target unitary – there are apparently no sub-optimal local minima in the loss landscape. As noted above, this result was unexpected given the non-convex nature of the loss landscape. We also find that the rate of convergence grows with the degree of over-parameterization as shown in Fig.(1). At the critical point where the number of parameters , we note the existence of a “computational phase transition.” At this critical point, the learning process converges to the desired target unitary, but the rate of convergence becomes very slow. For each parameter manifold of dimension , we performed ten experiments and each of the experiments has been plotted in Fig.(1).
In Fig.(2), we fit the loss over the first 1000 gradient descent steps (the first 50 steps are excluded) to a power law
where and are constants, and is the number of gradient descent steps. As shown in Fig.(2), the data for the initial 1000 gradient descent steps fits closely to such a power law. However, with the exception of the critical learning sequence with parameters, the performance of gradient descent deviates from a power law fit at later steps. For the under-parameterized case, the gradient descent plateaus at a sub-optimal value of the loss. For the over-parameterized case, the power law transitions to an exponential as the gradient descent approaches the global minimum, which is consistent with the expected quadratic form of the loss function in the vicinity of the global minimum. Fig.(2) shows the exponential fit for the later stages of gradient descent in the over-parameterized setting. The exponential fit takes the form
where , , , and are constants (optimized during the fit), and is the number of gradient descent steps.
The critical case of the sequence with exactly parameters is consistent with a power law rate of convergence to the target unitary during the entire gradient descent process.
The initial power law form of the gradient descent is consistent with a loss landscape that obeys the relation and . For example, the case corresponds to a power law of the form . The final exponential form of convergence corresponds to the case , and to a quadratic landscape where . The fitted value of in the initial power law regime is plotted as a function of the number of parameters in Fig.(3). Here, we observe a linear relationship between the power law exponent in Eq. (1) and the number of parameters in – i.e., the larger the degree of over-parameterization, the faster the rate of convergence, and the larger the exponent in the power law.
Iii Learning shallow-depth unitaries
In this section, we study the learnability of low-depth alternating operator unitaries where . Such unitaries are the alternating operator analogue of shallow depth quantum circuits. As noted above, unitaries of this form are by definition, obtainable by a learning sequence with depth . We wish to investigate for which values of , it is possible to learn the target unitary of depth . We could reasonably hope that such a shallow depth unitary could be learned by performing gradient descent over sequences of depth . We find that this is not the case. Indeed, we find that even to learn a unitary of depth , with high probability, we require a full depth learning sequence of depth or parameters in .
Depth =1 unitaries take the form . In Fig.(4), we present the landscape of the loss function which is a two dimensional parametric manifold. Here we attempt to learn the target unitary via a sequence also with two parameters. The loss function landscape is highly non-convex and contains many local sub-optimal traps. Learning the target unitary with much less than parameters using gradient descent is guaranteed only when the initial values of the parameters lie in the neighbourhood of the global minimum at and . In unbounded parametric manifolds, such an optimal initialization is generally hard to achieve.
Next, we consider a target unitary with four parameters (). In Fig.(5), we find that when the sequence has parameters, the loss function plateaus with increasing gradient descent steps. This indicates that gradient descent halts at a local minimum of the loss function landscape. The rate of learning improves when or as in the over-parameterized domain. In this setting, the loss function rapidly converges towards the global minimum of the landscape, and the rate of convergence to the target unitary is similar to the over-parameterized case shown in Fig.(2).
Surjectivity in the map from control parameters to the tangent space of the unitary manifold has been shown to be a sufficient condition for constructing loss landscapes with no poor local minima in quantum control settings russell2016quantum. This criteria implies that complete freedom of movement at any point in the unitary manifold is sufficient to guarantee convergence to a global minimum. The under-parameterized setting does not meet this criteria of surjectivity, since infinitesimal variations in the parameters are not sufficient to generate any local infinitesimal change in the unitary manifold of dimension . When the number of control parameters is or greater, the map from controls to unitaries is locally surjective at almost all points of the control space, so that at almost all points, all directions in the space of unitaries can be obtained. Our numerical results suggest that when there are a sufficient number of control parameters to render the system controllable, the control map is locally surjective along the entire path of gradient descent all the way to the global optimum.
We have numerically analysed the hardness of obtaining the optimal control parameters in an alternating operator sequence for learning arbitrary unitaries using gradient descent optimization. For learning a Haar random target unitary in dimensions to a desired accuracy, we find that gradient descent requires at least parameters in an alternating operator sequence. When there are fewer than parameters in the sequence, gradient descent converges to an undesirable minimum of the loss function landscape which cannot be escaped with further gradient descent steps. This is true even for learning shallow-depth alternating operator target unitaries which are the alternating operator analogue of shallow depth quantum circuits.
Gradient descent methods generally guarantee convergence only in convex spaces. The loss function landscape for unitaries is highly non-convex, and when we began this investigation, we did not know whether gradient descent on parameters in the landscape would succeed in the search for a global minimum. Indeed, we expected that gradient descent would not always converge. However, in contrast to our initial expectations, we find that when the number of parameters in the loss function landscape , gradient descent always converges to an optimal global minimum in the landscape. At the critical value of parameters, we observe a “computational phase transition” characterized by a power law convergence to the global optimum.
We thank Milad Marvian, Giacomo de Palma, Zi-Wen Liu, Can Gokler, Dirk Englund, Herschel Rabitz and Yann LeCun for helpful discussions and suggestions. This work was supported by DOE, IARPA, NSF, and ARO.
Vi Experiments using Adam optimizer
In addition to performing optimization using simple (vanilla) gradient descent, we performed optimization using the Adam optimizer kingma2014adam, a common optimization method used in deep learning. Adaptive Moment Estimation or Adam is an upgrade of the simple gradient descent algorithm where parameters are assigned different learning rate which are adaptively computed in every iteration of the algorithm. These updates are solely computed from first order gradients. In contrast, the learning rate is fixed for each parameter in simple gradient descent. For more on the Adam optimizer, the reader is referred to kingma2014adam. The final loss obtained for learning unitary matrices using the Adam optimizer was consistent with those obtained from simple gradient descent. However, the Adam optimizer appears to converge to a final outcome in far fewer steps. The results of our experiments are provided in Fig.(S1). A comparison between the performance of simple and Adam gradient descent can be observed from Fig.(S1) and Fig.(S2).
Vii Critical points in the under-parameterized models
When learning target unitaries using alternating operator sequences with parameters or more, gradient descent converges to a global minimum of the loss function landscape. When learning with under-parameterized models, we find that gradient descent plateaus at a non-zero loss function value. In the under-parameterized setting, we further explore how the loss function changes over the course of gradient descent by investigating the magnitude of the gradients. In the under-parameterized setting, we find that the magnitude of the gradients can both increase and decrease over the course of gradient descent, suggesting that the path of gradient descent passes in the vicinity of saddle points in the loss landscape. In the over-parameterized setting, the magnitudes of the gradients monotonically decrease with increasing gradient descent steps, suggesting that in this case, the path of gradient descent does not explore saddle points. The results of our findings are presented in Fig.(S3).
Viii Computation Details
All experiments were performed using the Python package Pytorch NEURIPS2019_9015. Experiments were run on a machine equipped with a Nvidia 2080 TI GPU and an Intel Core i7-9700K CPU. Calculations were performed with 64-bit floating point precision. The code used to perform the numerical experiments is available upon request.
Ix A greedy algorithm
As noted in the text, we find that gradient descent algorithms require parameters in the sequence to learn a low-depth unitary where . This suggests that such low-depth unitaries are intrinsically hard to learn with less than parameters using gradient descent. We also considered a simple greedy algorithm for performing a low-depth target unitary . Let . The first step of the greedy algorithm begins with and uses gradient descent to optimize the parameters and . The next step of the algorithm at performs gradient descent starting from the initial values and which are the optimal values obtained in the previous step. The greedy algorithm then continues, and at each step, is incremented by 1. At the th step, the initial starting points for gradient descent are and the remaining parameters are the optimal values obtained at the end of the previous step. We present the pseudocode of the greedy algorithm below.
We investigated the performance of the greedy algorithm for systems of up to five qubits in a restricted area of the loss function landscape. In particular, we considered the parameters in a low-depth target unitary . In this setting, we find that with a probability that decreases as a function of the number of qubits, the greedy algorithm can construct a sequence where . In contrast, gradient descent experiments require parameters in to learn . That is, the greedy algorithm does indeed sometimes find low-depth target unitaries even in cases where simple gradient descent on under-parameterized sequences fails. For a system of five qubits, the success probability of the greedy algorithm to learn a target unitary of depth (i.e., with 4 parameters) using less than 50 parameters in is around 0.1.