# Speeding-up the decision making of a learning agent using an ion trap quantum processor

## Abstract

We report a proof-of-principle experimental demonstration of the quantum speed-up for learning agents utilizing a small-scale quantum information processor based on radiofrequency-driven trapped ions. The decision-making process of a quantum learning agent within the projective simulation paradigm for machine learning is implemented in a system of two qubits. The latter are realized using hyperfine states of two frequency-addressed atomic ions exposed to a static magnetic field gradient. We show that the deliberation time of this quantum learning agent is quadratically improved with respect to comparable classical learning agents. The performance of this quantum-enhanced learning agent highlights the potential of scalable quantum processors taking advantage of machine learning.

wunderlich@physik.uni-siegen.de

## Introduction

The past decade has seen the parallel advance of two research areas quantum computation[1] and artificial intelligence[2] from abstract theory to practical applications and commercial use. Quantum computers, operating on the basis of information coherently encoded in superpositions of states that could be considered classical bit values, hold the promise of exploiting quantum advantages to outperform classical algorithms, e.g., for searching databases[3], factoring numbers[4], or even for precise parameter estimation[5]. At the same time, artificial intelligence and machine learning have become integral parts of modern automated devices using classical processors[7]. Despite this seemingly simultaneous emergence and promise to shape future technological developments, the overlap between these areas still offers a number of unexplored problems[11]. It is hence of fundamental and practical interest to determine how quantum information processing and autonomously learning machines can mutually benefit from each other.

Within the area of artificial intelligence, a central component of modern applications is the learning paradigm of an agent interacting with an environment[12] illustrated in Figure 1 (a), which is usually formalized as so-called reinforcement learning. This entails receiving perceptual input and being able to react to it in different ways. The learning aspect is manifest in the reinforcement of the connections between the inputs and actions, where the correct association is (often implicitly) specified by a reward mechanism, which may be external to the agent. In this very general context, an approach to explore the intersection of quantum computing and artificial intelligence is to equip autonomous learning agents with quantum processors for their deliberation procedure^{1}

However, a recent model[15] for learning agents based on projective simulation (PS)[13] allows for a generic speed-up in the agent’s deliberation time during each individual interaction. This quantum improvement in the reaction speed has been established within the reflecting projective simulation (RPS) variant of PS[15]. There, the desired actions of the agent are chosen according to a specific probability distribution that can be modified during the learning process. This is of particular relevance to adapt to rapidly changing environments[15], as we shall elaborate on in the next section. For this task, the deliberation time of classical RPS agents is proportional to the quantities and . These characterize the time needed to generate the specified distribution in the agent’s internal memory and the time to sample a suitable (e.g., rewarded rather than an unrewarded) action from it, respectively. A quantum RPS (Q-RPS) agent, in contrast, is able to obtain such an action quadratically faster, i.e., within a time of the order (see Methods).

Here, we report on the first proof-of-principle experimental demonstration of a quantum-enhanced reinforcement learning system, complementing recent experimental work in the context of (un)supervised learning [16]. We implement the deliberation process of an RPS learning agent in a system of two qubits that are encoded in the energy levels of one trapped atomic ion each. Within experimental uncertainties, our results confirm the agent’s action output according to the desired distributions and within deliberation times that are quadratically improved with respect to comparable classical agents. This laboratory demonstration of speeding up a learning agent’s deliberation process can be seen as the first experiment combining novel concepts from machine learning with the potential of ion trap quantum computers where complete quantum algorithms have been demonstrated [19] and feasible concepts for scaling up [23] are vigorously pursued.

## Experimental Implementation of Rank-One RPS

The proof-of-principle experiment that we report in this paper experimentally demonstrates the quantum speed-up of quantum-enhanced learning agents. That is, we are able to empirically confirm both the quadratically improved scaling of , and the correct output according to the tail of the stationary distribution. Here, denotes the initial probability of finding a flagged action within the stationary distribution ) for the average number of calls of the diffusion operator before sampling one of the desired actions (see Methods). The tail is defined as the first components of . The latter means that , where denotes the final probability that the agent obtains the flagged action labeled . Note that the Q-RPS algorithm enhances the overall probability of obtaining a flagged action such that

whilst maintaining the relative probabilities of the flagged actions according to the tail of , as illustrated in Figure 1 (b).

For the implementation we hence need at least a three-dimensional Hilbert space that we realize in our experiment using two qubits encoded in the energy levels of two trapped ions (see the experimental setup section): Two states to represent two different flagged actions (represented in our experiment by and ), and at least one additional state for all non-flagged actions ( and in our experiment). The preparation of the stationary state is implemented by

where is a single-qubit rotation on qubit , i.e.,

Here, , , and denote the usual Pauli operators of qubit . The total probability for a flagged action within the stationary distribution is then determined by via

whereas determines the relative probabilities of obtaining one of the flagged actions via

The reflection over the flagged actions is here given by a rotation, defined by , with rotation angle for the first qubit,

The reflection over the stationary distribution can be performed by a combination of single-qubit rotations determined by and and a CNOT gate given by

which can be understood as two calls to (once in terms of ) supplemented by fixed single-qubit operations[26]. The total gate sequence for a single diffusion step (consisting of a reflection over the flagged actions followed by a reflection over the stationary distribution) can hence be decomposed into single-qubit rotations and CNOT gates and is shown in Fig. Figure 2. The speed-up of the rank-one Q-RPS algorithm w.r.t. a classical RPS agent manifests in terms of a quadratically smaller average number of calls to (or, equivalently, to the diffusion operator ) until a flagged action is sampled. Since the final probability of obtaining a desired action is , we require samples on average, each of which is preceded by the initial preparation of and diffusion steps. The average number of uses of to sample correctly is hence , which we refer to as *’cost’* in this paper. In the following, it is this functional relationship between and that we put to the test, along with the predicted ratio of occurrence of the two flagged actions.

### The experimental setup

Two Yb ions are confined in a linear Paul trap with axial and radial trap frequencies of 2 kHz and kHz, respectively. After Doppler cooling, the two ions form a linear Coulomb crystal, which is exposed to a static magnetic field gradient of T/m, generated by a pair of permanent magnets. The ion-ion spacing in this configuration is approximately m. Magnetic gradient induced coupling (MAGIC) between ions results in an adjustable qubit interaction mediated by the common vibrational modes of the Coulomb crystal[27]. In addition, qubit resonances are individually shifted as a result of this gradient and become position dependent. This makes the qubits distinguishable and addressable by their frequency of resonant excitation. The addressing frequency separation for this two-ion system is about MHz. All coherent operations are performed using radio frequency (RF) radiation near GHz, matching the respective qubit resonances[28]. A more detailed description of the experimental setup can be found in Refs.[27].

The qubits are encoded in the hyperfine manifold of each ion’s ground state, representing an effective spin system. The qubit states and are represented by the energy levels and , respectively. The ions are Doppler cooled on the resonance with laser light near nm. Optical pumping into long-lived meta-stable states is prevented using laser light near nm and nm. The vibrational excitation of the Doppler cooled ions is further reduced by employing RF sideband cooling for both the center of mass mode and the stretch mode. This leads to a mean vibrational quantum number of for both modes. The ions are then initialized in the qubit state by state selective optical pumping with a GHz red-shifted Doppler-cooling laser on the resonance.

The desired qubit states are prepared by applying an RF pulse resulting in a coherent qubit rotation with precisely defined rotation angle and phase (Eq. (Equation 1)). The required number of diffusion steps is then applied to both qubits, using appropriate single-qubit rotations and a two-qubit ZZ-interaction given by

which is directly realizable with MAGIC[27]. A CNOT gate () can then be performed via

The required number of single qubit gates is optimized by combining appropriate single qubit rotations together from and (see Figure 2). Thus, we can simplify the algorithm to

as shown in Fig. ? of Methods.

During the evolution time of ms for each diffusion step both qubits are protected from decoherence by applying universally robust (UR) dynamical decoupling (DD) pulses[30]. The complete pulse sequence for the experiment reported here can be found in Fig. ? of Methods.

Finally, projective measurements on both qubits are performed in the computational basis by scattering laser light near nm on the transition, and detecting spatially resolved resonance fluorescence using an electron multiplying charge coupled device (EMCCD) to determine the relative frequencies for obtaining the states , , , and , respectively.

### Results

As discussed above, our goal is to test the two characteristic features of rank-one Q-RPS: (i) the scaling of the average cost with , and (ii) the sampling ratio for the different flagged actions. For the former, we expect a scaling of , while we expect the ratio of the number of occurrences of the two actions to be maintained with respect to the relative probabilities given by the stationary distribution. Therefore, our first set of measurements studies the behavior of the cost as a function of the total initial probability . The second set of measurements studies the behavior of the output probability ratio as a function of input probability ratio .

For the former, a series of measurements is performed for different values of corresponding to to diffusion steps after the initial state preparation. To obtain the cost , where , we measure the probabilities and after diffusion steps and repeat the experiment 1600 times for fixed . The average cost is then plotted against as shown in Fig. Figure 3. The experimental data shows that the cost decreases with as , as desired. This is in good agreement with the behavior expected for the ideal Q-RPS algorithm. In the range of chosen probabilities , the experimental result of Q-RPS outperforms the classical RPS, as shown in Figure 3. Therefore, we demonstrate that the experimental efficiency is already good enough not only to obtain improved scaling, but also to outperform the classical algorithm, despite the offset in the cost function and the finite precision of the quantum algorithm. The deviation from the ideal behavior is attributed to a small detuning of the RF pulses implementing coherent operations, as we discuss in the Supplementary Materials.

For the second set of measurements, we select a few calculated probabilities and in order to obtain different values of the input ratio between 0 and 2, whilst keeping in a range between and . For these probabilities and , the corresponding rotation angles and of RF pulses intended for preparation are extracted using Eq. (Equation 2) and Eq. (Equation 3). We then perform the Q-RPS algorithm for the specific choices of and repeat it 1600 times to estimate the probabilities and . We finally obtain the output ratio , which is plotted against the input ratio in Fig. Figure 4. The experimental data follows a straight line with an offset from the behavior expected for an ideal Q-RPS agent. The slopes of the two fitted linear functions agree within their respective error showing that the deviation of the output ratio from the ideal result is independent of the number of diffusion steps. In addition, this indicates that this deviation is not caused by the quantum algorithm itself, but by the initial state preparation and/or by the final measurement process where such a deviation can be caused by an asymmetry in the detection fidelity. Indeed, the observed deviation is well explained by a typical asymmetry in the detection fidelity of 3% as encountered in the measurements presented here. This implies reliability of the quantum algorithm also for a larger number of diffusion steps. A detailed discussion of experimental sources of error is given in the Supplementary Materials.

## Conclusion

We have investigated a quantum-enhanced deliberation process of a learning agent implemented in an ion trap quantum processor. Our approach is centered on the projective simulation[13] model for reinforcement learning. Within this paradigm, the decision-making procedure is cast as a stochastic diffusion process, that is, a (classical or quantum) random walk in a representation of the agent’s memory.

The classical PS framework can be used to solve standard textbook problems in reinforcement learning[31], and has recently been applied in advanced robotics[34], adaptive quantum computation[35], as well as in the machine-generated design of quantum experiments[36]. We have focused on reflecting projective simulation[15], an advanced variant of the PS model based on “mixing” (see Methods), where the deliberation process allows for a quantum speed-up of Q-RPS agents w.r.t. to their classical counterparts. In particular, we have considered the interesting special case of rank-one Q-RPS. This provides the advantage of the speed-up offered by the mixing-based approach, but is also in one-to-one correspondence with the hitting-based basic PS using two-layered networks, which has been applied in classical task environments[31].

In a proof-of-principle experimental demonstration, we verify that the deliberation process of the quantum learning agent is quadratically faster compared to that of a classical learning agent. The experimental uncertainties in the reported results, which are in excellent agreement with a detailed model, do not interfere with this genuine quantum advantage in the agent’s deliberation time. We achieve results for the cost for up to 7 diffusion steps corresponding to an initial probability = 0.01 to choose a flagged action. The systematic variation of the the ratio between the input probabilities, and for flagged actions and the measurement of the ratio between the learning agent’s output probabilities, and as a function of shows that the quantum algorithm is reliable independent of the number of diffusion steps.

This experiment highlights the potential of a quantum computer in the field of quantum enhanced learning and artificial intelligence. A practical advantage, of course, will become evident once larger percept spaces and general rank- Q-RPS are employed. Such extensions are, from the theory side, unproblematic given that the modularized nature of the algorithm makes it scalable. An experimental realization of such large-scale quantum enhanced learning will be feasible with the implementation of scalable quantum computer architectures. Meanwhile, all essential elements of Q-RPS have been successfully demonstrated in the proof-of-principle experiment reported here.

## Acknowledgments

T. S., S. W., G. S. G. and C. W. acknowledge funding from Deutsche Forschungsgemeinschaft and from Bundesministerium für Bildung und Forschung (FK 16KIS0128). G. S. G. also acknowledges support from the European Commission’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement number 657261. H. J. B. and N. F. acknowledge support from the Austrian Science Fund (FWF) through Grants No. SFB FoQuS F4012 and the START project Y879-N27, respectively.

### Footnotes

- Other approaches that we will not further discuss here concern models where the environment, and the agent’s interaction with it may be of quantum mechanical nature as well[14].

### References

- (, , )bibitemNoStop
- , ed. (, )bibitemNoStop
- in ’

12‘$12 ‘&12‘#12’12‘_12‘%12 - in ’

12‘$12 ‘&12‘#12’12‘_12‘%12 - (), ’

12‘$12 ‘&12‘#12’12‘_12‘%12 - (, )bibitemNoStop
- (), ’

12‘$12 ‘&12‘#12’12‘_12‘%12 - in ’

12‘$12 ‘&12‘#12’12‘_12‘%12 - (), ’

12‘$12 ‘&12‘#12’12‘_12‘%12 - in ’

12‘$12 ‘&12‘#12’12‘_12‘%12