# Speeding-up the decision making of a learning agent using an ion trap quantum processor

###### Abstract

We report a proof-of-principle experimental demonstration of the quantum speed-up for learning agents utilizing a small-scale quantum information processor based on radiofrequency-driven trapped ions. The decision-making process of a quantum learning agent within the projective simulation paradigm for machine learning is implemented in a system of two qubits. The latter are realized using hyperfine states of two frequency-addressed atomic ions exposed to a static magnetic field gradient. We show that the deliberation time of this quantum learning agent is quadratically improved with respect to comparable classical learning agents. The performance of this quantum-enhanced learning agent highlights the potential of scalable quantum processors taking advantage of machine learning.

## Introduction

The past decade has seen the parallel advance of two research areas — quantum computationNielsen and Chuang (2000) and artificial intelligenceRussell and Norvig (2003) — from abstract theory to practical applications and commercial use. Quantum computers, operating on the basis of information coherently encoded in superpositions of states that could be considered classical bit values, hold the promise of exploiting quantum advantages to outperform classical algorithms, e.g., for searching databasesGrover (1996), factoring numbersShor (1994), or even for precise parameter estimationGiovannetti et al. (2004); Friis et al. (2017). At the same time, artificial intelligence and machine learning have become integral parts of modern automated devices using classical processorsLim et al. (2017); Silver et al. (2016); Mnih et al. (2015); Schaeffer et al. (2007). Despite this seemingly simultaneous emergence and promise to shape future technological developments, the overlap between these areas still offers a number of unexplored problemsBiamonte et al. (2016). It is hence of fundamental and practical interest to determine how quantum information processing and autonomously learning machines can mutually benefit from each other.

Within the area of artificial intelligence, a central component of modern applications is the learning paradigm of an agent interacting with an environmentSutton and Barto (1998); Russell and Norvig (2003); Briegel and De las
Cuevas (2012)
illustrated in Fig. 1 (a), which is usually formalized as so-called reinforcement learning. This entails receiving perceptual input and being able to react to it in different ways. The learning aspect is manifest in the reinforcement of the connections between the inputs and actions, where the correct association is (often implicitly) specified by a reward mechanism, which may be external to the agent. In this very general context, an approach to explore the intersection of quantum computing and artificial intelligence is to equip autonomous learning agents with quantum processors for their deliberation procedure^{1}^{1}1Other approaches that we will not further discuss here concern models where the environment, and the agent’s interaction with it may be of quantum mechanical nature as wellDunjko et al. (2016)..
That is, an agent chooses its reactions to perceptual input by way of quantum algorithms or quantum random walks. The agent’s learning speed can then be quantified in terms of the average number of interactions with the environment until targeted behavior (reactions triggering a reward) is reproduced by the agent with a desired efficiency. This learning speed cannot generically be improved by incorporating quantum technologies into the agent’s designDunjko et al. (2016).

However, a recent modelPaparo et al. (2014) for learning agents based on projective simulation (PS)Briegel and De las Cuevas (2012) allows for a generic speed-up in the agent’s deliberation time during each individual interaction. This quantum improvement in the reaction speed has been established within the reflecting projective simulation (RPS) variant of PSPaparo et al. (2014). There, the desired actions of the agent are chosen according to a specific probability distribution that can be modified during the learning process. This is of particular relevance to adapt to rapidly changing environmentsPaparo et al. (2014), as we shall elaborate on in the next section. For this task, the deliberation time of classical RPS agents is proportional to the quantities and . These characterize the time needed to generate the specified distribution in the agent’s internal memory and the time to sample a suitable (e.g., rewarded rather than an unrewarded) action from it, respectively. A quantum RPS (Q-RPS) agent, in contrast, is able to obtain such an action quadratically faster, i.e., within a time of the order (see Methods).

Here, we report on the first proof-of-principle experimental demonstration of a quantum-enhanced reinforcement learning system, complementing recent experimental work in the context of (un)supervised learning Ristè et al. (2017); Li et al. (2015); Cai et al. (2015). We implement the deliberation process of an RPS learning agent in a system of two qubits that are encoded in the energy levels of one trapped atomic ion each. Within experimental uncertainties, our results confirm the agent’s action output according to the desired distributions and within deliberation times that are quadratically improved with respect to comparable classical agents. This laboratory demonstration of speeding up a learning agent’s deliberation process can be seen as the first experiment combining novel concepts from machine learning with the potential of ion trap quantum computers where complete quantum algorithms have been demonstrated Hanneke et al. (2010); Monz et al. (2016); Piltz et al. (2016); Debnath et al. (2016) and feasible concepts for scaling up Kielpinski et al. (2002); Monroe et al. (2014); Lekitsch et al. (2017) are vigorously pursued.

## Experimental Implementation of Rank-One RPS

The proof-of-principle experiment that we report in this paper experimentally demonstrates the quantum speed-up of quantum-enhanced learning agents. That is, we are able to empirically confirm both the quadratically improved scaling of , and the correct output according to the tail of the stationary distribution. Here, denotes the initial probability of finding a flagged action within the stationary distribution ) for the average number of calls of the diffusion operator before sampling one of the desired actions (see Methods). The tail is defined as the first components of . The latter means that , where denotes the final probability that the agent obtains the flagged action labeled . Note that the Q-RPS algorithm enhances the overall probability of obtaining a flagged action such that

(1) |

whilst maintaining the relative probabilities of the flagged actions according to the tail of , as illustrated in Fig. 1 (b).

For the implementation we hence need at least a three-dimensional Hilbert space that we realize in our experiment using two qubits encoded in the energy levels of two trapped ions (see the experimental setup section): Two states to represent two different flagged actions (represented in our experiment by and ), and at least one additional state for all non-flagged actions ( and in our experiment). The preparation of the stationary state is implemented by

(2) |

where is a single-qubit rotation on qubit , i.e.,

(3) |

Here, , , and denote the usual Pauli operators of qubit . The total probability for a flagged action within the stationary distribution is then determined by via

(4) |

whereas determines the relative probabilities of obtaining one of the flagged actions via

(5) |

The reflection over the flagged actions is here given by a rotation, defined by , with rotation angle for the first qubit,

(6) |

The reflection over the stationary distribution can be performed by a combination of single-qubit rotations determined by and and a CNOT gate given by

(7) |

which can be understood as two calls to (once in terms of ) supplemented by fixed single-qubit operationsDunjko et al. (2015). The total gate sequence for a single diffusion step (consisting of a reflection over the flagged actions followed by a reflection over the stationary distribution) can hence be decomposed into single-qubit rotations and CNOT gates and is shown in Fig. 2. The speed-up of the rank-one Q-RPS algorithm w.r.t. a classical RPS agent manifests in terms of a quadratically smaller average number of calls to (or, equivalently, to the diffusion operator ) until a flagged action is sampled. Since the final probability of obtaining a desired action is , we require samples on average, each of which is preceded by the initial preparation of and diffusion steps. The average number of uses of to sample correctly is hence , which we refer to as ’cost’ in this paper. In the following, it is this functional relationship between and that we put to the test, along with the predicted ratio of occurrence of the two flagged actions.

### The experimental setup

Two Yb ions are confined in a linear Paul trap with axial and radial trap frequencies of 2 kHz and kHz, respectively. After Doppler cooling, the two ions form a linear Coulomb crystal, which is exposed to a static magnetic field gradient of T/m, generated by a pair of permanent magnets. The ion-ion spacing in this configuration is approximately m. Magnetic gradient induced coupling (MAGIC) between ions results in an adjustable qubit interaction mediated by the common vibrational modes of the Coulomb crystalKhromova et al. (2012). In addition, qubit resonances are individually shifted as a result of this gradient and become position dependent. This makes the qubits distinguishable and addressable by their frequency of resonant excitation. The addressing frequency separation for this two-ion system is about MHz. All coherent operations are performed using radio frequency (RF) radiation near GHz, matching the respective qubit resonancesPiltz et al. (2014). A more detailed description of the experimental setup can be found in Refs.Khromova et al. (2012); Wölk et al. (2015); Piltz et al. (2016).

The qubits are encoded in the hyperfine manifold of each ion’s ground state, representing an effective spin system. The qubit states and are represented by the energy levels and , respectively. The ions are Doppler cooled on the resonance with laser light near nm. Optical pumping into long-lived meta-stable states is prevented using laser light near nm and nm. The vibrational excitation of the Doppler cooled ions is further reduced by employing RF sideband cooling for both the center of mass mode and the stretch mode. This leads to a mean vibrational quantum number of for both modes. The ions are then initialized in the qubit state by state selective optical pumping with a GHz red-shifted Doppler-cooling laser on the resonance.

The desired qubit states are prepared by applying an RF pulse resulting in a coherent qubit rotation with precisely defined rotation angle and phase (Eq. (3)). The required number of diffusion steps is then applied to both qubits, using appropriate single-qubit rotations and a two-qubit ZZ-interaction given by

(8) |

which is directly realizable with MAGICKhromova et al. (2012). A CNOT gate () can then be performed via

The required number of single qubit gates is optimized by combining appropriate single qubit rotations together from and (see Fig. 2). Thus, we can simplify the algorithm to

(9) |

as shown in Fig. 5 of Methods.

During the evolution time of ms for each diffusion step both qubits are protected from decoherence by applying universally robust (UR) dynamical decoupling (DD) pulsesGenov et al. (2017). The complete pulse sequence for the experiment reported here can be found in Fig. 5 of Methods.

Finally, projective measurements on both qubits are performed in the computational basis by scattering laser light near nm on the transition, and detecting spatially resolved resonance fluorescence using an electron multiplying charge coupled device (EMCCD) to determine the relative frequencies for obtaining the states , , , and , respectively.

### Results

As discussed above, our goal is to test the two characteristic features of rank-one Q-RPS: (i) the scaling of the average cost with , and (ii) the sampling ratio for the different flagged actions. For the former, we expect a scaling of , while we expect the ratio of the number of occurrences of the two actions to be maintained with respect to the relative probabilities given by the stationary distribution. Therefore, our first set of measurements studies the behavior of the cost as a function of the total initial probability . The second set of measurements studies the behavior of the output probability ratio as a function of input probability ratio .

For the former, a series of measurements is performed for different values of corresponding to to diffusion steps after the initial state preparation. To obtain the cost , where , we measure the probabilities and after diffusion steps and repeat the experiment 1600 times for fixed . The average cost is then plotted against as shown in Fig. 3. The experimental data shows that the cost decreases with as , as desired. This is in good agreement with the behavior expected for the ideal Q-RPS algorithm. In the range of chosen probabilities , the experimental result of Q-RPS outperforms the classical RPS, as shown in Fig. 3. Therefore, we demonstrate that the experimental efficiency is already good enough not only to obtain improved scaling, but also to outperform the classical algorithm, despite the offset in the cost function and the finite precision of the quantum algorithm. The deviation from the ideal behavior is attributed to a small detuning of the RF pulses implementing coherent operations, as we discuss in the Supplementary Materials.

For the second set of measurements, we select a few calculated probabilities and in order to obtain different values of the input ratio between 0 and 2, whilst keeping in a range between and . For these probabilities and , the corresponding rotation angles and of RF pulses intended for preparation are extracted using Eq. (4) and Eq. (5). We then perform the Q-RPS algorithm for the specific choices of and repeat it 1600 times to estimate the probabilities and . We finally obtain the output ratio , which is plotted against the input ratio in Fig. 4. The experimental data follows a straight line with an offset from the behavior expected for an ideal Q-RPS agent. The slopes of the two fitted linear functions agree within their respective error showing that the deviation of the output ratio from the ideal result is independent of the number of diffusion steps. In addition, this indicates that this deviation is not caused by the quantum algorithm itself, but by the initial state preparation and/or by the final measurement process where such a deviation can be caused by an asymmetry in the detection fidelity. Indeed, the observed deviation is well explained by a typical asymmetry in the detection fidelity of 3% as encountered in the measurements presented here. This implies reliability of the quantum algorithm also for a larger number of diffusion steps. A detailed discussion of experimental sources of error is given in the Supplementary Materials.

## Conclusion

We have investigated a quantum-enhanced deliberation process of a learning agent implemented in an ion trap quantum processor. Our approach is centered on the projective simulationBriegel and De las Cuevas (2012) model for reinforcement learning. Within this paradigm, the decision-making procedure is cast as a stochastic diffusion process, that is, a (classical or quantum) random walk in a representation of the agent’s memory.

The classical PS framework can be used to solve standard textbook problems in reinforcement learningMautner et al. (2015); Melnikov et al. (2014); Makmal et al. (2016), and has recently been applied in advanced roboticsHangl et al. (2016), adaptive quantum computationTiersch et al. (2015), as well as in the machine-generated design of quantum experimentsMelnikov et al. (2017). We have focused on reflecting projective simulationPaparo et al. (2014), an advanced variant of the PS model based on ”mixing” (see Methods), where the deliberation process allows for a quantum speed-up of Q-RPS agents w.r.t. to their classical counterparts. In particular, we have considered the interesting special case of rank-one Q-RPS. This provides the advantage of the speed-up offered by the mixing-based approach, but is also in one-to-one correspondence with the hitting-based basic PS using two-layered networks, which has been applied in classical task environmentsMautner et al. (2015); Melnikov et al. (2014); Makmal et al. (2016); Hangl et al. (2016); Tiersch et al. (2015); Melnikov et al. (2017).

In a proof-of-principle experimental demonstration, we verify that the deliberation process of the quantum learning agent is quadratically faster compared to that of a classical learning agent. The experimental uncertainties in the reported results, which are in excellent agreement with a detailed model, do not interfere with this genuine quantum advantage in the agent’s deliberation time. We achieve results for the cost for up to 7 diffusion steps corresponding to an initial probability = 0.01 to choose a flagged action. The systematic variation of the the ratio between the input probabilities, and for flagged actions and the measurement of the ratio between the learning agent’s output probabilities, and as a function of shows that the quantum algorithm is reliable independent of the number of diffusion steps.

This experiment highlights the potential of a quantum computer in the field of quantum enhanced learning and artificial intelligence. A practical advantage, of course, will become evident once larger percept spaces and general rank- Q-RPS are employed. Such extensions are, from the theory side, unproblematic given that the modularized nature of the algorithm makes it scalable. An experimental realization of such large-scale quantum enhanced learning will be feasible with the implementation of scalable quantum computer architectures. Meanwhile, all essential elements of Q-RPS have been successfully demonstrated in the proof-of-principle experiment reported here.

## Acknowledgments

T. S., S. W., G. S. G. and C. W. acknowledge funding from Deutsche Forschungsgemeinschaft and from Bundesministerium für Bildung und Forschung (FK 16KIS0128). G. S. G. also acknowledges support from the European Commission’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement number 657261. H. J. B. and N. F. acknowledge support from the Austrian Science Fund (FWF) through Grants No. SFB FoQuS F4012 and the START project Y879-N27, respectively.

## References

- Nielsen and Chuang (2000) M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information (Cambridge University Press, Cambridge, U.K., 2000).
- Russell and Norvig (2003) S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd ed. (Pearson Education, 2003).
- Grover (1996) L. K. Grover, in Proceedings of the Twenty-eighth Annual ACM Symposium on Theory of Computing, STOC ’96 (ACM, New York, NY, USA, 1996) pp. 212–219.
- Shor (1994) P. W. Shor, in Proceedings 35th Annual Symposium on Foundations of Computer Science (1994) pp. 124–134.
- Giovannetti et al. (2004) V. Giovannetti, S. Lloyd, and L. Maccone, Science 306, 1330 (2004).
- Friis et al. (2017) N. Friis, D. Orsucci, M. Skotiniotis, P. Sekatski, V. Dunjko, H. J. Briegel, and W. Dür, New J. Phys. 19, 063044 (2017).
- Lim et al. (2017) K. Lim, Y. Hong, Y. Choi, and H. Byun, PLOS ONE 12, e0173317 (2017).
- Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Nature 529, 484 (2016).
- Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, Nature 518, 529 (2015).
- Schaeffer et al. (2007) J. Schaeffer, N. Burch, Y. Björnsson, A. Kishimoto, M. Müller, R. Lake, P. Lu, and S. Sutphen, Science 317, 1518 (2007).
- Biamonte et al. (2016) J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd, “Quantum machine learning,” (2016), arXiv:1611.09347 .
- Sutton and Barto (1998) R. Sutton and A. Barto, Reinforcement learning (The MIT Press, 1998).
- Briegel and De las Cuevas (2012) H. J. Briegel and G. De las Cuevas, Sci. Rep. 2, 400 (2012).
- Dunjko et al. (2016) V. Dunjko, J. M. Taylor, and H. J. Briegel, Phys. Rev. Lett. 117, 130501 (2016).
- Paparo et al. (2014) G. D. Paparo, V. Dunjko, A. Makmal, M. A. Martin-Delgado, and H. J. Briegel, Phys. Rev. X 4, 031002 (2014).
- Ristè et al. (2017) D. Ristè, M. P. da Silva, C. A. Ryan, A. W. Cross, A. D. Córcoles, J. A. Smolin, J. M. Gambetta, J. M. Chow, and B. R. Johnson, npj Quantum Information 3, 16 (2017).
- Li et al. (2015) Z. Li, X. Liu, N. Xu, and J. Du, Phys. Rev. Lett. 114, 140504 (2015).
- Cai et al. (2015) X.-D. Cai, D. Wu, Z.-E. Su, M.-C. Chen, X.-L. Wang, L. Li, N.-L. Liu, C.-Y. Lu, and J.-W. Pan, Phys. Rev. Lett. 114, 110504 (2015).
- Hanneke et al. (2010) D. Hanneke, J. P. Home, J. D. Jost, J. M. Amini, D. Leibfried, and D. J. Wineland, Nat. Phys. 6, 13 (2010).
- Monz et al. (2016) T. Monz, D. Nigg, E. A. Martinez, M. F. Brandl, P. Schindler, R. Rines, S. X. Wang, I. L. Chuang, and R. Blatt, Science 351, 1068 (2016).
- Piltz et al. (2016) C. Piltz, T. Sriarunothai, S. S. Ivanov, S. Wölk, and C. Wunderlich, Sci. Adv. 2, e1600093 (2016).
- Debnath et al. (2016) S. Debnath, N. M. Linke, C. Figgatt, K. A. Landsman, K. Wright, and C. Monroe, Nature 536, 63 (2016).
- Kielpinski et al. (2002) D. Kielpinski, C. Monroe, and D. J. Wineland, Nature 417, 709 (2002).
- Monroe et al. (2014) C. Monroe, R. Raussendorf, A. Ruthven, K. R. Brown, P. Maunz, L.-M. Duan, and J. Kim, Phys. Rev. A 89, 022317 (2014).
- Lekitsch et al. (2017) B. Lekitsch, S. Weidt, A. G. Fowler, K. Mølmer, S. J. Devitt, C. Wunderlich, and W. K. Hensinger, Sci. Adv. 3, e1601540 (2017).
- Dunjko et al. (2015) V. Dunjko, N. Friis, and H. J. Briegel, New J. Phys. 17, 023006 (2015).
- Khromova et al. (2012) A. Khromova, C. Piltz, B. Scharfenberger, T. F. Gloger, M. Johanning, A. F. Varón, and C. Wunderlich, Phys. Rev. Lett. 108, 220502 (2012).
- Piltz et al. (2014) C. Piltz, T. Sriarunothai, A. Varón, and C. Wunderlich, Nat. Commun. 5, 4679 (2014).
- Wölk et al. (2015) S. Wölk, C. Piltz, T. Sriarunothai, and C. Wunderlich, J. Phys. B: At. Mol. Opt. Phys. 48, 075101 (2015).
- Genov et al. (2017) G. T. Genov, D. Schraft, N. V. Vitanov, and T. Halfmann, Phys. Rev. Lett. 118, 133202 (2017).
- Mautner et al. (2015) J. Mautner, A. Makmal, D. Manzano, M. Tiersch, and H. J. Briegel, New Generation Computing 33, 69 (2015).
- Melnikov et al. (2014) A. A. Melnikov, A. Makmal, and H. J. Briegel, “Projective simulation applied to the grid-world and the mountain-car problem,” (2014), arXiv:1405.5459 .
- Makmal et al. (2016) A. Makmal, A. A. Melnikov, V. Dunjko, and H. J. Briegel, IEEE Access 4, 2110 (2016).
- Hangl et al. (2016) S. Hangl, E. Ugur, S. Szedmak, and J. Piater, in Proceedings 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE, 2016) p. 2799.
- Tiersch et al. (2015) M. Tiersch, E. J. Ganahl, and H. J. Briegel, Sci. Rep. 5, 12874 (2015).
- Melnikov et al. (2017) A. A. Melnikov, H. Poulsen Nautrup, M. Krenn, V. Dunjko, M. Tiersch, A. Zeilinger, and H. J. Briegel, “Active learning machine learns to create new quantum experiments,” (2017), arXiv: 1706.00868 .
- Szegedy (2004) M. Szegedy, in 45th Annual IEEE Symposium on Foundations of Computer Science (IEEE, 2004).
- Magniez et al. (2011) F. Magniez, A. Nayak, J. Roland, and M. Santha, SIAM Journal on Computing 40, 142 (2011).
- Friis et al. (2014) N. Friis, V. Dunjko, W. Dür, and H. J. Briegel, Phys. Rev. A 89, 030303 (2014).
- Friis et al. (2015) N. Friis, A. A. Melnikov, G. Kirchmair, and H. J. Briegel, Sci. Rep. 5, 18036 (2015).
- Vitanov et al. (2015) N. V. Vitanov, T. F. Gloger, P. Kaufmann, D. Kaufmann, T. Collath, M. Tanveer Baig, M. Johanning, and C. Wunderlich, Phys. Rev. A 91, 033406 (2015).

## Methods

### Theoretical Framework of RPS

A generic picture for modeling autonomous learning scenarios is that of repeated rounds of interaction between an agent and its environment. In each round the agent receives perceptual input (“percepts”) from the environment, processes the input using an internal deliberation mechanism, and finally acts upon (or reacts to) the environment, i.e., performs an “action”Briegel and De las Cuevas (2012). Depending on the reward system in place and the given percept, such actions may be rewarded or not, which leads the agent to update its deliberation process, the agent learns.

Within the projective simulation (PS)Briegel and De las Cuevas (2012) paradigm for learning agents, the decision-making procedure is cast as a (physically motivated) stochastic diffusion process within an episodic compositional memory (ECM), i.e., a (classical or quantum) random walk in a representation of the agent’s memory containing the interaction history. One may think of the ECM as a network of clips that can correspond to remembered percepts, remembered actions, or combinations thereof. Mathematically, this clip network is described by a stochastic matrix (defining a Markov chain) , where the with and represent transition probabilities between the clips labeled and with . The learning process is implemented through an update of the matrix , which, in turn, serves as a basis for the random walks in the clip network. Different types of PS agents vary in their deliberation mechanisms, update rules, and other specifications.

In particular, one may distinguish between PS agents based on “hitting” and “mixing”. For the former type of PS agent, a random walk could, for instance, start from a clip called by the initially received percept. The first “step” of the random walk then corresponds to a transition to clips with probabilities . The agent then samples from the resulting distribution . If such a sample provides an action, e.g., if the clip is “hit”, this action is selected as output, otherwise the walk continues on from the clip . An advanced variant of the PS model based on “mixing” is reflecting projective simulation (RPS)Paparo et al. (2014). There, the Markov chain is first “mixed”, i.e., an appropriate number^{2}^{2}2The mixing time depends on the spectral gap of the Markov chain , i.e., the difference between the two largest eigenvalues of Paparo et al. (2014).
of steps are applied until the stationary distribution is attained (approximately), before a sample is taken. This, or other implementations of random walks in the clip network provide the basis for the PS framework for learning. The classical PS framework can be used to solve standard textbook problems in reinforcement learningMautner et al. (2015); Melnikov et al. (2014); Makmal et al. (2016), and has recently been applied in advanced roboticsHangl et al. (2016), adaptive quantum computationTiersch et al. (2015), as well as in the machine-generated design of quantum experimentsMelnikov et al. (2017).

Here, we focus on RPS agents, where the deliberation process based on mixing allows for a speed-up of Q-RPS agents w.r.t. to their classical counterpartsPaparo et al. (2014). In contrast to basic hitting-based PS agents, the clip network of RPS agents is structured into several sub-networks, one for each percept clip, and each with its own stochastic matrix . In addition to being stochastic, these matrices specify Markov chains which are ergodicPaparo et al. (2014), which ensures that the Markov chain in question has a unique stationary distribution, i.e., a unique eigenvector with eigenvalue , . Starting from any initial state, continued application of (or its equivalent in the quantized version) mixes the Markov chain, leaving the system in the stationary state.

As part of their deliberation process, RPS agents generate stationary distributions over their clip space, as specified by , which is updated as the agent learns.
These distributions have support over the whole sub-network clip space, and additional specifiers – flags – are used to ensure an output from a desired sub-set of clips. For instance, standard agents are presumed to output actions only, in which case only the actions are “flagged”^{3}^{3}3Such flags are rudimentary emoticons defined inBriegel and De las
Cuevas (2012). This ensures that an action will be output, while maintaining the relative probabilities of the
actions. The same mechanism of flags, which can be thought of as short-term memory, is also used to eliminate iterated attempts of actions which did not yield rewards in recent time-steps. This leads to a more efficient exploration of correct behavior.

In the quantum version of RPS, each clip is represented by a basis vector in a Hilbert space . The mixing process is then realized by a diffusion process on two copies of the original Hilbert space. On the doubled space a unitary operator (called the Szegedy walk operatorSzegedy (2004); Magniez et al. (2011)) and a quantum state with take the roles of the classical objects and . Both and depend on a set of unitaries on that act as for some reference state , i.e., . The more intricate construction of is given in detail inDunjko et al. (2015). The feature of the quantum implementation of RPS that is crucial for us here is an amplitude amplification similar to Grover’s algorithmGrover (1996), which incorporates the mixing of the Markov chain and allows outputting flagged actions after an average of calls to , where is the probability of sampling an action from the stationary distribution. The algorithm achieving this is structured as follows. After an initialization stage where is prepared, a number of diffusion steps are carried out. Each such step consists of two parts. The first part is a reflection over the states corresponding to actions in the first copy of , i.e., an operation

(10) |

where denotes the subspace of the clip network corresponding to actions. In the second part, an approximate reflection over the state , the mixing, is carried out, i.e., an operation designed to approximate Paparo et al. (2014). This second step involves calls to . The two-part diffusion steps are repeated times before a sample is taken from the resulting state by measuring in the basis . If an action is sampled, the algorithm concludes and that action is chosen as output. Otherwise, all steps are repeated. Since the algorithm amplifies the probability of sampling an action (almost) to unity, carrying out the deliberation procedure with the help of such a Szegedy walk hence requires an average of calls to . In comparison, a classical RPS agent would require an average of applications of to mix the Markov chain, and an average of samples to find an action. Q-RPS agents can hence achieve a quadratic speed-up in their reaction time.

Here, it should be noted that, its elegance not withstanding, the construction of the approximate reflection for general RPS networks is extremely demanding for current quantum computational architectures. Most notably, this is due to the requirement of two copies of , on which frequently updated coherent conditional operations need to be carried outDunjko et al. (2015); Friis et al. (2014, 2015). However, as we shall explain now, these requirements can be circumvented for the interesting class of rank-one Markov chains. In this special case, the entire Markov chain can be represented on one copy of by a single unitary , since all columns of are identical. Conceptually, this simplification corresponds to a situation where each percept-specific clip network contains only actions and the Markov chain is mixed in one step (). In such a case one uses flags to mark desired actions. Interestingly, these minor alterations also allow to establish a one-to-one correspondence with the hitting-based basic PS using two-layered networks, which has been applied in classical task environmentsMautner et al. (2015); Melnikov et al. (2014); Makmal et al. (2016); Hangl et al. (2016); Tiersch et al. (2015); Melnikov et al. (2017).

Let us now discuss how the algorithm above is modified for the rank-one case with the flagging mechanism in place. First, we restrict to be the subspace of the flagged actions only, assuming that there are of these, and we denote the corresponding probabilities within the stationary distribution by . In the initialization stage, the state is prepared. Then, an optimal number of diffusion stepsGrover (1996) is carried out, where

(11) |

and is the probability to sample a flagged action from the stationary distribution. Within the diffusion steps, the reflections over all actions of Eq. (10) are replaced by reflections over flagged actions, i.e.,

(12) |

In the rank-one case, the reflections over the stationary distribution becomes an exact reflection over the state and can be carried out on one copy of Dunjko et al. (2015). After the diffusion steps, a sample is taken and the agent checks if the obtained action is marked with a flag. If this is the case, the action is chosen as output, otherwise the algorithm starts anew.

While a classical RPS agents requires an average of samples until obtaining a flagged action, this number reduces to for Q-RPS agents. This quantum advantage is particularly pronounced when the overall number of actions is very large compared to and the environment is unfamiliar to the agent or has recently changed its rewarding pattern, in which case may be very small. Given some time, both agents learn to associate rewarded actions with a given percept, suitably add or remove flags, and adapt (and by extension ). In the short run, however, classical agents may be slow to respond and the advantage of a Q-RPS agent becomes apparent. Despite the remarkable simplification of the algorithm for the rank-one case with flags, the quadratic speed-up is hence preserved.

### Experimental Details

We discuss some details of the experimental implementation. The detailed pulse sequence for the reported experiments is shown in Fig. 5. Radio frequency (RF) pulses tuned to the respective qubit resonance frequencies near 12.6 GHz are applied for all coherent manipulations. The RF power is carefully adjusted for each ion in order to achieve an equal Rabi frequency of kHz. For each single-qubit rotation, during the preparation and diffusion steps, and represent the rotation angle according to Eq. (3 - 5) (main text). We replace by and by .

During the conditional evolution time, a set of ten UR14 RF -pulses (equaling a total of 140 pulses) is appliedGenov et al. (2017) to protect the qubits from decoherence by dynamical decoupling (DD). Each set is comprised of 14 error correcting pulses (Fig. 5) with appropriately chosen phase :

Since the phases of the -pulses are symmetrically arranged in time, only the first seven pulses are shown in Fig. 5. The last pulse is also shown to visualize the spacing of these pulses with respect to the start and end of evolution time, compared to the intermediate pulses. The maximum interaction time of 30 ms required to realize the deliberation algorithm (corresponding to 7 diffusion steps) presented in the main text is 60 times longer than the qubit coherence time. Such a long coherent interaction time is accomplished by the DD pulses applied to each qubit simultaneously.

Laser light near 369 nm is applied to the ions for cooling, state preparation, and state selective read-out as described in the main text. The durations are: 30 ms for Doppler cooling, 100 ms for sideband cooling on the center-of-mass mode, 100 ms for sideband cooling on the stretch mode, 0.25 ms for initialization in state of the ions, and 2 ms for detection.

Projective measurements on both qubits are performed to determine the states , , , and . Two thresholds are used to distinguish between dark and bright states of the ions, thus discarding 10% of all measurements as ambiguous events with a photon count that lies in the region of two partially overlapping Poissonian distributions representing the dark and bright states of the ionsVitanov et al. (2015); Wölk et al. (2015).

## Supplementary Materials

In this section, we discuss deviations of the experimental data from idealized theory predictions. In particular, for the chosen values of and the corresponding optimal , it is expected that the probability of obtaining a flagged action is close to . However, the success probability in our experiment lies between (for ) and (for ). In what follows, we discuss several reasons for this.

#### Scaling error

Theory | Theory | |||||
---|---|---|---|---|---|---|

1 | 0.1371 | 0.1371 | 0.2742 | 0.4966 | 0.4966 | 0.9932 |

2 | 0.0493 | 0.0493 | 0.0987 | 0.4996 | 0.4996 | 0.9993 |

3 | 0.0252 | 0.0252 | 0.0504 | 0.4999 | 0.4999 | 0.9998 |

4 | 0.0152 | 0.0152 | 0.0305 | 0.5000 | 0.5000 | 1.0000 |

5 | 0.0102 | 0.0102 | 0.0204 | 0.5000 | 0.5000 | 1.0000 |

6 | 0.0073 | 0.0073 | 0.0146 | 0.5000 | 0.5000 | 1.0000 |

7 | 0.0055 | 0.0055 | 0.0110 | 0.5000 | 0.5000 | 1.0000 |

Experiment | ||||||

1 | 0.449(15) | 0.440(15) | 0.89(2) | |||

2 | 0.347(15) | 0.353(15) | 0.70(2) | |||

3 | 0.438(16) | 0.334(15) | 0.77(2) | |||

4 | 0.422(15) | 0.336(15) | 0.76(2) | |||

5 | 0.407(17) | 0.331(16) | 0.74(2) | |||

6 | 0.431(17) | 0.324(16) | 0.76(2) | |||

7 | 0.365(15) | 0.299(14) | 0.66(2) |

Even in an ideal scenario without noise or experimental imperfections the success probability , as defined in Eq.(4) of the main text, after diffusion steps is usually not equal to unity, and depends on the specific value of . This behavior originates from the step-wise increase of the number of diffusion steps in the algorithm. The success probability is hence only if is an integer without rounding. The change of the ideal success probability with deviations of from such specific values is largest for small numbers of diffusion steps (e.g., ) and can drop down to (neglecting the cases where it is not advantageous to use a quantum algorithm at all). For larger numbers of diffusion steps, the exact value of does not play an important role any more for the ideal success probability provided that the correct number of diffusion steps is chosen. For example, for , the ideal success probability is larger than independently of the exact value of . Throughout this paper, we have chosen in such a way, that is always close to an integer (see Tab. 1), such that the deviation from a success probability due to the theoretically chosen is negligible compared to other error sources.

However, in a real experiment, the initial state, and therefore , can only be prepared with a certain accuracy. This can lead to an inaccurate estimation of the optimal number of diffusion steps. As opposed to the ideal case, an assumed accuracy of for the preparation only has a small effect on the success probability (drop of less than ) for , corresponding to . However, when does not fulfil the aforementioned condition and approaches from above, corresponding to , then the success probability drops down to due to a non-optimal choice of .

The preparation accuracy depends on the detuning of the RF pulses for single-qubit rotations as well as on the uncertainty in the determination of the Rabi frequency . The calibration of our experiment revealed and leading to an error in of and a decrease of the success probability of less than . The detuning and the uncertainty of the Rabi frequency not only influence the state preparation at the beginning of the quantum algorithm, but also its fidelity, as is detailed in the next paragraph.

To prevent decoherence during conditional evolution, we use MW -pulses per diffusion step and ion. Therefore, already a small detuning influences the fidelity of the algorithm. Consequently, the error induced by detuning is identified as the main error source leading, for example, to for and . This error is much larger than the error caused by dephasing (that is still present after DD is applied), or the detection error. To estimate the error by dephasing, we assume an exponential decay with for a single diffusion step of duration ms which would lead to for . Here, indicates the experimentally diagnosed rate of dephasing, and is the time of coherent evolution. The influence of the detuning on the cost of our algorithm is shown in Fig. 6 for different detunings. Here, we simulated the complete quantum algorithm including the experimentally determined dephasing and detection errors for . The experimental data is consistent with an average relative detuning of . Note that the detuning not only influences the single-qubit rotations that are an integral part of the quantum algorithm, but also leads to errors during the conditional evolution when dynamical decoupling pulses are applied.

#### Ratio error

In the ideal algorithm, the output ratio of the two flagged actions represented by the states and at the end of the algorithm equals the input ratio . However, in the experiment we have observed deviations from . During the measurements for the investigation of the scaling behavior (Fig. 3 in the main text), we fixed . The observed output ratios are varying by . That is, the probability to obtain the state is increased w.r.t. . Also during the measurement testing the output ratio, we observe that the output ratios are larger than the input ratios.

Theory | Experiment | |||||
---|---|---|---|---|---|---|

1 | 0.00271 | 0.27144 | 0.01 | 0.061(7) | 0.809(12) | 0.075(9) |

1 | 0.07257 | 0.20159 | 0.36 | 0.290(14) | 0.583(15) | 0.50(3) |

1 | 0.11383 | 0.16032 | 0.71 | 0.415(15) | 0.466(15) | 0.89(4) |

1 | 0.14107 | 0.13309 | 1.06 | 0.488(15) | 0.389(15) | 1.25(6) |

1 | 0.16040 | 0.11376 | 1.41 | 0.519(13) | 0.351(12) | 1.48(6) |

1 | 0.17482 | 0.09933 | 1.76 | 0.566(15) | 0.305(14) | 1.85(10) |

1 | 0.13708 | 0.13708 | 1.00 | 0.468(16) | 0.401(16) | 1.17(6) |

3 | 0.00458 | 0.04578 | 0.10 | 0.127(10) | 0.718(14) | 0.176(14) |

3 | 0.01633 | 0.03402 | 0.48 | 0.301(15) | 0.518(16) | 0.58(3) |

3 | 0.02328 | 0.02707 | 0.86 | 0.442(16) | 0.451(16) | 0.98(5) |

3 | 0.02788 | 0.02248 | 1.24 | 0.510(16) | 0.354(15) | 1.44(8) |

3 | 0.03114 | 0.01922 | 1.62 | 0.551(16) | 0.305(14) | 1.81(10) |

3 | 0.03357 | 0.01679 | 2.00 | 0.586(15) | 0.268(13) | 2.19(12) |

An asymmetric detection error could be the cause for this observation. Typical errors in our experiment are given by the probability to detect a bright ion () with a probability of as dark, and a dark ion () with a probability of as bright. In Fig. 7 we compare the measured output ratios with the calculated output ratios assuming the above mentioned detection errors only.

Fig. 7 shows that the experimental data, both for one step and for three steps, are well approximated by the simulation when the experimentally determined detection error is taken into account. Thus, the deviation of the measured ratios from the ideal result can be traced back mainly to the unbalanced detection error. In addition, also errors in the preparation of the input states play a role, especially when preparing very large or very small ratios leading to either or being close the the preparation accuracy of . At the same time, the detuning plays a less prominent role in these measurements because fewer dynamical decoupling pulses where required due to the small number of diffusion steps. Moreover, the detuning during these measurements could be kept below leading to an average success probability of also for diffusion steps compared to for during the measurements investigating the scaling (see Tab. 1).