Abstract
Can quantum mechanics help us build intelligent learning agents? A defining signature of intelligent behavior is the capacity to learn from experience. However, a major bottleneck for agents to learn in reallife situations is the size and complexity of the corresponding task environment. Even in a moderately realistic environment, it may simply take too long to rationally respond to a given situation. If the environment is impatient, allowing only a certain time for a response, an agent may then be unable to cope with the situation and to learn at all. Here we show that quantum physics can help and provide a quadratic speedup for active learning as a genuine problem of artificial intelligence. This result will be particularly relevant for applications involving complex task environments.
I Introduction
The levels of modern day technology have, in many aspects, surpassed the predictions made in the mid century, as is easily witnessed, for example, by the sheer computing power of the average ‘smart’ mobile phone. Arguably, the most striking exception to this, apart from, perhaps, human space exploration, lies in the development of genuine artificial intelligence (AI), the challenge of which has initially been greatly underestimated. The unceasing setbacks in the general AI problem caused research to shift emphasis to the production of useful technology, a direction now called applied AI. That is, emphasis was placed to specific algorithmic AI tasks – modules, such as data clustering, pattern matching, binary classification, and similar – and reduced from the holistic task of designing an autonomous and intelligent agent.
The discovery that the laws of quantum physics can be employed for dramatically enhanced ways of information processing 1985_Deutsch; 1992_Deutsch; 1996_Grover; 1994_Shor; 2000_NC; 2000_Bennet has already had a positive influence on specific algorithmic tasks of applied AI 2002_Sasaki; 2008_Neven; 2013_Lloyd; 2009_Brukner; 2013_Lidar; 2013_Aimeur. However, to our knowledge, it has so far not been demonstrated that quantum physics can help in the complemental task of designing autonomous and learning agents. The latter task is studied in the fields of embodied cognitive sciences and robotics 1986_Braitenberg; 1999_Brooks; 1999_Pfeifer; 2006_Pfeifer; 2008_Floreano; 2008_Barsalou, which promote a behaviorbased approach to intelligence and put a strong emphasis on the physical aspects of an agent. The approach to AI we adopt in this work is along the lines of the latter perspective. We are guided by a few basic principles, inspired by biological agents, which include autonomy (implying that the agent must learn in, and adapt to, unknown dynamic environments), embodiedness (implying that the agent is situated in, and actively interacts with, a physical environment), and homogeneity (meaning that all possible separate Òcognitive unitsÓ arise as possible configurations of one, or a few, homogeneous underlying systems that are, in principle, capable of growth). An example of a model that one could consider homogenous, aside from the projective simulation model 2012_Briegel we will consider here, would for example be artificial neural networks. One may then envision that true AI will emerge by growth and the learning of an agent, rather than through deliberate design. In this paper, we show that in such an embodied framework of AI provable advancements of a broad class of learning agents can be achieved when we take the full laws of quantum mechanics into account.
Ii Learning agents and quantum physics
How could quantum physics help design better agents? An embodied agent is always situated in an environment from which it receives a sensory input, that is, a percept (from some set of percepts ) and, based on the percept, it produces an action from the possible set of actions , see Fig. 1. The capacity to learn implies that the agent is at every instant of time in some internal state, that can change based on previous sequences of perceptaction events. That is, it has memory which reflects the agents history. The typical model for such autonomous learning agents we consider here is the reinforcement learning model 2003_Russel; SuttonBarto98, where to each perceptaction event a reward in (for simplicity, we consider binary rewards, but this reward system can be easily generalized in our model) is assigned when the action was correct.
Each perceptactionreward sequence constitutes an external timestep (or cycle) of the activity of an agent. The learning process of the agent is characterized by an update rule of the internal state (based on the previous perceptactionreward sequences), and the (local) policy of the agent SuttonBarto98 is defined by what action is output given the current internal state and the received percept. Unlike in typical reinforcement learning settings, in embodied active agents the required time to evaluate the policy (decide on an action) must be taken into account, and we refer to it as internal time.
The agent’s learning process is reminiscent to computational oracle query models, in which an unknown oracle (environment), is queried (via an action) by the agent, in an iterative quest for the best responses. It is tantalizing to consider employing the powerful quantum searching machinery 1996_Grover; 2004_Szegedy_IEEE; 2011_Magniez_SIAM, which has been proven to outperform classical algorithms in computational settings, in an attempt to improve the agent.
However, contrary to computer algorithms, an embodied agent, such as a robot, operates in a physical environment which is, for most existing applications, classical ^{1}^{1}1Examples of such applications include the problems of navigation, or humanrobot interaction. We note that the focus on such classical environments is by no means a restriction of our scheme. For a quantum environment, the actions which the agent can perform could e.g. be quantum measurements (as components of his actuators), and percepts the measurement outcomes – such scenarios are certainly not without interest. The results we present in this work apply equally to such environments. However, we will not explore the applications of learning agents in quantum environments in this paper.. This prohibits querying in quantum superposition, a central ingredient to all quantum search algorithms. Thus, such naïve approaches to quantizing learning agents are doomed to fail ^{2}^{2}2Even if we were to allow superpositions of actions, the amount of control the agent must have over the degrees of freedom of the environment, in order to apply quantum query algorithms, may be prohibitive. This constitutes one of the fundamental operative distinctions between quantum algorithms, where full control is assumed, and quantum agents, where such control is limited..
Nonetheless, while the physical nature of the agent and the environment prohibits speedup through quantum queries of the environment, the physical processes within the agent, which lead to the performed actions, can be significantly improved by employing full quantum mechanics ^{3}^{3}3In embodied agents, these physical processes may e.g. realize some internal representation of the environment, which the agent itself has to develop as it interacts with the environment. For example, in the context of artificial neural networks such internal models are known as selforganizing maps and, more specifically, sensorimotor maps 1995_Kohonen; 2006_Toussaint.. In particular, the required internal time can be polynomially reduced in the model we present next. In general learning settings, this speedup alone will constitute an overall qualitative improvement of performance, for instance when the environment changes on timescales not overwhelmingly larger than the agent’s internal ‘thinking’ time.
Iii Quantum agents based on projective simulation
iii.1 The PS agent model
The AI model of the agents we consider in the following is the socalled Projective Simulation (PS) model 2012_Briegel, whose conceptual framework is in line with the desired guiding principles we highlighted earlier. The PS model is based on a specific memory system, which is called episodic and compositional memory (ECM). This memory provides the platform for simulating future action before real action is taken. The ECM can be described as a stochastic network of socalled clips, which constitute the elementary excitations of episodic memory and can be implemented as excitations of suitable physical systems. The percepts () and actions (), along with sequences thereof, are represented within an agent as the aforementioned clips, and the set of these comprises the clip space . In this work we consider clips which are unit length sequences, representing a memorized percept or an action, which we denote using the same symbols, so ^{4}^{4}4In the PS framework one formally distinguishes between real percepts and actions , and their internal representations denoted , which comprise clips. For our purposes, by abuse of notation, we will omit the mapping . For more details see the Appendix, section V.3. , but this can be easily generalized. The internal states of the agent, i.e. the total memory, comprise weighted graphs over subsets of the clip space, which are assigned to each percept . This graph dictates the hopping probabilities from one clip to another, and the hopping process realizes a Markov chain (MC). Thus, the elementary internal processes of the agent, which implement the transitions from clip to clip, are discretetime stochastic diffusion processes. These diffusion processes can be realized in a variety of physical systems, as we discuss later.
The deliberation process of the agent is based only on the diffusion processes over the clip space of a certain (and, in general, variable) size, making this model homogeneous in the sense we explained earlier. The PS agents also perceive rewards and, based on the percieved percept , realized action , and the resulting reward, the weights of the graph (that is, transition probabilities) are updated via simple rules ^{5}^{5}5In the PS, the hopping probabilities themselves are encoded in the socalled matrix, which is an unnormalized representation of the transition matrix 2012_Briegel..
While the PS framework allows for many additional structures (see the Appendix, section V.3 for further details on the PS model), we will, for simplicity, only consider perceptspecific flags – corresponding to rudimentary emoticons in 2012_Briegel – which are subsets of actions assigned to each percept, formally . Such flags may be used to represent the agent’s shortterm memory, in which case they significantly improve the performance of the model 2012_Briegel. For example, one can consider a possible mechanism in which for each percept, all actions are initially flagged. If the agent outputs some action , given a percept , and this action is not rewarded, is removed from . Once the set has been depleted (indicating, for instance, that the environment changed its policy), it is reset to contain all actions. The meaning of flags may be more general, and in this work we only assume the sets of flags are always nonempty.
In the process of deliberation, the relevant Markov chain is diffused a particular number of times, depending on the particular PS model, until an action is output via socalled output couplers ^{6}^{6}6Each agent is equipped with input and output couplers, which translate, through sensors and actuators (see Fig. 1), real percepts to the internal representations of percepts, and internal representations of actions to real actions. . The choice of the action  and thereby the policy of the agent  is dictated by the probability distribution over the clip space, which is realized by the diffusion process. The latter depends on the agent’s experience manifest in the specified MC. Intuitively, this distribution represents the agent’s state of belief on what is the right action in the given situation.
A particular model of PS we introduce here, socalled reflecting PS (rPS) agents, draw their name from the reflection process 2012_Briegel in which the diffusion processes are repeated many times. Such agents approximate the complete mixing of their MCs, simulating infinite deliberation times. Once the mixing is (approximately) complete, the reflecting agent samples from the realized stationary distribution over the clip space (and, if needed, iterates the mixing process) until a flagged action clip has been sampled. The internal states of reflecting PS agents are thus irreducible, aperiodic and reversible MCs over subsets of the clip space (which contain all action clips). Reflecting PS agents can be seen as a generalization of socalled standard PS agents, and a comparison between a wellstudied class of PS agents, and rPS agents is provided in the Appendix, section LABEL:subsect:comparison.
In the limit of complete mixing, given a percept and the current internal state (the MC ), the rPS agents output an action distributed according to given as follows: let be the stationary distribution of and let be the subset of flagged actions, then
(1) 
that is the renormalized stationary distribution modified to have support only over flagged actions. We will often refer to as the tailed distribution.
In general, complete mixing is not possible or needed. To realize the (approximate) tailed distribution, as given in Eq. 1, the classical agent will have to, iteratively, prepare the approximate stationary distribution of (by applying to the initial distribution times), and sample from it, until a flagged action is hit. It is well known that, in order to mix the MC, should be chosen in ^{7}^{7}7In this paper we do not consider logarithmically contributing terms in the complexity analysis, thus we use the level analysis of the limiting behavior (instead of the standard ’big O’ ). (where is the spectral gap of MC defined as and is the second largest eigenvalue of in absolute value), and the expected number of iterations of mixing and checking which have to be performed is (where is the probability of sampling a flagged action from the stationary distribution ). Here, we will use the label of the transition matrix as a synonym for the MC itself. Note that the internal time of the classical agent, i.e. the number of primitive processes (diffusion steps), is therefore governed by the quantities and .
iii.2 Quantum speedup of reflecting PS agents
The procedure the rPS agent performs in each deliberation step resembles a type of a random walkbased search algorithm which can be employed to find targeted items in directed weighted graphs 2011_Magniez_SIAM. In that context, the theory of quantum walks provides us with analogs of discretetime diffusion processes, using which the search time can be quadratically reduced 2011_Magniez_SIAM; 2005_Magniez; 2004_Ambainis; 2004_Szegedy_IEEE. To design the quantum agent, inspired by these approaches to searching, here we introduce a quantum walk procedure which can be seen as a randomized Groverlike search algorithm (the quantum counterparts of classical search algorithms), most closely matching the main protocol in 2011_Magniez_SIAM.
However, there are essential differences between searching problems and the problems of designing intelligent AI agents, which we expose and resolve in this work. First, we note that, for the task of simple searching, the procedure the rPS agent follows is known not to be optimal in general 2011_Magniez_SIAM; 2012_Magniez_Algorithmica. In contrast, for the task of the rPS agent, which is to output a flagged action according to a good approximation of the tailed distribution in Eq. 1, this algorithm is, in general, optimal ^{8}^{8}8The optimality claim holds, provided that no additionaly mechanisms except for diffusion and checking are available.. This can be seen by the known lower bounds for mixing times of reversible MCs (see the Appendix, section V.2 for details).
Furthermore, while as a direct consequence of the results in 2011_Magniez_SIAM, the quantum rPS produces a flagged action in times quadratically faster than is achieved using the procedure employed by the classical agent, prior works provide no guarantees that the output actions will be distributed according to the desired tailed distribution. Recall, in the context of AI, all agents produce some action, and it is precisely the output distribution which differentiates one agent from another in terms of behavior (and, thus success). In this work we prove that both the output distributions of the reflecting classical and quantum agents approximate the distribution of Eq. 1, and thus are approximately equal. That is, they belong to the same behavioral class (for a formal definition of behavioral classes we introduce see the Appendix, section V.4). Consequently, the quantum reflecting agent construction we give realizes, in full sense, quantumenhanced analogs of classical reflecting agents.
While quantum approaches to sampling problems have not until now been exploited in AI, we observe that the methodology we use is, in spirit, related to the problem of sampling from particular distributions. Such sampling tasks have been extensively studied, often in the context of Markov chain Monte Carlo methods, where quantum speedup can also be obtained 2011_Temme_Nature; 2012_Yung; 2008_Somma_P; 2008_Wocjan. There the quantum walks were mostly used for the important purpose of sampling from BolzmannGibbs distributions of (classical and quantum) Hamiltonians.
In order to define the quantum reflecting agent, we first review the standard constructions and results from the theory of classical and quantum random walks 2004_Ambainis; 2004_Szegedy_IEEE; 2011_Magniez_SIAM, and refer the reader to the Appendix, section V.2, for more details.
The quantum rPS agent we propose uses the standard quantum discrete time diffusion operators and which act on two quantum registers, sufficiently large to store the labels of the nodes of the MC . The diffusion operators are defined as and where , . Here, is the time reversed MC defined by , where is the stationary distribution of ^{9}^{9}9In the case of reversible MCs, which will be the main focus of this paper, , so can be constructed from by conjugating it with the swap operator of the two registers. Here, we present the construction for the general case of irreducible, aperiodic Markov chains.. Using four applications of the diffusion operators above, it has been shown that one can construct the standard quantum walk operator (sometimes referred to as the Szegedy walk operator), which is a composition of two reflections in the mentioned tworegister state space. In particular, let be a projection operator on the space and be the projector on . Then .
Using the quantum walk operator and the wellknown phase detection algorithm 2000_NC; 2011_Magniez_SIAM the agent can realize the subroutine which approximates the reflection operator where is the coherent encoding of the stationary distribution . The parameters and control the fidelity (and the time requirement) of this process, i.e. how well the reflection is approximated as a function of the number of applications of the quantum walk operator.
More precisely, in the implementation of the operator, the quantum rPS agent utilizes an ancillary register of qubits. To ensure the correct behavior, is chosen as , which depends on the square root of the spectral gap of the MC . Under this condition, it has been shown that the distance between the ideal reflection operator and the realized approximate operator is upper bounded by , under a suitable metric (see the Appendix, section V.2, Theorem 3 for details). That is, the fidelity of this reflection operator approaches unity exponentially quickly in the parameter 2011_Magniez_SIAM.
To produce a flagged action according to the desired distribution, the quantum rPS agent will first initialize its quantum register to the state which requires just one application of the diffusion operator provided the state is available. Here, like in the standard frameworks of algorithms based on quantum walks with nonsymmetric Markov chains, we assume that the state is available, and in the Appendix, section LABEL:subsect:comparison, we provide an example of reflecting classical and quantum agents where this is easily achieved ^{10}^{10}10Here, we note that concrete applications of quantum walks specified by nonsymmetric Markov chains have, to our knowledge and aside from this work, only been studied in 2012_paparo_google; 2013_paparo_complex, by two of the authors and other collaborators, in a significantly different context..
Following this, the agent performs a randomized Groverlike sequence of reflections, reflecting over flagged actions (denoted ), interlaced with reflections using the approximate reflection operator described previously. After the reflections have been completed the required number of times, the resulting state is measured, and the found flagged action is output. In the case a nonaction clip is hit, the entire procedure is repeated ^{11}^{11}11For completeness we note as a technicality, following 2011_Magniez_SIAM, that if is always bounded below by a known constant (say as in 1998_Boyer), the quantum agent can immediately measure the initial state and efficiently produce the desired output by iterating this process a constant number of times. However, in the scenarios we envision, is very small. .
Since Groverlike search engines guarantee that the overlap between the final state, and a state with support just over the actions is constant, this implies that the probability of not hitting a flagged action decreases exponentially quickly in the total number of iterations, and does not contribute significantly to our analysis. In the Appendix, section V.4, we provide a detailed analysis, and propose a method for the efficient repreparation of the required initial state (by recycling of the residual state), in the event the deliberation procedure should be repeated. The deliberation process is detailed in Fig. 2. The total number of reflections required is (this choice is uniform at random as for the randomized Grover algorithm 1998_Boyer), and the approximate reflection operator is applied with parameters where ^{12}^{12}12We note that some of the only logarithmically contributing terms, which appear in a more detailed analysis omitted here, can be further avoided using more complicated constructions as has, for instance, been done in 2011_Magniez_SIAM..
This implies that the total number of calls to the diffusion operators and is in (and the total number of reflections over flagged actions – equivalents of checks in the classical agent – is in ), which is a quadratic improvement over the classical agent. As we have mentioned previously, the remaining key ingredient to our result is the fact that the proposed quantum agent produces actions (approximately) according to the tailed distribution in Eq. 1, and that the approximations for both agents can be made arbitrarily good within at most logarithmic overhead. The proof of this claim we leave for the Appendix section V.4. We note that in this paper we have presented constructions for reversible MCs for simplicity, but the constructions can be extended to general irreducible chains using approaches analogous to those in 2011_Magniez_SIAM.
We have thus presented a method for generic quantization of reflecting PS agents, which maintains the behavior of the agents, and provably yields a quadratic speedup in internal times vital in real environment settings.

Initialize: Prepare
with and is the transition probability from to as dictated by the MC .

For timesteps do

Check: Apply the operator which flips the phase of all components of the current state of the first register which are not in .

Diffuse: Apply the approximate reflection operator , as described in main text.


Measure the first register, and if it is a flagged action, output it, else reiterate the procedure.
Iv Discussion
We have presented a class of quantum learning agents that use quantum memory for their internal processing of previous experience. These agents are situated in a classical task environment that rewards a certain behavior but is otherwise unknown to the agent, which corresponds to the situation of conventional learning agents.
The agent’s internal ‘program’ is realized by physical processes that correspond to quantum walks. These quantum walks are derived from classical random walks over directed weighted graphs, which represent the structure of its episodic memory. We have shown how, using quantum coherence and known results from the study of quantum walks, the agent can explore its episodic memory in superposition in a way which guarantees a provable quadratic speedup in its active learning time over its classical analogue.
Regarding potential realizations for such quantum learning agents, modern quantum physics laboratories are exploring varieties of systems which can serve as suitable candidates. Quantum random walks and related processes can naturally be implemented in linear optics setups by, for instance, arrays of polarizing beam splitters 2012_Aspuru and highly versatile setups can also be realized using internal states of trapped ions 2012_Roos. Such advancements, all of which belong to the field of quantum simulation 2013_Schaetz, could be used as ingredients towards the implementation of quantum reflecting agents, without the need to develop a fullblown universal quantum computer.
An entirely different route towards realizing the proposed quantum (and classical) learning agents might employ condensed matter systems in which the proposed Markov chains could e.g. be realized through cooling or relaxation processes towards target distributions that then encode the state of belief of the agent. Here we envision rather nontrivial cooling/relaxation schemes in complex manybody systems, the study of which is also a prominent topic in the field of quantum simulation.
In conclusion, it seems to us that the embodied approach to artificial intelligence acquires a further fundamental perspective by combining it with concepts from the field of quantum physics. The implications of embodiment are, in the first place, described by the laws of physics, which tell us not only about the constraints but also the ultimate possibilities of physical agents. In this paper we have shown an example of how the laws of quantum physics can be fruitfully employed in the design of future intelligent agents that will outperform their classical relatives in complex task environments.
Acknowledgments:
MAMD acknowledgs support by the Spanish MICINN grant FIS200910061, FIS201233152, the CAM research consortium QUITEMAD S2009ESP1594, the European Commission PICC: FP7 20072013, Grant No. 249958, and the UCMBS grant GICC910758. HJB acknowledges support by the Austrian Science Fund (FWF) through the SFB FoQuS: F 4012, and the Templeton World Charity Fund (TWCF) grant TWCF0078/AB46.
GDP and VD have contributed equally to this work.
V Appendix
v.1 Formal definitions and behavior of reinforcement learning agents
Here we formally define the model of reinforcement learning agents as employed in this work.
Definition 1.
(Reinforcement learning agent) A reinforcement learning agent is an ordered sextuplet where:

are the sets of percepts and actions, respectively.

is the set of rewards, offered by the environment.

is the set of possible internal states of the agent.

is the decision function, which outputs some action given a percept and the internal state.

is the update function, which updates the internal state based on the success or failure of the last perceptaction sequence.
A few comments are in order. In this work, the sets of percepts, actions and internal states are defined to be finite, but, in general, this need not be the case. The set of rewards is binary, and this can again be generalized. The update function may take additional information into account, based on additional outputs of the decision function, which are only processed internally, but this does not occur in the models we consider.
The decision function is not necessarily deterministic. In the nondeterministic case it can be formally defined as
(2) 
that is, a function which takes values in the set of distributions over . In this case we also assume that this distribution is sampled before actual output is produced, and that the sampled action is the input to the update function.
Next, we consider equivalences between agents in socalled passive settings.
In the algorithmic tradition of machine learning, the learning pace is measured by external time (steps) alone, and the typical figure of merit is the percentage of rewarded actions of the agent, as the function of external time. From an embodied agent perspective, this setup corresponds to a special passive setting where a static environment always waits for the responses of the agent. This constraint imposes a restriction on the universality of statements which can at all be made about the performance of an agent. In particular, in that setting it is wellknown that no two agents can be meaningfully compared without reference to a specific (or a class of) learning tasks  a collection of results dubbed ‘no free lunch theorems’, and ‘almost no free lunch theorems’ 1996_Wolpert; 1997_Droste ^{13}^{13}13We acknowledge that the interpretation of these results in the sense of their practical impact on the field are not without controversy. Nonetheless, the validity of the mathematical statements is not contested. See NFLorg for more details.. These results prove that when one agent outperforms another in a certain environment, there exists a different environment where the ranking according to performance is reversed. This, for instance, implies that every choice of environment settings, for which results of agent performance are presented, must be first well justified. More critically for our agenda, in which we wish to make no assumptions on the environment, passive settings would imply no comparative statements relating the performances of agents are possible. In active scenarios internal time does matter, but nonetheless the passive setting plays a part. It provides a baseline for defining a passive behavioral equivalence of agents, which will be instrumental in our analysis of active scenarios.
Let us denote the elapsed sequence of triplets which had occurred up to timestep (the history of the agent) with , for two agents and who can perceive and produce the same sets of percepts () and actions (), respectively. Then we will say that and are passively ()equal if at every external time step the probabilities and of agents and , respectively, outputting some action , given every percept , and given all possible identical histories are ()equal, in terms of the variational distance on distributions:
(3) 
which we abbreviate with
(4) 
If the agents considered are equipped with an extra parameter (a precision parameter), which fine tunes the behavior of the agent, we can demand more and require that the approximate equality above converges to an equality (i.e. , as ). Then in the limit, the relation above induces passive behavioral equivalence classes for fixed sets of possible percepts and actions. In the case of the classical and quantum agents we consider in the main text, such precision parameters do exist, and, as we show later in this Supplementary Information, the approximate equality converges to an equality.
In passive settings, by definition, two passively equal agents perform equally well, and comparison of agents within a class is pointless. However, in the active scenario, with the classes in place, we can meaningfully compare agents within the same class, with no assumptions on the environment ^{14}^{14}14That is, a comparison can be made with no further assumptions on the environment beyond the trivial onesÑthat the percept and action sets are compatible with the environment and that the environment provides a rewarding scheme. Indeed, in an active learning setting, two passively equal agents and may have vastly different success chances. To see this, suppose that the environment changes its policies on a timescale that is long compared to the internal timescale of agent , but short relative to the internally slower agent . The best policy of both agents is to query the environment as frequently as possible, in order to learn the best possible actions. However, from the perspective of the slow agent, the environment will look fully inconsistent  once rewarded actions are no longer the right choice, as that agent simply did not have the time to learn. Thus, in active scenarios, internal speed of the agent is vital.
v.2 Classical and quantum walk basics
A random walk on a graph is described by a MC, specified by a transition matrix which has entries . For an irreducible MC there exists a stationary distribution such that . For an irreducible and aperiodic MC this distribution can be approximated by i.e. by applying, to any initial distribution , the MC number of times where . This time is known as mixing time and is defined as follows.
Definition 2.
(Mixing Time).
The mixing time is:
The latter can be related to the spectral properties of the MC P, in the case of reversible chains, via the following theorem 1993_Sinclair:
Theorem 1.
The mixing time satisfies the following inequalities:
Here, we use instead of the standard for consistency with the rest of the Supplementary Information, as has a reserved meaning.
For the purpose of clarity let us introduce some definitions and theorems, originally provided in 2004_Szegedy_IEEE; 2011_Magniez_SIAM that will be useful to introduce the notation and to prove the main results for the speedup of quantum agents.
The quantum analog of the applying the MC P is given by:
Definition 3.
(Quantum Diffusion Operators).
The quantum diffusion operators, the analogs of the classical diffusion operators are given by the the following transformations:
(5) 
(6) 
where , and is the time reversed MC defined by .
We will consider the application of the MC P (for the classical agent) and the quantum diffusion operators (for the quantum agent) as the (equally time consuming) primitive processes, as is done in the theory of quantum random walks 2011_Magniez_SIAM. Next, we can define the quantum walk operator for the Markov chain P.
Definition 4.
(Walk Operator or Quantum Markov Chain).
The walk operator or Quantum Markov Chain is given by
(7) 
where is the projection operator onto the space and is the projection operator onto .
The quantum walk operator can be easily realized through four applications of the quantum diffusion operators, see e.g. 2011_Magniez_SIAM for details.
Another standard operation which both the classical and the quantum agents do is checking whether the clip found is flagged (corresponding to checking whether an item is marked). The quantum check operator is defined as follows.
Definition 5.
(Check). The quantum check operator is the reflection denoted as performing
(8) 
where denotes the set of flagged actions corresponding to the percept .
In order to prove our main theorems we will be using the ideas introduced in the context of quantum searching 2004_Szegedy_IEEE; 2011_Magniez_SIAM which we now briefly expose. In the quantum walk over graphs approach to searching, one defines an initial state, which encodes the stationary distribution of a MC,
(9) 
and performs a rotation onto the state containing the ’marked items’
(10) 
where is the projector on the space of marked items i.e Let us point out that .
In order to achieve this rotation one makes use of two reflections. The first is the reflection over (denoted ), the state orthogonal to in . This operator can be realized using the primitive of checking. Indeed, we have the following claim (stated in 2011_Magniez_SIAM) given by:
Lemma 6.
Restricted on the subspace , the action of is identical to .
Proof.
Let be a vector in . We have that:
(11) 
where is the projector on the set . The result easily follows by noting that
(12) 
and that
since .
∎
On the other hand, the reflection over is not straightforward. One can devise an approximated scheme to implement this reflection using the phase estimation algorithm. Indeed, one can build a unitary operator, using phase estimation applied to the quantum walk operators, which approximates the reflection over . Before we state the theorem regarding this approximate reflection operator (constructively proven in 2011_Magniez_SIAM), we will first give another result regarding the spectrum of the quantum walk operator, which will be relevant to us presently.
Theorem 2.
(Szegedy 2004_Szegedy_IEEE) Let P be an irreducible, reversible MC with stationary distribution . Then the quantum walk operator is such that:

.

, where is the absolute value of an eigenvalue of and .

has no other eigenvalue in .
Note that the phase gap , defined as the minimum nonzero , is such that , where is the secondlargest eigenvalue of with respect to the absolute value. One can then, with some algebra, conclude that .
Let us note that any unitary able to approximately detect whether the eigenvalue of of a state in is different from one (or equivalently, its eigenphase is different from zero) and conditionally flip the state, will do. We will use such a unitary to approximate . Let us use this intuition to build such a unitary, , that takes as a parameter the precision and refer to it in the following as the approximate reflection operator 2011_Magniez_SIAM:
Theorem 3.
(Approximate Reflection Operator 2011_Magniez_SIAM).
Let P be an ergodic, irreducible Markov chain on a space of size with (unique) stationary distribution . Let be the corresponding quantum Markov Chain with phase gap . Then, if is chosen in , for every integer there exist a unitary that acts on qubits, such that:

makes at most calls to the (controlled) and .

.

If and is orthogonal to , then .
By the approximate reflection theorem above, there exists a subroutine , where from the statement of Theorem 3 is taken as , and explicitly controls the fidelity of the reflection. Note that, in the definition of the quantum reflecting agent from the main text, the parameter was chosen in and since by a Theorem of Szegedy 2004_Szegedy_IEEE, as we have commented, it holds that , we have that the fidelity of the approximation reflection approaches unity exponentially quickly in . We note that the parameter should be additionally increased by a logarithmic factor of , in order to compensate for the accumulated error stemming from the iterations of the ARO operator, which, as clarified, we omit in this analysis.
For the explicit construction of the approximate reflection operators, we refer the reader to 2011_Magniez_SIAM.
v.3 The PS model
The PS model is a reinforcement learning agent model, thus it formally fits within the specification provided with Def. 1. Here we will recap the standard PS model introduced in 2012_Briegel but note that the philosophy of the projective simulationbased agents is not firmly confined to the formal setting we provide here, as it is more general. PS agents are defined on a more conceptual level as agents whose internal states represent episodic and compositional memory and whose deliberation comprises an association driven hops between memory sequences  so called clips. Nonetheless, the formal definitions we give here allow us to precisely state our main claims. Following the basic definitions, we provide a formal treatment of a slight generalization of the standard PS model which subsumes both the standard and the reflecting agent model we refer to in the main text, and formally treat in this Supplementary Information later.
The PS model comprises the percept and action spaces as given in Def. 1. The central component of PS is the socalled episodic and compositional memory (ECM), and it comprises the internal states of the agent. The ECM is a directed weighted network (formally represented as a directed weighted graph) the vertices of which are called clips.
Each clip represents fragments of episodic experiences, which are formally tuples
(13) 
where each is an internal representation of a percept or an action, so
(14) 
where is a mapping from real percepts and actions to the internal representations. We will assume that each ECM always contains all the unitlength clips denoting elementary percepts and actions.
Within the ECM, each edge between two clips and is assigned a weight and the weights are collected in the socalled matrix. The elementary process of the PS agent is a Markov chain, in which the excitations of the ECM hop from one clip to another, where the transition probabilities are defined by the matrix:
(15) 
thus the matrix is just the nonnormalized transition matrix. In the standard PS model, the decision function is realized as follows: given a percept , the corresponding clip in the ECM is excited and hopping according to the ECM network is commenced. In the simplest case, the hopping process is terminated once a unitlength action clip is encountered, and this action is coupled out and output by the actuator (see Fig. 1). The moment when an action is coupled out can be defined in a more involved way, as we explain presently.
Finally, the update rule, in the standard model necessarily involves the redefinition of the weights in the matrix. A prototypical update rule, for a fully classical agent, defining an update from external timestep to depends on whether an action has been rewarded. If the previous action has been rewarded, and the transition between clips had actually occurred in the hopping process then the update is as follows:
(16) 
where is a positive reward and is a dissipation (forgetfulness) parameter. If the action had not been rewarded, or the clips had not played a part in the hopping process then the weights are updated as follows:
(17) 
The update rule can also be defined such that the update only requires the initial and terminal clip of the hopping process, which is always the case in the simple PS model, where all the clips are just actions or percepts, and hopping always involves a transition from a percept to an action. This example was used in section LABEL:subsect:comparison. For that particular example, the update function can be exactly defined by the rules above. As mentioned, aside from the basic structural and diffusion rules, the PS model allows for additional structures, which we repeat here. 1) Emoticons  the agents short term memory, i.e. flags which notify the agent whether the currently found action, given a percept was previously rewarded or not. For our purposes, we shall use only the very rudimentary mode of flags, which designate that the particular action (given a particular percept) was not already unsuccessfully tried before. If it was, the agent can ’reflect on its decision’ and reevaluate its strategies, by restarting the diffusion process. This reflection process is an example of a more complicated outcoupling rule we have mentioned previously. 2) Edge and clip glow  mechanisms which allow for the establishing of additional temporal correlations. 3) Clip composition  the PS model based on episodic and compositional memory allows the creation of new clips under certain variational and compositional principles. These principles allow the agent to develop new behavioral patterns under certain conditions, and allow for a dynamic reconfiguration of the agent itself. For more details we refer the reader to 2012_Briegel; Julian12; Briegel2_2012.
As illustrated, the PS model allows for a great flexibility. A straightforward generalization would allow for the ECM network to be perceptspecific, which is the view we adopt in the definition of reflecting agents. However, the same notion can be formalized without introducing multiple networks (one for every percept). In particular, the ECM network allows for action and percept clips to occur a multiple number of times. Thus the ECM network can be represented as disjoint subnetworks, each of which comprises all elementary action clips, and only one elementary percept clip. This structure is clearly within the standard PS model, and it captures all the features of the reflecting PS agent model. A simple case of such a network, relative to the standard picture, is illustrated in Fig. LABEL:fig:Comparison, part b). Thus, the reflecting PS agent model is, structurally, a standard PS model as well.
v.4 Behavioral equivalence of classical and quantum reflecting agents
In this section, we show that the classical and quantum reflecting agents (rPS), denoted and are approximately equal, that is where can be made arbitrarily small without incurring a significant overhead in the internal time of the agents.
We do so by separately showing that the output distributions of both the classical and the quantum rPS are close to the previously mentioned tailed distribution, for an arbitrarily small . The main claim will then follow by the triangle inequality on the behavioral distance (which holds since the behavioral distance is the variational distance on the output distributions).
For completeness, we begin by explicitly giving the deliberation process of the classical rPS, described in the main text, and proceed with the behavioral theorem for classical rPS.
The agent’s decisionmaking process (implementing the decision function , see section V.1 of this Supplementary Information for details), given percept , is given by the following steps.
Let ;

Sample from some fixed distribution .

Repeat:

Diffuse: (re)mix the Markov chain by

Check: Sample from . If is a flagged action, break and output .

In the following, when and are distributions then denotes the standard variational distance (Kolmogorov distance) on distributions, so
Theorem 4.
(Behavior of classical reflecting agents)
Let be the transition matrix of the Markov chain associated to percept , let be the (nonempty) set of flagged action clips. Furthermore, let be the probability mass function of the stationary distribution of and let be the renormalized distribution of where the support is retained only over the flagged actions, so:
(18) 
Let be the probability distribution over the clips as outputted by the classical reflecting agent, upon receiving . Then the distance is constant (ignoring logarithmic factors), and can be efficiently made arbitrarily small.