An empirical evaluation of active inference in multiarmed bandits
Abstract
A key feature of sequential decision making under uncertainty is a need to balance between exploiting–choosing the best action according to the current knowledge, and exploring–obtaining information about values of other actions. The multiarmed bandit problem, a classical task that captures this tradeoff, served as a vehicle in machine learning for developing bandit algorithms that proved to be useful in numerous industrial applications. The active inference framework, an approach to sequential decision making recently developed in neuroscience for understanding human and animal behaviour, is distinguished by its sophisticated strategy for resolving the explorationexploitation tradeoff. This makes active inference an exciting alternative to already established bandit algorithms. Here we derive an efficient and scalable approximate active inference algorithm and compare it to two stateoftheart bandit algorithms: Bayesian upper confidence bound and optimistic Thompson sampling, on two types of bandit problems: a stationary and a dynamic switching bandit. Our empirical evaluation shows that the active inference algorithm does not produce efficient longterm behaviour in stationary bandits. However, in more challenging switching bandit problem active inference performs substantially better than the two bandit algorithms. The results open exciting venues for further research in theoretical and applied machine learning, as well as lend additional credibility to active inference as a general framework for studying human and animal behaviour.
keywords:
Decision making, Bayesian inference, Multiarmed bandits, Active Inference, Upper confidence bound, Thompson sampling1 Introduction
When we are repeatedly deciding between alternative courses of action – about whose outcomes we are uncertain – we have to strike a tradeoff between exploration and exploitation. Do we exploit and choose an option that we currently expect to be the best, or do we sample more options with uncertain outcomes in order to learn about them, and potentially find a better option? This tradeoff is one of the fundamental problems of sequential decision making and it has been extensively studied both in the context of neuroscience wilson2020balancing (); mehlhorn2015unpacking (); cohen2007should () as well as machine learning lattimore2020bandit (); chapelle2011empirical (); auer2002finite (); kaufmann2012bayesian (). Here we propose active inference – an approach to sequential decision making developed recently in neuroscience kaplan2017navig (); friston2017temporal (); friston2017process (); mirza2016scene (); friston2016learning () – as an attractive alternative to established algorithms in machine learning. Although the explorationexploitation tradeoff has been described and analysed within the active inference framework fitzgerald2015active (); friston2015active (); schwartenbeck2013exploration (), the focus was on explaining animal and human behaviour rather than the algorithm performance on a given problem. What is lacking for a convincing machine learning application is the evaluation on multiarmed bandit problems lattimore2020bandit (), a set of standard problems that isolate the explorationexploitation tradeoff, thereby enabling a focus on best possible performance and the comparison to stateoftheart bandit algorithms from machine learning. Conversely, these analyses will also feed back into neuroscience research, giving rational foundations to active inference explanations of animal and human behaviour.
When investigating human and animal behaviour in stochastic (uncertain) environments, it has become increasingly fruitful to model and describe behaviour based on principles of Bayesian inference friston2012history (); doya2007bayesian (); knill2004bayesian (), both when describing perception, and decision making and planning botvinick2012planning (). The approach to describing sequential decision making and planning as probabilistic inference is jointly integrated within active inference kaplan2017navig (); friston2017temporal (); friston2017process (); mirza2016scene (); friston2016learning (); schwartenbeck2013exploration (), a mathematical framework for solving partially observable Markov decision processes, derived from the general selforganising principle for biological systems – the free energy principle karl2012free (); friston2006free (). Recent work has demonstrated that different types of exploratory behaviour – directed and random exploration – naturally emerge within active inference schwartenbeck2019computational (). This made active inference a useful approach for modelling how animals and humans resolve the explorationexploitation tradeoff (REFS), but also points at its potential usefulness for bandit and reinforcement learning problems in machine learning where the explorationexploitation tradeoff plays a prominent role lattimore2020bandit (); sutton2018reinforcement (). Active inference in its initial form was developed for small state spaces and toy problems without consideration for applications to typical machine learning problems. This has recently changed and various scalable solutions have been proposed ueltzhoffer2018deep (); millidge2020deep (), in addition to complex sequential policy optimisation that involves sophisticated (deep tree) searches friston2020sophisticated (); fountas2020deep (). Therefore, to make the active inference approach practical and scalable to the highdimensional bandit problems typically used in machine learning, we introduce here an approximate active inference (AAI) algorithm.
Here we examine how well the exact and AAI algorithms perform in multiarmedbandit problems that are traditionally used as a benchmarks in the research on the explorationexploitation tradeoff lattimore2020bandit (). Although originally formulated for improving medical trials thompson1933likelihood (), multiarmed bandits have become an essential tool for studying human learning and decision making early on bush1953stochastic (), and later on attracted the attention of statisticians whittle1980multi (); lai1985asymptotically () and machine learning researchers lattimore2020bandit () for studying the nature of sequential decision making more generally. We consider two types of bandit problems in our empirical evaluation: a stationary bandit as a classical machine learning problem lai1985asymptotically (); auer2002finite (); kaufmann2012bayesian (); kaufmann2012thompson () and a switching bandit commonly used in neuroscience izquierdo2017neural (); markovic2019predicting (); iglesias2013hierarchical (); wilson2012inferring (); racey2011pigeon (); behrens2007learning (). This will make the presented results directly relevant not only for the machine learning community, but also for learning and decision making studies in neuroscience, which are often utilising active inference framework for a wide range of research questions.
Using these two types of bandit problems we empirically compare the active inference algorithm to two stateoftheart bandit algorithms from machine learning: a variant of the upper confidence bound (UCB) auer2002finite () algorithm – the Bayesian UCB algorithm kaufmann2012bayesian (); kaufmann2012thompson () – and a variant of Thompson sampling – optimistic Thompson sampling lu2017adaptive (). Both types of algorithms keep track of uncertainty about the values of actions, in the form of posterior beliefs about reward probabilities, and leverage these to balance between exploration and exploitation, albeit in a different way. These two algorithms reach stateoftheart performance on various types of stationary bandit problems auer2002finite (); chapelle2011empirical (); kaufmann2012bayesian (); kaufmann2012thompson (), achieving regret (the difference between actual and optimal performance) that is close to the best possible logarithmic regret lai1985asymptotically (). In switching bandits, learning is more complex, but once this is properly accounted for, both the optimistic Thompson sampling and Bayesian UCB exhibit the stateoftheart performance cao2018nearly (); alami2017memory (); russo2018tutorial (); lu2017adaptive (); roijers2017interactive ().
We use a Bayesian approach to the bandit problem, also known as Bayesian bandits wang1992bayesian (), for all algorithms – active inference, Bayesian UCB and optimistic Thompson sampling. The Bayesian treatment allows us to keep the learning rules equivalent, thus facilitating the comparison of different action selection strategies. In other words, belief updating and learning of the hidden reward probabilities exclusively rests on the learning rules derived from an (approximate) inference scheme, and are independent on the specific action selection principle lu2017adaptive (). Furthermore, learning algorithms derived from principles of Bayesian inference can be made domainagnostic and fully adaptive to a wide range of unknown properties of the underlying bandit dynamics, such as the frequency of changes of choicereward contingencies. Therefore, we use the same inference scheme for all algorithms – variational surprise minimisation learning (SMiLE), an algorithm inspired by recent work in the field of human and animal decision making in changing environments liakoni2020learning (); markovic2016comparative (). The variational SMiLE algorithm corresponds to online Bayesian inference modulated by surprise, which can be expressed in terms of simple deltalike learning rules operating on the sufficient statistics of posterior beliefs.
In what follows, we will first introduce in detail the two types of bandit problems we focus on: the stationary and the dynamic bandit problem. We first describe each bandit problem formally in an abstract way and then specify the particular instantiation we use in our computational experiments. We will constrain ourselves to a wellstudied version of bandits, the socalled Bernoulli bandits. For Bernoulli bandits, choice outcomes are drawn from an armspecific Bernoulli distribution. Bernoulli bandits together with Gaussian bandits are the most commonly studied variant of multi armed bandits, both in theoretical and applied machine learning chapelle2011empirical (); lu2017adaptive (); liu2018change (); kaufmann2012thompson () and experimental cognitive neuroscience wilson2012inferring (); steyvers2009bayesian (); behrens2007learning (). This is followed by an introduction of three algorithms: we start with the derivation of the learning rules based on variational SMiLE, and introduce different action selection algorithms. Importantly, in active inference we will derive an approximate action selection scheme comparable in form to the well known UCB algorithm. Finally, we empirically evaluate the performance of different algorithms, and discuss the implications of the results for the fields of machine learning and cognitive neuroscience.
2 The multiarmed bandit problem
The bandit problem is a sequential game between an agent and an environment lattimore2020bandit (). The game is played in a fixed number of rounds (a horizon), where in each round the agent chooses an action (commonly referred to as a bandit arm). In response to the action, the environment delivers an outcome (e.g. a reward, punishment, or null). The goal of the agent is to develop a policy that allocates choices so as to maximise cumulative reward over all rounds. Here, we will be concerned with a bandit problem where the agent chooses between multiple arms (actions), a socalled multiarmed bandit (MAB). A wellstudied canonical example is the stochastic stationary bandit, where rewards are drawn from armspecific and fixed (stationary) probability distributions slivkins2019introduction ().
Here, the explorationexploitation tradeoff stems from the uncertainty of the agent about how the environment is delivering the rewards, and from the fact that the agent observes outcomes only for the chosen arms, that is, it has only incomplete information about the environment. Hence, the agent obtains not only rewards from outcomes but also learns about the environment by observing the relation between an action and its outcome. Naturally, more information can be obtained from arms that have been tried fewer times, thus creating a dilemma between obtaining information, about an unknown reward probability of an arm, or trying to obtain a reward from a familiar arm. Importantly, in bandit problems there is no need to plan ahead because available choices and rewards in the next run are not affected by current choices
Bandit problems were theoretically developed largely in statistics and machine learning, usually focusing on the canonical stationary bandit problem lattimore2020bandit (); lai1985asymptotically (); slivkins2019introduction (); auer2002finite (); kaufmann2012bayesian (); kaufmann2012thompson (). However, they also played an important role in cognitive neuroscience and psychology, where they have been applied in a wide range of experimental paradigms, investigating human learning and decisionmaking rather than optimal performance. Here dynamic or nonstationary variants have been used more often, as relevant changes in choicereward contingencies are typically hidden and stochastic in everyday environments of humans and other animals schulz2019algorithmic (); gottlieb2013information (); wilson2012inferring (); cohen2007should (); behrens2007learning (). We focus on a switching bandit, a particularly popular variant of the dynamic bandit where contingencies change periodically and stay fixed for some time between switches izquierdo2017neural (); markovic2019predicting (); iglesias2013hierarchical (); wilson2012inferring (); racey2011pigeon (); behrens2007learning (). The canonical stationary bandit has been influential in cognitive neuroscience and psychology as well steyvers2009bayesian (); stojic2020uncertainty (); wilson2014humans (); reverdy2014modeling (), in particular when combined with side information or context to investigate structure or function learning acuna2008bayesian (); stojic2020s (); schulz2018putting (); schulz2020finding ().
Note that many experimental tasks, even if not explicitly referred to as bandit problems, can be in fact reformulated as an equivalent bandit problem. The often used reversal learning task izquierdo2017neural (), for example, corresponds to a dynamic switching twoarmed bandit clark2004neuropsychology (), and the popular go/nogo task can be expressed as a fourarmed stationary bandit guitart2011action (), as another example. Furthermore, various variants of the wellestablished multistage task daw2011model () can be mapped to a multiarmed bandit problem, where the choice of arm corresponds to a specific sequence of choices in the task dezfouli2012habits ().
In summary, we will perform a comparative analysis on two types of bandit problems: stationary stochastic and switching bandit. In this section, we first describe each bandit problem formally in an abstract way and then specify the particular instantiations we use in our computational experiments.
2.1 Stationary stochastic bandit
A stationary stochastic bandit with finitely many arms is defined as follows: in each round the agent chooses an arm or action from a finite set of arms, and the environment then reveals an outcome (e.g. reward or punishment). The stochasticity of the bandit implies that outcomes are i.i.d. random variables drawn from a probability distribution . In Bernoulli bandits, these are draws specifically from a Bernoulli distribution for which outcomes are binary, that is, , where each arm has a reward probability that parametrises the Bernoulli distribution. Hence, we can express the observation likelihood as
(1) 
where denotes the chosen arm on trial . In stationary bandits reward probabilities of individual arms are fixed for all trials . We use to denote an optimal arm associated with the maximal expected reward .
In our computational experiments we follow a setup that has been used in previous investigations of stationary stochastic bandits chapelle2011empirical (): We consider the variant of the problem in which all but the best arm have the same reward probability . The probability of the best arm is set to , where . The number of arms and the mean outcome difference modulate the task difficulty. The more arms and the lower the reliability, the more difficult is the problem. To understand how task difficulty influences the performance of different action selection algorithms, in the experiments we systematically vary and steps. Note that the larger number of arms () is standard setting in machine learning benchmarks, as many industrial applications of multiarmed bandits contain a large number of options slivkins2019introduction (). In contrast, in experimental cognitive neuroscience one typically considers only a small number of options (e.g. two or three), to reduce the task complexity, thus, the training time and the experiment duration.
2.2 Switching bandit
A switching bandit is a dynamic multiarmed bandit, which, as the stationary bandit, is characterised by a set of arms, where each arm is associated with an i.i.d. random variable at a given time step . However, in contrast to the stationary bandit problem, outcomes are drawn from a timedependent Bernoulli probability distribution
(2) 
We use to denote the optimal arm associated with the maximal expected reward at trial ; hence, .
In the switching bandit cheung2019hedging (); besson2019generalized () the reward probability changes suddenly but is otherwise constant. Here we use the same reward probability structure as in the stationary bandits, but change the optimal arm with probability as follows
(3) 
where denotes the Kronecker delta, and denotes an auxiliary Bernoulli random variable representing the presence or absence of a switch on trial . The optimal arm is always associated with the same reward probability and the probability of all other arms is set to the same value . In the experiments with the switching bandit problem we systematically vary and , and sets.
In addition, we will consider the possibility that the task difficulty changes over time. Specifically, we will consider a setup in which the mean outcome difference is not fixed, and changes over time. We obtain an effective nonstationary by introducing a time evolution of the reward probabilities . At each switch () point, we generate the reward probabilities from a uniform distribution. Hence, the dynamics of the switching bandit with nonstationary difficulty can be expressed with the following transition probabilities
(4) 
where denotes Dirac’s delta function, and a uniform distribution on interval, expressed as the special case of a Beta distribution.
2.3 Evaluating performance in bandit problems
A standard approach to evaluate the performance of different decision making algorithms in bandit problems is regret analysis lattimore2020bandit (); blum2007learning (), and we will therefore use it here as a primary measure. Regret is typically defined as an external measure of performance which computes a cumulative expected loss of an algorithm relative to an oracle which knows the ground truth and always selects the optimal arm . If we define the cumulative expected reward of an agent, up to trial , that chose arm on trial as then the (external) cumulative regret is defined as
(5) 
The cumulative regret can also be viewed as a retrospective loss, which an agent playing the bandit game can estimate after it learns which arm was optimal. This definition makes sense for stationary stochastic bandits and in the limit of . In practice, the cumulative regret of a specific agent playing the game will be a function of the sequence of observed outcomes , the sequence of chosen arms , and a selection strategy of the given agent.
We additionally introduce a regret rate measure, a time average of the cumulative regret
(6) 
In the case of stationary bandits a decision making algorithm is considered consistent if and asymptotically efficient if its cumulative regret approaches the following lower bound as lai1985asymptotically ()
(7) 
In our case of Bernoulli bandits and specifically structured reward probabilities (see 2.1 Stationary stochastic bandit subsection) the KullbackLeibler divergence between outcome likelihoods of any arm and the arm associated with highest reward probability becomes
(8) 
Hence, the lower bound to the cumulative regret becomes approximately
(9) 
In addition, we can define an upper bound in terms of a random choice algorithm, which selects any arm with same probability on every trial. In the case of random and uniform action selection the cumulative regret becomes
(10) 
Note that the cumulative regret is an external quantity not accessible to an agent, which has uncertain beliefs about the reward probabilities of different arms. Although, in stationary bandits the cumulative regret can reveal how efficient an algorithm is in accumulating reward in the long term, it tells us little about how efficient an algorithm is in reducing regret in the shortterm. This shortterm efficiency is particularly important for dynamic bandits as an agent has to switch constantly between exploration and exploitation. Therefore, to investigate shortterm efficiency of the algorithm, specifically in the dynamic context, we will analyse the regret rate, instead of the commonly used cumulative regret (see raj2017taming ()).
3 Algorithms
Bandit algorithms can be thought of as consisting of two parts: (i) a learning rule that estimates action values, and (ii) an actionselection strategy that uses the estimates to choose actions and effectively balance between exploration and exploitation. As described in the previous section, for the canonical stationary problem a good bandit algorithm achieves a regret that scales sublinearly with the number of rounds (see Eq. 9). Intuitively, this means that the algorithm should be reducing exploration and allocating more choices over time to arms with high expected value. The relevant question is how to reduce exploration concretely? Naturally, this is a fine balancing act: reducing exploration too quickly would potentially result in false beliefs about the best arm, hence repeatedly choosing suboptimal arms and accumulating regret. In contrast, reducing exploration too slowly would result in wasting too many rounds exploring suboptimal arms and again accumulating regret. For comparison with the algorithm based on active inference, we focus on two popular classes of bandit algorithms that are known to hit the right balance: the (Bayesian) upper confidence bound (BUCB) auer2002finite (); kaufmann2012bayesian () and (optimistic) Thompson sampling (OTS) chapelle2011empirical (); thompson1933likelihood (); kaufmann2012thompson () algorithms. Note that greedy or Softmax actionselection strategies sutton2018reinforcement (), frequently used in reinforcement learning, have fixed exploration, and consequently poor regret performance in bandit problems. There are variants of these strategies where exploration parameters, in greedy and in Softmax, are reduced with specific schedules auer2002finite (). However, choosing a schedule is based on heuristics and parameters are difficult to tune. Hence, we did not include these types of strategies in our comparisons.
In what follows we decompose active inference and the bandit algorithms we compare to into two components, the learning rule and the action selection strategy. We derive learning rules from an approximate Bayesian inference scheme and keep the rules fixed across action selection strategies, and modify only the action selection strategy. This setup allows us to have a fair comparison between active inference and the competing bandit algorithms. Finally, we will use the same action selection strategies for both stationary and dynamic bandit problem, and parameterise the learning rules to account for the presence or absence of changes.
3.1 Shared learning rule  variational SMiLe
To derive the belief update equations we start with a hierarchical generative model described here and apply variational inference to obtain approximate learning rules. The obtained belief update equations correspond to the variational surprise minimisation learning (SMiLe) rule liakoni2020learning (); markovic2016comparative (). Importantly, we recover the learning rules for the stationary bandit (see 21) as a special case when changes are improbable.
We will express the hierarchical generative model of choice outcomes as the following joint distribution
(11) 
where the observation likelihood corresponds to the Bernoulli distribution. Hence,
(12) 
If no change () occurs on a given trial the reward probabilities are fixed, . Otherwise, if a change occurs (), a new value is generated for each arm from some prior distribution . Formally, we can express this process as
(13) 
Similarly, the probability that a change in reward probabilities occurs on a given trial is , hence we write
(14) 
The Bayesian approach requires us to specify a prior. The prior over reward probabilities associated with each arm is given as the product of conjugate priors of the Bernoulli distribution, that is, the Beta distribution
(15) 
where we initially set the prior to a uniform distribution, .
Hence, given Bayes rule at time step
(16) 
we can express the exact marginal posterior beliefs over reward probabilities as
(17) 
where and corresponds to the sequence of chosen arms, and corresponds to the marginal posterior probability that a change occurred on trial , which we can express as
(18) 
The exact marginal posterior in Eq. 17 will not belong to the Beta distribution family, making the exact inference analytically intractable. However, constraining the joint posterior to an approximate, fully factorised form
(19) 
and applying variational calculus allows us to recover the following variational SMiLe rule liakoni2020learning ()
(20) 
for the parameters of the Beta distributed factors and categorically distributed change probability whose update correspond to Eq. 18.
Note that for a stationary environment the changes are improbable, hence and consequently for every . This implies that for the stationary bandit we recover the following learning rules
(21) 
that correspond to the exact Bayesian inference over the stationary Bernoulli bandit problem.
3.2 Action selection
Active inference
One view on the explorationexploitation tradeoff is that it can be formulated as an uncertaintyreduction problem schwartenbeck2019computational (), where choices aim to resolve expected and unexpected uncertainty about hidden properties of the environment soltani2019adaptive (). This leads to casting choice behaviour and planning as a probabilistic inference problem kaplan2017navig (); friston2017temporal (); friston2017process (); mirza2016scene (); friston2016learning (), as expressed by active inference. Using this approach, different types of exploitative and exploratory behaviour naturally emerge schwartenbeck2019computational (). In active inference, decision strategies (behavioural policies) are chosen based on a single optimisation principle: minimising expected surprisal about observed and future outcomes, that is, the expected free energy smith2021step (). Formally, we express the expected free energy of a choice as
(22) 
where , , denotes prior preferences over outcomes, denotes the conditional entropy of observation likelihood , and , stands for the KullbackLeibler divergence. Then, a choice is made by selecting the action with the smallest expected free energy
(23) 
where we consider the simplest form of active inference, as in other bandit algorithms, onestepahead beliefs about actions.
Note that in active inference, the most likely action has dual imperatives, implicit within the expected free energy acting as the loss function (see the different decomposition in Eq. 22): The expected free energy can, on one hand, be decomposed into ambiguity and risk. On the other hand, it can be understood as a combination of intrinsic and extrinsic value, where intrinsic value corresponds to the expected information gain, and the extrinsic value to the expected value. The implicit information gain or uncertainty reduction pertains to beliefs about the parameters of the likelihood mapping, which has been construed as novelty kaplan2018planning (); schwartenbeck2013exploration (). Therefore, selecting actions that minimise the expected free energy dissolves the explorationexploitation tradeoff, by virtue of the fact that every action contains both expected value and information gain.
To express the expected free energy, , in terms of beliefs about armspecific reward probabilities, we will first constrain the prior preference to the following Bernoulli distribution
(24) 
In active inference, prior preferences determine whether a particular outcome is attractive or rewarding. Here we assume that agents prefer outcome over outcome . Hence, we specify payoffs or rewards with prior preferences over outcomes that have an associated precision , where . The precision parameter determines the balance between epistemic and pragmatic imperatives. When prior preferences are very precise, corresponding to large , the agent becomes risk sensitive and will tend to forgo exploration if the risk (i.e., divergence between predicted and preferred outcomes is high; see Eq. 22). Conversely, a low lambda corresponds to an agent which is less sensitive to risk and will engage in exploratory, epistemic behaviour, until it has familiarised itself with the environment (i.e., the hidden reward probabilities in the multiarmed bandit problem).
Given the following expressions for the marginal predictive likelihood, obtained as ,
(25) 
we get the following expressions for the expected free energy
(26) 
Above we have used the following relation
(27) 
for computing .
If we approximate digamma function as , and note that for all relevant use cases , then by substituting the approximate digamma expression into Eq. (26) we get the following action selection algorithm
(28) 
Note that a similar exploration bonus – inversely proportional to the number of observations – was proposed in the context of Bayesian reinforcement learning kolter2009near () when working with Dirichlet prior and posterior distributions.
Upper confidence bound
The upper confidence bound (UCB) is a classical action selection strategy for resolving the explorationexploitation dilemma auer2002finite (), with the action selection strategy defined as
(29) 
where is the expected reward of th arm and the number of times the th arm was selected (see chapelle2011empirical () for more details).
However, we consider a more recent variant called Bayesian UCB kaufmann2012bayesian (), grounded in Bayesian bandits. In Bayesian UCB the best arm is selected as the one with the highest th percentile of posterior beliefs, where the percentile increases over time as . Hence, we can express the action selection rule as
(30) 
where denotes cumulative distribution function of Beta distributed posterior beliefs, and the parameters (, ) denote approximate sufficient statistics of the Beta distributed prior beliefs on trial . Note that the exact prior corresponds to a mixture of two Beta distributions
(31) 
As the inverse of a cumulative distribution function of the above mixture distribution is analytically intractable we will assume the following approximation
(32) 
Thus, in the case of the Beta distributed prior beliefs, the inverse cumulative distribution function corresponds to the inverse incomplete regularised beta function. Hence, we can write
(33) 
Thompson sampling
Thompson sampling is traditionally associated with Bayesian bandits kandasamy2018parallelised (); chapelle2011empirical (); thompson1933likelihood (), where the action selection is derived from the i.i.d samples from the posterior beliefs about the reward probability. The standard algorithm corresponds to
(34) 
where denotes a single sample from the current beliefs about reward probabilities associated with the th arm.
An extension of the standard algorithm, proposed in the context of dynamic bandits, is called optimistic Thompson sampling raj2017taming (), defined as
(35) 
where the expected reward probability at current trial ,
constrains the minimal accepted value of the sample from the prior, hence biasing the sampling towards optimistic larger values.
4 Results
In what follows, we first examine the performance of active inference based agents, AAI (minimising approximated estimate of the expected free energy) and GAI (minimising exact expected free energy) in the stationary Bernoulli bandits. Using the regret rate as performance criterion we analyse the dependence of agent’s performance on the precision of prior preferences () parameter and simultaneously verify that our approximation is good enough. After illustrating the effectiveness of AAI (Eq. 28), in comparison to GAI (Eq. 23), we empirically compare only the AAI algorithm – now in terms of the cumulative regret – with agents using the optimistic Thompson sampling (OTS; Eq. 35) and Bayesian upper confidence bound (BUCB; Eq. 30) algorithms, in the same stationary Bernoulli bandit. Finally, we provide an empirical comparison of the three action selection algorithms in the switching Bernoulli bandit.
4.1 The stationary Bernoulli bandit
The precision parameter acts as a balancing parameter between exploitation and exploration (Eq. 28). Hence, it is paramount to understand how impacts the performance across different difficulty conditions. We expect that there will be a for which the active inference algorithm achieves minimal cumulative regret after a fixed number of trials , for each mean outcome difference and each number of arms . When the AI agent has very imprecise preferences (), it would engage in exploration for longer, thereby reducing its free energy (i.e., uncertainty about the likelihood mappings), at the expense of accumulating reward. Conversely, an AI agent with very precise preferences () would commit to a particular arm as soon as it had inferred that this was the arm with highest likelihood of payoffs. However, the ensuing âsuperstitiousâ behaviour would prevent it from finding the best arm. To illustrate this, in Fig. 1 we report regret rate averages over a simulations, and compare the agents using either the approximate (AAI) or the exact (GAI) expected free energy for action selection. Using the regret rate simplifies the comparison, as unlike cumulative regret, the regret rate stays on the same range of values independent of trial number .
We find the minimal regret rate – at the asymptotic limit of large number of trials , see solid lines in Fig. 1 – around for a large range of problem difficulties
Next we compare and contrast the cumulative regret, as a function of trial number , of the AAI agents with agents based on the optimistic Thompson sampling (OTS) and the Bayesian UCB (BUCB) algorithms (Fig. 2). The dotted lines mark the corresponding asymptotic limit (see Eq. 9) of the corresponding problem difficulty (). The asymptotic limit scales as and defines long term behaviour of the asymptotically efficient algorithm. Note that the limit behaviour can be offset by an arbitrary constant to form a lower bound lai1985asymptotically (); chapelle2011empirical (). For convenience we fix the constant to zero, and show the asymptotic curve only as a reference for long term behaviour of cumulative regret for different algorithms.
The comparison reveals that the AAI agent outperforms the bandit algorithms, but only up until some trial that depends on the task difficulty – in the asymptotic limit the regret grows faster than logarithmic with trial number. This is most clearly visible for the more difficult problems where ; for example, for , AAI outperforms bandit algorithms only up to . This divergence is driven by a small percentage of the agents in the ensemble that did not not find the accurate solution and are overconfident in their estimate of the arm with the highest reward probability. The divergence is not visible in easier settings with larger as one requires larger ensembles and bigger number of trials to observe suboptimal instances. It might appear surprising, that the divergence is evident only for the smallest considered number of arms . However, the reason for this is, that the smaller the number of arms is, the more chance an agent has to explore each individual arm, for a limited trial number. Hence, the agent will commit faster to a wrong arm and stay with that choice longer. Therefore, we found that our initial expectation about asymptotic performance of active inference algorithms is only partially correct. Although one could set for any task difficulty in a way that active inference initially outperforms the alternative algorithms, in the asymptotic limit the high performance level will not hold. The reason for this can be seen already in Fig. 1, if one notes that maximal performance (minimal regret rate) depends both on preference precision and trial number , for every tuple.
Although active inference based agents behave poorly in the asymptotic limit, the fact that they achieve higher performance on a short time scale suggests that in dynamic environments – if changes occur sufficiently often – one would get higher performance on average when compared to bandit alternatives.
4.2 The switching bandit problem
In the case of our switching bandit problem, the change probability acts as an additional difficulty parameter, besides the number of arms and the mean outcome difference . Therefore, for the betweenalgorithm comparison we will first keep fixed at its medial value, and vary number of arms in Fig. 3, and then keep the number of arms fixed at and vary the expected outcome difference in Fig. 4. For the algorithm comparison in switching bandit we fix the precision parameter , to , based on a similar analysis as in stationary bandits (not shown here). Note that although in stationary environments small values of are desirable to achieve low cumulative regret for large , in switching environments larger values of are preferable. This can be inferred from Fig. 1 where regret rate on a short time scale (, dotted lines) achieves minimum approximately around for a range of difficulty levels. This behaviour of regret on a short time scale suggests that in changing environments larger achieve better performance when the AAI agent is more biased towards exploitation and spends less time exploring. Numerical simulations in switching environments for a wide range of values confirm this (not shown here). For betweenalgorithm comparison in switching bandits we will use regret rate, instead of cumulative regret, as a reference performance measure. The reason for this is that in dynamic environments cumulative regret increases linearly with trial number , and regret rate provides visually more accessible gauge of performance differences raj2017taming ().
In Fig. 3 we illustrate the regret rate for each agent type over the course of the experiment for a range of different values of change probability and number of arms , and a fixed mean outcome difference . Importantly, when estimating the mean regret rate over an ensemble of agents, for each agent we simulate a distinct switching schedule with the same change probability . Hence, the average is performed not only over different choice outcome trajectories but also over different hidden trajectories of changes. This ensures that comparison is based on environmental properties, and not on specific realisation of the environment. We find better performance for the active inference agents compared to other bandit algorithms in all conditions. However, we observe that the more difficult the task is (in terms of higher change probability and larger number of arms ) the less pronounced is the performance advantage of the active inference based agents.
In Fig. 4 we show the regret rate for each agent type, however with a fixed number of arms, , but varying mean outcome difference . Here, the picture is very similar, where for increasing task difficulty the AAI agent type exhibits a diminishing performance advantage relative to the bandit algorithms. Importantly, although we present here the regret analysis only up to , unlike in the stationary bandit problem, the results do not change after a further increase in the number of trials. When we simulate longer experiments we find a convergent performance for all algorithms towards a nonzero regret rate; implying a linear increase in cumulative regret with trial number .
Finally, we further illustrate the dependence of performance on mean outcome preference, using the switching bandit with nonstationary task difficulty, where is not fixed but changes stochastically over the course of experiment (see 2.2 Switching bandit for more details). As shown in Fig. 5, we find an increased advantage of the AAI algorithm over BUCB and OTS algorithms when compared to fixed difficulty scenario (Fig. 3), specifically for more difficult problems – with either larger number of arms or larger change probability . However, the opposite is the case for lowered task difficulty; e.g. for and , where BUCB achieves higher performance then AAI algorithm. Notably, we would expect that for small number of arm () and slower changing environments () the drop in performance of the AAI agent becomes even more pronounced.
As a final remark, we find it interesting that the BUCB algorithm consistently outperforms OTS algorithm, in almost all nonstationary problems we examined, in contrast to the previous asymptotic analysis in the stationary bandit problem, which concluded that Thompson sampling exhibits better asymptotic scaling than BUCB kaufmann2012thompson (). We are not familiar with previous works comparing these two algorithms in the context of the switching bandit problem.
5 Discussion
In this paper we provided an empirical comparison between active inference, a Bayesian informationtheoretic framework friston2017process (), and two stateoftheart machine learning algorithms – Bayesian UCB and optimistic Thompson sampling – in stationary and nonstationary stochastic multiarmed bandits. We introduced an approximate active inference algorithm, for which our checks on the stationary bandit problem showed that its performance closely follows that of the exact version. Hence, we derived an active inference algorithm that is efficient and easily scalable to highdimensional problems. To our surprise, the empirical algorithm comparison in the stationary bandit problem showed that the active inference algorithm is not asymptotically efficient – the cumulative regret increased faster than logarithmic in the limit of large number of trials. The cause for this behaviour seems to be the fixed prior precision over preferences parameter, which acts as a balancing parameter between exploration and exploitation. An analysis of how the performance depends on this parameter showed that parameter values that give best performance decrease over time, suggesting that this parameter should be adaptive and decay over time as the need for exploration decreases. Attempts to remedy the situation with a simple and widely used decay scheme were not successful (for example logarithm of time, not reported here). This indicates it is not a simple relationship and a proper theoretical analysis will be needed to identify whether such a scheme exists.
In the nonstationary switching bandit problem the active inference algorithm generally outperformed Bayesian UCB and optimistic Thompson sampling. This provides evidence that the active inference framework may provide a good solution for optimisation problems that require continuous adaptation. Active inference provides the most efficient way of gaining information and this property of the algorithm pays off in the nonstationary setting. Such dynamic settings are also relevant in neuroscience, as relevant changes in choicereward contingencies are typically hidden and stochastic in everyday environments of humans and other animals schulz2019algorithmic (); gottlieb2013information (); wilson2012inferring (); cohen2007should (); behrens2007learning (). In contrast to previous neuroscience research that showed that active inference is a good description of human learning and decision making limanowski2020active (); smith2020imprecise (); cullen2018active (); schwartenbeck2015evidence (); schwartenbeck2015dopaminergic (), our results on the dynamic switching bandit show that active inference also performs well in objective sense. Such explanations of cognitive mechanisms that are grounded in optimal solutions are arguably more plausible chater1999ten (). Hence, this result lends additional credibility to active inference as a generalised framework for understanding human behaviour, not only in the behavioural experiments inspired by multiarmed bandits izquierdo2017neural (); markovic2019predicting (); iglesias2013hierarchical (); wilson2012inferring (); racey2011pigeon (); behrens2007learning (), but in a range of related investigations of human and animal decision making in complex dynamic environments under uncertainty markovic2020meta (); adams2013predictions (); pezzulo2012active ().
An important next step in examining active inference in the context of multiarmed bandits is to establish theoretical bounds on the cumulative regret for the stationary bandit problem. A key part of these theoretical studies will be to investigate whether its possible to devise a sound decay scheme for the parameter (see Eq. 28), that provably works for all instances of the canonical stationary bandit. This leaves open a possibility for developing new active inference inspired algorithms which can achieve the asymptotic efficiency. These theoretical bounds would allow us to more rigorously compare active inference algorithms to the already established bandit algorithms for which regret bounds are known. Moreover, we would potentially be able to generalise beyond the settings we have empirically tested here. Future work should also consider an informationtheoretic analysis of active inference, which might be more appropriate than regret analysis kaufmann2016bayesian (). For example, the Bayesian exploration bonus previously considered in Bayesian reinforcement learning was analysed with respect to sample complexity of identifying a good policy kolter2009near (). Similarly, in russo2016information () the authors introduced a new measure of regret weighted by the inverse information gain between actions and outcomes, and provided expected bounds for this measure for several Bayesian algorithms, such as Thompson sampling and Bayesian UCB.
As optimal behaviour is always defined with respect to a chosen objective function, a different objective function will lead to different behaviour, and the appropriateness of the objective function for the specific problem determines the performance of the algorithm on a given task. In other words, behaviour is determined not only by the beliefs about the hidden structure about the states of the world but also by the beliefs about useful preferences and objectives one should take into account in that environment. Therefore, although one can consider the sensitivity of the introduced active inference algorithm on the prior precision over preferences as a limitation of the algorithm in comparison to the other two algorithms, we believe that it is possible to introduce various adaptations to the algorithm to improve asymptotic behaviour. Fore example, one can consider learning rules for prior outcome preferences, as illustrated in sajid2019active (); markovic2020meta (). This would introduce a way to adapt an objective function to different environments achieving high performance in a widerange of stationary and dynamic multiarmed bandit problems. Alternatively, instead of basing action selection on the expected free energy, one can define a stochastic counterpart, which is estimated based on samples from the posterior, akin to Thompson sampling. This would enable the algorithm to better leverage directed and random exploration.
In spite of the poor asymptotic performance there are some advantages of active inference over classical bandit algorithms, both for artificial intelligence and neuroscience. Unlike the Thompson sampling and UCB algorithms, active inference is easily extendable to more complex settings where actions affect future states and actions available, what is usually formalised as a (partially observable) Markov decision process, which require the combination of adaptive decision making with complex planning mechanism millidge2020deep (); ueltzhoffer2018deep (); fountas2020deep (). Given our finding that the advantage of active inference algorithm improves with task difficulty, it would be interesting to apply the framework to a range of more complex Markov decision process problems, comparing it to stateoftheart reinforcement learning algorithms recently developed in machine learning.
The generative modelling approach integral to active inference allows several improvements to the presented algorithm, which also holds for related Bayesian approaches. For example, we have considered here only one learning algorithm, variational SMiLE liakoni2020learning (), which we have chosen based on its simplicity and efficiency. A potential drawback of variational SMiLE is that it might not be optimal (in terms of inference) for a generic problem of dynamic bandits (e.g. different mechanisms for generating changes and different reward distribution). For example, for restless bandits, which follow a random walk process, recently published alternative efficient learning algorithms derived from different generative models are likely to provide a better performance piray2020simple (); moens2019learning (). Employing a good learning algorithm is especially important in dynamic settings, where exact inference is not tractable, and the performance of learning rules is tightly coupled to the overall performance of the algorithm. In practice, one would expect that, the better the generative model and the corresponding approximate inference algorithm, the better the performance will be on a given multiarmed bandit problem. Furthermore, one can easily extend the learning algorithms with deep hierarchical variants, which can infer a wide range of unknown dynamical properties of the environment piray2020simple () and learn higher order temporal statistics markovic2019predicting (); markovic2020meta ().
6 Conclusion
We have derived an approximate active inference algorithm, based on a Bayesian informationtheoretic framework recently developed in neuroscience, proposing it as a novel machine learning algorithm for bandit problems that can compete with stateoftheart bandit algorithms. Our empirical evaluation has shown that active inference is indeed a promising bandit algorithm. This work is only a first step, however, important next steps that will provide further evidence on viability of the active inference as a bandit algorithm is the development of a decay schedule for the outcome preference precision parameter and theoretical regret analysis in the stationary bandit. The fact that the active inference algorithm achieves excellent performance in switching bandit problems, commonly used in cognitive neuroscience, provides rational grounds for using active inference as a generalised framework for understanding human and animal learning and decision making.
7 Acknowledgements
We thank Karl Friston and Gergely Neu for valuable feedback and constructive discussions.
Footnotes
 These authors contributed equally
 These authors contributed equally
 Note that this type of dependence between current and future choice sets, or rewards, would convert the bandit problem into a reinforcement learning problem. It makes the explorationexploitation tradeoff more complex and optimal solutions cannot be derived beyond trivial problems.
 In usual applications of active inference for understanding human behaviour, rather than minimising the expected free energy one would sample actions from posterior beliefs about actions (cf. planning as inference botvinick2012planning (); attias2003planning ()). This becomes useful when fitting empirical choice behaviour in behavioural experiments markovic2019predicting (); schw2016 ().
 For the hardest considered setting, corresponding to , the minimum is sharp and corresponds to the value .
References
 R. C. Wilson, E. Bonawitz, V. D. Costa, R. B. Ebitz, Balancing exploration and exploitation with information and randomization, Current Opinion in Behavioral Sciences 38 (2020) 49–56.
 K. Mehlhorn, B. R. Newell, P. M. Todd, M. D. Lee, K. Morgan, V. A. Braithwaite, D. Hausmann, K. Fiedler, C. Gonzalez, Unpacking the exploration–exploitation tradeoff: A synthesis of human and animal literatures., Decision 2 (3) (2015) 191.
 J. D. Cohen, S. M. McClure, A. J. Yu, Should i stay or should i go? how the human brain manages the tradeoff between exploitation and exploration, Philosophical Transactions of the Royal Society B: Biological Sciences 362 (1481) (2007) 933–942.
 T. Lattimore, C. Szepesvári, Bandit algorithms, Cambridge University Press, 2020.
 O. Chapelle, L. Li, An empirical evaluation of thompson sampling, in: Advances in neural information processing systems, 2011, pp. 2249–2257.
 P. Auer, N. CesaBianchi, P. Fischer, Finitetime analysis of the multiarmed bandit problem, Machine learning 47 (23) (2002) 235–256.
 E. Kaufmann, O. Cappé, A. Garivier, On bayesian upper confidence bounds for bandit problems, in: Artificial intelligence and statistics, 2012, pp. 592–600.

R. Kaplan, K. Friston,
Planning and
navigation as active inference, bioRxiv (2017).
arXiv:https://www.biorxiv.org/content/early/2017/12/07/230599.full.pdf,
doi:10.1101/230599.
URL https://www.biorxiv.org/content/early/2017/12/07/230599 
K. J. Friston, R. Rosch, T. Parr, C. Price, H. Bowman,
Deep
temporal models and active inference, Neuroscience & Biobehavioral Reviews
77 (Supplement C) (2017) 388 – 402.
doi:https://doi.org/10.1016/j.neubiorev.2017.04.009.
URL http://www.sciencedirect.com/science/article/pii/S0149763416307096  K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, G. Pezzulo, Active inference: A process theory, Neural Computation 29 (1) (2017) 1–49, pMID: 27870614. doi:10.1162/NECO_a_00912.
 M. B. Mirza, R. A. Adams, C. D. Mathys, K. J. Friston, Scene construction, visual foraging, and active inference, Frontiers in computational neuroscience 10 (2016).

K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, J. O’Doherty,
G. Pezzulo,
Active
inference and learning, Neuroscience & Biobehavioral Reviews 68 (Supplement
C) (2016) 862 – 879.
doi:https://doi.org/10.1016/j.neubiorev.2016.06.022.
URL http://www.sciencedirect.com/science/article/pii/S0149763416301336  T. H. FitzGerald, P. Schwartenbeck, M. Moutoussis, R. J. Dolan, K. Friston, Active inference, evidence accumulation, and the urn task, Neural computation 27 (2) (2015) 306–328.
 K. Friston, F. Rigoli, D. Ognibene, C. Mathys, T. Fitzgerald, G. Pezzulo, Active inference and epistemic value, Cognitive neuroscience 6 (4) (2015) 187–214.
 P. Schwartenbeck, T. FitzGerald, R. Dolan, K. Friston, Exploration, novelty, surprise, and free energy minimization, Frontiers in psychology 4 (2013) 710.
 K. Friston, The history of the future of the bayesian brain, NeuroImage 62 (2) (2012) 1230–1233.
 K. Doya, S. Ishii, A. Pouget, R. P. Rao, Bayesian brain: Probabilistic approaches to neural coding, MIT press, 2007.
 D. C. Knill, A. Pouget, The bayesian brain: the role of uncertainty in neural coding and computation, TRENDS in Neurosciences 27 (12) (2004) 712–719.
 M. Botvinick, M. Toussaint, Planning as inference, Trends in cognitive sciences 16 (10) (2012) 485–488.
 F. Karl, A free energy principle for biological systems, Entropy 14 (11) (2012) 2100–2121.
 K. Friston, J. Kilner, L. Harrison, A free energy principle for the brain, Journal of PhysiologyParis 100 (13) (2006) 70–87.
 P. Schwartenbeck, J. Passecker, T. U. Hauser, T. H. FitzGerald, M. Kronbichler, K. J. Friston, Computational mechanisms of curiosity and goaldirected exploration, Elife 8 (2019) e41703.
 R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
 K. Ueltzhöffer, Deep active inference, Biological cybernetics 112 (6) (2018) 547–573.
 B. Millidge, Deep active inference as variational policy gradients, Journal of Mathematical Psychology 96 (2020) 102348.
 K. Friston, L. Da Costa, D. Hafner, C. Hesp, T. Parr, Sophisticated inference, arXiv preprint arXiv:2006.04120 (2020).
 Z. Fountas, N. Sajid, P. A. Mediano, K. Friston, Deep active inference agents using montecarlo methods, arXiv preprint arXiv:2006.04176 (2020).
 W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika 25 (3/4) (1933) 285–294.
 R. R. Bush, F. Mosteller, A stochastic model with applications to learning, The Annals of Mathematical Statistics (1953) 559–585.
 P. Whittle, Multiarmed bandits and the gittins index, Journal of the Royal Statistical Society: Series B (Methodological) 42 (2) (1980) 143–149.
 T. L. Lai, H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in applied mathematics 6 (1) (1985) 4–22.
 E. Kaufmann, N. Korda, R. Munos, Thompson sampling: An asymptotically optimal finitetime analysis, in: International conference on algorithmic learning theory, Springer, 2012, pp. 199–213.
 A. Izquierdo, J. L. Brigman, A. K. Radke, P. H. Rudebeck, A. Holmes, The neural basis of reversal learning: an updated perspective, Neuroscience 345 (2017) 12–26.
 D. Marković, A. M. Reiter, S. J. Kiebel, Predicting change: Approximate inference under explicit representation of temporal structure in changing environments, PLoS computational biology 15 (1) (2019) e1006707.
 S. Iglesias, C. Mathys, K. H. Brodersen, L. Kasper, M. Piccirelli, H. E. den Ouden, K. E. Stephan, Hierarchical prediction errors in midbrain and basal forebrain during sensory learning, Neuron 80 (2) (2013) 519–530.
 R. C. Wilson, Y. Niv, Inferring relevance in a changing world, Frontiers in human neuroscience 5 (2012) 189.
 D. Racey, M. E. Young, D. Garlick, J. N.M. Pham, A. P. Blaisdell, Pigeon and human performance in a multiarmed bandit task in response to changes in variable interval schedules, Learning & behavior 39 (3) (2011) 245–258.
 T. E. Behrens, M. W. Woolrich, M. E. Walton, M. F. Rushworth, Learning the value of information in an uncertain world, Nature neuroscience 10 (9) (2007) 1214–1221.
 X. Lu, N. Adams, N. Kantas, On adaptive estimation for dynamic bernoulli bandits, arXiv preprint arXiv:1712.03134 (2017).
 Y. Cao, W. Zheng, B. Kveton, Y. Xie, Nearly optimal adaptive procedure for piecewisestationary bandit: a changepoint detection approach, arXiv preprint arXiv:1802.03692 (2018).
 R. Alami, O. Maillard, R. Féraud, Memory bandits: a bayesian approach for the switching bandit problem, 2017.
 D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al., A tutorial on thompson sampling, Foundations and Trends® in Machine Learning 11 (1) (2018) 1–96.
 D. M. Roijers, L. M. Zintgraf, A. Nowé, Interactive thompson sampling for multiobjective multiarmed bandits, in: International Conference on Algorithmic DecisionTheory, Springer, 2017, pp. 18–34.
 Y. Wang, J. Gittins, Bayesian bandits in clinical trials: Clinical trials, Sequential analysis 11 (4) (1992) 313–325.
 V. Liakoni, A. Modirshanechi, W. Gerstner, J. Brea, Learning in volatile environments with the bayes factor surprise (2020). arXiv:1907.02936.
 D. Marković, S. J. Kiebel, Comparative analysis of behavioral models for adaptive learning in changing environments, Frontiers in Computational Neuroscience 10 (2016) 33.
 F. Liu, J. Lee, N. Shroff, A changedetection based framework for piecewisestationary multiarmed bandit problem, in: ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 M. Steyvers, M. D. Lee, E.J. Wagenmakers, A bayesian analysis of human decisionmaking on bandit problems, Journal of Mathematical Psychology 53 (3) (2009) 168–179.
 A. Slivkins, et al., Introduction to multiarmed bandits, Foundations and Trends® in Machine Learning 12 (12) (2019) 1–286.
 D. I. Mattos, J. Bosch, H. H. Olsson, Multiarmed bandits in the wild: pitfalls and strategies in online experiments, Information and Software Technology 113 (2019) 68–81.
 E. Schulz, S. J. Gershman, The algorithmic architecture of exploration in the human brain, Current opinion in neurobiology 55 (2019) 7–14.
 J. Gottlieb, P.Y. Oudeyer, M. Lopes, A. Baranes, Informationseeking, curiosity, and attention: computational and neural mechanisms, Trends in cognitive sciences 17 (11) (2013) 585–593.
 H. Stojić, J. L. Orquin, P. Dayan, R. J. Dolan, M. Speekenbrink, Uncertainty in learning, choice, and visual fixation, Proceedings of the National Academy of Sciences 117 (6) (2020) 3291–3300.
 R. C. Wilson, A. Geana, J. M. White, E. A. Ludvig, J. D. Cohen, Humans use directed and random exploration to solve the explore–exploit dilemma., Journal of Experimental Psychology: General 143 (6) (2014) 2074.
 P. B. Reverdy, V. Srivastava, N. E. Leonard, Modeling human decision making in generalized gaussian multiarmed bandits, Proceedings of the IEEE 102 (4) (2014) 544–571.
 D. Acuna, P. Schrater, Bayesian modeling of human sequential decisionmaking on the multiarmed bandit problem, in: Proceedings of the 30th annual conference of the cognitive science society, Vol. 100, Washington, DC: Cognitive Science Society, 2008, pp. 200–300.
 H. Stojić, E. Schulz, P. P Analytis, M. Speekenbrink, Itâs new, but is it good? how generalization and uncertainty guide the exploration of novel options., Journal of Experimental Psychology: General (2020).
 E. Schulz, E. Konstantinidis, M. Speekenbrink, Putting bandits into context: How function learning supports decision making., Journal of Experimental Psychology: Learning, Memory, and Cognition 44 (6) (2018) 927.
 E. Schulz, N. T. Franklin, S. J. Gershman, Finding structure in multiarmed bandits, Cognitive Psychology 119 (2020) 101261.
 L. Clark, R. Cools, T. Robbins, The neuropsychology of ventral prefrontal cortex: decisionmaking and reversal learning, Brain and cognition 55 (1) (2004) 41–53.
 M. GuitartMasip, L. Fuentemilla, D. R. Bach, Q. J. Huys, P. Dayan, R. J. Dolan, E. Duzel, Action dominates valence in anticipatory representations in the human striatum and dopaminergic midbrain, Journal of Neuroscience 31 (21) (2011) 7867–7875.
 N. D. Daw, S. J. Gershman, B. Seymour, P. Dayan, R. J. Dolan, Modelbased influences on humans’ choices and striatal prediction errors, Neuron 69 (6) (2011) 1204–1215.
 A. Dezfouli, B. W. Balleine, Habits, action sequences and reinforcement learning, European Journal of Neuroscience 35 (7) (2012) 1036–1051.
 W. C. Cheung, D. SimchiLevi, R. Zhu, Hedging the drift: Learning to optimize under nonstationarity, arXiv preprint arXiv:1903.01461 (2019).
 L. Besson, E. Kaufmann, The generalized likelihood ratio test meets klucb: an improved algorithm for piecewise nonstationary bandits, arXiv preprint arXiv:1902.01575 (2019).

A. Blum, Y. Monsour,
Learning,
regret minimization, and equilibria (2007).
doi:10.1184/R1/6606935.v1.
URL https://kilthub.cmu.edu/articles/journal_contribution/Learning_Regret_minimization_and_Equilibria/6606935/1  V. Raj, S. Kalyani, Taming nonstationary bandits: A bayesian approach, arXiv preprint arXiv:1707.09727 (2017).
 A. Soltani, A. Izquierdo, Adaptive learning under expected and unexpected uncertainty, Nature Reviews Neuroscience 20 (10) (2019) 635–644.
 R. Smith, K. Friston, C. Whyte, A stepbystep tutorial on active inference and its application to empirical data (2021).
 H. Attias, Planning by probabilistic inference., in: AISTATS, Citeseer, 2003.
 P. Schwartenbeck, K. Friston, Computational phenotyping in psychiatry: a worked example, ENeuro 3 (4) (2016).
 R. Kaplan, K. J. Friston, Planning and navigation as active inference, Biological cybernetics 112 (4) (2018) 323–343.
 J. Z. Kolter, A. Y. Ng, Nearbayesian exploration in polynomial time, in: Proceedings of the 26th annual international conference on machine learning, 2009, pp. 513–520.
 K. Kandasamy, A. Krishnamurthy, J. Schneider, B. Póczos, Parallelised bayesian optimisation via thompson sampling, in: International Conference on Artificial Intelligence and Statistics, 2018, pp. 133–142.
 J. Limanowski, K. Friston, Active inference under visuoproprioceptive conflict: Simulation and empirical results, Scientific reports 10 (1) (2020) 1–14.
 R. Smith, P. Schwartenbeck, J. L. Stewart, R. Kuplicki, H. Ekhtiari, M. Paulus, T. Investigators, et al., Imprecise action selection in substance use disorder: Evidence for active learning impairments when solving the exploreexploit dilemma (2020).
 M. Cullen, B. Davey, K. J. Friston, R. J. Moran, Active inference in openai gym: a paradigm for computational investigations into psychiatric illness, Biological psychiatry: cognitive neuroscience and neuroimaging 3 (9) (2018) 809–818.
 P. Schwartenbeck, T. H. FitzGerald, C. Mathys, R. Dolan, M. Kronbichler, K. Friston, Evidence for surprise minimization over value maximization in choice behavior, Scientific reports 5 (2015) 16575.
 P. Schwartenbeck, T. H. FitzGerald, C. Mathys, R. Dolan, K. Friston, The dopaminergic midbrain encodes the expected certainty about desired outcomes, Cerebral cortex 25 (10) (2015) 3434–3445.
 N. Chater, M. Oaksford, Ten years of the rational analysis of cognition, Trends in Cognitive Sciences 3 (2) (1999) 57–65.
 D. Marković, T. Goschke, S. J. Kiebel, Metacontrol of the explorationexploitation dilemma emerges from probabilistic inference over a hierarchy of time scales, Cognitive, Affective, & Behavioral Neuroscience (2020) 1–25.
 R. A. Adams, S. Shipp, K. J. Friston, Predictions not commands: active inference in the motor system, Brain Structure and Function 218 (3) (2013) 611–643.
 G. Pezzulo, An active inference view of cognitive control, Frontiers in psychology 3 (2012) 478.
 E. Kaufmann, On bayesian index policies for sequential resource allocation, arXiv preprint arXiv:1601.01190 (2016).
 D. Russo, B. Van Roy, An informationtheoretic analysis of thompson sampling, The Journal of Machine Learning Research 17 (1) (2016) 2442–2471.
 N. Sajid, P. J. Ball, K. J. Friston, Active inference: demystified and compared, arXiv preprint arXiv:1909.10863 (2019) 2–3.

P. Piray, N. D. Daw, A
simple model for learning in volatile environments, PLOS Computational
Biology 16 (7) (2020) 1–26.
doi:10.1371/journal.pcbi.1007963.
URL https://doi.org/10.1371/journal.pcbi.1007963  V. Moens, A. Zénon, Learning and forgetting using reinforced bayesian change detection, PLoS computational biology 15 (4) (2019) e1006713.