An empirical evaluation of active inference in multi-armed bandits

An empirical evaluation of active inference in multi-armed bandits

Abstract

A key feature of sequential decision making under uncertainty is a need to balance between exploiting–choosing the best action according to the current knowledge, and exploring–obtaining information about values of other actions. The multi-armed bandit problem, a classical task that captures this trade-off, served as a vehicle in machine learning for developing bandit algorithms that proved to be useful in numerous industrial applications. The active inference framework, an approach to sequential decision making recently developed in neuroscience for understanding human and animal behaviour, is distinguished by its sophisticated strategy for resolving the exploration-exploitation trade-off. This makes active inference an exciting alternative to already established bandit algorithms. Here we derive an efficient and scalable approximate active inference algorithm and compare it to two state-of-the-art bandit algorithms: Bayesian upper confidence bound and optimistic Thompson sampling, on two types of bandit problems: a stationary and a dynamic switching bandit. Our empirical evaluation shows that the active inference algorithm does not produce efficient long-term behaviour in stationary bandits. However, in more challenging switching bandit problem active inference performs substantially better than the two bandit algorithms. The results open exciting venues for further research in theoretical and applied machine learning, as well as lend additional credibility to active inference as a general framework for studying human and animal behaviour.

keywords:
Decision making, Bayesian inference, Multi-armed bandits, Active Inference, Upper confidence bound, Thompson sampling

1 Introduction

When we are repeatedly deciding between alternative courses of action – about whose outcomes we are uncertain – we have to strike a trade-off between exploration and exploitation. Do we exploit and choose an option that we currently expect to be the best, or do we sample more options with uncertain outcomes in order to learn about them, and potentially find a better option? This trade-off is one of the fundamental problems of sequential decision making and it has been extensively studied both in the context of neuroscience wilson2020balancing (); mehlhorn2015unpacking (); cohen2007should () as well as machine learning lattimore2020bandit (); chapelle2011empirical (); auer2002finite (); kaufmann2012bayesian (). Here we propose active inference – an approach to sequential decision making developed recently in neuroscience kaplan2017navig (); friston2017temporal (); friston2017process (); mirza2016scene (); friston2016learning () – as an attractive alternative to established algorithms in machine learning. Although the exploration-exploitation trade-off has been described and analysed within the active inference framework fitzgerald2015active (); friston2015active (); schwartenbeck2013exploration (), the focus was on explaining animal and human behaviour rather than the algorithm performance on a given problem. What is lacking for a convincing machine learning application is the evaluation on multi-armed bandit problems lattimore2020bandit (), a set of standard problems that isolate the exploration-exploitation trade-off, thereby enabling a focus on best possible performance and the comparison to state-of-the-art bandit algorithms from machine learning. Conversely, these analyses will also feed back into neuroscience research, giving rational foundations to active inference explanations of animal and human behaviour.

When investigating human and animal behaviour in stochastic (uncertain) environments, it has become increasingly fruitful to model and describe behaviour based on principles of Bayesian inference friston2012history (); doya2007bayesian (); knill2004bayesian (), both when describing perception, and decision making and planning botvinick2012planning (). The approach to describing sequential decision making and planning as probabilistic inference is jointly integrated within active inference kaplan2017navig (); friston2017temporal (); friston2017process (); mirza2016scene (); friston2016learning (); schwartenbeck2013exploration (), a mathematical framework for solving partially observable Markov decision processes, derived from the general self-organising principle for biological systems – the free energy principle karl2012free (); friston2006free (). Recent work has demonstrated that different types of exploratory behaviour – directed and random exploration – naturally emerge within active inference schwartenbeck2019computational (). This made active inference a useful approach for modelling how animals and humans resolve the exploration-exploitation trade-off (REFS), but also points at its potential usefulness for bandit and reinforcement learning problems in machine learning where the exploration-exploitation trade-off plays a prominent role lattimore2020bandit (); sutton2018reinforcement (). Active inference in its initial form was developed for small state spaces and toy problems without consideration for applications to typical machine learning problems. This has recently changed and various scalable solutions have been proposed ueltzhoffer2018deep (); millidge2020deep (), in addition to complex sequential policy optimisation that involves sophisticated (deep tree) searches friston2020sophisticated (); fountas2020deep (). Therefore, to make the active inference approach practical and scalable to the high-dimensional bandit problems typically used in machine learning, we introduce here an approximate active inference (A-AI) algorithm.

Here we examine how well the exact and A-AI algorithms perform in multi-armed-bandit problems that are traditionally used as a benchmarks in the research on the exploration-exploitation trade-off lattimore2020bandit (). Although originally formulated for improving medical trials thompson1933likelihood (), multi-armed bandits have become an essential tool for studying human learning and decision making early on bush1953stochastic (), and later on attracted the attention of statisticians whittle1980multi (); lai1985asymptotically () and machine learning researchers lattimore2020bandit () for studying the nature of sequential decision making more generally. We consider two types of bandit problems in our empirical evaluation: a stationary bandit as a classical machine learning problem lai1985asymptotically (); auer2002finite (); kaufmann2012bayesian (); kaufmann2012thompson () and a switching bandit commonly used in neuroscience izquierdo2017neural (); markovic2019predicting (); iglesias2013hierarchical (); wilson2012inferring (); racey2011pigeon (); behrens2007learning (). This will make the presented results directly relevant not only for the machine learning community, but also for learning and decision making studies in neuroscience, which are often utilising active inference framework for a wide range of research questions.

Using these two types of bandit problems we empirically compare the active inference algorithm to two state-of-the-art bandit algorithms from machine learning: a variant of the upper confidence bound (UCB) auer2002finite () algorithm – the Bayesian UCB algorithm kaufmann2012bayesian (); kaufmann2012thompson () – and a variant of Thompson sampling – optimistic Thompson sampling lu2017adaptive (). Both types of algorithms keep track of uncertainty about the values of actions, in the form of posterior beliefs about reward probabilities, and leverage these to balance between exploration and exploitation, albeit in a different way. These two algorithms reach state-of-the-art performance on various types of stationary bandit problems auer2002finite (); chapelle2011empirical (); kaufmann2012bayesian (); kaufmann2012thompson (), achieving regret (the difference between actual and optimal performance) that is close to the best possible logarithmic regret lai1985asymptotically (). In switching bandits, learning is more complex, but once this is properly accounted for, both the optimistic Thompson sampling and Bayesian UCB exhibit the state-of-the-art performance cao2018nearly (); alami2017memory (); russo2018tutorial (); lu2017adaptive (); roijers2017interactive ().

We use a Bayesian approach to the bandit problem, also known as Bayesian bandits wang1992bayesian (), for all algorithms – active inference, Bayesian UCB and optimistic Thompson sampling. The Bayesian treatment allows us to keep the learning rules equivalent, thus facilitating the comparison of different action selection strategies. In other words, belief updating and learning of the hidden reward probabilities exclusively rests on the learning rules derived from an (approximate) inference scheme, and are independent on the specific action selection principle lu2017adaptive (). Furthermore, learning algorithms derived from principles of Bayesian inference can be made domain-agnostic and fully adaptive to a wide range of unknown properties of the underlying bandit dynamics, such as the frequency of changes of choice-reward contingencies. Therefore, we use the same inference scheme for all algorithms – variational surprise minimisation learning (SMiLE), an algorithm inspired by recent work in the field of human and animal decision making in changing environments liakoni2020learning (); markovic2016comparative (). The variational SMiLE algorithm corresponds to online Bayesian inference modulated by surprise, which can be expressed in terms of simple delta-like learning rules operating on the sufficient statistics of posterior beliefs.

In what follows, we will first introduce in detail the two types of bandit problems we focus on: the stationary and the dynamic bandit problem. We first describe each bandit problem formally in an abstract way and then specify the particular instantiation we use in our computational experiments. We will constrain ourselves to a well-studied version of bandits, the so-called Bernoulli bandits. For Bernoulli bandits, choice outcomes are drawn from an arm-specific Bernoulli distribution. Bernoulli bandits together with Gaussian bandits are the most commonly studied variant of multi armed bandits, both in theoretical and applied machine learning chapelle2011empirical (); lu2017adaptive (); liu2018change (); kaufmann2012thompson () and experimental cognitive neuroscience wilson2012inferring (); steyvers2009bayesian (); behrens2007learning (). This is followed by an introduction of three algorithms: we start with the derivation of the learning rules based on variational SMiLE, and introduce different action selection algorithms. Importantly, in active inference we will derive an approximate action selection scheme comparable in form to the well known UCB algorithm. Finally, we empirically evaluate the performance of different algorithms, and discuss the implications of the results for the fields of machine learning and cognitive neuroscience.

2 The multi-armed bandit problem

The bandit problem is a sequential game between an agent and an environment lattimore2020bandit (). The game is played in a fixed number of rounds (a horizon), where in each round the agent chooses an action (commonly referred to as a bandit arm). In response to the action, the environment delivers an outcome (e.g. a reward, punishment, or null). The goal of the agent is to develop a policy that allocates choices so as to maximise cumulative reward over all rounds. Here, we will be concerned with a bandit problem where the agent chooses between multiple arms (actions), a so-called multi-armed bandit (MAB). A well-studied canonical example is the stochastic stationary bandit, where rewards are drawn from arm-specific and fixed (stationary) probability distributions slivkins2019introduction ().

Bandit problems were theoretically developed largely in statistics and machine learning, usually focusing on the canonical stationary bandit problem lattimore2020bandit (); lai1985asymptotically (); slivkins2019introduction (); auer2002finite (); kaufmann2012bayesian (); kaufmann2012thompson (). However, they also played an important role in cognitive neuroscience and psychology, where they have been applied in a wide range of experimental paradigms, investigating human learning and decision-making rather than optimal performance. Here dynamic or non-stationary variants have been used more often, as relevant changes in choice-reward contingencies are typically hidden and stochastic in everyday environments of humans and other animals schulz2019algorithmic (); gottlieb2013information (); wilson2012inferring (); cohen2007should (); behrens2007learning (). We focus on a switching bandit, a particularly popular variant of the dynamic bandit where contingencies change periodically and stay fixed for some time between switches izquierdo2017neural (); markovic2019predicting (); iglesias2013hierarchical (); wilson2012inferring (); racey2011pigeon (); behrens2007learning (). The canonical stationary bandit has been influential in cognitive neuroscience and psychology as well steyvers2009bayesian (); stojic2020uncertainty (); wilson2014humans (); reverdy2014modeling (), in particular when combined with side information or context to investigate structure or function learning acuna2008bayesian (); stojic2020s (); schulz2018putting (); schulz2020finding ().

Note that many experimental tasks, even if not explicitly referred to as bandit problems, can be in fact reformulated as an equivalent bandit problem. The often used reversal learning task izquierdo2017neural (), for example, corresponds to a dynamic switching two-armed bandit clark2004neuropsychology (), and the popular go/no-go task can be expressed as a four-armed stationary bandit guitart2011action (), as another example. Furthermore, various variants of the well-established multi-stage task daw2011model () can be mapped to a multi-armed bandit problem, where the choice of arm corresponds to a specific sequence of choices in the task dezfouli2012habits ().

In summary, we will perform a comparative analysis on two types of bandit problems: stationary stochastic and switching bandit. In this section, we first describe each bandit problem formally in an abstract way and then specify the particular instantiations we use in our computational experiments.

2.1 Stationary stochastic bandit

A stationary stochastic bandit with finitely many arms is defined as follows: in each round the agent chooses an arm or action from a finite set of arms, and the environment then reveals an outcome (e.g. reward or punishment). The stochasticity of the bandit implies that outcomes are i.i.d. random variables drawn from a probability distribution . In Bernoulli bandits, these are draws specifically from a Bernoulli distribution for which outcomes are binary, that is, , where each arm has a reward probability that parametrises the Bernoulli distribution. Hence, we can express the observation likelihood as

 p\delot|→θ,at=k=θotk(1−θk)1−ot (1)

where denotes the chosen arm on trial . In stationary bandits reward probabilities of individual arms are fixed for all trials . We use to denote an optimal arm associated with the maximal expected reward .

In our computational experiments we follow a setup that has been used in previous investigations of stationary stochastic bandits chapelle2011empirical (): We consider the variant of the problem in which all but the best arm have the same reward probability . The probability of the best arm is set to , where . The number of arms and the mean outcome difference modulate the task difficulty. The more arms and the lower the reliability, the more difficult is the problem. To understand how task difficulty influences the performance of different action selection algorithms, in the experiments we systematically vary and steps. Note that the larger number of arms () is standard setting in machine learning benchmarks, as many industrial applications of multi-armed bandits contain a large number of options slivkins2019introduction (). In contrast, in experimental cognitive neuroscience one typically considers only a small number of options (e.g. two or three), to reduce the task complexity, thus, the training time and the experiment duration.

2.2 Switching bandit

A switching bandit is a dynamic multi-armed bandit, which, as the stationary bandit, is characterised by a set of arms, where each arm is associated with an i.i.d. random variable at a given time step . However, in contrast to the stationary bandit problem, outcomes are drawn from a time-dependent Bernoulli probability distribution

 p(ot|→θt,at=k)=θott,k\del1−θt,k1−ot. (2)

We use to denote the optimal arm associated with the maximal expected reward at trial ; hence, .

In the switching bandit cheung2019hedging (); besson2019generalized () the reward probability changes suddenly but is otherwise constant. Here we use the same reward probability structure as in the stationary bandits, but change the optimal arm with probability as follows

 j∼p(jt)=ρjt\del1−ρ1−jtk∗t∼⎧⎪⎨⎪⎩δk∗t−1,k∗t, if jt=0,1−δk∗t−1,k∗tK−1, if jt=1, (3)

where denotes the Kronecker delta, and denotes an auxiliary Bernoulli random variable representing the presence or absence of a switch on trial . The optimal arm is always associated with the same reward probability and the probability of all other arms is set to the same value . In the experiments with the switching bandit problem we systematically vary and , and sets.

In addition, we will consider the possibility that the task difficulty changes over time. Specifically, we will consider a setup in which the mean outcome difference is not fixed, and changes over time. We obtain an effective non-stationary by introducing a time evolution of the reward probabilities . At each switch () point, we generate the reward probabilities from a uniform distribution. Hence, the dynamics of the switching bandit with non-stationary difficulty can be expressed with the following transition probabilities

 p(θt,k|θt−1,k,jt)={δ(θt,k−θt−1,k),for jt=0,Be\del1,1,for jt=1, (4)

where denotes Dirac’s delta function, and a uniform distribution on interval, expressed as the special case of a Beta distribution.

2.3 Evaluating performance in bandit problems

A standard approach to evaluate the performance of different decision making algorithms in bandit problems is regret analysis lattimore2020bandit (); blum2007learning (), and we will therefore use it here as a primary measure. Regret is typically defined as an external measure of performance which computes a cumulative expected loss of an algorithm relative to an oracle which knows the ground truth and always selects the optimal arm . If we define the cumulative expected reward of an agent, up to trial , that chose arm on trial as then the (external) cumulative regret is defined as

 RT=Tθt,k∗−T∑t=1θatt. (5)

The cumulative regret can also be viewed as a retrospective loss, which an agent playing the bandit game can estimate after it learns which arm was optimal. This definition makes sense for stationary stochastic bandits and in the limit of . In practice, the cumulative regret of a specific agent playing the game will be a function of the sequence of observed outcomes , the sequence of chosen arms , and a selection strategy of the given agent.

We additionally introduce a regret rate measure, a time average of the cumulative regret

 ~RT=1TRT=θk∗−1TT∑t=1θt,at. (6)

In the case of stationary bandits a decision making algorithm is considered consistent if and asymptotically efficient if its cumulative regret approaches the following lower bound as lai1985asymptotically ()

 R––T=ln(T)∑i≠k∗θk∗−θiDKL\delp→θ(ot|i)||p→θ(ot|k∗)≡ω(K,ϵ)lnT (7)

In our case of Bernoulli bandits and specifically structured reward probabilities (see 2.1 Stationary stochastic bandit subsection) the Kullback-Leibler divergence between outcome likelihoods of any arm and the arm associated with highest reward probability becomes

 DKL\delp→θ(ot|k∗)||p→θ(ot|i)=−12ln\del1−4ϵ2≈2ϵ2. (8)

Hence, the lower bound to the cumulative regret becomes approximately

 R––T=2ϵK−1ln\del1+4ϵ2lnT≈K−12ϵlnT. (9)

In addition, we can define an upper bound in terms of a random choice algorithm, which selects any arm with same probability on every trial. In the case of random and uniform action selection the cumulative regret becomes

 ¯RT=TϵK−1K (10)

Note that the cumulative regret is an external quantity not accessible to an agent, which has uncertain beliefs about the reward probabilities of different arms. Although, in stationary bandits the cumulative regret can reveal how efficient an algorithm is in accumulating reward in the long term, it tells us little about how efficient an algorithm is in reducing regret in the short-term. This short-term efficiency is particularly important for dynamic bandits as an agent has to switch constantly between exploration and exploitation. Therefore, to investigate short-term efficiency of the algorithm, specifically in the dynamic context, we will analyse the regret rate, instead of the commonly used cumulative regret (see raj2017taming ()).

3 Algorithms

Bandit algorithms can be thought of as consisting of two parts: (i) a learning rule that estimates action values, and (ii) an action-selection strategy that uses the estimates to choose actions and effectively balance between exploration and exploitation. As described in the previous section, for the canonical stationary problem a good bandit algorithm achieves a regret that scales sub-linearly with the number of rounds (see Eq. 9). Intuitively, this means that the algorithm should be reducing exploration and allocating more choices over time to arms with high expected value. The relevant question is how to reduce exploration concretely? Naturally, this is a fine balancing act: reducing exploration too quickly would potentially result in false beliefs about the best arm, hence repeatedly choosing sub-optimal arms and accumulating regret. In contrast, reducing exploration too slowly would result in wasting too many rounds exploring sub-optimal arms and again accumulating regret. For comparison with the algorithm based on active inference, we focus on two popular classes of bandit algorithms that are known to hit the right balance: the (Bayesian) upper confidence bound (B-UCB) auer2002finite (); kaufmann2012bayesian () and (optimistic) Thompson sampling (O-TS) chapelle2011empirical (); thompson1933likelihood (); kaufmann2012thompson () algorithms. Note that -greedy or Softmax action-selection strategies sutton2018reinforcement (), frequently used in reinforcement learning, have fixed exploration, and consequently poor regret performance in bandit problems. There are variants of these strategies where exploration parameters, in -greedy and in Softmax, are reduced with specific schedules auer2002finite (). However, choosing a schedule is based on heuristics and parameters are difficult to tune. Hence, we did not include these types of strategies in our comparisons.

In what follows we decompose active inference and the bandit algorithms we compare to into two components, the learning rule and the action selection strategy. We derive learning rules from an approximate Bayesian inference scheme and keep the rules fixed across action selection strategies, and modify only the action selection strategy. This setup allows us to have a fair comparison between active inference and the competing bandit algorithms. Finally, we will use the same action selection strategies for both stationary and dynamic bandit problem, and parameterise the learning rules to account for the presence or absence of changes.

3.1 Shared learning rule - variational SMiLe

To derive the belief update equations we start with a hierarchical generative model described here and apply variational inference to obtain approximate learning rules. The obtained belief update equations correspond to the variational surprise minimisation learning (SMiLe) rule liakoni2020learning (); markovic2016comparative (). Importantly, we recover the learning rules for the stationary bandit (see 21) as a special case when changes are improbable.

We will express the hierarchical generative model of choice outcomes as the following joint distribution

 p(o1:T,→θ1:T,j1:T,ρ|a1:T)=p(ρ)T∏t=1p(ot|→θt,at)p(→θt|→θt−1,jt)p(jt|ρ), (11)

where the observation likelihood corresponds to the Bernoulli distribution. Hence,

 p(ot|→θt,at)=K∏k=1\sbrθotk\del1−θk1−otδat,k. (12)

If no change () occurs on a given trial the reward probabilities are fixed, . Otherwise, if a change occurs (), a new value is generated for each arm from some prior distribution . Formally, we can express this process as

 (13)

Similarly, the probability that a change in reward probabilities occurs on a given trial is , hence we write

 p\deljt|ρ=ρjt\del1−ρ1−jt (14)

The Bayesian approach requires us to specify a prior. The prior over reward probabilities associated with each arm is given as the product of conjugate priors of the Bernoulli distribution, that is, the Beta distribution

 p(→θ0)=K∏k=1B(θ0,k;α0,k,β0,k), (15)

where we initially set the prior to a uniform distribution, .

Hence, given Bayes rule at time step

 p(→θt,jt|ρ,o1:t,a1:t)∝p(ot|→θt,at)p(→θt,jt|ρ,o1:t−1), (16)

we can express the exact marginal posterior beliefs over reward probabilities as

 pρ\del→θt|o1:t,a1:t=\del1−γtp\del→θt|jt=0,o1:t,a1:t+γtp\del→θt|jt=1,ot,at (17)

where and corresponds to the sequence of chosen arms, and corresponds to the marginal posterior probability that a change occurred on trial , which we can express as

 γt=γ\delStBF,mγ(S,m)=mS1+mSStBF=p\delot|jt=0,at,ot:t−1p\delot|jt=1,at,o1:t−1m=ρ1−ρ (18)

The exact marginal posterior in Eq. 17 will not belong to the Beta distribution family, making the exact inference analytically intractable. However, constraining the joint posterior to an approximate, fully factorised form

 p(→θt,jt|ρ,o1:t,a1:t)≈Q(jt)K∏k=1Q(θkt) (19)

and applying variational calculus allows us to recover the following variational SMiLe rule liakoni2020learning ()

 αkt=(1−γt)αkt−1+γtα0+δat,kotβkt=(1−γt)αkt−1+γtα0+δat,k\del1−ot (20)

for the parameters of the Beta distributed factors and categorically distributed change probability whose update correspond to Eq. 18.

Note that for a stationary environment the changes are improbable, hence and consequently for every . This implies that for the stationary bandit we recover the following learning rules

 αt,k=αt−1,k+ot⋅δat,k,βt,k=βt−1,k+(1−ot)δat,k, (21)

that correspond to the exact Bayesian inference over the stationary Bernoulli bandit problem.

3.2 Action selection

Active inference

One view on the exploration-exploitation trade-off is that it can be formulated as an uncertainty-reduction problem schwartenbeck2019computational (), where choices aim to resolve expected and unexpected uncertainty about hidden properties of the environment soltani2019adaptive (). This leads to casting choice behaviour and planning as a probabilistic inference problem kaplan2017navig (); friston2017temporal (); friston2017process (); mirza2016scene (); friston2016learning (), as expressed by active inference. Using this approach, different types of exploitative and exploratory behaviour naturally emerge schwartenbeck2019computational (). In active inference, decision strategies (behavioural policies) are chosen based on a single optimisation principle: minimising expected surprisal about observed and future outcomes, that is, the expected free energy smith2021step (). Formally, we express the expected free energy of a choice as

 G(a)=DKL\delQ(ot|at=a)||P(ot)Risk+EQ(→θ)[H\sbrot|→θ,at=a]Ambiguity=−EQ(ot|at=a)[lnP(ot)]Extrinsic value−EQ(ot|at=a)[DKL\delQ\del→θ,jt|ot,at=a||Q\del→θ,jt]%Intrinsicvalue/Novelty (22)

where , , denotes prior preferences over outcomes, denotes the conditional entropy of observation likelihood , and , stands for the Kullback-Leibler divergence. Then, a choice is made by selecting the action with the smallest expected free energy4

 at=argminaG(a), (23)

where we consider the simplest form of active inference, as in other bandit algorithms, one-step-ahead beliefs about actions.

Note that in active inference, the most likely action has dual imperatives, implicit within the expected free energy acting as the loss function (see the different decomposition in Eq. 22): The expected free energy can, on one hand, be decomposed into ambiguity and risk. On the other hand, it can be understood as a combination of intrinsic and extrinsic value, where intrinsic value corresponds to the expected information gain, and the extrinsic value to the expected value. The implicit information gain or uncertainty reduction pertains to beliefs about the parameters of the likelihood mapping, which has been construed as novelty kaplan2018planning (); schwartenbeck2013exploration (). Therefore, selecting actions that minimise the expected free energy dissolves the exploration-exploitation trade-off, by virtue of the fact that every action contains both expected value and information gain.

To express the expected free energy, , in terms of beliefs about arm-specific reward probabilities, we will first constrain the prior preference to the following Bernoulli distribution

 P(ot)=1Z(λ)eotλe−(1−ot)λ. (24)

In active inference, prior preferences determine whether a particular outcome is attractive or rewarding. Here we assume that agents prefer outcome over outcome . Hence, we specify payoffs or rewards with prior preferences over outcomes that have an associated precision , where . The precision parameter determines the balance between epistemic and pragmatic imperatives. When prior preferences are very precise, corresponding to large , the agent becomes risk sensitive and will tend to forgo exploration if the risk (i.e., divergence between predicted and preferred outcomes is high; see Eq. 22). Conversely, a low lambda corresponds to an agent which is less sensitive to risk and will engage in exploratory, epistemic behaviour, until it has familiarised itself with the environment (i.e., the hidden reward probabilities in the multi-armed bandit problem).

Given the following expressions for the marginal predictive likelihood, obtained as ,

 Q\delot|at=\sbr~μattot\sbr1−~μatt1−ot~μatt=μatt−1+ρ\del12−μatt−1μatt−1=αatt−1νatt−1νatt−1=αatt−1+βatt−1 (25)

we get the following expressions for the expected free energy

 Gt(k)=−2λ(1−ρ)μt−1,k+~μt,kln~μt,k+(1−~μt,k)ln(1−~μt,k)−(1−ρ)\sbrμt−1,kψ\delαt−1,k+\del1−μt−1,kψ\delβt−1,k+(1−ρ)\sbrψ\delνt−1,k−1νt−1,k+const. (26)

Above we have used the following relation

 ∫dxB\delx;α,βxlnx=μ\delψ(α)−ψ\delν+1−μν, (27)

for computing .

If we approximate digamma function as , and note that for all relevant use cases , then by substituting the approximate digamma expression into Eq. (26) we get the following action selection algorithm

 at=argmaxk\sbr2λμt−1,k+12νt−1,k. (28)

Note that a similar exploration bonus – inversely proportional to the number of observations – was proposed in the context of Bayesian reinforcement learning kolter2009near () when working with Dirichlet prior and posterior distributions.

We will denote active inference agents which make choices based on the approximate expected free energy, Eq. 28, with A-AI, and agents which minimise directly the exact expected free energy, Eq. 23, with G-AI.

Upper confidence bound

The upper confidence bound (UCB) is a classical action selection strategy for resolving the exploration-exploitation dilemma auer2002finite (), with the action selection strategy defined as

 at=⎧⎪ ⎪⎨⎪ ⎪⎩argmaxk(mt,k+lntnt,k+√mt,klntnt,k)for t>Ktotherwise, (29)

where is the expected reward of -th arm and the number of times the -th arm was selected (see chapelle2011empirical () for more details).

However, we consider a more recent variant called Bayesian UCB kaufmann2012bayesian (), grounded in Bayesian bandits. In Bayesian UCB the best arm is selected as the one with the highest -th percentile of posterior beliefs, where the percentile increases over time as . Hence, we can express the action selection rule as

 at=argmaxkCDF−1(zt,¯αkt,¯βkt) (30)

where denotes cumulative distribution function of Beta distributed posterior beliefs, and the parameters (, ) denote approximate sufficient statistics of the Beta distributed prior beliefs on trial . Note that the exact prior corresponds to a mixture of two Beta distributions

 p\delθkt|ot−1:1=(1−ρ)Be\delαkt−1,βkt−1+ρBe\delα0,β0. (31)

As the inverse of a cumulative distribution function of the above mixture distribution is analytically intractable we will assume the following approximation

 p\delθkt|o1:t−1≈Be\del¯αkt,¯βkt¯αkt=(1−ρ)αkt−1+ρα0¯βkt=(1−ρ)βkt−1+ρβ0 (32)

Thus, in the case of the Beta distributed prior beliefs, the inverse cumulative distribution function corresponds to the inverse incomplete regularised beta function. Hence, we can write

 CDF−1(z,α,β)=I−1z(α,β) (33)

Thompson sampling

Thompson sampling is traditionally associated with Bayesian bandits kandasamy2018parallelised (); chapelle2011empirical (); thompson1933likelihood (), where the action selection is derived from the i.i.d samples from the posterior beliefs about the reward probability. The standard algorithm corresponds to

 at=argmaxkθ∗t,k,θ∗t,k∼p\delθt,k|o1:t−1, (34)

where denotes a single sample from the current beliefs about reward probabilities associated with the -th arm.

An extension of the standard algorithm, proposed in the context of dynamic bandits, is called optimistic Thompson sampling raj2017taming (), defined as

 at=argmaxkmax(θ∗t,k,~μt,k),θ∗t,k∼p\delθt,k|o1:t−1, (35)

where the expected reward probability at current trial ,

 ~μt,k=μt−1,k+ρ\del12−μt−1,k,

constrains the minimal accepted value of the sample from the prior, hence biasing the sampling towards optimistic larger values.

4 Results

In what follows, we first examine the performance of active inference based agents, A-AI (minimising approximated estimate of the expected free energy) and G-AI (minimising exact expected free energy) in the stationary Bernoulli bandits. Using the regret rate as performance criterion we analyse the dependence of agent’s performance on the precision of prior preferences () parameter and simultaneously verify that our approximation is good enough. After illustrating the effectiveness of A-AI (Eq. 28), in comparison to G-AI (Eq. 23), we empirically compare only the A-AI algorithm – now in terms of the cumulative regret – with agents using the optimistic Thompson sampling (O-TS; Eq. 35) and Bayesian upper confidence bound (B-UCB; Eq. 30) algorithms, in the same stationary Bernoulli bandit. Finally, we provide an empirical comparison of the three action selection algorithms in the switching Bernoulli bandit.

4.1 The stationary Bernoulli bandit

The precision parameter acts as a balancing parameter between exploitation and exploration (Eq. 28). Hence, it is paramount to understand how impacts the performance across different difficulty conditions. We expect that there will be a for which the active inference algorithm achieves minimal cumulative regret after a fixed number of trials , for each mean outcome difference and each number of arms . When the AI agent has very imprecise preferences (), it would engage in exploration for longer, thereby reducing its free energy (i.e., uncertainty about the likelihood mappings), at the expense of accumulating reward. Conversely, an AI agent with very precise preferences () would commit to a particular arm as soon as it had inferred that this was the arm with highest likelihood of payoffs. However, the ensuing âsuperstitiousâ behaviour would prevent it from finding the best arm. To illustrate this, in Fig. 1 we report regret rate averages over a simulations, and compare the agents using either the approximate (A-AI) or the exact (G-AI) expected free energy for action selection. Using the regret rate simplifies the comparison, as unlike cumulative regret, the regret rate stays on the same range of values independent of trial number .

We find the minimal regret rate – at the asymptotic limit of large number of trials , see solid lines in Fig. 1 – around for a large range of problem difficulties5. Hence, for the following between-agent comparisons we restrict the active inference agents to fixed precision, , which approximately minimises the regret rate over fixed length sessions () for all considered conditions. As both G-AI (blue lines) and A-AI (red lines) achieve very similar regret rates as a function of precision and number of trials , we will only consider the A-AI variant for the between-agent comparison. We anticipated that even this approximate form of active inference would outperform bandit algorithms; most notably when considering short sessions in the stationary scenario: i.e., when exploration gives way to exploitation after the agent becomes familiar with the payoffs afforded by the multi-armed options. The reason for this expectation is the exact computation of the information gain implicit within the expected free energy (see 22).

Next we compare and contrast the cumulative regret, as a function of trial number , of the A-AI agents with agents based on the optimistic Thompson sampling (O-TS) and the Bayesian UCB (B-UCB) algorithms (Fig. 2). The dotted lines mark the corresponding asymptotic limit (see Eq. 9) of the corresponding problem difficulty (). The asymptotic limit scales as and defines long term behaviour of the asymptotically efficient algorithm. Note that the limit behaviour can be offset by an arbitrary constant to form a lower bound lai1985asymptotically (); chapelle2011empirical (). For convenience we fix the constant to zero, and show the asymptotic curve only as a reference for long term behaviour of cumulative regret for different algorithms.

The comparison reveals that the A-AI agent outperforms the bandit algorithms, but only up until some trial that depends on the task difficulty – in the asymptotic limit the regret grows faster than logarithmic with trial number. This is most clearly visible for the more difficult problems where ; for example, for , A-AI outperforms bandit algorithms only up to . This divergence is driven by a small percentage of the agents in the ensemble that did not not find the accurate solution and are over-confident in their estimate of the arm with the highest reward probability. The divergence is not visible in easier settings with larger as one requires larger ensembles and bigger number of trials to observe sub-optimal instances. It might appear surprising, that the divergence is evident only for the smallest considered number of arms . However, the reason for this is, that the smaller the number of arms is, the more chance an agent has to explore each individual arm, for a limited trial number. Hence, the agent will commit faster to a wrong arm and stay with that choice longer. Therefore, we found that our initial expectation about asymptotic performance of active inference algorithms is only partially correct. Although one could set for any task difficulty in a way that active inference initially outperforms the alternative algorithms, in the asymptotic limit the high performance level will not hold. The reason for this can be seen already in Fig. 1, if one notes that maximal performance (minimal regret rate) depends both on preference precision and trial number , for every tuple.

Although active inference based agents behave poorly in the asymptotic limit, the fact that they achieve higher performance on a short time scale suggests that in dynamic environments – if changes occur sufficiently often – one would get higher performance on average when compared to bandit alternatives.

4.2 The switching bandit problem

In the case of our switching bandit problem, the change probability acts as an additional difficulty parameter, besides the number of arms and the mean outcome difference . Therefore, for the between-algorithm comparison we will first keep fixed at its medial value, and vary number of arms in Fig. 3, and then keep the number of arms fixed at and vary the expected outcome difference in Fig. 4. For the algorithm comparison in switching bandit we fix the precision parameter , to , based on a similar analysis as in stationary bandits (not shown here). Note that although in stationary environments small values of are desirable to achieve low cumulative regret for large , in switching environments larger values of are preferable. This can be inferred from Fig. 1 where regret rate on a short time scale (, dotted lines) achieves minimum approximately around for a range of difficulty levels. This behaviour of regret on a short time scale suggests that in changing environments larger achieve better performance when the A-AI agent is more biased towards exploitation and spends less time exploring. Numerical simulations in switching environments for a wide range of values confirm this (not shown here). For between-algorithm comparison in switching bandits we will use regret rate, instead of cumulative regret, as a reference performance measure. The reason for this is that in dynamic environments cumulative regret increases linearly with trial number , and regret rate provides visually more accessible gauge of performance differences raj2017taming ().

In Fig. 3 we illustrate the regret rate for each agent type over the course of the experiment for a range of different values of change probability and number of arms , and a fixed mean outcome difference . Importantly, when estimating the mean regret rate over an ensemble of agents, for each agent we simulate a distinct switching schedule with the same change probability . Hence, the average is performed not only over different choice outcome trajectories but also over different hidden trajectories of changes. This ensures that comparison is based on environmental properties, and not on specific realisation of the environment. We find better performance for the active inference agents compared to other bandit algorithms in all conditions. However, we observe that the more difficult the task is (in terms of higher change probability and larger number of arms ) the less pronounced is the performance advantage of the active inference based agents.

In Fig. 4 we show the regret rate for each agent type, however with a fixed number of arms, , but varying mean outcome difference . Here, the picture is very similar, where for increasing task difficulty the A-AI agent type exhibits a diminishing performance advantage relative to the bandit algorithms. Importantly, although we present here the regret analysis only up to , unlike in the stationary bandit problem, the results do not change after a further increase in the number of trials. When we simulate longer experiments we find a convergent performance for all algorithms towards a non-zero regret rate; implying a linear increase in cumulative regret with trial number .

Finally, we further illustrate the dependence of performance on mean outcome preference, using the switching bandit with non-stationary task difficulty, where is not fixed but changes stochastically over the course of experiment (see 2.2 Switching bandit for more details). As shown in Fig. 5, we find an increased advantage of the A-AI algorithm over B-UCB and O-TS algorithms when compared to fixed difficulty scenario (Fig. 3), specifically for more difficult problems – with either larger number of arms or larger change probability . However, the opposite is the case for lowered task difficulty; e.g. for and , where B-UCB achieves higher performance then A-AI algorithm. Notably, we would expect that for small number of arm () and slower changing environments () the drop in performance of the A-AI agent becomes even more pronounced.

As a final remark, we find it interesting that the B-UCB algorithm consistently outperforms O-TS algorithm, in almost all non-stationary problems we examined, in contrast to the previous asymptotic analysis in the stationary bandit problem, which concluded that Thompson sampling exhibits better asymptotic scaling than B-UCB kaufmann2012thompson (). We are not familiar with previous works comparing these two algorithms in the context of the switching bandit problem.

5 Discussion

In this paper we provided an empirical comparison between active inference, a Bayesian information-theoretic framework friston2017process (), and two state-of-the-art machine learning algorithms – Bayesian UCB and optimistic Thompson sampling – in stationary and non-stationary stochastic multi-armed bandits. We introduced an approximate active inference algorithm, for which our checks on the stationary bandit problem showed that its performance closely follows that of the exact version. Hence, we derived an active inference algorithm that is efficient and easily scalable to high-dimensional problems. To our surprise, the empirical algorithm comparison in the stationary bandit problem showed that the active inference algorithm is not asymptotically efficient – the cumulative regret increased faster than logarithmic in the limit of large number of trials. The cause for this behaviour seems to be the fixed prior precision over preferences parameter, which acts as a balancing parameter between exploration and exploitation. An analysis of how the performance depends on this parameter showed that parameter values that give best performance decrease over time, suggesting that this parameter should be adaptive and decay over time as the need for exploration decreases. Attempts to remedy the situation with a simple and widely used decay scheme were not successful (for example logarithm of time, not reported here). This indicates it is not a simple relationship and a proper theoretical analysis will be needed to identify whether such a scheme exists.

In the non-stationary switching bandit problem the active inference algorithm generally outperformed Bayesian UCB and optimistic Thompson sampling. This provides evidence that the active inference framework may provide a good solution for optimisation problems that require continuous adaptation. Active inference provides the most efficient way of gaining information and this property of the algorithm pays off in the non-stationary setting. Such dynamic settings are also relevant in neuroscience, as relevant changes in choice-reward contingencies are typically hidden and stochastic in everyday environments of humans and other animals schulz2019algorithmic (); gottlieb2013information (); wilson2012inferring (); cohen2007should (); behrens2007learning (). In contrast to previous neuroscience research that showed that active inference is a good description of human learning and decision making limanowski2020active (); smith2020imprecise (); cullen2018active (); schwartenbeck2015evidence (); schwartenbeck2015dopaminergic (), our results on the dynamic switching bandit show that active inference also performs well in objective sense. Such explanations of cognitive mechanisms that are grounded in optimal solutions are arguably more plausible chater1999ten (). Hence, this result lends additional credibility to active inference as a generalised framework for understanding human behaviour, not only in the behavioural experiments inspired by multi-armed bandits izquierdo2017neural (); markovic2019predicting (); iglesias2013hierarchical (); wilson2012inferring (); racey2011pigeon (); behrens2007learning (), but in a range of related investigations of human and animal decision making in complex dynamic environments under uncertainty markovic2020meta (); adams2013predictions (); pezzulo2012active ().

An important next step in examining active inference in the context of multi-armed bandits is to establish theoretical bounds on the cumulative regret for the stationary bandit problem. A key part of these theoretical studies will be to investigate whether its possible to devise a sound decay scheme for the parameter (see Eq. 28), that provably works for all instances of the canonical stationary bandit. This leaves open a possibility for developing new active inference inspired algorithms which can achieve the asymptotic efficiency. These theoretical bounds would allow us to more rigorously compare active inference algorithms to the already established bandit algorithms for which regret bounds are known. Moreover, we would potentially be able to generalise beyond the settings we have empirically tested here. Future work should also consider an information-theoretic analysis of active inference, which might be more appropriate than regret analysis kaufmann2016bayesian (). For example, the Bayesian exploration bonus previously considered in Bayesian reinforcement learning was analysed with respect to sample complexity of identifying a good policy kolter2009near (). Similarly, in russo2016information () the authors introduced a new measure of regret weighted by the inverse information gain between actions and outcomes, and provided expected bounds for this measure for several Bayesian algorithms, such as Thompson sampling and Bayesian UCB.

As optimal behaviour is always defined with respect to a chosen objective function, a different objective function will lead to different behaviour, and the appropriateness of the objective function for the specific problem determines the performance of the algorithm on a given task. In other words, behaviour is determined not only by the beliefs about the hidden structure about the states of the world but also by the beliefs about useful preferences and objectives one should take into account in that environment. Therefore, although one can consider the sensitivity of the introduced active inference algorithm on the prior precision over preferences as a limitation of the algorithm in comparison to the other two algorithms, we believe that it is possible to introduce various adaptations to the algorithm to improve asymptotic behaviour. Fore example, one can consider learning rules for prior outcome preferences, as illustrated in sajid2019active (); markovic2020meta (). This would introduce a way to adapt an objective function to different environments achieving high performance in a wide-range of stationary and dynamic multi-armed bandit problems. Alternatively, instead of basing action selection on the expected free energy, one can define a stochastic counterpart, which is estimated based on samples from the posterior, akin to Thompson sampling. This would enable the algorithm to better leverage directed and random exploration.

In spite of the poor asymptotic performance there are some advantages of active inference over classical bandit algorithms, both for artificial intelligence and neuroscience. Unlike the Thompson sampling and UCB algorithms, active inference is easily extendable to more complex settings where actions affect future states and actions available, what is usually formalised as a (partially observable) Markov decision process, which require the combination of adaptive decision making with complex planning mechanism millidge2020deep (); ueltzhoffer2018deep (); fountas2020deep (). Given our finding that the advantage of active inference algorithm improves with task difficulty, it would be interesting to apply the framework to a range of more complex Markov decision process problems, comparing it to state-of-the-art reinforcement learning algorithms recently developed in machine learning.

The generative modelling approach integral to active inference allows several improvements to the presented algorithm, which also holds for related Bayesian approaches. For example, we have considered here only one learning algorithm, variational SMiLE liakoni2020learning (), which we have chosen based on its simplicity and efficiency. A potential drawback of variational SMiLE is that it might not be optimal (in terms of inference) for a generic problem of dynamic bandits (e.g. different mechanisms for generating changes and different reward distribution). For example, for restless bandits, which follow a random walk process, recently published alternative efficient learning algorithms derived from different generative models are likely to provide a better performance piray2020simple (); moens2019learning (). Employing a good learning algorithm is especially important in dynamic settings, where exact inference is not tractable, and the performance of learning rules is tightly coupled to the overall performance of the algorithm. In practice, one would expect that, the better the generative model and the corresponding approximate inference algorithm, the better the performance will be on a given multi-armed bandit problem. Furthermore, one can easily extend the learning algorithms with deep hierarchical variants, which can infer a wide range of unknown dynamical properties of the environment piray2020simple () and learn higher order temporal statistics markovic2019predicting (); markovic2020meta ().

6 Conclusion

We have derived an approximate active inference algorithm, based on a Bayesian information-theoretic framework recently developed in neuroscience, proposing it as a novel machine learning algorithm for bandit problems that can compete with state-of-the-art bandit algorithms. Our empirical evaluation has shown that active inference is indeed a promising bandit algorithm. This work is only a first step, however, important next steps that will provide further evidence on viability of the active inference as a bandit algorithm is the development of a decay schedule for the outcome preference precision parameter and theoretical regret analysis in the stationary bandit. The fact that the active inference algorithm achieves excellent performance in switching bandit problems, commonly used in cognitive neuroscience, provides rational grounds for using active inference as a generalised framework for understanding human and animal learning and decision making.

7 Acknowledgements

We thank Karl Friston and Gergely Neu for valuable feedback and constructive discussions.

Footnotes

1. These authors contributed equally
2. These authors contributed equally
3. Note that this type of dependence between current and future choice sets, or rewards, would convert the bandit problem into a reinforcement learning problem. It makes the exploration-exploitation trade-off more complex and optimal solutions cannot be derived beyond trivial problems.
4. In usual applications of active inference for understanding human behaviour, rather than minimising the expected free energy one would sample actions from posterior beliefs about actions (cf. planning as inference botvinick2012planning (); attias2003planning ()). This becomes useful when fitting empirical choice behaviour in behavioural experiments markovic2019predicting (); schw2016 ().
5. For the hardest considered setting, corresponding to , the minimum is sharp and corresponds to the value .

References

1. R. C. Wilson, E. Bonawitz, V. D. Costa, R. B. Ebitz, Balancing exploration and exploitation with information and randomization, Current Opinion in Behavioral Sciences 38 (2020) 49–56.
2. K. Mehlhorn, B. R. Newell, P. M. Todd, M. D. Lee, K. Morgan, V. A. Braithwaite, D. Hausmann, K. Fiedler, C. Gonzalez, Unpacking the exploration–exploitation tradeoff: A synthesis of human and animal literatures., Decision 2 (3) (2015) 191.
3. J. D. Cohen, S. M. McClure, A. J. Yu, Should i stay or should i go? how the human brain manages the trade-off between exploitation and exploration, Philosophical Transactions of the Royal Society B: Biological Sciences 362 (1481) (2007) 933–942.
4. T. Lattimore, C. Szepesvári, Bandit algorithms, Cambridge University Press, 2020.
5. O. Chapelle, L. Li, An empirical evaluation of thompson sampling, in: Advances in neural information processing systems, 2011, pp. 2249–2257.
6. P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine learning 47 (2-3) (2002) 235–256.
7. E. Kaufmann, O. Cappé, A. Garivier, On bayesian upper confidence bounds for bandit problems, in: Artificial intelligence and statistics, 2012, pp. 592–600.
8. R. Kaplan, K. Friston, Planning and navigation as active inference, bioRxiv (2017).
9. K. J. Friston, R. Rosch, T. Parr, C. Price, H. Bowman, Deep temporal models and active inference, Neuroscience & Biobehavioral Reviews 77 (Supplement C) (2017) 388 – 402.
10. K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, G. Pezzulo, Active inference: A process theory, Neural Computation 29 (1) (2017) 1–49, pMID: 27870614.
11. M. B. Mirza, R. A. Adams, C. D. Mathys, K. J. Friston, Scene construction, visual foraging, and active inference, Frontiers in computational neuroscience 10 (2016).
12. K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, J. O’Doherty, G. Pezzulo, Active inference and learning, Neuroscience & Biobehavioral Reviews 68 (Supplement C) (2016) 862 – 879.
13. T. H. FitzGerald, P. Schwartenbeck, M. Moutoussis, R. J. Dolan, K. Friston, Active inference, evidence accumulation, and the urn task, Neural computation 27 (2) (2015) 306–328.
14. K. Friston, F. Rigoli, D. Ognibene, C. Mathys, T. Fitzgerald, G. Pezzulo, Active inference and epistemic value, Cognitive neuroscience 6 (4) (2015) 187–214.
15. P. Schwartenbeck, T. FitzGerald, R. Dolan, K. Friston, Exploration, novelty, surprise, and free energy minimization, Frontiers in psychology 4 (2013) 710.
16. K. Friston, The history of the future of the bayesian brain, NeuroImage 62 (2) (2012) 1230–1233.
17. K. Doya, S. Ishii, A. Pouget, R. P. Rao, Bayesian brain: Probabilistic approaches to neural coding, MIT press, 2007.
18. D. C. Knill, A. Pouget, The bayesian brain: the role of uncertainty in neural coding and computation, TRENDS in Neurosciences 27 (12) (2004) 712–719.
19. M. Botvinick, M. Toussaint, Planning as inference, Trends in cognitive sciences 16 (10) (2012) 485–488.
20. F. Karl, A free energy principle for biological systems, Entropy 14 (11) (2012) 2100–2121.
21. K. Friston, J. Kilner, L. Harrison, A free energy principle for the brain, Journal of Physiology-Paris 100 (1-3) (2006) 70–87.
22. P. Schwartenbeck, J. Passecker, T. U. Hauser, T. H. FitzGerald, M. Kronbichler, K. J. Friston, Computational mechanisms of curiosity and goal-directed exploration, Elife 8 (2019) e41703.
23. R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
24. K. Ueltzhöffer, Deep active inference, Biological cybernetics 112 (6) (2018) 547–573.
25. B. Millidge, Deep active inference as variational policy gradients, Journal of Mathematical Psychology 96 (2020) 102348.
26. K. Friston, L. Da Costa, D. Hafner, C. Hesp, T. Parr, Sophisticated inference, arXiv preprint arXiv:2006.04120 (2020).
27. Z. Fountas, N. Sajid, P. A. Mediano, K. Friston, Deep active inference agents using monte-carlo methods, arXiv preprint arXiv:2006.04176 (2020).
28. W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika 25 (3/4) (1933) 285–294.
29. R. R. Bush, F. Mosteller, A stochastic model with applications to learning, The Annals of Mathematical Statistics (1953) 559–585.
30. P. Whittle, Multi-armed bandits and the gittins index, Journal of the Royal Statistical Society: Series B (Methodological) 42 (2) (1980) 143–149.
31. T. L. Lai, H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in applied mathematics 6 (1) (1985) 4–22.
32. E. Kaufmann, N. Korda, R. Munos, Thompson sampling: An asymptotically optimal finite-time analysis, in: International conference on algorithmic learning theory, Springer, 2012, pp. 199–213.
33. A. Izquierdo, J. L. Brigman, A. K. Radke, P. H. Rudebeck, A. Holmes, The neural basis of reversal learning: an updated perspective, Neuroscience 345 (2017) 12–26.
34. D. Marković, A. M. Reiter, S. J. Kiebel, Predicting change: Approximate inference under explicit representation of temporal structure in changing environments, PLoS computational biology 15 (1) (2019) e1006707.
35. S. Iglesias, C. Mathys, K. H. Brodersen, L. Kasper, M. Piccirelli, H. E. den Ouden, K. E. Stephan, Hierarchical prediction errors in midbrain and basal forebrain during sensory learning, Neuron 80 (2) (2013) 519–530.
36. R. C. Wilson, Y. Niv, Inferring relevance in a changing world, Frontiers in human neuroscience 5 (2012) 189.
37. D. Racey, M. E. Young, D. Garlick, J. N.-M. Pham, A. P. Blaisdell, Pigeon and human performance in a multi-armed bandit task in response to changes in variable interval schedules, Learning & behavior 39 (3) (2011) 245–258.
38. T. E. Behrens, M. W. Woolrich, M. E. Walton, M. F. Rushworth, Learning the value of information in an uncertain world, Nature neuroscience 10 (9) (2007) 1214–1221.
39. X. Lu, N. Adams, N. Kantas, On adaptive estimation for dynamic bernoulli bandits, arXiv preprint arXiv:1712.03134 (2017).
40. Y. Cao, W. Zheng, B. Kveton, Y. Xie, Nearly optimal adaptive procedure for piecewise-stationary bandit: a change-point detection approach, arXiv preprint arXiv:1802.03692 (2018).
41. R. Alami, O. Maillard, R. Féraud, Memory bandits: a bayesian approach for the switching bandit problem, 2017.
42. D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al., A tutorial on thompson sampling, Foundations and Trends® in Machine Learning 11 (1) (2018) 1–96.
43. D. M. Roijers, L. M. Zintgraf, A. Nowé, Interactive thompson sampling for multi-objective multi-armed bandits, in: International Conference on Algorithmic DecisionTheory, Springer, 2017, pp. 18–34.
44. Y. Wang, J. Gittins, Bayesian bandits in clinical trials: Clinical trials, Sequential analysis 11 (4) (1992) 313–325.
45. V. Liakoni, A. Modirshanechi, W. Gerstner, J. Brea, Learning in volatile environments with the bayes factor surprise (2020).
46. D. Marković, S. J. Kiebel, Comparative analysis of behavioral models for adaptive learning in changing environments, Frontiers in Computational Neuroscience 10 (2016) 33.
47. F. Liu, J. Lee, N. Shroff, A change-detection based framework for piecewise-stationary multi-armed bandit problem, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
48. M. Steyvers, M. D. Lee, E.-J. Wagenmakers, A bayesian analysis of human decision-making on bandit problems, Journal of Mathematical Psychology 53 (3) (2009) 168–179.
49. A. Slivkins, et al., Introduction to multi-armed bandits, Foundations and Trends® in Machine Learning 12 (1-2) (2019) 1–286.
50. D. I. Mattos, J. Bosch, H. H. Olsson, Multi-armed bandits in the wild: pitfalls and strategies in online experiments, Information and Software Technology 113 (2019) 68–81.
51. E. Schulz, S. J. Gershman, The algorithmic architecture of exploration in the human brain, Current opinion in neurobiology 55 (2019) 7–14.
52. J. Gottlieb, P.-Y. Oudeyer, M. Lopes, A. Baranes, Information-seeking, curiosity, and attention: computational and neural mechanisms, Trends in cognitive sciences 17 (11) (2013) 585–593.
53. H. Stojić, J. L. Orquin, P. Dayan, R. J. Dolan, M. Speekenbrink, Uncertainty in learning, choice, and visual fixation, Proceedings of the National Academy of Sciences 117 (6) (2020) 3291–3300.
54. R. C. Wilson, A. Geana, J. M. White, E. A. Ludvig, J. D. Cohen, Humans use directed and random exploration to solve the explore–exploit dilemma., Journal of Experimental Psychology: General 143 (6) (2014) 2074.
55. P. B. Reverdy, V. Srivastava, N. E. Leonard, Modeling human decision making in generalized gaussian multiarmed bandits, Proceedings of the IEEE 102 (4) (2014) 544–571.
56. D. Acuna, P. Schrater, Bayesian modeling of human sequential decision-making on the multi-armed bandit problem, in: Proceedings of the 30th annual conference of the cognitive science society, Vol. 100, Washington, DC: Cognitive Science Society, 2008, pp. 200–300.
57. H. Stojić, E. Schulz, P. P Analytis, M. Speekenbrink, Itâs new, but is it good? how generalization and uncertainty guide the exploration of novel options., Journal of Experimental Psychology: General (2020).
58. E. Schulz, E. Konstantinidis, M. Speekenbrink, Putting bandits into context: How function learning supports decision making., Journal of Experimental Psychology: Learning, Memory, and Cognition 44 (6) (2018) 927.
59. E. Schulz, N. T. Franklin, S. J. Gershman, Finding structure in multi-armed bandits, Cognitive Psychology 119 (2020) 101261.
60. L. Clark, R. Cools, T. Robbins, The neuropsychology of ventral prefrontal cortex: decision-making and reversal learning, Brain and cognition 55 (1) (2004) 41–53.
61. M. Guitart-Masip, L. Fuentemilla, D. R. Bach, Q. J. Huys, P. Dayan, R. J. Dolan, E. Duzel, Action dominates valence in anticipatory representations in the human striatum and dopaminergic midbrain, Journal of Neuroscience 31 (21) (2011) 7867–7875.
62. N. D. Daw, S. J. Gershman, B. Seymour, P. Dayan, R. J. Dolan, Model-based influences on humans’ choices and striatal prediction errors, Neuron 69 (6) (2011) 1204–1215.
63. A. Dezfouli, B. W. Balleine, Habits, action sequences and reinforcement learning, European Journal of Neuroscience 35 (7) (2012) 1036–1051.
64. W. C. Cheung, D. Simchi-Levi, R. Zhu, Hedging the drift: Learning to optimize under non-stationarity, arXiv preprint arXiv:1903.01461 (2019).
65. L. Besson, E. Kaufmann, The generalized likelihood ratio test meets klucb: an improved algorithm for piece-wise non-stationary bandits, arXiv preprint arXiv:1902.01575 (2019).
66. A. Blum, Y. Monsour, Learning, regret minimization, and equilibria (2007).
67. V. Raj, S. Kalyani, Taming non-stationary bandits: A bayesian approach, arXiv preprint arXiv:1707.09727 (2017).
68. A. Soltani, A. Izquierdo, Adaptive learning under expected and unexpected uncertainty, Nature Reviews Neuroscience 20 (10) (2019) 635–644.
69. R. Smith, K. Friston, C. Whyte, A step-by-step tutorial on active inference and its application to empirical data (2021).
70. H. Attias, Planning by probabilistic inference., in: AISTATS, Citeseer, 2003.
71. P. Schwartenbeck, K. Friston, Computational phenotyping in psychiatry: a worked example, ENeuro 3 (4) (2016).
72. R. Kaplan, K. J. Friston, Planning and navigation as active inference, Biological cybernetics 112 (4) (2018) 323–343.
73. J. Z. Kolter, A. Y. Ng, Near-bayesian exploration in polynomial time, in: Proceedings of the 26th annual international conference on machine learning, 2009, pp. 513–520.
74. K. Kandasamy, A. Krishnamurthy, J. Schneider, B. Póczos, Parallelised bayesian optimisation via thompson sampling, in: International Conference on Artificial Intelligence and Statistics, 2018, pp. 133–142.
75. J. Limanowski, K. Friston, Active inference under visuo-proprioceptive conflict: Simulation and empirical results, Scientific reports 10 (1) (2020) 1–14.
76. R. Smith, P. Schwartenbeck, J. L. Stewart, R. Kuplicki, H. Ekhtiari, M. Paulus, T. Investigators, et al., Imprecise action selection in substance use disorder: Evidence for active learning impairments when solving the explore-exploit dilemma (2020).
77. M. Cullen, B. Davey, K. J. Friston, R. J. Moran, Active inference in openai gym: a paradigm for computational investigations into psychiatric illness, Biological psychiatry: cognitive neuroscience and neuroimaging 3 (9) (2018) 809–818.
78. P. Schwartenbeck, T. H. FitzGerald, C. Mathys, R. Dolan, M. Kronbichler, K. Friston, Evidence for surprise minimization over value maximization in choice behavior, Scientific reports 5 (2015) 16575.
79. P. Schwartenbeck, T. H. FitzGerald, C. Mathys, R. Dolan, K. Friston, The dopaminergic midbrain encodes the expected certainty about desired outcomes, Cerebral cortex 25 (10) (2015) 3434–3445.
80. N. Chater, M. Oaksford, Ten years of the rational analysis of cognition, Trends in Cognitive Sciences 3 (2) (1999) 57–65.
81. D. Marković, T. Goschke, S. J. Kiebel, Meta-control of the exploration-exploitation dilemma emerges from probabilistic inference over a hierarchy of time scales, Cognitive, Affective, & Behavioral Neuroscience (2020) 1–25.
82. R. A. Adams, S. Shipp, K. J. Friston, Predictions not commands: active inference in the motor system, Brain Structure and Function 218 (3) (2013) 611–643.
83. G. Pezzulo, An active inference view of cognitive control, Frontiers in psychology 3 (2012) 478.
84. E. Kaufmann, On bayesian index policies for sequential resource allocation, arXiv preprint arXiv:1601.01190 (2016).
85. D. Russo, B. Van Roy, An information-theoretic analysis of thompson sampling, The Journal of Machine Learning Research 17 (1) (2016) 2442–2471.
86. N. Sajid, P. J. Ball, K. J. Friston, Active inference: demystified and compared, arXiv preprint arXiv:1909.10863 (2019) 2–3.
87. P. Piray, N. D. Daw, A simple model for learning in volatile environments, PLOS Computational Biology 16 (7) (2020) 1–26.
88. V. Moens, A. Zénon, Learning and forgetting using reinforced bayesian change detection, PLoS computational biology 15 (4) (2019) e1006713.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters