Satisficing in Time-Sensitive Bandit Learning

# Satisficing in Time-Sensitive Bandit Learning

Daniel Russo Columbia University, djr2174@columbia.edu Benjamin Van Roy Stanford University, bvr@stanford.edu
###### Abstract

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. One shortcoming is that this orientation does not account for time sensitivity, which can play a crucial role when learning an optimal action requires much more information than near-optimal ones. Indeed, popular approaches such as upper-confidence-bound methods and Thompson sampling can fare poorly in such situations. We consider instead learning a satisficing action, which is near-optimal while requiring less information, and propose satisficing Thompson sampling, an algorithm that serves this purpose. We establish a general bound on expected discounted regret and study the application of satisficing Thompson sampling to linear and infinite-armed bandits, demonstrating arbitrarily large benefits over Thompson sampling. We also discuss the relation between the notion of satisficing and the theory of rate distortion, which offers guidance on the selection of satisficing actions.

## 1 Introduction

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. But does that make sense when learning an optimal action requires much more information than near-optimal ones? The following example illustrates the issue.

###### Example 1.

(Many-Armed Deterministic Bandit) Consider an action set . Each action results in reward . We refer to this as a deterministic bandit because the realized reward is determined by the action and not distorted by noise. The agent begins with a prior over each that is independent and uniform over and sequentially applies actions , selected by an algorithm that adapts decisions as rewards are observed. As grows, it takes longer to identify the optimal action . Indeed, for any algorithm, . Therefore, no algorithm can expect to select within time . On the other hand, by simply selecting actions in order, with , the agent can expect to identify an -optimal action within time periods, independent of .

It is disconcerting that popular algorithms perform poorly when specialized to this simple problem. Thompson sampling (TS) , for example, is likely to sample a new action in each time period so long as . The underlying issue is most pronounced in the asymptotic regime of , for which TS never repeats any action because, at any point in time, there will be actions better than those previously selected. A surprisingly simple modification offers dramatic improvement: settle for the first action for which . This alternative can be thought of as a variation of TS that aims to learn a satisficing action . We will refer to an algorithm that samples from the posterior distribution of a satisficing action instead of the optimal action, as satisficing Thompson sampling (STS).

While stylized, the above example captures the essence of a basic dilemma faced in all decision problems and not adequately addressed by popular algorithms. The underlying issue is time preference. In particular, if an agent is only concerned about performance over an asymptotically long time horizon, it is reasonable to aim at learning , while this can be a bad idea if shorter term performance matters and a satisficing action can be learned more quickly. To model time preference and formalize benefits of STS, we will assess performance in terms of expected discounted regret, which for the many-armed deterministic bandit can be written as . The constant is a discount factor that conveys time preference. It is easy to show, as is done through Theorem 10 in the appendix, that in the asymptotic regime of , TS experiences expected discounted regret of , whereas that of STS is bounded above by . For close to , we have , and therefore STS vastly outperforms TS. In fact, as approaches , the ratio between expected regret of TS and that of STS goes to infinity. This stylized example demonstrates potential advantages of STS.

This paper develops a general framework for studying satisficing in sequential learning. Satisficing algorithms aim to learn a satisficing action . Building on the work of Russo and Van Roy , we will establish a general information theoretic regret bound, showing that any algorithm’s expected discounted regret relative to a satisficing action is bounded in terms of the mutual information between the model parameters and the satisficing action and a newly defined information ratio, which measures the cost of information acquired about the satisficing action. The mutual information can be thought of as the number of bits of information about required to identify , and the fact that the bound depends on this quantity instead of the entropy of , as does the bound of , allows it to capture the reduction of discounted regret made possible by settling for the satisficing action.

A natural and deep question concerns the the choice of satisficing action and the limits of performance attainable via satisficing. An exploration of this question yields novel connections between sequential learning and rate-distortion theory. In Section, 5 we define a natural rate-distortion function for Bayesian decision making, which captures the minimal information about a decision-maker must acquire in order to reach an -optimal decision. Combining this rate-distortion function with our general regret bound leads to new results and insights. As an example, while previous information-theoretic regret bounds for the linear bandit problem become vacuous in contexts with infinite action spaces, our rate-distortion function leads to a strong bound on expected discounted regret.

We will also study the many-armed bandit problem with noisy rewards. Here, we will consider a satisficing action and establish a bound on expected discounted regret that demonstrates benefits of STS over TS. We will also present computational results that offer further perspective on the advantages.

Many papers [10, 12, 4] have studied bandit problems with continuous action spaces, where it is also necessary to learn only approximately optimal actions. However, because these papers focus on the asymptotic growth rate of regret they implicitly emphasize later stages of learning, where the algorithm has already identified extremely high performing actions but exploration is needed to identify even better actions. Our discounted framework instead focuses on the initial cost of learning to attain good, but not perfect, performance. Recent papers [8, 9] study several heuristics for a discounted objective, though without an orientation toward formal regret analysis. The knowledge gradient algorithm of Ryzhov et al.  also takes time horizon into account and can learn suboptimal actions when its not worthwhile to identify the optimal action. This algorithm tries to directly approximate the optimal Bayesian policy using a one-step lookahead heuristic, but unfortunately there are no performance guarantees for this method. Deshpande and Montanari  consider a linear bandit problem with dimension that is too large relative to the desired horizon. They propose an algorithm that limits exploration and learns something useful within this short time frame. Berry et al. , Wang et al.  and Bonald and Proutiere  study an infinite-armed bandit problem in which it is impossible to identify an optimal action and propose algorithms to minimizes the asymptotic growth rate of regret. While we will instantiate our general regret bound for STS on the infinite-armed bandit problem, we use this example mostly to provide a simple analytic illustration. We hope that the flexibility of STS and our analysis framework allow this work to be applied to more complicated time-sensitive learning problems

## 2 Problem Formulation.

An agent sequentially chooses actions from the action set and observes the corresponding outcomes . The agent associates a reward with each outcome . Let denote the reward corresponding to outcome . The outcome in period depends on the chosen action , idiosyncratic randomness associated with that time step, and a random variable that is fixed over time. Formally, there is a known system function , and an iid sequence of disturbances such that

 Yt=g(At,θ,Wt).

The disturbances are independent of , and have a known distribution. This is without loss of generality, as uncertainty about and the distribution of could be included in the definition of . From this, we can define

 μ(a,θ)=E[R(g(a,θ,Wt))|θ]

to be the expected reward of an action under parameter . Ours can be thought of as a Bayesian formulation, in which the distribution of represents the agent’s prior uncertainty about the true characteristics of the system, and conditioned on , the remaining randomness in represents idiosyncratic noise in observed outcomes.

The history available when selecting action is . The agent selects actions according to a policy, which is a sequence of functions , each mapping a history and an exogenous random variable to an action, with for each . Throughout the paper, we use to denote some random variable that is independent of and the disturbances .

Let denote the supremal reward, and let denote the true optimal action, when this maximum exists. As a performance metric, we consider expected discounted regret of a policy is defined by

 Regret(α,ψ)=Eψ[∞∑t=0αt(R∗−Rt)],

which measures a discounted sum of the expected performance gap between an omniscient policy which always chooses the optimal action the policy which selects the actions . This deviates from the typical notion of expected regret in its dependence on a discount factor . Regular expected regret corresponds to the case of . Smaller values of convey time preference by weighting gaps in nearer-term performance higher than gaps in longer-term performance.

The definition above compares regret relative to the optimal action and corresponding reward . It is useful to also consider performance loss relative to a less stringent benchmark. We define the satisficing regret at level to be

 SRegret(α,ψ,D)=Eψ[∞∑t=0αt(R∗−D−Rt)].

This measures regret relative to an action that is near-optimal, in the sense that it yields expected reward , which is within of optimal. This notation was chosen due to the connection we develop with rate-distortion theory, where typically denotes a tolerable level of “distortion” in a lossy compression scheme. Of course, for all ,

 Regret(α,ψ)=SRegret(α,ψ,D)+D1−α

and so one can easily translate between bounds on regret and bounds on satisficing regret. However, directly studying satisficing regret helps focus our attention on the design of algorithms that purposefully avoid the search for exactly optimal behavior in order to limit exploration costs.

Before beginning, let us first introduce some additional notation. We denote the entropy of a random variable by , the Kullback-Leibler divergence between probability distributions and by , and the mutual information between two random variables and by . We will frequently be interested in the conditional mutual information .

We sometimes denote by the expectation operator conditioned on the history up to time and similarly define . The definitions of entropy and mutual information depend on a base measure. We use and to denote entropy and mutual-information when the base-measure is the posterior distribution . For example, if is and are discrete random variables taking values in a sets and ,

 It(X;Z)=∑x∈X∑z∈ZPt(X=x,Z=z)log(Pt(X=x,Z=z)Pt(X=x)Pt(Z=z)).

Due to its dependence on the realized history , is a random variable. The standard definition of conditional mutual information integrates over this randomness, and in particular, .

## 3 Satisficing Actions.

We will consider learning a satisficing action instead of an optimal action . The idea is to target a satisficing action that is near-optimal yet easy to learn. The information about required to learn an action is captured by , while the performance loss is , where . For to be easy to learn relative to , we want . We will motivate this abstract notion through several examples.

Our first example addresses the many-armed deterministic bandit, as discussed in Section 1.

###### Example 2 (first satisfactory arm).

Consider the infinite-armed deterministic bandit of Section 1. For this problem, the prior of is uniformly distributed across a large number of actions, and . Consider a satisficing action , which represents the first action that attains reward within of the optimum . As , . But remains finite, as in this limit converges weakly to a geometric random variable, with .

The next example involves reducing the granularity of a discretized action space.

###### Example 3 (discretization).

Consider a linear bandit with and for an unknown vector . Suppose that and consists of vectors spread out uniformly along boundary of the -dimensional unit sphere . The optimal action is then uniformly distributed over , and therefore . As , it takes an enormous amount of information to exactly identify . The results of  become vacuous in this limit. Consider a satisficing action that represents a coarser version of . In particular, for , let consist of vectors spread out uniformly along boundary of the -dimensional unit sphere, with chosen such that for each element of there is a close approximation in . Let . This can be viewed as a form of lossy-compression, for which while remains small.

In the previous two examples, determined , and therefore . We now consider an example in which is controlled by randomly perturbing the satisficing action. Here, can be small even though is large.

###### Example 4 (random perturbation).

Consider again the linear bandit from the previous example. An alternative satistficing action results from optimizing a perturbed objective where . Since is not observable, it is not possible in this case to literally learn . Instead, we consider learning to behave in a manner indistinguishable from . The variance of is chosen such that . Moreover, it can be shown that and therefore, the information required about is bounded independently of the number of actions.

## 4 A General Regret Bound.

This section provides a general discounted regret bound and a new information-theoretic analysis technique. The first subsection introduces an alternative to the information ratio of Russo and Van Roy , which is more appropriate for time-sensitive online learning problems. The following subsection establishes a general discounted regret bound in terms of this information ratio.

### 4.1 A New Information Ratio.

First, we make a simplification to the information ratio defined by Russo and Van Roy . That expression depends on the history and hence is a random variable. In this paper, we observe that this can be avoided, and instead take as a starting point a simplified form of the information ratio that integrates out all randomness. In particular, we study

 E[R∗−Rt]2I(A∗;(At,Yt)∣Ht−1). (1)

Uniform bounds on the information ratio of the type established in past work [3, 15, 11] imply those on (1). Precisely, if is bounded by almost surely (i.e for any history ), then (1) is bounded by since

 E[R∗−Rt]2≤E[Et[R∗−Rt]2]≤E[λIt(A∗;(At,Yt))]=λI(A∗;(At,Yt)|Ht−1).

A more important change comes from measuring information about a benchmark action , which could be defined as in the examples in the previous section, rather than with respect to the optimal action . For a benchmark action we consider the single period information ratio

 E[~R−Rt]2I(~A;(At,Yt)∣Ht−1)

where . This ratio relates the current shortfall in performance relative to the benchmark action to the amount of information acquired about the benchmark action. We study the discounted average of these single period information ratios, defined for any policy as

 Γ(~A,ψ)=(1−α2)∞∑t=0α2t(E[~R−Rt]2I(~A;(At,Yt)|Ht−1)), (2)

where the actions are chosen under . The square in the discount factor is consistent with the problem’s original discount rate, since .

### 4.2 General Regret Bound.

The following theorem bounds the expected discounted regret of any algorithm, or policy, in terms of the information ratio (2).

###### Theorem 1.

For any policy , any and any where is independent of the disturbances , if , then

 SRegret(α,ψ,D)≤ ⎷Γ(~A,ψ)I(~A;θ)1−α2.
###### Proof.

We first show that the mutual information between and bounds the expected accumulation of mutual-information between and observations . By the chain rule for mutual information, for any ,

 T∑t=0I(~A;(At,Yt)∣Ht−1) = T∑t=0I(~A;(At,Yt)∣A0,Y0,…,At−1,Yt−1) = I(~A;HT) ≤ I(~A;(θ,HT)) = I(~A;θ)+I(~A;FT|θ) = I(~A;θ)

where the final inequality uses that, conditioned on , is independent of . Taking the limit as implies

 ∞∑t=0I(~A;(At,Yt)∣Ht−1)≤I(~A;θ),

where, the infinite series is assured to converge by the non-negativity of mutual information. Now, let

 Γt≡E[~R−Rt]2I(~A;(At,Yt)|Ht−1)

denote the information ratio at time under the benchmark action and actions chosen according to . Then

 SRegret(α,ψ,D)=E[∞∑t=0αt(R∗−D−Rt)] = ∞∑t=0αtE[~R−Rt,At] = ∞∑t=0√α2tΓt√I(~A;(At,Yt)|Ht−1) ≤  ⎷∞∑t=0α2tΓt ⎷∞∑t=0I(~A;(At,Yt)|Ht−1) ≤  ⎷[∞∑t=0α2tΓt]√I(~A;θ) =  ⎷Γ(~A,ψ)I(~A;θ)1−α2,

where the first inequality follows from the Cauchy-Schwarz inequality and the second was established earlier in this proof. ∎

An immediate consequence of this bound on satisficing regret is the discounted regret bound

 Regret(α,ψ)≤E[R∗−~R]1−α+ ⎷Γ(~A,ψ)I(~A;θ)1−α2. (3)

This bound decomposes regret into the sum of two terms; one which captures the discounted performance shortfall of the benchmark action relative to , and one which bounds the additional regret incurred while learning to identify . Breaking things down further, the mutual information measures how much information the decision-maker must acquire in order to implement the action , and the information ratio measures the regret incurred in gathering this information. It is worth highlighting that for any given action process, this bound holds simultaneously for all possible choices of , and in particular, it holds for the minimizing the right hand side of (3).

## 5 Connections With Rate Distortion Theory.

This section considers the optimal choice of satisfactory action and develops connections with the theory of rate-distortion in information theory. We construct a natural rate-distortion function for Bayesian decision making in the next subsection. Subsection 5.2 then develops a general regret bound that depends on this rate-distortion function.

### 5.1 A Rate Distortion Function for Bayesian Decision Making.

In information-theory, the entropy of a source characterizes the length of an optimal lossless encoding. The celebrated rate-distortion theory [6, Chapter 10] characterizes the number of bits required for an encoding to be close in some loss metric. This theory resolves when it is possible to to derive a satisfactory lossy compression scheme while transmitting far less information than required for a lossless compression. The rate-distortion function for a random variable with domain with respect to a loss function is

 R(D)= min I(^X;X) s.t. E[ℓ(^X,X)]≤D

where the minimum is taken the choice of random variables with domain , and denotes the mutual information between and . One can view this optimization problem as specifying a conditional distribution that minimizes the information uses about among all choices incurring average loss less than .

We will explore a powerful link with sequential Bayesian decision-making, where the rate-distortion function characterizes the minimal amount of new information the decision–maker must gather in order make a satisfactory decision. Typically (5.1) is applied in the context of representing as closely as possible by , and the loss function is taken to be something like the squared distance or total variation distance between the two. For our purposes, we replace with , and with a benchmark action . The interpretation is that is a function of the unknown parameter and some exogenous randomness that offers a similar reward to playing but hopefully can be identified using much less information about . We specify a loss function measuring the single period regret from playing under :

 ℓ(a,θ)=maxa′∈Aμ(a′,θ)−μ(a,θ).

As a result

 E[ℓ(~A,θ)]=E[R∗−μ(~A,θ)].

We come to the rate-distortion function

 R(D):= min I(~A;θ) s.t. E[R∗−μ(~A,θ)]≤D.

As before, the minimum above is taken over the choice of random variable taking values in . That is, the minimum is taken over all conditional probability distributions specifying a distribution over actions as a function of . Since the choice is always feasible, for all

 R(D)≤I(A∗;θ)=H(A∗)

where denotes the entropy of the optimal action. Rate distortion is never larger than entropy, but it may be small even if the entropy of is infinite.

The following, somewhat artificial, example explicitly links communication with decision-making and may help clarify the role of the rate-distortion function .

###### Example 5.

A military command center waits to hear from an outpost before issuing orders. The outpost, stationed close to the conflict, determines its message based on a wealth of nuanced information – at the level of readouts from weather sensors and full transcripts of intercepted enemy communication. The command post could make very complicated decisions as a function of the detailed information it receives, with the possibility of specifying commands at the level of individual troops and equipment. How much must decision quality degrade if decisions are based only on coarser information? At an intuitive level, the outpost only needs to communicate surprising revelations that are important to reaching a satisfactory decision. As a result, our answer can depend in a complicated way on the extent to which the outpost’s observations are predictable apriori and the extent to which decision quality is reliant on this information. The rate-distortion function precisely quantifies these effects.

To map this problem onto our formulation of the rate-distortion function, take to consist of all information observed by the outpost, to be the order issued by the command center, and the rewards to indicate whether the orders led to a successful outcome. The mutual information captures the average amount of information the outpost must send in order for to be implemented. The goal is to develop a plan for placing orders that requires minimal communication from the outpost among all plans that degrade the chance of success by no more than .

### 5.2 Uniformly Bounded Information Ratios.

The general regret bound in Theorem 1 has a superficial relationship to the rate-distortion function through its dependence on the mutual information between the benchmark action and the true parameter . Indeed, for a benchmark action attaining the rate-distortion limit, and we attain a regret bound that depends explicitly on the rate-distortion level. However, the information ratio also depends on the choice of benchmark action, and may be infinite for a poor choice.

This second dependence on does not appear in rate-distortion theory, and reflects a fundamental distinction between communication problems and sequential learning problems. Indeed, a key feature enabling the sharp results of rate distortion theory is that no bit of information is more costly to send and receive than others; the question is to what extent useful communication is possible while many fewer bits of information on average. By contrast, sequential learning agents must explore to uncover information and the cost per unit of information uncovered may vary widely depending on which pieces of information are sought. This is accounted for by the information ratio , which roughly captures the expected cost, in terms of regret incurred, per bit of information acquired about the benchmark action.

Despite this, regret bounds in terms of rate-distortion apply in many important cases. Theorem 1, which is an immediate consequence of Theorem 1, provides a general bound of this form. Roughly, the uniform information ratio in the theorem reflects something about quality of the feedback the agent receives when exploring; it means that for any choice of benchmark action there is a sequential learning strategy that learns about with cost per bit of information less than . The next section applies this result to online linear optimization, where several possible uniform information ratio bounds are possible depending on the problems precise feedback structure.

###### Theorem 2.

Suppose there is a uniform bound on the information ratio

 ΓU:=sup~AinfψΓ(~A,ψ)<∞.

Then, for any there exists a policy under which

 SRegret(α,ψ,D)≤√ΓUR(D)1−α2.

## 6 Application to Online Linear Optimization.

Consider a special case of our formulation: the problem of learning to solve a linear optimization problem. Precisely, suppose expected rewards follow the linear model where , , and almost surely. We will consider several natural forms a feedback the decision-maker may receive.

In each case, uniform bounds on the information ratio hold for satisficing Thompson sampling. More precisely, for any let denote the strategy that randomly samples an action at each time by probability matching with respect to , i.e. . Applying the same proofs as in  yields bounds of the form , where depends on the problem’s feedback structure but not the choice of benchmark action. Now, let us choose to attain the rate distortion limit (5.1), so and . We denote by satisficing Thompson sampling with respect to this choice of satisfactory action.

Full Information. Suppose for a random vector with . This is an extreme point of our formulation, where all information is revealed without active exploration. For all , the information ratio is bounded as and hence

 SRegret(α,ψSTSD,D)≤√R(D)2(1−α2).

Bandit Feedback. Suppose the agent only observes the reward the action she chooses (). This is the so-called linear bandit problem. For all , the information ratio is bounded as . This gives the regret bound

 SRegret(α,ψSTSD,D)≤√R(D)p2(1−α2).

Semi-Bandit Feedback. Assume again that for all . Take the action set to consist of binary vectors where for all . Upon playing action , the agent observes for every component that was active in . We make the additional assumption that the components of are independent conditioned on . Then, for all , the information ratio is bounded as and hence

 SRegret(α,ψSTSD,D)≤√R(D)(p/m)2(1−α2).

By following the appendix of , each of these result can be extended gracefully to settings where noise distributions are sub-Gaussian. For example, suppose follows a multivariate Gaussian distribution, and the reward at time is where is a zero mean Gaussian random variable. Then, if the variance of rewards is bounded by some for all , the previous bounds on the information ratio scale by a factor of .

The next theorem considers an explicit choice of satisfactory action . This yields a computationally efficient version of STS as well as explicit upper bounds on the rate-distortion function. As above, consider the case where follows a multivariate Gaussian prior and reward noise is Gaussian. We study the optimizer of a randomly perturbed objective. The small perturbation controls the mutual information between and without substantially degrading decision quality. It is easy to implement probability matching with respect to whenever linear optimization problems over are efficiently solvable. In particular, if and denote the posterior mean and covariance matrix, which are efficiently computable using Kalman filtering, then by sampling and and setting one has .

The result in the next theorem assumes the action set is contained within an ellipsoid and the resulting bound displays a logarithmic dependence on the eigenvalues of . Precisely, note that the trace of the matrix , or sum of its eigenvalues, provides one natural measure of the size of the ellipsoid. Our result also depends on the covariance matrix through the . To understand this, consider applying similarity transforms to the parameter and action vectors so that is isotropic and the set of action vectors is . This transformed action space is contained in the ellipsoid , where . Then provides a measure of the size of this ellipsoid.

###### Theorem 3.

Suppose is a compact subset of the ellipsoid for some real symmetric matrix and suppose follows a –dimensional multivariate Gaussian distribution. Set

 ~A=argmaxa∈A⟨a,θ+ξ⟩

where is independent of and . For ,

 E[⟨θ,~A⟩]≥E[⟨θ,A∗⟩]−D

and

 R(D)≤I(~A;θ)≤p2log(1+Trace(QΣ)D2).
###### Proof.

By Jensen’s inequality

 E[⟨~A,θ+ξ⟩]=E[maxa∈A⟨a,θ+ξ⟩]≥E[maxa∈A⟨a,θ⟩]=E⟨A∗,θ⟩].

This implies

 E[⟨A∗,θ⟩]−E[⟨~A,θ⟩]≤E[⟨~A,ξ⟩]≤E[maxa∈A⟨a,ξ⟩]≤E[maxx:∥x∥Q−1≤1⟨a,ξ⟩]=E[∥ξ∥Q]

where the final equality uses the explicit formula for the maximum of a linear function over an ellipsoid. Now,

 E[∥ξ∥Q]≤√E[ξ⊤Qξ]=√E[Trace(ξ⊤Qξ)]=√Trace(QE[ξξ⊤])=√β2Trace(QΣ)=D.

Next we derive the bound on mutual information. We have

 I(~A;θ)≤I(θ+W;θ) = H(θ+W)−H(θ+W|θ) = H(θ+W)−H(W) = 12log(det(Σ+β2Σ)det(β2Σ)) = 12log(det((1+β2)Σ)det(β2Σ)) = p2log(1+1β2) = p2log(1+Trace(QΣ)D2)

where denotes the determinant of a matrix. Here the first inequality uses the data processing inequality, the third equality uses the explicit form the entropy of a multivariate Gaussian () and the penultimate equality uses that for any scalar . ∎

## 7 Application to the Infinite-Armed bandit.

This section considers a generalization of the deterministic infinite-armed bandit problem in the introduction that allows for noisy observations and non-uniform priors. The action space is . We assume almost surely and , meaning the agent only observes rewards. The mean reward of action is where the prior distribution of is independent with support . We will this problem by specializing our general framework and results.

### 7.1 STS for the Infinite-Armed Bandit Problem.

We consider the simple satisficing action defined in the introduction: . Rather than continue to explore until identifying the optimal action , we will settle for the first111 Of course, there is nothing crucial about this ordering on actions. We can equivalently construct a randomized order in which actions are sampled; for each realization of the random variable , let be a permutation, and take . action known to yield reward within of optimal.

We study satisficing Thompson sampling where actions are selected by probability matching with respect to . Note that an algorithm for this problem must decide whether to sample a previously tested action – and if so which one to sample – or whether to try out an entirely new action. Let denote the set of previously sampled actions. STS may sample an untested action , and does so with probability

 P(At∉At|Ht−1)=P(~A∉At|Ht−1).

equal to the posterior probability no satisfactory action has yet been sampled. The remainder of the action probabilities are allocated among previously tested actions, with

 P(At=a|Ht−1)=P(~A=a|Ht−1)∀a∈At.

There is a simple algorithmic implementation of STS that mirrors computationally efficient implementations of Thompson sampling (TS). At time , TS selects a random action via probability matching with respect to . Algorithmically, this is usually accomplished by first sampling and solving for . Similarly, we can implement STS by sampling and approximately optimizing a posterior sample. Over each th period, STS selects an action as follows:

1. For each , sample

2. Let

3. If is not null set . Otherwise, play an untested action .

Note that is supplied to the algorithm as a tolerance parameter. When , STS is equivalent to TS. Otherwise, STS attributes preference to selecting previously selected actions, which can yield substantial benefit in the face of time preference.

### 7.2 Information Ratio Analysis of the Infinite-Armed Bandit.

The following theorem provides an discounted regret bound for STS in the infinite armed bandit.The result follows from bounding the problems information ratio and the mutual information and applying the general regret bound of Theorem 1. This requires substantial additional analysis, the details of which are until Subsection 7.4.

###### Theorem 4.

Consider the the infinite-armed bandit with noisy observations, and let . Denote the STS policy with respect to by . Then,

 I(~A;θ)≤1+log(1/δ)andΓ(~A,ψSTSD)≤6+4/δ+(2/δ)log(11−α2)

where . Together with Theorem 1 this implies

 SRegret(α,ψSTSD,D) ≤  ⎷(6+4/δ+(2/δ)log(11−α2))(1+log(1/δ))1−α2 = ~O(√1/δ1−α2).

### 7.3 Computational Examples.

We close with a simple computational illustration of the potential benefits afforded by STS. We consider two many-armed bandit problems, and demonstrate that per-period regret of STS diminishes much more rapidly than that of TS over early time periods.

We consider problems with 250 actions, where the mean reward associated with each action is independently sampled uniformly from . We first consider the many-armed deterministic bandit, for which there is no observation noise. Figure 1(a) presents per-period regret of TS and STS over 500 time periods, averaged over 5000 simulations, each with an independently sampled problem instance. STS is applied with tolerance parameter 0.05. We next consider incorporating observation noise. In particular, instead of observing , after selecting an action , we observe a binary reward that is one with probability . Figure 1(b) displays the results of this experiment. Figure 1: TS versus STS for the (a) many-armed deterministic bandit and (b) many-armed bandit with observation noise.

### 7.4 Proof of Theorem 4.

Our proof will use the following fact, which is stated as Fact 9 in .

###### Fact 1.

For any distributions and such that that is absolutely continuous with respect to , any random variable and any such that ,

 EP[g(X)]−EQ[g(X)]≤√12D(P||Q),

where and denote the expectation operators under and .

We begin by showing the mutual information bound stated as part of Theorem 4.

###### Lemma 5 (Mutual Information Bound).

Let . Then

 I(~A;θ)≤1+log(1/δ).
###### Proof.

Since is a deterministic function of , we have where is a geometric random variable. This implies

 I(~A;θ) = H(N) = −∞∑k=1δ(1−δ)k−1log(δ(1−δ)k−1) = −∞∑k=1δ(1−δ)k−1log(δ)−∞∑k=1δ(1−δ)k−1log((1−δ)k−1) = ∞∑k=1P(N=k)log(1/δ)−log(1−δ)∞∑k=1δ(1−δ)k−1(k−1) = log(1/δ)+log(11−δ)(E[N]−1) = log(1/δ)+log(1+δ1−δ)(1−δδ) ≤ log(1/δ)+(δ1−δ)(1−δδ) = 1+log(1/δ).

Throughout the remainder of the proof we use the notation

 Γt=Et[θ~A−θAt]2It(~A;(At,Yt)).

This represents the expected one-step information ratio in period under the posterior measure . The next lemma shows that the cumulative information ratio can be bounded by the expected discounted average of these one-step information ratios.

###### Lemma 6 (Relating the information ratio to the one-step-information-ratio).
 Γ(~A,ψSTSD)≤(1−α2)∞∑t=0α2tE[Γt]
###### Proof.

We have

 E[~R−Rt]=E[θ~A−θAt] = E[Et[θ~A−θAt]] =