Randomized Exploration for Non-Stationary Stochastic Linear Bandits

# Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Baekjin Kim and Ambuj Tewari
{baekjin,tewaria}@umich.edu

## Abstract

We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a -dimensional feature and the unknown parameter is time-varying with total variation , we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret , but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show the outstanding performance in tackling conservatism issue that Discounted LinUCB (D-LinUCB) struggles with.

## 1 Introduction

A multi-armed bandit is the simplest model of decision making that involves the exploration versus exploitation trade-off [Lai and Robbins, 1985]. Linear bandits are an extension of multi-armed bandits where reward has linear structure with a finite-dimensional feature associated with each arm [Abe et al., 2003, Dani et al., 2008]. Two standard exploration strategies in stochastic linear bandits are Upper Confidence Bound algorithm (LinUCB) [Abbasi-Yadkori et al., 2011] and Linear Thomson Sampling (LinTS) [Agrawal and Goyal, 2013]. The former relies on optimism in face of uncertainty and is a deterministic algorithm built upon the construction of a high-probability confidence ellipsoid for the unknown parameter vector. The latter is a Bayesian solution that maximizes the expected rewards according to a parameter sampled from the posterior distribution. Chapelle and Li [2011] showed that Linear Thompson Sampling empirically performs better and is more robust to corrupted or delayed feedback than LinUCB. From a theoretical perspective, it enjoys a regret bound that is a factor of worse than minimax-optimal regret bound that LinUCB enjoys. However, the minimax optimality of optimism comes at a cost: implementing UCB type algorithms can lead to NP-hard optimization problems even for convex action sets [Agrawal, 2019].

Random perturbation methods were originally proposed in the 1950s by Hannan [1957] in the full information setting where losses of all actions are observed. Kalai and Vempala [2005] showed that Hannan’s perturbation approach leads to efficient algorithms by making repeated calls to an offline optimization oracle. They also gave a new name to this family of randomized algorithms: Follow the Perturbed Leader (FTPL). Recent work [Abernethy et al., 2014, 2015, Kim and Tewari, 2019] has studied the relationship between FTPL algorithms and Follow the Regularized Leader (FTRL) algorithms and also investigated whether FTPL algorithms achieve minimax-optimal regret in both full and partial information settings.

Abeille and Lazaric [2017] viewed Linear Thompson Sampling as a perturbation based algorithm, characterized a family of perturbations whose regrets can be analyzed, and raised an open problem to find a minimax-optimal perturbation. In addition to its significant role in smartly balancing exploration with exploitation, a perturbation based approach to linear bandits also reduces the problem to one call to the offline optimization oracle in each round. Recent works [Kveton et al., 2019a, b] have proposed randomized algorithms that use perturbation as a means to achieve oracle-efficient computation as well as better theoretical guarantee than LinTS, but there is still a gap between their regret bounds and the lower bound of . This gap is logarithmic in the number of actions which can introduce extra dependence on dimensions for large or infinite action spaces.

A new randomized exploration scheme is proposed in the recent work of Vaswani et al. [2019]. In contrast to Hannan’s perturbation approach that injects perturbation directly into an estimate, it replaces optimism with random perturbation when using confidence sets for action selection in optimism-based algorithms. This scheme can be broadly applied to multi-armed bandit and structured bandit problems and resulting algorithms are theoretically optimal and empirically perform well since overall conservatism of optimism-based algorithms can be tackled by randomly sampled confidence level.

Linear bandit problems were originally motivated by applications such as online ad placement with features extracted from the ads and website users. However, users’ preferences often evolve with time which leads to interest in the non-stationary variant of linear bandits. Accordingly, adaptive algorithms that accommodate time-variation of environments have been studied in a rich line of works in both multi-armed bandit [Besbes et al., 2014] and linear bandit. With prior information of total variation budget, SW-LinUCB [Cheung et al., 2019] and D-LinUCB [Russac et al., 2019] were constructed on the basis of the optimism in face of uncertainty principle via sliding window and exponential discounting weights, respectively. Luo et al. [2017] and Chen et al. [2019] studied fully adaptive and oracle-efficient algorithms assuming access to an optimization oracle when total variation is unknown for the learner. It is still open problem to design a practically simple, oracle-efficient and statistically optimal algorithm for non-stationary linear bandits.

### 1.1 Contribution

In Section 2, we explicate, in the simpler stationary setting, the role of two perturbation approaches in overcoming conservatism that UCB-type algorithms chronically suffer from in practice. In one approach, we replace optimism with a simple randomization when using confidence sets. In the other, we add random perturbations to the current estimate before maximizing the expected reward. These two approaches result in Randomized LinUCB and Gaussian Linear Thompson Sampling for stationary linear bandits, respectively. We highlight the statistical optimality versus oracle efficiency trade-off between them.

In Section 3, we study the non-stationary environment and present two randomized algorithms with exponential discounting weights, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) to gracefully adjust to the time-variation in the true parameter. We explain the trade-off between statistical optimality and oracle efficiency in that the former asymptotically achieves the optimal dynamic regret , but the latter enjoys computational efficiency due to sole reliance on an offline optimization oracle for large or infinite action set while the extra gap in dynamic regret bound is paid at a cost.

In Section 4, we run multiple simulation studies based on Criteo live traffic data [Diemert et al., 2017] to evaluate the empirical performances of D-RandLinUCB and D-LinTS. We observe that the two show outstanding performance in tackling conservatism issue that the non-randomized D-LinUCB struggles with. When the high dimension and large set of actions are considered, in particular, D-LinTS performs as well as Linear Thompson Sampling with prior information on the change-point.

## 2 Warm-Up: Stationary Stochastic Linear Bandit

### 2.1 Preliminaries

In stationary stochastic linear bandit, a learner chooses an action from a given action set in every round , and he subsequently observes reward where is an unknown parameter and is conditionally 1-subGaussian random variable. For simplicity, assume that for all , , , and thus .

As a measure of evaluating a learner, the regret is defined as the difference between rewards the learner would have received had it played the best in hindsight, and the rewards actually received. Therefore, minimizing the regret is equivalent to maximizing the expected cumulative reward. Denote the best action in a round as and the expected regret as .

To learn about unknown parameter from history up to time , , algorithms rely on -regularized least-squares estimate of , , and confidence ellipsoid centered from . We define , where and is a positive regularization parameter.

### 2.2 Randomized Exploration

The standard solutions in stationary stochastic linear bandit are optimism based algorithm (LinUCB, Abbasi-Yadkori et al. [2011]) and Linear Thompson Sampling (LinTS, Agrawal and Goyal [2013]). While the former obtains the theoretically optimal regret bound matched to lower bound , the latter empirically performs better in spite of regret bound worse than LinUCB [Chapelle and Li, 2011]. In finite-arm setting, the regret bound of Gaussian Linear Thompson Sampling (Gaussian-LinTS) is improved by as a special case of Follow-the-Perturbed-Leader-GLM (FPL-GLM, Kveton et al. [2019b]. Also, a series of randomized algorithms for linear bandit were proposed in recent works: Linear Perturbed History Exploration (LinPHE, Kveton et al. [2019a]) and Randomized Linear UCB (RandLinUCB, Vaswani et al. [2019]). They are categorized in terms of regret bounds, randomness, and oracle access in Table 1. We denote .

There are two families of randomized algorithms according to the way perturbations are used. The first algorithm family is designed to choose an action by maximizing the expected rewards after adding the random perturbation to estimates. Gaussian-LinTS, LinPHE, and FPL-GLM algorithms are in this family. But they are limited in that regret bounds, , depend on the number of arms, and yield regret when action set is infinite. The other family including RandLinUCB [Vaswani et al., 2019] is constructed by replacing the optimism with simple randomization when choosing a confidence level to handle the chronic issue that UCB-type algorithms are too conservative. This randomized version of LinUCB achieves theoretically optimal regret bounds of LinUCB as well as matches the empirical performance of LinTS.

Oracle point of view : We assume that the learner has access to an algorithm that returns a near-optimal solution to the offline problem, called an offline optimization oracle. It returns the optimal action that maximizes the expected reward from a given action space when a parameter is given as input.

###### Definition 1 (Offline Optimization Oracle).

There exists an algorithm, , which when given a pair of action space , and a parameter , computes .

In contrast to the non-randomized LinUCB and RandLinUCB that are required to compute spectral norms of all actions in every round so that they cannot be efficiently implemented with an infinite set of arms, the main advantage of the algorithms in the first family such as (Gaussian-)LinTS, LinPHE, and FPL-GLM is that they rely on an offline optimization oracle in every round so that the optimal action can be efficiently obtained within polynomial times from large or even infinite action set.

Improved regret bound of Gaussian LinTS : In FTL-GLM, generating Gaussian perturbations and saving -dimensional feature vectors are required to obtain perturbed estimate in every round , which causes computation burden and memory issue for storage. However, once the Gaussian perturbations are used in linear model, adding univariate Gaussian perturbations to historical rewards is the same as perturbing the estimate by a multivariate Gaussian perturbation because of its linear invariance property, and the resulting algorithm is approximately equivalent to Gaussian Linear Thompson Sampling [Agrawal and Goyal, 2013].

 ~θt =^θt+V−1t,λt−1∑l=1XlZ(t)l,Z(t)l∼N(0,a2) ≈^θt+V−1/2t,λZ(t),Z(t)∼N(0,a2Id) :\bf Gaussian LinTS.

It naturally implies that the regret bound of Gaussian LinTS is improved by with a finite set of arms [Kveton et al., 2019b].

Equivalence between Gaussian LinTS and RandLinUCB : Another perspective of Gaussian LinTS algorithm is that it is equivalent to RandLinUCB with decoupled perturbations across arms due to linear invariance of Gaussian random variables:

 ⟨x,~θt⟩ =⟨x,^θt⟩+xTV−1/2t,λZ(t),Z(t)∼N(0,a2Id) =⟨x,^θt⟩+Zt,x∥x∥V−1t,λ,Zt,x∼N(0,a2) :\footnotesize\bf Decoupled RandLinUCB.

If perturbations are coupled, randomly sampled confidence level is shared by all actions in each round by replacing with . In Decoupled RandLinUCB where each arm has its own random confidence level, more variations are generated so that its regret bound have extra logarithmic gap that depends on the number of decoupled actions. In other words, the standard (Coupled) RandLinUCB enjoys minimax-optimal regret bound due to coupled perturbations. At a cost of statistical optimality, it cannot rely on offline optimization oracle and thus loses computational efficiency, which is a trade-off between efficiency and optimality described in two design principles of perturbation-based algorithms for stationary linear bandits.

## 3 Non-Stationary Stochastic Linear Bandit

### 3.1 Preliminaries

In each round , an action set is given to the learner and he has to choose an action . Then, the reward is observed to the learner. is an unknown time-varying parameter and is a conditionally 1-subGaussian random variable. The non-stationary assumption allows unknown parameter to be time-variant within total variation budget . It is a nice way of quantifying time-variations of in that it covers both slowly-changing and abruptly-changing environments. Simply, assume that for all , , , and thus . And denote the number of actions in the largest action space as

In a similar way to stationary setting, denote the best action in a round as and the expected dynamic regret as where is chosen action at time . The goal of the learner is to minimize the expected dynamic regret.

In a stationary stochastic environment where reward has a linear structure, Linear Upper Confidence Bound algorithm (LinUCB) follows a principle of optimism in the face of uncertainty (OFU). Under this OFU principle, two recent works of Wu et al. [2018] and Russac et al. [2019] proposed Sliding Window Linear UCB (SW-LinUCB) and Discounted Linear UCB (D-LinUCB), which are non-stationary variants of LinUCB to adapt to time-variation of . They rely on weighted least-squares estimators with equal weights only given to recent observations where is length of a sliding-window, and exponentially discounting weights, respectively.

SW-LinUCB and D-LinUCB both achieve the minimax optimal dynamic regret when is known to the learner, but share inefficiency of implementation with LinUCB [Abbasi-Yadkori et al., 2011] in that the computation of spectral norms of all actions are required. Furthermore, they are built upon the construction of a high-probability confidence ellipsoid for the unknown parameter so that they are deterministic and confidence ellipsoid becomes too wide when high dimensional features are available. In this section, randomization exploration algorithms, Discounted randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS), are proposed to handle computational inefficiency and conservatism that both optimism-based algorithms suffer from. The dynamic regret bound, randomness, and oracle access of algorithms are reported in Table 2.

### 3.2 Weighted Least-Squares Estimator

We firstly study the weighted least-squares estimator with discounting factor . In round , the weighted least-squares estimator is obtained in a closed form, where . Additionally, we define . This form is closely connected with the covariance matrix of . For simplicity, we denote .

###### Lemma 2 (Weighted Least-Sqaures Confidence Ellipsoid, Theorem 1 [Russac et al., 2019]).

Assume the stationary setting where . For any ,

 P(∀t≥1,∥^θwlst−θ⋆∥Wt,λ~W−1t,λWt,λ≤βt)≥1−δ

where .

While Lemma 2 states that the confidence ellipsoid contains true parameter with high probability in stationary setting, the true parameter is not necessarily inside the confidence ellipsoid in the non-stationary setting because of variation in the parameters. We alternatively define a surrogate parameter , which belongs to with probability at least , which is formally stated in Lemma 4.

### 3.3 Randomized Exploration

In this section, we propose two randomized algorithms for non-stationary stochastic linear bandits, Discounted randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS). To gracefully adapt to environmental variation, the weighted method with exponentially discounting factor is directly applied to both RandLinUCB and LinTS, respectively. The random perturbations are injected to D-RandLinUCB and D-LinTS in different fashions: either by replacing optimism with simple randomization in deciding the confidence level or perturbing estimates before maximizing the expected rewards.

#### Discounted Randomized Linear UCB

Following the optimism in face of uncertainty principle, D-LinUCB [Russac et al., 2019] chooses an action by maximizing the upper confidence bound of expected reward based on and confidence level . Motivated by the recent work of Vaswani et al. [2019], our first randomized algorithm in non-stationary linear bandit setting is constructed by replacing confidence level with a random variable and this non-stationary variant of RandLinUCB algorithm is called Discounted Randomized LinUCB (D-RandLinUCB, Algorithm 1),

 D-LinUCB:Xt =argmaxx∈Xt⟨x,^θwlst⟩+a∥x∥V−1t D-RandLinUCB:Xt =argmaxx∈Xt⟨x,^θwlst⟩+Zt∥x∥V−1t.

#### Discounted Linear Thompson Sampling

The idea of perturbing estimates via random perturbation in LinTS algorithm can be directly applied to non-stationary setting by replacing and Gram matrix with the weighted least-squares estimator and its corresponding matrix . We call it Discounted Linear Thompson Sampling (D-LinTS, Algorithm 2). The motivation of D-LinTS arises from its equivalence to D-RandLinUCB with decoupled perturbations for all in round as

 ~ft(x) =⟨x,~θwlst⟩=⟨x,^θwlst⟩+xTW−1t,λ~W1/2t,λZ(t) =⟨x,^θwlst⟩+Zx,t∥x∥V−1t

where . Perturbations above are decoupled in that arms do not share random perturbation with each other so that it generates more variation and accordingly larger regret bound than that of D-RandLinUCB algorithm that uses coupled perturbations . By paying a logarithmic regret gap in terms of at a cost, the innate perturbation of D-LinTS allows itself to have an arg-max oracle access () in contrast to D-LinUCB and D-RandLinUCB. Therefore, D-LinTS algorithm can be efficient in computation even with an infinite action set.

### 3.4 Analysis

We construct a general regret bound for linear bandit algorithm on the top of prior work of Kveton et al. [2019a]. The difference from their work is that action sets vary from time and can have infinite arms. Also, non-stationary environment is considered where true parameter changes within total variation . Dynamic regret is decomposed into surrogate regret and bias arising from total variation.

 E[R(T)]=T∑t=1E[⟨x⋆t−Xt,θ⋆t⟩] =T∑t=1E[⟨x⋆t−Xt,¯θt⟩]+T∑t=1E[⟨x⋆t−Xt,θ⋆t−¯θt⟩] ≤T∑t=1E[⟨x⋆t−Xt,¯θt⟩]+2T∑t=1∥θ⋆t−¯θt∥2

#### Surrogate Instantaneous Regret

To bound the surrogate instantaneous regret , we newly define three events , and :

 Ewls={∀(x,t)∈¯XT;|⟨x,^θwlst−¯θt⟩|≤c1∥x∥V−1t}, Econct={∀x∈Xt;|~ft(x)−⟨x,^θwlst⟩|≤c2∥x∥V−1t}, Eantit={~ft(x⋆t)−⟨x⋆t,^θwlst⟩>c1∥x⋆t∥V−1t}.

where . The choice of is made by algorithmic design, which decides choices on both and simultaneously. In round , we consider the general algorithm which maximizes perturbed expected reward over action space . The following theorem is a extension of Theorem 1 [Kveton et al., 2019a] to the time-evolving environment.

###### Theorem 3.

Assume we have satisfying , , and , and . Let be an algorithm that chooses arm at time . Then the expected surrogate instantaneous regret of , is bounded by

 p2+(c1+c2)(1+2p3−p2)Et[min(1,∥Xt∥V−1t)].
###### Proof.

Firstly, we newly define in round . Given history , we assume that event holds and let be the set of arms that are under-sampled and worse than given in round . Among them, let be the least uncertain under-sampled arm in round . By definition of the optimal arm, . The set of sufficiently sampled arms is defined as and let . Note that any actions with can be neglected since the regret induced by these actions are always negative so that it is upper bounded by zero. Given history , is deterministic term while is random because of innate randomness in . Thus surrogate instantaneous regret can be bounded as,

 ΔXt=ΔUt+⟨Ut,¯θt⟩−⟨Xt,¯θt⟩ ≤ΔUt+~ft(Ut)−~ft(Xt)+c∥Xt∥V−1t+c∥Ut∥V−1t ≤c∥Xt∥V−1t+2c∥Ut∥V−1t.

Thus, the expected surrogate instantaneous regret can be bounded as,

 Et[ΔXt] =Et[ΔXtI{Econct}]+Et[ΔXtI{¯Econct}] ≤cEt[∥Xt∥V−1t]+2c∥Ut∥V−1t+Pt(¯Econct) ≤cEt[∥Xt∥V−1t]+2c∥Ut∥V−1t+p2 ≤cEt[∥Xt∥V−1t]+2cEt[∥Xt∥V−1t]Pt(Xt∈¯St)+p2 =c(1+2Pt(Xt∈¯St))Et[∥Xt∥V−1t]+p2 ≤c(1+2p3−p2)Et[∥Xt∥V−1t]+p2 ≤c(1+2p3−p2)Et[min(1,∥Xt∥V−1t)]+p2.

The third inequality holds because of definition of that is the least uncertain in and deterministic as follows,

 Et[∥Xt∥V−1t] ≥Et[∥Xt∥V−1t|Xt∈¯St]⋅Pt(Xt∈¯St) ≥∥Ut∥V−1t⋅Pt(Xt∈¯St).

The last inequality works due to the assumption on , which leads us to have .

The second last inequality holds since on event ,

 Pt(Xt∈¯St) ≥Pt(∃x∈¯St:~ft(x)≥maxy∈St~ft(y)) ≥Pt(~ft(x⋆t)≥maxy∈St~ft(y)) ≥Pt(~ft(x⋆t)≥maxy∈St~ft(y),Econct) ≥Pt(~ft(x⋆t)≥⟨x⋆t,¯θt⟩,Econct) ≥Pt(~ft(x⋆t)≥⟨x⋆t,¯θt⟩)−Pt(¯Econct) ≥p3−p2.

The fourth inequality holds since for any ,

In the following three lemmas, the probability of events , and can be controlled with optimal choices of and for D-RandLinUCB and D-LinTS algorithms.

###### Lemma 4 (Proposition 3, Russac et al. [2019]).

For , and , the event holds with probability at least .

###### Lemma 5 (Concentration).

Given history ,
(a) D-RandLinUCB : where , and . Then, .
(b) D-LinTS : , where , and . Then, .

###### Proof.

(a) We have in D-RandLinUCB algorithm, and thus

 P(¯Econct)=1−P(Econct) =1−P(∀x∈Xt;|~ft(x)−⟨x,^θwlst⟩|≤c2∥x∥V−1t) =1−P(∀x∈Xt;|Zt|⋅∥x∥V−1t≤c2∥x∥V−1t) =1−P(|Zt|≤c2)∵Lemma??? ≤1/T,wherec2=a√2log(T/2).

(b) Given history , we have is equivalent to where by the linear invariant property of Gaussian distributions. Thus,

 P(¯Econct)=1−P(Econct) =1−P(∀x∈Xt;|~ft(x)−⟨x,^θwlst⟩|≤c2∥x∥V−1t) =1−P(∀x∈Xt;|Zt,x|⋅∥x∥V−1t≤c2∥x∥V−1t) =1−P(∀x∈Xt;|Zt,x|≤c2)∵Lemma??? ≤1/T,wherec2=a√2log(KT/2).

###### Lemma 6 (Anti-concentration).

Given ,
(a) D-RandLinUCB : , where . Then, when we have .
(b) D-LinTS : where . If we assume , then .

###### Proof.

(a) We denote perturbed expected reward as for D-RandLinUCB. Thus,

 P(Eantit) =P(~ft(x⋆t)−⟨x⋆t,^θwlst⟩>c1∥x⋆t∥V−1t) =P(Zt≥c1)≥exp(−7c21/(2a2))/(8√π) =e−1/4/(8√π)where a2=14c21.

(b) In the same way as the proof of Lemma 5 (b), is equivalent to where