Mitigating Bias in Adaptive Data Gathering via Differential Privacy
Abstract
Data that is gathered adaptively — via bandit algorithms, for example — exhibits bias. This is true both when gathering simple numeric valued data — the empirical means kept track of by stochastic bandit algorithms are biased downwards — and when gathering more complicated data — running hypothesis tests on complex data gathered via contextual bandit algorithms leads to false discovery. In this paper, we show that this problem is mitigated if the data collection procedure is differentially private. This lets us both bound the bias of simple numeric valued quantities (like the empirical means of stochastic bandit algorithms), and correct the pvalues of hypothesis tests run on the adaptively gathered data. Moreover, there exist differentially private bandit algorithms with near optimal regret bounds: we apply existing theorems in the simple stochastic case, and give a new analysis for linear contextual bandits. We complement our theoretical results with experiments validating our theory.
1 Introduction
Many modern data sets consist of data that is gathered adaptively: the choice of whether to collect more data points of a given type depends on the data already collected. For example, it is common in industry to conduct “A/B” tests to make decisions about many things, including ad targeting, user interface design, and algorithmic modifications, and this A/B testing is often conducted using “bandit learning algorithms” Bubeck et al. (2012), which adaptively select treatments to show to users in an effort to find the best treatment as quickly as possible. Similarly, sequential clinical trials may halt or reallocate certain treatment groups due to preliminary results, and empirical scientists may initially try and test multiple hypotheses and multiple treatments, but then decide to gather more data in support of certain hypotheses and not others, based on the results of preliminary statistical tests.
Unfortunately, as demonstrated by Nie et al. (2017), the data that results from adaptive data gathering procedures will often exhibit substantial bias. As a result, subsequent analyses that are conducted on the data gathered by adaptive procedures will be prone to error, unless the bias is explicitly taken into account. This can be difficult. Nie et al. (2017) give a selective inference approach: in simple stochastic bandit settings, if the data was gathered by a specific stochastic algorithm that they design, they give an MCMC based procedure to perform maximum likelihood estimation to recover debiased estimates of the underlying distribution means. In this paper, we give a related, but orthogonal approach whose simplicity allows for a substantial generalization beyond the simple stochastic bandits setting. We show that in very general settings, if the data is gathered by a differentially private procedure, then we can place strong bounds on the bias of the data gathered, without needing any additional debiasing procedure. Via elementary techniques, this connection implies the existence of simple stochastic bandit algorithms with nearly optimal worstcase regret bounds, with very strong bias guarantees. The connection also allows us to derive algorithms for linear contextual bandits with nearly optimal regret guarantees, and strong bias guarantees. Since our connection to differential privacy only requires that the rewards and not the contexts be kept private, we are able to obtain improved accuracy compared to past approaches to private contextual bandit problems. By leveraging existing connections between differential privacy and adaptive data analysis Dwork et al. (2015c); Bassily et al. (2016); Rogers et al. (2016), we can extend the generality of our approach to bound not just bias, but to correct for effects of adaptivity on arbitrary statistics of the gathered data. For example, we can obtain valid pvalue corrections for hypothesis tests (like ttests) run on the adaptively collected data. Since the data being gathered will generally be useful for some as yet unspecified scientific analysis, rather than just for the narrow problem of mean estimation, our technique allows for substantially broader possibilities compared to past approaches. Experiments explore the bias incurred by conventional bandit algorithms, confirm the reduction in bias obtained by leveraging privacy, and show why correction for adaptivity is crucial to performing valid posthoc hypothesis tests. In particular we show that for the fundamental primitive of conducting ttests for regression coefficients, naively conducting tests on adaptively gathered data leads to incorrect inference.
1.1 Our Results
This paper has four main contributions:

Using elementary techniques, we provide explicit bounds on the bias of empirical arm means maintained by bandit algorithms in the simple stochastic setting that make their selection decisions as a differentially private function of their observations. Together with existing differentially private algorithms for stochastic bandit problems, this yields an algorithm that obtains an essentially optimal worstcase regret bound, and guarantees minimal bias (on the order of O(1/\sqrt{K\cdot T})) for the empirical mean maintained for every arm.

We then extend our results to the linear contextual bandit problem. We show that algorithms that make their decisions in a way that is differentially private in the observed reward of each arm (but which need not be differentially private in the context) have bounded bias (as measured by the difference between the predicted reward of each arm at each time step, compared to its true reward). We also derive a differentially private algorithm for the contextual bandit problem, and prove new bounds for it. Together with our bound on bias, this algorithm also obtains strong sublinear regret bounds, while having robust guarantees on bias.

We then make a general observation, relating adaptive data gathering to an adaptive analysis of a fixed dataset (in which the choice of which query to pose to the dataset is adaptive). This lets us apply the large existing literature connecting differential privacy to adaptive data analysis Dwork et al. (2015a, c); Bassily et al. (2016). In particular, it lets us apply the maxinformation bounds of Dwork et al. (2015b); Rogers et al. (2016) to our adaptive data gathering setting. This allows us to give much more general guarantees about the data collected by differentially private collection procedures, that extend well beyond bias. For example, it lets us correct the pvalues for arbitrary hypothesis tests run on the gathered data.

Finally, we run a set of experiments that measure the bias incurred by the standard UCB algorithm in the stochastic bandit setting, contrast it with the low bias obtained by a private UCB algorithm, and show that there are settings of the privacy parameter that simultaneously can make bias statistically insignificant, while having competitive empirical regret with the nonprivate UCB algorithm. We also demonstrate in the linear contextual bandit setting how failing to correct for adaptivity can lead to false discovery when applying ttests for nonzero regression coefficients on an adaptively gathered dataset.
1.2 Related Work
This paper bridges two recent lines of work. Our starting point is two recent papers: Villar et al. (2015) empirically demonstrate in the context of clinical trials that a variety of simple stochastic bandit algorithms produce biased sample mean estimates (Similar results have been empirically observed in the context of contextual bandits Dimakopoulou et al. (2017)). Nie et al. (2017) prove that simple stochastic bandit algorithms that exhibit two natural properties (satisfied by most commonly used algorithms, including UCB and Thompson Sampling) result in empirical means that exhibit negative bias. They then propose a heuristic algorithm which computes a maximum likelihood estimator for the sample means from the empirical means gathered by a modified UCB algorithm which adds Gumbel noise to the decision statistics. Deshpande et al. (2017) propose a debiasing procedure for ordinary leastsquares estimates computed from adaptively gathered data that trades off bias for variance, and prove a central limit theorem for their method. In contrast, the methods we propose in this paper are quite different. Rather than giving an expost debiasing procedure, we show that if the data were gathered in a differentially private manner, no debiasing is necessary. The strength of our method is both in its simplicity and generality: rather than proving theorems specific to particular estimators, we give methods to correct the pvalues for arbitrary hypothesis tests that might be run on the adaptively gathered data.
The second line of work is the recent literature on adaptive data analysis Dwork et al. (2015c, b); Hardt and Ullman (2014); Steinke and Ullman (2015); Russo and Zou (2016); Wang et al. (2016); Bassily et al. (2016); Hardt and Blum (2015); Cummings et al. (2016); Feldman and Steinke (2017a, b) which draws a connection between differential privacy Dwork et al. (2006) and generalization guarantees for adaptively chosen statistics. The adaptivity in this setting is dual to the setting we study in the present paper: In the adaptive data analysis literature, the dataset itself is fixed, and the goal is to find techniques that can mitigate bias due to the adaptive selection of analyses. In contrast, here, we study a setting in which the data gathering procedure is itself adaptive, and can lead to bias even for a fixed set of statistics of interest. However, we show that adaptive data gathering can be recast as an adaptive data analysis procedure, and so the results from the adaptive data analysis literature can be ported over.
2 Preliminaries
2.1 Simple Stochastic Bandit Problems
In a simple stochastic bandit problem, there are K unknown distributions P_{i} over the unit interval [0,1], each with (unknown) mean \mu_{i}. Over a series of rounds t\in\{1,\ldots,T\}, an algorithm \mathcal{A} chooses an arm i_{t}\in[K], and observes a reward y_{i_{t},t}\sim P_{i_{t}}. Given a sequence of choices i_{1},\ldots,i_{T}, the pseudoregret of an algorithm is defined to be:
\mathrm{Regret}((P_{1},\ldots,P_{K}),i_{1},\ldots,i_{T})=T\cdot\max_{i}\mu_{i}% \sum_{t=1}^{T}\mu_{i_{t}} 
We say that regret is bounded if we can put a bound on the quantity \mathrm{Regret}((P_{1},\ldots,P_{K}),i_{1},\ldots,i_{T}) in the worst case over the choice of distributions P_{1},\ldots,P_{K}, and with high probability or in expectation over the randomness of the algorithm and of the reward sampling.
As an algorithm \mathcal{A} interacts with a bandit problem, it generates a history \Lambda , which records the sequence of actions taken and rewards observed thus far: \Lambda_{t}=\{(i_{\ell},y_{i_{\ell},\ell})\}_{\ell=1}^{t1}. We denote the space of histories of length T by \mathcal{H}^{T}=([K]\times\mathbb{R})^{T}.
The definition of an algorithm \mathcal{A} induces a sequence of T (possibly randomized) selection functions f_{t}:\mathcal{H}^{t1}\rightarrow[K], which map histories onto decisions of which arm to pull at each round.
2.2 Contextual Bandit Problems
In the contextual bandit problem, decisions are endowed with observable features. Our algorithmic results in this paper focus on the linear contextual bandit problem, but our general connection between adaptive data gathering and differential privacy extends beyond the linear case. For simplicity of exposition, we specialize to the linear case here.
There are K arms i, each of which is associated with an unknown ddimensional linear function represented by a vector of coefficients \theta_{i}\in\mathbb{R}^{d} with \theta_{i}_{2}\leq 1. In rounds t\in\{1,\ldots,T\}, the algorithm is presented with a context x_{i,t}\in\mathbb{R}^{d} for each arm i with x_{i,t}_{2}\leq 1, which may be selected by an adaptive adversary as a function of the past history of play. We write x_{t} to denote the set of all K contexts present at round t. As a function of these contexts, the algorithm then selects an arm i_{t}, and observes a reward y_{i_{t},t}. The rewards satisfy {\mathbb{E}\left[y_{i,t}\right]}=\theta_{i}\cdot x_{i,t} and are bounded to lie in [0,1]. Regret is now measured with respect to the optimal policy. Given a sequence of contexts x_{1},\ldots,x_{t}, a set of linear functions \theta_{1},\ldots,\theta_{k}, and a set of choices i_{1},\ldots,i_{k}, the pseudoregret of an algorithm is defined to be:
\mathrm{Regret}((\theta_{1},\ldots,\theta_{K}),(x_{1},i_{1}),\ldots,(x_{t},i_{% T}))=\sum_{t=1}^{T}\left(\max_{i}\theta_{i}\cdot x_{i,t}\theta_{i,t}\cdot x_{% i_{t},t}\right) 
We say that regret is bounded if we can put a bound on the quantity \mathrm{Regret}((\theta_{1},\ldots,\theta_{K}),(x_{1},i_{1}),\ldots,(x_{T},i_{% T})) in the worst case over the choice of linear functions \theta_{1},\ldots,\theta_{K} and contexts x_{1},\ldots,x_{T}, and with high probability or in expectation over the randomness of the algorithm and of the rewards.
In the contextual setting, histories incorporate observed context information as well: \Lambda_{t}=\{(i_{\ell},x_{\ell},y_{i_{\ell},\ell})\}_{\ell=1}^{t1}.
Again, the definition of an algorithm \mathcal{A} induces a sequence of T (possibly randomized) selection functions f_{t}:\mathcal{H}^{t1}\times\mathbb{R}^{d\times K}\rightarrow[K], which now maps both a history and a set of contexts at round t to a choice of arm at round t.
2.3 Data Gathering in the Query Model
Above we’ve characterized a bandit algorithm \mathcal{A} as gathering data adaptively using a sequence of selection functions f_{t}, which map the observed history \Lambda_{t}\in\mathcal{H}^{t1} to the index of the next arm pulled. In this model only after the arm is chosen is a reward drawn from the appropriate distribution. Then the history is updated, and the process repeats.
In this section, we observe that whether the reward is drawn after the arm is “pulled,” or in advance, is a distinction without a difference. We cast this same interaction into the setting where an analyst asks an adaptively chosen sequence of queries to a fixed dataset, representing the arm rewards. The process of running a bandit algorithm \mathcal{A} up to time T can be formalized as the adaptive selection of T queries against a single database of size T  fixed in advance. The formalization consists of two steps:

By the principle of deferred randomness, we view any simple stochastic bandit algorithm as operating in a setting in which T i.i.d. samples from \prod_{i=1}^{K}P_{i} (vectors of length K representing the rewards for each of K arms on each time step t) are drawn before the interaction begins. This is the Interact algorithm below.
In the contextual setting, the contexts are also available, and the T draws are not drawn from identical distributions. Instead, the t^{th} draw is from \prod_{i=1}^{K}P_{i}^{t}, where each distribution P_{i}^{t} is determined by the context x_{i}^{t}.

The choice of arm pulled at time t by the bandit algorithm can be viewed as the answer to an adaptively selected query against this fixed dataset. This is the InteractQuery algorithm below.
Adaptive data analysis is formalized as an interaction in which a data analyst \mathcal{A} performs computations on a dataset D, observes the results, and then may choose the identity of the next computation to run as a function of previously computed results Dwork et al. (2015c, a). A sequence of recent results shows that if the queries are differentially private in the dataset D, then they will not in general overfit D, in the sense that the distribution over results induced by computing q(D) will be “similar” to the distribution over results induced if q were run on a new dataset, freshly sampled from the same underlying distribution Dwork et al. (2015c, a); Bassily et al. (2016); Dwork et al. (2015b); Rogers et al. (2016). We will be more precise about what these results say in Section 5.
Recall that histories \Lambda record the choices of the algorithm, in addition to its observations. It will be helpful to introduce notation that separates out the choices of the algorithm from its observations. In the simple stochastic setting and the contextual setting, given a history \Lambda_{t}, an action history \Lambda_{t}^{\mathcal{A}}=(i_{1},\ldots,i_{t1})\in[K]^{t1} denotes the portion of the history recording the actions of the algorithm.
In the simple stochastic setting, a bandit tableau is a T\times K matrix D\in\left([0,1]^{K}\right)^{T}. Each row D_{t} of D is a vector of K real numbers, intuitively representing the rewards that would be available to a bandit algorithm at round t for each of the K arms. In the contextual setting, a bandit tableau is represented by a pair of T\times K matrices: D\in\left([0,1]^{K}\right)^{T} and C\in\left((\mathbb{R}^{d})^{K}\right)^{T}. Intuitively, C represents the contexts presented to a bandit algorithm \mathcal{A} at each round: each row C_{t} corresponds to a set of K contexts, one for each arm. D again represents the rewards that would be available to the bandit algorithm at round t for each of the K arms.
We write \mathrm{Tab} to denote a bandit tableau when the setting has not been specified: implicitly, in the simple stochastic case, \mathrm{Tab}=D, and in the contextual case, \mathrm{Tab}=(D,C).
Given a bandit tableau and a bandit algorithm \mathcal{A}, we have the following interaction:
We denote the subset of the reward tableau D corresponding to rewards that would have been revealed to a bandit algorithm \mathcal{A} given action history \Lambda_{t}^{\mathcal{A}}, by \Lambda_{t}^{\mathcal{A}}(D). Concretely if \Lambda_{t}^{\mathcal{A}}=(i_{1},\ldots,i_{t1}) then \Lambda_{t}^{\mathcal{A}}(D)=\{(i_{\ell},y_{i_{\ell},\ell})\}_{\ell=1}^{t1}. Given a selection function f_{t} and an action history \Lambda_{t}^{\mathcal{A}}, define the query q_{\Lambda_{t}^{\mathcal{A}}} as q_{\Lambda_{t}^{\mathcal{A}}}(D)=f_{t}({\Lambda_{t}^{\mathcal{A}}}(D)).
We now define Algorithms Bandit and InteractQuery. Bandit is a standard contextual bandit algorithm defined by selection functions f_{t}, and InteractQuery is the Interact routine that draws the rewards in advance, and at time t selects action i_{t} as the result of query q_{\Lambda_{t}^{\mathcal{A}}}. With the above definitions in hand, it is straightforward to show that the two Algorithms are equivalent, in that they induce the same joint distribution on their outputs. In both algorithms for convenience we assume we are in the linear contextual setting, and we write \eta_{i_{t}} to denote the i.i.d. error distributions of the rewards, conditional on the contexts.
Claim 1.
Let P_{1,t} be the joint distribution induced by Algorithm Bandit on \Lambda_{t} at time t, and let P_{2,t} be the joint distribution induced by Algorithm InteractQuery on \Lambda_{t}=\Lambda_{t}^{\mathcal{A}}(D). Then \forall t\;P_{1,t}=P_{2,t}.
The upshot of this equivalence is that we can import existing results that hold in the setting in which the dataset is fixed, and queries are adaptively chosen. There are a large collection of results of this form that apply when the queries are differentially private Dwork et al. (2015c); Bassily et al. (2016); Rogers et al. (2016) which apply directly to our setting. In the next section we formally define differential privacy in the simple stochastic and contextual bandit setting, and leave the description of the more general transfer theorems to Section 5.
2.4 Differential Privacy
We will be interested in algorithms that are differentially private. In the simple stochastic bandit setting, we will require differential privacy with respect to the rewards. In the contextual bandit setting, we will also require differential privacy with respect to the rewards, but not necessarily with respect to the contexts.
We now define the neighboring relation we need to define bandit differential privacy:
Definition 1.
In the simple stochastic setting, two bandit tableau’s D,D^{\prime} are reward neighbors if they differ in at most a single row: i.e. if there exists an index \ell such that for all t\neq\ell, D_{t}=D^{\prime}_{t}.
In the contextual setting, two bandit tableau’s (D,C),(D^{\prime},C^{\prime}) are reward neighbors if C=C^{\prime} and D and D^{\prime} differ in at most a single row: i.e. if there exists an index \ell such that for all t\neq\ell, D_{t}=D^{\prime}_{t}.
Note that changing a context does not result in a neighboring tableau: this neighboring relation will correspond to privacy for the rewards, but not for the contexts.
Remark 1.
Note that we could have equivalently defined reward neighbors to be tableaus that differ in only a single entry, rather than in an entire row. The distinction is unimportant in a bandit setting, because a bandit algorithm will be able to observe only a single entry in any particular row.
Definition 2.
A bandit algorithm \mathcal{A} is (\epsilon,\delta) reward differentially private if for every time horizon T and every pair of bandit tableau \mathrm{Tab},\mathrm{Tab}^{\prime} that are reward neighbors, and every subset S\subseteq[K]^{T}:
\mathbb{P}\left[\mathbf{Interact}(T,\mathcal{A},\mathrm{Tab})\in S\right]\leq e% ^{\epsilon}\mathbb{P}\left[\mathbf{Interact}(T,\mathcal{A},\mathrm{Tab}^{% \prime})\in S\right]+\delta 
If \delta=0, we say that \mathcal{A} is \epsilondifferentially private.
2.5 The Binary Mechanism
For many interesting stochastic bandit algorithms \mathcal{A} (UCB, Thompsonsampling, \epsilongreedy) the selection functions (f_{t})_{t\in[T]} are randomized functions of the history of sample means at each time step for each arm. It will therefore be useful to have notation to refer to these means. We write N_{i}^{T} to represent the number of times arm i is pulled through round T: N_{i}^{T}=\sum_{t^{\prime}=1}^{T}\mathbbm{1}_{\{f_{t^{\prime}}(\Lambda_{t^{% \prime}})=i\}}. Note that before the history has been fixed, this is a random variable. In the simple stochastic setting, We write \hat{Y}_{i}^{T} to denote the sample mean at arm i at time T:\hat{Y}_{i}^{T}=\frac{1}{N_{i}^{T}}\sum_{j=1}^{N_{i}^{T}}y_{i,t_{j}}, where t_{j} is the time t that arm i is pulled for the j^{th} time. Then we can write the current set of sample means sequences for all K arms at time T as (\hat{Y}_{i}^{t})_{i\in[K],t\in[T]}. Since differential privacy is preserved under postprocessing and composition, we observe that to obtain a private version \mathcal{A}_{priv} of any of these standard algorithms, an obvious method would be to estimate (\hat{Y}_{i}^{t})_{i\in[K]} privately at each round, and then to plug these private estimates into the selection functions f_{t}.
The Binary mechanism Chan et al. (2011); Dwork et al. (2010) is an online algorithm that continually releases an estimate of a running sum \sum_{i=1}^{t}y_{i} as each y_{i} arrives one at a time, while preserving \epsilondifferential privacy of the entire sequence (y_{i})_{i=1}^{T}, and guaranteeing worst case error that scales only with \log(T). It does this by using a treebased aggregation scheme that computes partial sums online using the Laplace mechanism, which are then combined to produce estimates for each sample mean \hat{Y}_{i}^{t}. Since the scheme operates via the Laplace mechanism, it extends immediately to the setting when each y_{i} is a vector with bounded l_{1} norm. In our private algorithms we actually use a modified version of the binary mechanism due to Chan et al. (2011) called the hybrid mechanism, which operates without a fixed time horizon T. For the rest of the paper we denote the noise added to the t^{th} partial sum by the hybrid mechanism, either in vector or scalar form, as \eta\sim\text{Hybrid}(t,\epsilon).
Theorem 1 (Corollary 4.5 in Chan et al. (2011)).
Let y_{1},\ldots y_{T}\in[0,1]. The hybrid mechanism produces sample means \tilde{Y}^{t}=\frac{1}{t}(\sum_{i=1}^{t}y_{i}+\eta_{t}), where \eta_{t}\sim\text{Hybrid}(t,\epsilon), such that the following hold:

The sequence (\tilde{Y}^{t})_{t\in[T]} is \epsilondifferentially private in (y_{1},\ldots y_{T}).

With probability 1\delta :
\sup_{t\in[T]}\tilde{Y}_{i}^{T}\hat{Y}_{i}^{T}\leq\frac{\log^{*}(\log t)}{% \epsilon t}\log t^{1.5}\text{Ln}(\log\log t)\log{\frac{1}{\delta}}, (1) where \log^{*} denotes the binary iterated logarithm, and Ln is the function defined as \text{Ln}(n)=\prod_{r=0}^{\log^{*}(n)}\log^{(r)}n in Chan et al. (2011).
For the rest of the paper, we denote the RHS of (1) as \tilde{O}(\frac{1}{t\epsilon}\log^{1.5}t\log\frac{1}{\delta}), hiding the messier sublogarithmic terms.
3 Privacy Reduces Bias in Stochastic Bandit Problems
We begin by showing that differentially private algorithms that operate in the stochastic bandit setting compute empirical means for their arms that are nearly unbiased. Together with known differentially private algorithms for stochastic bandit problems, the result is an algorithm that obtains a nearly optimal (worstcase) regret guarantee while also guaranteeing that the collected data is nearly unbiased. We could (and do) obtain these results by combining the reduction to answering adaptively selected queries given by Theorem 1 with the standard generalization theorems in adaptive data analysis (e.g. Corollary 3 in its most general form), but we first prove these debiasing results from first principles to build intuition.
Theorem 2.
Let \mathcal{A} be an (\epsilon,\delta)differentially private algorithm in the stochastic bandit setting. Then, for all i\in[K], and all t, we have:
\left{\mathbb{E}\left[\hat{Y}_{i}^{t}\mu_{i}\right]}\right\leq(e^{\epsilon}% 1+T\delta)\mu_{i} 
Remark 2.
Note that since \mu_{i}\in[0,1], and for \epsilon\ll 1, e^{\epsilon}\approx 1+\epsilon, this theorem bounds the bias by roughly \epsilon+T\delta. Often, we will have \delta=0 and so the bias will be bounded by roughly \epsilon.
Proof.
First we fix some notation. Fix any time horizon T, and let (f_{t})_{t\in[T]} be the sequence of selection functions induced by algorithm \mathcal{A}. Let \mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}} be the indicator for the event that arm i is pulled at time t. We can write the random variable representing the sample mean of arm i at time T as
\hat{Y}_{i}^{T}=\sum_{t=1}^{T}\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{% \sum_{t^{\prime}=1}^{T}\mathbbm{1}_{\{f_{t^{\prime}}(\Lambda_{t^{\prime}})=i\}% }}y_{it} 
where we recall that y_{i,t} is the random variable representing the reward for arm i at time t. Note that the numerator (f_{t}(\Lambda_{t})=i) is by definition independent of y_{i,t}, but the denominator (\sum_{t^{\prime}=1}^{T}\mathbbm{1}_{\{f_{t^{\prime}}(\Lambda_{t^{\prime}})=i\}}) is not, because for t^{\prime}>t \Lambda_{t^{\prime}} depends on y_{i,t}. It is this dependence that leads to bias in adaptive data gathering procedures, and that we must argue is mitigated by differential privacy.
We recall that the random variable N_{i}^{T} represents the number of times arm i is pulled through round T: N_{i}^{T}=\sum_{t^{\prime}=1}^{T}\mathbbm{1}_{\{f_{t^{\prime}}(\Lambda_{t^{% \prime}})=i\}}. Using this notation, we write the sample mean of arm i at time T, as:
\hat{Y}_{i}^{T}=\sum_{t=1}^{T}\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_% {i}^{T}}\cdot y_{it} 
We can then calculate:
\displaystyle\mathop{\mathbb{E}}[\hat{Y}_{i}^{t}]  \displaystyle=  \displaystyle\sum_{t=1}^{T}\mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(% \Lambda_{t})=i\}}}{N_{i}^{T}}y_{it}]  
\displaystyle=  \displaystyle\sum_{t=1}^{T}\mathop{\mathbb{E}}_{y_{it}\sim P_{i}}[y_{it}\cdot% \mathop{\mathbb{E}}_{\mathcal{A}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}% }{N_{i}^{T}}y_{it}]] 
where the first equality follows by the linearity of expectation, and the second follows by the law of iterated expectation.
Our goal is to show that the conditioning in the inner expectation does not substantially change the value of the expectation. Specifically, we want to show that all t, and any value y_{it}, we have
\mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}}y_{it% }]\geq e^{\epsilon}\mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t}% )=i\}}}{N_{i}^{T}}]\delta 
If we can show this, then we will have
\mathop{\mathbb{E}}[\hat{Y}_{i}^{T}]\geq(e^{\epsilon}\sum_{t=1}^{T}\mathop{% \mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}^{T}}]T\delta)% \cdot\mu_{i} 
=(e^{\epsilon}\mathop{\mathbb{E}}[\frac{N_{i}^{T}}{N_{i}^{T}}]T\delta)\cdot% \mu_{i}=(e^{\epsilon}T\delta)\cdot\mu_{i} 
which is what we want (The reverse inequality is symmetric).
This is what we now show to complete the proof. Observe that for all t,i, the quantity \frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}} can be derived as a postprocessing of the sequence of choices (f_{1}(\Lambda_{1}),\ldots,f_{T}(\Lambda_{T})), and is therefore differentially private in the observed reward sequence. Observe also that the quantity \frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}^{T}} is bounded in [0,1]. Hence by Lemma 2 for any pair of values y_{it},y^{\prime}_{it}, we have \mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}^{T}}y% _{it}]\geq e^{\epsilon}\mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda% _{t})=i\}}}{N_{i}^{T}}y^{\prime}_{it}]\delta. All that remains is to observe that there must exist some value y^{\prime}_{it} such that \mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}}y^{% \prime}_{it}]\geq\mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i% \}}}{N_{i}}]. (Otherwise, this would contradict \mathop{\mathbb{E}}_{y^{\prime}_{it}\sim P_{i}}[\mathop{\mathbb{E}}[\frac{% \mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}}y^{\prime}_{it}]]=\mathop{% \mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}^{T}}]). Fixing any such y^{\prime}_{it} implies that for all y_{it}
\mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=i\}}}{N_{i}}y_{it% }]\geq e^{\epsilon}\mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t}% )=i\}}}{N_{i}^{T}}y^{\prime}_{i,t}]\delta 
\geq e^{\epsilon}\mathop{\mathbb{E}}[\frac{\mathbbm{1}_{\{f_{t}(\Lambda_{t})=% i\}}}{N_{i}^{T}}]\delta 
as desired. The upper bound on the bias follows symmetrically from Lemma 2. ∎
3.1 A Private UCB Algorithm
There are existing differentially private variants of the classic UCB algorithm (Auer et al. (2002); Agrawal (1995); Lai and Robbins (1985)), which give a nearly optimal tradeoff between privacy and regret Mishra and Thakurta (2014); Tossou and Dimitrakakis (2017, 2016). For completeness, we give a simple version of a private UCB algorithm in the Appendix which we use in our experiments. Here, we simply quote the relevant theorem, which is a consequence of a theorem in Tossou and Dimitrakakis (2016):
Theorem 3.
Tossou and Dimitrakakis (2016) Let \{\mu_{i}:i\in[K]\} be the means of the karms. Let \mu^{*}=\max_{k}\mu_{k}, and for each arm k let \Delta=\min_{\mu_{k}<\mu^{*}}\mu^{*}\mu_{k}. Then there is an \epsilondifferentially private algorithm that obtains expected regret bounded by:
\displaystyle\sum_{k\in[K]:\mu_{k}<\mu*}\min\left(\max\left(B(\ln(B)+7),\frac{% 32}{\Delta_{k}}\log T\right)+\left(\Delta_{k}+\frac{2\pi^{2}\Delta_{k}}{3}% \right),\Delta_{k}N_{k}^{T}\right)  (2) 
where B=\frac{\sqrt{8}}{2\epsilon}\ln(4T^{4}). Taking the worst case over instances (values \Delta_{k}) and recalling that \sum_{k}N_{k}^{T}=T, this implies expected regret bounded by:
O\left(\max\left(\frac{\ln T}{\epsilon}\cdot\left(\ln\ln(T)+\ln(1/\epsilon)% \right),\sqrt{kT\log T}\right)\right) 
Thus, we can take \epsilon to be as small as \epsilon=O(\frac{\ln^{1.5}T}{\sqrt{kT}}) while still having a regret bound of O(\sqrt{kT\log T}), which is nearly optimal in the worst case (over instances) Audibert and Bubeck (2009).
Combining the above bound with Theorem 2, and letting \epsilon=O(\frac{\ln^{1.5}T}{\sqrt{kT}}), we have:
Corollary 1.
There exists a simple stochastic bandit algorithm that simultaneously guarantees that the bias of the empirical average for each arm i is bounded by O(\mu_{i}\cdot\frac{\ln^{1.5}T}{\sqrt{kT}}) and guarantees expected regret bounded by O(\sqrt{kT\log T}).
Of course, other tradeoffs are possible using different values of \epsilon. For example, the algorithm of Tossou and Dimitrakakis (2016) obtains sublinear regret so long as \epsilon=\omega(\frac{\ln^{2}T}{T}). Thus, it is possible to obtain nontrivial regret while guaranteeing that the bias of the empirical means remains as low as \mathrm{polylog}(T)/T.
4 Privacy Reduces Bias in Linear Contextual Bandit Problems
In this section, we extend Theorem 2 to directly show that differential privacy controls a natural measure of “bias” in linear contextual bandit problems as well. We then design and analyze a new differentially private algorithm for the linear contextual bandit problem, based on the LinUCB algorithm Li et al. (2010). This will allow us to give an algorithm which simultaneously offers bias and regret bounds.
In the linear contextual bandit setting, we first need to define what we mean by bias. Recall that rather than simply maintaining an empirical mean for each arm, in the linear contextual bandit case, the algorithm is maintaining an estimate \theta_{i,t} a linear parameter vector \theta_{i} for each arm. One tempting measure of bias in this case is: \theta_{i}{\mathbb{E}\left[\hat{\theta}_{it}\right]}_{2}, but even in the nonadaptive setting if the design matrix at arm i is not of full rank, the OLS estimator will not be unique. In this case, the attempted measure of bias is not even well defined. Instead, we note that even when the design matrix is not of full rank, the predicted values on the training set \hat{y}=x_{i,t}\hat{\theta}_{i,t} are unique. As a result we define bias in the linear contextual bandit setting to be the bias of the predictions that the least squares estimator, trained on the gathered data, makes on the gathered data. We note that if the data were not gathered adaptively, then this quantity would be 0. We choose this one for illustration; other natural measures of bias can be defined, and they can be bounded using the tools in section 5.
We write \Lambda_{i,T} to denote the sequence of context/reward pairs for arm i that a contextual bandit algorithm \mathcal{A} has observed through time step T. Note that \Lambda_{i,T}=N_{i}^{T}. It will sometimes be convenient to separate out contexts and rewards: we will write C_{i,T} to refer to just the sequence of contexts observed through time T, and D_{i,T} to refer to just the corresponding sequence of rewards observed through time T. Note that once we fix \Lambda_{i,T}, C_{i,T} and D_{i,T} are determined, but fixing C_{i,T} leaves D_{i,T} a random variable. The randomness in C_{i,T} is over which contexts from arm i \mathcal{A} has selected by round T, not over the actual contexts x_{it}  these are fixed. Thus the following results will hold over a worstcase set of contexts, including when the contexts are drawn from an arbitrary distribution. We will denote the sequence of arms pulled by \mathcal{A} up to time T by \Lambda_{T}^{\mathcal{A}}. We note that \Lambda_{T}^{\mathcal{A}} fixes C_{i,T} independently of the observed rewards D_{i,T}, and so if \mathcal{A} is differentially private in the observed rewards, the postprocessing C_{i,T} is as well. First, we define the least squares estimator:
Definition 3.
Given a sequence of observations \Lambda_{i,T}, a least squares estimator \hat{\theta}_{i} is any vector that satisfies:
\hat{\theta}_{i}\in\arg\min_{\theta}\sum_{(x_{it},y_{i,t})\in\Lambda_{i,T}}(% \theta\cdot x_{it}y_{i,t})^{2} 
Definition 4 (Bias).
Fix a time horizon T, a tableau of contexts, an arm i, and a contextual bandit algorithm \mathcal{A}. Let \hat{\theta}_{i} be the least squares estimator trained on the set of observations \Lambda_{i,T}. Then the bias of arm i is defined to be the maximum bias of the predictions made by \hat{\theta}_{i} on the contexts in C_{i,T}, over any worst case realization of C_{i,T}. The inner expectation is over D_{i,T} since \hat{\theta}_{i} depends on the rewards at arm i.
\mathrm{Bias}(i,T)=\max_{C_{i,T},\;x_{it}\in C_{i,T}}{\left\mathbb{E}_{D_{i,T% }}\left[(\hat{\theta}_{i}\theta_{i})x_{it}\right]\right} 
It then follows from an elementary application of differential privacy similar to that in the proof of Theorem 2, that if the algorithm \mathcal{A} makes its arm selection decisions in a way that is differentially private in the observed sequences of rewards, the least squares estimators computed based on the observations of \mathcal{A} have bounded bias as defined above. The proof is deferred to the Appendix.
Theorem 4.
Let \mathcal{A} be any linear contextual bandit algorithm whose selections are \epsilondifferentially private in the rewards. Fix a time horizon T, and let \hat{\theta}_{i} be a least squares estimator computed on the set of observations \Lambda_{i,T}. Then for every arm i\in[K] and any round t:
\mathrm{Bias}(i,T)\leq e^{\epsilon}1 
Below we outline a rewardprivate variant of the LinUCB algorithm Chu et al. (2011), and state a corresponding regret bound. In combination with Theorem 4 this will give an algorithm that yields a smooth tradeoff between regret and bias. This algorithm is similar to the private linear UCB algorithm presented in Mishra and Thakurta (2014). The main difference compared to the algorithm in Mishra and Thakurta (2014) is that Theorem 4 requires only reward privacy, whereas the algorithm from Mishra and Thakurta (2014) is designed to guarantee privacy of the contexts as well. The result is that we can add less noise, which also makes the regret analysis more tractable — none is given in Mishra and Thakurta (2014) — and the regret bound better. Estimates of the linear function at each arm are based on the ridge regression estimator, which gives a lower bound on the singular values of the design matrix and hence an upper bound on the effect of the noise. As part of the regret analysis we use the selfnormalized martingale inequality developed in AbbasiYadkori et al. (2011); for details see the proof in the Appendix.
Theorem 5.
Algorithm 1 is \epsilonreward differentially private and has regret:
R(T)\leq\tilde{O}(d\sqrt{TK}+\sqrt{TKd\lambda}+K\frac{1}{\sqrt{\lambda}}\frac{% 1}{\epsilon}\log^{1.5}(T/K)\log(K/\delta)\cdot 2d\log(1+T/Kd\lambda)), 
with probability 1\delta.
The following corollary follows by setting \lambda=1 and setting \epsilon to be as small as possible, without it becoming an asymptotically dominant term in the regret bound. We then apply Theorem 4 to convert the privacy guarantee into a bias guarantee.
Corollary 2.
Setting \lambda=1 and \epsilon=O(\sqrt{\frac{K}{T}}), Algorithm 1 has regret:
R(T)=\tilde{O}(d\sqrt{TK}) 
with probability 1\delta, and for each arm i satisfies
\mathrm{Bias}(i,T)\leq e^{\epsilon}1=O\left(\sqrt{\frac{K}{T}}\right) 
Remark 3.
Readers familiar with the linear contextual bandit literature will remark that the optimal nonprivate regret bound in the realizable setting scales like O(\sqrt{Td\log K}) Chu et al. (2011), as opposed to O(d\sqrt{TK}) above. This is an artifact of the fact that for ease of presentation we have analyzed a simpler LinUCB variant using techniques from AbbasiYadkori et al. (2011), rather than the more complicated SupLinUCB algorithm of Chu et al. (2011). It is not a consequence of using the binary mechanism to guarantee privacy – it is likely the same technique would give a private variant of SupLinUCB with a tighter regret bound than the one given above.
5 Max Information & Arbitrary Hypothesis Tests
Up through this point, we have focused our attention on showing how the private collection of data mitigates the effect that adaptivity has on bias, in both the stochastic and contextual bandit problems. In this section, we draw upon more powerful results from the adaptive data analysis literature to go substantially beyond bias: to correct the pvalues of hypothesis tests applied to adaptively gathered data. These pvalue corrections follow from the connection between differential privacy and a quantity called max information, which controls the extent to which the dependence of selected test on the dataset can distort the statistical validity of the test (Dwork et al., 2015b; Rogers et al., 2016). We briefly define max information, state the connection to differential privacy, and illustrate how max information bounds can be used to perform adaptive analyses in the private data gathering framework.
Definition 5 (MaxInformation Dwork et al. (2015b).).
Let X,Z be jointly distributed random variables over domain (\mathcal{X},\mathcal{Z}). Let X\otimes Z denote the random variable that draws independent copies of X,Z according to their marginal distributions. The maxinformation between X,Z, denoted I_{\infty}(X,Z), is defined:
I_{\infty}(X,Z)=\log\sup_{\mathcal{O}\subset(\mathcal{X}\times\mathcal{Z})}% \frac{\mathbb{P}\left[(X,Z)\in\mathcal{O}\right]}{\mathbb{P}\left[X\otimes Z% \in\mathcal{O}\right]} 
Similarly, we define the \betaapproximate max information
I_{\beta}(X,Z)=\log\sup_{\mathcal{O}\subset(\mathcal{X}\times\mathcal{Z}),\;% \mathbb{P}\left[(X,Z)\in\mathcal{O}\right]>\beta}\frac{\mathbb{P}\left[(X,Z)% \in\mathcal{O}\right]\beta}{\mathbb{P}\left[X\otimes Z\in\mathcal{O}\right]} 
Following Rogers et al. (2016), define a test statistic t:\mathcal{D}\to\mathbb{R}, where \mathcal{D} is the space of all datasets. For D\in\mathcal{D}, given an output a=t(D), the pvalue associated with the test t on dataset D is p(a)=\mathbb{P}_{D\sim\mathbb{P}_{0}}\left[t(D)\geq a\right], where P_{0} is the null hypothesis distribution. Consider an algorithm \mathcal{A}, mapping a dataset to a test statistic.
Definition 6 (Valid pvalue Correction Function Rogers et al. (2016).).
A function \gamma:[0,1]\to[0,1] is a valid pvalue correction function for \mathcal{A} if the procedure:

Select a test statistic t=\mathcal{A}(D)

Reject the null hypothesis if p(t(D))\leq\gamma(\alpha)
has probability at most \alpha of rejection, when D\sim P_{0}.
Then the following theorem gives a valid pvalue correction function when (D,A(D)) have bounded \betaapproximate max information.
Theorem 6 (Rogers et al. (2016).).
Let \mathcal{A} be a datadependent algorithm for selecting a test statistics such that I_{\beta}(X,\mathcal{A}(X))\leq k. Then the following function \gamma is a valid pvalue correction function for \mathcal{A}:
\gamma(\alpha)=\max(\frac{\alpha\beta}{2^{k}},0) 
Finally, we can connect max information to differential privacy, which allows us to leverage private algorithms to perform arbitrary valid statistical tests.
Theorem 7 (Theorem 20 from Dwork et al. (2015b).).
Let \mathcal{A} be an \epsilondifferentially private algorithm, let P be an arbitrary product distribution over datasets of size n, and let D\sim P. Then for every \beta>0:
I_{\beta}(D,\mathcal{A}(D))\leq\log(e)(\epsilon^{2}n/2+\epsilon\sqrt{n\log(2/% \beta)/2}) 
Rogers et al. (2016) extend this theorem to algorithm satisfying (\epsilon,\delta)differential privacy.
Remark 4.
We note that a hypothesis of this theorem is that the data is drawn from a product distribution. In the contextual bandit setting, this corresponds to rows in the bandit tableau being drawn from a product distribution. This will be the case if contexts are drawn from a distribution at each round, and then rewards are generated as some fixed stochastic function of the contexts. Note that contexts (and even rewards) can be correlated with one another within a round, so long as they are selected independently across rounds. In contrast, the regret bound we prove allows the contexts to be selected by an adversary, but adversarially selected contexts would violate the independence assumption needed for Theorem 7.
We now formalize the process of running a hypothesis test against an adaptively collected dataset. A bandit algorithm \mathcal{A} generates a history \Lambda_{T}\in\mathcal{H}^{T}. Let the reward portion of the gathered dataset be denoted by D_{\mathcal{A}}. We define an adaptive test statistic selector as follows.
Definition 7.
Fix the reward portion of a bandit tableau D and bandit algorithm \mathcal{A}. An adaptive test statistic selector is a function s from action histories to test statistics such that s(\Lambda_{T}^{\mathcal{A}}) is a realvalued function of the adaptively gathered dataset {D_{\mathcal{A}}}.
Importantly, the selection of the test statistic s(\Lambda_{T}^{\mathcal{A}}) can depend on the sequence of arms pulled by \mathcal{A} (and in the contextual setting, on all contexts observed), but not otherwise on the reward portion of the tableau D. For example, t_{\mathcal{A}}=s(\Lambda_{T}^{\mathcal{A}}) could be the tstatistic corresponding to the null hypothesis that the arm i^{*} which was pulled the greatest number of times has mean \mu:
t_{\mathcal{A}}(D_{\mathcal{A}})=\frac{\sum_{t=1}^{N_{i^{*}}^{T}}y_{i^{*}t}% \mu}{\sqrt{N_{i^{*}}^{T}}} 
By virtue of Theorems 6 and 7, and our view of adaptive data gathering as adaptively selected queries, we get the following corollary:
Corollary 3.
Let \mathcal{A} be an \epsilon reward differentially private bandit algorithm, and let s be an adaptive test statistic selector. Fix \beta>0, and let \gamma(\alpha)=\frac{\alpha}{2^{\log(e)(\epsilon^{2}T/2+\epsilon\sqrt{T\log(2/% \beta)/2})}}, for \alpha\in[0,1]. Then for any adaptively selected statistic t_{\mathcal{A}}=s(\Lambda_{T}^{\mathcal{A}}), and any product distribution P corresponding to the null hypothesis for t_{\mathcal{A}}
\mathbb{P}_{D\sim P,\mathcal{A}}\left[p(t_{\mathcal{A}}(D))\leq\gamma(\alpha)% \right]\leq\alpha 
If we set \epsilon=O(1/\sqrt{T}) in Corollary 3, then \gamma(\alpha)=O(\alpha)– i.e. a valid pvalue correction that only scales \alpha by a constant. For example, in the simple stochastic setting, we can recall corollary 1 to obtain:
Corollary 4.
Setting \epsilon=O(\frac{\ln^{1.5}T}{\sqrt{kT}}) there exists a simple stochastic bandit algorithm that guarantees expected regret bounded by O(\sqrt{kT\log T}), such that for any adaptive test statistic t evaluated on the collected data, there exists a valid pvalue correction function \gamma(\alpha)=O(\alpha).
Of course, our theorems allow us to smoothly trade off the severity of the pvalue correction with the regret bound.
6 Experiments
We first validate our theoretical bounds on bias in the simple stochastic bandit setting. As expected the standard UCB algorithm underestimates the mean at each arm, while the private UCB algorithm of Mishra and Thakurta (2015) obtains very low bias. While using the \epsilon suggested by the theory in Corollary 4 effectively reduces bias and achieves near optimal asymptotic regret, the resulting private algorithm only achieves nontrivial regret for large T due to large constants and logarithmic factors in our bounds. This motivates a heuristic choice of \epsilon that provides no theoretical guarantees on bias reduction, but leads to regret that is comparable to the nonprivate UCB algorithm. We find empirically that even with this large choice of \epsilon we achieve an 8 fold reduction in bias relative to UCB. This is consistent with the observation that our guarantees hold in the worstcase, and suggests that there is room for improvement in our theoretical bounds — both improving constants in the worstcase bounds on bias and on regret, and for proving instance specific bounds. Finally, we show that in the linear contextual bandit setting collecting data adaptively with a linear UCB algorithm and then conducting ttests for regression coefficients yields incorrect inference (absent a pvalue correction). These findings confirm the necessity of our methods when drawing conclusions from adaptively gathered data.
6.1 Stochastic MultiArmed Bandit
In our first stochastic bandit experiment we set K=20 and T=500. The K arm means are equally spaced between 0 and 1 with gap \Delta=.05, with \mu_{0}=1. We run UCB and \epsilonprivate UCB for T rounds with \epsilon=.05, and after each run compute the difference between the sample mean at each arm and the true mean. We repeat this process 10,000 times, averaging to obtain high confidence estimates of the bias at each arm. The average absolute bias over all arms for private UCB was .00176, with the bias for every arm being statistically indistinguishable from 0 (see Figures 2 for confidence intervals) while the average absolute bias (over arms) for UCB was .0698, or over 40 times higher. The most biased arm had a measured bias of roughly 0.14, and except for the top 4 arms, the bias of each arm was statistically significant. It is worth noting that private UCB achieves bias significantly lower than the \epsilon=.05 guaranteed by the theory, indicating that the theoretical bounds on bias obtained from differential privacy are conservative. Figures 2, 2 show the bias at each arm for private UCB vs. UCB, with 95\% confidence intervals around the bias at each arm. Not only is the bias for private UCB an order of magnitude smaller on average, it does not exhibit the systemic negative bias evident in Figure 2.
Noting that the observed reduction in bias for \epsilon=.05 exceeded that guaranteed by the theory, we run a second experiment with K=5,T=100000,\Delta=.05, and \epsilon=400, averaging results over 1000 iterations. Figure 3 shows that private UCB achieves sublinear regret comparable with UCB. While \epsilon=400 provides no meaningful theoretical guarantee, the average absolute bias at each arm mean obtained by the private algorithm was .0015 (statistically indistinguishable from 0 at 95% confidence for each arm), while the nonprivate UCB algorithm obtained average bias .011, 7.5 times larger. The bias reduction for the arm with the smallest mean (for which the bias is the worst with the non private algorithm) was by more than a factor of 10. Figures 5,5 show the bias at each arm for the private and nonprivate UCB algorithms together with 95% confidence intervals; again we observe a negative skew in the bias for UCB, consistent with the theory in Nie et al. (2017).
6.2 Linear Contextual Bandits
Nie et al. (2017) prove and experimentally investigate the existence of negative bias at each arm in the simple stochastic bandit case. Our second experiment confirms that adaptivity leads to bias in the linear contextual bandit setting in the context of hypothesis testing – and in particular can lead to false discovery in testing for nonzero regression coefficients. The set up is as follows: for K=5 arms, we observe rewards y_{i,t}\sim\mathcal{N}(\theta_{i}^{\prime}x_{it},1), where \theta_{i},x_{it}\in\mathbb{R}^{5},\theta_{i}=x_{it}=1. For each arm i, we set \theta_{i1}=0. Subject to these constraints, we pick the \theta parameters uniformly at random (once per run), and select the contexts x uniformly at random (at each round). We run a linear UCB algorithm (OFUL AbbasiYadkori et al. (2011)) for T=500 rounds, and identify the arm i^{*} that has been selected most frequently. We then conduct a ztest for whether the first coordinate of \theta_{i^{*}} is equal to 0. By construction the null hypothesis H_{0}:\theta_{i^{*}1}=0 of the experiment is true, and absent adaptivity, the pvalue should be distributed uniformly at random. In particular, for any value of \alpha the probability that the corresponding pvalue is less than \alpha is exactly \alpha. We record the observed pvalue, and repeat the experiment 1000 times, displaying the histogram of observed pvalues in Figure 6. As expected, the adaptivity of the data gathering process leads the pvalues to exhibit a strong downward skew. The dotted blue line demarcates \alpha=.05. Rather than probability .05 of falsely rejecting the null hypothesis at 95% confidence, we observe that 76\% of the observed pvalues fall below the .05 threshold. This shows that a careful pvalue correction in the style of Section 2.3 is essential even for simple testing of regression coefficients, lest bias lead to false discovery.
References
 AbbasiYadkori et al. [2011] Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2312–2320, USA, 2011. Curran Associates Inc. ISBN 9781618395993. URL http://dl.acm.org/citation.cfm?id=2986459.2986717.
 Agrawal [1995] Rajeev Agrawal. Sample mean based index policies with o(log n) regret for the multiarmed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995. ISSN 00018678. URL http://www.jstor.org/stable/1427934.
 Audibert and Bubeck [2009] JeanYves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217–226, 2009.
 Auer et al. [2002] Peter Auer, Nicolò CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Mach. Learn., 47(23):235–256, May 2002. ISSN 08856125. doi: 10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352.
 Bassily et al. [2016] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the fortyeighth annual ACM symposium on Theory of Computing, pages 1046–1059. ACM, 2016.
 Bubeck et al. [2012] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 Chan et al. [2011] T.H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Trans. Inf. Syst. Secur., 14(3):26:1–26:24, November 2011. ISSN 10949224. doi: 10.1145/2043621.2043626. URL http://doi.acm.org/10.1145/2043621.2043626.
 Chu et al. [2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
 Cummings et al. [2016] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu. Adaptive learning with robust generalization guarantees. In Conference on Learning Theory, pages 772–814, 2016.
 Deshpande et al. [2017] Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy. Accurate inference for adaptive linear models. arXiv preprint arXiv:1712.06695, 2017.
 Dimakopoulou et al. [2017] Maria Dimakopoulou, Susan Athey, and Guido Imbens. Estimation considerations in contextual bandits. arXiv preprint arXiv:1711.07077, 2017.
 Dubhashi and Panconesi [2009] Devdatt P Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of randomized algorithms. Cambridge University Press, 2009.
 Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. SpringerVerlag. ISBN 3540327312, 9783540327318. doi: 10.1007/11681878˙14. URL http://dx.doi.org/10.1007/11681878_14.
 Dwork et al. [2010] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Proceedings of the Fortysecond ACM Symposium on Theory of Computing, STOC ’10, pages 715–724, New York, NY, USA, 2010. ACM. ISBN 9781450300506. doi: 10.1145/1806689.1806787. URL http://doi.acm.org/10.1145/1806689.1806787.
 Dwork et al. [2015a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015a.
 Dwork et al. [2015b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 2, NIPS’15, pages 2350–2358, Cambridge, MA, USA, 2015b. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969442.2969502.
 Dwork et al. [2015c] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the Fortyseventh Annual ACM Symposium on Theory of Computing, STOC ’15, pages 117–126, New York, NY, USA, 2015c. ACM. ISBN 9781450335362. doi: 10.1145/2746539.2746580. URL http://doi.acm.org/10.1145/2746539.2746580.
 Feldman and Steinke [2017a] Vitaly Feldman and Thomas Steinke. Generalization for adaptivelychosen estimators via stable median. In Conference on Learning Theory, pages 728–757, 2017a.
 Feldman and Steinke [2017b] Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data analysis. arXiv preprint arXiv:1712.07196, 2017b.
 Hardt and Blum [2015] Moritz Hardt and Avrim Blum. The ladder: a reliable leaderboard for machine learning competitions. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pages 1006–1014. JMLR. org, 2015.
 Hardt and Ullman [2014] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is hard. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 454–463. IEEE, 2014.
 Joseph et al. [2018] Matthew Joseph, Michael J. Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. Fair algorithms for infinite and contextual bandits. AIES’18, 2018. URL http://arxiv.org/abs/1610.09559.
 Lai and Robbins [1985] T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math., 6(1):4–22, March 1985. ISSN 01968858. doi: 10.1016/01968858(85)900028. URL http://dx.doi.org/10.1016/01968858(85)900028.
 Li et al. [2010] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 661–670, New York, NY, USA, 2010. ACM. ISBN 9781605587998. doi: 10.1145/1772690.1772758. URL http://doi.acm.org/10.1145/1772690.1772758.
 Mishra and Thakurta [2014] Nikita Mishra and Abhradeep Thakurta. Private stochastic multiarm bandits: From theory to practice. In ICML Workshop on Learning, Security, and Privacy, 2014.
 Mishra and Thakurta [2015] Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multiarm bandits. In Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence, UAI’15, pages 592–601, Arlington, Virginia, United States, 2015. AUAI Press. ISBN 9780996643108. URL http://dl.acm.org/citation.cfm?id=3020847.3020909.
 Nie et al. [2017] X. Nie, X. Tian, J. Taylor, and J. Zou. Why adaptively collected data have negative bias and how to correct for it. ArXiv eprints, August 2017.
 Rogers et al. [2016] Ryan M. Rogers, Aaron Roth, Adam D. Smith, and Om Thakkar. Maxinformation, differential privacy, and postselection hypothesis testing. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 911 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 487–494, 2016. doi: 10.1109/FOCS.2016.59. URL https://doi.org/10.1109/FOCS.2016.59.
 Russo and Zou [2016] Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. In Artificial Intelligence and Statistics, pages 1232–1240, 2016.
 Steinke and Ullman [2015] Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In Conference on Learning Theory, pages 1588–1628, 2015.
 Tossou and Dimitrakakis [2016] Aristide C. Y. Tossou and Christos Dimitrakakis. Algorithms for differentially private multiarmed bandits. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2087–2093. AAAI Press, 2016. URL http://dl.acm.org/citation.cfm?id=3016100.3016190.
 Tossou and Dimitrakakis [2017] Aristide C. Y. Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multiarmed bandit. CoRR, abs/1701.04222, 2017. URL http://arxiv.org/abs/1701.04222.
 Villar et al. [2015] Sofía S Villar, Jack Bowden, and James Wason. Multiarmed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.
 Wang et al. [2016] YuXiang Wang, Jing Lei, and Stephen E Fienberg. A minimax theory for adaptive data analysis. arXiv preprint arXiv:1602.04287, 2016.
Appendix A Differential Privacy Basics
We recall the standard definition of differential privacy, which can be defined over any neighboring relationship on data sets D,D^{\prime}\in\mathcal{X}^{*}. The standard relation says that D,D^{\prime} are neighbors (written as D\sim D^{\prime}) if they differ in a single element.
Definition 8 (Differential Privacy Dwork et al. [2006]).
Fix \epsilon\geq 0. A randomized algorithm A:\mathcal{X}^{*}\rightarrow\mathcal{O} is (\epsilon,\delta)differentially private if for every pair of neighboring data sets D\sim D^{\prime}\in\mathcal{X}^{*}, and for every event S\subseteq\mathcal{O}:
\mathbb{P}\left[A(D)\in S\right]\leq\exp(\epsilon)\mathbb{P}\left[A(D^{\prime}% )\in S\right]+\delta. 
Differentially private computations enjoy two nice properties:
Lemma 1 (Post Processing Dwork et al. [2006]).
Let A:\mathcal{X}^{*}\rightarrow\mathcal{O} be any (\epsilon,\delta)differentially private algorithm, and let f:\mathcal{O}\rightarrow\mathcal{O^{\prime}} be any (possibly randomized) algorithm. Then the algorithm f\circ A:\mathcal{X}^{*}\rightarrow\mathcal{O}^{\prime} is also (\epsilon,\delta)differentially private.
Postprocessing implies that, for example, every decision process based on the output of a differentially private algorithm is also differentially private.
Theorem 8 (Composition Dwork et al. [2006]).
Let A_{1}:\mathcal{X}^{*}\rightarrow\mathcal{O}, A_{2}:\mathcal{X}^{*}\rightarrow\mathcal{O}^{\prime} be algorithms that are (\epsilon_{1},\delta_{1}) and (\epsilon_{2},\delta_{2})differentially private, respectively. Then the algorithm A:\mathcal{X}^{*}\rightarrow\mathcal{O}\times\mathcal{O^{\prime}} defined as A(x)=(A_{1}(x),A_{2}(x)) is (\epsilon_{1}+\epsilon_{2}),(\delta_{1}+\delta_{2})differentially private.
Definition 9.
Two random variables X,Y defined over the same domain R are (\epsilon,\delta)close, written X\approx_{\epsilon,\delta}Y , if for all S\subseteq R:
\mathbb{P}\left[X\in S\right]\leq e^{\epsilon}\mathbb{P}\left[Y\in S\right]+\delta 
Note that if A is an (\epsilon,\delta)differentially private algorithm, and D,D^{\prime} are neighboring datasets, then A(D)\approx_{\epsilon,\delta}A(D^{\prime}). We make use of a simple lemma:
Lemma 2 (Folklore, but see e.g. Dwork et al. [2015c]).
Let X,Y be distributions such that X\approx_{\epsilon,\delta}Y and let f:\mathcal{Y}\to[0,1] be a realvalued function on the outcome space. Then {\mathbb{E}\left[f(X)\right]}\geq\exp(\epsilon){\mathbb{E}\left[f(Y)\right]}+\delta
Appendix B Useful Concentration Inequalities
Lemma 3 (Hoeffding Bound (See e.g. Dubhashi and Panconesi [2009])).
Let X_{1},\ldots X_{n} be independent random variables bounded by the interval [0,1]:0\leq X_{i}\leq 1. Then for t>0, \mathbb{P}\left[\bar{X}{\mathbb{E}\left[\bar{X}\right]}\geq t\right]\leq 2e% ^{2nt^{2}}
Appendix C A Private UCB algorithm
Appendix D Missing Proofs
Proof of Theorem 4.
Fix any x_{ik}\in C_{i,T}. We write {\dagger} to denote the matrix inverse in the case it exists, or else the pseudoinverse if not. We first expand {\hat{\theta}_{i}}^{\prime}x_{ik}:
{\hat{\theta}_{i}}^{\prime}x_{ik}=x_{ik}^{\prime}(\sum_{x_{i,\ell}\in C_{i,T}}% x_{i,\ell}x_{i,\ell}^{\prime})^{{\dagger}}\sum_{x_{i,\ell}\in C_{i,T}}x_{i,% \ell}^{\prime}y_{i,\ell}=x_{ik}^{\prime}(\sum_{t=1}^{T}x_{it}x_{it}^{\prime}% \mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)=i})^{{\dagger}}(\sum_{t=1}^{T}x_{it}% y_{i,t}\mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)=i}), 
where \mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)=i} is the indicator that arm i was pulled at round t. Then we take the conditional expectation of \hat{\theta}^{\prime}x_{ik}, conditioned on \Lambda_{T}^{\mathcal{A}}. Note that once we condition, (\sum_{t=1}^{T}x_{it}x_{it}^{\prime}\mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)=% i})^{{\dagger}} is just a fixed matrix, and so linearity of expectation will allow us to propagate through to the outer term:
\mathbb{E}_{D_{i,T}}\left[\hat{\theta}_{i}^{\prime}x_{ik}\Lambda_{T}^{% \mathcal{A}}\right]=x_{ik}^{\prime}(\sum_{t=1}^{T}x_{it}x_{it}^{\prime}% \mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)=i})^{{\dagger}}(\sum_{t=1}^{T}x_{it}% \mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)}\mathbb{E}_{D_{i,T}}\left[y_{i,t}% \Lambda_{T}^{\mathcal{A}}\right]) 
Note that we condition on \Lambda_{T}^{\mathcal{A}} which is an \epsilondifferentially private function of the rewards y_{i,t}, and that y_{i,t}\in[0,1]. Hence by Lemma 3, just as in the proof of Theorem 2, we have that \mathbb{E}_{D_{i,T}}\left[y_{i,t}\Lambda_{T}^{\mathcal{A}}\right]\leq e^{% \epsilon}\mathbb{E}_{D_{i,T}}\left[y_{i,t}\right]=e^{\epsilon}x_{it}\cdot% \theta_{i}. Substituting into the above gives:
\mathbb{E}_{D_{i,T}}\left[\hat{\theta}_{i}^{\prime}x_{ik}\Lambda_{T}^{% \mathcal{A}}\right]\leq e^{\epsilon}x_{ik}^{\prime}(\sum_{t=1}^{T}x_{it}x_{it}% ^{\prime}\mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)=i})^{{\dagger}}(\sum_{t=1}^% {T}x_{it}\mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)}x_{it}^{\prime}\theta_{i}) 
=e^{\epsilon}x_{ik}^{\prime}(\sum_{t=1}^{T}x_{it}x_{it}^{\prime}\mathbbm{1}_{% \Lambda_{T}^{\mathcal{A}}(t)=i})^{{\dagger}}(\sum_{t=1}^{T}x_{it}x_{it}^{% \prime}\mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)=i})\theta_{i}=e^{\epsilon}x_{% ik}^{\prime}\theta_{i}, 
where the last equality follows immediately when \sum_{t=1}^{T}x_{it}x_{it}^{\prime}\mathbbm{1}_{\Lambda_{T}^{\mathcal{A}}(t)=i} is fullrank, and follows from properties of the pseudoinverse even if it is not. But then we’ve shown that \mathbb{E}_{D_{i,T}}\left[\hat{\theta}_{i}^{\prime}x_{ik}\theta_{i}^{\prime}x% _{ik}\Lambda_{T}^{\mathcal{A}}\right]\leq(e^{\epsilon}1)\theta_{i}^{\prime}x% _{ik}\leq e^{\epsilon}1, since by assumption \theta_{i}^{\prime}x_{ik}\leq 1. Since this expectation holds conditionally on \Lambda_{T}^{\mathcal{A}}, we can integrate out \Lambda_{T}^{\mathcal{A}} to obtain:
\mathbb{E}_{D_{i,T}}\left[\hat{\theta}_{i}^{\prime}x_{ik}\theta_{i}^{\prime}x% _{ik}\right]\leq e^{\epsilon}1 
The lower bound \theta_{i}^{\prime}x_{ik}\mathbb{E}_{D_{i,T}}\left[\hat{\theta}_{i}^{\prime}% x_{ik}\right]\geq 1e^{\epsilon} follows from the reverse direction of Lemma 3. Since this holds for arbitrary C_{i,T} and x_{ik}\in C_{i,T} we are done. ∎
Proof of Theorem 5.
The rewardprivacy claim follows immediately from the privacy of the hybrid mechanism Chan et al. [2011] and the postprocessing property of differential privacy (Lemma 1). Here we prove the regret bound.
We first show that the confidence intervals given by \hat{y}_{tk}\pm(\frac{1}{\lambda}s_{it}+w_{it}) are valid \forall i,t with probability 1\delta.
Then since we always play the action with the highest upper confidence bound, with high probability we can bound our regret at time T by the sum of the widths of the confidence intervals of the chosen actions at each time step.
We know from AbbasiYadkori et al. [2011] that \forall i,T,\mathbb{P}\left[\langle\theta_{it},x_{it}\rangle\in[\langle\hat{% \theta}_{it},x_{it}\rangle\pm w_{it}]\right]\geq 1\frac{\delta}{2}. By construction,
\langle\hat{\theta}_{it}^{priv},x_{it}\rangle\langle\hat{\theta}_{it},x_{it}% \rangle\leqx_{it}^{\prime}\hat{V}_{it}^{1}\eta_{it}\leqx_{it}_{\hat{V}% _{it}^{1}}\eta_{it}_{\hat{V}_{it}^{1}},  (3) 
where the second inequality follows from applying the CauchySchwarz inequality with respsect to the matrix inner product \langle\cdot,\cdot\rangle_{\hat{V}_{it}^{1}}. We also have that \eta_{it}_{\hat{V}_{it}^{1}}\leq 1/\sqrt{\lambda}\eta_{it}_{2}, and by the utility theorem for the Hybrid mechanism Chan et al. [2011], with probability 1\delta/2,\;\forall i,t,\;\eta_{it}_{2}\leq s_{it}=O(\frac{1}{\epsilon}% \log^{1.5}T\log(K/\delta)). Thus by triangle inequality and a union bound, with probability 1\delta,\;\forall i,t:
\langle\theta_{it},x_{it}\rangle\langle\hat{\theta}_{it}^{priv},x_{it}% \rangle\leq O(\frac{1}{\sqrt{\lambda}}\frac{1}{\epsilon}\log^{1.5}T\log(K/% \delta)x_{it}_{\hat{V}_{it}^{1}})+w_{it}, 
Let R(T) denote the pseudoregret at time T, and R_{i}(T) denote the sum of the widths of the confidence intervals at arm i, over all times in which arm i was pulled. Then with probability 1\delta:
R(T)\leq\sum_{i}R_{i}(T)\leq\sum_{i=1}^{K}\bigg{(}\sum_{t=1}^{N_{i}^{T}}w_{i_{% t}t}+\frac{1}{\sqrt{\lambda}}\frac{1}{\epsilon}\log^{1.5}N_{i}^{T}\log(K/% \delta)\bigg{(}K\sum_{i=1}^{N_{i}^{T}/K}x_{it}_{\hat{V}_{it}^{1}}\bigg{)}% \bigg{)} 
The RHS is maximized at N_{i}^{T}=\frac{T}{K} for all i, giving:
R(T)\leq K\bigg{(}\sum_{t=1}^{T/K}w_{i_{t}t}+\frac{1}{\sqrt{\lambda}}\frac{1}{% \epsilon}\log^{1.5}(T/K)\log(K/\delta)\bigg{(}K\sum_{i=1}^{T/K}x_{it}_{% \hat{V}_{it}^{1}}\bigg{)}\bigg{)} 
Reproducing the analysis of AbbasiYadkori et al. [2011], made more explicit on page 13 in the Appendix of Joseph et al. [2018] gives:
\sum_{i=1}^{T/K}w_{i_{t}t}\leq\sqrt{2d\log(1+\frac{T}{\lambda Kd})}(\sqrt{2dT/% K\log(1/\delta+\frac{T}{K\lambda\delta})}+\sqrt{\frac{T}{K}\lambda}) 
The crux of their analysis is actually the bound \sum_{t=1}^{n}x_{it}_{\hat{V}_{it}^{1}}\leq 2d\log(1+n/d\lambda), which holds for \lambda\geq 1. Letting n=T/K bounds the second summation, giving that with probability 1\delta:
R(T)=\tilde{O}(d\sqrt{TK}+\sqrt{TKd\lambda}\;+K\frac{1}{\sqrt{\lambda}}\frac{1% }{\epsilon}\log^{1.5}(T/K)\log(K/\delta)\cdot 2d\log(1+T/Kd\lambda)), 
where \tilde{O} hides logarithmic terms in 1/\lambda,1/\delta,T,K,d. ∎
Proof of Claim 1.
We first remark that by the principle of deferred randomness we can view Algorithm 3 as first drawing the tableau D\in([0,1]^{K})^{T} and receiving C\in((\mathbb{R}^{d})^{K})^{T} up front, and then in step 4 publishing y_{i_{t},t} rather than drawing a fresh y_{i_{t},t}. Then, because for both Algorithm 2 and Algorithm 3 the tableau distributions are the same, it suffices to show that conditioning on D, the distributions induced on the action histories \Lambda_{t}^{\mathcal{A}} are the same. For both algorithms, at round t, there is some distribution over the next arm pulled i_{t}. We can write the joint distribution over \Lambda_{t+1}^{\mathcal{A}}=(i_{1},\ldots,i_{t}) as:
\mathbb{P}\left[i_{1},\ldots i_{t}\right]=\prod_{k=1}^{t}\mathbb{P}\left[i_{k}% i_{k1},\ldots i_{1}\right] 
For Algorithm 2 \mathbb{P}\left[i_{k}i_{k1},\ldots i_{1}\right] is equal to \mathbb{P}\left[f_{k}(\Lambda_{k})=i_{k}\right]. For Algorithm 3 it is \mathbb{P}\left[q_{k}(D)=i_{k}\right]. But by definition \mathbb{P}\left[q_{k}(D)=i_{k}\right]=\mathbb{P}\left[q_{\Lambda_{k}^{\mathcal% {A}}}(D)=i_{k}\right]=\mathbb{P}\left[f_{k}(\Lambda_{t}^{\mathcal{A}}(D))=i_{k% }\right]=\mathbb{P}\left[f_{k}(\Lambda_{k})=i_{k}\right], and so the joint distributions coincide. ∎