Linear Bandits with Stochastic Delayed Feedback
Abstract
Stochastic linear bandits are a natural and wellstudied model for structured exploration/exploitation problems and are widely used in applications such as online marketing and recommendation. One of the main challenges faced by practitioners hoping to apply existing algorithms is that usually the feedback is randomly delayed and delays are only partially observable. For example, while a purchase is usually observable some time after the display, the decision of not buying is never explicitly sent to the system. In other words, the learner only observes delayed positive events. We formalize this problem as a novel stochastic delayed linear bandit and propose and , two computationally efficient algorithms able to integrate new information as it becomes available and to deal with the permanently censored feedback. We prove optimal bounds on the regret of the first algorithm and study the dependency on delaydependent parameters. Our model, assumptions and results are validated by experiments on simulated and real data.
1 Introduction
Content optimization for websites and online advertising are among the main industrial applications of bandit algorithms Chapelle and Li (2011); Chapelle (2014). The dedicated services sequentially choose an option among several possibilities and display it on a web page to a particular customer. In most real world architectures, for each recommendation request, the features of the products are joined and hashed with those of the current user and provide a (finite) action set included in . For that purpose, linear bandits Chu et al. (2011); AbbasiYadkori et al. (2011) are among the most adopted as they allow to take into account the structure of the space where the action vectors lie.
A key aspect of these interactions through displays on webpages is the time needed by a customer to make a decision and provide feedback to the learning algorithm, also known as the conversion indicator Chapelle (2014); Diemert et al. (2017). For example, a midsize ecommerce website can serve hundreds of recommendations per second, but customers need minutes, or even hours, to make a purchase. In Chapelle (2014), the authors ran multiple tests on proprietary industrial datasets, providing a good example of how delays affect the performance of clickthrough rate estimation. They extract 30 days of display advertising data and find that delays are on average of the order of several hours and up to several days.
On the other extreme of the time scale, some companies optimize longterm metrics for customer engagement (e.g., accounting for returned products in the sales results) which by definition can be computed only several weeks after the bandit has played the action. Moreover, after a piece of content is displayed on a page, the user may or may not decide to react (e.g., click, buy a product). In the negative case, no signal is sent to the system and the learner cannot distinguish between actions for which the user did not click and those where they did, but the conversion is delayed. Note that in the use cases we consider, when a feedback is received, the learner is able to attribute it to the past action that triggered it.
Two major requirements for bandit algorithms in order to be ready to run in a real online service are the ability to leverage contextual information and handle delayed feedback. Many approaches are available to deal with contextual information AbbasiYadkori et al. (2011); Agarwal et al. (2014); Auer et al. (2002); Neu (2015); Beygelzimer et al. (2011); Chu et al. (2011); Zhou (2015). Delays have been identified as a major problem in online applications Chapelle (2014). We give an overview of the existing literature in Section 7. However, to the best of our knowledge, no algorithm was able to address this problem given the requirements defined above.
Contributions
Our main contribution is a novel bandit algorithm called OnTheFlyLinUCB (). The algorithm is based on AbbasiYadkori et al. (2011), but with confidence intervals and leastsquares estimators that are adapted to account for the delayed and censored rewards (Section 3). The algorithm is complemented by a regret analysis, including lower bounds (Section 5). We then provide a variant inspired by Thompson sampling and Follow the Perturbed Leader (Section 6) and evaluate the empirical performance of all algorithms in Section 8.
2 Learning Setting under Delayed Feedback
Notation
All vectors are in where is fixed. For any symmetric positive definite matrix and vector , and is the usual norm of , where denotes the identity matrix in dimension .
Learning setting
Our setup involves a learner interacting with an environment over rounds. The environment depends on an unknown parameter with and a delay distribution supported on the natural numbers. Note, in contrast to Vernade et al. (2017), we do not assume the learner knows . Then, in each round ,

The learner receives from the environment a finite set of actions with and and for all .

The learner selects an action from based on information observed so far.

The environment samples a reward and a delay that are partially revealed to the learner and where:
3.a. .
3.b. is sampled independently of from .

Certain rewards resulting from previous actions are revealed to the learner. For , let , which is called the censoring variable and indicates whether or not the reward resulting from the decision in round is revealed by round . Then let . The learner observes the collection at the end of round . If , we say that the action converts.
The delays in combination with the Bernoulli noise model and censored observations introduces an interesting structure. When , then for all , but is also possible when , but the reward from round has been delayed sufficiently. On the other hand, if , the learner can immediately deduce that .
The goal of the learner is to sequentially minimize the cumulative regret after rounds, which is
(1) 
where is the action that maximises the expected reward in round .
Remark 1.
The assumption that for all ensures the reward is well defined. A natural alternative is to replace the linear model with a generalized linear model. Our algorithms and analysis generalize to this setting in the natural way using the techniques of Filippi et al. (2010); Jun et al. (2017). For simplicity, however, we restrict our attention to the linear model. The assumption that and for all actions are quite standard and the dependence of our results on alternative bounds is relatively mild.
3 Concentration for Least Squares Estimators with Delays
This section is dedicated to disentangling the delays and the reward estimation. The combination of an unknown delay distribution and censored binary rewards makes it hopeless to store all past actions and wait for every conversion. For this reason our algorithm depends on a parameter . If a reward has not converted within rounds, the algorithm assumes it will never convert and ignores any subsequent signals related to this decision. There is also a practical advantage, which is that the learner does not need to store individual actions that occurred more than rounds in the past. Define
which is the same as except rewards that convert after more than rounds are ignored. The learner then uses to estimate a parameter that is proportional to using regularized least squares. Let be a regularization parameter and define
(2) 
We now state our main deviation result.
Theorem 1.
Let and . Then the following holds for all with probability at least .
(3) 
where
(4) 
Proof.
Let and be the two types of noise affecting our observations, both being centered Bernoulli and independent of the past conditionally on . By (Lattimore and Szepesvári, 2019, Theorem 20.4) it holds with probability at least that for all ,
(5) 
where we used that for the first inequality. We comment on that step further below.
The next step is to decompose with respect to the value of the censoring variables of the learner. We first rewrite by expliciting the value of the indicator function in each term of its sum:
where we added terms in the first sum and removed them in the second one. The second sum now contains the terms that will eventually convert but have not been received yet.
Now assume both events in Equation 5 hold. Using the decomposition above, we have
The last term can be naively bounded and gives the second term of Equation 3. We can bound the first term using Equation 5:
where both the first and second inequalities follow from the triangle inequality applied to and the last from the assumption that the events in Equation 5 hold. ∎
Remark 2.
The initial step of the proof in Eq (5) might seem loose but we explain why this term cannot easily be bounded more tightly. Note that the variance of is and thus, when applying (Lattimore and Szepesvári, 2019, Theorem 20.4), we could obtain a tighter bound by taking it into acount, which would lead to . But had we included it there, it would have appeared in the expression of the upper bound, i.e. in the algorithm, and the learner would have needed its knowledge to compute the upper bound. This was the choice made by Vernade et al. (2017). By removing it, we pay the price of slightly larger confidence intervals (more exploration) for not having to give prior information to the learner. We discuss other possible approaches in conclusion.
Practical considerations
As we mentioned already, a practical advantage of the windowing idea is that the learner need not store actions for which the feedback has not been received indefinitely. The cutoff time is often rather long, even as much as 30 days Chapelle and Li (2011).
Choosing the window
The windowing parameter is often a constraint of the system and the lerner cannot choose it. Our results show the price of this external censoring on the regret. If the learner is able to choose the choice is somewhat delicate. The learner effectively discards proportion of the data, so ideally should be large, which corresponds to large . But there is a price for this. The learner must store actions and the confidence interval also depends (somewhat mildly) on . When the mean of the delay distribution is finite and known, then a somewhat natural choice of the windowing parameter is . By Markov’s inequality this ensures that . The result continues to hold if the learner only knows an upper bound on .
Precisely how should be chosen depends on the underlying problem. We discuss this issue in more detail in Section 5 where the regret analysis is provided.
4 Algorithm
We are now equipped to present , an optimistic linear bandit algorithm that uses concentration analysis from the previous section. The pseudocode of is given in Algorithm 1. It accepts as input a confidence level , a window parameter and a regularization parameter . In each round the algorithm computes the estimator using Equation 3 and for each arm computes an upper confidence bound on the expected reward defined by
(6) 
Then action is chosen to maximize the upper confidence bound:
where ties are broken arbitrarily.
Implementation details
The algorithm needs to keep track of and as defined in Equation 2. These can be updated incrementally as actions are taken and information is received. The algorithm also uses , which can be updated incrementally using the ShermanâMorrison formula. In order to recompute the algorithm needs to store the most recent actions, which are also used to update .
Computation complexity
The computation complexity is dominated by three operations: (1) Updating and computing its inverse, which takes computation steps using a rankone update, and (2) computing the radius of the confidence ellipsoid, which requires computations, one for each of the last actions. Finally, (3) iterating over the actions and computing the upper confidence bounds, which requires computations. Hence the total computation per round is .
Space complexity
The space complexity is dominated by: (1) Storing the matrix and (2) storing the most recent actions, which are needed to compute the least squares estimator and the upper confidence bound. Hence, the space complexity is .
Improved computation complexity
Because changes in every round, the radius of the confidence set needs to be recomputed in each round, which requires computations per round. A minor modification reduces the computation complexity to . The idea is to notice that for any and ,
Hence, one can store a buffer of scalars at the memory cost of . This slightly increases the upper confidence bounds, but not so much that the analysis is affected as we discuss in the next section.
5 Regret Analysis
Our main theorem is the following high probability upper bound on the regret of . The proof combines the ideas from AbbasiYadkori et al. (2011) with a novel argument to handle the confidence bound in Theorem 1, which has a more complicated form than what usually appears in the analysis of stochastic linear bandits.
Theorem 2.
With probability at least the regret of satisfies
Proof.
Let , which is chosen so that the upper confidence bound for action in round is
and . By Theorem 1, with probability at least it holds that for all . Assume for the remainder that the above event holds. Then
Therefore the regret in round is bounded by
where the second inequality follows from CauchySchwarz. We now substitute the value of and bound the overall regret by
(7) 
The first sum is bounded in the same way as the standard setting AbbasiYadkori et al. (2011):
where the first inequality follows from CauchySchwarz and the second from the elliptical potential lemma (Lattimore and Szepesvári, 2019, Lemma 19.4). For the second sum in Equation 7 we introduce a new trick. Using the fact that ,
where in the second inequality we used the fact that for ,
Remark 3.
The choice of is left to the learner. It influences the bound in two ways: (1) The lowerorder term is linear in , which prevents the user from choosing very large. On the other hand, is increasing in , which pushes the user in the opposite direction. Designing an adaptive algorithm that optimizes the choice of online remains a challenge for the future.
Lower bound
We now provide a nonasymptotic minimax lower bound for armed stochastic Bernoulli bandits showing that in the windowed setting there is an unavoidable dependence on . Note, an asymptotic problemdependent bound for this setting was already known Vernade et al. (2017). Although our results are specialized to the finitearmed bandit model, we expect that standard analysis for other action sets should follow along the same lines as (Lattimore and Szepesvári, 2019) (§24).
Theorem 3.
For any policy and and and there exists a armed Bernoulli bandit such that , where is a universal constant.
Interestingly, the dependence on appears in the square root, while in our upper bounds it is not. We speculate that the upper bound is loose. In fact, were known it would be possible to improve our upper bounds by using confidence intervals based on Bernstein’s inequality that exploit the reduced variance that is a consequence of being small. When is unknown you might imagine estimating the variance. We anticipate this should be possible, but the complexity of the algorithm and analysis would greatly increase.
6 Thompson sampling
The standard implementation of Thompson sampling for linear bandits without delays and Gaussian noise is to sample where is the usual regularized least squares estimator
The algorithm then chooses
This algorithm corresponds to Thompson sampling (or posterior sampling) when the prior is Gaussian with zero mean and covariance. No frequentist analysis exists for this algorithm, but empirically it performs very well. In the delayed setting the are not available to the learner at time . Nevertheless, it is possible to propose a randomized algorithm in the spirit of Thompson sampling. To motivate our choices, recall that the standard concentration analysis for least squares regression by AbbasiYadkori et al. (2011) shows that with high probability
In the delayed setting, on the other hand, Theorem 1 shows that
(8)  
(9) 
An ansatz guess for a sampling algorithm that uses the delayed least squares estimator is to compute and then sample
where . The choice of is rather heuristic. A more conservative choice would be the righthand side of Equation 9. The resulting algorithm roughly corresponds to sampling from the confidence set used by our optimistic algorithm. Although this sacrifices certain empirical advantages, we expect the analysis techniques by Agrawal and Goyal (2013); Abeille and Lazaric (2017) could be applied to prove a frequentist regret bound for this algorithm.
Remark 4.
Algorithms based on adding noise to an empirical estimate are often referred to as ‘follow the perturbed leader’, which has been effectively applied in a variety of settings Abeille and Lazaric (2017); Kveton et al. (2018). An advantage of sampling approaches is that the optimization problem to find is a linear program, which for large structured action sets may be more efficient than finding the arm maximizing an upper confidence bound.
Remark 5.
A genuine implementation of Thompson sampling would require a prior on the space of delay distributions as well as the unknown parameter. We are not hopeful about the existence of a reasonable prior for which computing or sampling from the posterior is efficient.
7 Related Work
Delays in the environment response is a frequent phenomenon that may take many different forms and should be properly modelled to design appropriate decision strategies. For instance, in an early work on applications of bandit algorithms to clinical trials, Eick (1988) uses ‘delays’ to model the survival time of the patients, in which case delays are the reward rather than external noise, which is a radically different problem to ours. In the same vein, in the sequential stochastic shortest path problem Talebi et al. (2017), the learner aims at minimising the routing time in a network.
Another example is parallel experimentation, where delays force the learner to make decisions under temporary uneven information. For instance, Desautels et al. (2014); Grover et al. (2018) consider the problem of running parallel experiments that do not all end simultaneously. They propose a Bayesian way of handling uncertain outcomes to make decisions: they sample hallucinated results according to the current posterior. The related problem of gradientbased optimization with delayed stochastic gradient information is studied by Agarwal and Duchi (2011).
In online advertising, delays are due to the natural latency in users’ responses. However, in many works on bandit algorithm, delays are ignored as a first approximation. In the famous empirical study of Thompson sampling Chapelle and Li (2011), a section is dedicated to analyzing the impact of delays on either Thompson sampling or . While this is an early interest for this problem, they only consider fixed, nonrandom, delays of 10, 30 or 60 minutes. Similarly, in Mandel et al. (2015), the authors conclude that randomized policies are more robust to this type of latencies. The general problem of online learning under known delayed feedback is addressed in Joulani et al. (2013), including full information settings and partial monitoring, and we refer the interested reader to their references on those topics. The most recent and closest work to ours is Zhou et al. (2019). The main difference with our approach is that they make strong assumptions on the distribution of the delays, while not having any censoring of the feedback. In that sense their problem is easier than ours because delays are fully observed. Nonetheless, the key idea of their algorithm is reminiscent to ours: they inscrease the exploration bonus by a quantity that corresponds to the amount of missing data at each round, which is observable in their case, not in ours. Recent work Li et al. (2019) address the case of unknown delays. The idea of ambiguous feedback, where delays are partially unknown is introduced in Vernade et al. (2017).
Many models of delays for online advertising have been proposed to estimate conversion rates in an offline fashion: e.g. Yoshikawa and Imai (2018) (nonparametric) or Chapelle (2014); Diemert et al. (2017) (generalized linear parametric model).
An alternative, harder model relies only on anonymous feedback CesaBianchi et al. (2018); PikeBurke et al. (2017); Arya and Yang (2019): the rewards, when observed, cannot be directly linked to the action that triggered them in the past, and the learner has to deal with mixing processes. Finally, a recent fullinformation setting Mann et al. (2018) suggests to incorporate intermediate feedback correlated with the rewards.
8 Experiments
In this section we illustrate the two realistic settings handled by this work.
The first case below, with geometrically distributed delays,
corresponds to the empirical study done by Chapelle (2014),
already reproduced in simulations by Vernade et al. (2017).
The second case, with arbitrary heavytailed delays, corresponds to
another usecase we extracted from the data released by Diemert et al. (2017).
Remark 6.
To the best of our knowledge, there is no competitor for this problem. In particular, despite the similarity of the algorithm DUCB of Zhou et al. (2019) with ours, it cannot be implemented when delays are not observed. Specifically, DUCB, maintains a quantity which is equal to the exact amount of missing data (delayed feedback not converted yet). In our case, this quantity is not observable. The same comment applies for the QPMD algorithm of Joulani et al. (2013) and similar queuebased approaches. On the other end, existing algorithms for unknown delays are not derived for linear bandits with arbitrary action sets.
Wellbehaved delays
In the datasets analyzed
Heavytailed delays.
Interestingly, the more recent dataset released by Diemert et al. (2017)
9 Discussion
We introduced the delayed stochastic linear bandit setting and proposed two algorithms. The first uses the optimism principle in combination with ellipsoidal confidence intervals while the second is inspired by Thompson sampling and follow the perturbed leader.
There are a number of directions for future research, some of which we now describe.
Improved algorithms
Our lower bound suggests the dependence on in Theorem 2 can be improved. Were this value known we believe that using a concentration analysis that makes use of the variance should improved the dependence to match the lower bound. When is not known, however, the problem becomes more delicate. One can envisage various estimation schemes, but the resulting algorithm and analysis are likely to be rather complex.
Generalized linear models
When the rewards are Bernoulli it is natural to replace the linear model with a generalized linear model. As we remarked already, this should be possible using the machinery of Filippi et al. (2010).
Thompson sampling
Another obvious question is whether our variant of Thompson sampling/follow the perturbed leader admits a regret analysis. In principle we expect the analysis by Agrawal and Goyal (2013) in combination with our new ideas in the proof of Theorem 2 can be combined to yield a guarantee, possibly with a different tuning of the confidence width . Investigating a more pure Bayesian algorithm that updates beliefs about the delay distribution as well as unknown parameter is also a fascinating open question, though possibly rather challenging.
Refined lower bounds
Our current lower bound is proven when is the standard basis vectors for all . It would be valuable to reproduce the results where is the unit sphere or hypercube, which should be a straightforward adaptation of the results in (Lattimore and Szepesvári, 2019, §24).
Acknowledgements
This work was started when CV was at Amazon Berlin and at OvGU Magdeburg, working closely with AC, GZ, BE and MB. Major changes and improvements were made thanks to TL at DeepMind later on. CV wants to thank Csaba Szepesvári for his useful comments and discussions, and Vincent Fortuin for precisely reading and commenting.
Appendix A Proof of Theorem 3
For let be the relative entropy between Bernoulli distributions with biases and respectively. For let denote the expectation when the algorithm interacts with the Bernoulli bandit determined by . Let where is some parameter to be tuned subsequently. Then let
By the pigeonhole principle it follows that . Then define so that for all and . By the definitions of and we have
which means that
Summing the two regrets and applying the BretagnolleHuber inequality shows that
The next step is to calculate the relative entropy between and . Both bandits behave identically on all arms except action . When action is played the learner effectively observes a reward with bias either or . Therefore
Upper bounding the relative entropy by the squared distance shows that
where we used the assumption that . Therefore
Finally we conclude that
The result follows by tuning .
Footnotes
 here we denote if for any ,
 The code for all data analysis and simulations is available at https://sites.google.com/view/banditsdelayedfeedback
 The datasets were not released.
 Note that in Chapelle (2014), time is rather measured in hours and days.
 https://ailab.criteo.com/criteoattributionmodelingbiddingdataset/
References
 Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1, §1, §1, §5, §5, §6.
 Linear thompson sampling revisited. In AISTATS 201720th International Conference on Artificial Intelligence and Statistics, Cited by: §6, Remark 4.
 Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pp. 873–881. Cited by: §7.
 Taming the monster: a fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646. Cited by: §1.
 Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127–135. Cited by: §6, §9.
 Randomized allocation with nonparametric estimation for contextual multiarmed bandits with delayed rewards. arXiv preprint arXiv:1902.00819. Cited by: §7.
 The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §1.
 Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19–26. Cited by: §1.
 Nonstochastic bandits with composite anonymous feedback. In Conference On Learning Theory, pp. 750–773. Cited by: §7.
 An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257. Cited by: §1, §3, §7.
 Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1097–1105. Cited by: §1, §1, §1, §7, §8, §8, footnote 4.
 Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. Cited by: §1, §1.
 Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research 15 (1), pp. 3873–3923. Cited by: §7.
 Attribution modeling increases efficiency of bidding in display advertising. In Proceedings of the AdKDD and TargetAd Workshop, KDD, Halifax, NS, Canada, August, 14, 2017, pp. To appear. Cited by: §1, §7, §8, §8.
 The twoarmed bandit with delayed responses. The Annals of Statistics, pp. 254–264. Cited by: §7.
 Parametric bandits: the generalized linear case. In Advances in Neural Information Processing Systems, pp. 586–594. Cited by: §9, Remark 1.
 Best arm identification in multiarmed bandits with delayed feedback. In International Conference on Artificial Intelligence and Statistics, pp. 833–842. Cited by: §7.
 Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pp. 1453–1461. Cited by: §7, Remark 6.
 Scalable generalized linear bandits: online computation and hashing. In Advances in Neural Information Processing Systems, pp. 99–109. Cited by: Remark 1.
 Garbage in, reward out: bootstrapping exploration in multiarmed bandits. External Links: 1811.05154 Cited by: Remark 4.
 Bandit algorithms. Cambridge University Press (preprint). Cited by: §3, §5, §5, §9, Remark 2.
 Bandit online learning with unknown delays. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 993–1002. Cited by: §7.
 Towards more practical reinforcement learning. In 24th International Joint Conference on Artificial Intelligence, IJCAI 2015, Cited by: §7.
 Learning from delayed outcomes with intermediate observations. arXiv preprint arXiv:1807.09387. Cited by: §7.
 Explore no more: improved highprobability regret bounds for nonstochastic bandits. In Advances in Neural Information Processing Systems, pp. 3168–3176. Cited by: §1.
 Bandits with delayed anonymous feedback. arXiv preprint arXiv:1709.06853. Cited by: §7.
 Stochastic online shortest path routing: the value of feedback. IEEE Transactions on Automatic Control 63 (4), pp. 915–930. Cited by: §7.
 Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artificial Intelligence, Cited by: §2, §5, §7, §8, §8, Remark 2.
 A nonparametric delayed feedback model for conversion rate prediction. arXiv preprint arXiv:1802.00255. Cited by: §7.
 A survey on contextual multiarmed bandits. arXiv preprint arXiv:1508.03326. Cited by: §1.
 Learning in generalized linear contextual bandits with stochastic delays. In Advances in Neural Information Processing Systems 32, pp. 5198–5209. Cited by: §7, Remark 6.