Contextual Bandits under Delayed Feedback
Abstract
Delayed feedback is an ubiquitous problem in many industrial systems employing bandit algorithms. Most of those systems seek to optimize binary indicators as clicks. In that case, when the reward is not sent immediately, the learner cannot distinguish a negative signal from a notyetsent positive one: she might be waiting for a feedback that will never come. In this paper, we define and address the contextual bandit problem with delayed and censored feedback by providing a new UCBbased algorithm. In order to demonstrate its effectiveness, we provide a finite time regret analysis and an empirical evaluation that compares it against a baseline commonly used in practice.
1 Introduction
Content optimization for websites and online advertising are among the main industrial applications of bandit algorithms. The services at stake are used to sequentially choose an option among several possible ones and display it on a web page to a particular customer. For that purpose, contextual bandits are among the most adopted as they allow to take into account the structure of the space where the actions lie.
Moreover, a key aspect of these interactions through displays on webpages is the time needed by a customer to make a decision and perform an action. The customers’ reactions are often treated as feedback and provided to the learning algorithm. For example, a midsize ecommerce website can serve hundreds of recommendations per second but customers need minutes, or even hours, to perform a purchase. This means that the feedback the learner requires for its internal updates is always delayed by several thousands of steps. In (Chapelle, 2014), the authors ran multiple tests on proprietary industrial datasets, providing a good example of how delays affect the performance of clickthrough rate estimation.
The problem becomes even more relevant when the delay is big compared to the considered time horizon. For example, several ecommerce websites provide special deals that only last for a few hours. Yet, customers will still need several minutes to make their decision, which is a significant amount of time as compared to the time horizon. Furthermore, on the other extreme of the time scale, some companies optimize longterm metrics for customers engagement (e.g., accounting for returned products in the sales results) which by definition can be computed only several weeks after the bandit performed the action. In many of these applications, the ratio between time horizon of the learner and the average delay is between 5 and 10, which makes the delayed feedback problem extremely relevant in practice.
Two major requirements for bandits algorithms in order to be ready to run in a real online service are the ability to leverage contextual information and handle delayed feedback. Many approaches are available to deal with contextual information AbbasiYadkori et al. (2011); Agarwal et al. (2014); Auer et al. (2002); Neu (2015); Beygelzimer et al. (2011); Chu et al. (2011); Zhou (2015) and delays have been identified as a major problem in online applications Chapelle (2014). We give an overview of the existing literature in the field in Section 6. However, to the best of our knowledge, no algorithm was able to address this problem given the requirements defined above. In the following, we consider a censored contextual bandit problem: after a piece of content is displayed on a page, the user may or may not decide to react (e.g., click, buy a product). In the negative case, no signal will ever be sent to the system. This means that the system will wait for the feedback even if it never comes. On top of preventing the update of the parameters, delays also imply nonnegligible memory costs. Indeed, as long as the learner is awaiting a feedback, the context and action must be stored in order to allow the future possible update. For that reason, it is common practice to impose a cut off time after which the stored objects are discarded and the rewards associated with too long delays are never considered. This additional censoring of the feedback has only been addressed in a recent work on noncontextual stochastic bandits Vernade et al. (2017) which does not generalize to the contextual case.
In this paper, we define and formalize the contextual censored problem under bandit feedback. We notice that a simple baseline algorithm can be easily implemented but has bad nonasymptotic performance. We propose , a carefully modified version of handling delays more efficiently. We provide a regret analysis for the newly presented algorithm and an empirical evaluation against the exhibited naive baseline.
2 Setting and Notation
We introduce the contextual delayed bandits as stochastic contextual bandit problem with independent stochastic delays. This setting is inspired by AbbasiYadkori et al. (2011); Joulani et al. (2013) and Vernade et al. (2017). Upper case and lower case letters are used respectively for random variables and constants.
At round , a set of contextualized actions is available. In practice, these action vectors are constructed as , where are the user’s features, are fixed action vectors, and is some nonlinear projection on . The system constraints fix a cut off time that corresponds to the longest waiting time allowed for each action made.
The learning proceeds as follows:

The learner observes the contextualized actions ;

An action is chosen from ;

An acknowledgment indicator is generated independently of the past and accounts for the event that the chosen action was actually evaluated by the customer. The parameter does not depend on the action taken and it is also unknown.

Conditionally on , two random variables are generated but not immediately observed:
4.a. a reward following the linear assumption: where is an unknown vector and is an independent centered random noise discussed later;
4.b. an actionindependent delay . As in Vernade et al. (2017), for all , characterize the delay distribution in a nonparametric fashion. However, as opposed to their setting, we do not require a prior knowledge of these parameters.

The observation of is postponed to if , we then say that the action converts. Otherwise it will never be observed and the system will have to process it accordingly. As long as the observation is not observed, the learner sees a which can be mistaken for a reward.
Remark 1.
The acknowledgment variables are here to stand for actions that are never converted for external reasons nonrelated to the taken action. It is equivalent to assuming that the delay associated to these actions is larger than . We made this choice to align with the model of Chapelle (2014) but our bandit algorithm is agnostic to .
From the moment an action is made at round , a sequence of random variables is generated that model the available observation at round :
Indeed, as long as , the action chosen at time is still awaiting conversion and the conditional expectation of is
(1) 
The goal of the learner is to sequentially make actions in order to minimize the expected cumulated regret after rounds,
where is such that .
A main difficulty here is that when one observes at time with , there is an ambiguity on whether , or whether is while is not . If we were to know the value of , the problem would be much easier. This is illustrated as a warmup in Section 3.1.
The setting of this paper addresses a harder problem where rewards are ambiguous. Specifically, the learner does not observe . For this case, we propose two alternative ways of constructing and controlling an estimator of that make use of the sequential observations in a different manner. The first one, our baseline, simply waits for the cut off. This allows to avoid any delayrelated bias due to the awaiting observations mentioned in Eq. 1. The second, more efficient in practice, updates the estimation onthefly as data comes in and handles the bias conveniently.
Assumption 1.
Without loss of generality and following usual practices in the literature on linear bandits AbbasiYadkori et al. (2011); Lattimore and Szepesvári (2016), we make the following assumptions:

Bounded scalar reward: , , ;

Bounded actions: we assume that each coefficient of any action is bounded by such that .

Bounded noise: for all , . Typically we will consider the case where the are Bernoulli random variables. We comment on subGaussian noise in Section 3.1 below.
3 Algorithm
In this section, we will define confidence intervals in order to build like algorithms for the contextualized delayed feedback setting. We build on existing results by Lattimore and Szepesvári (2016) and we make use of the exploration function therein: for some universal constant,
For any matrix , we denote its pseudoinverse and .
3.1 Warmup: the nonambiguous case.
We first consider a special case where the learner observes the fact that a sample is censored: she receives as an extra information . Note that in the case where the reward is continuously distributed, typically when the noise is Gaussian, it is necessarily censored as soon as the reward is exactly zero. This is because, in that case, .
This setting is much simpler than the general case where one does not observe . Then the learner can simply update the covariance matrix and the estimator of as soon as the reward is received, i.e. when . If the reward is censored, the action is just discarded. We define the least squares estimator of in this nonambiguous (NA) case by
(2) 
We introduced the notations and based on AbbasiYadkori et al. (2011). The following theorem defines a confidence interval for .
Theorem 1.
For any , and such that is almost surely nonsingular,
Proof.
Let such , we have
3.2 General case: ambiguous rewards
In many applications, the extra information of the censoring is not available: one does not observe . A user may decide to click or not for any reason and the system can never know whether a null observation means that the user did not acknowledge the display (), whether it means that the delay is too long and the reward was censored (), or whether it means that the reward is truly (). This distinction would be crucial in the example of Bernoulli rewards , which is important in classical web applications where the reward models a click. The warmup estimation strategy presented above cannot be applied here as we do not observe , and e.g. removing all null rewards from an estimator would result in an estimator whose bias would not converge to .
We present two strategies to address delayed feedback in this setting. One major improvement as compared to the existing proposed by Vernade et al. (2017) is that our estimator does not require any prior knowledge of the delay distribution or of the conversion probability . We start by presenting a baseline estimator that simply relies on waiting for each action to be cut off. While a good linear bandit strategy using this estimator would be asymptotically efficient, we argue that it suffers from bad nonasymptotic performance, especially when the time horizon is short with respect to the cut off  as we discard effectively all the last observations.
To overcome this pitfall, we design a better estimator that also makes use of the notyetconverted actions. We present , a linear bandit algorithm based on a new concentration result that has better nonasymptotic performance, as it does not discard any information.
Waiting as a baseline
We present , a simple heuristic that builds on existing concentration results. It is based on the aforementioned observation that after the timeout of the system has passed, no conversion can be observed anymore. To be more precise, for any and any , .
This means that at a fixed round , one can build an unbiased estimate of by computing the leastsquares solution using only the data available at :
(4) 
The following theorem defines the according confidence interval.
Theorem 2.
For any , and such that is almost surely nonsingular,
Proof.
Then, simply proceeds like : actions are chosen uniformly at random until the first uncensored reward is observed and then
(5) 
As in the warmup case, since Theorem 2 holds, will have a similar regret as .
Better approach: progressive updates
We now describe , an algorithm that consists in taking into account all the data at hands, including those that are received within the conversion window.
We define another estimator that suffers a small, controllable and vanishing bias, but that is much more dataefficient:
(6) 
This estimator includes all the observations received up to round and it has the same precision matrix as . As described in details in Algorithm 1, this allows to start updating the internal parameters after each action. The algorithm updates the covariance after each action is taken but only updates after rewards are received.
Our technique comes at the cost of a bounded bias that we carefully take into account in the confidence interval that drives our algorithm presented in the theorem below.
Theorem 3.
Let , and large enough such that is invertible. Then,
Proof.
The proof works as follows: we notice that this estimator is “close” to and we bound the bias due to the additional observations. For , we write
Decomposing according to the initial remark, we obtain
The second term is handled by Theorem 2 so it suffices to bound for any :
To give an intuition of this result, we state it in the case when the arms are the canonical basis of , which corresponds to the noncontextual bandit setting.
Corollary 1.
For all , let be the canonical basis of . Then, after rounds, where is the number of pulls of the action and . Then,
4 Regret analysis
4.1 Problemindependent lower bound
We first present a lower bound that sheds light on the fact that the quantity has to appear in the regret. Intuitively, the higher the censorship is, the larger the regret must be.
Theorem 4.
Consider the set of all contextual delayed bandit problems (as defined in the setting) in dimension with rewards bounded by , horizon , and censoring parameters . We have for any bandit policy
where is the expected regret of policy on the bandit problem .
The proof of this result is deferred to Appendix B.
4.2 Problemindependent Upper bound
By design, our algorithms and suffer very similar regret bounds as as shown e.g. in AbbasiYadkori et al. (2011). We derive the bound below only for as it is the most interesting one here. Indeed, mutatis mutandis, a similar bound can be proved for .
Theorem 5.
Let sufficiently large such that is invertible. With probability , the expected regret of after rounds is bounded by
Proof.
For any round , let us denote the vector such that . A consequence of Theorem 3 is that
So, it suffices to bound the regret on the most likely event
The last term is handled by the concentration result stated in Theorem 3. The first sum is bounded using the classical analysis of that can be found for instance in AbbasiYadkori et al. (2011):
Finally, by Jensen’s and CauchySchwartz inequality,
The last inequality comes from the so called Elliptical Potential Lemma (see e.g. Lemma 11 in AbbasiYadkori et al. (2011)):
We make two remarks on this bound. First, there is a gap with the lower bound: this is a common gap suffered by approaches AbbasiYadkori et al. (2011). Second, there is a gap with the lower bound. We believe that this could be fixed with a tighter control of but that would imply quite involved computations that we leave for future work.
5 Experiments
To validate the effectiveness of our algorithm, we run experiments in the censored setting that is the most common scenario in realworld applications. In this section, rewards are simulated in order to show the behavior of our algorithm compared to when facing a specific feedback environment. For a given time horizon, we test the nonasymptotic behavior of both policies. Indeed, that an accurate handling of the delays will provide better performance. Even if in theory the improvement is a constant factor, this matters in practice, especially since it can be fairly large from a nonasymptotic perspective.
We fix the horizon to , we choose a geometric delay distribution with mean . In a real setting, this would correspond to an experiment that lasts 3h, with average delays of 6 minutes. Then, we let the cut off vary , which corresponds to 15, 30 and 60 minutes: a reasonable set of values for a waiting time in an online system. We also fix , and we set . We only show the results in the more interesting and realistic Bernoulli case.
6 Related Work
Delays have been identified early Chapelle and Li (2011) as an incompatibility of the usual assumptions of the bandit framework for concrete applications. However, in many works on bandit algorithm, delays are ignored as a first approximation.
Delayed feedback occurs in various situations. For instance, Desautels et al. (2014); Grover et al. (2018) consider the problem of running parallel experiments that do not all end simultaneously. They propose a Bayesian way of handling uncertain outcomes to make decisions: they sample hallucinated results according to the current posterior.
In online advertising, delays are due to the natural latency in users’ responses. In the famous empirical study of Thompson Sampling Chapelle and Li (2011), a section is dedicated to analyzing the impact of delays on either Thompson Sampling or . While this is an early interest for this problem, they only consider fixed, nonrandom, delays of 10, 30 or 60 minutes. Similarly, in Mandel et al. (2015), the authors conclude that randomized policies are more robust to this type of latencies. The general problem of online learning under delayed feedback is addressed in Joulani et al. (2013), including full information settings and partial monitoring, and we refer the interested reader to their references on those topics. The idea of ambiguous feedback is introduced in Vernade et al. (2017).
Many models of delays for online advertising have been proposed to estimate conversion rates in an offline fashion: e.g. Yoshikawa and Imai (2018) (nonparametric) or Chapelle (2014) (generalized linear parametric model). The learning setting of the present work builds on the latter reference. Our goal is different though as we want to build a bandit algorithm that minimizes the regret under such assumptions.
Finally, we mention a recent work on anonymous feedback PikeBurke et al. (2017) that considers a rather harder setting where the rewards, when observed, cannot be directly linked to the action that triggered them in the past.
7 Discussions and conclusions
This paper frames and models a relevant and recurrent problem in several industrial systems employing contextual bandits. After noticing that the problem can be solved by a simple heuristic, we investigate a more efficient strategy called , which provides a significant practical advantage and stronger theoretical guarantees. An interesting aspect which is not investigated in the work is the usage of randomize policies, often preferable in practical applications for their positive impact on the customer engagement.
References
 AbbasiYadkori et al. [2011] Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
 Agarwal et al. [2014] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646, 2014.
 Auer et al. [2002] Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
 Beygelzimer et al. [2011] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26, 2011.
 Bubeck et al. [2012] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 Chapelle [2014] Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105. ACM, 2014.
 Chapelle and Li [2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
 Chu et al. [2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
 Desautels et al. [2014] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923, 2014.
 Grover et al. [2018] Aditya Grover, Todor Markov, Peter Attia, Norman Jin, Nicolas Perkins, Bryan Cheong, Michael Chen, Zi Yang, Stephen Harris, William Chueh, et al. Best arm identification in multiarmed bandits with delayed feedback. In International Conference on Artificial Intelligence and Statistics, pages 833–842, 2018.
 Joulani et al. [2013] Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 1453–1461, 2013.
 Lattimore and Szepesvári [2016] Tor Lattimore and Csaba Szepesvári. The end of optimism? an asymptotic analysis of finitearmed linear bandits. arXiv preprint arXiv:1610.04491, 2016.
 Mandel et al. [2015] Travis Mandel, Emma Brunskill, and Zoran Popović. Towards more practical reinforcement learning. In 24th International Joint Conference on Artificial Intelligence, IJCAI 2015. International Joint Conferences on Artificial Intelligence, 2015.
 Neu [2015] Gergely Neu. Explore no more: Improved highprobability regret bounds for nonstochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176, 2015.
 PikeBurke et al. [2017] Ciara PikeBurke, Shipra Agrawal, Csaba Szepesvari, and Steffen Grunewalder. Bandits with delayed anonymous feedback. arXiv preprint arXiv:1709.06853, 2017.
 Vernade et al. [2017] Claire Vernade, Olivier Cappé, and Vianney Perchet. Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artificial Intelligence, 2017.
 Yoshikawa and Imai [2018] Yuya Yoshikawa and Yusaku Imai. A nonparametric delayed feedback model for conversion rate prediction. arXiv preprint arXiv:1802.00255, 2018.
 Zhou [2015] Li Zhou. A survey on contextual multiarmed bandits. arXiv preprint arXiv:1508.03326, 2015.
Appendix A Concentration results
Concentration of proof of Theorem 6.
For completeness, we report here the result of Theorem 8 from Lattimore and Szepesvári [2016]. It gives a highprobability bound on the deviations of the absolute value of for any vector and any sequence of actions .
Fix sufficiently large such that is almost surely nonsingular. Concretely, they prove that for any , and for any subGaussian noise ,
(7) 
We will use this result to bound similar deviations in all the concentration results of this paper. Note that this is a refinement of the original result from AbbasiYadkori et al. [2011] that could also be used instead.
In this section we define and study a scaled estimator of :
This estimator uses the complete covariance matrix – built with all the past action vectors– but only the received observations. The unreceived ones are counted as zeros. It is not exactly the least squares estimator that should use either the covariance containing converted actions or all the rewards, which is not possible because part of them are unobserved. This effect will tend to shrink the norm of the estimator. The next theorem tries to control as defined above.
Theorem 6.
For any , sufficiently large and such that is almost surely nonsingular,
where for some universal constant,
Proof.
We start by noticing that
Let such that . We have
(8) 
We thus have two noises that can be bounded individually.
The right term can be rewritten as
The last term can be bounded with highprobability using the inequality in Eq.(7).
On the other hand, we bound the deviation term on the left of Eq.(8):
Taking the scalar product with some vector , we get a sum of two noises again:
The absolute value of each of the terms above can be bounded using Eq. (7).
Summing the 3 upper bounded derived above, we finally obtain,
Appendix B Lower Bound: proof of Theorem 4
The proof follows the lines of Theorem 3.5 in Bubeck et al. [2012]. Their result is actually a special case of a more general result stated in Lemma 3.6. It gives a problemindependent lower bound that depends on a parameter that characterizes the considered changes of distributions. Here, the number of arms is fixed to , where in the harder case the arms are the vectors of the canonical basis.
The idea is to consider hard problems where all arms have a mean except one that has mean . This defines hard problems, for each of the arms. The goal is then to lower bound the worst expected regret under each of those models, which happens to be larger than the mean :
It then suffices to bound the righthand side. Pinsker’s inequality provides a first bound :
where is the model where no arm stands out and they all have their mean equal to .
We prove the following bound that adapts their Lemma 3.6 to our delayed feedback setting:
Proof.
The main difference of our setting as compared to theirs is that we assume that we are given a collection of i.i.d. Bernoulli random variables of parameter independent of the generated rewards. The data generated is and we only observe in the case where .
The main step that we modify is the computation of the empirical corresponding to the expectation of the likelihood of the rewards under two alternative models. We prove that
This allows us to obtain the desired result using the concavity of the square root:
where we used the fact that
since and are independent.
Taking , we get from Lemma 3.6 that
Finally, we recall and prove the lower bound of Section 4.
Theorem 7.
Consider the set of all contextual delayed bandit problems (as defined in the setting) in dimension with rewards bounded by , horizon , and censoring parameters . We have for any bandit algorithm
where is the expected regret of algorithm on the bandit problem .
Proof.
Appendix C Additional experiments
As in the main paper, we fixed the horizon and the cut off vary in {250, 500, 1000}. The delay still follows a geometric distribution but in this set of experiments .
In our hypothetic 3h experiment, the average delay now corresponds to 15 minutes and the cutoff time is fixed in 15, 30, 60 minutes.
As for the experiments in the main paper, the trend is clear: the higher is the cutoff time the more advantageous is to properly handle the delay. The main difference with the previous experiments is the speed at which the algorithms are learning: as expected the higher is the average delay the slower is the learning process and the higher is the regret.