Abstract
Although shown to be useful in many areas as models for solving sequential decision problems with side observations (contexts), contextual bandits are subject to two major limitations. First, they neglect user “reneging” that occurs in realworld applications. That is, users unsatisfied with an interaction quit future interactions forever. Second, they assume that the reward distribution is homoscedastic, which is often invalidated by realworld datasets, e.g., datasets from finance. We propose a novel model of “heteroscedastic contextual bandits with reneging” to overcome the two limitations. Our model allows each user to have a distinct “acceptance level,” with any interaction falling short of that level resulting in that user reneging. It also allows the variance to be a function of context. We develop a UCBtype of policy, called HRUCB, and prove that with high probability it achieves regret.
Heteroscedastic Bandits with Reneging
PingChun Hsieh^{*} Xi Liu^{*} Anirban Bhattacharya P. R. Kumar pingchun.hsieh@tamu.edu xiliu.tamu@gmail.com anirbanb@stat.tamu.edu prk@tamu.edu
Texas A&M University
1 Introduction
Multiarmed Bandits (MAB) [5] have been extensively used to model sequential decision problems with uncertain rewards. Such problems commonly arise in a large number of realworld applications such as clinical trials, search engines, online advertising, and notification systems. While in those applications, users (e.g., patients) have been modeled as being homogeneous, there is a strong motivation to enhance user experience by personalizaton for users and taking care of their specific demands, and thereby increase revenue with improved user experience. The model of “contectual bandits” [1] seeks to do so by proposing a MAB model for learning how to act optimally based on contexts (features) of users and arms. At the beginning of each round, the learner observes a context from the context set (e.g., medical records, treatment details) and selects an arm from the arm set (e.g., different treatments). At the end of the round, the learner receives a random reward (e.g., the result of the treatment) with the mean value of its distribution depending on the observed context. The objective of the learner is to accumulate as much reward as possible within rounds. Since the parameters involved in the dependence of the mean reward on the context are unknown, the learner has to handle a tradeoff between exploration (e.g., choosing new treatment with possible higher effectiveness) and exploitation (e.g., choosing the best known treatment) at each round.
While this model has been usefully applied in many areas, it is subject to two major limitations. First, it neglects the phenomenon of “reneging” that is common in realworld applications. Reneging here refers to the behavior of users cutting ties with the learner after an unsatisfactory experience, and desisting from any future interactions. This is also referred to as “churn”, “disengagement”, “abandonment”, or “unsubscribing” [13]. Since, as is well known, the acquisition cost for new users is much higher than the retention cost for existing users, handling reneging plays a critical role in business success. Reneging is common in realworld applications. For instance, in clinical trials, a patient dissatisfied with the effectiveness of a treatment quits all further trials. Search services face a similar problem; users may never again use any search engine after one returns results regarded as irrelevant. Another example is online advertising, where users stop clicking on any future advertisements, after the pursuit of one or more delivered advertisements leads to a loss of in the advertiser. Similar concerns are found in notification systems employed by content creators, where there is value in sending more email notifications, but each email also risks the user disabling the notification functionality, permanently eliminating any opportunity for the creator to interact with the user in the future.
Second, previous studies have usually assumed that rewards are generated from an underlying reward distribution that is homoscedastic, i.e., its variance is independent of contexts. Unfortunately, this model is invalid due to the presence of “heteroscedasticity” in many realworld datasets, and learning algorithms based on it may be improvable. Examples abound in financial applications such as portfolio selection for hedge funds [16]. In online advertising or notification systems, the clickthrough rate can vary among users due to their differing spare times. Users with more spare time tend to be more tolerant to advertisements/notifications, and may continue to click on them, while users with little spare time will in most cases ignore them.
We propose a novel model of contextual bandits that addresses the challenges arising from reneging risk and heteroscedasticity. We call the model “heteroscedastic bandits with reneging.” In our model, at a round for user , the learner observes a collection of contexts , where context is drawn from context set . After observing the context, the learner selects an action and receives a reward drawn from a reward distribution. To model heteroscedasticity, we allow for the mean and variance of the reward distribution to both depend on , i.e., and . To model the reneging risk, we suppose that user has a satisfaction level . If is below level , the user quits all future interactions; otherwise, the user stays. We assume that the satisfaction level for each user is fixed beforehand and does not depend on the decision of the learner. Under this model, the reneging risk associated with action of user is the probability that the observed reward is below its acceptance level, i.e., . The parameters in and are unknown and need to be learned on the fly.
Three key challenges arise in finding the optimal policy for heteroscedastic bandits with reneging. First, to estimate the unknown variance function, we have to construct a satisfactory estimator and the corresponding confidence interval. Since in statistics, there is usually no explicit way to represent the confidence interval for variance estimation, establishing regret bounds for upper confidence bound (UCB) algorithms becomes difficult. Second, the presence of reneging makes estimation of unknown functions more difficult. Each round has a nonzero probability of being the last round, and so some userarm pair may be pulled. As a result, the conventional definition of regret needs to be modified. Moreover, since the mean and variance depend on the context, the reward distributions to be learned for one user are different from those for another user. How to transfer the knowledge accumulated on one user to another user has to be carefully handled. Third, the optimal policy needs to handle the issue of exploration vs. exploitation in terms of both rewards and risk. Intuitively, a good policy should prefer actions with high expected return and low reneging risks. This becomes difficult when there are arms that have high expected return and high risk. This work focuses on developing optimal learning algorithms that address the above challenges.
Seminal studies on contextual bandits consider linear contextual bandits [1, 10, 4], assuming that the expected reward is a linear function on contexts. Although these models have been shown to be useful in some areas, they do not address reneging and heteroscedasticity. Reneging can be handled as risk to be avoided or controlled. The risk in bandit problems has been studied for variance minimization [18] and valueatrisk maximization [21, 8, 9], and guarantees provided that outperform baselines [14, 24]. However, the risks those studies handle are different from those we are motivated by, and their models cannot be used to solve the problems of interest here. The risks they handle usually have no impact on lifetimes of bandits. Their approaches encode the consideration of risk in statistics and put them in objective functions, while in our problem, the reneging risk comes from the probability that the observed reward is below an acceptance level. Moreover, their models are restricted to homoscedastic datasets, while our model is applicable to both heteroscedastic and homoscedastic datasets. The acceptance level in our formulation has a flavor of thresholding bandits [2, 17, 12, 19]. However, the latter is based on a very different setting and assumes the distribution is context independent and homoscedastic (a more careful review and comparison are given in Section 2).
Contributions. Our research contributions can be summarized as follows:

Reward heteroscedasticity and reneging risk are common in realworld applications but not taken into account in existing bandit models. We formulate a novel model, dubbed “heteroscedastic bandits with reneging.” To the best of our knowledge, this paper is the first to address them in a bandit model.

To solve the proposed model, we develop a UCBtype policy, called HRUCB, that is proved to achieve a regret bound with high probability. Although the proposed solution mainly applies to heteroscedastic bandits with reneging, the techniques employed here to handle heteroscedasticity can be used to solve bandits that are sensitive to variance, e.g., riskaverse bandits, thresholding bandits etc.
2 Related Work
Contextual bandits, as an approach to solve sequential decision problems with side observations (contexts) and user heterogeneity, have attracted considerable research attention recently. The most well known studies are of linear contextual bandits [1, 10, 4], where it is assumed that the expected reward is a linear function of context, an assumption also made in this paper. Although previous studies of contextual bandits have been useful in many areas, they are subject to two major limitations. First, they neglect user reneging that is comminly found in realworld applications, e.g., search engines and online advertising. That is, a user not satisfied with one interaction just drops out forever from any future interactions. Appropriately handling it has been therefore regarded by many realworld practitioners as key to their longterm viability and success [13, 3]. Second, it is usually assumed that the reward distribution is homoscedastic in contexts, which is usually invalidated by realworld datasets, e.g., datasets from financialrelated applications. When the reward distribution is alloed to br contextdependent, the assumption that only the mean of the distribution depends on context restricts the applicability of those models. So motivated, in this paper we propose a novel model of contextual bandits. Differing from previous works, our model allows each user to have a distinct acceptance level, with interactions falling below it resulting in the user reneging. Moreover, our model allows the variance also to be a function of context. Modeling reneging and heteroscedasticity in contextual bandits are the salient features of this paper. Compared to conventional contextual bandits, both the function for variance and for mean need to be learned in our model; in addition, reneging aborts future interactions and makes the learning task more complex. Moreover, diverse reward distributions make the avoidance of reneging more difficult. The objective of our paper is to propose an optimal policy that attacks those challenges. As far as we are aware, our model is the first one that addresses the two issues and achieves optimal regret.
There are two main lines of research related to our work: bandits with risk and thresholding bandits.
Bandits with Risk. Reneging can be viewed as a type of risk that the learner tries to avoid or control. The risk in bandit problems has been studied in terms of variance, quantiles, and guarantees that outperform baselines. In [18] and many follow up works, meanvariance models to handle return (reward) and risk (variability) are studied, where the objective to be maximized is a linear combination of mean reward and variance. Subsequent studies [21, 8] propose a quantile (value at risk) to replace rewards and variance in evaluating which arm to select. In contrast to these works, [14, 24] control the risk by requiring that the accumulated rewards while learning the optimal policy be above those of baselines. Similarly, in [20], each arm is associated with some risk; safety is guaranteed by requiring the accumulated risk to be below a given budget. Although these studies investigate optimal policies under risk, the risks they handle are different from ours and their models cannot be used to solve our problem. The risks they handle usually have no impact on lifetime of bandits. Their approaches to handle the risk are based on more straightforward statistics, while, in our problem, the reneging risk is relatively complex, i.e., it comes from the probability that the observed reward is below an acceptance level. Moreover, their models assume homoscedasticity, while we allow the variance to depend on the context.
Thresholding Bandits. The acceptance level in our model has the flavor of thresholding bandits. However, the thresholds in the existing literature differ from our perspective. In [2], the action receives a unit payoff in the event that the sampled reward exceeds a threshold. In [17], the objective is to find the set of arms whose means are above a given threshold up to a precision. In [12], threshold is used to trigger a oneshot reward, i.e., for an arm, no rewards can be collected until the total number of successes exceeds the threshold, but once a reward is collected, the arm is removed from the interaction. Compared to the problem in this paper, the most similar one that has been studied is in [19]. However, it has a very different setting and assumes that the distribution is context independent and homoscedastic. In that paper, each arm is represented by a real number; users may abandon the program as long as the pulled arm exceeds a threshold, which measures user tolerance capability. As comparison, we consider a contextual bandit model; we allow the reward distribution to be heteroscedastic; and we capture the reneging through a probability.
As far as we are aware, only one very recent paper discusses bandits under heteroscedasticity [15]. Compared to it, our paper has two salient differences. First, we discuss heteroscedasticity under the presence of reneging. The presence of reneging makes the learning problem more challenging as the learner has to always be prepared that plans for the future may not be carried out. Second, the solution in [15] is based on information directed sampling. In contrast to that, we exhibit in this paper, a heteroscedastic UCB policy that is efficient, and easier to implement, can perfectly achieve sublinear regret.
3 Problem Formulation
In heteroscedastic bandits with reneging, since the interaction with one user is often aborted after a finite number rounds with new users joining in the interactions afterwards, we index users by their order of interaction and conduct a regret analysis in terms of the total number of interacting users. Let be the number of users, who are indexed by . Let be the context set, where denotes the norm. At each round for user , the learner observes a set of contexts . After observing the contexts, the learner selects an action and receives a random reward drawn from a reward distribution that satisfies:
(1)  
(2)  
(3) 
where denotes the Gaussian distribution with zero mean and variance . For the mean of the reward distribution we operate under the linear realizability assumption: that is there is an unknown with so that
(4) 
for all and . For the variance of the reward distribution, heteroscedasticity is taken into account through a function
(5) 
where is known and is required to be nonnegative, strictly increasing, and biLipschitz continuous, i.e. there exists a constant with such that , for all . For example, we can choose or . The parameter vector with is unknown and will be learned during interactions. Since is bounded over all possible and , we know that is also bounded, i.e. for some , for all and defined above. This also implies that is subGaussian, for all .
The minimal expectation in an interaction of a user is characterized by its acceptance level. Denote by the acceptance level of user . We assume that acceptance levels of users, like their context, are available before interacting with them. Denote by the observed reward for user at round . When is below , reneging occurs and the user drops out from any future interaction. Suppose that at round , arm is selected for user , then the risk that reneging occurs is
(6) 
where is the cumulative density function (CDF) for . Without loss of generality, we also assume that is lower bounded by for some . Let be the stopping time that denotes the first time that is below the acceptance level,
(7) 
A policy is a rule for selecting an arm at each round of a user based on the preceeding interactions with that user and other users, where denotes the set of all admissible policies. In fact, the stopping time also depends on the policy that is used, so we use to represent the stopping time of user operating under policy . Let denote the sequence of contexts that correspond to the actions of user under policy . Let be the expected reward of user under the action sequence . Then we have
(8) 
where is the probability of the event that the user stays for at least rounds. Then the total expected reward collected from users can be represented by
(9) 
We are ready to define the pseudoregret of the heteroscedastic bandits with reneging as
(10) 
where is the optimal policy in terms of pseudoregret among admissible policies, i.e.,
(11) 
The objective of the learner is to learn a policy that achieves as minimal a regret as possible.
Illustrative examples for heteroscedasticity and reneging risk are shown in Figure 1. In Figure 1(a), the variance of the reward distribution gradually increases as the value of the onedimensional context increases. Although the mean of the reward distribution still follows the conventional formulation of being s linear function of context, and thus the ordinary least square estimator is still unbiased, the context dependent variance makes the standard error estimates biased, and invalidates the method usually used to construct the confidence bounds. Each userarm pair corresponds to a distribution with distinct mean and variance. Moreover, the presence of reneging risk makes every observation have a probability of being the last one, which makes the learning task more challenging. Intuitively, the optimal policy prefers the distribution that has large mean and low reneging risk. Unfortunately, it is nontrivial to follow that intuition in optimal policy construction. As shown in Figure 1(b), the reward distribution has mean and variance , correspondingly and variance for . The two correspond to the same user, but for different arms. Thus they have the same acceptance level . A learner may prefer pulling distribution as its mean reward is higher than . However, since the variance of is also higher than , the reneging risk (the blue shaded area) is higher than (the red shaded area) as well. When considering which arm to pull, the learner faces an additional dilemma (beyond the exploration vs. exploitation dilemma) of choosing between receiving higher reward for one pull and staying longer to collect more future rewards. This makes the model distinct and especially difficult to solve.
4 Algorithms and Results
In this section, we present a UCBtype algorithm for heteroscedastic bandits with reneging. We start by introducing general results on heteroscedastic regression.
4.1 Heteroscedastic Regression
In this section, we consider a general regression problem with heteroscedasticity.
4.1.1 Generalized Least Squares Estimators
With a slight abuse of notation, let be a collection of pairs of context and reward realization that are collected sequentially. Recall from (1)(3) that and with unknown parameters and . Note that given the contexts , are mutually independent. Let and be the row vectors of the reward realizations and the deviations from the mean reward, respectively. Let be an matrix in which the th row is , for all . We use to denote the estimators of and based on the observations , respectively. Moreover, define the estimated deviation with respect to as
(12) 
Let . Let denote the identity matrix, and let denote the Hadamard product of any two vectors . We consider the generalized least squares estimators (GLSE) [23]
(13)  
(14) 
where is some regularization parameter and is the preimage of the vector .
Remark 1
Note that in (13), is the conventional ridge regression estimator. On the other hand, to obtain an estimator , (14) still follows the ridge regression approach, but with two additional steps: (i) derive the estimated deviation based on , and (ii) apply the map on the square of . It is known that defined in (14) has some nice asymptotic properties (e.g. Chapter 8.2 of [23]). However, it remains unknown how to obtain nonasymptotic results regarding the confidence set for . This question will be answered rigorously in Section 4.1.2.
4.1.2 Confidence Sets for GLSE
In this section, we discuss the confidence sets for the estimators and described above. To simplify notation, we define a matrix as
(15) 
A confidence set for was introduced in [1]. For convenience, we restate the results in the following lemma.
Lemma 1
(Theorem 2 in [1]) For all , define
(16) 
For any , with probability at least , for all , we have
(17) 
where is the induced vector norm of vector with respect to .
Remark 2
Next, we derive the confidence set for . Define
(18)  
(19) 
where and are some universal constants that will be described in Lemma 3. The following is the main theorem on the confidence set for .
Theorem 1
For all , define
(20) 
For any , with probability at least , for all , we have
(21) 
To demonstrate the main idea behind Theorem 1, we highlight the proof procedure in the following Lemma 25. First, to quantify the difference between and , we start by considering the inner product of an arbitrary vector and in the following lemma.
Lemma 2
For any , we have
(22)  
(23)  
(24)  
(25) 

The proof is provided in Appendix A.1.
Lemma 3
For any , for any , with probability at least , we have
(26) 

We highlight the main idea of the proof. Recall that . Therefore, is a distribution with a scaling of . Hence, each element in has zero mean. Moreover, we observe that is quadratic. Since the distribution is subexponential, we utilize a proper tail inequality for quadratic forms of subexponential distributions to derive an upper bound. The complete proof is provided in Appendix A.2.
Next, we derive an upper bound for (24).
Lemma 4
For any , for any , with probability at least , we have
(27) 
Next, we provide an upper bound for (25).
Lemma 5
For any , for any , with probability at least , we have
(28) 
Now we are ready to put all the above together and prove Theorem 1.

We use to denote the smallest eigenvalue of a square symmetric matrix. Recall that is positive definite for all . Then we have
(29) By (29), Lemma 25, we know that for a given and a given , with probability at least , we have
(30) Note that (30) holds for any . By substituting into (30), we have
(31) Since , we know that for a given and a given , with probability at least ,
(32) Finally, to obtain a uniform bound, we simply choose and apply the union bound to (32) over all . Note that . Therefore, we conclude that with probability at least , for all ,
(33) The proof is complete.
4.2 Heteroscedastic UCB Policy
In this section, we formally introduce the proposed UCB policy based on the heteroscedastic regression discussed in Section 4.1.
4.2.1 An Oracle Policy
In this section, we consider a policy which has access to an oracle with full knowledge of and . Consider users that arrive sequentially. Let be the sequence of contexts that correspond to the actions for the user under an oracle policy . The oracle policy is constructed by choosing
(34) 
for each . Due to the construction in (34), we know that achieves the largest possible expected reward for each user , and is hence optimal in terms of pseudoregret defined in Section 3. Based on (8) and (34), by using an onestep optimality argument, it is easy to verify that is a fixed policy for each user , i.e. , for all . Let denote the total expected reward of user under . We have
(35) 
Next, we derive a useful property regarding (35). For any given , define the function as
(36) 
Note that for any given , equals the total expected reward of a single user with threshold if a fixed action with context is chosen under parameters . We show that has the following nice property.
Theorem 2
Let be a invertible matrix. For any with , , for any with , , for any , for any ,
(37)  
(38) 
where and are some finite positive constants that are independent of and .

The main idea is to apply firstorder approximation under Lipschitz continuity of and . The detailed proof is provided in Appendix A.5.
4.2.2 The HRUCB Policy
To begin with, we introduce an upper confidence bound based on the GLSE described in Section 4.1. Note that the results in Theorem 1 depend on the size of the set of contextreward pairs. Moreover, in our bandit model, the number of rounds of each user is a stopping time and can be arbitrarily large. To address this, we propose to actively maintain a regression sample set through a function . Specifically, we let the size of grow at a proper rate regulated by . One example is to choose for some constant . Since each user will play for at least one round, we know is at least after interacting with users. We use to denote the regression sample set right after the departure of user . Moreover, let be the matrix in which the rows are composed by the contexts of all the elements in . Similar to (15), we define , for all . To simplify notation, we also define
(39) 
For any , we define the upper confidence bound as follows:
(40) 
Next, we show that is indeed an upper confidence bound.

The proof is provided in Appendix A.6.
Now, we formally introduce the HRUCB algorithm. The complete algorithm is shown in Algorithm 1 and can be described in detail as follows:

After applying an action, HRUCB observes the corresponding reward and the reneging event if any. The current contextreward pair will be added to only if the size of is less than .

Based on the regression sample set , HRUCB updates the estimators and right after the departure of each user.
Remark 4
Note that under HRUCB, the estimators and are updated right after the departure of each user (Line 1). Alternatively, and can be updated whenever is updated. While this alternative may make slightly better use of the observations, it also incurs more computation overhead. For ease of exposition, we still focus on the ”lazyupdate” version presented in Algorithm 1.
4.3 Regret Analysis
In this section, we provide the regret analysis for the proposed HRUCB policy.
Theorem 3
Under HRUCB, with probability at least , the pseudo regret is upper bounded as
(42)  
(43) 
Moreover, by choosing for some constant , we have
(44) 

The proof is provided in Appendix A.7.
Remark 5
Remark 6
We briefly discuss the difference between our regret bound and the regret bounds of other related settings. Note that if the acceptance level for all , then all the users will quit after exactly one round. This corresponds to the conventional contextual bandits setting (e.g. homoscedastic case [10] and heteroscedastic case [15]). In this degenerate case, our regret bound is , which has an additional factor resulting from the heteroscedasticity with reneging.
5 Concluding Remarks
In this paper, we have studied the challenges in bandit modeling that arise from heteroscedasticity and reneging. Most existing contextual bandit algorithms suffer from neglecting them and cannot be used. These complications exist in many realworld applications, and taking them into account is economically necessary for the success of the business. To attack the above challenges, we have formulated a heteroscedastic bandit model with reneging, where the user may quit from future interactions if the reward falls below its acceptance level, and the variance of reward distribution can depend on context. We have proposed a UCBtype policy, called HRUCB, to solve this novel model, and proved that it achieves regret. The techniques we developed to estimate heteroscedastic variance and establish sublinear regret under the presence of heteroscedasticity, can be extended to other variance sensitive bandit problems, such as riskaverse bandits, thresholding bandits, etc.
References
 [1] Y. AbbasiYadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
 [2] J. D. Abernethy, K. Amin, and R. Zhu. Threshold bandits, with and without censored feedback. In Advances In Neural Information Processing Systems, pages 4889–4897, 2016.
 [3] S. Aflaki and I. Popescu. Managing retention in service relationships. Management Science, 60(2):415–433, 2013.
 [4] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135, 2013.
 [5] P. Auer, N. CesaBianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(23):235–256, 2002.
 [6] S. Bubeck, N. CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 [7] S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 [8] A. Cassel, S. Mannor, and A. Zeevi. A general approach to multiarmed bandits under risk criteria. In Annual Conference on Learning Theory, pages 1295–1306, 2018.
 [9] A. R. Chaudhuri and S. Kalyanakrishnan. Quantileregret minimisation in infinitely manyarmed bandits. In Association for Uncertainty in Artificial Intelligence, 2018.
 [10] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
 [11] L. Erdős, H.T. Yau, and J. Yin. Bulk universality for generalized wigner matrices. Probability Theory and Related Fields, 154(12):341–407, 2012.
 [12] L. Jain and K. Jamieson. Firing bandits: Optimizing crowdfunding. In International Conference on Machine Learning, pages 2211–2219, 2018.
 [13] Y. Kanoria, I. Lobel, and J. Lu. Managing customer churn via service mode control. Columbia Business School Research, 2018.
 [14] A. Kazerouni, M. Ghavamzadeh, Y. Abbasi, and B. Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910–3919, 2017.
 [15] J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic noise. In Annual Conference on Learning Theory, pages 358–384, 2018.
 [16] O. Ledoit and M. Wolf. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of empirical finance, 10(5):603–621, 2003.
 [17] A. Locatelli, M. Gutzeit, and A. Carpentier. An optimal algorithm for the thresholding bandit problem. In Proceedings of the 33rd International Conference on International Conference on Machine LearningVolume 48, pages 1690–1698. JMLR. org, 2016.
 [18] A. Sani, A. Lazaric, and R. Munos. Riskaversion in multiarmed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
 [19] S. Schmit and R. Johari. Learning with abandonment. In International Conference on Machine Learning, pages 4516–4524, 2018.
 [20] W. Sun, D. Dey, and A. Kapoor. Safetyaware algorithms for adversarial contextual bandit. In International Conference on Machine Learning, pages 3280–3288, 2017.
 [21] B. Szorenyi, R. BusaFekete, P. Weng, and E. Hüllermeier. Qualitative multiarmed bandits: A quantilebased approach. In 32nd International Conference on Machine Learning, pages 1660–1668, 2015.
 [22] J. A. Tropp. Userfriendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
 [23] J. M. Wooldridge. Introductory econometrics: A modern approach. Nelson Education, 2015.
 [24] Y. Wu, R. Shariff, T. Lattimore, and C. Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262, 2016.
Appendix A Appendix
a.1 Proof of Lemma 2

Recall that . Note that
(45) (46) (47) (48) (49) Therefore, for any , we know
(50) (51) (52) (53) Moreover, by rewriting , we have
(54) (55) (56) (57) where (56)(57) follow from the fact that is biLipschitz continuous and hence is Lipschitz continuous as described in Section 3. Therefore, by (50)(57) and the CauchySchwarz inequality, we have
(58) (59) (60) (61)
a.2 Proof of Lemma 3
We first introduce the following useful lemmas.
Lemma 7 (Lemma B.2 in [11])
Let be independent random complex variables with zero mean and variance and having the uniform subexponential decay, i.e. there exists such that
(62) 
We use to denote the conjugate transpose of . Let , let denote the complex conjugate of , for all , and let be a complex matrix. Then, we have
(63)  
(64) 
where and are positive constants that depend only on . Moreover, for the standard distribution, and .
Lemma 8
(65) 

By the definition of induced matrix norm,
(66) (67) (68) (69) where (69) follows from the singular value decomposition and .
To simplify notation, we use and as a shorthand for and , respectively. For convenience, we rewrite as the matrix of column vectors (each ).