Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits
Abstract
In this paper, we study multi-armed bandit problems in an explore-then-commit setting. In our proposed explore-then-commit setting, the goal is to identify the best arm after a pure experimentation (exploration) phase and exploit it once or for a given finite number of times. We identify that although the arm with the highest expected reward is the most desirable objective for infinite exploitations, it is not necessarily the one that is most probable to have the highest reward in a single or finite-time exploitations. Alternatively, we advocate the idea of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. We propose two algorithms whose objectives are to select the arm that is most probable to reward the most. Using a new notion of finite-time exploitation regret, we find an upper bound of order for the minimum number of experiments before commitment, to guarantee upper bound for regret. As compared to existing risk-averse bandit algorithms, our algorithms do not rely on hyper-parameters, resulting in a more robust behavior, which is verified by numerical evaluations.
I Introduction
One of the classes of decision making models is the multi-armed bandit (MAB) framework where decision makers learn the model of different arms that are unknown and actions do not change the state of arms [1]. The MAB problem was originally proposed by Robbins [2], and has a wide range of applications in finance [3, 4], health-care [5], autonomous vehicles [6, 7], communication and networks [8, 9, 10, 11, 12], and energy management [13, 14] to name but a few. In the classical MAB problem, the decision-maker sequentially selects an arm (action) with an unknown reward distribution out of independent arms. The random reward of the selected arm is revealed and the rewards of other arms remain unknown. At each step, the decision-maker encounters a dilemma between exploitation of the best identified arm versus exploration of alternative arms. The goal of the classical model of multi-armed bandit is to maximize the expected cumulative reward over a time horizon.
In this paper, we focus on a setting where a player is allowed to explore different arms in the exploration (or experimentation, used interchangeably) phase before committing to the best identified arm for exploitation in one or a given finite number of times. This setting of interest is motivated by several application domains such as personalized health-care and one-time investment. In such applications, exploitation is costly and/or it is infeasible to exploit for a large number of times, but arms can be experimented by simulation and/or based on the historical data for multiple times with negligible cost [15]. The big step in personalized health-care is to provide an individual patient with his/her disease risk profile based on his/her electronic medical record and personalized assessments [16, 17]. The different treatments (arms) are evaluated for a person by simulation or mice trials for many times with a low cost, but one personalized treatment is exploited once for a patient in the end [18, 19]. Another example of one-time exploitation is one-time investment where an investor chooses a factory out of multiple ones. Based on experimentation on historical data, he/she selects a factory to invest in once. The common theme in both above examples is to identify the best arm for one-time exploitation after an experimentation phase of pure exploration.
The above setting falls in the class of MAB problems called explore-then-commit. The previous works [15, 20, 21, 22, 13, 23] on explore-then-commit bandits, to the best of our knowledge, try to identify the arm with an optimum risk-return criterion on an expectation sense up to a hyper-parameter. Even though this objective is desirable in the settings with infinite exploitations, it is not necessarily the best objective in the explore-then-commit setting with a single or finite exploitations. We further elaborate on this observation by an illustrative example in Section III. We advocate an alternative approach in which the objective is to select an arm that is most probable to reward the most. It has been realized that in many scenarios of multi-armed bandits, considering maximum expected reward as an objective to select an arm is not the best strategy. In such scenarios, players not only aim to achieve the maximum cumulative reward, but they also want to minimize the uncertainty such as risk in the outcome [24], and scuh approaches are known as risk-averse MAB. In literature, there are several approaches to address the risk-averse MAB including mean-variance (MV) [23] and the conditional value at risk (CVaR) [13]. The performance of both MV and CVaR, are highly dependent on different single scalar hyper-parameters, and selecting an inappropriate hyper-parameter might degrade the performance substantially. More details on MV and CVaR criteria are given in Section II, and the negative impact of hyper-parameter mismatch is studied in Section V.
Contributions: We propose a class of hyper-parameter-free risk-averse algorithms (called OTE/FTE-MAB) for explore-then-commit bandits with finite-time exploitations. The goal of the algorithms is to select the arm that is most probable to give the player the highest reward. To analyze the algorithms, we define a new notion of finite-time exploitation regret for our setting of interest. We provide concrete mathematical support to obtain an upper bound of order for the minimum number of experiments that should be done to guarantee upper bound for regret. More specifically, our results show that by utilizing the proposed algorithms, the regret can be bounded arbitrarily small by sufficient number of experimentations. As a salient feature, the OTE/FTE-MAB algorithm is hyper-parameter-free, so it is not prone to errors due to hyper-parameter mismatch.
Organization of the Paper: Section II discusses related work. In Section III, the one/finite-time exploitation multi-armed bandit problem after an experimentation phase is formally described. We define a new notion of one/finite-time exploitation regret for our problem setup. An example is provided clarifying the motivation of our work. In Section IV, we propose the OTE-MAB and FTE-MAB algorithms, and find an upper bound of order for the minimum number of pure explorations needed to guarantee upper bound for regret. In Section V, we evaluate the OTE-MAB algorithm versus risk-averse baselines and compare the minimum number of experiments needed to guarantee an upper bound on regret for both the OTE-MAB and FTE-MAB algorithms. We conclude the paper with a discussion of opportunities for future work in Section VI.
Ii Related Work
Explore-then-commit bandit is a class of multi-armed bandit problems that has two consecutive phases named as exploration (experimentation) and commitment. The decision maker can arbitrarily explore each arm in the experimentation phase; however, he/she needs to commit to one selected arm in the commitment phase. There are several studies on explore-then-commit bandits in the literature as follows. Bui et al. [15] studied the optimal number of explorations when cost is incurred in both phases. Liau et al. [25] designed an explore-then-commit algorithm for the case where there is a limited space to record the arm reward statistics. Perchet et al. [26] studied explore-then-commit policy under the assumption that the employed policy must split explorations into a number of batches. None of these works have addressed the risk-averse issue on explore-then-commit bandits. In the following, we present an overview on risk-averse bandits.
There are several criteria to measure and to model risk in a risk-averse multi-armed bandit problem. One of the common risk measurements is the mean-variance paradigm [27]. The two algorithms MV-LCB and ExpExp proposed by Sani et al. [23] are based on mean-variance concept. They define the mean-variance of an arm with mean and variance as MV, where is the absolute risk tolerance coefficient. In an infinite horizon multi-armed bandit problem, MV-LCB plays the arm with minimum lower confidence bound for estimation of MV. In a best-arm identification setting, the ExpExp algorithm explores each of the arms for the same number of times and selects the arm with minimum estimated MV. This approach is followed by numerous researchers in risk-averse multi-armed bandit problems [24, 28, 29, 30].
Another way of considering risk in multi-armed bandit problems is to use conditional value at risk level , CVaR, where it is the expected policy return in a specified quantile. CVaR is utilized by Galichet et al. [13] in risk-aware multi-armed bandit problems. They presented the Multi-Armed Risk-Aware Bandit (MaRaB) algorithm aiming to select the arm with the maximum conditional value at risk level , CVaR. Formally, let be the target quantile level and defined as be the associated quantile value, where is the arm reward. The conditional value at risk is then defined as CVaR. CVaR is also followed by researchers in multi-armed bandit problems [24, 31, 32, 33, 34].
Iii Problem Statement
Consider arms whose rewards are random variables that have unknown distributions with unknown finite expected values , respectively. The goal is to identify the best arm at the end of an experimentation phase that is followed by an exploitation phase, where the best arm is exploited for a given number of times, . In the experimentation phase, each arm is sampled for independent times. Denote the observed reward of arm at iteration of experimentation by .
Let where are independent and identically distributed random variables and . The optimum arm for exploitations in the sense that maximizes the probability of receiving the highest reward is
(1) |
where and what being greater than or equal to a vector means is that it is greater than or equal to all elements of the vector. Let . Given the above preliminaries, the finite-time exploitation regret is defined below.
Definition 1
The finite-time exploitation regret, , is defined as a function of an input for the selected arm as
(2) |
Note that the above definition of regret is different from the commonly used regret in bandit problems. In the following, an example is presented that motivates to define this new notion of regret for the finite-time exploitation setting.
Iii-a Illustrative Example
As mentioned in the Introduction, although the arm with the highest expected reward is the optimum arm for utilization in infinite number of exploitations, it is not necessarily the one that is most probable to have the highest reward in a single or some finite number of exploitations. In the following example, two arms are considered such that , but it is more probable that a one-time exploitation of the first arm rewards us more than a one-time exploitation of the second arm. Hence, arm is not necessarily the ideal arm for one-time exploitation let alone the arm with the maximum empirical mean, i.e. .
Example 1
Consider two arms with the following independent reward distributions:
where and are constants for which each of the two distributions integrate to one and is the indicator function.
In example 1, although the second arm has a larger mean than the first one, and , the variance of reward received from the second arm is larger than that from the first one, which increases the risk of choosing the second arm for a one-time exploitation application. In fact, the first arm with lower mean is more probable to reward us more than the second arm since . In general, a larger variance for the received reward is against the principle of risk-aversion where the objective is to keep a balance in a trade-off between the expected return and risk of an action [23]. Mean-variance is an existing approach to tackle this scenario. However, it has some drawbacks that are explained in details in the following.
The mean-variance (MV) of an arm depends on the hyper-parameter , which is the absolute risk tolerance coefficient. The trade-off on is that if it is set to zero, the arm with the minimum variance is selected. On the other hand, if goes to infinity, the arm with the maximum expected reward is selected, which is the same as classical multi-armed bandit approach. Although the behavior of mean-variance trade-off is known for marginal values of , it is not obvious what value of the hyper-parameter keeps a desirable balance between return and risk. The choice of this hyper-parameter can be tricky and as will be shown in Section V; a bad choice can increase the regret dramatically. As a simple example, consider two arms with unknown parameters , and . The mean-variance trade-off is formalized as , where and are empirical estimates of variance and mean of each arm. Note that the empirical means and variances converge to true values, so the second arm that is performing worse with probability one is selected if . In order to address this issue, we alternatively propose the following best arm identification algorithm for One-Time (Finite-time) Exploitation in a Multi-Armed Bandit problem (OTE/FTE-MAB algorithm) that has concrete mathematical support for its action and is hyper-parameter-free.
Iv One/Finite-Time Exploitation in Multi-Armed Bandit Problems after an Experimentation Phase
In this section, we propose the OTE-MAB and FTE-MAB algorithms. The OTE-MAB algorithm is a specific case of FTE-MAB algorithm. Since the proof of theorem related to the FTE-MAB algorithm is notationally heavy, we first propose the OTE-MAB algorithm in Subsection IV-A and postpone the FTE-MAB algorithm to Subsection IV-B.
Iv-a The OTE-MAB Algorithm
The OTE-MAB algorithm desires to play the arm that is most probable to reward the most for the case as
(3) |
which is a specific case of Equation (1). Due to simplicity of notation, the -notation is eliminated in this subsection.
Remark 1
A more general version of the OTE-MAB algorithm is to concatenate a constant to vector as .
Since the reward distributions of the independent arms are not known, the exact values of are unknown. Hence, estimates of these probabilities, , are needed to be evaluated based on observations in the experimentation phase as follows:
(4) |
where and rewards of different arms are assumed to be independent.
Remark 2
If rewards of different arms are dependent, instantaneous observations of all arms at the same time are needed for times and is calculated as follows:
(5) |
The OTE-MAB algorithm selects arm as the best arm in terms of rewarding the most with the highest probability in one-time exploitation. The one-time exploitation regret, , which is a specific case of Definition 1, is defined as follows, where is defined in Equation (3):
(6) |
The OTE-MAB algorithm is summarized in Algorithm 1. We next present a theorem on an upper bound of the minimum number of experiments needed to guarantee an upper bound on regret of Algorithm 1.
Theorem 1
For any if each of the arms are experimented for times in the experimentation phase, the one-time exploitation regret defined in Equation (6) is bounded by , i.e. . Note that simultaneous exploration of the arms are required in the experimentation phase if arm rewards are dependent.
Consider the Bernoulli random variables and their unknown means for . Possessing independent observations from each of the independent or dependent arms in the pure exploration phase, the confidence interval derived from Hoeffding’s inequality for estimating based on Equation (4) or Equation (5) with confidence level has the property that
(7) |
Note that for the case of dependent arms, there is an -tuple containing the instantaneous observation of the arm rewards as for , which is used for estimation of in Equation (5). On the other hand, for the case of independent arms, any of the orderings of the observations of the arm rewards can be used for estimation of as is done in Equation (4). However, cannot be used as confidence interval with confidence level . The reason is that, although is derived from samples, not all those samples are independent, but exactly of the samples are independent. In fact, the observed independent rewards can be classified as -tuples of the arm rewards with independent elements in different ways. None of such -tuples has any priority over the other ones to estimate , so can be computed based on any of the -tuples. The estimate of derived from any of those -tuples is in with probability at least , so the average of those estimations is again in the mentioned interval with probability at least . Note that the average of estimates of derived from all of the different -tuples is equal to derived from Equation (4) due to the following reason. An element of an -tuple is repeated for times in all -tuples. Hence, averaging over the number of distinct elements of -tuples results in the same answer as the case of averaging the estimates of derived from all of different -tuples. As a result, can be used as the half width of the confidence interval for estimators obtained from Equations (4) and (5) for both independent and dependent arms.
In order to find a bound on regret, defined in Equation (6) as , note that
(8) |
where is true if . By using union bound and Equation (17), the probability of the right-hand side of the above equation can be bounded as follows, which results in the following bound on regret:
(9) |
The above upper bound on regret is derived under the condition that , which by using and simple algebraic calculations is equivalent to .
According to Theorem 1, the selected arm by Algorithm 1, , satisfies with probability at least for any , if each of the arms is explored in the experimentation phase for times. Hence, can get arbitrarily close to by increasing the number of pure explorations in the experimentation phase.
Let be the ordered list of in descending order. Note that arm is actually arm defined in Equation (3). Define the difference between the two maximum ’s as , where without loss of generality is assumed to be nonzero. Having the knowledge of or a lower bound on it, a stronger notion of regret can be defined as
(10) |
and have the following corollary.
Corollary 1
From the theoretical point of view, upon the knowledge of or a lower bound on it, for any , the regret defined in Equation (10) is bounded by i.e. , if the arms are explored for times each. If arms are dependent, instantaneous explorations of the arms are needed.
Iv-B The FTE-MAB Algorithm
Consider the case where an arm is going to be exploited for finite number of times, . The best arm for -time exploitations is defined in Equation (1). Since reward distributions are unknown, ’s are needed to be estimated based on observations in pure exploration phase. In the case of independent arms, define the vector with cardinality as
(11) |
Let for be the different elements of . Let be the estimate of , where they can be computed as
(12) |
In the case of dependent arms, ’s are defined in the same way as independent arms, but note that the set corresponding to is used for generating for all . Hence, is defined as follows for dependent arms:
(13) |
The FTE-MAB algorithm selects arm for -time exploitations. This algorithm is summarized in Algorithm 2. We next present a theorem for an upper bound of the minimum number of experiments needed to guarantee an upper bound on regret of Algorithm 2 which is the generalization of Theorem 1.
Theorem 2
For any if each of the arms is explored for times in the experimentation phase such that , the finite-time exploitation regret defined in Definition 1 is bounded by , i.e. . If the rewards of different arms are dependent, simultaneous explorations of the arms are required for the same bound on regret.
Let be the ordered list of in descending order. Note that arm is actually arm defined in Equation (1). Define the difference between the two maximum ’s as , where without loss of generality is assumed to be nonzero. Having the knowledge of or a lower bound on it, a stronger notion of regret can be defined as
(14) |
and have the following corollary.
Corollary 2
From the theoretical point of view, upon the knowledge of or a lower bound on it, for any , the regret defined in Equation (14) is bounded by i.e. , if the arms are explored for times each, where . If arms are dependent, instantaneous explorations of the arms are needed.
Corollary 3
If converges to infinity, the problem becomes the classical multi-armed bandit problem since is the same as and due to the law of large numbers as . Hence, the FTE-MAB algorithm selects the arm with maximum expected reward if the arm is going to be exploited for infinitely many times and the cumulative reward is desired to be maximized.
V Simulation Results
In this section, we report numerical simulations validating the theoretical results presented in this paper. We compare our proposed OTE-MAB algorithm with the Upper Confidence Bound (UCB) [35], ExpExp [23], and MaRaB [13] algorithms. Consider two arms with the reward distributions given in example 1. The regret defined in Equation (10) versus the number of pure explorations for each arm, , is averaged over 100,000 runs. The result is plotted in Figure 1 and as is shown OTE-MAB outperforms the state-of-the-art algorithms for the purpose of risk-aversion in terms of the regret defined in this paper. Note that the UCB algorithm aims at selecting an arm that maximizes the expected received reward, but in example 1, the arm with higher expected reward is less probable to have the highest reward, which is why the UCB algorithm performs poorly in this example. However, in the following example where the arm that rewards more on expectation is also more probable to reward more, the UCB, ExpExp, and MaRaB algorithms perform as well as the OTE-MAB algorithm.
Example 2
Consider two arms with the following unknown independent reward distributions:
where and are constants so that the two probability distribution functions integrate to one.
Note that in example 2, and . For this scenario, the regret defined in Equation (10) versus the number of pure explorations for each arm, , averaged over 100,000 runs is plotted in Figure 2.
In another experiment, the multi-armed bandit is simulated for example 1 and the probability that the selected arm has the higher reward is calculated over 500,000 runs for different algorithms. The result is shown in Figure 3. This result confirms the motivation of our study on risk-averse finite-time exploitations in multi-armed bandits.
In the above comparison of OTE-MAB with state-of-the-art algorithms, three different choices of hyper-parameters for the ExpExp and MaRaB algorithms are tested and the best performance is presented. However, note that the performances of these algorithms depend on the choice of hyper-parameter. In Figure 4, the sensitivity of the performance of ExpExp algorithm with respect to the choice of hyper-parameter is depicted for example 1 and a third example where the variance of the best arm is larger than the variance of the arm with lower expected reward. The two plots are the averaged regret over 100,000 runs versus the value of for the ExpExp algorithm for two different multi-armed bandit problems when . As depicted in Figure 4, a choice of can be good for one multi-armed bandit problem, but not good for another one. Due to our observations, the sensitivity of the MaRaB algorithm to its hyper-parameter can even be more complex. Figure 5 depicts the averaged regret over 100,000 runs versus the value of MaRaB hyper-parameter, , when . This figure is plotted for example 1 and a fourth example where reward of the first arm has a truncated Gaussian distribution with mean three and variance two over the interval and the second arm is the same as the one in example 1.
In another experiment, we compare the minimum number of explorations needed to guarantee a bound on regret for two cases of one-time and two-time exploitations. Theorems 1 and 2 suggest that for given , and , the upper bound of minimum number of explorations needed for -time exploitations to guarantee that the regret is bounded by is times that of one-time exploitation. We design two examples of two-armed bandits such that and plot the minimum number of explorations to guarantee bounded regret by in Figure 6. The dashed line is the plot of the OTE-MAB algorithm multiplied by two which is close to the one related to the FTE-MAB algorithm for two-armed bandits. This observation supports our theoretical results.
Taking a closer look at example 1, we note that the regret defined in (10) can generally be formulated as
(15) | ||||
Deriving the regret from the above equation, the same regret is found as the one generated by simulation for the OTE-MAB algorithm that is plotted in Figure 1. In this paper, the experimentation is assumed to have zero cost, which is often a valid assumption. However, if experimentation is time-consuming, there is a cost to postpone the exploitation of the best identified arm. For example, for more experimentation, a patient receives medication by delay or an investor keeps his/her money on hold with zero interest, both of which incur costs. Let such a cost be formulated by an increasing function , where is the incurred cost of experiments. Then, a trade-off between more exploration for higher accuracy of best-arm identification and lower incurred cost of experimentation emerges. Such a trade-off can be formalized by solving
(16) |
where is the cost-regret trade-off and is calculated by Equation (15) based on an estimation of which is updated after each experiment. Figure 7 plots under example 1 for , and the real value of . The rigorous analysis of (16) is postponed for future work [36] and is beyond the scope of this paper.
Vi Conclusion and Future Work
The focus of this work is on application domains, such as personalized health-care and one-time investment, where an experimentation phase of pure arm exploration is followed by a given finite number of exploitations of the best identified arm. We show through an example that the arm with maximum expected reward does not necessarily maximize the probability of receiving the maximum reward. The OTE-MAB and FTE-MAB algorithms are presented in this paper whose goals are to select the arm that maximizes the probability of receiving the maximum reward. We define a new notion of regret for our problem setup and find an upper bound on the minimum number of experiments that should be done to guarantee an upper bound on regret. The cost of experimentation is assumed to be negligible in this paper, but if such an assumption is violated in an application domain, one can study the cost-regret trade-off as a promising future work in various deterministic and stochastic versions.
References
- [1] J. Vermorel and M. Mohri, “Multi-armed bandit algorithms and empirical evaluation,” in European conference on machine learning. Springer, 2005, pp. 437–448.
- [2] H. Robbins, “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society, vol. 58, no. 5, pp. 527–535, 1952.
- [3] D. Bergemann and U. Hege, “Dynamic venture capital financing, learning and moral hazard,” Journal of Banking and Finance, vol. 22, no. 6-8, pp. 703–735, 1998.
- [4] ——, “The financing of innovation: Learning and stopping,” RAND Journal of Economics, pp. 719–752, 2005.
- [5] D.-S. Zois, “Sequential decision-making in healthcare iot: Real-time health monitoring, treatments and interventions,” in 2016 IEEE 3rd World Forum on Internet of Things (WF-IoT). IEEE, 2016, pp. 24–29.
- [6] N. Musavi, D. Onural, K. Gunes, and Y. Yildiz, “Unmanned aircraft systems airspace integration: A game theoretical framework for concept evaluations,” Journal of Guidance, Control, and Dynamics, pp. 96–109, 2016.
- [7] N. Musavi, K. B. Tekelioğlu, Y. Yildiz, K. Gunes, and D. Onural, “A game theoretical modeling and simulation framework for the integration of unmanned aircraft systems in to the national airspace,” in AIAA Infotech@ Aerospace, 2016, p. 1001.
- [8] O. Avner and S. Mannor, “Multi-user lax communications: a multi-armed bandit approach,” in IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications. IEEE, 2016, pp. 1–9.
- [9] A. Yekkehkhany and R. Nagi, “Blind gb-pandas: A blind throughput-optimal load balancing algorithm for affinity scheduling,” arXiv preprint arXiv:1901.04047, 2019.
- [10] Q. Xie, A. Yekkehkhany, and Y. Lu, “Scheduling with multi-level data locality: Throughput and heavy-traffic optimality,” in Computer Communications, IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on. IEEE, 2016, pp. 1–9.
- [11] A. Yekkehkhany, “Near data scheduling for data centers with multi levels of data locality,” (Dissertation, University of Illinois at Urbana-Champaign).
- [12] A. Yekkehkhany, A. Hojjati, and M. H. Hajiesmaili, “Gb-pandas:: Throughput and heavy-traffic optimality analysis for affinity scheduling,” ACM SIGMETRICS Performance Evaluation Review, vol. 45, no. 2, pp. 2–14, 2018.
- [13] N. Galichet, M. Sebag, and O. Teytaud, “Exploration vs exploitation vs safety: Risk-aware multi-armed bandits,” in Asian Conference on Machine Learning, 2013, pp. 245–260.
- [14] S. Maghsudi and E. Hossain, “Distributed user association in energy harvesting dense small cell networks: A mean-field multi-armed bandit approach,” IEEE Access, vol. 5, pp. 3513–3523, 2017.
- [15] L. X. Bui, R. Johari, and S. Mannor, “Committing bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 1557–1565.
- [16] N. V. Chawla and D. A. Davis, “Bringing big data to personalized healthcare: a patient-centered framework,” Journal of general internal medicine, vol. 28, no. 3, pp. 660–665, 2013.
- [17] D. E. Pritchard, F. Moeckel, M. S. Villa, L. T. Housman, C. A. McCarty, and H. L. McLeod, “Strategies for integrating personalized medicine into healthcare practice,” Personalized medicine, vol. 14, no. 2, pp. 141–152, 2017.
- [18] K. Priyanka and N. Kulennavar, “A survey on big data analytics in health care,” International Journal of Computer Science and Information Technologies, vol. 5, no. 4, pp. 5865–5868, 2014.
- [19] E. Abrahams, G. S. Ginsburg, and M. Silver, “The personalized medicine coalition,” American Journal of Pharmacogenomics, vol. 5, no. 6, pp. 345–355, 2005.
- [20] A. Garivier, T. Lattimore, and E. Kaufmann, “On explore-then-commit strategies,” in Advances in Neural Information Processing Systems, 2016, pp. 784–792.
- [21] A. Garivier, P. Ménard, and G. Stoltz, “Explore first, exploit next: The true shape of regret in bandit problems,” Mathematics of Operations Research, 2018.
- [22] L. Prashanth, “Cs6046: Multi-armed bandits,” 2018.
- [23] A. Sani, A. Lazaric, and R. Munos, “Risk-aversion in multi-armed bandits,” in Advances in Neural Information Processing Systems, 2012, pp. 3275–3283.
- [24] S. Vakili and Q. Zhao, “Risk-averse multi-armed bandit problems under mean-variance measure,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 6, pp. 1093–1111, 2016.
- [25] D. Liau, E. Price, Z. Song, and G. Yang, “Stochastic multi-armed bandits in constant space,” arXiv preprint arXiv:1712.09007, 2017.
- [26] V. Perchet, P. Rigollet, S. Chassang, E. Snowberg et al., “Batched bandit problems,” The Annals of Statistics, vol. 44, no. 2, pp. 660–681, 2016.
- [27] H. M. Markowitz, “Portfolio selection/harry markowitz,” The Journal of Finance, vol. 7, no. 1, pp. 77–91, 1952.
- [28] S. Vakili and Q. Zhao, “Mean-variance and value at risk in multi-armed bandit problems,” in 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2015, pp. 1330–1335.
- [29] J. Y. Yu and E. Nikolova, “Sample complexity of risk-averse bandit-arm selection,” in Twenty-Third International Joint Conference on Artificial Intelligence, 2013.
- [30] S. Vakili and Q. Zhao, “Risk-averse online learning under mean-variance measures,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 1911–1915.
- [31] J. Xu, W. B. Haskell, and Z. Ye, “Index-based policy for risk-averse multi-armed bandit,” arXiv preprint arXiv:1809.05385, 2018.
- [32] N. Galichet, “Contributions to multi-armed bandits: Risk-awareness and sub-sampling for linear contextual bandits,” Ph.D. dissertation, Université Paris Sud-Paris XI, 2015.
- [33] A. Cassel, S. Mannor, and A. Zeevi, “A general approach to multi-armed bandits under risk criteria,” arXiv preprint arXiv:1806.01380, 2018.
- [34] R. K. Kolla, K. Jagannathan et al., “Risk-aware multi-armed bandits using conditional value-at-risk,” arXiv preprint arXiv:1901.00997, 2019.
- [35] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, no. 2-3, pp. 235–256, 2002.
- [36] A. Yekkehkhany, E. Arian, R. Nagi, and I. Shomorony, “A cost-based analysis for risk-averse explore-then-commit finite-time bandits.”
Proof of Theorem 2: Consider the Bernoulli random variables and their unknown means for . Possessing independent observations from each of the independent or dependent arms in pure exploration, there are exactly independent samples for estimation of . Due to the same reasoning in the proof of Theorem 1, the confidence interval for estimating based on Equation (12) or (13) with confidence level has the property that
(17) |
for all .
In order to find a bound on regret, defined in Definition 1 as , note that
(18) | ||||
where is true if . By using union bound and Equation (17), the probability of the right-hand side of the above equation can be bounded as follows, which results in the following bound on regret:
(19) |
The above upper bound on regret is derived under the condition that , which by using and simple algebraic calculations is equivalent to .