Importance Weighting Without Importance Weights:
An Efficient Algorithm for Combinatorial SemiBandits
Abstract
We propose a sampleefficient alternative for importance weighting for situations where one only has sample access to the probability distribution that generates the observations. Our new method, called Geometric Resampling (GR), is described and analyzed in the context of online combinatorial optimization under semibandit feedback, where a learner sequentially selects its actions from a combinatorial decision set so as to minimize its cumulative loss. In particular, we show that the wellknown FollowthePerturbedLeader (FPL) prediction method coupled with Geometric Resampling yields the first computationally efficient reduction from offline to online optimization in this setting. We provide a thorough theoretical analysis for the resulting algorithm, showing that its performance is on par with previous, inefficient solutions. Our main contribution is showing that, despite the relatively large variance induced by the GR procedure, our performance guarantees hold with high probability rather than only in expectation. As a side result, we also improve the best known regret bounds for FPL in online combinatorial optimization with full feedback, closing the perceived performance gap between FPL and exponential weights in this setting. \@footnotetextA preliminary version of this paper was published as Neu and Bartók (2013). Parts of this work were completed while Gergely Neu was with the SequeL team at INRIA Lille – Nord Europe, France and Gábor Bartók was with the Department of Computer Science at ETH Zürich.
Universitat Pompeu Fabra
Roc Boronat 138, 08018, Barcelona, Spain Gábor Bartók bartok@google.com
Google Zürich
Brandschenkestrasse 100, 8002, Zürich, Switzerland
Editor: Manfred Warmuth
Keywords: online learning, combinatorial optimization, bandit problems, semibandit feedback, follow the perturbed leader, importance weighting
1 Introduction
Importance weighting is a crucially important tool used in many areas of machine learning, and specifically online learning with partial feedback. While most work assumes that importance weights are readily available or can be computed with little effort during runtime, this is often not the case in many practical settings, even when one has cheap sample access to the distribution generating the observations. Among other cases, such situations may arise when observations are generated by complex hierarchical sampling schemes, probabilistic programs, or, more generally, blackbox generative models. In this paper, we propose a simple and efficient sampling scheme called Geometric Resampling (GR) to compute reliable estimates of importance weights using only sample access.
Our main motivation is studying a specific online learning algorithm whose practical applicability in partialfeedback settings had long been hindered by the problem outlined above. Specifically, we consider the wellknown FollowthePerturbedLeader (FPL) prediction method that maintains implicit sampling distributions that usually cannot be expressed in closed form. In this paper, we endow FPL with our Geometric Resampling scheme to construct the first known computationally efficient reduction from offline to online combinatorial optimization under an important partialinformation scheme known as semibandit feedback. In the rest of this section, we describe our precise setting, present related work and outline our main results.
1.1 Online Combinatorial Optimization
We consider a special case of online linear optimization known as online combinatorial optimization (see Figure 1). In every round t=1,2,\dots,T of this sequential decision problem, the learner chooses an action \bm{V}_{t} from the finite action set \mathcal{S}\subseteq\left\{0,1\right\}^{d}, where \left\\bm{v}\right\_{1}\leq m holds for all \bm{v}\in\mathcal{S}. At the same time, the environment fixes a loss vector \bm{\ell}_{t}\in[0,1]^{d} and the learner suffers loss \bm{V}_{t}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t}. The goal of the learner is to minimize the cumulative loss \sum_{t=1}^{T}\bm{V}_{t}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t}. As usual in the literature of online optimization (CesaBianchi and Lugosi, 2006), we measure the performance of the learner in terms of the regret defined as
R_{T}=\max_{\bm{v}\in\mathcal{S}}\sum_{t=1}^{T}\left(\bm{V}_{t}\bm{v}\right)^% {\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t}=\sum_{t=1}^{T}\bm{V}_{t}^{\mathsf% {\scriptscriptstyle T}}\bm{\ell}_{t}\min_{\bm{v}\in\mathcal{S}}\sum_{t=1}^{T}% \bm{v}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t}\,,  (1) 
that is, the gap between the total loss of the learning algorithm and the best fixed decision in hindsight. In the current paper, we focus on the case of nonoblivious (or adaptive) environments, where we allow the loss vector \bm{\ell}_{t} to depend on the previous decisions \bm{V}_{1},\dots,\bm{V}_{t1} in an arbitrary fashion. Since it is wellknown that no deterministic algorithm can achieve sublinear regret under such weak assumptions, we will consider learning algorithms that choose their decisions in a randomized way. For such learners, another performance measure that we will study is the expected regret defined as
\widehat{R}_{T}=\max_{\bm{v}\in\mathcal{S}}\sum_{t=1}^{T}\mathbb{E}\left[\left% (\bm{V}_{t}\bm{v}\right)^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t}\right]=% \mathbb{E}\left[\sum_{t=1}^{T}\bm{V}_{t}^{\mathsf{\scriptscriptstyle T}}\bm{% \ell}_{t}\right]\min_{\bm{v}\in\mathcal{S}}\mathbb{E}\left[\sum_{t=1}^{T}\bm{% v}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t}\right]. 
The framework described above is general enough to accommodate a number of interesting problem instances such as path planning, ranking and matching problems, finding minimumweight spanning trees and cut sets. Accordingly, different versions of this general learning problem have drawn considerable attention in the past few years. These versions differ in the amount of information made available to the learner after each round t. In the simplest setting, called the fullinformation setting, it is assumed that the learner gets to observe the loss vector \bm{\ell}_{t} regardless of the choice of \bm{V}_{t}. As this assumption does not hold for many practical applications, it is more interesting to study the problem under partialinformation constraints, meaning that the learner only gets some limited feedback based on its own decision. In the current paper, we focus on a more realistic partialinformation scheme known as semibandit feedback (Audibert, Bubeck, and Lugosi, 2014) where the learner only observes the components \ell_{t,i} of the loss vector for which V_{t,i}=1, that is, the losses associated with the components selected by the learner.^{1}^{1}1Here, V_{t,i} and \ell_{t,i} are the i^{\mathrm{th}} components of the vectors \bm{V}_{t} and \bm{\ell}_{t}, respectively.
1.2 Related Work
The most wellknown instance of our problem is the multiarmed bandit problem considered in the seminal paper of Auer, CesaBianchi, Freund, and Schapire (2002): in each round of this problem, the learner has to select one of N arms and minimize regret against the best fixed arm while only observing the losses of the chosen arms. In our framework, this setting corresponds to setting d=N and m=1. Among other contributions concerning this problem, Auer et al. propose an algorithm called Exp3 (Exploration and Exploitation using Exponential weights) based on constructing loss estimates \widehat{\ell}_{t,i} for each component of the loss vector and playing arm i with probability proportional to \exp(\eta\sum_{s=1}^{t1}\widehat{\ell}_{s,i}) at time t, where \eta>0 is a parameter of the algorithm, usually called the learning rate^{2}^{2}2In fact, Auer et al. mix the resulting distribution with a uniform distribution over the arms with probability \eta N. However, this modification is not needed when one is concerned with the total expected regret, see, e.g., Bubeck and CesaBianchi (2012, Section 3.1).. This algorithm is essentially a variant of the Exponentially Weighted Average (EWA) forecaster (a variant of weighted majority algorithm of Littlestone and Warmuth, 1994, and aggregating strategies of Vovk, 1990, also known as Hedge by Freund and Schapire, 1997). Besides proving that the expected regret of Exp3 is O\bigl{(}\sqrt{NT\log N}\bigr{)}, Auer et al. also provide a general lower bound of \Omega\bigl{(}\sqrt{NT}\bigr{)} on the regret of any learning algorithm on this particular problem. This lower bound was later matched by a variant of the Implicitly Normalized Forecaster (INF) of Audibert and Bubeck (2010) by using the same loss estimates in a more refined way. Audibert and Bubeck also show bounds of O\bigl{(}\sqrt{NT/\log N}\log(N/\delta)\bigr{)} on the regret that hold with probability at least 1\delta, uniformly for any \delta>0.
The most popular example of online learning problems with actual combinatorial structure is the shortest path problem first considered by Takimoto and Warmuth (2003) in the full information scheme. The same problem was considered by György, Linder, Lugosi, and Ottucsák (2007), who proposed an algorithm that works with semibandit information. Since then, we have come a long way in understanding the “price of information” in online combinatorial optimization—see Audibert, Bubeck, and Lugosi (2014) for a complete overview of results concerning all of the information schemes considered in the current paper. The first algorithm directly targeting general online combinatorial optimization problems is due to Koolen, Warmuth, and Kivinen (2010): their method named Component Hedge guarantees an optimal regret of O\bigl{(}m\sqrt{T\log(d/m)}\bigr{)} in the full information setting. As later shown by Audibert, Bubeck, and Lugosi (2014), this algorithm is an instance of a more general algorithm class known as Online Stochastic Mirror Descent (OSMD). Taking the idea one step further, Audibert, Bubeck, and Lugosi (2014) also show that OSMDbased methods can also be used for proving expected regret bounds of O\bigl{(}\sqrt{mdT}\bigr{)} for the semibandit setting, which is also shown to coincide with the minimax regret in this setting. For completeness, we note that the EWA forecaster is known to attain an expected regret of O\bigl{(}m^{3/2}\sqrt{T\log(d/m)}\bigr{)} in the full information case and O\bigl{(}m\sqrt{dT\log(d/m)}\bigr{)} in the semibandit case.
While the results outlined above might suggest that there is absolutely no work left to be done in the full information and semibandit schemes, we get a different picture if we restrict our attention to computationally efficient algorithms. First, note that methods based on exponential weighting of each decision vector can only be efficiently implemented for a handful of decision sets \mathcal{S}—see Koolen et al. (2010) and CesaBianchi and Lugosi (2012) for some examples. Furthermore, as noted by Audibert et al. (2014), OSMDtype methods can be efficiently implemented by convex programming if the convex hull of the decision set can be described by a polynomial number of constraints. Details of such an efficient implementation are worked out by Suehiro, Hatano, Kijima, Takimoto, and Nagano (2012), whose algorithm runs in O(d^{6}) time, which can still be prohibitive in practical applications. While Koolen et al. (2010) list some further examples where OSMD can be implemented efficiently, we conclude that there is no general efficient algorithm with nearoptimal performance guarantees for learning in combinatorial semibandits.
The FollowthePerturbedLeader (FPL) prediction method (first proposed by Hannan, 1957 and later rediscovered by Kalai and Vempala, 2005) offers a computationally efficient solution for the online combinatorial optimization problem given that the static combinatorial optimization problem \min_{\bm{v}\in\mathcal{S}}\bm{v}^{\mathsf{\scriptscriptstyle T}}\bm{\ell} admits computationally efficient solutions for any \bm{\ell}\in\mathbb{R}^{d}. The idea underlying FPL is very simple: in every round t, the learner draws some random perturbations \bm{Z}_{t}\in\mathbb{R}^{d} and selects the action that minimizes the perturbed total losses:
\bm{V}_{t}=\mathop{\rm arg\,min}_{\bm{v}\in\mathcal{S}}\bm{v}^{\mathsf{% \scriptscriptstyle T}}\left(\sum_{s=1}^{t1}\bm{\ell}_{s}\bm{Z}_{t}\right). 
Despite its conceptual simplicity and computational efficiency, FPL have been relatively overlooked until very recently, due to two main reasons:

The best known bound for FPL in the full information setting is O\bigl{(}m\sqrt{dT}\bigr{)}, which is worse than the bounds for both EWA and OSMD that scale only logarithmically with d.

Considering bandit information, no efficient FPLstyle algorithm is known to achieve a regret of O\bigl{(}\sqrt{T}\bigr{)}. On one hand, it is relatively straightforward to prove O\bigl{(}T^{2/3}\bigr{)} bounds on the expected regret for an efficient FPLvariant (see, e.g., Awerbuch and Kleinberg, 2004 and McMahan and Blum, 2004). Poland (2005) proved bounds of O\bigl{(}\sqrt{NT\log N}\bigr{)} in the Narmed bandit setting, however, the proposed algorithm requires O\bigl{(}T^{2}\bigr{)} numerical operations per round.
The main obstacle for constructing a computationally efficient FPLvariant that works with partial information is precisely the lack of closedform expressions for importance weights. In the current paper, we address the above two issues and show that an efficient FPLbased algorithm using independent exponentially distributed perturbations can achieve as good performance guarantees as EWA in online combinatorial optimization.
Our work contributes to a new wave of positive results concerning FPL. Besides the reservations towards FPL mentioned above, the reputation of FPL has been also suffering from the fact that the nature of regularization arising from perturbations is not as wellunderstood as the explicit regularization schemes underlying OSMD or EWA. Very recently, Abernethy et al. (2014) have shown that FPL implements a form of strongly convex regularization over the convex hull of the decision space. Furthermore, Rakhlin et al. (2012) showed that FPL run with a specific perturbation scheme can be regarded as a relaxation of the minimax algorithm. Another recently initiated line of work shows that intuitive parameterfree variants of FPL can achieve excellent performance in fullinformation settings (Devroye et al., 2013 and Van Erven et al., 2014).
1.3 Our Results
In this paper, we propose a lossestimation scheme called Geometric Resampling to efficiently compute importance weights for the observed components of the loss vector. Building on this technique and the FPL principle, resulting in an efficient algorithm for regret minimization under semibandit feedback. Besides this contribution, our techniques also enable us to improve the best known regret bounds of FPL in the full information case. We prove the following results concerning variants of our algorithm:

a bound of O\bigl{(}m\sqrt{dT\log(d/m)}\bigr{)} on the expected regret under semibandit feedback (Theorem 1),

a bound of O\bigl{(}m\sqrt{dT\log(d/m)}+\sqrt{mdT}\log(1/\delta)\bigr{)} on the regret that holds with probability at least 1\delta, uniformly for all \delta\in(0,1) under semibandit feedback (Theorem 2),

a bound of O\bigl{(}m^{3/2}\sqrt{T\log(d/m)}\bigr{)} on the expected regret under full information (Theorem 13).
We also show that both of our semibandit algorithms access the optimization oracle O(dT) times over T rounds with high probability, increasing the running time only by a factor of d compared to the fullinformation variant. Notably, our results close the gaps between the performance bounds of FPL and EWA under both full information and semibandit feedback. Table 1 puts our newly proven regret bounds into context.
FPL  EWA  OSMD  
Full info regret bound  \mathbf{m^{3/2}\sqrt{T\,log\frac{d}{m}}}  m^{3/2}\sqrt{T\log\frac{d}{m}}  m\sqrt{T\log\frac{d}{m}} 
Semibandit regret bound  \mathbf{m\sqrt{dT\,log\frac{d}{m}}}  m\sqrt{dT\log\frac{d}{m}}  \sqrt{mdT} 
Computationally efficient?  always  sometimes  sometimes 
2 Geometric Resampling
In this section, we introduce the main idea underlying Geometric Resampling in the specific context of Narmed bandits where d=N, m=1 and the learner has access to the basis vectors \left\{\bm{e}_{i}\right\}_{i=1}^{d} as its decision set \mathcal{S}. In this setting, components of the decision vector are referred to as arms. For ease of notation, define I_{t} as the unique arm such that V_{t,I_{t}}=1 and \mathcal{F}_{t1} as the sigmaalgebra induced by the learner’s actions and observations up to the end of round t1. Using this notation, we define p_{t,i}=\mathbb{P}\left[\left.I_{t}=i\right\mathcal{F}_{t1}\right].
Most bandit algorithms rely on feeding some loss estimates to a sequential prediction algorithm. It is commonplace to consider importanceweighted loss estimates of the form
\widehat{\ell}^{*}_{t,i}=\frac{\mathbbm{1}_{\left\{I_{t}=i\right\}}}{p_{t,i}}% \ell_{t,i}  (2) 
for all t,i such that p_{t,i}>0. It is straightforward to show that \widehat{\ell}^{*}_{t,i} is an unbiased estimate of the loss \ell_{t,i} for all such t,i. Otherwise, when p_{t,i}=0, we set \widehat{\ell}_{t,i}^{*}=0, which gives \mathbb{E}\left[\left.\widehat{\ell}_{t,i}^{*}\right\mathcal{F}_{t1}\right]=% 0\leq\ell_{t,i}.
To our knowledge, all existing bandit algorithms operating in the nonstochastic setting utilize some version of the importanceweighted loss estimates described above. This is a very natural choice for algorithms that operate by first computing the probabilities p_{t,i} and then sampling I_{t} from the resulting distributions. While many algorithms fall into this class (including the Exp3 algorithm of Auer et al. (2002), the Green algorithm of Allenberg et al. (2006) and the INF algorithm of Audibert and Bubeck (2010), one can think of many other algorithms where the distribution \bm{p}_{t} is specified implicitly and thus importance weights are not readily available. Arguably, FPL is the most important online prediction algorithm that operates with implicit distributions that are notoriously difficult to compute in closed form. To overcome this difficulty, we propose a different loss estimate that can be efficiently computed even when \bm{p}_{t} is not available for the learner.
Our estimation procedure dubbed Geometric Resampling (GR) is based on the simple observation that, even though p_{t,I_{t}} might not be computable in closed form, one can simply generate a geometric random variable with expectation 1/p_{t,I_{t}} by repeated sampling from \bm{p}_{t}. Specifically, we propose the following procedure to be executed in round t:
Geometric Resampling for multiarmed bandits The learner draws I_{t}\sim\bm{p}_{t}. For k=1,2,\dots Draw I^{\prime}_{t}(k)\sim\bm{p}_{t}. If I^{\prime}_{t}(k)=I_{t}, break. Let K_{t}=k.
Observe that K_{t} generated this way is a geometrically distributed random variable given I_{t} and \mathcal{F}_{t1}. Consequently, we have \mathbb{E}\left[K_{t}\left\mathcal{F}_{t1},I_{t}\right.\right]=1/p_{t,I_{t}}. We use this property to construct the estimates
\widehat{\ell}_{t,i}=K_{t}\mathbbm{1}_{\left\{I_{t}=i\right\}}\ell_{t,i}  (3) 
for all arms i. We can easily show that the above estimate is unbiased whenever p_{t,i}>0:
\begin{split}\displaystyle\mathbb{E}\left[\left.\widehat{\ell}_{t,i}\right% \mathcal{F}_{t1}\right]&\displaystyle=\sum_{j}p_{t,j}\mathbb{E}\left[\left.% \widehat{\ell}_{t,i}\right\mathcal{F}_{t1},I_{t}=j\right]\\ &\displaystyle=p_{t,i}\mathbb{E}\left[\ell_{t,i}K_{t}\left\mathcal{F}_{t1},I% _{t}=i\right.\right]\\ &\displaystyle=p_{t,i}\ell_{t,i}\mathbb{E}\left[K_{t}\left\mathcal{F}_{t1},I% _{t}=i\right.\right]\\ &\displaystyle=\ell_{t,i}.\end{split} 
Notice that the above procedure produces \widehat{\ell}_{t,i}=0 almost surely whenever p_{t,i}=0, giving \mathbb{E}\left[\left.\widehat{\ell}_{t,i}\right\mathcal{F}_{t1}\right]=0 for such t,i.
One practical concern with the above sampling procedure is that its worstcase running time is unbounded: while the expected number of necessary samples K_{t} is clearly N, the actual number of samples might be much larger. In the next section, we offer a remedy to this problem, as well as generalize the approach to work in the combinatorial semibandit case.
3 An Efficient Algorithm for Combinatorial SemiBandits
In this section, we present our main result: an efficient reduction from offline to online combinatorial optimization under semibandit feedback. The most critical element in our technique is extending the Geometric Resampling idea to the case of combinatorial action sets. For defining the procedure, let us assume that we are running a randomized algorithm mapping histories to probability distributions over the action set \mathcal{S}: letting \mathcal{F}_{t1} denote the sigmaalgebra induced by the history of interaction between the learner and the environment, the algorithm picks action \bm{v}\in\mathcal{S} with probability p_{t}(\bm{v})=\mathbb{P}\left[\bm{V}_{t}=\bm{v}\left\mathcal{F}_{t1}\right.\right]. Also introducing q_{t,i}=\mathbb{E}\left[V_{t,i}\left\mathcal{F}_{t1}\right.\right], we can define the counterpart of the standard importanceweighted loss estimates of Equation 2 as the vector \widehat{\bm{\ell}}^{*}_{t} with components
\widehat{\ell}_{t,i}^{*}=\frac{V_{t,i}}{q_{t,i}}\ell_{t,i}.  (4) 
Again, the problem with these estimates is that for many algorithms of practical interest, the importance weights q_{t,i} cannot be computed in closed form. We now extend the Geometric Resampling procedure defined in the previous section to estimate the importance weights in an efficient manner. One adjustment we make to the procedure presented in the previous section is capping off the number of samples at some finite M>0. While this capping obviously introduces some bias, we will show later that for appropriate values of M, this bias does not hurt the performance of the overall learning algorithm too much. Thus, we define the Geometric Resampling procedure for combinatorial semibandits as follows:
Geometric Resampling for combinatorial semibandits The learner draws \bm{V}_{t}\sim\bm{p}_{t}. For k=1,2,\dots,M, draw \bm{V}^{\prime}_{t}(k)\sim\bm{p}_{t}. For i=1,2,\dots,d, K_{t,i}=\min\bigl{(}\left\{k:V_{t,i}^{\prime}(k)=1\right\}\cup\left\{M\right\}% \bigr{)}.
Based on the random variables output by the GR procedure, we construct our lossestimate vector \widehat{\bm{\ell}}_{t}\in\mathbb{R}^{d} with components
\displaystyle\widehat{\ell}_{t,i}=K_{t,i}V_{t,i}\ell_{t,i}  (5) 
for all i=1,2,\dots,d. Since V_{t,i} are nonzero only for coordinates for which \ell_{t,i} is observed, these estimates are welldefined. It also follows that the sampling procedure can be terminated once for every i with V_{t,i}=1, there is a copy \bm{V}_{t}^{\prime}(k) such that V_{t,i}^{\prime}(k)=1.
Now everything is ready to define our algorithm: FPL+GR, standing for FollowthePerturbedLeader with Geometric Resampling. Defining \widehat{\bm{L}}_{t}=\sum_{s=1}^{t}\widehat{\bm{\ell}}_{s}, at time step t FPL+GR draws the components of the perturbation vector \bm{Z}_{t} independently from a standard exponential distribution and selects action^{3}^{3}3By the definition of the perturbation distribution, the minimum is unique almost surely.
\bm{V}_{t}=\mathop{\rm arg\,min}_{\bm{v}\in\mathcal{S}}\bm{v}^{\mathsf{% \scriptscriptstyle T}}\left(\eta\widehat{\bm{L}}_{t1}\bm{Z}_{t}\right),  (6) 
where \eta>0 is a parameter of the algorithm. As we mentioned earlier, the distribution \bm{p}_{t}, while implicitly specified by \bm{Z}_{t} and the estimated cumulative losses \widehat{\bm{L}}_{t1}, cannot usually be expressed in closed form for FPL.^{4}^{4}4One notable exception is when the perturbations are drawn independently from standard Gumbel distributions, and the decision set is the ddimensional simplex: in this case, FPL is known to be equivalent with EWA—see, e.g., Abernethy et al. (2014) for further discussion. However, sampling the actions \bm{V}_{t}^{\prime}(\cdot) can be carried out by drawing additional perturbation vectors \bm{Z}_{t}^{\prime}(\cdot) independently from the same distribution as \bm{Z}_{t} and then solving a linear optimization task. We emphasize that the above additional actions are never actually played by the algorithm, but are only necessary for constructing the loss estimates. The power of FPL+GR is that, unlike other algorithms for combinatorial semibandits, its implementation only requires access to a linear optimization oracle over \mathcal{S}. We point the reader to Section 3.2 for a more detailed discussion of the running time of FPL+GR. Pseudocode for FPL+GR is shown on as Algorithm 1.
As we will show shortly, FPL+GR as defined above comes with strong performance guarantees that hold in expectation. One can think of several possible ways to robustify FPL+GR so that it provides bounds that hold with high probability. One possible path is to follow Auer et al. (2002) and define the lossestimate vector \widetilde{\bm{\ell}}_{t}^{*} with components
\widetilde{\ell}_{t,i}^{*}=\widehat{\ell}_{t,i}\frac{\beta}{q_{t,i}} 
for some \beta>0. The obvious problem with this definition is that it requires perfect knowledge of the importance weights q_{t,i} for all i. While it is possible to extend Geometric Resampling developed in the previous sections to construct a reliable proxy to the above loss estimate, there are several downsides to this approach. First, observe that one would need to obtain estimates of 1/q_{t,i} for every single i—even for the ones for which V_{t,i}=0. Due to this necessity, there is no hope to terminate the sampling procedure in reasonable time. Second, reliable estimation requires multiple samples of K_{t,i}, where the sample size has to explicitly depend on the desired confidence level.
Thus, we follow a different path: Motivated by the work of Audibert and Bubeck (2010), we propose to use a lossestimate vector \widetilde{\bm{\ell}}_{t} with components of the form
\widetilde{\ell}_{t,i}=\frac{1}{\beta}\log\left(1+\beta\widehat{\ell}_{t,i}\right)  (7) 
with an appropriately chosen \beta>0. Then, defining \widetilde{\bm{L}}_{t1}=\sum_{s=1}^{t1}\widetilde{\bm{\ell}}_{s}, we propose a variant of FPL+GR that simply replaces \widehat{\bm{L}}_{t1} by \widetilde{\bm{L}}_{t1} in the rule (6) for choosing \bm{V}_{t}. We refer to this variant of FPL+GR as FPL+GR.P. In the next section, we provide performance guarantees for both algorithms.
3.1 Performance Guarantees
Now we are ready to state our main results. Proofs will be presented in Section 4. First, we present a performance guarantee for FPL+GR in terms of the expected regret:
Theorem 1
The expected regret of FPL+GR satisfies
\widehat{R}_{T}\leq\frac{m\left(\log\left(d/m\right)+1\right)}{\eta}+2\eta mdT% +\frac{dT}{eM} 
under semibandit information. In particular, with
\eta=\sqrt{\frac{\log(d/m)+1}{2dT}}\qquad\mbox{and}\qquad M=\left\lceil\frac{% \sqrt{dT}}{em\sqrt{2\left(\log(d/m)+1\right)}}\right\rceil, 
the expected regret of FPL+GR is bounded as
\widehat{R}_{T}\leq 3m\sqrt{2dT\left(\log\frac{d}{m}+1\right)}. 
Our second main contribution is the following bound on the regret of FPL+GR.P.
Theorem 2
Fix an arbitrary \delta>0. With probability at least 1\delta, the regret of FPL+GR.P satisfies
\begin{split}\displaystyle R_{T}\leq&\displaystyle\frac{m\left(\log(d/m)+1% \right)}{\eta}+\eta\left(Mm\sqrt{2T\log\frac{5}{\delta}}+2md\sqrt{T\log\frac{5% }{\delta}}+2mdT\right)+\frac{dT}{eM}\\ &\displaystyle+\beta\left(M\sqrt{2mT\log\frac{5}{\delta}}+2d\sqrt{T\log\frac{5% }{\delta}}+2dT\right)+\frac{m\log(5d/\delta)}{\beta}\\ &\displaystyle+m\sqrt{2(e2)T}\log\frac{5}{\delta}+\sqrt{8T\log\frac{5}{\delta% }}+\sqrt{2(e2)T}.\end{split} 
In particular, with
M=\left\lceil\sqrt{\frac{dT}{m}}\right\rceil,\quad\beta=\sqrt{\frac{m}{dT}},% \quad\mbox{and}\quad\eta=\sqrt{\frac{\log(d/m)+1}{dT}}, 
the regret of FPL+GR.P is bounded as
\begin{split}\displaystyle R_{T}\leq&\displaystyle 3m\sqrt{dT\left(\log\frac{d% }{m}+1\right)}+\sqrt{mdT}\left(\log\frac{5d}{\delta}+2\right)+\sqrt{2mT\log% \frac{5}{\delta}}\left(\sqrt{\log\frac{d}{m}+1}+1\right)\\ &\displaystyle+1.2m\sqrt{T}\log\frac{5}{\delta}+\sqrt{T}\left(\sqrt{8\log\frac% {5}{\delta}}+1.2\right)+2\sqrt{d\log\frac{5}{\delta}}\left(m\sqrt{\log\frac{d}% {m}+1}+\sqrt{m}\right)\end{split} 
with probability at least 1\delta.
3.2 Running Time
Let us now turn our attention to computational issues. First, we note that the efficiency of FPLtype algorithms crucially depends on the availability of an efficient oracle that solves the static combinatorial optimization problem of finding \mathop{\rm arg\,min}_{\bm{v}\in\mathcal{S}}\bm{v}^{\mathsf{\scriptscriptstyle T% }}\bm{\ell}. Computing the running time of the fullinformation variant of FPL is straightforward: assuming that the oracle computes the solution to the static problem in O(f(\mathcal{S})) time, FPL returns its prediction in O(f(\mathcal{S})+d) time (with the d overhead coming from the time necessary to generate the perturbations). Naturally, our loss estimation scheme multiplies these computations by the number of samples taken in each round. While terminating the estimation procedure after M samples helps in controlling the running time with high probability, observe that the naïve bound of MT on the number of samples becomes way too large when setting M as suggested by Theorems 1 and 2. The next proposition shows that the amortized running time of Geometric Resampling remains as low as O(d) even for large values of M.
Proposition 3
Let S_{t} denote the number of sample actions taken by GR in round t. Then, \mathbb{E}\left[S_{t}\right]\leq d. Also, for any \delta>0,
\sum_{t=1}^{T}S_{t}\leq(e1)dT+M\log\frac{1}{\delta} 
holds with probability at least 1\delta.
Proof For proving the first statement, let us fix a time step t and notice that
S_{t}=\max_{j:V_{t,j}=1}K_{t,j}=\max_{j=1,2,\dots,d}V_{t,j}K_{t,j}\leq\sum_{j=% 1}^{d}V_{t,j}K_{t,j}. 
Now, observe that \mathbb{E}\left[\left.K_{t,j}\right\mathcal{F}_{t1},V_{t,j}\right]\leq 1/% \mathbb{E}\left[\left.V_{t,j}\right\mathcal{F}_{t1}\right], which gives \mathbb{E}\left[S_{t}\right]\leq d, thus proving the first statement. For the second part, notice that X_{t}=\left(S_{t}\mathbb{E}\left[\left.S_{t}\right\mathcal{F}_{t1}\right]\right) is a martingaledifference sequence with respect to \left(\mathcal{F}_{t}\right) with X_{t}\leq M and with conditional variance
\begin{split}\displaystyle{\rm Var}\left[\left.X_{t}\right\mathcal{F}_{t1}% \right]&\displaystyle=\mathbb{E}\left[\left.\left(S_{t}\mathbb{E}\left[\left.% S_{t}\right\mathcal{F}_{t1}\right]\right)^{2}\right\mathcal{F}_{t1}\right]% \leq\mathbb{E}\left[\left.S_{t}^{2}\right\mathcal{F}_{t1}\right]\\ &\displaystyle=\mathbb{E}\left[\left.\max_{j}\left(V_{t,j}K_{t,j}\right)^{2}% \right\mathcal{F}_{t1}\right]\leq\mathbb{E}\left[\left.\sum_{j=1}^{d}V_{t,j}% K_{t,j}^{2}\right\mathcal{F}_{t1}\right]\\ &\displaystyle\leq\sum_{j=1}^{d}\min\left\{\frac{2}{q_{t,j}},M\right\}\leq dM,% \end{split} 
where we used \mathbb{E}\left[\left.K_{t,i}^{2}\right\mathcal{F}_{t1}\right]=\frac{2q_{t,%
i}}{q_{t,i}^{2}}.
Then, the second statement follows from applying a version of Freedman’s inequality due to Beygelzimer et al. (2011) (stated as
Lemma 16 in the appendix) with B=M and \Sigma_{T}\leq dMT.
Notice that choosing M=O\bigl{(}\sqrt{dT}\bigr{)} as suggested by Theorems 1 and 2, the
above
result guarantees that the amortized running time of FPL+GR is O\big{(}(d+\sqrt{d/T})\cdot(f(\mathcal{S})+d)\big{)} with high
probability.
4 Analysis
This section presents the proofs of Theorems 1 and 2. In a didactic attempt, we present statements concerning the lossestimation procedure and the learning algorithm separately: Section 4.1 presents various important properties of the loss estimates produced by Geometric Resampling, Section 4.2 presents general tools for analyzing FollowthePerturbedLeader methods. Finally, Sections 4.3 and 4.4 put these results together to prove Theorems 1 and 2, respectively.
4.1 Properties of Geometric Resampling
The basic idea underlying Geometric Resampling is replacing the importance weights 1/q_{t,i} by appropriately defined random variables K_{t,i}. As we have seen earlier (Section 2), running GR with M=\infty amounts to sampling each K_{t,i} from a geometric distribution with expectation 1/q_{t,i}, yielding an unbiased loss estimate. In practice, one would want to set M to a finite value to ensure that the running time of the sampling procedure is bounded. Note however that early termination of GR introduces a bias in the loss estimates. This section is mainly concerned with the nature of this bias. We emphasize that the statements presented in this section remain valid no matter what randomized algorithm generates the actions \bm{V}_{t}. Our first lemma gives an explicit expression on the expectation of the loss estimates generated by GR.
Lemma 4
For all j and t such that q_{t,j}>0, the loss estimates (5) satisfy
\mathbb{E}\left[\left.\widehat{\ell}_{t,j}\right\mathcal{F}_{t1}\right]=% \left(1(1q_{t,j})^{M}\right)\ell_{t,j}. 
Proof Fix any j,t satisfying the condition of the lemma. Setting q=q_{t,j} for simplicity, we write
\begin{split}\displaystyle\mathbb{E}\left[K_{t,j}\left\mathcal{F}_{t1}\right% .\right]=&\displaystyle\sum_{k=1}^{\infty}k(1q)^{k1}q\sum_{k=M}^{\infty}(k% M)(1q)^{k1}q\\ \displaystyle=&\displaystyle\sum_{k=1}^{\infty}k(1q)^{k1}q(1q)^{M}\sum_{k=% M}^{\infty}(k\!\!M)(1q)^{k\!\!M\!\!1}q\\ \displaystyle=&\displaystyle\left(1(1q)^{M}\right)\sum_{k=1}^{\infty}k(1q)^% {k1}q=\frac{1(1q)^{M}}{q}.\end{split} 
The proof is concluded by combining the above with \mathbb{E}\left[\left.\widehat{\ell}_{t,j}\right\mathcal{F}_{t1}\right]\!=\!%
q_{t,j}\ell_{t,j}\mathbb{E}\left[\left.K_{t,j}\right\mathcal{F}_{t1}\right].
The following lemma shows two important properties of the GR loss estimates (5).
Roughly speaking, the first of these properties ensure that any learning algorithm relying on these estimates will be
optimistic in the sense that the loss of any fixed decision will be underestimated in expectation. The
second property ensures that the learner will not be overly optimistic concerning its own performance.
Lemma 5
For all \bm{v}\in\mathcal{S} and t, the loss estimates (5) satisfy the following two properties:
\displaystyle\mathbb{E}\left[\left.\bm{v}^{\mathsf{\scriptscriptstyle T}}% \widehat{\bm{\ell}}_{t}\right\mathcal{F}_{t1}\right]  \displaystyle\leq  \displaystyle\bm{v}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t},  (8)  
\displaystyle\mathbb{E}\left[\left.\sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})% \left(\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)% \right\mathcal{F}_{t1}\right]  \displaystyle\geq  \displaystyle\sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})\bigl{(}\bm{u}^{\mathsf{% \scriptscriptstyle T}}\bm{\ell}_{t}\bigr{)}\frac{d}{eM}.  (9) 
Proof Fix any \bm{v}\in\mathcal{S} and t. The first property is an immediate consequence of Lemma 4: we have that \mathbb{E}\left[\left.\widehat{\ell}_{t,k}\right\mathcal{F}_{t1}\right]\leq% \ell_{t,k} for all k, and thus \mathbb{E}\left[\left.\bm{v}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}% }_{t}\right\mathcal{F}_{t1}\right]\leq\bm{v}^{\mathsf{\scriptscriptstyle T}}% \bm{\ell}_{t}. For the second statement, observe that
\begin{split}\displaystyle\mathbb{E}\left[\left.\sum_{\bm{u}\in\mathcal{S}}p_{% t}(\bm{u})\left(\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}% \right)\right\mathcal{F}_{t1}\right]&\displaystyle=\sum_{i=1}^{d}q_{t,i}% \mathbb{E}\left[\left.\widehat{\ell}_{t,i}\right\mathcal{F}_{t1}\right]=\sum% _{i=1}^{d}q_{t,i}\left(1(1q_{t,i})^{M}\right)\ell_{t,i}\end{split} 
also holds by Lemma 4.
To control the bias term \sum_{i}q_{t,i}(1q_{t,i})^{M}, note that q_{t,i}(1q_{t,i})^{M}\leq q_{t,i}e^{Mq_{t,i}}.
By elementary calculations, one can show that f(q)=qe^{Mq} takes its maximum at q=\frac{1}{M} and thus
\sum_{i=1}^{d}q_{t,i}(1q_{t,i})^{M}\leq\frac{d}{eM}.
Our last lemma concerning the loss estimates (5) bounds the conditional variance of the estimated
loss of the learner. This term plays a key role in the performance analysis of Exp3style algorithms (see,
e.g., Auer et al. (2002); Uchiya et al. (2010); Audibert et al. (2014)), as well as in the analysis presented in the current paper.
Lemma 6
For all t, the loss estimates (5) satisfy
\mathbb{E}\left[\left.\sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})\left(\bm{u}^{% \mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)^{2}\right% \mathcal{F}_{t1}\right]\leq 2md. 
Before proving the statement, we remark that the conditional variance can be bounded as md for the standard (although usually infeasible) loss estimates (4). That is, the above lemma shows that, somewhat surprisingly, the variance of our estimates is only twice as large as the variance of the standard estimates.
Proof Fix an arbitrary t. For simplifying notation below, let us introduce \widetilde{\bm{V}} as an independent copy of \bm{V}_{t} such that \mathbb{P}\left[\left.\widetilde{\bm{V}}=\bm{v}\right\mathcal{F}_{t1}\right]% =p_{t}(\bm{v}) holds for all \bm{v}\in\mathcal{S}. To begin, observe that for any i
\mathbb{E}\left[\left.K_{t,i}^{2}\right\mathcal{F}_{t1}\right]\leq\frac{2q_% {t,i}}{q_{t,i}^{2}}\leq\frac{2}{q_{t,i}^{2}}  (10) 
holds, as K_{t,i} has a truncated geometric law. The statement is proven as
\begin{split}\displaystyle\mathbb{E}\left[\left.\sum_{\bm{u}\in\mathcal{S}}p_{% t}(\bm{u})\left(\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}% \right)^{2}\right\mathcal{F}_{t1}\right]&\displaystyle=\mathbb{E}\left[\left% .\sum_{i=1}^{d}\sum_{j=1}^{d}\left(\widetilde{V}_{i}\widehat{\ell}_{t,i}\right% )\left(\widetilde{V}_{j}\widehat{\ell}_{t,j}\right)\right\mathcal{F}_{t1}% \right]\\ &\displaystyle=\mathbb{E}\left[\left.\sum_{i=1}^{d}\sum_{j=1}^{d}\left(% \widetilde{V}_{i}K_{t,i}V_{t,i}\ell_{t,i}\right)\left(\widetilde{V}_{j}K_{t,j}% V_{t,j}\ell_{t,j}\right)\right\mathcal{F}_{t1}\right]\\ &\displaystyle\qquad\qquad\mbox{(using the definition of $\widehat{\bm{\ell}}_% {t}$)}\\ &\displaystyle\leq\mathbb{E}\left[\left.\sum_{i=1}^{d}\sum_{j=1}^{d}\frac{K_{t% ,i}^{2}+K_{t,j}^{2}}{2}\left(\widetilde{V}_{i}V_{t,i}\ell_{t,i}\right)\left(% \widetilde{V}_{j}V_{t,j}\ell_{t,j}\right)\right\mathcal{F}_{t1}\right]\\ &\displaystyle\qquad\qquad\mbox{(using $2AB\leq A^{2}+B^{2}$)}\\ &\displaystyle\leq 2\mathbb{E}\left[\left.\sum_{i=1}^{d}\frac{1}{q_{t,i}^{2}}% \left(\widetilde{V}_{i}V_{t,i}\ell_{t,i}\right)\sum_{j=1}^{d}V_{t,j}\ell_{t,j}% \right\mathcal{F}_{t1}\right]\\ &\displaystyle\qquad\qquad\mbox{(using symmetry, Eq.~{}\eqref{eq:K2bound} and % $\widetilde{V}_{j}\leq 1$)}\\ &\displaystyle\leq 2m\mathbb{E}\left[\left.\sum_{j=1}^{d}\ell_{t,j}\right% \mathcal{F}_{t1}\right]\leq 2md,\end{split} 
where the last line follows from using \left\\bm{V}_{t}\right\_{1}\leq m, \left\\bm{\ell}_{t}\right\_{\infty}\leq 1, and
\mathbb{E}\left[\left.V_{t,i}\right\mathcal{F}_{t1}\right]=\mathbb{E}\left[%
\left.\widetilde{V}_{i}\right\mathcal{F}_{t1}\right]=q_{t,i}.
4.2 General Tools for Analyzing FPL
In this section, we present the key tools for analyzing the FPLcomponent of our learning algorithm. In some respect, our analysis is a synthesis of previous work on FPLstyle methods: we borrow several ideas from Poland (2005) and the proof of Corollary 4.5 in CesaBianchi and Lugosi (2006). Nevertheless, our analysis is the first to directly target combinatorial settings, and yields the tightest known bounds for FPL in this domain. Indeed, the tools developed in this section also permit an improvement for FPL in the fullinformation setting, closing the presumed performance gap between FPL and EWA in both the fullinformation and the semibandit settings. The statements we present in this section are not specific to the lossestimate vectors used by FPL+GR.
Like most other known work, we study the performance of the learning algorithm through a virtual algorithm that (i) uses a timeindependent perturbation vector and (ii) is allowed to peek one step into the future. Specifically, letting \widetilde{\bm{Z}} be a perturbation vector drawn independently from the same distribution as \bm{Z}_{1}, the virtual algorithm picks its t^{\mathrm{th}} action as
\widetilde{\bm{V}}_{t}=\mathop{\rm arg\,min}_{\bm{v}\in\mathcal{S}}\left\{\bm{% v}^{\mathsf{\scriptscriptstyle T}}\left(\eta\widehat{\bm{L}}_{t}\widetilde{% \bm{Z}}\right)\right\}.  (11) 
In what follows, we will crucially use that \widetilde{\bm{V}}_{t} and \bm{V}_{t+1} are conditionally independent and identically distributed given \mathcal{F}_{t}. In particular, introducing the notations
\displaystyle q_{t,i}  \displaystyle=\mathbb{E}\left[\left.V_{t,i}\right\mathcal{F}_{t1}\right]  \displaystyle\widetilde{q}_{t,i}  \displaystyle=\mathbb{E}\left[\left.\widetilde{V}_{t,i}\right\mathcal{F}_{t}\right]  
\displaystyle p_{t}(\bm{v})  \displaystyle=\mathbb{P}\left[\left.\bm{V}_{t}=\bm{v}\right\mathcal{F}_{t1}\right]  \displaystyle\widetilde{p}_{t}(\bm{v})  \displaystyle=\mathbb{P}\left[\left.\widetilde{\bm{V}}_{t}=\bm{v}\right% \mathcal{F}_{t}\right], 
we will exploit the above property by using q_{t,i}=\widetilde{q}_{t1,i} and p_{t}(\bm{v})=\widetilde{p}_{t1}(\bm{v}) numerous times in the proofs below.
First, we show a regret bound on the virtual algorithm that plays the action sequence \widetilde{\bm{V}}_{1},\widetilde{\bm{V}}_{2},\dots,\widetilde{\bm{V}}_{T}.
Lemma 7
For any \bm{v}\in\mathcal{S},
\begin{split}\displaystyle\sum_{t=1}^{T}\sum_{\bm{u}\in\mathcal{S}}\widetilde{% p}_{t}(\bm{u})\left(\left(\bm{u}\bm{v}\right)^{\mathsf{\scriptscriptstyle T}}% \widehat{\bm{\ell}}_{t}\right)\leq\frac{m\left(\log(d/m)+1\right)}{\eta}.\end{split}  (12) 
Although the proof of this statement is rather standard, we include it for completeness. We also note that the lemma slightly improves other known results by replacing the usual \log d term by \log(d/m).
Proof Fix any \bm{v}\in\mathcal{S}. Using Lemma 3.1 of CesaBianchi and Lugosi (2006) (sometimes referred to as the “followtheleader/betheleader” lemma) for the sequence \bigl{(}\eta\widehat{\bm{\ell}}_{1}\widetilde{\bm{Z}},\allowbreak\eta\widehat% {\bm{\ell}}_{2},\allowbreak\dots,\allowbreak\eta\widehat{\bm{\ell}}_{T}\bigr{)}, we obtain
\eta\sum_{t=1}^{T}\widetilde{\bm{V}}_{t}^{\mathsf{\scriptscriptstyle T}}% \widehat{\bm{\ell}}_{t}\widetilde{\bm{V}}_{1}^{\mathsf{\scriptscriptstyle T}}% \widetilde{\bm{Z}}\leq\eta\sum_{t=1}^{T}\bm{v}^{\mathsf{\scriptscriptstyle T}}% \widehat{\bm{\ell}}_{t}\bm{v}^{\mathsf{\scriptscriptstyle T}}\widetilde{\bm{Z% }}. 
Reordering and integrating both sides with respect to the distribution of \widetilde{\bm{Z}} gives
\begin{split}\displaystyle\eta\sum_{t=1}^{T}\sum_{\bm{u}\in\mathcal{S}}% \widetilde{p}_{t}(\bm{u})\left(\left(\bm{u}\bm{v}\right)^{\mathsf{% \scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)\leq\mathbb{E}\left[\left(% \widetilde{\bm{V}}_{1}\bm{v}\right)^{\mathsf{\scriptscriptstyle T}}\widetilde% {\bm{Z}}\right].\end{split}  (13) 
The statement follows from using \mathbb{E}\left[\widetilde{\bm{V}}_{1}^{\mathsf{\scriptscriptstyle T}}%
\widetilde{\bm{Z}}\right]\leq m(\log(d/m)+1), which is proven in
Appendix A as Lemma 14, noting that \widetilde{\bm{V}}_{1}^{\mathsf{\scriptscriptstyle T}}\widetilde{\bm{Z}} is upperbounded by the sum of the m
largest components of \widetilde{\bm{Z}}.
The next lemma relates the performance of the virtual algorithm to the actual performance of FPL. The lemma relies on
a “sparseloss” trick similar to the trick used in the proof Corollary 4.5 in CesaBianchi and Lugosi (2006), and is also
related to the “unit rule” discussed by Koolen et al. (2010).
Lemma 8
For all t=1,2,\dots,T, assume that \widehat{\bm{\ell}}_{t} is such that \widehat{\ell}_{t,k}\geq 0 for all k\in\left\{1,2,\dots,d\right\}. Then,
\sum_{\bm{u}\in\mathcal{S}}\bigl{(}p_{t}(\bm{u})\widetilde{p}_{t}(\bm{u})% \bigr{)}\left(\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}% \right)\leq\eta\sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})\left(\bm{u}^{\mathsf{% \scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)^{2}. 
Proof Fix an arbitrary t and \bm{u}\in\mathcal{S}, and define the “sparse loss vector” \widehat{\bm{\ell}}^{}_{t}(\bm{u}) with components \widehat{\ell}^{}_{t,k}(\bm{u})=u_{k}\widehat{\ell}_{t,k} and
\bm{V}^{}_{t}(\bm{u})=\mathop{\rm arg\,min}_{\bm{v}\in\mathcal{S}}\left\{\bm{% v}^{\mathsf{\scriptscriptstyle T}}\left(\eta\widehat{\bm{L}}_{t1}+\eta% \widehat{\bm{\ell}}^{}_{t}(\bm{u})\widetilde{\bm{Z}}\right)\right\}. 
Using the notation p^{}_{t}(\bm{u})=\mathbb{P}\left[\left.\bm{V}^{}_{t}(\bm{u})=\bm{u}\right% \mathcal{F}_{t}\right], we show in Lemma 15 (stated and proved in Appendix A) that p^{}_{t}(\bm{u})\leq\widetilde{p}_{t}(\bm{u}). Also, define
\bm{U}(\bm{z})=\mathop{\rm arg\,min}_{\bm{v}\in\mathcal{S}}\left\{\bm{v}^{% \mathsf{\scriptscriptstyle T}}\left(\eta\widehat{\bm{L}}_{t1}\bm{z}\right)% \right\}. 
Letting f(\bm{z})=e^{\left\\bm{z}\right\_{1}} (\bm{z}\in\mathbb{R}_{+}^{d}) be the density of the perturbations, we have
\begin{split}\displaystyle p_{t}(\bm{u})&\displaystyle=\int\limits_{\bm{z}\in[% 0,\infty]^{d}}\mathbbm{1}_{\left\{\bm{U}(\bm{z})=\bm{u}\right\}}f(\bm{z})\,d% \bm{z}\\ &\displaystyle=e^{\eta\left\\widehat{\bm{\ell}}^{}_{t}(\bm{u})\right\_{1}}% \int\limits_{\bm{z}\in[0,\infty]^{d}}\mathbbm{1}_{\left\{\bm{U}(\bm{z})=\bm{u}% \right\}}f\left(\bm{z}+\eta\widehat{\bm{\ell}}^{}_{t}(\bm{u})\right)\,d\bm{z}% \\ &\displaystyle=e^{\eta\left\\widehat{\bm{\ell}}^{}_{t}(\bm{u})\right\_{1}}% \idotsint\limits_{z_{i}\in[\widehat{\ell}^{}_{t,i}(\bm{u}),\infty]}\mathbbm{1% }_{\left\{\bm{U}\left(\bm{z}\eta\widehat{\bm{\ell}}^{}_{t}(\bm{u})\right)=% \bm{u}\right\}}f(\bm{z})\,d\bm{z}\\ &\displaystyle\leq e^{\eta\left\\widehat{\bm{\ell}}^{}_{t}(\bm{u})\right\_{% 1}}\int\limits_{\bm{z}\in[0,\infty]^{d}}\mathbbm{1}_{\left\{\bm{U}\left(\bm{z}% \eta\widehat{\bm{\ell}}^{}_{t}(\bm{u})\right)=\bm{u}\right\}}f(\bm{z})\,d\bm% {z}\\ &\displaystyle\leq e^{\eta\left\\widehat{\bm{\ell}}^{}_{t}(\bm{u})\right\_{% 1}}p^{}_{t}(\bm{u})\leq e^{\eta\left\\widehat{\bm{\ell}}^{}_{t}(\bm{u})% \right\_{1}}\widetilde{p}_{t}(\bm{u}).\end{split} 
Now notice that \bigl{\}\widehat{\bm{\ell}}^{}_{t}(\bm{u})\bigr{\}_{1}=\bm{u}^{\mathsf{% \scriptscriptstyle T}}\widehat{\bm{\ell}}^{}_{t}(\bm{u})=\bm{u}^{\mathsf{% \scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}, which gives
\begin{split}\displaystyle\widetilde{p}_{t}(\bm{u})&\displaystyle\geq p_{t}(% \bm{u})e^{\eta\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}}% \geq p_{t}(\bm{u})\left(1\eta\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{% \bm{\ell}}_{t}\right).\end{split} 
The proof is concluded by repeating the same argument for all \bm{u}\in\mathcal{S}, reordering and summing the terms as
\begin{split}\displaystyle\sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})\left(\bm{u}% ^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)&\displaystyle% \leq\sum_{\bm{u}\in\mathcal{S}}\widetilde{p}_{t}(\bm{u})\left(\bm{u}^{\mathsf{% \scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)+\eta\sum_{\bm{u}\in% \mathcal{S}}p_{t}(\bm{u})\left(\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{% \bm{\ell}}_{t}\right)^{2}.\end{split}  (14) 
4.3 Proof of Theorem 1
Now, everything is ready to prove the bound on the expected regret of FPL+GR. Let us fix an arbitrary \bm{v}\in\mathcal{S}. By putting together Lemmas 6, 7 and 8, we immediately obtain
\mathbb{E}\left[\sum_{t=1}^{T}\sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})\left(% \left(\bm{u}\bm{v}\right)^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_% {t}\right)\right]\leq\frac{m\left(\log(d/m)+1\right)}{\eta}+2\eta mdT,  (15) 
leaving us with the problem of upper bounding the expected regret in terms of the lefthand side of the above inequality. This can be done by using the properties of the loss estimates (5) stated in Lemma 5:
\mathbb{E}\left[\sum_{t=1}^{T}\left(\bm{V}_{t}\bm{v}\right)^{\mathsf{% \scriptscriptstyle T}}\bm{\ell}_{t}\right]\leq\mathbb{E}\left[\sum_{t=1}^{T}% \sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})\left(\left(\bm{u}\bm{v}\right)^{% \mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)\right]+\frac{dT}{% eM}. 
Putting the two inequalities together proves the theorem.
4.4 Proof of Theorem 2
We now turn to prove a bound on the regret of FPL+GR.P that holds with high probability. We begin by noting that the conditions of Lemmas 7 and 8 continue to hold for the new loss estimates, so we can obtain the central terms in the regret:
\sum_{t=1}^{T}\sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})\left(\left(\bm{u}\bm{v% }\right)^{\mathsf{\scriptscriptstyle T}}\widetilde{\bm{\ell}}_{t}\right)\leq% \frac{m(\log(d/m)+1)}{\eta}+\eta\sum_{t=1}^{T}\sum_{\bm{u}\in\mathcal{S}}p_{t}% (\bm{u})\left(\bm{u}^{\mathsf{\scriptscriptstyle T}}\widetilde{\bm{\ell}}_{t}% \right)^{2}. 
The first challenge posed by the above expression is relating the lefthand side to the true regret with high probability. Once this is done, the remaining challenge is to bound the second term on the righthand side, as well as the other terms arising from the first step. We first show that the loss estimates used by FPL+GR.P consistently underestimate the true losses with high probability.
Lemma 9
Fix any \delta^{\prime}>0. For any \bm{v}\in\mathcal{S},
\bm{v}^{\mathsf{\scriptscriptstyle T}}\left(\widetilde{\bm{L}}_{T}\bm{L}_{T}% \right)\leq\frac{m\log\left(d/\delta^{\prime}\right)}{\beta} 
holds with probability at least 1\delta^{\prime}.
The simple proof is directly inspired by Appendix C.9 of Audibert and Bubeck (2010).
Proof Fix any t and i. Then,
\begin{split}\displaystyle\mathbb{E}\left[\left.\exp\left(\beta\widetilde{\ell% }_{t,i}\right)\right\mathcal{F}_{t1}\right]=\mathbb{E}\left[\left.\exp\left(% \log\left(1+\beta\widehat{\ell}_{t,i}\right)\right)\right\mathcal{F}_{t1}% \right]\leq 1+\beta\ell_{t,i}\leq\exp(\beta\ell_{t,i}),\end{split} 
where we used Lemma 4 in the first inequality and 1+z\leq e^{z} that holds for all z\in\mathbb{R}. As a result, the process W_{t}=\exp\Bigl{(}\beta\bigl{(}\widetilde{L}_{t,i}L_{t,i}\bigr{)}\Bigr{)} is a supermartingale with respect to \left(\mathcal{F}_{t}\right): \mathbb{E}\left[\left.W_{t}\right\mathcal{F}_{t1}\right]\leq W_{t1}. Observe that, since W_{0}=1, this implies \mathbb{E}\left[W_{t}\right]\leq\mathbb{E}\left[W_{t1}\right]\leq\ldots\leq 1. Applying Markov’s inequality gives that
\begin{split}\displaystyle\mathbb{P}\left[\widetilde{L}_{T,i}>L_{T,i}+% \varepsilon\right]&\displaystyle=\mathbb{P}\left[\widetilde{L}_{T,i}L_{T,i}>% \varepsilon\right]\\ &\displaystyle\leq\mathbb{E}\left[\exp\left(\beta\left(\widetilde{L}_{T,i}L_{% T,i}\right)\right)\right]\exp(\beta\varepsilon)\leq\exp(\beta\varepsilon)% \end{split} 
holds for any \varepsilon>0. The statement of the lemma follows after using
\left\\bm{v}\right\_{1}\leq m, applying the union bound for all i, and solving for \varepsilon.
The following lemma states another key property of the loss estimates.
Lemma 10
For any t,
\sum_{i=1}^{d}q_{t,i}\widehat{\ell}_{t,i}\leq\sum_{i=1}^{d}q_{t,i}\widetilde{% \ell}_{t,i}+\frac{\beta}{2}\sum_{i=1}^{d}q_{t,i}\widehat{\ell}_{t,i}^{2}. 
Proof The statement follows trivially from the inequality \log(1+z)\geq z\frac{z^{2}}{2} that holds for all z\geq 0. In particular, for any fixed t and i, we have
\log\left(1+\beta\widehat{\ell}_{t,i}\right)\geq\beta\widehat{\ell}_{t,i}% \frac{\beta^{2}}{2}\widehat{\ell}_{t,i}^{2}. 
Multiplying both sides by q_{t,i}/\beta and summing for all i proves the statement.
The next lemma relates the total loss of the learner to its total estimated losses.
Lemma 11
Fix any \delta^{\prime}>0. With probability at least 12\delta^{\prime},
\begin{split}\displaystyle\sum_{t=1}^{T}\bm{V}_{t}^{\mathsf{\scriptscriptstyle T% }}\bm{\ell}_{t}\leq&\displaystyle\sum_{t=1}^{T}\sum_{\bm{u}\in\mathcal{S}}p_{t% }(\bm{u})\left(\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}% \right)+\frac{dT}{eM}+\sqrt{2(e2)T}\left(m\log\frac{1}{\delta^{\prime}}+1% \right)+\sqrt{8T\log\frac{1}{\delta^{\prime}}}\end{split} 
Proof We start by rewriting
\begin{split}\displaystyle\sum_{\bm{u}\in\mathcal{S}}p_{t}(\bm{u})\left(\bm{u}% ^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)&\displaystyle=% \sum_{i=1}^{d}q_{t,i}K_{t,i}V_{t,i}\ell_{t,i}.\end{split} 
Now let k_{t,i}=\mathbb{E}\left[\left.K_{t,i}\right\mathcal{F}_{t1}\right] for all i and notice that
X_{t}=\sum_{i=1}^{d}q_{t,i}V_{t,i}\ell_{t,i}\left(k_{t,i}K_{t,i}\right) 
is a martingaledifference sequence with respect to \left(\mathcal{F}_{t}\right) with elements upperbounded by m (as Lemma 4 implies k_{t,i}q_{t,i}\leq 1 and \left\\bm{V}_{t}\right\_{1}\leq m). Furthermore, the conditional variance of the increments is bounded as
\begin{split}\displaystyle{\rm Var}\left[\left.X_{t}\right\mathcal{F}_{t1}% \right]\leq&\displaystyle\mathbb{E}\left[\left.\left(\sum_{i=1}^{d}q_{t,i}V_{t% ,i}K_{t,i}\right)^{2}\right\mathcal{F}_{t1}\right]\leq\mathbb{E}\left[\left.% \sum_{j=1}^{d}V_{t,j}\left(\sum_{i=1}^{d}q_{t,i}^{2}K_{t,i}^{2}\right)\right% \mathcal{F}_{t1}\right]\leq 2m,\end{split} 
where the second inequality is Cauchy–Schwarz and the third one follows from \mathbb{E}\left[\left.K_{t,i}^{2}\right\mathcal{F}_{t1}\right]\leq 2/q_{t,i}% ^{2} and \left\\bm{V}_{t}\right\_{1}\leq m. Thus, applying Lemma 16 with B=m and \Sigma_{T}\leq 2mT we get that for any S\geq m\sqrt{\log\frac{1}{\delta^{\prime}}\big{/}(e2)},
\sum_{t=1}^{T}\sum_{i=1}^{d}q_{t,i}\ell_{t,i}V_{t,i}\left(k_{t,i}K_{t,i}% \right)\leq\sqrt{(e2)\log\frac{1}{\delta^{\prime}}}\left(\frac{2mT}{S}+S\right) 
holds with probability at least 1\delta^{\prime}, where we have used \left\\bm{V}_{t}\right\_{1}\leq m. After setting S=m\sqrt{2T\log\frac{1}{\delta^{\prime}}}, we obtain that
\sum_{t=1}^{T}\sum_{i=1}^{d}q_{t,i}\ell_{t,i}V_{t,i}\left(k_{t,i}K_{t,i}% \right)\leq\sqrt{2\left(e2\right)T}\left(m\log\frac{1}{\delta^{\prime}}+1\right)  (16) 
holds with probability at least 1\delta^{\prime}.
To proceed, observe that q_{t,i}k_{t,i}=1(1q_{t,i})^{M} holds by Lemma 4, implying
\sum_{i=1}^{d}q_{t,i}V_{t,i}\ell_{t,i}k_{t,i}\geq\bm{V}_{t}^{\mathsf{% \scriptscriptstyle T}}\bm{\ell}_{t}\sum_{i=1}^{d}V_{t,i}(1q_{t,i})^{M}. 
Together with Eq. (16), this gives
\begin{split}\displaystyle\sum_{t=1}^{T}\bm{V}_{t}^{\mathsf{\scriptscriptstyle T% }}\bm{\ell}_{t}\leq&\displaystyle\sum_{t=1}^{T}\sum_{\bm{u}\in\mathcal{S}}p_{t% }(\bm{u})\left(\bm{u}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}% \right)+\sqrt{2\left(e2\right)T}\left(m\log\frac{1}{\delta^{\prime}}+1\right)% +\sum_{t=1}^{T}\sum_{i=1}^{d}V_{t,i}(1q_{t,i})^{M}.\end{split} 
Finally, we use that, by Lemma 5, (1q_{t,i})^{M}\leq 1/(eM), and
Y_{t}=\sum_{i=1}^{d}\left(V_{t,i}q_{t,i}\right)(1q_{t,i})^{M} 
is a martingaledifference sequence with respect to \left(\mathcal{F}_{t}\right) with increments bounded in [1,1]. Then, by an application of Hoeffding–Azuma inequality, we have
\sum_{t=1}^{T}\sum_{i=1}^{d}V_{t,i}(1q_{t,i})^{M}\leq\frac{dT}{eM}+\sqrt{8T% \log\frac{1}{\delta^{\prime}}} 
with probability at least 1\delta^{\prime}, thus proving the lemma.
Finally, our last lemma in this section bounds the secondorder terms arising from Lemmas 8
and 10.
Lemma 12
Fix any \delta^{\prime}>0. With probability at least 12\delta^{\prime}, the following hold simultaneously:
\begin{split}\displaystyle\sum_{t=1}^{T}\sum_{\bm{v}\in\mathcal{S}}p_{t}(\bm{v% })\left(\bm{v}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)^{% 2}&\displaystyle\leq Mm\sqrt{2T\log\frac{1}{\delta^{\prime}}}+2md\sqrt{T\log% \frac{1}{\delta^{\prime}}}+2mdT\\ \displaystyle\sum_{t=1}^{T}\sum_{i=1}^{d}q_{t,i}\widehat{\ell}_{t,i}^{2}&% \displaystyle\leq M\sqrt{2mT\log\frac{1}{\delta^{\prime}}}+2d\sqrt{T\log\frac{% 1}{\delta^{\prime}}}+2dT.\end{split} 
Proof First, recall that
\mathbb{E}\left[\left.\sum_{\bm{v}\in\mathcal{S}}p_{t}(\bm{v})\left(\bm{v}^{% \mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)^{2}\right% \mathcal{F}_{t1}\right]\leq 2md 
holds by Lemma 8. Now, observe that
X_{t}=\sum_{\bm{v}\in\mathcal{S}}p_{t}(\bm{v})\left(\left(\bm{v}^{\mathsf{% \scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)^{2}\mathbb{E}\left[\left% .\left(\bm{v}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)^{2% }\right\mathcal{F}_{t1}\right]\right) 
is a martingaledifference sequence with increments in [2md,mM]. An application of the Hoeffding–Azuma inequality gives that
\sum_{t=1}^{T}\sum_{\bm{v}\in\mathcal{S}}p_{t}(\bm{v})\left(\left(\bm{v}^{% \mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t}\right)^{2}\mathbb{E}% \left[\left.\left(\bm{v}^{\mathsf{\scriptscriptstyle T}}\widehat{\bm{\ell}}_{t% }\right)^{2}\right\mathcal{F}_{t1}\right]\right)\leq Mm\sqrt{2T\log\frac{1}{% \delta^{\prime}}}+2md\sqrt{T\log\frac{1}{\delta^{\prime}}} 
holds with probability at least 1\delta^{\prime}. Reordering the terms completes the proof of the first statement. The second statement is proven analogously, building on the bound
\begin{split}\displaystyle\mathbb{E}\left[\left.\sum_{i=1}^{d}q_{t,i}\widehat{% \ell}_{t,i}^{2}\right\mathcal{F}_{t1}\right]\leq&\displaystyle\mathbb{E}% \left[\left.\sum_{i=1}^{d}q_{t,i}V_{t,i}K_{t,i}^{2}\right\mathcal{F}_{t1}% \right]\leq 2d.\end{split} 
Theorem 2 follows from combining Lemmas 9 through 12 and applying
the union bound.
5 Improved Bounds for Learning With Full Information
Our proof techniques presented in Section 4.2 also enable us to tighten the guarantees for FPL in the full information setting. In particular, consider the algorithm choosing action
\bm{V}_{t}=\mathop{\rm arg\,min}_{\bm{v}\in\mathcal{S}}\bm{v}^{\mathsf{% \scriptscriptstyle T}}\left(\eta\bm{L}_{t1}\bm{Z}_{t}\right), 
where \bm{L}_{t}=\sum_{s=1}^{t}\bm{\ell}_{s} and the components of \bm{Z}_{t} are drawn independently from a standard exponential distribution. We state our improved regret bounds concerning this algorithm in the following theorem.
Theorem 13
For any \bm{v}\in\mathcal{S}, the total expected regret of FPL satisfies
\widehat{R}_{T}\leq\frac{m\left(\log(d/m)+1\right)}{\eta}+\eta m\sum_{t=1}^{T}% \mathbb{E}\left[\bm{V}_{t}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t}\right] 
under full information. In particular, defining L_{T}^{*}=\min_{\bm{v}\in\mathcal{S}}\bm{v}^{\mathsf{\scriptscriptstyle T}}L_{T} and setting
\eta=\min\left\{\sqrt{\frac{\log(d/m)+1}{L_{T}^{*}}},\frac{1}{2}\right\}, 
the regret of FPL satisfies
R_{T}\leq 4m\max\left\{\sqrt{L_{T}^{*}\left(\log\left(\frac{d}{m}\right)+1% \right)},\left(m^{2}+1\right)\left(\log\frac{d}{m}+1\right)\right\}. 
In the worst case, the above bound becomes 2m^{3/2}\sqrt{T\bigl{(}\log(d/m)+1\bigr{)}}, which improves the best known bound for FPL of Kalai and Vempala (2005) by a factor of \sqrt{d/m}.
Proof The first statement follows from combining Lemmas 7 and 8, and bounding
\sum_{\bm{u}\in\mathcal{S}}^{N}p_{t}(\bm{u})\bigl{(}\bm{u}^{\mathsf{% \scriptscriptstyle T}}\bm{\ell}_{t}\bigr{)}^{2}\leq m\sum_{\bm{u}\in\mathcal{S% }}^{N}p_{t}(\bm{u})\bigl{(}\bm{u}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t}% \bigr{)}, 
while the second one follows from standard algebraic manipulations.
6 Conclusions and Open Problems
In this paper, we have described the first general and efficient algorithm for online combinatorial optimization under semibandit feedback. We have proved that the regret of this algorithm is O\bigl{(}m\sqrt{dT\log(d/m)}\bigr{)} in this setting, and have also shown that FPL can achieve O\bigl{(}m^{3/2}\sqrt{T\log(d/m)}\bigr{)} in the full information case when tuned properly. While these bounds are off by a factor of \sqrt{m\log(d/m)} and \sqrt{m} from the respective minimax results, they exactly match the best known regret bounds for the wellstudied Exponentially Weighted Forecaster (EWA). Whether the remaining gaps can be closed for FPLstyle algorithms (e.g., by using more intricate perturbation schemes or a more refined analysis) remains an important open question. Nevertheless, we regard our contribution as a significant step towards understanding the inherent tradeoffs between computational efficiency and performance guarantees in online combinatorial optimization and, more generally, in online optimization.
The efficiency of our method rests on a novel loss estimation method called Geometric Resampling (GR). This estimation method is not specific to the proposed learning algorithm. While GR has no immediate benefits for OSMDtype algorithms where the ideal importance weights are readily available, it is possible to think about problem instances where EWA can be efficiently implemented while importance weights are difficult to compute.
The most important open problem left is the case of efficient online linear optimization with full bandit feedback where the learner only observes the inner product \bm{V}_{t}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}_{t} in round t. Learning algorithms for this problem usually require that the (pseudo)inverse of the covariance matrix P_{t}=\mathbb{E}\left[\left.\bm{V}_{t}\bm{V}_{t}^{\mathsf{\scriptscriptstyle T% }}\right\mathcal{F}_{t1}\right] is readily available for the learner at each time step (see, e.g., McMahan and Blum (2004); Dani et al. (2008); CesaBianchi and Lugosi (2012); Bubeck et al. (2012)). Computing this matrix, however, is at least as challenging as computing the individual importance weights 1/q_{t,i}. That said, our Geometric Resampling technique can be directly generalized to this setting by observing that the matrix geometric series \sum_{n=1}^{\infty}(IP_{t})^{n} converges to P_{t}^{1} under certain conditions. This sum can then be efficiently estimated by sampling independent copies of \bm{V}_{t}, which paves the path for constructing lowbias estimates of the loss vectors. While it seems straightforward to go ahead and use these estimates in tandem with FPL, we have to note that the analysis presented in this paper does not carry through directly in this case. The main limitation is that our techniques only apply for loss vectors with nonnegative elements (cf. Lemma 8). Nevertheless, we believe that Geometric Resampling should be a crucial component for constructing truly effective learning algorithms for this important problem.
Acknowledgments
The authors wish to thank Csaba Szepesvári for thoughtprovoking discussions. The research presented in this paper was supported by the UPFellows Fellowship (Marie Curie COFUND program n{{}^{\circ}} 600387), the French Ministry of Higher Education and Research and by FUI project Hermès.
A Further Proofs and Technical Tools
Lemma 14
Let Z_{1},\dots,Z_{d} be i.i.d. exponentially distributed random variables with unit expectation and let Z_{1}^{*},\dots,Z_{d}^{*} be their permutation such that Z_{1}^{*}\geq Z_{2}^{*}\geq\dots\geq Z_{d}^{*}. Then, for any 1\leq m\leq d,
\mathbb{E}\left[\sum_{i=1}^{m}Z_{i}^{*}\right]\leq m\left(\log\left(\frac{d}{m% }\right)+1\right). 
Proof Let us define Y=\sum_{i=1}^{m}Z_{i}^{*}. Then, as Y is nonnegative, we have for any A\geq 0 that
\begin{split}\displaystyle\mathbb{E}\left[Y\right]=&\displaystyle\int_{0}^{% \infty}\mathbb{P}\left[Y>y\right]dy\\ \displaystyle\leq&\displaystyle A+\int_{A}^{\infty}\mathbb{P}\left[\sum_{i=1}^% {m}Z_{i}^{*}>y\right]dy\\ \displaystyle\leq&\displaystyle A+\int_{A}^{\infty}\mathbb{P}\left[Z_{1}^{*}>% \frac{y}{m}\right]dy\\ \displaystyle\leq&\displaystyle A+d\int_{A}^{\infty}\mathbb{P}\left[Z_{1}>% \frac{y}{m}\right]dy\\ \displaystyle=&\displaystyle A+de^{A/m}\\ \displaystyle\leq&\displaystyle m\log\left(\frac{d}{m}\right)+m,\end{split} 
where in the last step, we used that A=\log\left(\frac{d}{m}\right) minimizes A+de^{A/m} over the real line.
Lemma 15
Fix any \bm{v}\in\mathcal{S} and any vectors \bm{L}\in\mathbb{R}^{d} and \bm{\ell}\in[0,\infty)^{d}. Define the vector \bm{\ell}^{\prime} with components \ell^{\prime}_{k}=v_{k}\ell_{k}. Then, for any perturbation vector \bm{Z} with independent components,
\begin{split}&\displaystyle\mathbb{P}\left[\bm{v}^{\mathsf{\scriptscriptstyle T% }}\left(\bm{L}+\bm{\ell}^{\prime}\bm{Z}\right)\leq\bm{u}^{\mathsf{% \scriptscriptstyle T}}\left(\bm{L}+\bm{\ell}^{\prime}\bm{Z}\right)\,\left(% \forall\bm{u}\in\mathcal{S}\right)\right]\\ &\displaystyle\qquad\leq\mathbb{P}\left[\bm{v}^{\mathsf{\scriptscriptstyle T}}% \left(\bm{L}+\bm{\ell}\bm{Z}\right)\leq\bm{u}^{\mathsf{\scriptscriptstyle T}}% \left(\bm{L}+\bm{\ell}\bm{Z}\right)\,\left(\forall\bm{u}\in\mathcal{S}\right)% \right].\end{split} 
Proof Fix any \bm{u}\in\mathcal{S}\setminus\left\{\bm{v}\right\} and define the vector \bm{\ell}^{\prime\prime}=\bm{\ell}\bm{\ell}^{\prime}. Define the events
A^{\prime}(\bm{u})=\left\{\bm{v}^{\mathsf{\scriptscriptstyle T}}\left(\bm{L}+% \bm{\ell}^{\prime}\bm{Z}\right)\leq\bm{u}^{\mathsf{\scriptscriptstyle T}}% \left(\bm{L}+\bm{\ell}^{\prime}\bm{Z}\right)\right\} 
and
A(\bm{u})=\left\{\bm{v}^{\mathsf{\scriptscriptstyle T}}\left(\bm{L}+\bm{\ell}% \bm{Z}\right)\leq\bm{u}^{\mathsf{\scriptscriptstyle T}}\left(\bm{L}+\bm{\ell}% \bm{Z}\right)\right\}. 
We have
\begin{split}\displaystyle A^{\prime}(\bm{u})&\displaystyle=\left\{\left(\bm{v% }\bm{u}\right)^{\mathsf{\scriptscriptstyle T}}\bm{Z}\geq\left(\bm{v}\bm{u}% \right)^{\mathsf{\scriptscriptstyle T}}\left(\bm{L}+\bm{\ell}^{\prime}\right)% \right\}\\ &\displaystyle\subseteq\left\{\left(\bm{v}\bm{u}\right)^{\mathsf{% \scriptscriptstyle T}}\bm{Z}\geq\left(\bm{v}\bm{u}\right)^{\mathsf{% \scriptscriptstyle T}}\left(\bm{L}+\bm{\ell}^{\prime}\right)\bm{u}^{\mathsf{% \scriptscriptstyle T}}\bm{\ell}^{\prime\prime}\right\}\\ &\displaystyle=\left\{\left(\bm{v}\bm{u}\right)^{\mathsf{\scriptscriptstyle T% }}\bm{Z}\geq\left(\bm{v}\bm{u}\right)^{\mathsf{\scriptscriptstyle T}}\left(% \bm{L}+\bm{\ell}\right)\right\}=A(\bm{u}),\end{split} 
where we used \bm{v}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}^{\prime\prime}=0 and \bm{u}^{\mathsf{\scriptscriptstyle T}}\bm{\ell}^{\prime\prime}\geq 0. Now, since A^{\prime}(\bm{u})\subseteq A(\bm{u}), we have
\cap_{\bm{u}\in\mathcal{S}}A^{\prime}(\bm{u})\subseteq\cap_{\bm{u}\in\mathcal{%
S}}A(\bm{u}),
thus proving
\mathbb{P}\left[\cap_{\bm{u}\in\mathcal{S}}A^{\prime}(\bm{u})\right]\leq%
\mathbb{P}\left[\cap_{\bm{u}\in\mathcal{S}}A(\bm{u})\right]
as claimed in the lemma.
Lemma 16 (cf. Theorem 1 in Beygelzimer et al. (2011))
Assume X_{1},X_{2},\dots,X_{T} is a martingaledifference sequence with respect to the filtration (\mathcal{F}_{t}) with X_{t}\leq B for 1\leq t\leq T. Let \sigma_{t}^{2}={\rm Var}\left[\left.X_{t}\right\mathcal{F}_{t1}\right] and \Sigma_{t}^{2}=\sum_{s=1}^{t}\sigma_{s}^{2}. Then, for any \delta>0,
\mathbb{P}\left[\sum_{t=1}^{T}\bm{X}_{t}>B\log\frac{1}{\delta}+(e2)\frac{% \Sigma_{T}^{2}}{B}\right]\leq\delta. 
Furthermore, for any S>B\sqrt{\log(1/\delta))(e2)},
\mathbb{P}\left[\sum_{t=1}^{T}\bm{X}_{t}>\sqrt{(e2)\log\frac{1}{\delta}}\left% (\frac{\Sigma_{T}^{2}}{S}+S\right)\right]\leq\delta. 
References
 Abernethy et al. (2014) J. Abernethy, C. Lee, A. Sinha, and A. Tewari. Online linear optimization via smoothing. In Proceedings of The 27th Conference on Learning Theory (COLT), pages 807–823, 2014.
 Allenberg et al. (2006) C. Allenberg, P. Auer, L. Györfi, and Gy. Ottucsák. Hannan consistency in online learning in case of unbounded losses under partial monitoring. In Proceedings of the 17th International Conference on Algorithmic Learning Theory (ALT), pages 229–243, 2006.
 Audibert and Bubeck (2010) J.Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11:2635–2686, 2010.
 Audibert et al. (2014) J.Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39:31–45, 2014.
 Auer et al. (2002) P. Auer, N. CesaBianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
 Awerbuch and Kleinberg (2004) B. Awerbuch and R. D. Kleinberg. Adaptive routing with endtoend feedback: distributed learning and geometric approaches. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pages 45–53, 2004.
 Beygelzimer et al. (2011) A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 19–26, 2011.
 Bubeck et al. (2012) S. Bubeck, N. CesaBianchi, and S. M. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Proceedings of The 25th Conference on Learning Theory (COLT), pages 1–14, 2012.
 Bubeck and CesaBianchi (2012) S. Bubeck and N. CesaBianchi. Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit Problems. Now Publishers Inc, 2012.
 CesaBianchi and Lugosi (2006) N. CesaBianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
 CesaBianchi and Lugosi (2012) N. CesaBianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78:1404–1422, 2012.
 Dani et al. (2008) V. Dani, T. Hayes, and S. Kakade. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems (NIPS), volume 20, pages 345–352, 2008.
 Devroye et al. (2013) L. Devroye, G. Lugosi, and G. Neu. Prediction by randomwalk perturbation. In Proceedings of the 26th Conference on Learning Theory, pages 460–473, 2013.
 Freund and Schapire (1997) Y. Freund and R. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997.
 György et al. (2007) A. György, T. Linder, G. Lugosi, and Gy. Ottucsák. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007.
 Hannan (1957) J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
 Kalai and Vempala (2005) A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71:291–307, 2005.
 Koolen et al. (2010) W. Koolen, M. Warmuth, and J. Kivinen. Hedging structured concepts. In Proceedings of the 23rd Conference on Learning Theory (COLT), pages 93–105, 2010.
 Littlestone and Warmuth (1994) N. Littlestone and M. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
 McMahan and Blum (2004) H. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of the 17th Conference on Learning Theory (COLT), pages 109–123, 2004.
 Neu and Bartók (2013) G. Neu and G. Bartók. An efficient algorithm for learning with semibandit feedback. In Proceedings of the 24th International Conference on Algorithmic Learning Theory (ALT), pages 234–248, 2013.
 Poland (2005) J. Poland. FPL analysis for adaptive bandits. In 3rd Symposium on Stochastic Algorithms, Foundations and Applications (SAGA), pages 58–69, 2005.
 Rakhlin et al. (2012) S. Rakhlin, O. Shamir, and K. Sridharan. Relax and randomize: From value to algorithms. In Advances in Neural Information Processing Systems (NIPS), volume 25, pages 2150–2158, 2012.
 Suehiro et al. (2012) D. Suehiro, K. Hatano, S. Kijima, E. Takimoto, and K. Nagano. Online prediction under submodular constraints. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory (ALT), pages 260–274, 2012.
 Takimoto and Warmuth (2003) E. Takimoto and M. Warmuth. Paths kernels and multiplicative updates. Journal of Machine Learning Research, 4:773–818, 2003.
 Uchiya et al. (2010) T. Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with multiple plays. In Proceedings of the 21st International Conference on Algorithmic Learning Theory (ALT), pages 375–389, 2010.
 Van Erven et al. (2014) T. Van Erven, M. Warmuth, and W. Kotłowski. Follow the leader with dropout perturbations. In Proceedings of The 27th Conference on Learning Theory (COLT), pages 949–974, 2014.
 Vovk (1990) V. Vovk. Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory (COLT), pages 371–386, 1990.