Exponential Weights on the Hypercube in Polynomial Time
Abstract
We address the online linear optimization problem when the decision set is the entire hypercube. It was previously unknown if it is possible to run the exponential weights algorithm on this decision set in polynomial time. In this paper, we show a simple polynomial time algorithm which is equivalent to running exponential weights on . In the Full Information setting, we show that our algorithm is equivalent to both Exp2 and Online Mirror Descent with Entropic Regularization. This enables us to prove a tight regret bound for Exp2 on . In the Bandit setting, we show that our algorithm is equivalent to both Exp2 and OMD with Entropic Regularization as long as they use the same exploration distribution. In addition, we show a reduction from the hypercube to the hypercube for the full information and bandit settings. This implies that we can also run exponential weights on in polynomial time, addressing the problem of sampling from the exponential weights distribution in polynomial time, which was left as an open question in bubeck2012towards.
Exponential Weights on the Hypercube in Polynomial TimeSudeep Raja Putta
Online Learning, Bandits, Exponential Weights, Hedge, Online Mirror Descent, EXP2, Online Stochastic Mirror Descent
1 Introduction
In this paper, we consider the Online Linear Optimization framework when the decision set is . This framework is also referred to as Online Combinatorial Optimization. It proceeds as a repeated game between a player and an adversary. At each time instance, the player chooses an action from the decision set , possibly using some internal randomization. Simultaneously, the adversary chooses a loss vector , without access to the internal randomization of the player. The player incurs the loss . The goal of the player is to minimize the expected cumulative loss . Here the expectation is with respect to the internal randomization of the player(and eventually the adversary’s randomization if it is an adaptive adversary). We use regret as a measure of performance of the player, which is defined as:
(1) 
Since the player’s decision could be the outcome of a randomized algorithm, we consider the expected regret over the randomness in the algorithm. We consider two kinds of feedback models for the player:

Full Information setting: At the end of each round , the player observes the loss vector .

Bandit setting: At the end of each round , the player only observes the scalar loss incurred
1.1 Previous Work

Full Information: freund1997decision considered the problem of learning with experts under full information and introduced the Hedge algorithm. The online combinatorial optimization problem was introduced by kalai2005efficient for the full information setting. Several works have studied this problem for specific kinds of decision sets. koolen2010hedging introduce the Component Hedge algorithm for online learning of msets, spanning trees, shortest paths and truncated permutations. Their algorithm is similar to Online Mirror Descent(OMD) with Entropic regularization. In this paper, our decision set consists of the entire hypercube and we consider linear losses. hazan2012online also consider the entire hypercube in the case of submodualar losses, which are more general.

Bandit Information: Online Linear optimization with bandit feedback was first studied by awerbuch2004adaptive and mcmahan2004online who obtained suboptimal regret bounds(in terns of ). dani2008price were the first to achieve the optimal regret bound in using their Geometric Hedge algorithm, which is similar in spirit to the Exp2 algorithm. Several improvements to the exploration strategy of Geometric Hedge have also been proposed. In the specific case of Online Combinatorial optimization under bandit feedback, cesa2012combinatorial propose ComBand for several combinatorial structures. bubeck2012towards propose Exp2 with John’s Exploration as well as Online Stochastic Mirror Descent(OSMD) with Entropic regularization which are shown to achieve optimal regret in terms of and .
See the books by cesa2006prediction, bubeck2012regret, shalev2012online, hazan2016introduction and lectures by rakhlin2009lecture, bubeck2011introduction for a comprehensive survey of online learning.
In particular, we refer to the a statement in bubeck2012towards. They consider the problem of online linear optimization on decision set with Bandit feedback. They state that it is not known if it is possible to sample from the exponential weights distribution in polynomial time for this particular set of actions. Hence, they turn to using OSMD with entropic regularization for this action set. Over the course of this paper, we show a simple way to sample from and update the exponential weights distribution in polynomial time for the hypercube, under linear losses in both full information and bandit feedback.
1.2 Our Contributions
For most of this paper, we consider the hypercube as our decision set. Towards the end, we show how to transform the problem to if the decision set is the hypercube. Our contributions are:

In the Full information setting, we propose a polynomial time algorithm PolyExp, which is equivalent to running Exp2 (which takes exponential time).

We show that OMD with Entropic regularization is equivalent to PolyExp. We use OMD’s analysis to derive a regret bound for PolyExp. This naturally implies a regret bound for Exp2. This bound is tighter than previously known bounds for this problem.

In the Bandit setting, we show that PolyExp, Exp2 and OSMD with Entropic regularization are equivalent if they use they same exploration distribution.

Finally, we show how to reduce the problem on to for both the Full information and Bandit settings. This solves the open problem in bubeck2012towards about being able to run exponential weights on in polynomial time.
2 Full Information
2.1 Exp2
This algorithm is equivalent to Hedge on experts using the losses . Since it explicitly maintains a probability distribution on experts, the running time is exponential. The expected regret of Exp2 is as follows: {restatable}theoremHedgeRegret In the full information setting, using , Exp2 attains the regret bound:
2.2 PolyExp
The sampling step and the update step in Exp2 are both exponential time operations. This is because Exp2 explicitly maintains a probability distribution on objects. To get a polynomial time algorithm, we replace the sampling and update steps with polynomial time operations. The PolyExp algorithm is as follows:
PolyExp uses parameters represented by the vector . Each element of corresponds to the mean of a Bernoulli distribution. It uses the product of these Bernoulli distributions to sample and uses the aforementioned update equation to obtain
2.3 Equivalence of Exp2 and PolyExp
We prove that running Exp2 is equivalent to running PolyExp. {restatable}theoremEquivAt round , The probability that PolyExp chooses is where This is equal to the probability of Exp2 choosing at round , ie:
At every round, since the probability distribution of Exp2 and PolyExp over the decision set is the same, they have the same regret bound, with the added advantage of PolyExp being a polynomial time algorithm. Lemma Appendix A. Proofs is crucial in proving equivalence between the two algorithms. In a strict sense, Lemma Appendix A. Proofs holds only because our decision set is the entire hypercube.
2.4 Equivalence of PolyExp and OMD with Entropic Regularization
We introduce the OMD algorithm and show that OMD with the Entropic Regularizer for is equivalent to PolyExp. The regularizer for this domain is:
Here is the Bregman divergence of . {restatable}theoremEqOMD The sampling procedure of PolyExp satisfies . Moreover the update of OMD with Entropic Regularization is , same as PolyExp. This implies that Exp2 is also equivalent to OMD with Entropic Regularization. This statement is not true in general. The only other instance when these two are equivalent is when the decision set is the probability simplex.
2.5 Regret of PolyExp via OMD analysis
Since OMD with Entropic Regularization and PolyExp are equivalent, we can use the standardized analysis tools of OMD to derive a regret bound for PolyExp. {restatable}theoremRegPoly In the full information setting, using , PolyExp attains the regret bound:
This implies a better regret bound for Exp2. This is a very surprising result as we were able to improve Exp2’s regret by its equivalence to OMD and not by directly analyzing Exp2’s regret.
3 Bandit Setting
Algorithms for full information setting can be modified for the bandit setting. The general strategy for the bandit setting is as follows:
where . Here is the distribution used by the underlying algorithm (either Exp2, PolyExp or OMD). is the exploration distribution and is the mixing coefficient. Playing from is necessary in order to make sure that is invertible and also to lower bound the smallest eigenvalue of . The loss estimate is then used to update the underlying algorithm.
When using PolyExp in the bandit setting, the we can sample from by sampling from with probability and sampling from with probability . When using John’s exploration, is supported by at most points. So sampling can be done in polynomial time. We have , where the matrix has elements and for all . The matrix can be precomputed before the first round. Hence, we can calculate in polynomial time.
theoremBanditeq In the bandit setting, PolyExp is equivalent to Exp2 and Online Mirror Descent with Entropic regularization when the exploration distribution used by the three algorithms is the same. Hence, PolyExp with John’s exploration can be run in polynomial time on . Moreover, since it is equivalent to Exp2 with John’s exploration, it also has optimal regret.
4 Hypercube Case
Full information and bandit algorithms which work on can be modified to work on . The general strategy is as follows:
theoremHypereq Playing using Exp2 directly on losses is equivalent to sampling using Exp2, PolyExp or OMD and playing on losses
Hence, using the above strategy, PolyExp can be run in polynomial time on . Because of the equivalences we have established in the previous sections, we have that PolyExp achieves optimal regret in full information case and PolyExp with John’s exploration achieves optimal regret in bandit case on
5 Conclusions
In this paper, we give a principled way of running the exponential weights algorithm on both the hypercube and hypercube for both full information and bandit settings using the PolyExp algorithm. We do this by cleverly decomposing the exponential weight distribution as a product of Bernoulli distributions. We also show equivalences to the Exp2 algorithm and OMD with entropic regularization, which are known to achieve the optimal regret, hence establishing the optimality of PolyExp.
Appendix A. Proofs
{lemma}(see hazan2016introduction, Theorem. 1.5) The Exp2 algorithm satisfies for any :
*
Proof.
Using and applying expectation with respect to the randomness of the player to definition of regret(equation 1), we get:
Applying Lemma Appendix A. Proofs, we get . Since for all , we get . Optimizing over the choice of , we get the desired regret bound. ∎
We have that
Proof.
Consider . It is a product of terms, each consisting of terms, and . On expanding the product, we get a sum of terms. Each of these terms is a product of terms, either a or . If it is , then and if it is , then . So,
∎
*
Proof.
The proof is via straightforward substitution of the expression for .
The proof is complete by applying Lemma Appendix A. Proofs ∎
*
Proof.
It is easy to see that . Hence .
Finding and using it in the condition , we get
Since is always in , the Bregman projection step is not required. So we have which gives the same update as PolyExp. ∎
Bregman Divergence: Let be a convex function, the Bregman divergence is:
Fenchel Conjugate: The Fenchel conjugate of a function is:
(see bubeck2012regret, Theorem. 5.6) For any , OMD with regularizer with effective domain and is differentiable on satisfies:
The Fenchel Conjugate of is:
Proof.
Differentiating wrt and equating to :
Substituting this back in , we get . It is also straightforward to see that ∎
For any , OMD with entropic regularizer satisfies:
Proof.
We start from Lemma Appendix A. Proofs. Using the fact that , we get . Next we bound the Bregmen term using Lemma Appendix A. Proofs
In the last term, . So the last term is . The first two terms can be simplified as:
Using the fact that :
Using the inequality:
Using the inequality:
The Bregman term can be bounded by Hence, we have:
∎
*
Proof.
Applying expectation with respect to the randomness of the player to definition of regret(equation 1), we get:
Applying Lemma Appendix A. Proofs, we get . Using the fact that , we get . Optimizing over the choice of , we get the desired regret bound. ∎
*
Proof.
The bandit setting differs from the full information case in two ways. First, we sample from the distribution , which is formed by mixing with the distribution using a mixing coefficient . Second, an estimate of the loss vector is formed via onepoint linear regression. This vector is used to update the algorithm. In the full information case, we have already established that the update of Exp2, PolyExp and OMD with entropic regularization are equivalent. As the same exploration distribution is used, will be the same for the three algorithms. Since is used in the update, will be the same for the three algorithms. Hence, even in the bandit setting, these algorithms are equivalent. ∎
Exp2 on with losses is equivalent to Exp2 on with losses and using the map to play on .
Proof.
Consider the update equation for Exp2 on
Using the fact that every can be mapped to a using the bijective map . So:
This is equivalent to updating the Exp2 on with the loss vector . ∎
*
Proof.
We sample from in full information and in bandit setting. Then we play . So we have that . In the full information setting, we get . But in bandit setting, we need to find , where . Since , this can be transformed to . Since is used in full information case and is used in the bandit case to update the algorithm, by Lemma Appendix A. Proofs we have that . Hence the equivalence. ∎