We address the problem of learning in an online, bandit setting where the
learner must repeatedly select among $K$ actions, but only receives partial
feedback based on its choices. We establish two new facts: First, using a new
algorithm called Exp4…