Unreliable MultiArmed Bandits: A Novel Approach to Recommendation Systems
^{†}^{†}thanks: *Equal contribution, first author was assigned randomly
Abstract
We use a novel modification of MultiArmed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a closetooptimal ExploreCommit algorithm.
I Introduction and Problem Formulation
MultiArmed Bandits and algorithms to characterize the explorationexploitation tradeoff have been studied for a long time in various literature [1, 5]. It’s use in recommendation systems has also been studied in various literature [3, 2].
However most literature analyses recommendation systems from a dataacquisition and use perspective, usually fails to address the question on how whether the cost incurred in collecting and using this data for recommendation systems is worth the reward generated from how they bias autonomous user choices. We attempt to do that by developing the Unreliable MultiArmed Bandits framework, and prove a few fundamental bottlenecks on the cost incurred by recommendation systems
A armed bandit is defined by Random Variables (RV) . The RVs refers to reward produced by arm at the time instant. Here is independent of if and additionally identically distributed if . These rewards are governed by an unknown underlying distribution (hence with an unknown mean ). The optimal arm is the arm such that is mean reward of that arm.
Let an unreliable intermediate visit different arms by traversing an irreducible and aperiodic Markov chain with state space and a transition probability matrix .
The state space S is the set of arms and has a cardinality and models the autonomy of the intermediate in pulling the arms. This Markovian structure shows the dependence of future choices of the intermediate based on the current choice. For example a person who watches a “Horror” movie and likes it, is likely to pick the next movie they watch as a “Horror” movie as well. In our analysis we make the assumption that is known. This is justified because recommendation systems usually possess enough data about the user to make predictions on the user’s behaviour. Thus even if is initially unknown, it is easily estimated. Figure 1a) shows the structure for a simplified 2state Markov chain
Let be known as the perturbation variable. This variable is used to model the effect of recommendation systems. The streaming site can use methods like “targeted advertising” to bias the user choices (i.e. Suggesting they pick another movie, say a “Documentary” which increases the chance they pick that movie). Therefore is used to perturb the probability of transition from a particular state. The following rules regarding is followed,

This is added and the transition probabilities are suitably modified to maintain a sum to one

If this modification causes negative probabilities (probabilities greater than ) to arise, the probabilities are suitably truncated to 0 ()

We make the simplifying assumption that is a prefixed parameter, future work will deal with more realistic scenarios
Figure 1b) shows the perturbed version of the Markov chain
Let , be a perturbation policy which specifies the perturbation of . We define Regret as , where refers to the real reward obtained at time instant . Therefore the Cumulative Regret () after plays / biases is,
(1) 
Here is the optimal policy, (i.e. the one that has least cumulative regret in expectation). The goal is to minimize the over .
The structure of the paper is as follows: First we state 2 theorems on the regret bound and optimal policy of our setup. Following the proofs of these theorems, we look at a heuristic explorecommit algorithm based inspired by these theorems and show (via simulations) that it is near optimal for enough number of states (cardinality of state space is reasonably large)
Ii Main Results
Our first theorem characterizes which policy is optimal to maximize reward obtained. We first define a few concepts to make the theorem clear.
Definition 1. The bandit is said to have a “genie” available if is known to the bandit for all
Definition 2. (When genie is available) A policy is a policy that perturbs to , where is the modified transition probability matrix that maximizes the probability of transitioning to the state with maximum mean reward (from any state)
Theorem 1. The optimal policy (when genie is available) is
The proof for this theorem for the case where the recommendation system happens after the perturbed Markov chain attains it’s stationary distribution. This is easily justified since we can allow the Markov chain to evolve from and start our observations from The proof generalises to the finite case (where the Markov chain doesn’t attain it’s stationary distribution) quite easily, but we exclude it for brevity. The importance of this theorem is that it provides us a heuristic to construct the explorecommit algorithm discussed in Section IV.
Our next theorem shows that there exists a fundamental bottleneck on the rewards that can be extracted from the setup. This is intuitively due to the fact that even if the bandit knows exactly which state gives what reward, the bandit doesn’t have full control over which arm it can pull at each instant. The following theorem asserts that this bottleneck is in fact linear.
Theorem 2. For any , the can be lower bounded (in expectation) as,
(2) 
where is the sum of the means of each the underlying distribution of each arm, weighted by the stationary probability distribution of the transition probability matrix after policy is applied on the Markov Chain
Again for brevity the proof technique in the next section assumes that the perturbed Markov chain attains it’s stationary distribution. The theorem holds true in order (both are lower bounded by functions which are linear in ) even otherwise as well, though some of the constants vary. The importance of this theorem is that it gives us a metric to evaluate our algorithm against. The explorecommit algorithm described in Section IV is observed to be near optimal and robust with respect to this theorem.
Iii Proofs
Let be a vector representing the stationary probability distribution of the Markov chain. This is defined as the solution to the equation . This equation has a unique solution since the Markov chain is irreducible and aperiodic. When T is infinity this stationary distribution defines the amount of time (in expectation) spent in each state
Proof of Theorem 1:
Without loss of generality assume is a decreasing function of
We prove this by induction on the number of states of the Markov chain
Base case: Let . Then the total reward () in steps is,
(3) 
(4) 
Since , We can rewrite (4) as
(5) 
Since , maximizing is our goal
It is easy to see that policy maximizes (and hence the expression).
Inductive step: Assume the theorem is true for states. We now prove that the theorem is true for states. Note that,
(6) 
Now (6) can be rewritten as,
(7) 
Take the first term in (7). We can normalise this by dividing this term by . Let . There a Markov chain with states that has a unique stationary probability distribution .
By induction hypothesis, this new Markov Chain is maximised by a policy. Note here that B doesn’t depend when is specified. Since (7)
only depends on , the policy applied on also maximises (7). This proves the first theorem. A similar analysis also follows for a finite (though the calculations are a bit cumbersome)
Proof of Theorem 2: We note the following facts

If a genie is available, the cumulative regret of the original problem can only go down, in expectation

When genie is available, Theorem 1 states that a policy is optimal

A policy perturbs to
The above observation tells us that calculating when genie is available, for the policy will give us a lower bound on for any general policy and unknown reward distributions. Let the setting required be .
To this end lets calculate .
(8)  
(9)  
(10) 
It is easy to see that . This proves theorem 2.
Iv Algorithms
Definition 3: A policy is a policy that perturbs to , where is the modified transition probability matrix when Markov chain is biased towards state
We construct the Perturbation Explore (), Perturbation Commit algorithm based on insights from Theorem 1. The intuition behind the algorithm is as follows,

We explore “enough” so that we get a distinguishable estimate of the means of each of the arms. To do this we perturb the Markov chain towards the arm that is least explored at a given time instant (i.e. most uncertain)

Since after exploration we are pretty much the “genie” now, theorem 1 kicks in. We commit to the arm with highest empirical reward by using the policy
Note: We haven’t characterized exactly how to choose in the above algorithm. Here we just choose manually based on simulations. Future work will deal with proving a theoretically optimal choice of .
In the upcoming section we compare our algorithm two other algorithms, Upper Confidence Bound (UCB) [1] and a Greedy algorithm. But slight modifications need to be made to these algorithms to fit our problem statement.
The UCB algorithm remains same, except that in our case we cannot guarantee that we pick the arm with the most confidence in each timestep. Rather we can only bias the Markov chain so that it is more likely to pick said arm. Therefore we compare our algorithm keeping this in mind.
The vanilla Greedy algorithm [1] chooses a random arm with probability . The structure of our Markov chain allows this exploration anyway, since there is always a nonzero chance to pull some suboptimal arm.
V Simulations
The simulations in Figure 2 are done for a 10state Markov chain with randomly generated transition probability matrix and rewards such that is a decreasing function of . As required by the problem statement, the algorithms don’t have an access to the rewards Apriori.
We compare our algorithm via simulation with the modified UCB and greedy algorithm displayed in the previous section. We plot the cumulative regret plots for three different values of .
Notice that a purely greedy strategy is unreliable as compared to method as we have no guarantee the exploration is sufficient to distinguish the optimal arm from the suboptimal ones. Which could lead us to be pulling the suboptimal arms for a long time.
It is apparent that UCB is a suboptimal strategy for this setting since it is in violation of our Theorem 1 and is often stuck trying to pull suboptimal arms for too long.
is theoretically supported by Theorem 1 and is empirically very close to the Oracle “Genie”) making it close to optimal.
An interesting observation that can be made here is that as the perturbation value increases the cumulative regret decreases for our algorithm. This is consistent with the hypothesis that increasing the influence of the recommendation system will give improved rewards. This is plotted in Figure 3.
Vi Conclusion
Our work has propensity for future work. The first of this would be to generalise the notion of biasing, such that the bias is a) state dependent and b) apply to multiple transition probabilities. The second would be to make the model more realistic in terms of the intermediate’s behavior. For example, it is unrealistic to assume that in the example we use that user would watch the same movies multiple times and derive the same reward from it. We therefore need to bring in a rotting bandit [4] or restless bandit [6] assumption on the rewards. We also haven’t taken into account dynamic updations of the Markov chain (because the way users make choices evolves over time). Moreover, there is propensity to extend the probabilistic behavior of the intermediate to more than a Markovian behavior.
References
 [1] Peter Auer, Nicolò CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, May 2002.
 [2] Djallel Bouneffouf, Amel Bouzeghoub, and Alda Lopes Gançarski. A contextualbandit algorithm for mobile contextaware recommender system. In Tingwen Huang, Zhigang Zeng, Chuandong Li, and Chi Sing Leung, editors, Neural Information Processing, pages 324–331, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
 [3] Y. Deshpande and A. Montanari. Linear bandits in high dimension and recommendation systems. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1750–1754, Oct 2012.
 [4] Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3074–3083. Curran Associates, Inc., 2017.
 [5] Steven L. Scott. A modern bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
 [6] C. Tekin and M. Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, Aug 2012.