# Contextual Bandits Evolving Over Finite Time

###### Abstract

Contextual bandits have the same exploration-exploitation trade-off as standard multi-armed bandits. On adding positive externalities that decay with time, this problem becomes much more difficult as wrong decisions at the start are hard to recover from. We explore existing policies in this setting and highlight their biases towards the inherent reward matrix. We propose a rejection based policy that achieves a low regret irrespective of the structure of the reward probability matrix.

^{†}

^{†}footnotetext: Joint First Authors

## I Introduction

In the context of restaurant recommendation systems, users can generally be classified into multiple user types with different preferences for different restaurants. Another behaviour that can be observed is that based on reviews provided by past users, the proportion of users preferring one restaurant over the other can change with time.

We consider such a setting in which the users/customers are classified into a number of customer types based on which of the restaurants they like the most. We propose an algorithm for a recommendation platform such that whenever a new user comes to the platform, the platform suggests one of the restaurants to the user, and a binary reward is generated based on the reviews provided by the user. The platform is aware of the type of the incoming user but is unaware of the user-restaurant reward probabilities. Further, if a positive reward is generated on being recommended a particular restaurant, the population of people preferring that particular restaurant increases. This can lead to a self-reinforcing behavior that is termed as positive externalities [katz]. This increase in population is modelled as decaying with time, which is intuitive, as over time, the effects of recommendations generally saturate, leading to an equilibrium in the population distribution of customers.

We model this setting as an evolving contextual bandits problem where the user type is regarded as the context and the restaurants are modelled as the bandit arms. The population distribution of the context changes according to the arms pulled and the corresponding rewards accrued. A trivial way to maximise the total reward accrued in such a setting would be to keep showing the arm with the maximum probability of being accepted irrespective of the context. Although such a policy guarantees minimum regret over the infinite time horizon, it does not guarantee minimum regret over a short time period which is usually the case in such settings. Moreover, because of the decaying nature of the externalities, suboptimal decisions in the beginning can lead to an increase in regret which might be difficult to compensate under a short time horizon.

## Ii Previous Work

Contextual bandits have been explored in various works [auernon, langford, li2010]. Much work has been done on a non-evolving setting where the incoming population of the different contexts is not affected by the arms pulled or the rewards accrued.

Evolving bandits have been explored by [Virag] who have developed policies to minimize regret in a similar setting. However, they highlight that their model is different from contextual bandits. Furthermore, in contrast to their setting of externalities, we have an evolution that decays with time, thus making the problem more difficult as wrong decisions at the start are harder to correct.

## Iii Setting

### Iii-a Context and Arm Rewards

Let be the types of context that can arrive at any time instant. Let the set of arms be (m n). At each time instant, the context is sampled from a distribution . Here, is a x1 array where denotes the population density of customer type at time . For each such context, an arm is pulled and the obtained reward is 0 or 1.

Let reward obtained at time be . The cumulative reward till time is defined as . Similarly denotes reward accrued till time by pulling arm on arrivals of type . Also, and denote the number of times arm was pulled on arrival of context and the number of such instances with 0 reward (ie, the number of times the arm was rejected for that user type), respectively. Thus, for all and .

### Iii-B Reward Probabilities

The probability of getting reward 1 for context and arm is equal to . Thus, the reward probabilities can be compactly represented by the matrix:

In , the maximum element of each row is the diagonal entry corresponding to that row - we call this the ”maxima along the diagonal” structure. This allows every user type to have a unique ”most-preferred” arm. Thus, . Further, without loss of generality, we arrange the rows in decreasing order of their highest elements. Thus, .

### Iii-C Evolution of

We consider a setting where the population distribution of user types changes only when the reward accrued is 1. Thus, at time instant if context arrived, arm was pulled and reward was , is updated as:

(Normalization) |

where is a constant indicative of the step-size.

We can see that any non-decreasing function of can be used instead of . We restricted ourselves to functions of the form as they form ODEs that can be solved in closed form. Further we chose to be 2 as it was high enough to have appreciable change in the distribution and low enough to guarantee alpha will saturate.

## Iv Policies

We have explored various policies for different settings. In this section, we describe in short the policies and then introduce our own policy Reward Based Arm Elimination (RBAE) at the end.

### Iv-a Oracle

It is easy to see that if we know the underlying reward matrix, in an infinite time horizon, the best policy would be to pull the arm with highest reward probability (for all arms for all contexts). Thus for any context, arm is pulled where is .

### Iv-B Greedy-Oracle

This policy assumes knowledge of the ”maxima along the diagonal” structure of the probability matrix , and always recommends the best arm for user type , which is .

### Iv-C Random Explore then Commit (REC)

For (pre-defined), sample arms uniformly at random (exploration). After , sample , i.e., the arm which accrues the highest reward for that user type (exploitation).

### Iv-D Balanced Exploration (BE)

[Virag] proposes an algorithm that structures exploration (in contrast with REC) by balancing exploration across arms, by ensuring every arm accrues at least a minimum reward before deciding on the optimal arm. We implement a version of this modified to our setting where exploration is done across context types.

### Iv-E Rejection-Based Arm Elimination (R-BAE)

A problem with BE is that when the reward probabilities are low, the exploration phase would take a longer time to complete, thus possibly increasing the regret especially when the time horizon of interest is small. To overcome this limitation, we propose a rejection-based policy where sub-optimal arms for a user type are eliminated when the number of rejections for an arm cross a threshold. We believe that by using such a policy, highly sub-optimal arms would be discarded at the earliest, thereby decreasing the accumulated regret.

## V Simulations and Discussion

For simulations, we choose 2 types of context and 2 bandit-arms i.e. . 500 iterations were run with each iteration lasting for time instances. The step size of the distribution update was chosen as 0.01. For REC, was used as the exploration time. For BE and RBAE, the thresholds and were both taken as .

Figures 2 and 2 show the aggregate regret and evolution of over time, respectively, with the initial distribution fixed as , for different values of .

0(a) uses a probability matrix with sufficient difference between the reward probabilities of the optimal and sub-optimal arms. This leads to BE accumulating a large regret in the exploration phase as it keeps sampling sub-optimal arms till they reach a minimum desired reward. On the other hand, REC is second best and RBAE performs the best in terms of accrued regret. 0(b) uses a probability matrix with relatively high probabilities of reward for all arms for all contexts. We see that in this case, BE and RBAE perform much better than REC. This can be attributed to the small difference in rewards of optimal and sub-optimal arm thus leading to REC making wrong decisions more often. 0(c) shows average accumulated regret of the policies across 1250 iterations with the probability matrix randomly changed after every 50 iterations. This was done to remove the biases that the policies had towards certain types (relative values) of the probability matrix. In this case, RBAE performs the best closely followed by REC and then by BE. Note that the ”Oracle” always achieves a higher regret in small time horizons as it trades off regret to increase the distribution of the context with the highest possible expected reward. This can be seen in the plots of Figure 2, where Oracle increases to a significantly high value as compared to all the other policies.

Figures 2(a) and 2(b) show the regret and evolution of for a different value of initial distribution , this time starting with a low value of . This can correspond to a setting where a new restaurant enters the market, with a low proportion of customers preferring the entrant initially. 2(c) shows aggregate regret for the same , but averaged over random values of . Again, RBAE outperforms BE and REC.

## Vi Conclusions and Future Work

We present a new policy ”Rejection-Based Arm Elimination” and demonstrate its efficacy in a decaying positive externality setting as compared to previously known policies. This policy eliminates arms based on individual rejections accrued, thereby performing better in terms of acquired regret irrespective of the inherent reward probabilities. We also demonstrate that the other policies can perform well when the probability matrix satisfies certain conditions whereas R-BAE performs well in all cases.

In future work, we plan to examine and exploit the correlation between the optimal arms of different customer types and use this extra information to improve expected reward.