Personalized Advertisement Recommendation: A Ranking Approach to Address the Ubiquitous Click Sparsity Problem
Abstract
We study the problem of personalized advertisement recommendation (PAR), which consist of a user visiting a system (website) and the system displaying one of ads to the user. The system uses an internal ad recommendation policy to map the user’s profile (context) to one of the ads. The user either clicks or ignores the ad and correspondingly, the system updates its recommendation policy. PAR problem is usually tackled by scalable contextual bandit algorithms, where the policies are generally based on classifiers. A practical problem in PAR is extreme click sparsity, due to very few users actually clicking on ads. We systematically study the drawback of using contextual bandit algorithms based on classifierbased policies, in face of extreme click sparsity. We then suggest an alternate policy, based on rankers, learnt by optimizing the Area Under the Curve (AUC) ranking loss, which can significantly alleviate the problem of click sparsity. We conduct extensive experiments on public datasets, as well as three industry proprietary datasets, to illustrate the improvement in clickthroughrate (CTR) obtained by using the rankerbased policy over classifierbased policies.
Personalized Advertisement Recommendation: A Ranking Approach to Address the Ubiquitous Click Sparsity Problem
Sougata Chaudhuri Department of Statistics University of Michigan, Ann Arbor Georgios Theocharous Adobe Big Data Experience Lab Mohammad Ghavamzadeh Adobe Big Data Experience Lab
Introduction
Personalized advertisement recommendation (PAR) system is intrinsic to many major tech companies like Google, Yahoo, Facebook and others. The particular PAR setting we study here consists of a policy that displays one of the possible ads/offers, when a user visits the system. The user’s profile is represented as a context vector, consisting of relevant information like demographics, geolocation, frequency of visits, etc. Depending on whether user clicks on the ad, the system gets a reward of value , which in practice translates to dollar revenue. The policy is (continuously) updated from historical data, which consist of tuples of the form . We will, in this paper, concern ourselves with PAR systems that are geared towards maximizing total number of clicks.
The plethora of papers written on the PAR problem makes it impossible to provide an exhaustive list. Interested readers may refer to a recent paper by a team of researchers in Google [?] and references therein. While the techniques in different papers differ in their details, the majority of them can be be analyzed under the umbrella framework of contextual bandits [?]. The term bandit refers to the fact that the system only gets to see the user’s feedback on the ad that was displayed, and not on any other ad. Bandit information leads to difficulty in estimating the expected reward of a new or a relatively unexplored ad (the cold start problem). Thus, contextual bandit algorithms, during prediction, usually balance between exploitation and exploration. Exploitation consists of predicting according to the current recommendation policy, which usually selects the ad with the maximum estimated reward, and exploration consists of systematically choosing some other ad to display, to gather more information about it.
Most contextual bandit algorithms aim to learn a policy that is essentially some form of multiclass classifier. For example, one important class of contextual bandit algorithms learn a classifier per ad from the batch of data [?; ?; ?] and convert it into a policy, that displays the ad with the highest classifier score to the user (exploitation). Some exploration techniques, like explicit greedy [?; ?] or implicit Bayesian type sampling from the posterior distribution maintained on classifier parameters [?] are sometimes combined with this exploitation strategy. Other, more theoretically sophisticated online bandit algorithms, essentially learn a costsensitive multiclass classifier by updating after every round of usersystem interaction [?; ?].
Despite the fact that PAR has always been mentioned as one of the main applications of CB algorithms, there has not been much investigation into the practical issues raised in using classifierbased policies for PAR. The potential difficulty in using such policies in PAR stems from the problem of click sparsity, i.e., very few users actually ever click on online ads and this lack of positive feedback makes it difficult to learn good classifiers. Our main objective here is to study this important practical issue and we list our contributions:

We detail the framework of contextual bandit algorithms and discuss the problem associated with click sparsity.

We suggest a simple rankerbased policy to overcome the click sparsity problem. The rankers are learnt by optimizing the Area Under Curve (AUC) ranking loss via stochastic gradient descent (SGD) [?], leading to a highly scalable algorithm. The rankers are then combined to create a recommendation policy.

We conduct extensive experiments to illustrate the improvement provided by our suggested method over both linear and ensemble classifierbased policies for the PAR problem. Our first set of experiments compare deterministic policies on publicly available classification datasets, that are converted to bandit datasets following standard techniques. Our second set of experiments compare stochastic policies on three proprietary bandit datasets, for which we employ a high confidence offline contextual bandit evaluation technique.
Contextual Bandit (CB) Approach to PAR
The main contextual bandit algorithms can be largely divided into two classes: those that make specific parametric assumption about the reward generation process and those that simply assume that context and rewards are generated i.i.d. from some distribution. The two major algorithms in the first domain are LinUCB [?] and Thompson sampling [?]. Both algorithms assume that the reward of each ad (arm) is a continuous linear function of some unknown parameter, which is not a suitable assumption for clickbased binary reward in PAR. Moreover, both algorithms assume that there is context information available for each ad, while we assume availability of only user context in our setting. Thus, from now on, we focus on the second class of the contextual bandit algorithms. We provide a formal description of the framework of contextual bandits suited to the PAR setting, and then discuss the problem that arises due to click sparsity.
Let and denote the user context space and different ads/arms. At each round, it is assumed that a pair is drawn i.i.d. from some unknown joint distribution over . Here, and represent the user context vector and the full reward vector, i.e., the user’s true preference for all the ads (the full reward vector is unknown to the algorithm). is the space of policies such that for any , . Contextual bandit algorithms have the following steps:

At each round , the context vector is revealed, i.e., a user visits the system.

The system selects ad according to the current policy (exploitation strategy). Optionally, an exploration strategy is sometimes added, creating a distribution over the ads and is drawn from . Policy and distribution are sometimes used synonymously by considering to be a stochastic policy.

Reward is revealed and the new policy is computed, using information . We emphasize that the system does not get to know , .
Assuming the usersystem interaction happens over rounds, the objective of the system is to maximize its cumulative reward, i.e., . Note that since rewards are assumed to be binary, is precisely the total number of clicks and is the overall CTR of the recommendation algorithm. Theoretically, performance of a bandit algorithm is analyzed via the concept of regret, i.e.,
where . The desired property of any contextual bandit algorithm is to have a sublinear (in ) bound on Regret(T) (in expectation or high probability), i.e., . This guarantees that, at least, the algorithm converges to the optimal policy asymptotically.
Practical Issues with CB Policy Space
Policy space considered for major contextual bandit algorithms are based on classifiers. They can be tuples of binary classifiers, with one classifier per ad, or global costsensitive multiclass classifier, depending on the nature of the bandit algorithm. Since clicks on the ads are rare and small improvement in clickthrough rate can lead to significant reward, it is vital for the policy space to have good policies that can identify the correct ads for the rare users who are highly likely to click on them. Extreme click sparsity makes it very practically challenging to design a classifierbased policy space, where policies can identify the correct ads for rare users. Crucially, contextual bandit algorithms are only concerned with converging as fast as possible to the best policy in the policy space and do not take into account the nature of the policies. Hence, if the optimal policy in the policy space does a poor job in identifying correct ads, then the bandit algorithm will have very low cumulative reward, regardless of its sophistication. We discuss how click sparsity hinders in the design of different types of classifierbased policies.
Binary Classifier Based Policies
Contextual bandit algorithms are traditionally presented as online algorithms, with continuous update of policies. Usually, in industrial PAR systems, it is highly impractical to update policies continuously, due to thousands of users visiting a system in a small time frame. Thus, policy update happens, i.e. new policy is learnt, after intervals of time, using the bandit data produced from the interaction between the current policy and users, collected in batch. It is convenient to learn a binary classifier per ad in such a setting. To explain the process concisely, we note that the bandit data consists of tuples of the form . For each ad , the users who had not clicked on the ad (=0) would be considered as negative examples and the users who had clicked on the ad (=1) would be considered as positive examples, creating a binary training set for ad . The binary classifiers are converted into a recommendation policy using a “onevsall” method [?]. Thus, each policy in policy space can be considered to be a tuple of binary classifiers.
A number of research publications show that researchers consider binary linear classifiers, that are learnt by optimizing the logistic loss [?], while ensemble classifiers, like random forests, are also becoming popular [?]. We note that the majority of the papers that learn a logistic linear classifier focus on feature selection [?], novel regularizations to tackle highdimensional context vectors [?], or propose clever combinations of logistic classifiers [?].
Click sparsity poses difficulty in design of accurate binary classifiers in the following way: for an ad , there will be very few clicks on the ad as compared to the number of users who did not click on the ad. A binary classifier learnt in such setting will almost always predict that its corresponding ad will not be clicked by a user, failing to identify the rare, but very important, users who are likely to click on the ad. This is colloquially referred to as “class imbalance problem” in binary classification [?]. Due to the extreme nature of the imbalance problem, tricks like undersampling of negative examples or oversampling of positive examples [?] are not very useful. More sophisticated techniques like costsensitive svms require prior knowledge about importance of each class, which is not generally available in the PAR setting.
Note Some of the referenced papers do not have explicit mention of CBs because the focus in those papers is on the issues related to classifier learning process, involving type of regularization, overcoming curse of dimensionality, scalability etc. The important issue of extreme class imbalance has not received sufficient attention (Sec 6.2, [?]). When the classifiers are used to predict ads, the technique is a particular CB algorithm (the exact exploration+ exploitation mix is often not revealed).
Cost Sensitive MultiClass Classifier Based Policies
Another type of policy space consist of costsensitive multiclass classifiers [?; ?; ?]. They can be costsensitive multiclass svms [?], multiclass logistic classifiers or filter trees [?]. Click sparsity poses slightly different kind of problem in practically designing a policy space of such classifiers.
Cost sensitive multiclass classifier works as follows: assume a contextreward vector pair (x,r) is generated as described in the PAR setting. The classifier will try to select a class (ad) such that the reward is maximum among all choices of , (we consider reward maximizing classifiers, instead of cost minimizing classifiers). Unlike in traditional multiclass classification, where one entry of is and all other entries are ; in cost sensitive classification, can have any combination of and . Now consider the reward vectors s generated over rounds. A poor quality classifier , which fails to identify the correct ad for most users , will have very low average reward, i.e.,, with 0. From the model perspective, extreme click sparsity translates to almost all reward vectors being . Thus, even a very good classifier , which can identify the correct ad for almost all users, will have very low average reward, i.e., . From a practical perspective, it is difficult to distinguish between the performance of a good and poor classifier, in face of extreme sparsity, and thus, cost sensitive multiclass classifiers are not ideal policies for contextual bandits addressing the PAR problem.
AUC Optimized Ranker
We propose a rankingbased alternative to learning a classifier per ad, in the offline setting, that is capable of overcoming the click sparsity problem. We learn a ranker per ad by optimizing the Area Under the Curve (AUC) loss, and use a ranking score normalization technique to create a policy mapping context to ad. We note that AUC is a popular measure used to evaluate a classifier on an imbalanced dataset. However, our objective is to explicitly use the loss to learn a ranker that overcomes the imbalance problem and then create a context to ad mapping policy.
Ranker Learning Technique: For an ad , let and be the set of positive and negative instances, respectively. Let be a linear ranking function parameterized by , i.e., (inner product). AUCbased loss (AUCL) is a ranking loss that is minimized when positive instances get higher scores than negative instances, i.e., the positive instances are ranked higher than the negatives when instances are sorted in descending order of their scores [?]. Formally, we define empirical AUCL for function
Direct optimization of AUCL is a NPhard problem, since AUCL is sum of discontinuous indicator functions. To make the objective function computationally tractable, the indicator functions are replaced by a continuous, convex surrogate . Examples include hinge and logistic surrogates. Thus, the final objective function to optimize is
(1) 
Note: Since AUCL is a ranking loss, the concept of class imbalance ceases to be a problem. Irrespective of the number of positive and negative instances in the training set, the position of a positive instance w.r.t to a negative instance in the final ranked list is the only matter of concern in AUCL calculation.
Optimization Procedure
The objective function (1) is a convex function and can be efficiently optimized by stochastic gradient descent (SGD) procedure [?]. One computational issue associated with AUCL is that it pairs every positive and negative instance, effectively squaring the training set size.The SGD procedure easily overcomes this computational issue. At every step of SGD, a positive and a negative instance are randomly selected from the training set, followed by a gradient descent step. This makes the training procedure memoryefficient and mimics full gradient descent optimization on the entire loss. We also note that the rankers for the ads can be trained in parallel and any regularizer like and can be added to (1), to introduce sparsity or avoid overfitting. Lastly, powerful nonlinear kernel ranking functions can be learnt in place of linear ranking functions, but at the cost of memory efficiency, and the rankers can even be learnt online, from streaming data [?].
Constructing Policy from Rankers
Similar to learning a classifier per ad, a separate ranking function is learnt for each ad from the bandit batch data. Then the following technique is used to convert the separate ranking functions into a recommendation policy. First, a threshold score is learnt for each action separately (see the details below), and then for a new user , the combined policy works as follows:
(2) 
Thus, maps to ad with maximum “normalized score”. This normalization negates the inherent scoring bias that might exist for each ranking function. That is, a ranking function for an action might learn to score all instances (both positive and negative) higher than a ranking function for an action . Therefore, for a new instance , ranking function for will always give a higher score than the ranking function for , leading to possible incorrect predictions.
Learning Threshold Score : After learning the ranking function from the training data, the threshold score is learnt by maximizing some classification measure like precision, recall, or Fscore on the same training set. That is, score of each (positive or negative) instance in the training set is calculated and the classification measure corresponding to different thresholds are compared. The threshold that gives the maximum measure value is assigned to .
Competing Policies and Evaluation Techniques
To support our hypothesis that ranker based policies address the clicksparsity problem better than classifier based policies, we set up two sets of experiments. We a) compared deterministic policies (only “exploitation”) on full information (classification) datasets and b) compared stochastic policies (“exploitation + exploration”) on bandit datasets, with a specific offline evaluation technique. Both of our experiments were designed for batch learning setting, with policies constructed from separate classifiers/rankers per ad. The classifiers considered were linear and ensemble RandomForest classifiers and ranker considered was the AUC optimized ranker.
Deterministic Policies: Policies from the trained classifiers were constructed using the “onevsall” technique, i.e., for a new user , the ad with the maximum score according to the classifiers was predicted. For the policy constructed from rankers, the ad with the maximum shifted score according to the rankers was predicted, using Eq. 2. Deterministic policies are “exploit only” policies.
Stochastic Policies: Stochastic policies were constructed from deterministic policies by adding an greedy exploration technique on top. Briefly, let one of the stochastic policies be denoted by and let . For a context in the test set, , if was the offer with the maximum score according to the underlying deterministic policy, and , otherwise ( is the total number of offers). Thus, is a probability distribution over the offers. Stochastic policies are “exploit+ explore” policies.
Evaluation on Full Information Classification Data
Benchmark bandit data are usually hard to obtain in public domains. So, we compared the deterministic policies on benchmark Kclass classification data, converted to Kclass bandit data, using the technique in [?]. Briefly, the standard conversion technique is as follows: A class dataset is randomly split into training set and test set (in our experiments, we used split). Only the labeled training set is converted into bandit data, as per procedure. Let be an instance and the corresponding class in the training set. A class is selected uniformly at random. If , a reward of is assigned to ; otherwise, a reward of is assigned. The new bandit instance is of the form or , and the true class is hidden. The bandit data is then divided into separate binary class training sets, as detailed in the section “Binary Classifier based Policies”.
Evaluation Technique: We compared the deterministic policies by calculating the CTR of each policy. For a policy , CTR on a test set of cardinality is measured as:
(3) 
Note that we can calculate the true CTR of a policy because the correct class for an instance is known in the test set.
Evaluation on Bandit Information Data
Bandit datasets have both training and test sets in bandit form, and datasets we use are industry proprietary in nature.
Evaluation Technique: We compared the stochastic policies on bandit datasets. Comparison of policies on bandit test set comes with the following unique challenge: for a stochastic policy , the expected reward is , for a test context (with the true CTR of being average of expected reward over entire test set). Since the bandit form of test data does not give any information about rewards for offers which were not displayed, it is not possible to calculate the expected reward!
We evaluated the policies using a particular offline contextual bandit policy evaluation technique. There exist various such evaluation techniques in the literature, with adequate discussion about the process [?]. We used one of the importance weighted techniques as described in Theocharous et al. [?]. The reason was that we could give high confidence lower bound on the performance of the policies. We provide the mathematical details of the technique.
The bandit test data was logged from the interaction between users and a fully random policy , over an interaction window. The random policy produced the following distribution over offers: , . For an instance in the test set, the importance weighted reward of evaluation policy is computed as . The importance weighted reward is an unbiased estimator of the true expected reward of , i.e., .
Let the cardinality of the test set be . The importance weighted CTR of is defined as
(4) 
Since are assumed to be generated i.i.d., the importance weighted CTR is an unbiased estimator of the true CTR of . Moreover, it is possible to construct a ttest based lower confidence bound on the expected reward, using the unbiased estimator, as follows: let , , and . Then, and
(5) 
is a lower confidence bound on the true CTR. Thus, during evaluation, we plotted the importance weighted CTR and lower confidence bounds for the competing policies.
Empirical Results
We detail the parameters and results of our experiments.
Linear Classifiers and Ranker: For each ad , a linear classifier was learnt by optimizing the logistic surrogate, while a linear ranker was learnt by optimizing the objective function (1), with being the logistic surrogate. Since we did not have the problem of sparse highdimensional features in our datasets, we added an regularizer instead of regularizer. We applied SGD with1 million iterations; varied the parameter of the regularizer in the set and recorded the best result.
Ensemble Classifiers: We learnt a RandomForest classifier for each ad . The RandomForests were composed of 200 trees, both for computational feasibility and for the more theoretical reason outlined in [?].
Comparison of Deterministic Policies
Datasets: The multiclass datasets are detailed in Table 1.
OptDigits  Isolet  Letter  PenDigits  Movementlibras  

Size  5620  7797  20000  10992  360 
Features  64  617  16  16  91 
Classes  10  26  26  10  15 
Avg. positive  10  4  4  10  7 
Evaluation: To compare the deterministic policies , we conducted two sets of experiments; one without undersampling of negative classes during training (i.e., no class balancing) and another with heavy undersampling of negative classes (artificial class balancing). Training and testing were repeated 10 times for each dataset to account for the randomness introduced during conversion of classification training data to bandit training data, and the average accuracy over the runs are reported. Figure 1 top and bottom show performance of various policies learnt without and with undersampling during training, respectively. Undersampling was done to make positive:negative ratio as 1:2 for every class (this basically means that Avg positive was 33). The ratio of 1:2 generally gave the best results.
Observations: a) With heavy undersampling, the performance of classifierbased policies improve significantly during training. Rankerbased policy is not affected, validating that class imbalance does not affect ranking loss, b) The linear rankerbased policy performs uniformly better than the linear classifierbased policy, with or without undersampling. This shows that restricting to same class of functions (linear), rankers handles classimbalance much better than classifiers c) The linear rankerbased policy does better than more complex RandomForest (RF) based policy, when no undersampling is done during training, and is competitive when undersampling is done, and d) Complex classifiers like RFs are relatively robust to moderate class imbalance. However, as we will see in real datasets, when class imbalance is extreme, gain from using a rankerbased policy becomes prominent. Moreover, growing big forests may be infeasible due to memory constraints,
Comparison of Stochastic Policies
Our next set of experiments were conducted on three different datasets that are property of a major technology company.
Datasets: Two of the datasets were collected from campaigns run by two major banks and another from campaign run by a major hotel chain. When a user visited the campaign website, she was either assigned to a targeted policy or a purely random policy. The targeted policy was some specific ad serving policy, particular to the campaign. The data was collected in the form , where denotes the user context, denotes the offer displayed, and denotes the reward received. We trained our competing policies on data collected from the targeted policy and testing was done on the data collected from the random policy. We focused on the top5 offers by number of impressions in the training set. Table 2 provides information about the training sets collected from the hotel and one of the bank’s campaigns. The second bank’s campaign has similar training set as the first one. As can be clearly observed, each offer in the bank’s campaign suffers from extreme click sparsity.
Domain  Offer  Impressions  Avg. Positive 

(Clicks/Impressions)  
Hotel  
1  36164  8.1  
2  37944  8.2  
3  30871  7.8  
4  32765  7.7  
5  20719  5.5  
Bank 1  
1  37750  0.17  
2  38254  0.40  
3  182191  0.45  
4  168789  0.30  
5  17291  0.23 
Feature Selection: We used a feature selection strategy to select around 20 of the users’ features, as some of the features were of poor quality and led to difficulty in learning. We used the information gain criteria to select features [?].
Results: Figures 2(a), 2(b), and 2(c) show the results of our experiments. We used heavy undersampling of negative examples, at the ratio 1:1 for positive:negative examples per offer, while training the classifiers. During evaluation, was used as exploration parameter. Taking , meaning more exploitation, did not yield better results.
Observations: a) The rankerbased policy generally performed better than the classifierbased policies. b) For the bank campaigns, where the click sparsity problem is extremely severe, it can be stated with high confidence that the rankerbased policy performed significantly better than classifier based policies. This shows that the rankerbased policy can handle class imbalance better than the classifier policies.
References
 [Agarwal et al., 2009] Agarwal, D.; Gabrilovich, E.; Hall, R.; Josifovski, V.; and Khanna, R. 2009. Translating relevance scores to probabilities for contextual advertising. In Proceedings of the 18th ACM conference on Information and knowledge management, 1899–1902. ACM.
 [Agarwal et al., 2014] Agarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; and Schapire, R. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning, 1638–1646.
 [Agrawal and Goyal, 2013] Agrawal, S., and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, 127–135.
 [Beygelzimer, Langford, and Ravikumar, 2007] Beygelzimer, A.; Langford, J.; and Ravikumar, P. 2007. Multiclass classification with filter trees. Preprint, June 2.
 [Calders and Jaroszewicz, 2007] Calders, T., and Jaroszewicz, S. 2007. Efficient AUC optimization for classification. In Knowledge Discovery in Databases: PKDD 2007. Springer. 42–53.
 [Cao, Zhao, and Zaiane, 2013] Cao, P.; Zhao, D.; and Zaiane, O. 2013. An optimized costsensitive svm for imbalanced data learning. In Advances in Knowledge Discovery and Data Mining. Springer. 280–292.
 [Chapelle and Li, 2011] Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, 2249–2257.
 [Chawla, Japkowicz, and Kotcz, 2004] Chawla, N.; Japkowicz, N.; and Kotcz, A. 2004. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6(1):1–6.
 [Cheng, Wang, and Bryant, 2012] Cheng, T.; Wang, Y.; and Bryant, S. 2012. FSelector: a ruby gem for feature selection. Bioinformatics 28(21):2851–2852.
 [Chu et al., 2011] Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, 208–214.
 [Cortes and Mohri, 2004] Cortes, C., and Mohri, M. 2004. AUC optimization vs. error rate minimization. Advances in neural information processing systems 16(16):313–320.
 [Dudik et al., 2011] Dudik, M.; Hsu, D.; Kale, S.; Karampatziakis, N.; Langford, J.; Reyzin, L.; and Zhang, T. 2011. Efficient optimal learning for contextual bandits. Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, 2011.
 [He and others, 2014] He, X., et al. 2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1–9. ACM.
 [Japkowicz and Stephen, 2002] Japkowicz, N., and Stephen, S. 2002. The class imbalance problem: A systematic study. Intelligent data analysis 6(5):429–449.
 [Koh and Gupta, 2014] Koh, E., and Gupta, N. 2014. An empirical evaluation of ensemble decision trees to improve personalization on advertisement. In Proceedings of KDD 14 Second Workshop on User Engagement Optimization.
 [Langford and Zhang, 2008] Langford, J., and Zhang, T. 2008. The epochgreedy algorithm for multiarmed bandits with side information. In Advances in neural information processing systems, 817–824.
 [Langford, Li, and Dudik, 2011] Langford, J.; Li, L.; and Dudik, M. 2011. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning, 1097–1104.
 [Li et al., 2011] Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, 297–306. ACM.
 [McMahan and others, 2013] McMahan, H., et al. 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 1222–1230. ACM.
 [Richardson, Dominowska, and Ragno, 2007] Richardson, M.; Dominowska, E.; and Ragno, R. 2007. Predicting clicks: estimating the clickthrough rate for new ads. In Proceedings of the 16th international conference on World Wide Web, 521–530. ACM.
 [Rifkin and Klautau, 2004] Rifkin, R., and Klautau, A. 2004. In defense of onevsall classification. The Journal of Machine Learning Research 5:101–141.
 [Shamir and Zhang, 2013] Shamir, O., and Zhang, T. 2013. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, 2013, 71–79.
 [Theocharous, Thomas, and Ghavamzadeh, 2015] Theocharous, G.; Thomas, P.; and Ghavamzadeh, M. 2015. Ad recommendation systems for lifetime value optimization. In Proceedings of the 24th International Conference on World Wide Web Companion, 1305–1310.
 [Thomas, Theocharous, and Ghavamzadeh, 2015] Thomas, P.; Theocharous, G.; and Ghavamzadeh, M. 2015. High confidence offpolicy evaluation. In Proceedings of the TwentyNinth Conference on Artificial Intelligence.
 [Zhao et al., 2011] Zhao, P.; Jin, R.; Yang, T.; and Hoi, S. C. 2011. Online auc maximization. In Proceedings of the 28th International Conference on Machine Learning (ICML11), 233–240.