Abstract
Multiple machine learning and prediction models are often used for the same prediction or recommendation task. In our recent work, where we develop and deploy airline ancillary pricing models in an online setting, we found that among multiple pricing models developed, no one model clearly dominates other models for all incoming customer requests. Thus, as algorithm designers, we face an exploration  exploitation dilemma. In this work, we introduce an adaptive metadecision framework that uses Thompson sampling, a popular multiarmed bandit solution method, to route customer requests to various pricing models based on their online performance. We show that this adaptive approach outperform a uniformly random selection policy by improving the expected revenue per offer by 43% and conversion score by 58% in an offline simulation.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Adaptive Model Selection Framework: An Application to Airline Pricing
Naman Shukla ^{* }^{0 }^{0 } Arinbjörn Kolbeinsson ^{* }^{0 } Lavanya Marla ^{0 } Kartik Yellepeddi ^{0 }
Copyright 2019 by the author(s).\@xsect
In recent years, ancillaries such as bags, meals, wifi service and extra leg room have become a major source of revenue and profitability for airlines (IdeaWorks, 2018; Bockelie & Belobaba, 2017). Conventional pricing strategies based on static business rules do not respond to changing market conditions and do not match customer willingness to pay, leaving potential revenue sources untapped. In our recent work (Shukla et al., 2019), we demonstrate that machine learning approaches that dynamically price such ancillaries based on customer context lead to higher revenue on average for the airline compared to conventional pricing schemes. Specifically, we presented three different airline ancillary pricing models that provided dynamic price recommendations specific to each customer interaction and improved the revenue per offer for the airline. We also measured the realworld business impact of these approaches by deploying them in an A/B test on an airline’s internet booking website. As a part of that research, in order to measure the relative performance of these models they were all deployed in parallel on the airline’s booking system. The online business performance of our deployed models was better than human on an average, but no single model outperformed the other deployed models in all customer contexts. Moreover, we were also developing new models that were performing well in the offline setting and showed the promise to do better in an online setting. As a result, we face an explorationexploitation dilemma – do we exploit (a) the one single model that does best on an average, or (b) explore other models that do better in a different context or (c) utilize offline metrics? Therefore, we present a metadecision framework that addresses this issue.
Our contributions in this work are as follows.

We develop an approach that uses multiarmed bandit method to actively and adaptively route pricing requests to multiple models to further improve revenue.

We test this approach in a rigorously constructed simulation environment to demonstrate that a improved routing scheme that improves business metrics can be achieved.

We lay a foundation for future research to use contextual multiarmed bandit methods and perform online testing of this approach.
Ensemble methods are metalearning algorithms that combine multiple machine learning methods to reduce error and variance (Kotsiantis et al., 2007). Several ensemble learning methods have been proposed to improve personalization systems. Metarecommenders were pioneered by Schafer et al. (2002) in their work on combining multiple information sources and recommendation techniques for improved recommendations. More recent advancements in ensemble learning in the context of the ’bucket of models’ framework include model selection using crossvalidation techniques (Arlot et al., 2010). Most related to our work is the use of contextual multiarmed bandits for recommendation in an online setting (Zeng et al., 2016). The notable difference to our approach in this paper is that they apply the bandit directly to the problem; whereas we first generate recommendations using an assortment of diverse models. Multiarmed bandits have also been used to control for falsediscovery rate in sequential A/B tests (Yang et al., 2017). However, that method does not allow for maximizing an arbitrary outcome (such as revenue or offer conversion score), which is a necessary property for our required solution.
Among a set of models that price ancillaries, our objective is to direct online traffic (customers) more effectively to better performing models that predict customer willingness to pay more accurately to maximize revenue. To formalize this, we model the online traffic share directed to each pricing model as a tunable metaparameter. Learning this parameter is therefore critical to identify the “winning” model and separate it from other models. Based on customers’ responses (to purchase an ancillary or not), we can reinforce the metaparameters to adapt to the outcome. We model this problem from a reinforcement learning perspective as a sequence of actions (selecting a model) and rewards (offer accepted or not) that can be learned using a multiarmed bandit.
Solving the multiarmed bandit problem involves choosing an action (pulling an ’arm’), from a given action space, when the reward associated with each action is only partially known (Auer et al., 2002; Sutton & Barto, 2018). The environment reveals a reward after an action is taken, thereby giving information about both the environment and the action’s reward. The objective of the problem is to maximize the expected reward or minimize the cumulative expected regret (Bubeck et al., 2012) from a series of actions. The bandit problem involves a trade off between exploration and exploitation since the reward is only revealed for the chosen action. In our framework, each of the deployed ancillary pricing models is viewed as an arm. This enables the decisionmaker to direct online traffic based on the customer’s response to the price offered by that arm. Hence, this metalearning approach provides a trainable mechanism to allocate customer traffic to the arms that provide best conversion scores. We use the Thompson sampling algorithm, a popular multiarmed bandit solution approach, which allows us to exploit the current winning arm and explore seemingly inferior arms, or newly introduced arms that might outperform the current best arm (Chapelle & Li, 2011; Russo et al., 2018).
In a Bernoulli bandit problem, there are a total of valid actions, where an action at time produces a reward of one with probability and zero with probability . The mean reward is assumed to be unknown, but is stationary over time. The agent begins with an independent prior belief over each . As observations are gathered, the distribution is updated according to Bayes’ rule. As described in equation (1), the priors are assumed to be betadistributed with parameters and . In particular, the exact prior probability density function given for an action is
(1) 
where denotes the gamma function. The beta distribution is particularly suited to this computation because it is a conjugate prior to the Bernoulli distribution.
Algorithm 1 describes the method used for choosing models within our simulation environment (Figure 1, Step 2). is the set of models (described in Section id1). is the model chosen to provide the pricing recommendation, and the corresponding reward is 1 if the ancillary is purchased and 0 otherwise.
In this section we discuss an offline simulation performed on synthetic data to show the convergence score of Thompson sampling on the ancillary pricing models developed in our work. Specifically, we create customer session samples where the three experimental models have a success rate in capturing customer willingness to pay of , and , respectively; and then aim to infer these probabilities through Thompson sampling from the caller streams.
We start the simulation with the prior , which corresponds to a uniform prior between 0 and 1. The run is then simulated for 2,000 steps and target probabilities are recorded. Table 1 demonstrates that Thompson sampling converges to the latent probability distribution (the success rates of each model) within a reasonable number of iterations, indicating its potential for deployment in realworld pricing settings.
True probability  Inferred probability  # of trials  

0.45  0.45  70  
0.55  0.55  305  
0.60  0.61  1625 
We now apply the Thompson sampling algorithm to a testing environment constructed based on realworld data. Figure 1 shows the overview of our testing setup for evaluating the adaptive meta decision framework. Our simulation environment models customer behavior by sampling sessions based on 6 months of historically collected data, amounting to a total of about 16000 sessions. In our models, we make monotonicity assumptions for consistency. First, we assume that if a customer is willing to purchase a product for price , they are willing to purchase the same product at a price . Similarly, if a customer is unwilling to purchase a product at price , they will be unwilling to purchase at a price . We define , shown in Figure 2, as a latent variable that ensures this assumption by taking the historical ground truth (purchased by customer or not) into account. Hence, the latent variable can be used as the reward response from simulator for the offered price, given that customer session.
We consider the following three pricing models, presented in (Shukla et al., 2019), in which pricing is modeled as a 2step processes. First, the purchase probability of the ancillary is estimated as a function of the price; and second, a price that optimizes revenue is offered to the customer. Further details on these models can be found in our previous work (Shukla et al., 2019).

Gaussian Naive Bayes (GNB): This twostage pricing model uses a Gaussian Naive Bayes (GNB) model for ancillary purchase probability prediction and a precalibrated logistic price mapping function for revenue optimization.

Gaussian Naive Bayes with clustered features (GNBC): This twostage pricing model uses a Gaussian Naive Bayes with clustered features (GNBC) model for ancillary purchase probability prediction and a precalibrated logistic price mapping function for revenue optimization.

DeepNeural Network (DNN): This twostage pricing model uses a DeepNeural Network (DNN) trained using a weighted crossentropy loss function for ancillary purchase probability estimation. For price optimization, we implement a simple discrete exhaustive search algorithm that finds the optimal price point within the permissible pricing range.
Our simulation results using Thompson sampling as a meta learning framework are shown in Table 2. The expected offer conversion scores for each model, where each model is considered as an arm of the multiarmed bandit, is shown in Figure 3. Also, Figure 4 shows the evolution of the assignment probability learned using Thompson sampling. Figure 3 shows that the probability of success for GNB is higher than the others for the first 5,000 sessions. Subsequently, it is apparent that the DNN model dominates as the sessions progress. Consequently, learned assignment probability with the number of sessions (steps) for DNN shows an increasing trend, whereas the learned assignment probability for GNB and GNBC shows a decreasing trend, with both breaking even around 5,000 steps. We observe DNN significantly outperforms GNB and GNBC. This is reflected in the assignment probability that is learned using Thompson sampling. The reason behind DNN outperforming other models is that the recommended prices by DNN is comparatively lower than the other two, on average. Since the reward is given only to affirmative response using simulator which is operating on willingness to pay assumption, only lower prices with ground truth equal to are awarded. Provided that the simulator is biased towards the model which prices conservatively rather than aggressively, the routing probability is able to learn this trend through the meta learning based on Thompson sampling.
Model  Assignment probability  Offer conversion score 

GNBC  0.177  0.627% 
GNB  0.266  0.943% 
DNN  0.557  1.976% 
Models  Revenue per offer  Conversion score 

Only GNBC  0.57  0.61% 
Only GNB  0.84  0.93% 
Only DNN  1.41  1.71% 
Random  0.93  1.00% 
MAB  1.33  1.58% 
From a business perspective, revenue per offer is another crucial metric. We measure expected revenue per offer in the following scenarios :multiarmed bandit (MAB) as an adaptive metadecision framework, random selection of pricing models with equal probability, and each individual model (GNBC, GNB or DNN) recommending ancillary price. Figure 5 shows expected revenue per offer in the simulation environment for the six month, 16000 sessions period. Figure 5 shows that the revenue per offer generated by only GNB is highest until 6,000 sessions. After 6,000 sessions, DNN generates better revenue per offer. Hence, an ideal metadecision framework should select the model that generates the highest revenue for each offer, i.e., GNB until 6,000 sessions and DNN afterwards. Since information about the customer’s response is unavailable prior to the model selection, the model that will generate the highest revenue per offer is not deterministic. Hence, exploration is used in the MABbased approach. The trend for the MAB using Thompson sampling (red line) is similar to the ideal metadecision framework. Table 3 shows the expected revenue per offer and conversion scores at the end of the simulation. The MAB metamodel generates 43% more revenue per offer and 58% more conversion than random selection. However, the cost of exploration is reflected in the revenue per offer because the DNN is able to get 6% higher revenue per offer as well as 8% higher conversion score than the MAB by the end of the simulation.
The objective of Thompson sampling is to maximize the conversion score. Nonetheless, the improvement to the revenue per offer is entirely implicit but not guaranteed. In the future, we plan to adjust the formulation of the multiarmed bandit method to maximize for revenue per offer directly. Another important benefit of the Thompson sampling based approach is active protection against revenue loss from directing traffic to suboptimal models. This metadecision making framework allows for instant and automated adjustments to the traffic assignment ratios. For example, new models that performs badly compared to previously deployed models will get iteratively fewer sessions assigned to it, reducing the amount of lost revenue.
Another extension of the metadecision framework based on the multiarmed bandit approach addresses contextual online decision problems (Russo et al., 2018). In these problems, the choice of the arm in a multiarmed bandit also depends on an independent random variable that the agent observes prior to making the decision. In such scenarios, the conditional distribution of the response is of the form . Contextual bandit problems of this kind can be addressed through augmenting the action space and introducing timevarying constraint sets by viewing action and constraint together as , with each arm (choice) represented as , where is the set from which must be chosen.
In settings where models have a particular baseline criteria to maintain, cautionbased sampling can be used (Chapelle & Li, 2011; Russo et al., 2018). This can be accomplished through constraining actions for each time step to have lower bound on expected average reward as . This ensures that expected average reward at least exceeds using such actions.
So far, we discussed settings in which target metaparameters are constant over time i.e. belong to a stationary system. In practice, dynamic pricing problems are time dependent and thereby nonstationary; and are more appropriately modeled by timevarying parameters, such that reward is generated by . In such contexts, the multiarmed bandit approach will never stop exploring the arms, which could be a potential drawback. A more robust method involves ignoring all historical observations made prior to a certain time period in the past (Cortes et al., 2017). Decisionmakers can produce a posterior distribution after every time step based on the prior and condition only on the most recent actions and observations. Model parameters are sampled from this distribution, and an action is selected to optimize the associated model.
An alternative approach is to view dynamic pricing recommendations as a weighted sum of prices from multiple arms, a concept referred to as concurrence. In this case, the decisionmaker takes multiple actions (arms) concurrently. Concurrency can be predefined with number of fixed arms to be pulled every time or it could be coupled with baseline caution (discussed in Section id1). This approach is similar to an approach based on an ensemble of models.
We are able to successfully demonstrate, for a dynamic pricing problem in an offline setting, that a multiarmed bandit approach can be used to adaptively learn a better routing scheme to generate 43% more revenue per offer and 58% more conversion than a randomly selected model, when dealing with a bucket of models. However, this approach takes a slight toll on revenue per offer and conversion score due to the exploration phase, in comparison to the best performing model alone. We are currently working on deploying this in an online setting, where purchasing customers reveal the ground truth in realtime. In the future, we plan to extend our model to contextual multiarmed bandits with caution and concurrence, to optimize for different business metrics. We will also model customers’ elasticity towards an offered ancillary to enable the simulator environment to model ground truth more closely. Finally, although this paper focuses specifically on ancillary pricing for airlines, the framework presented here can be applied to any machine learning problem where a bucket of online models and a pipeline of new models are waiting to be deployed.
Acknowledgements We sincerely acknowledge our airline partners for their continuing support. The academic partners are also thankful to deepair (www.deepair.io) for funding this research.
References
 Arlot et al. (2010) Arlot, S., Celisse, A., et al. A survey of crossvalidation procedures for model selection. Statistics surveys, 4:40–79, 2010.
 Auer et al. (2002) Auer, P., CesaBianchi, N., and Fischer, P. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 Bockelie & Belobaba (2017) Bockelie, A. and Belobaba, P. Incorporating ancillary services in airline passenger choice models. Journal of Revenue and Pricing Management, 16(6):553–568, 2017.
 Bubeck et al. (2012) Bubeck, S., CesaBianchi, N., et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 Chapelle & Li (2011) Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257, 2011.
 Cortes et al. (2017) Cortes, C., DeSalvo, G., Kuznetsov, V., Mohri, M., and Yand, S. Multiarmed bandits with nonstationary rewards. CoRR, abs/1710.10657, 2017.
 IdeaWorks (2018) IdeaWorks. Airline ancillary revenue projected to be $92.9 billion worldwide in 2018. https://www.ideaworkscompany.com/wpcontent/uploads/2018/11/PressRelease133GlobalEstimate2018.pdf, 2018. IdeaWorks Article.
 Kotsiantis et al. (2007) Kotsiantis, S. B., Zaharakis, I., and Pintelas, P. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering, 160:3–24, 2007.
 Russo et al. (2018) Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
 Schafer et al. (2002) Schafer, J. B., Konstan, J. A., and Riedl, J. Metarecommendation systems: usercontrolled integration of diverse recommendations. In Proceedings of the eleventh international conference on Information and knowledge management, pp. 43–51. ACM, 2002.
 Shukla et al. (2019) Shukla, N., Kolbeinsson, A., Otwell, K., Marla, L., and Yellepeddi, K. Dynamic pricing for airline ancillaries with customer context. arXiv preprint arXiv:1902.02236, 2019.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
 Yang et al. (2017) Yang, F., Ramdas, A., Jamieson, K. G., and Wainwright, M. J. A framework for multia (rmed)/b (andit) testing with online fdr control. In Advances in Neural Information Processing Systems, pp. 5957–5966, 2017.
 Zeng et al. (2016) Zeng, C., Wang, Q., Mokhtari, S., and Li, T. Online contextaware recommendation with time varying multiarmed bandit. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 2025–2034. ACM, 2016.