Multiple machine learning and prediction models are often used for the same prediction or recommendation task. In our recent work, where we develop and deploy airline ancillary pricing models in an online setting, we found that among multiple pricing models developed, no one model clearly dominates other models for all incoming customer requests. Thus, as algorithm designers, we face an exploration - exploitation dilemma. In this work, we introduce an adaptive meta-decision framework that uses Thompson sampling, a popular multi-armed bandit solution method, to route customer requests to various pricing models based on their online performance. We show that this adaptive approach outperform a uniformly random selection policy by improving the expected revenue per offer by 43% and conversion score by 58% in an offline simulation.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.


Adaptive Model Selection Framework: An Application to Airline Pricing


Naman Shukla* 0 0  Arinbjörn Kolbeinsson* 0  Lavanya Marla0  Kartik Yellepeddi0 

footnotetext: *Equal contribution 1AUTHORERR: Missing \icmlaffiliation. 2AUTHORERR: Missing \icmlaffiliation. 3AUTHORERR: Missing \icmlaffiliation. . Correspondence to: Naman Shukla <>, Kartik Yellepeddi <>.  
Copyright 2019 by the author(s).

In recent years, ancillaries such as bags, meals, wifi service and extra leg room have become a major source of revenue and profitability for airlines (IdeaWorks, 2018; Bockelie & Belobaba, 2017). Conventional pricing strategies based on static business rules do not respond to changing market conditions and do not match customer willingness to pay, leaving potential revenue sources untapped. In our recent work (Shukla et al., 2019), we demonstrate that machine learning approaches that dynamically price such ancillaries based on customer context lead to higher revenue on average for the airline compared to conventional pricing schemes. Specifically, we presented three different airline ancillary pricing models that provided dynamic price recommendations specific to each customer interaction and improved the revenue per offer for the airline. We also measured the real-world business impact of these approaches by deploying them in an A/B test on an airline’s internet booking website. As a part of that research, in order to measure the relative performance of these models they were all deployed in parallel on the airline’s booking system. The online business performance of our deployed models was better than human on an average, but no single model outperformed the other deployed models in all customer contexts. Moreover, we were also developing new models that were performing well in the offline setting and showed the promise to do better in an online setting. As a result, we face an exploration-exploitation dilemma – do we exploit (a) the one single model that does best on an average, or (b) explore other models that do better in a different context or (c) utilize offline metrics? Therefore, we present a meta-decision framework that addresses this issue.

Our contributions in this work are as follows.

  • We develop an approach that uses multi-armed bandit method to actively and adaptively route pricing requests to multiple models to further improve revenue.

  • We test this approach in a rigorously constructed simulation environment to demonstrate that a improved routing scheme that improves business metrics can be achieved.

  • We lay a foundation for future research to use contextual multi-armed bandit methods and perform online testing of this approach.


Ensemble methods are meta-learning algorithms that combine multiple machine learning methods to reduce error and variance (Kotsiantis et al., 2007). Several ensemble learning methods have been proposed to improve personalization systems. Meta-recommenders were pioneered by Schafer et al. (2002) in their work on combining multiple information sources and recommendation techniques for improved recommendations. More recent advancements in ensemble learning in the context of the ’bucket of models’ framework include model selection using cross-validation techniques (Arlot et al., 2010). Most related to our work is the use of contextual multi-armed bandits for recommendation in an online setting (Zeng et al., 2016). The notable difference to our approach in this paper is that they apply the bandit directly to the problem; whereas we first generate recommendations using an assortment of diverse models. Multi-armed bandits have also been used to control for false-discovery rate in sequential A/B tests (Yang et al., 2017). However, that method does not allow for maximizing an arbitrary outcome (such as revenue or offer conversion score), which is a necessary property for our required solution.


Among a set of models that price ancillaries, our objective is to direct online traffic (customers) more effectively to better performing models that predict customer willingness to pay more accurately to maximize revenue. To formalize this, we model the online traffic share directed to each pricing model as a tunable meta-parameter. Learning this parameter is therefore critical to identify the “winning” model and separate it from other models. Based on customers’ responses (to purchase an ancillary or not), we can reinforce the meta-parameters to adapt to the outcome. We model this problem from a reinforcement learning perspective as a sequence of actions (selecting a model) and rewards (offer accepted or not) that can be learned using a multi-armed bandit.


Solving the multi-armed bandit problem involves choosing an action (pulling an ’arm’), from a given action space, when the reward associated with each action is only partially known (Auer et al., 2002; Sutton & Barto, 2018). The environment reveals a reward after an action is taken, thereby giving information about both the environment and the action’s reward. The objective of the problem is to maximize the expected reward or minimize the cumulative expected regret (Bubeck et al., 2012) from a series of actions. The bandit problem involves a trade off between exploration and exploitation since the reward is only revealed for the chosen action. In our framework, each of the deployed ancillary pricing models is viewed as an arm. This enables the decision-maker to direct online traffic based on the customer’s response to the price offered by that arm. Hence, this meta-learning approach provides a trainable mechanism to allocate customer traffic to the arms that provide best conversion scores. We use the Thompson sampling algorithm, a popular multi-armed bandit solution approach, which allows us to exploit the current winning arm and explore seemingly inferior arms, or newly introduced arms that might outperform the current best arm (Chapelle & Li, 2011; Russo et al., 2018).


In a Bernoulli bandit problem, there are a total of valid actions, where an action at time produces a reward of one with probability and zero with probability . The mean reward is assumed to be unknown, but is stationary over time. The agent begins with an independent prior belief over each . As observations are gathered, the distribution is updated according to Bayes’ rule. As described in equation (1), the priors are assumed to be beta-distributed with parameters and . In particular, the exact prior probability density function given for an action is


where denotes the gamma function. The beta distribution is particularly suited to this computation because it is a conjugate prior to the Bernoulli distribution.

Algorithm 1 describes the method used for choosing models within our simulation environment (Figure 1, Step 2). is the set of models (described in Section id1). is the model chosen to provide the pricing recommendation, and the corresponding reward is 1 if the ancillary is purchased and 0 otherwise.

1:  for  do
2:     for  do
3:        Sample
4:     end for
6:     Apply               
7:     Observe ,             
9:  end for
Algorithm 1 Thompson sampling for the Bernoulli Bandit

In this section we discuss an offline simulation performed on synthetic data to show the convergence score of Thompson sampling on the ancillary pricing models developed in our work. Specifically, we create customer session samples where the three experimental models have a success rate in capturing customer willingness to pay of , and , respectively; and then aim to infer these probabilities through Thompson sampling from the caller streams.

We start the simulation with the prior , which corresponds to a uniform prior between 0 and 1. The run is then simulated for 2,000 steps and target probabilities are recorded. Table 1 demonstrates that Thompson sampling converges to the latent probability distribution (the success rates of each model) within a reasonable number of iterations, indicating its potential for deployment in real-world pricing settings.

True probability Inferred probability # of trials
0.45 0.45 70
0.55 0.55 305
0.60 0.61 1625
Table 1: Simulation results for Thompson sampling after 2,000 steps.
Figure 1: Offline testing simulation environment setup

We now apply the Thompson sampling algorithm to a testing environment constructed based on real-world data. Figure 1 shows the overview of our testing setup for evaluating the adaptive meta decision framework. Our simulation environment models customer behavior by sampling sessions based on 6 months of historically collected data, amounting to a total of about 16000 sessions. In our models, we make monotonicity assumptions for consistency. First, we assume that if a customer is willing to purchase a product for price , they are willing to purchase the same product at a price . Similarly, if a customer is unwilling to purchase a product at price , they will be unwilling to purchase at a price . We define , shown in Figure 2, as a latent variable that ensures this assumption by taking the historical ground truth (purchased by customer or not) into account. Hence, the latent variable can be used as the reward response from simulator for the offered price, given that customer session.

Figure 2: Latent variable mapping from ground truth where prices are arranged in ascending order from left to right

We consider the following three pricing models, presented in (Shukla et al., 2019), in which pricing is modeled as a 2-step processes. First, the purchase probability of the ancillary is estimated as a function of the price; and second, a price that optimizes revenue is offered to the customer. Further details on these models can be found in our previous work (Shukla et al., 2019).

  1. Gaussian Naive Bayes (GNB): This two-stage pricing model uses a Gaussian Naive Bayes (GNB) model for ancillary purchase probability prediction and a pre-calibrated logistic price mapping function for revenue optimization.

  2. Gaussian Naive Bayes with clustered features (GNBC): This two-stage pricing model uses a Gaussian Naive Bayes with clustered features (GNBC) model for ancillary purchase probability prediction and a pre-calibrated logistic price mapping function for revenue optimization.

  3. Deep-Neural Network (DNN): This two-stage pricing model uses a Deep-Neural Network (DNN) trained using a weighted cross-entropy loss function for ancillary purchase probability estimation. For price optimization, we implement a simple discrete exhaustive search algorithm that finds the optimal price point within the permissible pricing range.

Our simulation results using Thompson sampling as a meta learning framework are shown in Table 2. The expected offer conversion scores for each model, where each model is considered as an arm of the multi-armed bandit, is shown in Figure 3. Also, Figure 4 shows the evolution of the assignment probability learned using Thompson sampling. Figure 3 shows that the probability of success for GNB is higher than the others for the first 5,000 sessions. Subsequently, it is apparent that the DNN model dominates as the sessions progress. Consequently, learned assignment probability with the number of sessions (steps) for DNN shows an increasing trend, whereas the learned assignment probability for GNB and GNBC shows a decreasing trend, with both breaking even around 5,000 steps. We observe DNN significantly outperforms GNB and GNBC. This is reflected in the assignment probability that is learned using Thompson sampling. The reason behind DNN outperforming other models is that the recommended prices by DNN is comparatively lower than the other two, on average. Since the reward is given only to affirmative response using simulator which is operating on willingness to pay assumption, only lower prices with ground truth equal to are awarded. Provided that the simulator is biased towards the model which prices conservatively rather than aggressively, the routing probability is able to learn this trend through the meta learning based on Thompson sampling.

Model Assignment probability Offer conversion score
GNBC 0.177 0.627%
GNB 0.266 0.943%
DNN 0.557 1.976%
Table 2: Thompson sampling converged values for ancillary pricing using simulated environment. The offer conversion score is the percent of offers generated that were accepted by customers.
Figure 3: Conversion score of each arm using Thompson sampling
Figure 4: Learned Assignment Probability
Models Revenue per offer Conversion score
Only GNBC 0.57 0.61%
Only GNB 0.84 0.93%
Only DNN 1.41 1.71%
Random 0.93 1.00%
MAB 1.33 1.58%
Table 3: Revenue per offer and Offer conversion scores for different models

From a business perspective, revenue per offer is another crucial metric. We measure expected revenue per offer in the following scenarios :multi-armed bandit (MAB) as an adaptive meta-decision framework, random selection of pricing models with equal probability, and each individual model (GNBC, GNB or DNN) recommending ancillary price. Figure 5 shows expected revenue per offer in the simulation environment for the six month, 16000 sessions period. Figure 5 shows that the revenue per offer generated by only GNB is highest until 6,000 sessions. After 6,000 sessions, DNN generates better revenue per offer. Hence, an ideal meta-decision framework should select the model that generates the highest revenue for each offer, i.e., GNB until 6,000 sessions and DNN afterwards. Since information about the customer’s response is unavailable prior to the model selection, the model that will generate the highest revenue per offer is not deterministic. Hence, exploration is used in the MAB-based approach. The trend for the MAB using Thompson sampling (red line) is similar to the ideal meta-decision framework. Table 3 shows the expected revenue per offer and conversion scores at the end of the simulation. The MAB metamodel generates 43% more revenue per offer and 58% more conversion than random selection. However, the cost of exploration is reflected in the revenue per offer because the DNN is able to get 6% higher revenue per offer as well as 8% higher conversion score than the MAB by the end of the simulation.

Figure 5: Expected revenue per offer

The objective of Thompson sampling is to maximize the conversion score. Nonetheless, the improvement to the revenue per offer is entirely implicit but not guaranteed. In the future, we plan to adjust the formulation of the multi-armed bandit method to maximize for revenue per offer directly. Another important benefit of the Thompson sampling based approach is active protection against revenue loss from directing traffic to sub-optimal models. This meta-decision making framework allows for instant and automated adjustments to the traffic assignment ratios. For example, new models that performs badly compared to previously deployed models will get iteratively fewer sessions assigned to it, reducing the amount of lost revenue.


Another extension of the meta-decision framework based on the multi-armed bandit approach addresses contextual online decision problems (Russo et al., 2018). In these problems, the choice of the arm in a multi-armed bandit also depends on an independent random variable that the agent observes prior to making the decision. In such scenarios, the conditional distribution of the response is of the form . Contextual bandit problems of this kind can be addressed through augmenting the action space and introducing time-varying constraint sets by viewing action and constraint together as , with each arm (choice) represented as , where is the set from which must be chosen.

In settings where models have a particular baseline criteria to maintain, caution-based sampling can be used (Chapelle & Li, 2011; Russo et al., 2018). This can be accomplished through constraining actions for each time step to have lower bound on expected average reward as . This ensures that expected average reward at least exceeds using such actions.


So far, we discussed settings in which target meta-parameters are constant over time i.e. belong to a stationary system. In practice, dynamic pricing problems are time dependent and thereby non-stationary; and are more appropriately modeled by time-varying parameters, such that reward is generated by . In such contexts, the multi-armed bandit approach will never stop exploring the arms, which could be a potential drawback. A more robust method involves ignoring all historical observations made prior to a certain time period in the past (Cortes et al., 2017). Decision-makers can produce a posterior distribution after every time step based on the prior and condition only on the most recent actions and observations. Model parameters are sampled from this distribution, and an action is selected to optimize the associated model.

An alternative approach is to view dynamic pricing recommendations as a weighted sum of prices from multiple arms, a concept referred to as concurrence. In this case, the decision-maker takes multiple actions (arms) concurrently. Concurrency can be predefined with number of fixed arms to be pulled every time or it could be coupled with baseline caution (discussed in Section id1). This approach is similar to an approach based on an ensemble of models.


We are able to successfully demonstrate, for a dynamic pricing problem in an offline setting, that a multi-armed bandit approach can be used to adaptively learn a better routing scheme to generate 43% more revenue per offer and 58% more conversion than a randomly selected model, when dealing with a bucket of models. However, this approach takes a slight toll on revenue per offer and conversion score due to the exploration phase, in comparison to the best performing model alone. We are currently working on deploying this in an online setting, where purchasing customers reveal the ground truth in real-time. In the future, we plan to extend our model to contextual multi-armed bandits with caution and concurrence, to optimize for different business metrics. We will also model customers’ elasticity towards an offered ancillary to enable the simulator environment to model ground truth more closely. Finally, although this paper focuses specifically on ancillary pricing for airlines, the framework presented here can be applied to any machine learning problem where a bucket of online models and a pipeline of new models are waiting to be deployed.


Acknowledgements We sincerely acknowledge our airline partners for their continuing support. The academic partners are also thankful to deepair ( for funding this research.


  • Arlot et al. (2010) Arlot, S., Celisse, A., et al. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40–79, 2010.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • Bockelie & Belobaba (2017) Bockelie, A. and Belobaba, P. Incorporating ancillary services in airline passenger choice models. Journal of Revenue and Pricing Management, 16(6):553–568, 2017.
  • Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Chapelle & Li (2011) Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257, 2011.
  • Cortes et al. (2017) Cortes, C., DeSalvo, G., Kuznetsov, V., Mohri, M., and Yand, S. Multi-armed bandits with non-stationary rewards. CoRR, abs/1710.10657, 2017.
  • IdeaWorks (2018) IdeaWorks. Airline ancillary revenue projected to be $92.9 billion worldwide in 2018., 2018. IdeaWorks Article.
  • Kotsiantis et al. (2007) Kotsiantis, S. B., Zaharakis, I., and Pintelas, P. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering, 160:3–24, 2007.
  • Russo et al. (2018) Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
  • Schafer et al. (2002) Schafer, J. B., Konstan, J. A., and Riedl, J. Meta-recommendation systems: user-controlled integration of diverse recommendations. In Proceedings of the eleventh international conference on Information and knowledge management, pp. 43–51. ACM, 2002.
  • Shukla et al. (2019) Shukla, N., Kolbeinsson, A., Otwell, K., Marla, L., and Yellepeddi, K. Dynamic pricing for airline ancillaries with customer context. arXiv preprint arXiv:1902.02236, 2019.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
  • Yang et al. (2017) Yang, F., Ramdas, A., Jamieson, K. G., and Wainwright, M. J. A framework for multi-a (rmed)/b (andit) testing with online fdr control. In Advances in Neural Information Processing Systems, pp. 5957–5966, 2017.
  • Zeng et al. (2016) Zeng, C., Wang, Q., Mokhtari, S., and Li, T. Online context-aware recommendation with time varying multi-armed bandit. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 2025–2034. ACM, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description