Submitted to Management Science 
manuscript 
A Nonparametric Approach to Modeling Choice with Limited Data
Vivek F. Farias
MIT Sloan, vivekf@mit.edu
Srikanth Jagabathula
EECS, MIT, jskanth@alum.mit.edu
Devavrat Shah
EECS, MIT, devavrat@mit.edu
A central push in operations models over the last decade has been the incorporation of models of customer choice. Real world implementations of many of these models face the formidable stumbling block of simply identifying the ‘right’ model of choice to use. Thus motivated, we visit the following problem: For a ‘generic’ model of consumer choice (namely, distributions over preference lists) and a limited amount of data on how consumers actually make decisions (such as marginal information about these distributions), how may one predict revenues from offering a particular assortment of choices? We present a framework to answer such questions and design a number of tractable algorithms from a data and computational standpoint for the same. This paper thus takes a significant step towards ‘automating’ the crucial task of choice model selection in the context of operational decision problems.
A problem of central interest to operations managers is the use of historical sales data in the prediction of revenues or sales from offering a particular assortment of products to customers. As one can imagine, such predictions form crucial inputs to several important business decisions, both operational and otherwise. A classical example of such a decision problem is that of assortment planning: deciding the ‘‘optimal’’ assortment of products to offer customers with a view to maximizing expected revenues (or some related objective) subject to various constraints (e.g. limited display or shelf space). A number of variants of this problem, both static and dynamic, arise in essentially every facet of revenue management. Such problems are seen as crucial revenue management tasks and needless to say, accurate revenue or sales predictions fundamentally impact how well we can perform such tasks.
Why might these crucial predictions be difficult to make? Consider the task of predicting expected sales rates from offering a particular set of products to customers. In industry jargon, this is referred to as ‘conversionrate’, and is defined as the probability of converting an arriving customer into a purchasing customer. Predicting the conversionrate for an offer set is difficult because the probability of purchase of each product depends on all the products on offer. This is due to substitution behavior, where an arriving customer potentially substitutes an unavailable product with an available one. Due to substitution, the sales observed for a product may be viewed as a combination of its ‘primary’ demand and additional demand due to substitution. Customer Choice Models have been used to model this behavior with success. At an abstract level, a choice model can be thought of as a conditional probability distribution that for any offer set yields the probability that an arriving customer purchases a given product in that set.
There is vast literature spanning marketing, economics, and psychology devoted to the construction of parametric choice models and their estimation from data. In the literature that studies the sorts of revenue management decision problems we alluded to above, such models are typically assumed given. The implicit understanding is that a complete prescription for these decision problems will require fitting the ‘‘right’’ parametric choice model to data, so as to make accurate revenue or sales predictions. This is a complex task. Apart from the fact that one can never be sure that the chosen parametric structure is a ‘‘good’’ representation of the underlying ground truth, parametric models are prone to overfitting and underfitting issues. Once a structure is fixed, one does not glean new structural information from data. This is a serious issue in practice because although a simple model (such as the multinomial logit (MNL) model) may make practically unreasonable assumptions (such as the socalled ‘‘IIA’’ assumption), fitting a more complex model can lead to worse performance due to overfitting – and one can never be sure.
In this paper, we propose a nonparametric, datadriven approach to making revenue or sales predictions that affords the revenue manager the opportunity to avoid the challenging task of fitting an appropriate parametric choice model to historical data. Our approach views choice models generically, namely as distributions over rankings (or preference lists) of products. As shall be seen subsequently, this view subsumes essentially all extant choice models. Further, this view yields a nonparametric approach to choice modeling where the revenue manager does not need to think about the appropriate parametric structure for his problem, or the tradeoff between model parsimony and the risk of overfitting. Rather, through the use of a nonparametric approach, our goal is to offload as much of this burden as possible to the data itself.
As mentioned above, we consider entirely generic models of choice, specified as a distribution over all possible rankings (or preference lists) of products. Our view of data is aligned with what one typically has available in reality – namely, sales rates of products in an assortment, for some set of product assortments. This is a general view of choice modeling. Our main contribution is to make this view operational, yielding a datadriven, nonparametric approach. Specifically, we make the following contributions in the context of this general setup:

Revenue Predictions: As mentioned above, accurate revenue or sales predictions form core inputs for a number of important revenue/ inventory management problems. Available sales data will typically be insufficient to fully specify a generic model of choice of the type we consider. We therefore seek to identify the set of generic choice models consistent with available sales data. Given the need to make a revenue or sales prediction on a heretofore unseen assortment, we then offer the worstcase expected revenue possible for that assortment assuming that the true model lies in the set of models found to be consistent with observed sales data. Such an approach makes no apriori structural assumptions on the choice model, and has the appealing feature that as more data becomes available, the predictions will improve, by narrowing down the set of consistent models. This simple philosophy dictates challenging computational problems; for instance, the sets we compute are computationally unwieldy and, at first glance, highly intractable. Nonetheless, we successfully develop several simple algorithms of increasing sophistication to address these problems.

Empirical Evaluation: We conducted an empirical study to gauge the practical value of our approach, both in terms of the absolute quality of the predictions produced, and also relative to using alternative parametric approaches. We describe the results of two such studies:

Simulation Study: The purpose of our simulation study is to demonstrate that the robust approach can effectively capture model structure consistent with a number of different parametric models and produce good revenue predictions. The general setup in this study was as follows: We use a parametric model to generate synthetic transaction data. We then use this data in conjunction with our revenue prediction procedure to predict expected revenues over a swathe of offer sets. Our experimental design permits us to compare these predictions to the corresponding ‘ground truth’. The parametric families we considered included the multinomial logit (MNL), nested logit (NL), and mixture of multinomial logit (MMNL) models. In order to ‘stresstest’ our approach, we conducted experiments over a wide range of parameter regimes for these generative parametric choice models, including some that were fit to DVD sales data from Amazon.com. The predictions produced are remarkably accurate.

Empirical Study with Sales Data from a Major US Automaker: The purpose of our empirical study is twofold: (1) to demonstrate how our setup can be applied with realworld data, and (2) to pit the robust method in a ‘‘horserace’’ against the MNL and MMNL parametric families of models. For the case study, we used sales data collected daily at the dealership level over 2009 to 2010 for a range of small SUVs offered by a major US automaker for a dealership zone in the Midwest. We used a portion of this sales data as ‘training’ data. We made this data available to our robust approach, as well as in the fitting of an MNL model and an MMNL model. We tested the quality of ‘conversionrate’ predictions (i.e. a prediction of the sales rate given the assortment of models on the lot) using the robust approach and the incumbent parametric approaches on the remainder of the data. We conducted a series of experiments by varying the amount of training data made available to the approaches. We conclude that (a) the robust method improves on the accuracy of either of the parametric methods by about (this is large) in all cases and (b) unlike the parametric models, the robust method is apparently not susceptible to underfitting and overfitting issues. In fact, we see that the performance of the MMNL model relative to the MNL model deteriorates as the amount of training data available decreases due to overfitting.


Descriptive Analysis: In making revenue predictions, we did not need to concern ourselves with the choice model implicitly assumed by our prediction procedure. This fact notwithstanding, it is natural to consider criteria for selecting choice models consistent with the observed data that are independent of any decision context. Thus motivated, we consider the natural task of finding the simplest choice model consistent with the observed data. As in much of contemporary high dimensional statistics, we employ sparsity^{1}^{1}1By sparsity we refer to the number of rank lists or, in effect, customer types, assumed to occur with positive probability in the population. as our measure of simplicity. To begin, we use the sparsest fit criterion to obtain a characterization of the choice models implicitly used by the robust revenue prediction approach. Loosely speaking, we show that the choice model implicitly used by the robust approach is essentially the sparsest model (Theorem 1) and the complexity of the model (as measured by its sparsity) scales with the ‘‘amount’’ of data. This provides an explanation for the immunity of the robust approach to over/under fitting as observed in our case study. Second, we characterize the family of choice models that can be identified only from observed marginal data via the sparsest fit criterion (Theorems 2 and 3). Our characterization formalizes the notion that the complexity of the models that can be identified via the sparsest fit criterion scales with the ‘‘amount’’ of data at hand.
The study of choice models and their applications spans a vast literature across multiple fields including at least Marketing, Operations and Economics. In disciplines such as marketing learning a choice model is an interesting goal unto itself given that it is frequently the case that a researcher wishes to uncover ‘‘why’’ a particular decision was made. Within operations, the goal is frequently more application oriented with the choice model being explicitly used as a predictive tool within some larger decision model. Since our goals are aligned with the latter direction, our literature review focuses predominantly on OM; we briefly touch on key work in Marketing. We note that our consideration of ‘sparsity’ as an appropriate nonparametric model selection criterion is closely related to the burgeoning statistical area of compressive sensing; we discuss those connections in a later Section.
The vast majority of decision models encountered in operations have traditionally ignored substitution behavior (and thereby choice modeling) altogether. Within airline RM, this is referred to as the ‘‘independent demand’’ model (see Talluri and van Ryzin (2004b)). Over the years, several studies have demonstrated the improvements that could be obtained by incorporating choice behavior into operations models. For example, within airline RM, the simulation studies conducted by Belobaba and Hopperstad (1999) on the well known passenger origin and destination simulator (PODS) suggested the value of corrections to the independent demand model; more recently, Ratliff et al. (2008) and Vulcano et al. (2010) have demonstrated valuable average revenue improvements from using MNL choicebased RM approaches using real airline market data. Following such studies, there has been a significant amount of research in the areas of inventory management and RM attempting to incorporate choice behavior into operations models.
The bulk of the research on choice modeling in both the areas has been optimization related. That is to say, most of the work has focused on devising optimal decisions given a choice model. Talluri and van Ryzin (2004a), Gallego et al. (2006), van Ryzin and Vulcano (2008), Mahajan and van Ryzin (1999), Goyal et al. (2009) are all papers in this vein. Kök et al. (2008) provides an excellent overview of the stateoftheart in assortment optimization. Rusmevichientong et al. (2008) consider the multinomial logit (MNL) model and provide an efficient algorithm for the static assortment optimization problem and propose an efficient policy for the dynamic optimization problem. A follow on paper, Rusmevichientong and Topaloglu (2009), considers the same optimization problem but where the mean utilities in the MNL model are allowed to lie in some arbitrary uncertainty set. Saure and Zeevi (2009) propose an alternative approach for the dynamic assortment optimization problem under a general random utility model.
The majority of the work above focuses on optimization issues given a choice model. Paper such as Talluri and van Ryzin (2004a) discuss optimization problems with general choice models, and as such our revenue estimation procedure fits in perfectly there. In most cases, however, the choice model is assumed to be given and of the MNL type. Papers such as Saure and Zeevi (2009) and Rusmevichientong and Topaloglu (2009) loosen this requirement by allowing some amount of parametric uncertainty. In particular, Saure and Zeevi (2009) assume unknown mean utilities and learn these utilities, while the optimization schemes in Rusmevichientong and Topaloglu (2009) require knowledge of mean utilities only within an interval. In both cases, the structure of the model (effectively, MNL) is fixed up front.
The MNL model is by far the most popular choice model studied and applied in OM. The origins of the MNL model date all the way back to the PlackettLuce model, proposed independently by Luce (1959) and Plackett (1975). Before becoming popular in the area of OM, the MNL model found widespread use in the areas of transportation (see seminal works of McFadden (1980), BenAkiva and Lerman (1985)) and marketing (starting with the seminal work of Guadagni and Little (1983), which paved the way for choice modeling using scanner panel data). See Wierenga (2008), Chandukala et al. (2008) for a detailed overview of choice modeling in the area of Marketing. The MNL model is popular because its structure makes it tractable both in terms of estimating its parameters and solving decision problems. However, the tractability of the MNL model comes at a cost: it is incapable of capturing any heterogeneity in substitution patterns across products (see Debreu (1960)) and suffers from Independent of Irrelevant Alternatives (IIA) property (see BenAkiva and Lerman (1985)), both of which limit its practical applicability.
Of course, these issues with the MNL model are well recognized, and far more sophisticated models of choice have been suggested in the literature (see, for instance, BenAkiva and Lerman (1985), Anderson et al. (1992)); the price one pays is that the more sophisticated models may not be easily identified from sales data and are prone to overfitting. It must be noted that an exception to the above state of affairs is the paper by Rusmevichientong et al. (2006) that considers a general nonparametric model of choice similar to the one considered here in the context of an assortment pricing problem. The caveat is that the approach considered requires access to samples of entire customer preference lists which are unlikely to be available in many practical applications.
Our goal relative to all of the above work is to eliminate the need for structural assumptions and thereby, the associated risks as well. We provide a means of going directly from raw sales transaction data to revenue or sales estimates for a given offer set. While this does not represent the entirety of what can be done with a choice model, it represent a valuable application, at least within the operational problems discussed.
We consider a universe of products, . We assume that the th product in corresponds to the ‘outside’ or ‘nopurchase’ option. A customer is associated with a permutation (or ranking) of the products in ; the customer prefers product to product if and only if . A customer will be presented with a set of alternatives ; any set of alternatives will, by convention, be understood to include the nopurchase alternative i.e. the th product. The customer will subsequently choose to purchase her single most preferred product among those in . In particular, she purchases
(1) 
It is quickly seen that the above structural assumption is consistent with structural assumptions made in commonly encountered choice models including the multinomial logit, nested multinomial logit, or more general random utility models. Those models make many additional structural assumptions which may or may not be reasonable for the application at hand. Viewed in a different light, basic results from the theory of social preferences dictate that the structural assumptions implicit in our model are no more restrictive than assuming that the customer in question is endowed with a utility function over alternatives and chooses an alternative that maximizes her utility from among those available. Our model of the customer is thus general ^{2}^{2}2As opposed to associating a customer with a fixed , one may also associate customers with distributions over permutations. This latter formalism is superfluous for our purposes..
In order to make useful predictions on customer behavior that might, for instance, guide the selection of a set to offer for sale, one must specify a choice model. A general choice model is effectively a conditional probability distribution , that yields the probability of purchase of a particular product in given the set of alternatives available to the customer.
We will assume essentially the most general model for . In particular, we assume that there exists a distribution over the set of all possible permutations . Recall here that is effectively the set of all possible customer types since every customer is associated with a permutation which uniquely determines her choice behavior. The distribution defines our choice model as follows: Define the set
is simply the set of all customer types that would purchase product when the offer set is . Our choice model is then given by
Not surprisingly, as mentioned above, the above model subsumes essentially any model of choice one might concoct: in particular, all we have assumed is that at a given point in time a customer possess rational (transitive) (see MasColell et al. (1995)) preferences over all alternatives ^{3}^{3}3Note however that the customer need not be aware of these preferences; from (id1), it is evident that the customer need only be aware of his preferences for elements of the offer set., and that a particular customer will purchase her most preferred product from the offered set according to these preferences; a given customer sampled at different times may well have a distinct set of preferences.
The class of choice models we will work with is quite general and imposes a minimal number of behavioral assumptions on customers apriori. That said the data available to calibrate such a model will typically be limited in the sense that a modeler will have sales rate information for a potentially small collection of assortments. Ignoring the difficulties of such a calibration problem for now, we posit a general notion of what we will mean by observable ‘data’. The abstract notion we posit will quickly be seen as relevant to data one might obtain from sales information.
We assume that the data observed by the seller is given by an dimensional ‘partial information’ vector , where makes precise the relationship between the observed data and the underlying choice model. Typically we anticipate signifying, for example, the fact that we have sales information for only a limited number of assortments. Before understanding how transactional data observed in practice relates to this formalism, we consider, for the purposes of illustration a few simple concrete examples of data vectors ; we subsequently introduce a type of data relevant to our experiments and transaction data observed in the real world.

Comparison Data: This data represents the fraction of customers that prefer a given product to a product . The partial information vector is indexed by with . For each , denotes the fraction of customers that prefer product to . The matrix is thus in . A column of , , will thus have if and only if .

Ranking Data: This data represents the fraction of customers that rank a given product as their th choice. Here the partial information vector is indexed by with . For each , is thus the fraction of customers that rank product at position . The matrix is then in . For a column of corresponding to the permutation , , we will thus have iff .

Top Set Data: This data refers to a concatenation of the ‘‘Comparison Data’’ above and information on the fraction of customers who have a given product as their topmost choice for each . Thus where is simply the matrix for comparison data, and has if and only if .
Transaction Data: More generally, in the retail context, historical sales records corresponding to displayed assortments might be used to estimate the fraction of purchasing customers who purchased a given product when the displayed assortment was . We might have such data for some sequence of test assortments say . This type of data is consistent with our definition (i.e. it may be interpreted as a linear transformation of ) and is, in fact, closely related to the comparison data above. In particular, denoting by , the fraction of customers purchasing product when assortment is on offer, our partial information vector, , may thus be indexed by with . The matrix is then in . For a column of corresponding to the permutation , , we will then have iff and for all products in assortment .
While modeling choice is useful for a variety of reasons, we are largely motivated by decision models for OM problems that benefit from the incorporation of a choice model. In many of these models, the fundamental feature impacted by the choice model is a ‘revenue function’ that measures revenue rates corresponding to a particular assortment of products offered to customers. Concrete examples include static assortment management, network revenue management under choice and inventory management assuming substitution.
We formalize this revenue function. We associate every product in with a retail price . Of course, . The revenue function, , determines expected revenues to a retailer from offering a set of products to his customers. Under our choice model this is given by:
The function is a fundamental building block for all of the OM problems described above, so that we view the problem of estimating as our central motivating problem. The above specification is general, and we will refer to any linear functional of the type above as a revenue function. As another useful example of such a functional, consider setting for all (i.e. all products other than the nopurchase option). In this case, the revenue function, yields the probability an arriving customer will purchase some product in ; i.e. the ‘conversion rate’ under assortment .
Given a ‘blackbox’ that is capable of producing estimates of using some limited corpus of data, one may then hope to use such a black box for making assortment decisions over time in the context of the OM problems of the type discussed in the introduction.
Imagine we have a corpus of transaction data, summarized by an appropriate data vector as described in Section id1. Our goal is to use just this data to make predictions about the revenue rate (i.e the expected revenues garnered from a random customer) for some given assortment, say , that has never been encountered in past data. We propose accomplishing this by solving the following program:
(2) 
In particular, the optimal value of this program will constitute our prediction for the revenue rate. In words, the feasible region of this program describes the set of all choice models consistent with the observed data . The optimal objective value consequently corresponds to the minimum revenues possible for the assortment under any choice model consistent with the observed data. Since the family of choice models we considered was generic this prediction relies on simply the data and basic economic assumptions on the customer that are tacitly assumed in essentially any choice model.
The philosophy underlying the above program can be put to other uses. For instance, one might seek to recover a choice model itself from the available data. In a parametric world, one would consider a suitably small, fixed family of models within which a unique model would best explain (but not necessarily be consistent with) the available data. It is highly unlikely that available data will determine a unique model in the general family of models we consider here. Our nonparametric setting thus requires an appropriate selection criterion. A natural criterion is to seek the ‘simplest’ choice model that is consistent with the observed data. There are many notions of what one might consider simple. One criterion that enjoys widespread use in highdimensional statistics is sparsity. In particular, we may consider finding a choice model consistent with the observed data, that has minimal support, . In other words, we might seek to explain observed purchasing behavior by presuming as small a number of modes of customer choice behavior as possible (where we associate a ‘mode’ of choice with a ranking of products). More formally, we might seek to solve:
(3) 
Sections id1, id1 and id1 are focused on providing procedures to solve the program (2), and on examining the quality of the predictions produced on simulated data and actual transaction data respectively. Section id1 will discuss algorithmic and interesting descriptive issues pertaining to (3).
In the previous section we formulated the task of computing revenue estimates via a nonparametric model of choice and any available data as the mathematical program (2), which we repeat below, in a slightly different form for clarity:
The above mathematical program is a linear program in the variables . Interpreting the program in words, the constraints ensure that any assumed in making a revenue estimate is consistent with the observed data. Other than this consistency requirement, writing the probability that a customer purchases , as the quantity assumes that the choice model satisfies the basic structure laid out in Section id1. We make no other assumptions outside of these, and ask for the lowest expected revenues possible for under any choice model satisfying these requirements.
Thus, while the assumptions implicit in making a revenue estimate are something that the user need not think about, the two natural questions that arise are:

How does one solve this conceptually simple program in practice given that the program involves an intractable number of variables?

Even if one did succeed in solving such a program are the revenue predictions produced useful or are they too loose to be of practical value?
This section will focus on the first question. In practical applications such a procedure would need to be integrated into a larger decision problem and so it is useful to understand the computational details which we present at a high level in this section. The second, ‘so what’ question will be the subject of the next two sections where we will examine the performance of the scheme on simulated transaction data, and finally on a real world sales prediction problem using real data. Finally, we will examine an interesting property enjoyed by the choice models implicitly assumed in making the predictions in this scheme in Section id1.
At a high level our approach to solving (2) will be to consider the dual of that program and then derive efficient exact or approximate descriptions to the feasible regions of these programs. We begin by considering the dual program to (2). In preparation for taking the dual, let us define
where recall that denotes the set of all permutations that result in the purchase of when the offered assortment is . Since and for , we have implicitly specified a partition of the columns of the matrix . Armed with this notation, the dual of (2) is:
(4) 
where and are dual variables corresponding respectively to the data consistency constraints and the requirement that is a probability distribution (i.e. ) respectively. Of course, this program has a potentially intractable number of constraints. We explore two approaches to solving the dual:

An extremely simple to implement approach that relies on sampling constraints in the dual that will, in general produce approximate solutions that are upper bounds to the optimal solution of our robust estimation problem.

An approach that relies on producing effective representations of the sets , so that each of the constraints , can be expressed efficiently.This approach is slightly more complex to implement but in return can be used to sequentially produce tighter approximations to the robust estimation problem. In certain special cases, this approach is provably efficient and optimal.
The following is an extremely simple to implement approach to approximately solve the problem (4):

Select a distribution over permutations, .

Sample permutations according to the distribution. Call this set of permutation .

Solve the program:
(5)
Observe that (5) is essentially a ‘sampled’ version of the problem (4), wherein constraints of that problem have been sampled according to the distribution and is consequently a relaxation of that problem. A solution to (5) is consequently an upper bound to the optimal solution to (4).
The question of whether the solutions thus obtained provide meaningful approximations to (4) is partially addressed by recent theory developed by Calafiore and Campi (2005). In particular, it has been shown that for a problem with variables and given samples, we must have that with probability at least the following holds: An optimal solution to (5) violates at most an fraction of constraints of the problem (4) under the measure . Hence, given a number of samples that scales only with the number of variables (and is independent of the number of constraints in (4), one can produce an solution to (4) that satisfies all but a small fraction of constraints. The theory does not provide any guarantees on how far the optimal cost of the relaxed problem is from the optimal cost of the original problem.
The heuristic nature of this approach notwithstanding, it is extremely simple to implement, and in the experiments conducted in the next section, provided close to optimal solutions.
We describe here one notion of an efficient representation of the sets , and assuming we have such a representation, we describe how one may solve (4) efficiently. We will deal with the issue of actually coming up with these efficient representations in Appendix id1, where we will develop an efficient representation for ranking data and demonstrate a generic procedure to sequentially produce such representations.
Let us assume that every set can be expressed as a disjoint union of sets. We denote the th such set by and let be the corresponding set of columns of . Consider the convex hull of the set , . Recalling that , . is thus a polytope contained in the dimensional unit cube, . In other words,
(6) 
for some matrices and vectors . By a canonical representation of , we will thus understand a partition of and a polyhedral representation of the columns corresponding to every set in the partition as given by (6). If the number of partitions as well as the polyhedral description of each set of the partition given by (6) is polynomial in the input size, we will regard the canonical representation as efficient. Of course, there is no guarantee that an efficient representation of this type exists; clearly, this must rely on the nature of our partial information i.e. the structure of the matrix . Even if an efficient representation did exist, it remains unclear whether we can identify it. Ignoring these issues for now, we will in the remainder of this section demonstrate how given a representation of the type (6), one may solve (4) in time polynomial in the size of the representation.
For simplicity of notation, in what follows we assume that each polytope is in standard form,
Now since an affine function is always optimized at the vertices of a polytope, we know:
We have thus reduced (4) to a ‘robust’ LP. Now, by strong duality we have:
(7) 
We have thus established the following useful equality:
It follows that solving (2) is equivalent to the following LP whose complexity is polynomial in the description of our canonical representation:
(8) 
As discussed, our ability to solve (8) relies on our ability to produce an efficient canonical representation of of the type (6). In Appendix id1, we first consider the case of ranking data, where an efficient such representation may be produced. We then illustrate a method that produces a sequence of ‘outerapproximations’ to (6) for general types of data, and thereby allows us to produce a sequence of improving lower bounding approximations to our robust revenue estimation problem, (2). This provides a general procedure to address the task of solving (4), or equivalently, (2).
We end this section with a brief note on noisy observations. In particular, in practice, one may see a ‘noisy’ version of . Specifically, as opposed to knowing precisely, one may simply know that , where may, for instance, represent an uncertainty ellipsoid, or a ‘box’ derived from sample averages of the associated quantities and the corresponding confidence intervals. In this case, one seeks to solve the problem:
Provided is convex, this program is essentially no harder to solve than the variant of the problem we have discussed and similar methods to those developed in this section apply.
In this section, we describe the results of an extensive simulation study, the main purpose of which is to demonstrate that the robust approach can capture various underlying parametric structures and produce good revenue predictions. For this study, we pick a range of random utility parametric structures used extensively in current modeling practice.
The broad experimental procedure we followed is the following:

Pick a structural model. This may be a model derived from realworld data or a purely synthetic model.

Use this structural model to simulate sales for a set of test assortments. This simulates a data set that a practitioner likely has access to.

Use this transaction data to estimate marginal information , and use to implement the robust approach.

Use the implemented robust approach to predict revenues for a distinct set of assortments, and compare the predictions to the true revenues computed using the ‘groundtruth’ structural model chosen for benchmarking in step 1.
Notice that the above experimental procedure lets us isolate the impact of structural errors from that of finite sample errors. Specifically, our goal is to understand how well the robust approach captures the underlying choice structure. For this purpose, we ignore any estimation errors in data by using the ‘groundtruth’ parametric model to compute the exact values of any choice probabilities and revenues required for comparison. Therefore, if the robust approach has good performance across an interesting spectrum of structural models that are believed to be good fits to data observed in practice, we can conclude that the robust approach is likely to offer accurate revenue predictions with no additional information about structure across a widerange of problems encountered in practice.
The above procedure generates data sets using a variety of ‘ground truth’ structural models. We pick the following ‘random utility’ models as benchmarks. A selfcontained and compact exposition on the foundations of each of the benchmark models below may be found in the appendix.
Multinomial logit family (MNL): For this family, we have:
where the are the parameters specifying the models. See Appendix id1 for more details.
Nested logit family (NL): This model is a first attempt at overcoming the ‘independence of irrelevant alternatives’ effect, a shortcoming of the MNL model. For this family, the universe of products is partitioned into mutually exclusive subsets, or ‘nests’, denoted by such that
This model takes the form:
(9) 
where is a certain scale parameter, and
Here is the parameter capturing the level of membership of the nopurchase option in nest and satisfies, . In cases when for all , the family is called the Cross nested logit (CNL) family. For a more detailed description including the corresponding random utility function and bibliographic details, see Appendix id1
Mixed multinomial logit family (MMNL): This model accounts specifically for customer heterogeneity. In its most common form, the model reduces to:
where is a vector of observed attributes for the th product, and is a distribution parameterized by selected by the econometrician that describes heterogeneity in taste. For a more detailed description including the corresponding random utility function and bibliographic details, see Appendix id1.
Transaction Data Generated: Having selected (and specified) a structural model from the above list, we generated sales transactions as follows:

Fix an assortment of two products, .

Compute the values of using the chosen parametric model.

Repeat the above procedure for all pairs, , and single item sets, .
The above data is succinctly summarized as an dimensional data vector , where for , . Given the above data, the precise specialization of the robust estimation problem (2) that we solve may be found in Appendix id1.
With the above setup we conducted two broad sets of experiments. In the first set of experiments, we picked specific models from the MNL, CNL, and MMNL model classes; the MNL model was constructed using DVD shopping cart data from Amazon.com, and the CNL and MMNL models were obtained through slight ‘perturbations’ of the MNL model. In order to avoid any artifacts associated with specific models, in the second set of experiments, we conducted ‘stress tests’ by generating a number of instances of models from each of the MNL, CNL, and MMNL models classes. We next present the details of the two sets of experiments.
The Amazon Model: We considered an MNL model fit to Amazon.com DVD sales data collected between 1 July 2005 to 30 September 2005 ^{4}^{4}4The specifics of this model were shared with us by the authors of Rusmevichientong et al. (2008). ,where an individual customer’s utility for a given DVD, is given by:
here is the the price of the package divided by the number of physical discs it contains, and is the total number of helpful votes received by product and is a standard Gumbel. The model fit to the data has , and . See Table 2 for the attribute values taken by the products we used for our experiments. We will abbreviate this model AMZN for future reference.
We also considered the following synthetic perturbations of the AMZN model:

AMZNCNL: We derived a CNL model from the original AMZN model by partitioning the products into nests with the first nest containing products to , the second nest containing products to , the third containing products to , and the last containing products and . We choose . We assigned the nopurchase option to every nest with nest membership parameter .

AMZNMMNL: We derived an MMNL model from the original AMZN model by replacing each parameter with the random quantity , for with is a customer specific random variable distributed as a zero mean normal random variable with standard deviation .
Figure 1 shows the results of the generic experiment for each of the three models above. Each experiment queries the robust estimate on sixty randomly drawn assortments of sizes between one and seven and compares these estimates to those under the respective true model for each case.
Synthetic Model Experiments: The above experiments considered structurally diverse models, each for a specific set of parameters. Are the conclusions suggested by Figure 1 artifacts of the set of parameters? To assuage this concern, we performed ‘stress’ tests by considering each structural model in turn, and for each model generating a number of instances of the model by drawing the relevant parameters from a generative family. For each structural model, we considered the following generative families of parameters:

MNL Random Family: randomly generated models on products, each generated by drawing mean utilities, , uniformly between and .

CNL Random Family: We maintained the nests, selection of and as in the AMZNCNL model. We generated distinct CNL models, each generated by drawing uniformly between and .

MMNL Random Family: We preserved the basic nature of the AMZNMMNL model. We considered randomly generated MMNL models. Each model differs in the distribution of the parameter vector . The random coefficients in each case are defined as follows: where is a random variable. Each of the 20 models corresponds to a single draw of form the uniform distribution on .
For each of the 60 structural model instances described above, we randomly generated offer sets of sizes between and . For a given offer set , we queried the robust procedure and compared the revenue estimate produced to the true revenue for that offer set; we can compute the latter quantity theoretically. In particular, we measured the relative error, . The three histograms in Figure 2 below represent distributions of relative error for the three generative families described above. Each histogram consists of test points; a given test point corresponds to one of the randomly generated structural models in the relevant family, and a random assortment.
In the above ‘stress’ tests, we kept the standard deviation of the MMNL models fixed at . The standard deviation of the MMNL model can be treated as a measure of the heterogeneity or the ‘‘complexity’’ of the model. Naturally, if we keep the ‘‘amount’’ of transaction data fixed and increase the standard deviation – and hence the complexity of the underlying model – we expect the accuracy of robust estimates to deteriorate. To give a sense of the sensitivity of the accuracy of robust revenue estimates to changes in the standard deviation, we repeated the above stress tests with the MMNL model class for three values of standard deviation: , and . Figure 3 shows the comparison of the density plots of relative errors for the three cases.
We draw the following broad conclusion from the above experiments:

Given limited marginal information for distributions over permutations, , arising from a number of commonly used structural models of choice, the robust approach effectively captures diverse parametric structures and provides close revenue predictions under range of practically relevant parametric models.

With the type of marginal information fixed, the accuracy of robust revenue predictions deteriorates (albeit mildly) as the complexity of the underlying model increases; this is evidenced by the deterioration of robust performance as we go from the MNL to the MMNL model class, and similarly as we increase the standard deviation for the MMNL model while keeping the ‘amount’ of data fixed.

The design of our experiments allows us to conclude that in the event that a given structural model among the types used in our experiments predicts revenue rates accurately, the robust approach is likely to be just as good without knowledge of the relevant structure. In the event that the structural model used is a poor fit, the robust approach will continue to provide meaningful guarantees on revenues under the mild condition that it is tested in an environment where the distribution generating sales is no different from the distribution used to collect marginal information.
In this section, we present the results of a case study conducted using sales transaction data from the dealer network of a major US automaker. Our goal in this study is to use historical transaction data to predict the sales rate or ‘conversion rate’ for any given offer set of automobiles on a dealer lot. This conversionrate is defined as the probability of converting an arriving customer into a purchasing customer. The purpose of the case study is twofold: (1) To demonstrate how the prediction methods developed in this paper can be applied in the realworld and the quality of the predictions they offer in an absolute sense, and (2) To pit the robust method for revenue predictions in a ‘horserace’ against parametric approaches based on the MNL and MMNL families of choice models. In order to test the performance of these approaches in different regimes of calibration data, we carried out crossvalidations with varying ‘amounts’ of training/calibration data. The results of the experiments conducted as part of the case study provide us with the evidence to draw two main conclusions:

The robust method predicts conversion rates more accurately than either of the parametric methods. In our case study, the improvement in accuracy was about across all regimes of calibration data.

Unlike the parametric methods we study, the robust approach is apparently not susceptible to overfitting and underfitting.
The improvement in accuracy is substantial. The second conclusion has important implications as well: In practice, it is often difficult to ascertain whether the data available is ‘‘sufficient’’ to fit the model at hand. As a result, parametric structures are prone to overfitting or underfitting. The robust approach, on the other hand, automatically scales the complexity of the underlying model class with data available, so in principle one should be able to avoid these issues. This is borne out by the case study. In the remainder of this section we describe the experimental setup and then present the evidence to support the above conclusions.
Appendix 2 provides a detailed description of our setup; here we provide a higher level discussion for ease of exposition. We collect data comprising purchase transactions of a specific range of small SUVs offered by a major US automaker over months. The data is collected at the dealership level (i.e the finest level possible) for a network of dealers in the Midwest. Each transaction contains information about the date of sale, the identity of the SUV sold, and the identity of the other cars on the dealership lot at the time of sale. Here by ‘identity’ we mean a unique model identifier that collectively identifies a package of features, color and invoice price point. We make the assumption that purchase behavior within the zone can be described by a single choice model. To ensure the validity of this assumption, we restrict attention to a specific dealership zone, defined as the collection of dealerships within an appropriately defined geographical area with relatively homogeneous demographic features.
Our data consisted of sales information on distinct SUV identities (as described above). We observed a total of distinct assortments (or subsets) of the products in the dataset, where each assortment , , was on offer at some point at some dealership in the dealership zone. We then converted the transaction data into sales rate information for each of the assortments as follows:
Note that the information to compute the denominator in the expression for is not available because the number of arriving customers who purchase nothing is not known. Such data ‘censoring’ is common in practice and impacts both parametric methods as well as our approach. A common approximation here is based on demographic information relative to the location of the dealership. Given the data at our disposal, we are able to make a somewhat better approximation to overcome this issue. In particular, we assume a daily arrival rate of for dealership and measure the number of arrivals of assortment as
where denotes the number of days for which was on offer at dealership . The arrival rate to each dealership clearly depends on the size of the market to which the dealership caters. Therefore, we assume that , where denotes the ‘‘market size’’ for dealership and is a ‘‘fudge’’ factor. We use previous year total sales at dealership for the particular model class as the proxy for and tune the parameter using crossvalidation (more details in the appendix).
We now describe the experiments we conducted and the present the results we obtained. In order to test the predictive performance of the robust, the MNL, and the MMNL methods, we carried out fold crossvalidations with . In fold crossvalidation (see Mosteller and Tukey (1987)), we arbitrarily partition the collection of assortments into partitions of about equal size, except may be the last partition. Then, using partitions as training data to calibrate the methods, we test their performance on the partition. We repeat this process times with each of the partitions used as test data exactly once. This repetition ensures that each assortment is tested at least once. Note that as decreases, the number of training assortments decreases resulting in more limited data scenarios. Such limited data scenarios are of course of great practical interest.
We measure the prediction accuracy of the methods using the relative error metric. In particular, letting denote the conversionrate prediction for test assortment , the incurred relative error is defined as , where
In the case of the parametric approaches, is computed using the choice model fit to the training data. In the case of the robust approach, we solve an appropriate mathematical program. A detailed description of how is determined by each method is given in the appendix.
We now present the results of the experiments. Figure 4 shows the comparison of the relative errors of the three methods from fold crossvalidations for . Table 1 shows the mean relative error percentages of the three methods and the percent improvement in mean relative error achieved by the robust method over the MNL and MMNL methods for the three calibration data regimes of . It is clear from the definition of fold crossvalidation that as decreases, the ‘‘amount’’ of calibration data decreases, or equivalently calibration data sparsity increases. Such sparse calibration data regimes are of course of great practical interest.
The immediate conclusion we draw from the results is that the prediction accuracy of the robust method is better than those of both MNL and MMNL methods in all calibration data regimes. In particular, using the robust method results in close to improvement in prediction accuracy over the MNL and MMNL methods. We also note that while the prediction accuracy of the more complex MMNL method is marginally better than that of the MNL method in the high calibrationdata regime of , it quickly becomes worse as the amount of calibration data available decreases. This behavior is a consequence of overfitting caused due to the complexity of the MMNL model. The performance of the robust method, on the other hand, remains stable across the different regimes of calibrationdata.
In making revenue predictions, we did not need to concern ourselves with the choice model implicitly assumed by our prediction procedure. This fact notwithstanding, it is natural to consider criteria for selecting choice models consistent with the observed data that are independent of any decision context. Thus motivated, we consider the natural task of finding the simplest choice model consistent with the observed data. As in much of contemporary high dimensional statistics (see for example, Candes et al. (2006), Cormode and Muthukrishnan (2006)), we employ sparsity as our measure of simplicity. In addition to the appealing notion of explaining observed substitution behavior by as small a number of customer preference lists as possible, such a description also provides a great deal of tractability in multiple applications (see, for example van Ryzin and Vulcano (2008)). Our goal in this section is to first understand the choice models implicitly assumed by the robust procedure through the lens of the sparsity criterion, and second, to understand the discriminative power of this criterion.
Towards the above goal, we begin by characterizing choice models implicitly used by the robust approach in terms of their sparsity. Loosely speaking, we establish that the choice model implicitly used by the robust approach is indeed simple or sparse. In particular, such choice models have sparsity within at most one of the sparsity of the sparsest model consistent with the data. As such, we see that the choice model implicitly selected by our robust revenue prediction procedure is, in essence, the sparsest choice model consistent with the data. From a descriptive perspective, this establishes the appealing fact that simplicity or sparsity is a natural property possessed by all choice models used in making robust revenue predictions. We also establish that the sparsity of the choice model used by the robust approach scales with the dimension of the data vector thereby establishing that the complexity of the model used by the robust approach scales with the ‘‘amount’’ of data available. This provides a potential explanation for the immunity of the robust approach to over/under fitting issues, as evidenced in our case study.
Next, we turn to understanding the discriminative power of the sparsest fit criterion. Towards this end, we describe a family of choice models that can be uniquely identified from the given marginal data using the sparsest fit criterion. We intuitively expect the complexity of identifiable models to scale with the ‘‘amount’’ of data that is available. We formalize this intuition by presenting for various types of data, conditions on the model generating the data under which identification is possible. These conditions characterize families of choice models that can be identified in terms of their sparsity and formalize the scaling between the complexity of a model class and the ‘‘amount’’ of data needed to identify it.
We now provide a characterization of the choice models implicitly used by the robust procedure through the lens of model sparsity. As mentioned above, loosely speaking, we can establish that the choice models selected implicitly via our revenue estimation procedure are, in essence, close to the sparsest model consistent with the observed data. In other words, the robust approach implicitly uses the simplest models consistent with observed data to predict revenues.
To state our result formally, let us define the set as the set of all possible data vectors, namely the convex hull of the columns of the matrix . For some and an arbitrary offer set, , let be an optimal basic feasible solution to the program used in our revenue estimation procedure, namely, (2). Moreover, let, be the sparsest choice model consistent with the data vector ; i.e. is an optimal solution to (3). We then have that with probability one, the sparsity (i.e. the number of rank lists with positive mass) under is close to that of . In particular, we have:
Theorem 1
For any distribution over that is absolutely continuous with respect to Lebesgue measure on , we have with probability 1, that:
Theorem 1 establishes that if were the support size of the sparsest distribution consistent with , the sparsity of the choice model used by our revenue estimation procedure is either or for ‘‘almost all’’ data vectors . As such, this establishes that the choice model implicitly employed by the robust procedure is essentially also the sparsest model consistent with the observed data.
In addition the proof of the theorem reveals that the sparsity of the robust choice model consistent with the observed data is either^{6}^{6}6Here, we assume that matrix has full row rank. or for almost all data vectors of dimension . This yields yet another valuable insight into the choice models implicit in our revenue predictions – the complexity of these models, as measured by their sparsity, grows with the amount of observed data. As such, we see that the complexity of the choice model implicitly employed by the robust procedure scales automatically with the amount of available data, as one would desire from a nonparametric scheme. This provides a potential explanation for the robust procedures’ lack of susceptibility to the overfitting observed for the MMNL model in our empirical study.
We now consider the family of choice models that can be identified via the sparsest fit criterion. For that, we present two abstract conditions that, if satisfied by the choice model generating the data , guarantee that the optimal solution to (3) is unique, and in fact, equal to the choice model generating the data.
Before we describe the conditions, we introduce some notation. As before, let denote the true underlying distribution, and let denote the support size, . Let denote the permutations in the support, i.e, for , and for all . Recall that is of dimension and we index its elements by . The two conditions are:
Signature Condition: For every permutation in the support, there exists a such that and , for every and . In other words, for each permutation in the support, serves as its ‘signature’.
Linear Independence Condition: , for any and , where denotes the set of integers and is a sufficiently large number . This condition is satisfied with probability if is drawn uniformly from the dim simplex, or for that matter, any distribution on the dim simplex with a density.
When the two conditions above are satisfied by a choice model, this choice model can be recovered from observed data as the solution to problem (3). Specifically, we have:
Theorem 2
Suppose we are given and satisfies the signature and linear independence conditions. Then, is the unique solution to the program in (3).
The proof of Theorem 2 is given in Appendix id1. The proof is constructive in that it describes an efficient scheme to determine the underlying choice model. Thus, the theorem establishes that whenever the underlying choice model satisfies the signature and linear independence conditions, it can identified using an efficient scheme as the optimal solution to the program in (3). We next characterize a family of choice models that satisfy the signature and linear independence conditions. Specifically, we show that essentially all choice model with sparsity satisfy these two conditions as long as scales as , and for comparison, topset, and ranking data respectively. To capture this notion of ‘essentially’ all choice models, we introduce a natural generative model. It then remains to understand how restrictive these values of are, which we discuss subsequently.
A Generative Model: Given and an interval on the positive real line, we generate a choice model as follows: choose permutations, , uniformly at random with replacement^{7}^{7}7Though repetitions are likely due to replacement, for large and , they happen with a vanishing probability., choose numbers uniformly at random from the interval , normalize the numbers so that they sum to ^{8}^{8}8We may pick any distribution on the dim simplex with a density; here we pick the uniform distribution for concreteness., and assign them to the permutations , . For all other permutations , .
Depending on the observed data, we characterize values of sparsity up to which distributions generated by the above generative model can be recovered with a high probability. Specifically, the following theorem is for the three examples of observed data mentioned in Section id1.
Theorem 3
Suppose is a choice model of support size drawn from the generative model. Then, satisfies the signature’ and linear independence conditions with probability as provided for comparison data, for the top set data, and for ranking data.
Theorem 3 above implies that essentially all choice models of sparsity (and higher) can be recovered from the types of observed data discussed in the theorem. A natural question that arises at this juncture is what a reasonable value of might be. To give a sense of this, we provide the following approximation result: a good approximation to any choice model for the purposes of revenue estimation is obtained by a sparse choice model with support scaling as . Specifically, let us restrict ourselves to offer sets that are ‘small’, i.e. bounded by a constant ; this is legitimate from an operational perspective and in line with many of the applications we have described. We now show that any customer choice model can be wellapproximated by a choice model with sparse support for the purpose of evaluating revenue of any offer set of size upto . In particular, we have:
Theorem 4
Let be an arbitrary given choice model. Then, there exists a choice model with support such that
The proof is provided in Appendix id1. Along with Theorem 3, the above result establishes the potential generality of the signature and linear independence conditions.
In summary, this section visited the issues of explicitly selecting a choice model consistent with the observed data. This is in contrast to our work thus far, which has been simply making revenue predictions. We showed that the robust procedure we used in making revenue predictions may also be seen to yield what is essentially the sparsest choice model consistent with the observed data. Finally, by presenting a family of models for which the sparsest fit to the observed data was unique, and studying the properties of this unique solution, we were able to delineate a datadependent family of choice models for which the sparsest fit criterion actually yields identification. This formalized the intuitive notion that the complexity of the choice model that can be recovered scales with the ‘‘amount’’ of data that is available.
This paper presented a new approach to the problem of using historical sales data to predict expected sales / revenues from offering a particular assortment of products. We depart from traditional parametric approaches to choice modeling in that we assume little more than a weak form of customer rationality; the family of choice models we focus on is essentially the most general family of choice models one may consider. In spite of this generality, we have presented schemes that succeed in producing accurate sales / revenue predictions. We complemented those schemes with extensive empirical studies using both simulated and realworld data, which demonstrated the power of our approach in producing accurate revenue predictions without being prone to over and under fitting. We believe that these schemes are particularly valuable from the standpoint of incorporating models of choice in decision models frequently encountered in operations management. Our schemes are efficient from a computational standpoint and raise the possibility of an entirely ‘datadriven’ approach to the modeling of choice for use in those applications. We also discussed some ideas on the problem of identifying sparse or simple models that are consistent with the available marginal information.
With that said, this work cannot be expected to present a panacea for choice modeling problems. In particular, one merit of a structural/ parametric modeling approach to modeling choice is the ability to extrapolate. That is to say, a nonparametric approach such as ours can start making useful predictions about the interactions of a particular product with other products only once some data related to that product is observed. With a structural model, one can hope to say useful things about products never seen before. The decision of whether a structural modeling approach is relevant to the problem at hand or whether the approach we offer is a viable alternative thus merits a careful consideration of the context. Of course, as we have discussed earlier, resorting to a parametric approach will typically require expert input on underlying product features that ‘matter’, and is thus difficult to automate on a large scale.
We believe that this paper presents a starting point for a number of research directions. These include, from an applications perspective:

The focus of this paper has been the estimation of the revenue function . The rationale here is that this forms a core subroutine in essentially any revenue optimization problem that seeks to optimize revenues in the face of customer choice. A number of generic algorithms (such as local search) can potentially be used in conjunction with the subroutine we provide to solve such optimization problems. It would be interesting to study such a procedure in the context of problems such as network revenue optimization in the presence of customer choice, and assortment optimization.

Having learned a choice model that consists of a distribution over a small number of rank lists, there are a number of qualitative insights one might hope to draw. For instance, using fairly standard statistical machinery, one might hope to ask for the product features that most influence choice from among thousands of potential features by understanding which of these features best rationalize the rank lists learned. In a different direction, one may use the distribution learned as a ‘prior’, and given further interactions with a given customer infer a distribution specialized to that customer via Bayes rule. This is effectively a means to accomplishing ‘collaborative filtering’.
There are also interesting directions to pursue from a theoretical perspective: First, extending our understanding of the limits of identification. In particular, it would be useful to characterize the limits of recoverability for additional families of observable data beyond those discussed in Theorem 3. Second, Theorem 4 points to the existence of sparse approximations to generic choice models. Can we compute such approximations for any choice model but with limited data? Finally, the robust approach in Section id1 presents us with a family of difficult optimization problems for which the present work has presented a generic optimization scheme that is in the spirit of cutting plane approaches. An alternative to this is the development of strong relaxations that yield uniform approximation guarantees (in the spirit of the approximation algorithms literature).
References
 Anderson et al. [1992] S. P. Anderson, A. De Palma, and J. F. Thisse. Discrete choice theory of product differentiation. MIT press, Cambridge, MA, 1992.
 Belobaba and Hopperstad [1999] P. P. Belobaba and C. Hopperstad. Boeing/MIT simulation study: PODS results update. In 1999 AGIFORS Reservations and Yield Management Study Group Symposium, April, pages 27–30, 1999.
 BenAkiva [1973] M. E. BenAkiva. Structure of passenger travel demand models. PhD thesis, Department of Civil Engineering, MIT, 1973.
 BenAkiva and Lerman [1985] M. E. BenAkiva and S. R. Lerman. Discrete choice analysis: theory and application to travel demand. CMIT press, Cambridge, MA, 1985.
 Bierlaire [2003] M. Bierlaire. BIOGEME: a free package for the estimation of discrete choice models. In Proceedings of the 3rd Swiss Transportation Research Conference, Ascona, Switzerland, 2003.
 Bierlaire [2008] M. Bierlaire. An introduction to BIOGEME Version 1.7. 2008.
 Birkhoff [1946] G. Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman Rev. Ser. A, 5:147–151, 1946.
 Boyd and Mellman [1980] J. H. Boyd and R. E. Mellman. The effect of fuel economy standards on the u.s. automotive market: An hedonic demand analysis. Transportation Research Part A: General, 14(56):367 – 378, 1980.
 Calafiore and Campi [2005] G. Calafiore and M. C. Campi. Uncertain convex programs: randomized solutions and confidence levels. Mathematical Programming, 102(1):25–46, 2005.
 Candes et al. [2006] E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8), 2006.
 Cardell and Dunbar [1980] N. S. Cardell and F. C. Dunbar. Measuring the societal impacts of automobile downsizing. Transportation Research Part A: General, 14(56):423 – 434, 1980.
 Chandukala et al. [2008] S.R. Chandukala, J. Kim, and T. Otter. Choice Models in Marketing. Now Publishers Inc, 2008.
 Cormode and Muthukrishnan [2006] G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. Lecture Notes in Computer Science, 4056:280, 2006.
 Debreu [1960] G. Debreu. Review of r. d. luce, ‘individual choice behavior: A theoretical analysis’. American Economic Review, 50:186–188, 1960.
 Gallego et al. [2006] G. Gallego, G. Iyengar, R. Phillips, and A. Dubey. Managing flexible products on a network. Working Paper, 2006.
 Goyal et al. [2009] V. Goyal, R. Levi, and D. Segev. Nearoptimal algorithms for the assortment planning problem under dynamic substitution and stochastic demand. Submitted, June 2009.
 Guadagni and Little [1983] P. M. Guadagni and J. D. C. Little. A logit model of brand choice calibrated on scanner data. Marketing science, 2(3):203–238, 1983.
 Hensher and Greene [2003] D. A. Hensher and W. H. Greene. The mixed logit model: the state of practice. Transportation, 30(2):133–176, 2003.
 Jagabathula and Shah [2008] S. Jagabathula and D. Shah. Inferring rankings under constrained sensing. In NIPS, 2008.
 Kök et al. [2008] A. G. Kök, M. L. Fisher, and R. Vaidyanathan. Assortment planning: Review of literature and industry practice. Retail Supply Chain Management, pages 1–55, 2008.
 Luce [1959] R.D. Luce. Individual choice behavior: A theoretical analysis. Wiley, New York, 1959.
 Mahajan and van Ryzin [1999] S. Mahajan and G. J. van Ryzin. On the relationship between inventory costs and variety benefits in retail assortments. Management Science, 45(11):1496–1509, 1999.
 Marzano and Papola [2008] V. Marzano and A. Papola. On the covariance structure of the crossnested logit model. Transportation Research Part B: Methodological, 42(2):83 – 98, 2008.
 MasColell et al. [1995] A. MasColell, M. D. Whinston, and J. R. Green. Microeconomic Theory. Oxford University Press, 1995.
 McFadden [1980] D. McFadden. Econometric models for probabiistic choice among products. The Journal of Business, 53(3):S13–S29, 1980.
 McFadden and Train [2000] D. McFadden and K. Train. Mixed MNL models for discrete response. Journal of Applied Econometrics, 15(5):447–470, September 2000.
 McKinney [1966] E. H. McKinney. Generalized birthday problem. American Mathematical Monthly, pages 385–387, 1966.
 Mosteller and Tukey [1987] F. Mosteller and J. Tukey. Data analysis, including statistics. The Collected Works of John W. Tukey: Philosophy and principles of data analysis, 19651986, page 601, 1987.
 Plackett [1975] RL Plackett. The analysis of permutations. Applied Statistics, 24(2):193–202, 1975. ISSN 00359254.
 Ratliff et al. [2008] R. M. Ratliff, V. Rao, C. P. Narayan, and K. Yellepeddi. A multiflight recapture heuristic for estimating unconstrained demand from airline bookings. Journal of Revenue and Pricing Management, 7(2):153–171, 2008.
 Rusmevichientong and Topaloglu [2009] P. Rusmevichientong and H. Topaloglu. Technical Note: Robust Logit Assortments. 2009.
 Rusmevichientong et al. [2006] P. Rusmevichientong, B. Van Roy, and P. Glynn. A nonparametric approach to multiproduct pricing. Operations Research, 54(1), 2006.
 Rusmevichientong et al. [2008] P. Rusmevichientong, Z. J. Shen, and D. B. Shmoys. Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint. Technical report, Working Paper, 2008.
 Saure and Zeevi [2009] D. Saure and A. Zeevi. Optimal dynamic assortment planning. Columbia GSB Working Paper, 2009.
 Small [1987] K. A. Small. A discrete choice model for ordered alternatives. Econometrica: Journal of the Econometric Society, 55(2):409–424, 1987.
 Talluri and van Ryzin [2004a] K. Talluri and G. J. van Ryzin. Revenue management under a general discrete choice model of consumer behavior. Management Science, 50(1):15–33, 2004a.
 Talluri and van Ryzin [2004b] K. T. Talluri and G. J. van Ryzin. The Theory and Practice of Revenue Management. Springer Science+Business Media, 2004b.
 van Ryzin and Vulcano [2008] G. J. van Ryzin and G. Vulcano. Computing virtual nesting controls for network revenue management under customer choice behavior. Manufacturing & Service Operations Management, 10(3):448–467, 2008.
 von Neumann [1953] J. von Neumann. A certain zerosum twoperson game equivalent to the optimal assignment problem. In Contributions to the theory of games, 2, 1953.
 Vovsha [1997] P. Vovsha. Crossnested logit model: an application to mode choice in the TelAviv metropolitan area. Transportation Research Record, 1607:6–15, 1997.
 Vulcano et al. [2010] G. Vulcano, G. van Ryzin, and W. Chaar. Om practice—choicebased revenue management: An empirical study of estimation and optimization. Manufacturing & Service Operations Management, 12(3):371–392, 2010.
 Wierenga [2008] B. Wierenga. Handbook of marketing decision models. Springer Verlag, 2008.
Appendix
The result of Theorem 1 follows immediately from the following lemma, which we prove below.
Lemma 1
Let denote the column rank of matrix and denote the convex hull of the columns of . Then, it must be that belongs to a dimensional subspace, , and
where denotes the set of all data vectors such that
and denotes the dimensional volume of a set of points.
Proof of Lemma 1 We prove this lemma in two parts: (1) belongs to a dimensional subspace and for all , and (2) .
To prove the first part, note that any data vector belongs to dimensional subspace because has a dimensional range space and belongs to the intersection of the range space of and the hyperplane . Let denote the augmented matrix, which is obtained by augmenting the last row of matrix with a row of all s. Similarly, let denote the vector obtained by augmenting vector with . The equality constraints of (2) can now be written as , . Since has rank , the rank of will be at most . Therefore, for any data vector , an optimal BFS solution to (2) must be such that
(10) 
Coming to the second part of the proof, for any , let denote the set of all data vectors that can be written as a convex combination of at most columns of matrix . Let denote the number of columns of of size at most , and let denote the corresponding subsets of columns of of size at most . Then, it is easy to see that can be written as the union of disjoint subsets , where for each , denotes the set of data vectors that can be written as the convex combination of the columns in subset . For each , since is a polytope residing in dimensional space, it must follow that . Since is finite, it follows that . Therefore, we can conclude that
(11) 
Before we prove Theorem 2, we propose a simple combinatorial algorithm that recovers the model whenever satisfies the signature and linear independence conditions; we make use of this algorithm in the proof of the theorem.
The algorithm recovers when the signature and linear independence conditions are satisfied. If the conditions are not satisfied, the algorithm provides a certificate to that effect. The algorithm takes as an explicit input with the prior knowledge of the structure of as an auxiliary input. It’s aim is to produce . In particular, the algorithm outputs the sparsity of , , permutations so that , and the values . Without loss of generality, assume that the values are sorted with and further that .
Before we describe the algorithm, we observe the implication of the two conditions. The Linear Independence condition says that for any two nonempty distinct subsets ,
This means that if we know all and since we know , then we can recover as the unique solution to in .
Therefore, the nontriviality lies in finding and . This issue is resolved by use of the Signature condition in conjunction with the above described properties in an appropriate recursive manner. Specifically, recall that the Signature condition implies that for each for which , there exists such that . By Linear Independence, it follows that all s are distinct and hence by our assumption
Therefore, it must be that the smallest value, equals . Moreover, and for all . Next, if then it must be that and for all . We continue in this fashion until we reach a such that but . Using similar reasoning it can be argued that and for all . Continuing in this fashion and repeating essentially the above argument with appropriate modifications leads to recovery of the sparsity , the corresponding and for . The complete procedural description of the algorithm is given below.
Sparsest Fit Algorithm:
Initialization: , and , .
for to
if for some