The Use of Binary Choice Forests to Model and Estimate Discrete Choices
Abstract
We show the equivalence of discrete choice models and the class of binary choice forests, which are random forests based on binary choice trees. This suggests that standard machine learning techniques based on random forests can serve to estimate discrete choice models with an interpretable output. This is confirmed by our datadriven theoretical results which show that random forests can predict the choice probability of any discrete choice model consistently, with its splitting criterion capable of recovering preference rank lists. The framework has unique advantages: it can capture behavioral patterns such as irrationality or sequential searches; it handles nonstandard formats of training data that result from aggregation; it can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product; it can also incorporate price information and customer features. Our numerical results show that using random forests to estimate customer choices represented by binary choice forests can outperform the best parametric models in synthetic and real datasets.
1 Introduction
Being able to understand consumers’ choice behavior when they are offered an assortment of products provides firms with unique advantages. It is particularly important in the modern era: online retailers that predict consumers’ choice behavior more accurately can implement more effective strategies and earn higher profits. In turn, they can afford to invest in advanced technologies and infrastructure, and sharpen their prediction of consumers’ behavior. The unstoppable cycle has created a few unprecedented market juggernauts such as Amazon. Firms unwilling or incapable of getting inside the mind of their consumers are left behind. Not surprisingly, discrete choice models (DCM) have become one of the central topics in revenue management and pricing analytics.
To understand and predict consumers’ choice behavior, academics and practitioners have proposed several frameworks, some of which are widely adopted in industry, One ubiquitous framework is modelthenestimate. In this framework, a parametric DCM is proposed to explain how a customer chooses a product when offered an assortment. The parameters are then estimated using historical data. Once the model has been estimated properly, it can then be used as a workhorse to predict the choice behavior of future consumers.
In the modelthenestimate framework, there is a tradeoff between the flexibility and accuracy. A flexible DCM incorporates a wide range of patterns of consumers’ behavior, but it may be difficult to estimate and may overfit training data. A parsimonious model, may fail to capture consumers behavior, and even if estimated correctly it would be misspecified. The goal is to reach a delicate balance between flexibility and predictability even relative to assortments never seen before. Not surprisingly, it is not straightforward to find the “sweet spot” when selecting among the large class of parametric DCMs.
Another framework favored by data scientists is estimatewithoutmodels. Advanced machine learning algorithms are applied to historical sales data, and used to predict future choice behavior. The framework skips “modeling” entirely and does not attempt to understand the rationality (or irrationality) hidden behind the patterns observed in the training data. With engineering tweaks, the algorithms can be implemented efficiently and capture a wide range of choice behavior. For example, neural networks are known to be able to approximate any continuous functions. This approach may sound appealing: if the algorithm achieves impressive accuracy when predicting the choice behavior of new consumers, why do we care about the actual rationale behind consumers when they make choices? There are two reasons to care. First, the firm may be interested in not only making accurate predictions, but also other goals such as finding the optimal assortment that maximizes the expected revenue. Without a proper model, it is unclear if the goal can be formulated as an optimization problem. Second, when the market environment or customer preferences change systematically over time, having a reasonable model provide a certain degree of generalizability while blackbox algorithms may fail to capture an obvious pattern just because the pattern has not appeared frequently in the past.
In this paper, we introduce a datadriven framework which we call estimateandmodel that combines machine learning with DCMs, and thus retains the strengths of both frameworks mentioned previously. The model we propose, binary choice forests, is a mixture of binary trees, each of which mimics the internal decisionmaking process of a customer. We show that the binary choice forest can be used to approximate any DCM, and is thus sufficiently flexible, but still identifiable with training data of reasonable size. Moreover, it can be efficiently estimated using random forests (breiman2001random), a popular machine learning technique that has stood the test of time. Random forests are easy to implement using R or Python (scikitlearn; rrandomforest) and have been shown to have extraordinary predictive power in practice. We provide theoretical guarantees: as the sample size increases, random forests can successfully recover the binary choice forest, and thus any DCM. Moreover, the splitting criterion used by the random forests is intrinsically connected to the preference rank list of customers.
As a contribution to the literature, the framework we propose has the following practical advantages:

It can capture various patterns of customer behavior that cannot be easily captured by other models, such as irregularity and sequential searches (weitzman1979optimal). See Section 5.1 for more details.

It can deal with nonstandard formats of historical data, which is a major challenge in practice. See Section 5.2 for more details.

It can return an importance index for all products, based on how frequently a random customer would make decisions depending on the presence of the product.

It can incorporate the prices of the products and reflect the information in the decisionmaking of consumers.

It can naturally incorporate customer features and is compatible with personalized online retailing.
1.1 Literature Review
We first review DCMs proposed in the literature following the modelthenestimate framework, in the order of increasing flexibility and difficulty in terms of estimation. The independent demand model and the MNL model (McFa73) have very few parameters (one per product), which are easy to estimate (train2009discrete). Although the MNL model is still widely used, its inherent property of independence of irrelevant alternatives (IIA) has been criticized for being unrealistic (see anderson1992discrete for more details). The mixed logit model, the nested logit model, the Markov chain DCM, and the rankbased DCM (see, e.g., williams1977formation; train2009discrete; farias2013nonparametric; blanchet2016markov) are able to capture much more complex choice behavior than the MNL model. In fact, the mixed logit model and the rankbased DCM can approximate any random utility model (RUM), encompassing a very general class of DCMs. The estimation of these models is challenging, but there has been exciting progress in recent years (farias2013nonparametric; van2014market; van2017expectation; csimcsek2018expectation; jagabathula2018conditional). However, the computational feasibility and the susceptibility to overfitting remain a challenge in practice. Even the general class of RUM cannot capture certain choice behavioral. A RUM possesses the socalled regularity property: the probability of choosing an alternative cannot increase if the offered set is enlarged. There are a few experimental studies showing strong evidence that regularity may be violated (simonson1992choice). Several models are proposed to capture even more general behavior than RUM (natarajan2009persistency; flores2017assortment; berbeglia2018generalized; feng2017relation). It is unclear if the estimation for such models can be done efficiently.
The specifications of random forests used in this paper are introduced by breiman2001random, although many of the ideas were discovered even earlier. The readers may refer to hastie2009elements for a general introduction. Although random forests have been very successful in practice, little is known about their theoretical properties. To date, most studies are focused on isolated setups or simplified versions of the procedure. In a recent study, scornet2015consistency establish the consistency of random forests in regression problems, under less restrictive assumptions. biau2016random provide an excellent survey of the recent theoretical and methodological developments in the field.
A recent paper by chen2019decision proposes a similar treebased DCM. They show that their “decision forest” can approximate any DCMs with arbitrary precision; a similar result is proved with a different approach in this paper. Our studies differ substantially in the estimation step: we focus on random forests, while chen2019decision follow an optimization approach based on column generation ideas for estimation. Moreover, we establish the consistency of random forests, and show that the estimation can accommodate the price information and aggregate choice data. In our numerical study, we find that random forests are quite robust and have a good performance even compared with the Markov chain model estimated using the expectationmaximization (EM) algorithm, which has been shown to have outstanding empirical performance compared to MNL, the nested logit, the mixed logit and rankbased DCM (berbeglia2018comparative), especially when the training data is large. Our algorithm runs 17 times faster than the EM algorithm. In contrast, the computational study in chen2019decision is limited to the rankbased model estimated by column generation (van2014market), which is shown to be outperformed by the Markov chain model (berbeglia2018comparative).
2 Choice Models and Mixture of Binary Trees
Consider a set of products and define where represents the nopurchase option. Let be a binary vector representing an assortment of products, where indicates product is in the assortment and otherwise. A discrete choice model (DCM) is a nonnegative mapping such that
It is clear that represents the probability of a random customer choosing product when presented the assortment . We refer to a subset of as an assortment associated with , i.e., if and only if . Without ambiguity, we will write instead of .
A binary decision tree maps into . More precisely, it specifies a partition of the space , , and assigns label to region , so . Some of the regions in the partition may be empty. We refer to the partition as a binary decision tree because any partition of can be obtained by sequentially splitting the space along dimensions. For example, a decision tree representation of a partition when is demonstrated in Figure 1.
A binary decision forest is defined as a convex combination of multiple binary decision trees. More precisely, a binary decision forest can be written as
where the and are, respectively decision trees, and nonnegative weights summing up to one. Notice that a decision forest maps just like DCMs do. Yet decision forest are not necessarily DCMs because may be equal to even if .
A binary decision tree is a binary choice tree if only if . A binary decision forest is a binary choice forest (BCF) if it is a convex combination of binary choice trees. A BCF can be interpreted as decisions made by consumer types, with consumers of type having weight and making decisions based on binary choice tree . If is a BCF, then is also a DCM. This is because is nonnegative, and . To see that the converse is also true, we will first show that DCMs are closed under convex combinations and that any DCM is in the convex hull of extreme DCMs. We next argue that the extreme DCMs are the deterministic DCMs that assign to a particular choice with probability one for every . The next step is to show that each extreme DCM can be represented by a binary choice tree concluding that every DCM is a convex combination of choice trees and is thus a BCF.
Theorem 1.
Every BCF is a DCM, and every DCM can be represented as a BCF.
One way to interpret this result is that for each DCM there exists a set of weights adding to one, such that for all , , where the ’s are the extreme deterministic DCMs.
Although we can represent every DCM as a BCF, it will be difficult to estimate if we have too many extreme points. The number of extreme points is for products, which increases tremendously as increases, with more than extreme points for . In the next theorem, we will show that any DCM can be represented as a convex combination of much fewer binary choice trees.
Theorem 2.
Every DCM can be represented as a convex combination of a BCF containing at most trees.
Proof.
Carathéodory’s theorem states that if a point of lies in the convex hull of a set , then can be written as the convex combination of at most points in . To apply Carathéodory’s theorem to DCM, notice that since the choice probabilities sum to 1, each assortment with cardinality has dimension of . We have . Therefore, Every DCM can be represented as a convex combination of a BCF containing at most . ∎
As an example, for , , so any DCM with can be represented by a convex combination of trees.
A recent working paper by chen2019decision has independently shown, by construction, that any choice model can be represented by a decision forest where each of the trees has depth . While their proof has the virtue of being constructive, our proof is more succinct and insightful as it shows that DCMs and BCFs are equivalent, and the existence of a solution of much lower dimension. Our result implies that choice forests are capable of explaining some of the pathological cases that do not exhibit regularity and are outside the RUM, including the decoy effect (ariely2008predictably) and the comparisonbased choice (huber1982adding). Note also that all RUMs can be modelled as convex combinations of permutation lists, which are special cases of decision trees.
3 Data and Estimation
The main goal of this paper is to provide a practical method to estimate DCMs using random forests, which are shown to be able to approximate all BCFs. The numerical recipe for random forests is widely available and implementable. Before proceeding we remark that an alternative approach would be to use column generation starting with a collection of trees and adding additional trees to improve the fit to data. This approach has been taken, for example by van2014market; mivsic2016data; jagabathula2016nonparametric to estimate RUMs by weighted preference rank lists, and a similar approach has been pursued by chen2019decision for trees. We remark that the output of our model can be fed into a column generation algorithm to seek further improvements although we have not pursued this in our paper.
We will assume that arriving consumers make selections independently based on an unknown DCM , and that a firm collects data of the form (or equivalently ) where was the assortment offered to the th consumer and is the choice made by consumer . Our goal is to use the data to construct a family of binary choice trees as a means to estimate the underlying DCM represented by a BCF. We view the problem as a classification problem: given the predictor , we would like to provide a classifier that maps the predictor to a class label , or the class probabilities.
To this end we will use a random forest as a classifier. The output of a random forest is individual binary decision trees (CART), , where is a tunable parameter. Although a single tree only outputs a class label in each region, the aggregation of the trees, i.e., the forest, is naturally equipped with the class probabilities. Then the choice probability of item in the assortment is estimated as
(1) 
which is a special form of BCF. The next result shows that the random forest can still approximate any DCM.
Theorem 3.
If is sufficiently large, then a binary choice forest of the form
can approximate any DCM.
The implication of this result is that we don’t have to worry about generating all of the extreme points, or deterministic DCMs, and then finding a set of weights for each such tree . Intuitively, if is sufficiently large, then we need approximately type customers associated with tree with positive weight in the convex combination.
We explain how the random forest can be estimated from the historical data by first reviewing the basic mechanism of CART which preforms recursive binary splitting of the predictor space . In each iteration, it selects a dimension and a split point to split the predictor space. More precisely, the split divides the observations to and . In our problem, because is at the corner of the hypercube, all split points between 0 and 1 create the same partition of the observations and thus we simply set . To select the dimension, usually an empirical criterion is optimized to favor splits that create “purer” regions. That is, the resulting region should contain data points that mostly belong to the same class. We use a common measure called Gini index: where is the number of observations in region of the partition and is the empirical frequency of class in . It is not hard to see that the Gini index takes smaller values when the regions contain predominantly observations from a single class. In this case, a dimension is selected that minimizes the measures and the partition is further refined by a binary split. This splitting operation is conducted recursively for the regions in the resulting partition until a stopping rule is met.
The main drawback of CART is its tendency to overfitting the training data. If a deep decision tree is built (having a large number of splits), then it may fit the training data well but introduce large variances when applied to test data. If the tree is pruned and only has a few leaves (or regions in the predictor space), then it loses the predictive accuracy. Random forests, by creating a number of decision trees and then aggregating them, significantly improve the power of single trees and moves the biasvariance tradeoff toward the favorable direction. The basically idea behind random forests is to “shake” the original training data in various ways in order to create decision trees that are as uncorrelated as possible. Because the decision trees are deliberately “decorrelated”, they can afford to be deep, as the large variances are remedied by aggregating the “almost independent” trees.
Next we explain the details of random forests. To create randomized trees, for each , we randomly choose samples with replacement from the observations (a bootstrap sample). Only the subsample of observations is used to train the th decision tree. Splits are performed only on a random subset of of size according to one of the criterion of Gini index. The random subsample of training data and random directions to split are two key ingredients in creating less correlated decision trees in the random forest. The depth of the tree is controlled by the minimal number of observations, say , in a region for the tree to keep splitting.
These ideas are subsumed in Algorithm 1.
We first remark on the procedure in Algorithm 1 that can be applied to a generic classification problem and then comment on the special properties in our problem. (1) Many machine learning algorithms such as neural networks have numerous parameters to tune and the performance crucially depends on a suitable choice of parameters. Random forests, on the other hand, have only a few interpretable parameters. Even so, in the numerical studies in this paper, we simply choose a set of parameters that are commonly used for classification problems, without crossvalidation or tuning, in order to demonstrate the robustness of the algorithm. In particularly, we mostly use , and . There are other alternative options when constructing random forests, such as using a bootstrap sample Step 4. For the ease of exposition, we stick to the canonical version presented in Algorithm 1. (2) The numerical recipe for the algorithm is implemented in many programming languages such as R and Python and ready to use. In Section B, we provide a demonstration using scikitlearn, a popular machine learning package in Python that implements random forests, to estimate customer choice. As one can see, it takes less than 20 lines to implement the procedure.
Because of the structure of the problem, there are three specific observations. (1) Because the entries of are binary , the split position of decision trees is always . Therefore, along a branch of a decision tree, there can be at most one split on a particular dimension, and the depth of a decision tree is at most . (2) The random forest is a binary decision forest instead of a BCF. In particular, the probability of class , or the choice probability of product given assortment , may be positive even when , i.e., product is not included in the assortment. To fix the issue, we adjust the probability of class by conditioning on the trees that output reasonable class labels:
(3) When returning the class label of a leaf note in a decision tree, we use a randomly chosen observation instead of taking a majority vote (Step 11 in Algorithm 1). While not being a typical choice, it seems crucial in deriving our consistency result (Theorem 4). Intuitively, unlike other classification problems in which the predictor has a continuous support, in our problem are overlapping when an assortment is offered to multiple consumers in the data. A majority vote would favor the choice of product that most consumers make and ignore less attractive products. To correctly recover the choice probability from the data, we randomly choose an observation in the leaf (equivalently, randomly pick a customer in the data who has been offered the same assortment), which is at least an unbiased estimator for the choice probability.
4 Why Do Random Forests Work Well?
Many machine learning algorithms have superb performances in practice, while very few theories can be spelt out on why it is the case. For example, for random forests, even consistency, one of the most fundamental properties a statistician would demand for any classic estimators, was only established recently for regression problems under restrictive assumptions (scornet2015consistency). The lack of theoretical understandings can worry practitioners when stakes are high and the failure may have harmful consequences. In this section, we attempt to answer the “why” question for our setting from two angles. We show that random forests are consistent for any DCM, and the way that random forests split (Gini index) can naturally help to recover the choice model when it can be represented by a tree.
4.1 Random Forests are Consistent for Any Choice Model
We now show that with enough data, random forests can recover the choice probability of any DCM. To obtain our theoretical results, we impose mild assumptions on how the data is generated.
Assumption 1.
There is an underlying ground truth DCM from which all consumers independently make selections from the offered assortments, generating data , .
Note that the assumption only requires consumers to make choices independently. On the other hand, we focus on a fixeddesign experiment, and the sequence of assortment offered can be arbitrary. This is different from most consistency results of random forests in which random design is used (see (biau2016random) for references), i.e., are i.i.d. In our setting, the assortment is unlikely to be generated randomly, but chosen by the firm, either to maximize the revenue or explore customer preferences by A/B testing. Therefore, a fixed design probably reflects the reality more than a random design.
Since the consistency result requires the sample size , we use the subscript to emphasize the fact that the parameters may be chosen based on . For a given assortment , let be the number of consumers who see assortment . We are now ready to establish the consistency of random forests.
Theorem 4.
Suppose Assumption 1 holds, then for any and , if , is fixed, , , then the random forest is consistent:
for all .
According to Theorem 4, the random forest can accurately predict the choice probability of any DCM, given that the firm offers the assortment for many times. Practically, the result can guide us about the choice of parameters. In fact, we just need to generate many trees in the forest (), resample many observations in a decision tree (), and keep the terminal leaf small ( is fixed). The requirement is easily met by the choice of parameters in the remarks following Algorithm 1, i.e., , and . Theorem 4 guarantees a good performance of the random forest when the seller has collected a large dataset. This is a typical case in online retailing, especially in the era of “big data”.
Random forests thus provide a novel datadriven approach to model customer choices. In particular, the model is first trained from data, and then used to interpret the inherent thought process of consumers when they make purchases. By Theorem 4, when the historical data has a large sample size, the model can accurately predict how consumers make decisions in reality. This reflects the universality of the model. In this section, we provide concrete examples demonstrating several practical considerations that can hardly be captured by other DCMs and handled well by random forests.
4.2 Gini Index Recovers the Rank List
In Section 2, we have shown that any DCM can be represented by a combination of binary decision trees. Moreover, through numerous experiments, we have found out that random forests perform particularly well when the data is generated by DCMs that can be represented by a few binary decision trees. In this section, we further explore this connection by studying a concrete setting where the DCM is represented by a single regular decision tree. Without loss of generality, we assume that customers always prefer product to , for , and product to the nopurchase option. Equivalently, the DCM is a single rank list (the preferences of all customers form an ordered set). The following finitesample result demonstrates that the rank list can be recovered from the random forest with high probability.
Theorem 5.
Suppose the actual DCM is a rank list and the assortments in the training data are sampled uniformly. The random forest algorithm with subsample size (without replacement), , terminal leaf size and accurately predicts the choices of at least a fraction of all assortments with probability at least
where .
Since the bound scales exponentially in , the predictive accuracy increases tremendously with size.
The proof of the theorem reveals an intrinsic connection between the Gini index and the recovery of the rank list. Consider the first split of the random forest in Step 8. We can show that, in expectation, if the first split is on product , then the resulting Gini index is
In other words, if the data is generated without randomness (centered at the mean), then the first split would occur on product 1 because of the ordering of the Gini index when is large. Therefore, for the data points falling into the right branch of the first split (having product 1 in the assortment), no more splits are needed as all customers would choose product 1 according to the rank list. Such a split correctly identifies roughly half of the assortments. In the proof, we control the randomness by concentration inequalities, and conduct similar computations for the second, third splits and so on. The proof reveals the following insight into why random forests may work well in practice: The Gini index criterion tends to find the products that are ranked high in the rank lists, because they create “purer” splits that lower Gini index. As a result, the topological structure of the decision trees trained in the random forest is likely to resemble that of the binary choice trees underlying the DCM generating the data.
5 Flexibility and Benefits of Random Forests
In this section we demonstrate the flexibility and benefit of using random forests to estimate the choice forest.
5.1 Behavioral Issues
Because of Theorem 3 and Theorem 4, random forests can be used to estimated any DCMs. For example, there is empirical evidence showing that behavioral considerations of consumers may distort their choice and thus violate regularity, e.g., the decoy effect (ariely2008predictably) and the comparisonbased DCM (huber1982adding; russo1983strategies). It is already documented in chen2019decision that the decision forest can capture the decoy effect. In this section, we use the choice forest to model consumer search.
How consumers search to obtain new information when making purchases, is an important behavioral issue that is not monitored, or “unsupervised” in statistical terms, and hard to estimate by most models (for a few exceptions, see e.g. wang2017impact). Therefore, most DCMs abstract away those thought processes and only capture the aggregate effect. weitzman1979optimal proposes a sequential search model with search costs. Prior to initiating the search consumers know only the distribution, say of the net utility of product and the cost to learn the realization of . Let be the root of the equation and sort the products in descending order of . Weitzman shows that it is optimal to walk away without making any observations if the realized value of the nopurchase alternative, say exceeds . Otherwise is paid to observe is observed and is computed. The process stops if exceeds and continued otherwise, stopping the first time, if ever, that .
We next show that this search process can be represented by decision trees. Consider three products (). Suppose that the products are sorted so that , and that the valuations of an arriving customer satisfy . Hence the customer always searches in the order of product one product two product three. If in addition we suppose , then the decision tree can be illustrated in Figure 2. For example, suppose products one and tree are offered. The customer first searches product one, because the reservation price of product one is the highest. The realized valuation of product one is, however, not satisfactory (). Hence the customer keeps on searching the product with the second highest reservation price in the assortment, which is product three (product two is skipped because it is not in the assortment). However, the search process results in an even lower valuation of product three . As a result, the customer recalls and chooses product one. Clearly, a customer with different realized valuations would conduct a different search process, and leads to a different decision tree.
5.2 Aggregated Choice Data
One of the most pressing practical challenges in data analytics is the quality of data. In Section 2, the historical data is probably the most structured and granular form of data one can hope to acquire. While most academic papers studying the estimation of DCMs assume this level of granularity, in practice it is frequent to see data in a more aggregate format. As an example, consider an airline offering three service classes E, T and Q of a flight where data is aggregated over a time window during which there may be changes to the assortment, and compiled from different sales channels. The company records information at certain time clicks as in Table 1.
Class  Closure percentage  # Booking 
E  20%  2 
T  0%  5 
Q  90%  1 
For each class, the closure percentage reflects the fraction of time that the class is not open for booking, i.e., included in the assortment. Thus, 100% would imply that the corresponding class is not offered during that the time window. In a retail setting, this helps to deal with products that sellout between review periods. The number of bookings for each class is also recorded. There may be various reasons behind the aggregation of data. The managers may not realize the value of highquality data or are unwilling to invest in the infrastructure and human resources to reform the data collection process. One of the author has encountered this situation in practice with aggregate datasets as in Table 1.
Fortunately, random forests can deal with aggregated choice data naturally, a feat that may be quite difficult to deal with with the column generation approach. Suppose the presented aggregated data has the form , where denotes the closure percentage of the products in day , denotes the number of bookings^{1}^{1}1Again, we do not deal with demand censoring in this paper and assume that has an additional dimension to record the number of consumers who do not book any class., and the data spans time windows. We transform the data into the desired form as follows: for each time window , we create observations, . The predictor and let of be valued , for .
To explain the intuition behind the data transformation, notice that we cannot tell from the data which assortment a customer faced when she made the booking. We simply take an average assortment that the customer may have faced, represented by . In other words, if is large, then it implies that product is offered most of the time during the day, and the transformation leads to the interpretation that consumers see a larger “fraction” of product . As the closure percentage has a continuous impact on the eventual choice, it is reasonable to transform the predictors into a Euclidean space , and build a smooth transition between the two ends (the product is always offered) and (the product is never offered).
The transformation creates a training dataset for classification with continuous predictors. The random forest can accommodate the data with minimal adaptation. In particular, all the steps in Algorithm 1 can be performed. The tree may have different structures: because the predictor may not be at the corner of the unit hypercube any more, the split points may no longer be at 0.5.
5.3 Product Importance
Random forests can be used to assign scores to each product and rank the importance of products. A common score, mean decrease impurity (MDI), is based on the total decrease in node impurity from splitting on the variable (product), averaged over all trees (biau2016random). The score for product is defined as
In other words, if consumers make decisions frequently based on the presence of product (a lot of splits occur on product ), or their decisions are more consistent after observing the presence of product (the Gini index is reduced significantly after splitting on ), then the product gains more score in MDI and regarded as important.
The identification of important products provides simple yet powerful insights into the behavioral patterns of consumers. Consider the following use cases: (1) An online retailer wants to promote its “flagship” products that significantly increase the conversion rate. By computing the MDI from the historical data, important products can be identified without extensive A/B testing. (2) Due to limited capacity, a firm plans to reduce the available types of products in order to cut costs. It could simply remove the products that have low sales according to the historical data. However, some products, while not looking attractive themselves, serve as decoys or references and boost the demand of other products. Removing these products would distort the choice behavior of consumers and may lead to unfavorable consequences. The importance score provides an ideal solution: if a product is ranked low based on MDI, then it does not strongly influence the decision making of consumers. It is therefore safe to leave them out. (3) When designing a new product, a firm attempts to decode the impact of various product features on customer choices. Which product feature is drawing most attentions? What do attractive products have in common? To conduct successful product engineering, first it needs to use the historical data to nail down a set of attractive products. Moreover, to quantify and separate out the contribution of various features, a numerical score of product importance is necessary. The importance score is a more reasonable criterion than sales volume, because the latter cannot capture the synergy created between the products.
5.4 Incorporating Price Information
Besides the ease of estimation, the other benefit of a parametric DCM, such as the MNL or nested logit model, is the ability to account for covariates. For example, in the MNL model, the firm can estimate the price sensitivity of each product, and extrapolate/predict the choice probability when the product is charged a new price that has never been observed in the historical data. Many nonparametric DCMs cannot easily be extended to new prices. In this section, we show that while enjoying the benefit of a nonparametric formulation, random forests can also accommodate the price information.
Consider the data of the following format: , where represent the prices of all products. For product that is not included in the assortment offered to customer , we set . This is because when a product is priced at , no customer would be willing to purchase it, and it is equivalent to the scenario that the product is not offered at all. Such view of equivalence is commonly adopted in the literature.^{2}^{2}2One may argue that an assortment with a product having an artificially high price is not equivalent to the one without such a product, as the product may induce reference effects. We do not consider such behaviors here. Therefore, compared to the binary vector that only records whether a product is offered, the price vector provides more information.
However, the predictor can not be readily used in random forests. The predictor space is unbounded, and the value added to the extended real number line is not implementable in practice. To apply Algorithm 1, we introduce link functions that map the predictors into a compact set.
Definition 1.
A function is referred to as a link function, if (1) is strictly decreasing, (2) , and (3) .
The link function can be used to transform a price into . Moreover, because of property (3), we can naturally define . Thus, if product is not included in assortment , then . If product is offered at a very low price, then . After the transformation of predictors, ^{3}^{3}3When is applied to a vector , it is interpreted as applied to each component of the vector., we introduce a continuous scale to the problem in Section 2. Instead of binary status (included or not), each product now has a spectrum of presence, depending on the price of the product. Now we can directly apply Algorithm 1 to the training data . As a result, we need to modify Step 7, because the algorithm needs to find not only the optimal dimension to split, but also the optimal split location. The slightly modified random forests are demonstrated in Algorithm 2.
Because of the nature of the decision trees, the impact of prices on the choice behaviors is piecewise linear. For example, Figure 3 illustrates a possible decision tree with .
It is not surprising that there are numerous link functions to choose from. We give two examples below:
In fact, the survival function of any nonnegative random variables with positive PDF is a candidate for the link function. This extra degree of freedom may concern some academics and practitioners: How sensitive is the estimated DCM to the choice of link functions? What criteria may be used to pick a “good” link function? Our next result guarantees that the choice of link functions does not affect the estimated DCM. For any two link functions and , we can run Algorithm 2 for training data and . We use to denote the returned th tree of the algorithm for link function , .
Proposition 1.
It is worth pointing out that although the random forests using two link functions output identical class labels for in the training data, they may differ for when predicting a new price vector . This is because the splitting operation that minimizes the Gini index in Step 8 is not unique. Any split between two consecutive observations^{4}^{4}4If the algorithm splits on dimension , then and are consecutive if there does not exist in the same leaf node such that . results in an identical class composition in the new leaves and thus the same Gini index. Usually the algorithm picks the middle between two consecutive observations to split, which may differ for different link functions if they are not locally linear. Nevertheless, these cases are rare and Algorithm 2 is not sensitive to the choice of link functions.
The theoretical guarantee in the pricing setting, however, is far more involved than Section 2. The stateofart theoretical guarantee of random forests is given by scornet2015consistency. The authors prove that random forests are consistent for the regression problem, under some mild assumptions. Their setup is the closest to the original algorithm proposed in breiman2001random, while other papers have proved the consistency for random forests with simplified or special implementations. Our setup differs from scornet2015consistency in that we are focusing on a classification problem. We can recast it into a regression problem by analyzing the class probability of a particular class. However, instead of the Gini index, the sum of squared errors is typically used in regression problems, and the analysis has to be modified substantially. We thus leave the theoretical guarantee for future research.
5.5 Incorporating Customer Features
A growing trend in online retailing and Ecommerce is personalization. Due to the increasing access to personal information and computation power, the retailer is able to device specific policies, including pricing or recommendation, for different customers based on his/her observed features. Personalization turns out to be hugely successful. Imagine an arriving customer being labeled as a college student. Then for a fashion retailer, it is a strong indicator that she/he may be interested in affordable brands. Leveraging personal information can greatly increase the garnered revenue of the firm.
To offer personalized assortment, the very first step is to incorporate the feature information into the choice model. One possible model, The mixed logit model assumes that the customers are categorized into discrete types, and customers of the same type behave homogeneously according to an independent MNL model. As one of the main drawbacks, it is not straightforward to connect continuous customers features to discrete types, and unsupervised learning algorithms may be needed. Recently, bernstein2018dynamic propose a dynamic Bayesian algorithm to address the issue. Another typical approach is built on the MNL model, while replacing the deterministic utility of a product by a linear function of the customer feature. For example, see cheung2017thompson and references therein. In such personalized MNL models, the critics of the MNL model (such as IIA) persist.
In this section, we demonstrate that it is natural for random forests to capture customer features and return a binary choice forest that is aware of such information. Suppose the collected data of the firm have the form for customer , where in addition to , the choice made and the offered set, the normalized customer feature is also recorded. The procedure in Section 3 can be extended naturally. In particular, we may append to , so that the predictor . Algorithm 1 can be modified accordingly.
The resulting binary choice forest consists of binary choice trees. The splits of the binary choice tree now encode not only whether a product is offered, but also predictive feature information of the customer. For example, a possible binary choice tree illustrated in Figure 4 may result from the algorithm.
Compared with the current personalized choice models, the framework introduced in this paper has the following benefits:

The estimation is straightforward (same as the algorithm without customer features) and can be implemented efficiently.

The nonparametric nature of the model allows to capture complex interaction between products and customer features, and among customer features. For example, “offering a highend handbag” may become a strong predictor when the combination of features “female” and “age” are activated. In a binary choice tree, the effect is captured by three splits (one for the product and two for the customer features) along a branch. It is almost impossible to capture in a parametric (linear) model.

The framework can be combined with aforementioned adjustments, such as pricing and product importance. For example, the measure MDI introduced in Section 5.3 can be used to identify predictive customer features.
6 Numerical Experiments
In this section, we conduct a comprehensive numerical study based on both synthetic and real datasets. We find that (1) random forests are quite robust and the performance does not vary much for underlying DCMs with different levels of complexity. In particular, random forests only underperform the correctly specified parametric models by a small margin and do not overfit; (2) the standard error of random forests are small compared to other estimation procedures; (3) random forests benefit tremendously from increasing sample size compared to other DCMs; (4) the computation time of random forests almost does not scale with the size of the training data; (5) random forests perform well even if the training set only includes of all available assortments; (6) random forests handle training data with nonstandard format reasonably well, such as aggregated data and price information (see Section 5.2 and 5.4 for more details) which cannot be handled easily by other frameworks.
We will compare the estimation results of random forests with the MNL model (train2009discrete) and the Markov chain model (blanchet2016markov)^{5}^{5}5The MNL model is estimated using MLE. The Markov chain model is estimated using the EM algorithm, the same as the implementation in csimcsek2018expectation. The random forest is estimated using the Python package “scikitlearn”. The implementation is slightly different in that scikitlearn outputs the empirical class probability rather than a random sample in Step 11. The difference is negligible when is large. for both synthetic and real data sets. We choose the MNL and the Markov Chain models as benchmarks because the MNL model is one of the most widely used DCM and the Markov chain model can flexibly approximate RUM () and has been shown (berbeglia2018comparative) to have an outstanding empirical performance compared to MNL, the nested logit, the mixed logit, and rankbased DCM. Note that the actual DCM generating the training data is not necessarily one of the three models mentioned above.
When conducting numerical experiments, we set the hyperparameters of the random forest as follows: , , , . Choosing the parameters optimally using cross validation would further improve the performance of random forest.
6.1 The Random Utility Model
We first investigate the performance of random forests when the training data is generated by RUM. The RUM includes a large class of DCMs. Consider products. We generate the training set using the MNL model as the ground truth, where the expected utility of each product is generated from a standard normal distribution. Our training data consists of periods. Each period contains a single assortment and 10 transactions so the total number of data points is . This is following the setup of berbeglia2018comparative. We randomly generate an assortment in each period uniformly randomly among all assortments.
The performance is evaluated by root mean squared error, which is also used in berbeglia2018comparative:
(2) 
where denotes the actual choice probability and denotes the estimated choice probability. The RMSE tests all the assortments and there is no need to generate a test set. For each setting, we generate 100 independent training data sets and compute the average and standard deviation of the RMSEs. The result is shown in Table 2.
RF  MNL  Markov  

0.084 (0.014)  0.030 (0.007)  0.062 (0.009)  
0.061 (0.006)  0.019 (0.005)  0.042 (0.005)  
0.048 (0.005)  0.014 (0.003)  0.031 (0.004)  
0.041 (0.004)  0.009 (0.002)  0.023 (0.003)  
0.037 (0.002)  0.006 (0.002)  0.017 (0.002) 
Not surprisingly, MNL model performs the best among the three because it has very few parameters and correctly specifies the ground truth. With such a simple DCM, the random forest does not overfit and only slightly underperforms the Markov chain model. As the data size increases, the RMSE of random forest converges to zero.
Next we use the rankbased model to generate the training data, which is shown to be equivalent to RUM (block1959random). Consider products. Consumers are divided into or different types, each with a random preference permutation of all the products and the nopurchase alternative. For a given assortment of products, each type of consumer will purchase the product ranked the highest in her preference rank. If the nopurchase option is ranked higher than all the products in the assortment, then the customer does not purchase anything. We also randomly generate the fractions of customer types as follows: draw uniform random variables between zero and one for , and then set to be the proportion of type , . The result is shown in Table 3.
RF  MNL  Markov  

0.115 (0.031)  0.121 (0.034)  0.078 (0.032)  
0.090 (0.021)  0.118 (0.025)  0.058 (0.024)  
0.069 (0.016)  0.114 (0.029)  0.047 (0.020)  
0.056 (0.009)  0.118 (0.018)  0.044 (0.017)  
0.045 (0.006)  0.116 (0.021)  0.040 (0.017)  
0.034 (0.004)  0.115 (0.020)  0.037 (0.017)  
RF  MNL  Markov  
0.104 (0.013)  0.097 (0.016)  0.077 (0.016)  
0.079 (0.009)  0.093 (0.012)  0.057 (0.009)  
0.065 (0.008)  0.091 (0.014)  0.048 (0.009)  
0.053 (0.005)  0.088 (0.013)  0.042 (0.008)  
0.046 (0.004)  0.088 (0.013)  0.040 (0.008)  
0.038 (0.003)  0.087 (0.014)  0.037 (0.009) 
We can see that the MNL model underperforms and does not improve significantly as the data size increases, because of the misspecification error. The Markov chain model performs the best among the three. The performance of the random forest is quite robust, judged from the low standard deviation. Moreover, the performance improves dramatically as increases; for , the RMSE is smaller than the Markov chain model, which is shown in berbeglia2018comparative to outperform other DCM estimators. Predicted by Theorem 4, the RMSE tends to zero when the training set is large.
We run our algorithm on iMac with 2.7GHz quadcore Inter Core i5 and 8GB RAM installed. The running time is shown in Table 4. In terms of computation time, both the MNL model and the random forest can be implemented efficiently, while the EM algorithm used to estimate the Markov chain model takes much longer. When , the random forest spends 1/17 of the computation time of the Markov chain model. Note that the running time of random forest almost does not increase for larger training set. This makes it useful when dealing with big data.
RF  MNL  Markov  

72.3s  0.7s  25.7s  
72.5s  1.4s  36.1s  
72.3s  3.2s  113.7s  
74.0s  6.8s  203.0s  
74.5s  17.4s  445.2s  
81.8s  55.5s  1460.6s 
6.2 Generalizability to Unseen Assortments
One of the major challenges in the estimation of the DCM, compared to other statistical estimation problems, is the limited coverage of the training data, which strongly violates the i.i.d. assumption. In particular, the seller tends to offer a few assortments that they believe are profitable. As a result, in the training data only makes up a small fraction of the total available assortments. Any estimation procedure needs to address the following issue: can the DCM estimated from a few assortments generalize to the assortments that have never been offered in the training data?
Next we show that random forests perform this task well: theoretically, random forests adaptively choose nearest neighbors, and the choice probability of an assortment can be generalized to “neighboring” assortments (those with one more or one less product), as long as the underlying DCM possesses a certain degree of continuity in terms of the offered set . Consider products and . We randomly choose assortments to offer in the training set and thus there are transactions for each assortment. “Large” assortments refer to those with many products (). The result is shown in Table 5.
Rankbased  Rankbased  MNL  

0.193 (0.064)  0.156 (0.034)  0.133 (0.041)  
0.158 (0.034)  0.128 (0.026)  0.111 (0.035)  
(large)  0.181 (0.056)  0.124 (0.028)  0.038 (0.017) 
(large)  0.150 (0.047)  0.109 (0.027)  0.034 (0.014) 
0.087 (0.025)  0.073 (0.014)  0.054 (0.008)  
0.068 (0.014)  0.060 (0.007)  0.042 (0.004)  
0.045 (0.006)  0.046 (0.004)  0.037 (0.002) 
Note that there are possible available assortments. Therefore, for example, implies that only of the total assortments have been offered in the training data. The RMSE is only two to three times larger than the case where most assortments have been offered . Moreover, a larger assortment helps the estimation of the DCM. When the actual DCM is the MNL model, training random forests with 10 large assortments performs better than training with randomly chosen assortments.
We also remark that the generalizability of random forests does not only depend on the estimator, but also the actual DCM. Some DCMs are more accessible to generalization to unseen assortments. It remains an exciting future research to formalize the statement and theoretically quantify the generalizability of a DCM to unseen data in the framework of random forests.
6.3 Behavioral Choice Models
When the DCM is outside the scope of RUM and the regularity is violated, the Markov chain and MNL model may fail to specify the choice behavior correctly. In this section, we generate choice data using the comparisonbased DCM (huber1982adding), described below. Consumers implicitly score various attributes of the products in the assortment. Then they undergo an internal roundrobin tournament of all the products. When comparing two products from the assortment, the customer checks their attributes and count the number of preferable attributes of both products. Eventually, the customer count the total number of wins (preferable attributes) in the pairwise comparisons. Here we assume that customers choose with equal probability if there is a tie.
In the experiment, we consider products. Consumers are divided into different types, whose proportions are randomly generated between 0 and 1. Each type assigns uniform random variables between 0 and 1 to the five attributes of all the products (including the nopurchase option). Again we use the RMSE in (2) to compare the predictive accuracy. Like in the previous experiment, each setting is simulated 100 times. The result is shown in Table 6.
RF  MNL  Markov  

0.157 (0.031)  0.160 (0.033)  0.146 (0.038)  
0.133 (0.025)  0.156 (0.030)  0.132 (0.036)  
0.112 (0.022)  0.152 (0.030)  0.123 (0.033)  
0.094 (0.021)  0.155 (0.030)  0.120 (0.037)  
0.079 (0.018)  0.152 (0.032)  0.120 (0.036) 
Because of the irregularity, both the MNL and the Markov chain DCM are outperformed by the random forest, especially when the data size increases. Note that as , the random forest is able to achieve diminishing RMSE, while the other two models do not improve because of the misspecification error. Like the previous experiment, the random forest achieves stable performances with small standard deviations.
6.4 Aggregated Choice Data
In this section, we investigate the performance of random forests when the training data is aggregated as in Section 5.2. To generate the aggregated training data, we first generate observations using the rankbased model for products and customer types, as in Section 6.1. The only difference is that we only simulate one instead of ten transactions for each offered assortment. Then, we let be aggregation levels, i.e., we aggregate data points together. For example, is equivalent to the original data. For , Table 7 illustrates five observations in the original data set for . Upon aggregation, the five transactions are replaced by five new observations with and for .
Product 1  Product 2  Product 3  Product 4  Product 5  Choices 
1  1  1  1  1  1 
0  1  0  0  1  0 
1  0  1  1  1  4 
0  0  1  0  0  3 
1  0  1  0  0  1 
We test the performance for different sizes of the training set and different aggregate levels . The performance is measured in RMSE. We simulate 100 instances for each setting to evaluate the average and standard deviation, shown in Table 8.
0.082 (0.009)  0.109 (0.016)  0.114 (0.016)  0.119 (0.015)  0.120 (0.015)  
0.047 (0.004)  0.085 (0.010)  0.097 (0.012)  0.111 (0.013)  0.114 (0.013)  
0.039 (0.002)  0.068 (0.009)  0.082 (0.011)  0.103 (0.013)  0.108 (0.013) 
From the results, random forests handle aggregate data relatively well. Even with aggregation level , the RMSE does not seem to deteriorate significantly. Note that no other DCMs can handle aggregate data to the best of our knowledge, so no benchmark can be provided in this case.
6.5 Incorporating Pricing Information
In this section, we test the performance of random forests when the price information is incorporated. This is a unique feature of random forests as most DCMs can’t estimate the choice probability efficiently with prices.
We use the MNL model to generate the choice data. Let denote the expected utility of the products and their prices. Therefore, for given assortment , the choice probabilities of product and the nopurchase option are:
(3) 
Consider products. We generate as uniform random variables between 0 and 1 for each product. For each observation, we first randomly generate an assortment as Section 6.1. Then we generate a price for each product in the assortment as the absolute value of a standard normal random variable. As explained in Section 5.4, we use the link function . The customer’s choice then follows the choice probability (3).
The RMSE in (2) is no longer applicable because the assortments and prices cannot be exhausted. To evaluate the performance, we randomly generate assortments and prices according to the same distribution as the training data. Then we evaluate the RMSE as follows:
(4) 
where denotes the actual choice probability, and denotes the estimation. We investigate the performance of the random forest for different sizes of training data . The result is shown in Table 9.
RMSE  

0.067 (0.008)  
0.040 (0.002)  
0.035 (0.002) 
The result confirms that random forests can tackle price information well. Although we do not have benchmarks, the RMSE is comparable to the previous experiments, e.g., Table 8.
6.6 Real Data: Hotel
In this section we apply the random forest algorithm to a public dataset based on bodea2009data. The dataset includes transient customers (mostly from business travelers) who stayed in one of five continental U.S. hotels between March 12, 2007, and April 15, 2007. The minimum booking horizon for each checkin date is four weeks. Rate and room type availability and reservation information are collected via the hotel and/or customer relationship officers (CROs), the hotel’s websites, and offline travel agencies. Since there is no direct competition among these five hotels, we will process the data separately. A product is uniquely defined by the room type (e.g. Suite 1, 2 Double Beds Room 1, etc). For each transaction, the purchased room type and the assortment offered are recorded.
When processing the dataset, we have removed the product that has less than 10 transactions. We also removed the transactions whose offered assortments are not available due to technical reasons. For the transactions that none of the products in the available sets are purchased by the customer, we assume customers choose the nopurchase alternative. We do not add dummy transactions with nopurchases to uncensor the data like van2014market, csimcsek2018expectation and berbeglia2018comparative.
To compare different estimation procedures, we use fivefold cross validation to examine the outofsample performance. Because we no longer know the actual choice model that generates the data, after estimating the model in the training set, we follow berbeglia2018comparative and evaluate the “empirical” version of the RMSE in the validation set. That is, letting be the validation set, we define
(5) 
In Table 10 we show the scale of the five datasets after preprocessing. We show the outofsample RMSE data for each hotel (average and standard deviation). In addition, we also show the performance of the independent demand model (ID), which does not incorporate the substitution effect and is expected to perform poorly, in order to provide a lower bound of the performance.
Consistent with the insights drawn from the synthetic data, random forest outperforms the parametric methods for larger dataset (Hotel 1, 2 and 3). For smaller data size (Hotel 4 and 5), random forest is on a par with the best parametric estimation procedure (Markov) according to berbeglia2018comparative.
# products  # insample  # outofsample  

Hotel 1  10  1271  318 
Hotel 2  6  347  87 
Hotel 3  7  1073  268 
Hotel 4  4  240  60 
Hotel 5  6  215  54 
RF  MNL  Markov  ID  

Hotel 1  0.3040 (0.0046)  0.3098 (0.0031)  0.3047 (0.0039)  0.3224 (0.0043) 
Hotel 2  0.3034 (0.0120)  0.3120 (0.0148)  0.3101 (0.0124)  0.3135 (0.0178) 
Hotel 3  0.2842 (0.0051)  0.2854 (0.0065)  0.2842 (0.0064)  0.2971 (0.0035) 
Hotel 4  0.3484 (0.0129)  0.3458 (0.0134)  0.3471 (0.0125)  0.3584 (0.0047) 
Hotel 5  0.3219 (0.0041)  0.3222 (0.0069)  0.3203 (0.0046)  0.3259 (0.0058) 
6.7 Real Data: IRI Academic Dataset
In this section we compare several algorithms on the IRI Academic Dataset (bronnenberg2008database). The IRI Academic Dataset collects weekly transaction data from 47 U.S. markets from 2001 to 2012, covering more than 30 product categories. Each transaction includes the week and store of purchase, the universal product code (UPC) of the purchased item, number of units purchased and total paid dollars.
The preprocessing steps taken follow those in jagabathula2018limit and chen2019decision. In particular, we conduct the analysis for 31 categories separately using the data for the first two weeks in 2007. Each product is uniquely defined by the vendor code. Each assortment is defined as the set of products that are available in the same store in that week. We only focus on the top nine purchased products from all stores during the two weeks in each category and treat all other products as the nopurchase alternative.
However, the sales data for most categories is still too large for the EM algorithm to estimate the Markov chain model. For example, carbonated beverages, milk, soup and yogurt have more than 10 million transactions. For computational efficiency, we uniformly sample 1/200 of original data size without replacement. This does not significantly increase the sampling variability as most transactions in the original data are repeated entries.
We use fivefold crossvalidation and RMSE defined in (5) to examine the outofsample performance. The result is shown in Table 11. Random forests outperform the other two in 24 of 31 categories, especially for larger data size. According to berbeglia2018comparative, the Markov chain choice model has already been shown to have superb performance in synthetic and realworld studies. Table 11 fully demonstrates the potential of random forests as a framework to model and estimate consumer behaviors in practice.
Product Category  # data  RF  MNL  Markov  
1  Beer  10,440  0.2717 (0.0006)  0.2722 (0.0008)  0.2721 (0.0007) 
2  Blades  1,085  0.3106 (0.0037)  0.3092 (0.0034)  0.3096 (0.0036) 
3  Carbonated Beverages  71,114  0.3279 (0.0004)  0.3299 (0.0004)  0.3295 (0.0004) 
4  Cigarettes  6,760  0.2620 (0.0028)  0.2626 (0.0030)  0.2626 (0.0030) 
5  Coffee  8,135  0.2904 (0.0010)  0.2934 (0.0009)  0.2925 (0.0010) 
6  Cold Cereal  30,369  0.2785 (0.0003)  0.2788 (0.0003)  0.2787 (0.0003) 
7  Deodorant  2,775  0.2827 (0.0005)  0.2826 (0.0005)  0.2826 (0.0005) 
8  Diapers  1,528  0.3581 (0.0024)  0.3583 (0.0020)  0.3583 (0.0022) 
9  Facial Tissue  8,956  0.3334 (0.0007)  0.3379 (0.0010)  0.3375 (0.0007) 
10  Frozen Dinners/Entrees  48,349  0.2733 (0.0003)  0.2757 (0.0003)  0.2750 (0.0003) 
11  Frozen Pizza  16,263  0.3183 (0.0001)  0.3226 (0.0001)  0.3210 (0.0001) 
12  Household Cleaners  6,403  0.2799 (0.0010)  0.2798 (0.0009)  0.2798 (0.0009) 
13  Hotdogs  7,281  0.3123 (0.0011)  0.3183 (0.0005)  0.3170 (0.0007) 
14  Laundry Detergent  7,854  0.2738 (0.0017)  0.2875 (0.0017)  0.2853 (0.0016) 
15  Margarine/Butter  9,534  0.2985 (0.0004)  0.2995 (0.0004)  0.2990 (0.0003) 
16  Mayonnaise  4,380  0.3212 (0.0024)  0.3242 (0.0010)  0.3230 (0.0006) 
17  Milk  56,849  0.2467 (0.0007)  0.2501 (0.0005)  0.2538 (0.0012) 
18  Mustard  5,354  0.2844 (0.0008)  0.2856 (0.0006)  0.2852 (0.0006) 
19  Paper Towels  9,520  0.2939 (0.0009)  0.2964 (0.0008)  0.2959 (0.0008) 
20  Peanut Butter  4,985  0.3113 (0.0017)  0.3160 (0.0006)  0.3146 (0.0009) 
21  Photography supplies  189  0.3456 (0.0081)  0.3399 (0.0081)  0.3456 (0.0088) 
22  Razors  111  0.3269 (0.0300)  0.3294 (0.0225)  0.3323 (0.0195) 
23  Salt Snacks  44,975  0.2830 (0.0006)  0.2844 (0.0007)  0.2840 (0.0007) 
24  Shampoo  3,354  0.2859 (0.0006)  0.2855 (0.0071)  0.2856 (0.0009) 
25  Soup  68,049  0.2709 (0.0007)  0.2738 (0.0005)  0.2729 (0.0005) 
26  Spaghetti/Italian Sauce  12,377  0.2901 (0.0003)  0.2919 (0.0006)  0.2914 (0.0006) 
27  Sugar Substitutes  1,269  0.3080 (0.0036)  0.3067 (0.0035)  0.3072 (0.0034) 
28  Toilet Tissue  11,154  0.3084 (0.0005)  0.3126 (0.0004)  0.3132 (0.0014) 
29  Toothbrushes  2,562  0.2860 (0.0009)  0.2859 (0.0004)  0.2858 (0.0006) 
30  Toothpaste  4,258  0.2704 (0.0008)  0.2708 (0.0011)  0.2708 (0.0011) 
31  Yogurt  61,671  0.2924 (0.0011)  0.2976 (0.0008)  0.2960 (0.0008) 
7 Concluding Remarks
We hope that this study will encourage more scholars to pursue BRF as a research topic. We believe that addressing the following questions would help us decode the empirical success of random forests and understand the pitfalls:

What type of DCMs can be estimated well by random forests and have higher generalizability to unseen assortments?

As we use the choice forest to approximate DCMs, how can we translate the properties of a DCM to the topological structure of decision trees?

Can we provide finitesample error bounds for the performance of random forests, with or without the price information?

What properties does the product importance index MDI have?

Given a binary choice forest, possibly estimated by random forests, can we compute the optimal assortment efficiently?
References
Appendix A Proofs
Proof of Theorem 1:.
It is easy to see that a BCF is a DCM. To show the converse, consider a collection of DCMs . Let with , Then is clearly a DCM, so a convex combination of DCM is a DCM and thus all DCMs form a convex set.
Consider the extreme points, i.e., DCMs that cannot be written as a nontrivial convex combination of two or more DCMs. Let be the collection of all extreme DCMs, . Then any DCM is in the convex hull of . A deterministic DCM is a DCM such that for every and every . Next we show that a DCM is an extreme point if and only if it is deterministic. Given a deterministic DCM, say , let be the choice made by , so that only if . It is clear that a deterministic DCM is an extreme point. Conversely, for an extreme DCM, if it is not deterministic, then we can always split the probability between 0 and 1 and makes it a convex combination of two different DCMs. Therefore, extreme points are equivalent to deterministic DCMs.
It is sufficient to show that all deterministic DCMs can be represented as a BCF. This follows directly because every deterministic DCM is the binary choice tree which can be explicitly constructed for all . We can now formally state the connection between DCMs and BCFs. ∎
Proof of Theorem 3.
Let be an arbitrary DCM. Construct trees , , with leaves associated with each of the possible subsets of . For any given , we let for . It is easy to see that
Since the error bound holds for all and , the choice forest can approximate any DCM for a sufficiently large .
∎
Proof of Theorem 4:.
We first prove that for a single decision tree, there is a high probability that the number of observations chosen in Step 4 in which is offered is large. More precisely, let . It is easy to see that . Step 4 randomly selects observations out of the with replacement. Denote the bootstrap sample of by . By Hoeffding’s inequality, we have the following concentration inequality
(6) 
for any . In other words, the bootstrap sample in Step 4 does not deviate too far from the population as long as is large. As we choose , it implies that and in particular
(7) 
Next we show that given for a decision tree, the leaf node that contains only contains observations with . That is, the terminal leaf containing is a single corner of the unit hypercube. If the terminal leaf node containing an observation with predictor , then it has no less than observations, because all the samples used to train the tree fall on the same corner in the predictor space. If another observation with a different predictor is in the same leaf node, then it contradicts Step 6 in the algorithm, because it would imply that another split could be performed. Suppose is the final partition corresponding to the decision tree. As a result, in the region such that , we must have that is a random sample from the customer choices, according to Step 11.
Now consider the estimated choice probability from the random forest: . Note that are i.i.d. given the training set. By Hoeffding’s inequality, conditional on ,
(8) 
for all . Next we analyze the probability for a single decision tree. By the previous paragraph, conditional , the output of a single tree is randomly chosen from the class labels of observations whose predictor is . Let be the class label of the th chosen observation in Step 4. Therefore, conditional on the event and the training data, we have
(9) 
Because is a bootstrap sample, having i.i.d. distribution
given the training data, we apply Hoeffding’s inequality again
(10) 
for all . Now applying Hoeffding’s inequality to again, and because of Assumption 1, we have that
(11) 
for all .
With the above results, we can bound the target quantity
By (8), the first term is bounded by which converges to zero as . To bound the second term, note that