Model selection and model averaging in MACMLestimated Multinomial Probit (Mnp) models
Abstract
This paper provides a review of model selection and model averaging
methods for multinomial probit models estimated using the Maximum Approximate Composite Marginal Likelihood (MACML)
approach. The proposed approaches are partitioned into test based
methods (mostly derived from the likelihood ratio paradigm), methods
based on information criteria and model averaging methods.
Many of the approaches first have been derived for models estimated using maximum
likelihood and later adapted to the composite marginal likelihood
framework. In this paper all approaches are applied to the MACML approach for estimation. The investigation lists advantages and disadvantages of the various methods in terms of asymptotic properties as well as computational aspects. We find that likelihoodratiotype tests and information criteria have a spotty performance when applied to MACML models and instead propose the use of an empirical likelihood test.
Furthermore, we show that model averaging is easily adaptable to Composite Marginal Likelihood (CML) estimation and has promising performance w.r.t to parameter recovery. Finally model averaging is applied to a real world example in order to demonstrate the feasibility of the method in real world sized problems.
 AIC
 Akaike Information Criterion
 APB
 Average Percentage Bias
 ASC
 Alternative Specific Constant
 BIC
 Bayesian Information Criterion
 bME
 bivariate MendellElston
 CCL
 Composite Conditional Likelihood
 CDF
 Cumulative Distribution Function
 CEF
 Conditional Expectation Function
 CL
 Composite Likelihood
 CLAIC
 Composite Likelihood Akaike Information Criterion
 CLBIC
 Composite Likelihood Bayesian Information Criterion
 CML
 Composite Marginal Likelihood
 c.p.
 ceteris paribus
 DCM
 Discrete Choice Models
 DGP
 Data Generating Process
 DM
 Decision Maker
 GHK
 GewekeHajivassiliouKeane
 IC
 Information Criterion
 iid
 independent, identically distributed
 KLD
 KullbackLeibler Divergence
 LLN
 Law of Large Numbers
 MA
 model averaging
 MACML
 Maximum Approximate Composite Marginal Likelihood
 MAE
 Mean Absolute Error
 MC
 Monte Carlo
 MCMC
 Markov Chain Monte Carlo
 ME
 MendellElston
 ML
 Maximum Likelihood
 MNP
 Multinomial Probit
 MOP
 German Mobility Panel
 MSE
 Mean Squared Error
 MSL
 Maximum Simulated Likelihood
 MVNCDF
 multivariate normal cumulative distribution function
 OLS
 Ordinary Least Squares
 PQD
 Positive Quadrant Dependence
 RE
 Random Effect
 RMSE
 Root Mean Squared Error
 RUM
 Random Utility Model
 RV
 Random Variable
 SEM
 Structural Equation Models
 SJ
 SolowJoe
 SLLN
 Strong Law of Large Numbers
 ULLN
 Uniform Law of Large Numbers
 WLLN
 Weak Law of Large Numbers
 WTP
 Willingness to Pay
1 Introduction
The two most commonly used families of discrete choice models are the multinomial logit (MNL) and the multinomial probit (MNP) model. Between these two the MNP offers better modeling flexibility at the expense of higher computational costs. To alleviate the computational burden of MNP estimation, the MACML estimation approach combines CML and an analytic approximation to allow for simulationfree, and fast (but approximate) estimation of MNP models (see (Bhat; 2011)). Bhat and coworkers have shown that estimation in the MACML framework is very fast (see Cherchi et al. (2016)) and can be applied even for complicated MNP models (see e.g. Kamargianni et al. (2015)).
Even though it is sometimes possible to specify a discrete choice model solely through theoretical reasoning, more often than not the researcher has to rely on model selection in order to arrive at a useful model. Despite Bhat (2011) as well as Bhat (2014) offering some limited advice, the literature on model selection for MNP models using the MACML estimation methodology is not extensive. Moreover, model averaging which has been proposed in a number of different contexts (see e.g. Gao et al. (2016), Hjort and Claeskens (2003), Hansen (2007), Wan et al. (2014)) has not been dealt with in the context of CML estimation.
Therefore this paper aims to stimulate discussion on this important topic by surveying proposals for model selection methods for MNP models estimated using MACML.^{2}^{2}2Note that in order to safe space we will sometimes violate the distinction between estimation method and model by abbreviating ’MNP models estimated using MACML’ as MACML models. We study those methods under the premise of a given, finite collection of competing models denoted as , from which the researcher selects one using a datadriven procedure.
In the literature one finds two approaches for model selection. The first approach is based on a penalized goodnessoffit function which is associated with each of the competing models. This function incorporates the tradeoff between model fit and model complexity. In the context of MACML this procedure is based on information criteria (IC) derived from the respective pseudolikelihoods. The decision rule is that the model with the smallest, estimated value is retained. In an abstract sense this partitions the space of all observations into regions where a particular model is chosen.
The alternative approach is based on repeated hypothesis testing wherein the final model choice is taken by repeatedly performing hypothesis tests of a restricted model versus an unrestricted model. Stepwise regression techniques are examples of this procedure. Again a partitioning of the observation space results.
The estimated parameter after model selection based on either approach therefore might be presented as
where denotes the observations and the weight is defined as
Note that the weights are random as they are also ’estimated’ from the data because the components of the model selection procedure depend on the given sample. Furthermore the decision rules only allow for the selection of a single model, which is equivalent to .
As an alternative to model selection in this paper we introduce model averaging methods for models estimated using the MACML paradigm. The idea of model averaging (MA) is to use the information from all models and not only the ’winning’ model. The first motivation for doing so is the acknowledgment that model selection based on just one sample is subject to randomness. From the theoretical point of view, MA avoids problems related to postmodelselection inference (see e.g. (Claeskens and Hjort; 2008, 206ff) or Leeb and Pötscher (2005)). MA is also appealing to the practitioner because it has been shown that predictions are better in terms of mean squared error when they are based on model averaging rather than model selection (see e.g. Gao et al. (2016)).
The estimated parameter after model averaging might be presented as,
(1) 
where the weights fulfill . There are several model averaging procedures which differ in the way the weights are justified and computed.
In this paper we present the adaptations of the concepts well known in the context of maximum likelihood estimation to the MACML case. In all cases we discuss advantages and disadvantages of the various approaches focusing on the analytical and numerical properties. This is in particular of importance as model selection and model averaging requires the computation of a number of models and hence exacerbates computational issues already occurring for the estimation of a single model estimation.
The organization of the paper is as follows. First, we succinctly review the MACML framework and argue that the MACML pseudolikelihood differs from the standard CML framework. In the second section we review test and information criterionbased model selection and discuss computational challenges. As a new contribution this section introduces a test which is framed in empirical likelihood theory. This is followed by a simulation exercise comparing the various approaches on two demonstration examples.
Third, we introduce and discuss MA for MACML followed by a simulationbased comparison of the discussed methods. In section 7 we present an empirical example based on real choice data from the German Mobility Panel in order to demonstrate the feasibility of model averaging. We conclude the article with a summary of our results and discuss practical implications of our findings.
2 The MACML approach
The first building block of the MACML method is CML estimation (for a general survey of this method see Varin et al. (2011)). In this paper only the special case of the so called pairwise likelihood is considered, although many results generalize also to more general CMLs. For pairwise CML the fulllikelihood representing the joint probability of the observation of choices () by one individual is replaced by the product of the probabilities of all pairs of choices,
(2) 
Note that the functions are valid marginal likelihoods such that key properties of the likelihood framework continue to hold. For example it directly follows that the composite score, which we define as , is of zero mean at the true parameter .
In analogy to Maximum Likelihood (ML) estimation the CML estimator is defined as
Under suitable regularity conditions this estimator is consistent for (see (Molenberghs and Verbeke; 2005, 190ff)), whenever is fixed. Furthermore, it is possible to show that under suitable assumptions the estimator is asymptotically normally distributed, where the asymptotic variance is given by the inverse of the Godambe/sandwich information matrix, where the so called sensitivity matrix is defined as and the variability matrix is the variance of the composite score . This is a standard result for misspecified (in the sense of using a pseudo likelihood instead of the true one) likelihood models which is due to the fact that the information identity (also called the second Bartlett identity) does not hold (see e.g. White (1982)). In general the CML estimator therefore is less efficient than the ML estimator.
When CML is utilized to estimate MNP models, each involves the evaluation of the multivariate normal cumulative distribution function (MVNCDF): As is well known the MNP model can be derived using an underlying random utility function assigning the utility
to the choice of alternative with characteristics in the th decision of individual . Here denotes fixed parameters while – assumed to be normally distributed with zero mean and variance – allows for heterogeneity between deciders.
Furthermore denotes the independent, identically distributed (iid) random errors assumed to be normally distributed with zero mean and variance . Consequently using the vector we obtain that is normally distributed.
It follows that the probability that and are the choices (that is have maximal random utility) in the th and th choice respectively is given by
(3) 
where collects all parameters contained in for appropriate functions and defining the upper limits of integration and the correlation matrix respectively. Furthermore denotes a dimensional MVNCDF corresponding to mean zero and variancecovariance matrix R.
Whenever is large, CML estimation of MNP models is still computationally demanding and the MACML approach by Bhat (2011), therefore, complements pairwise CML estimation with an analytic approximation for MVNCDFs. Bhat (2011) proposes the use of the SolowJoe (SJ) approximation (Solow; 1990; Joe; 1995), whose general idea is to factorize the multivariate normal distribution into a product of conditional distributions, which are in turn approximated by linear projections. For a three dimensional case of calculating the MVNCDF for where are standard normally distributed with correlation between and we obtain (using ):
(4)  
where denotes the linear projection
Here the entries of and
contain covariances of with , which
are functions requiring the evaluation of univariate and bivariate MVNCDF functions.
The positive definite matrix has smallest eigenvalues bounded away from zero for
bounded b if the same holds for R.
Note that the ordering of the components for the approximation is arbitrary but influences the approximation quality. It is known that averaging over all possible permutations of coordinates improves the approximation accuracy at the price of higher computational load.
Therefore typically only a fixed number (with one being a popular value) of random permutations are averaged.
Note that the approximation does not guarantee that the approximated choice probabilities for all choices sums to one.
Furthermore there are no guarantees that the approximation lies in . This is typically ensured using some ad hoc interventions.
For further details regarding the SJ approximation see Joe (1995).
Even though the SJ approximation is an important building block of MACML and ensures the comparatively fast estimation of complex MNP models, it is important to note that its application alters the pseudolikelihood and that MACML estimation is, therefore, not equivalent to CML estimation. To be more precise it is possible to show that due to the SJ approximation^{3}^{3}3This is also true for several other analytic approximations and no particular problem of the SJ approximation (for further details see Batram and Bauer (2016)). Furthermore, this inconsistency is also present in Maximum Simulated Likelihood (MSL)estimates whenever the number of simulations is finite (see e.g. Lee (1992)). the MACML estimators for data generated from an MNP is asymptotically biased and not consistent (see Batram and Bauer (2016)). To the best of our knowledge there are no general results that quantify or state bounds for the deviations between the MACML pseudolikelihood and the CML pseudolikelihood even in large samples, although the deviation has been shown to be small in all real world case studies performed so far.
A different way to look at the stated inconsistency is to note that, if the approximated choice probabilities are normalized to sum to one, they encode a parameterized mapping of the characteristics encoded in onto the choice probabilities. It appears that one can show^{4}^{4}4This result is work in progress. that the MACML estimator using the normalized probabilities can be used to consistently estimate the corresponding underlying parameters for this mapping. In this sense the MACML estimator regains all the usual properties of consistency and asymptotic normality for a slightly altered model.
For model selection and averaging we will need some more notation in the following sections: Assume that the parameter vector is of dimension and that we partition this vector into , where is dimensional and has dimension . The vector contains parameters that are known to be present in the model while we are unsure whether the parameters assigned to the vector should be contained in the model.
Our interest for model selection thus lies on and we want to test : . The elements of will most often equal zero. Restrictions to other values such as 1 are possible for example for variances.
With respect to the full parameter vector we denote the estimator where is constrained according to the null hypothesis by and the unconstrained estimator is . Finally, let be the trailing submatrix of the information matrix and the dimensional trailing subvector of the score which contains only entries related to . Furthermore, denotes the corresponding trailing submatrix of .
3 Model selection
As discussed above, the MACML estimation is not equivalent to CML estimation. However, we start this section by surveying methods for the CML framework and then – in the following section – assess their applicability to MACML estimation by simulation.
3.1 Likelihoodratiotype tests
In this section we focus only on likelihoodratiotype tests. While it is easy to adapted scoretype and Waldtype tests for the use within the CML framework, the Waldtype tests suffer from the known shortcomings (lack of invariance to reparametrization and elliptical confidence region) and the tests based on the score are numerically unreliable (see (Molenberghs and Verbeke; 2005, 193f)).
The standard likelihoodratio tests are inapplicable because the ratio of two composite likelihoods does not adhere to an asymptotic distribution under but instead is a linear combination of independent but weighted distributions. If we are interested in testing , then
(5) 
where are eigenvalues of (as discussed those are matrices which are evaluated under the null hypothesis) and are independent standard normal random variables. Just like the standard likelihood ratio test its CML counterpart rejects whenever is large.
There are several different adjustments to , which aim to facilitate likelihoodratiotype testing in the CML framework. Due to space constraints we will only focus on the adjustments which have shown promising results in previous studies (for a more general overview and simulation results see Pace et al. (2011) or Cattelan and Sartori (2016)).^{5}^{5}5Note, however, that the simulation studies here usually focus on simple models like estimation of moments from the multivariate normal or Gaussian random fields. As those models are not a good surrogate for MACMLestimated MNP models we provide our own simulations in section 4. Those adjustments either try to alter so that it is approximately distributed or match certain moments of the distribution.
The moment matching adjustments are motivated by the observation that if the eigenvalue is a simple ratio and, therefore, is asymptotically . For more than one parameter this adjustment has the form,
(6) 
where . This adjustment is equivalent to match the first moment of the distribution. A more sophisticated adjustment, which is designed to match the first and second moment, was proposed in Varin (2008),
(7) 
where and . It is well known that this adjustment might be inaccurate because it only matches the first two moments.
Both the calculation of and are computationally only slightly more expensive than CML optimization: Only the eigenvalues must be calculated based on estimates of and which demand one extra pass through the likelihood calculations (if the corresponding quantities are not stored in the last gradient descent step).
Another adjustment was first proposed by Chandler and Bate (2007) and later modified by Pace et al. (2011) in order to be invariant to reparametrization. In contrast to the previous method this adjustment does not match moments but tries to ensure that the corresponding is asymptotically distributed,
Note that the numerator is the CML version of a score test statistic but as for example explained in Cattelan and Sartori (2016) does not inherit its numerical instabilities. It is important to note that all those adjustments rely on estimators for the Godambe information matrix. Again the computation requires little extra work.
As an alternative we introduce another test, which has the appeal that there is no need to compute the Godambe information matrix. This test is based on the empirical likelihood framework, which in the context of CML estimation was first used by Lunardon et al. (2013) to derive confidence regions for CML estimates. Using general results from Qin and Lawless (1994) it is straightforward to also introduce a likelihoodratiolike test for CML. The only property needed is that the composite score is an unbiased estimation equation, which it is under standard conditions as discussed in section 2, and Corollary 5 from Qin and Lawless (1994). Hence we do not provide any proofs.
In general, this approach comes with two difficulties: First, regarding theoretical properties empirical likelihood methods are known for their slow convergence to their respective asymptotic distributions, which might impose problems for small sample sizes. Second there is a need for additional computations. The essential building block of empirical likelihood theory is,
with , where is a ddimensional vector chosen to satisfy,
(8) 
While the individual gradients are available from the CML estimation we need to compute the Lagrange multiplier in order to calculate . The best way to determine is to use the strategy outlined in Owen (1990) and restate the root finding problem as a minimization problem (see (Owen; 1990, 104ff)). Note that for practical computations (8) should be ’numerical zero’. In our simulation experiments even with extremely low (gradient and step) tolerances it was rare to find a solution which fulfilled (8) exactly. Our first attempts to use direct (multivariate) root finding methods failed (e.g. NAGs c05qb) because the solutions provided by those algorithms were small but too far from zero and interfered with the tests performance.^{6}^{6}6How much deviation from (8) can be tolerated without harming the test performance remains a practically relevant but open research question. Therefore we opted for solving the minimization problem.
After is computed the empirical likelihood ratio statistic for testing is (see (Qin and Lawless; 1994, 307)),
Numerically this procedure is more demanding than the adjustments of the composite likelihood ratios: For two Lagrange multipliers need to be computed by searching for the appropriate roots, which is in general faster than the computation of the sensitivity as well as the variability matrix.^{7}^{7}7We refrain from presenting computation times alongside our results because those depend strongly on the implementation as well as on the specification of the computer used for the computations. However, the appendix provides those numbers for our example implementation and computer system.
The expected computation time is the reason we did not include Bootstrap tests into our comparison. Even though the theoretical results in Aerts and Claeskens (1999) show that the parametric bootstrap leads to a consistent estimator for the distribution of pseudolikelihood tests under the nullhypothesis, the specific form of the parametric bootstrap
(see e.g. Bhat (2014)) is set up in a way that two models need to be estimated for every bootstrap sample.^{8}^{8}8As discussed in (Molenberghs and Verbeke; 2005, 195) the parametric bootstrap is expected to break down once the assumption regarding the distribution of the error terms is wrong. It might therefore be worthwhile to assess the performance of Bhats bootstrap because from the theoretical standpoint the use of the SJapproximation in the MACML pseudolikelihood is not compatible with the assumption of multivariate normal error terms in the utility function. For a reasonable number of bootstrap samples this leads to very high computation times. It might still be worth to consider the bootstrap to draw inference from the final model but it is certainly not suited to be part of the model selection process for real world sample sizes.
All in all we have identified four different methods to perform likelihoodratiotype tests in the CML framework. Three methods (, , ) are adjustments to the composite likelihood ratio (5), which rely on estimates of the Godambe information matrix. The fourth method (), which is based on empirical likelihood arguments, utilizes only the individual score functions.
3.2 Information Criteria
In this section we discuss the use of information criteria for model selection. The two classical criteria are the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The AIC was developed as an estimator for the KullbackLeibler Divergence (KLD) between two competing models. Naturally the model with minimal KullbackLeibler divergence would be the model of choice. Because of this the AIC is designed to select the mimimumKLDmodel. However, the AIC is not an estimate of the KLD (see Claeskens and Hjort (2008, p. 28ff)).
The BIC is designed to select the model with the highest posterior probability when equal prior probability is assigned to each model and vague priors for the model parameters are used (see Burnham and Anderson (2002, p. 286)). The BIC is an approximation to this quantity with the appealing feature ”that the specification of the prior completely disappears in the formula of BIC” (Claeskens and Hjort (2008, p. 81)).
Two important properties of Information Criterion (IC) are efficiency and consistency. An IC is deemed efficient if it selects the model with minimum prediction error. Consistency is defined as the property that an IC asymptotically selects the model with the minimum KLD and if that condition is satisfied by more than one model the model with the lowest number of parameters. This leads to the selection of the true model (if it is part of the model space) with probability going to one as . When assessing BIC and AIC with regard to efficiency, it can be shown that the AIC is efficient while the BIC is not (see Claeskens and Hjort (2008, p. 112)). On the other hand it can also be established that the BIC is consistent while AIC is not consistent.
Both of those classic IC have been adopted for use within the CML framework. The Composite Likelihood Akaike Information Criterion (CLAIC) was introduced in Varin and Vidoni (2005) and is a variant of the Takeuchi IC (TIC) (see e.g. (Claeskens and Hjort; 2008, 43f)). The major difference between AIC and TIC is that the latter is developed without the assumption that the true Data Generating Process (DGP) is amongst the candidate models. Given that within the CML framework we are willing to make a working independence assumption with respect to the DGP, it seems reasonable that the CLAIC is closer to the TIC rather than the actual AIC. The CLAIC selects the model minimizing
(9) 
where and are consistent, firstorder unbiased estimators for and , respectively. When concerned with the CML framework it is important to note that the objective function () is not necessarily a proper likelihood function. Therefore, Varin and Vidoni (2005) use a modified definition of the KLD to derive the CLAIC. This definition focuses on the marginal distributions, but also imposes that the compared CML models are composed of marginals with the same dimension. Ng and Joe (2014) address some of the differences between the CLAIC and normal AIC. To do so they focus on nested models with tractable likelihoods and make use of the theory of local alternatives. Under those premises they can show that all other things equal the probability to pick the true models is smaller for the CLAIC than for the AIC. Furthermore, they illustrate that – as expected – the CLAIC gets closer to the AIC for rising dimensions of the marginals.
The BIC has been adapted in analogy to the definition of CLAIC for the CML framework by Gao and Song (2010). The Composite Likelihood Bayesian Information Criterion (CLBIC) selects the model minimizing,
(10) 
where and are consistent, firstorder unbiased
estimators for and , respectively. Gao and Song (2010) show that the CLBIC as defined above is a consistent information criterion. Furthermore, they provide an extension for highdimensional data which is irrelevant in the MNP context. Finally note that CLAIC and CLBIC rely on the sandwich information matrix and that the computational burden is therefore similar to the tests discussed in the previous section.
Considering the definition of CLAIC and CLBIC one notices a subtle dependence of the penalty
term on the parameter
. As formulated above hence the penalty term is not guaranteed to be a
strictly monotonous function of the model order. One could evaluate the penalty term always at the biggest model which would guarantee monotonicity.
However, as CLAIC and CLBIC is intended to be used also for nonnested comparisons this definition is not suitable.
4 Model selection: Comparison by simulation
After we have discussed several model selection procedures we will assess their respective performance using a simulation exercise. The model under consideration is a mixed panel MNP model with five alternatives. Those alternatives are explained by five explanatory variables (drawn from independent standard normal distributions) and their respective parameters are assumed to be an instance of a multivariate normal distribution with mean and covariance matrix . The error terms are assumed to be normally distributed with a mean of zero and a diagonal covariance matrix whose entries are fixed at 0.5.
We generated data sets by drawing values of the vector and the error term from their respective distribution. Based on those values we calculated the utility and the chosen alternative is the one with the highest utility. The data sets have sample sizes 300, 500 or 1000 and in either case we generate 500 of those data sets for each setup to assess the properties of the various selection methods.^{9}^{9}9For each sample size a small doubledigit number of the Monte Carlo samples was not used in the final analysis due to obvious nonconvergence of at least one of the estimators.
Our aim is to estimate the linear (b) as well as the covariance parameters. The covariance parameters are estimated using the corresponding Cholesky decomposition (). In order to find the optimum of the likelihood function we rely on the BFGSalgorithm provided by MATLABs fminunc function. To ensure competitive computation times we have derived the respective analytic gradient but other than that relied on the default options. Following (Bhat and Sidhartan; 2011) we initialize the optimizer at the true values and use one random permutation per observation for the SJ approximation, which stays the same during the optimization. Note that we always fit the unrestricted and restricted model and that we do not use a simple ’plugin technique’ to compute the restricted pseudolikelihood/estimate.
In order to compute the CLAIC, CLBIC and most tests we need estimators for the Godambe information matrix. A simple estimate for the sensitivity matrix () is given by the Hessian of evaluated at the CML estimate ,
An alternative estimator is derived by exploiting that the respective information identities hold for the marginal likelihoods (see Lindsay et al. (2011)),
where is a pairwise likelihood of the nth subject.
The estimator relies only on gradients, it can be computed on the fly, while the numerical computation of sometimes takes longer than the estimation of the corresponding model. Furthermore, like (Claeskens and Hjort; 2008, 43f) we found that the estimates for and are subject to (at times severe) sampling variability and that this variance is in general lower for compared to . We, therefore, base our analysis on the estimator .
For our setup and for MNP models in general the sample size is certainly larger than the number of repeated observations and, therefore, the variability matrix can be estimated by
Alternative methods to compute the Godambe information matrix are a Jackknifebased estimator (for the general idea see (Joe; 1997, p. 302ff) and for an application see Zhao and Joe (2005)). Furthermore, it is possible to obtain estimates for and by simulation (see Cattelan and Sartori (2016)). We leave those options for further research.
4.1 Variable Selection
In this section we address the selection of variables. The setup is simple in that we simulate the addition of a fixed effect. We still assume that the other four variables are Random Effect (RE)s so but the covariance matrix is only . We only consider uncorrelated RE,
The decision involves only one parameter, which is either restricted to zero or estimated from the data  . In order to explore the power of the tests we vary the true from 0.1 to 0.5 by steps of 0.1. This is a simple task, which does not even require the use of a (composite) likelihoodratiotest, but the results offer a first look into the performance of the various tests. Note that by definition , and are the same in the one parameter case.
The results are given in Table 1. First note that we have included the naive composite likelihood ratio test, which is based on using (5) and a distribution without any corrections. As shown in the first column the size () of this test is severely inflated. Even though the nominal size is supposed to be the empirical size is nearly 6 times larger. Furthermore, the power () seems satisfactory which is to be expected because the test has a high rejection rate anyway. The size inflation seems to get worse for rising sample sizes.
n  0  0.1  0.2  0.3  0.4  0.5  

300  0.304  0.872  0.996  1.000  1.000  1.000  
0.082  0.600  0.972  0.998  1.000  1.000  
0.049  0.533  0.971  0.998  1.000  1.000  
0.971  0.997  1.000  1.000  1.000  1.000  
0.924  0.989  0.999  1.000  1.000  1.000  
500  0.346  0.944  1.000  1.000  1.000  1.000  
0.125  0.792  0.996  1.000  1.000  1.000  
0.066  0.760  0.997  0.999  1.000  1.000  
0.966  0.999  1.000  1.000  1.000  1.000  
0.900  0.996  1.000  1.000  1.000  1.000  
1000  0.375  0.991  1.000  1.000  1.000  1.000  
0.159  0.940  0.999  1.000  1.000  1.000  
0.064  0.946  1.000  1.000  1.000  1.000  
0.947  0.999  1.000  1.000  1.000  1.000  
0.863  0.996  1.000  1.000  1.000  1.000 
Regarding the various corrected tests, which are collectively presented as , our simulations reveal that the corrections work but that those tests still have an inflated size. Again, the nominal size is supposed to be the empirical sizes we observe are in the worst case nearly 3 times larger for , and . All tests reject more often than theory would suggest. We observe that the size inflation gets more pronounced as the sample size gets larger. By noting that those tests are constructed by premultiplying (5) with a corrective term, this hints to the fact that the correction is not able to keep up with the inflation inherited from . The power () for each test is satisfying and improves for larger sample sizes. However, the second column reveals problems for small deviations in small data sets, which is to be expected. Finally, note that the power results are not adjusted for the difference between nominal and empirical size.
For the the empirical likelihood test (), which is not based on a correction of (5), the empirical size is closer to the nominal size and not or only slightly inflated. Furthermore, the results suggest that the performance of the empirical likelihood test is in general superior compared to the correctionbased tests because the power is higher or equal for all values of . This dominance is especially pronounced for the sample size 1000, which is not surprising because as discussed in section 3.1 empirical likelihoods methods have the drawback of slow convergence.
The performance difference between the correctionbased tests and the test is also depicted in Figure 1 where we have plotted the empirical distribution function of the empirically observed values of the test statistics as well as the distribution they aim to approximate. It is clearly visible that the closely follows the desired distribution while the other tests have too many large values. Even though this is a onedimensional example, where we only need to divide by the relevant eigenvalue to attain the desired distribution the estimation of this eigenvalue seems to be imprecise.
In Table 1 the last two rows for every sample size contain information regarding the performance of the information criteria. We choose the show the empirical probability that the larger model is selected which should be close to one for all values of except for the first column. We see that regardless of the underlying true model both criteria are highly in favor of the larger model. An interesting fact to note from the first column of Table 2 is that even the CLBIC is predominantly selecting the ’wrong’ model just like the CLAIC despite being a consistent IC. With rising sample sizes the probability to select the ’true model’ goes up, which is a reminder that this property is only adhered asymptotically.
4.2 Choosing Covariance structures
In this section we address the issue of choosing between different covariance structures for the RE. More specifically we aim to test whether the offdiagonal entries of are zero (diagonal covariance, ). For that reason is fixed as . In order to assess the empirical size of the test we first simulate 1000 data sets from a model with diagonal covariance (). We address the power of the tests using covariance matrices that mimic a Toeplitz matrices,
In order to represent various degrees of correlation between the REs we vary the true from 0.1 to 0.4 by steps of 0.1. The decision involves the ten offdiagonal parameters of which in the estimation are either all restricted to zero or all estimated from the data without restrictions.
Table 2 shows that for the current setup the performance of the tests differs from that of the last section. First, the problems of the naive test are even more pronounced than for the variable selection task as is almost always rejected. However, contrary to the results of the last section the size inflation of the corrected test () is less severe but the difference between nominal and empirical size is, again, growing with the sample size. Furthermore, while and are again almost equal in performance there is a performance difference in comparison to . The latter has an empirical size which is more inflated for small samples but the power is higher even when (for sample size 1000) the size inflation is equal to the one of and .
For this setup the test suffers from size inflation, which is for the smallest sample size more severe than for the correction based test. As in the previous example the test has the empirical size which is closest to the nominal level for a sample size of 1000. This serves as a reminder that empirical likelihood methods might not be adequate for small sample sizes. Furthermore, for sample size 1000 the power is lower than that of the correctionbased tests at least for .
Regarding the ICs it is clear that both CLAIC and CLBIC again favor the larger model, that is the unrestricted correlation structure, over the diagonal correlation matrix. This effect is extremely pronounced as the small model is almost never selected and illustrates that selecting the covariance structure using IC will not work.
n  0  0.1  0.2  0.3  0.4  

300  0.926  0.938  0.975  0.983  0.996  
0.077  0.178  0.415  0.726  0.924  
0.071  0.171  0.409  0.723  0.923  
0.108  0.198  0.434  0.742  0.926  
0.165  0.253  0.508  0.774  0.949  
1.000  1.000  1.000  1.000  1.000  
0.999  0.999  0.999  1.000  1.000  
500  0.905  0.947  0.984  0.992  0.999  
0.091  0.231  0.594  0.907  0.986  
0.090  0.226  0.587  0.906  0.984  
0.109  0.254  0.622  0.908  0.987  
0.099  0.221  0.549  0.879  0.989  
0.999  0.999  1.000  1.000  1.000  
0.999  0.999  0.997  1.000  1.000  
1000  0.929  0.959  0.985  1.000  1.000  
0.106  0.339  0.869  0.993  1.000  
0.104  0.335  0.867  0.993  1.000  
0.104  0.354  0.865  0.993  1.000  
0.078  0.233  0.794  0.991  1.000  
0.998  0.996  1.000  1.000  1.000  
0.994  0.994  0.998  1.000  1.000 
5 Model averaging
An alternative to model selection is model averaging (MA). In the following we only refer to frequentist model averaging because we are concerned with MACML estimation, a frequentist method. However, it is worth to note that there exists a large body of literature on Bayesian MA.
In MA instead of selecting just one model, the estimates from several competing models are combined in a weighted average.
In this section we assume that the models are labeled consecutively as
. A model is characterized by setting some coordinates to
prespecified values (often zero) and letting the other coordinates vary
freely. The main case in this respect is regressor selection where setting
a coordinate to zero implies that the corresponding regressor does not
influence the outcome.
The estimated th parameter after model averaging then might be written as
(11) 
where the weight is defined such that and where if the parameter is not contained in the model. Different versions of MA methods use different weights .
A simple class of MA methods is based on information criteria where the weight of every model is derived from its respective AIC or BIC value. Those adhoc methods, which originated from Buckland et al. (1997) aim to incorporate the uncertainty involved in model selection. The weights for a given sample are defined as
(12) 
where in our case is either the CLAIC or CLBIC calculated for model . Note that in order to simplify the calculations the are often normalized to , in either case the weights sum up to one by definition. Furthermore all weights are positive by definition. A theoretical justification for this model averaging strategy is provided in Burnham and Anderson (2002) where it is shown that asymptotically even when those weights are computed from the AIC they might be interpreted as (Bayesian) posterior probability that the model is correct (see section 6.4.5. in (Burnham and Anderson; 2002)).
Alternatively asymptotically optimal MA methods base the weight choice on the minimization of some criterion function, e.g. the asymptotic Mean Squared Error (MSE) in the parameter or in prediction of certain focus quantities like conditional probabilities. These techniques typically are motivated within
a local misspecification framework proposed in Hjort and Claeskens (2003):
The parameter vector is partitioned into a set of parameters that are shared by all candidate models () and some additional parameters which only appear in some models (). Therefore, there is a range of models starting with a ’narrow model’ where
(often ), which contains only the mandatory parameters and ending
with the ’wide model’, which is based on the full set of parameters. Within the local misspecification framework geared towards difficult to separate cases the data is assumed to be generated by the parameters converging to . In this setting the bias for not including a variable in the model is of the same magnitude as the squared error due to sampling variability, making the decision of whether to include the variable corresponding to based on the available amount of data a hard decision.
Wan et al. (2014)) show how to derive such a setup for Multinomial Logit Models in connection with maximum likelihood estimation, but – to the best of our knowledge – no results are available neither for MNP nor for CML methods in general up to now.
Following Hjort and Claeskens (2003) an optimal MA is based on the following insight:
Theorem 5.1
Let
and let .
Further let (variance under ) and .
Let the model be such that the function is three times continuously differentiable in a
neighborhood of where all derivatives are dominated by functions with finite means under
. Furthermore the variables have finite fourth moments under .
Then if the data is generated as iid draws under the sequence of local alternatives we have:
(13) 
The theorem follows from Lemma 3.1. in Hjort and Claeskens (2003) as the assumptions directly imply assumptions (C2) and (C4) there. A slight difference occurs as in the current case the second derivative of the loglikelihood does not equal the negative variance of the score and thus different matrices and appear in the distribution of instead of only .
This theorem can be applied to the current setting, if the number of choices is constant over individuals (generalizations to cases of unequal number of choices per individual are immediate).
Note in this respect that using the SJ approximation to the MVNCDF implicitly defines a model by providing a mapping between parameters and regressors onto choice probabilities.
The corresponding model will be close to the MNP but not identical. Consequently these two models and corresponding estimation methods need to be dealt with separately.
Theorem 5.2
Assume that the data are generated by a MNP with repeated choices for each individual,
where the regressors are uniformly
bounded. Furthermore the parameter set is compact and such that .
In this case the conditions of Theorem 5.1 hold for the
MNP model.
The same conclusions hold for data generated according to model implied by the SJ
approximation.
The proof of the theorem is based on the fact that the MVNCDF is continuously
differentiable in all its arguments (see Plackett (1954) for derivatives with respect to entries
in the correlation matrix; derivatives with respect to upper integration limits are obviously
given by the corresponding PDF). Thus also higher order derivatives are continuously
differentiable. The bounds on the regressors and the eigenvalues of the covariance matrix
imply bounds on the upper limits as well as the eigenvalues of the correlation matrices occurring in (3).
For the SJ case also only Gaussian CDFs and PDFs occur and hence the result is immediate. Details are omitted.
Standard asymptotic expansions of the score in combination with the mean value theorem then provide the following consequence of Theorem 5.1:
Theorem 5.3
Under the conditions of Theorem 5.2 let denote the estimator (over the compact parameter set of which is an interior point) of the model defined via the fact that in for the estimation certain entries are chosen according to while the remaining are estimated. Then
for some matrix .
If furthermore the function is continuously differentiable at with gradient then
and hence in the limit is normally distributed with mean and variance . Consequently times the mean square error for the model asymptotically equals
PROOF: The proof follows standard mean value expansions: As maximizes the CML function at an interior point of the parameter set, its derivative at the estimator is zero:
Here denotes an intermediate value. Standard theory implies that where denotes the projection onto the free coordinates within . The limit for the estimation error then follows from the fact that
where consequently
The asymptotic distribution of then follows from
applying the Delta method. The formula for the asymptotic MSE then is obvious.
Therefore the asymptotics are for all models driven by the same random variable . Thus it follows that the estimation errors for all models jointly are asymptotically normal. This allows the calculation of the asymptotic MSE also for the averaged model using weights :
(14)  
where the next to last equation defines the matrix and the vector of weights . The optimal weights for estimating the value then is provided by the solution to
(15) 
In practice the matrix is not known but contains only quantities that can be estimated consistently except for . In this respect it is customary to use the estimate from the wide model. This estimate is unbiased but not consistent. Note that contrary to the weights from the ICbased approach the model averaging weights that result from (15) might be either positive or negative. Without further restrictions the optimal weight vector can be calculated explicitly.
The matrix is not necessarily nonsingular, its rank is bounded by . Conditions for nonsingularity of can be derived for particular settings, see Charkhi et al. (2016). This also implies that the estimation of models is sufficient in order to obtain minimum MSE weightings. This number typically is much smaller than the number of all possible combinations of regressor selections.
The main steps to compute an optimally averaged estimator may be summarized as:

Decide on a list of candidate models

Estimate the wide model and obtain as an estimate of .

Estimate its sensitivity as well as its variability matrix .

Estimate the remaining models, then compute the weights using (15) and the averaged estimate.
This description leaves the question of the choice of open, which in this context is usually called the focus parameter. A number of options in this respect are:

: extracting one parameter of interest.

being a prediction such as the conditional probability of a particular choice.

Alternatively the sum of squared errors corresponding to all parameters can be used.
In summary we have introduced two model averaging strategies, the first is rather adhoc but has the benefit that the weights are easily computed from either CLAIC or CLBIC. The other strategy involves additional computations but features a strong theoretic framing (optimality w.r.t asymptotic MSE of some focus quantity).
6 Model averaging: Comparison by simulation
In this section we use a rather simplistic approach to explore the performance of model averaging in that we only average over the two different covariance structures involved in the test decisions of section 4, where the models differ in 10 parameters.
The setup was described in the introduction to section 4 but here our focus is on the estimation accuracy of the previously discussed averaging estimators when compared to model selection. The results are based on the Mean Absolute Error (MAE) over all Monte Carlo samples with regard to the – arbitrarily chosen – third parameter in b (). So the difference between the models is in the covariance structure while we are interested in the impact of this difference on the estimates for a linear parameter. We first present the MAE for the case of selecting the model using either CLAIC or CLBIC followed by the errors for different MA methods. Note that as discussed previously the probability to select the restricted model is rather low and gets lower for rising such that this MAE is mainly driven by the unrestricted model.
In Table 3 we see that as expected the MAE for all methods gets smaller for rising sample sizes because the underlying estimation is getting better. Furthermore we see that the errors of model selection, which are presented in the first two rows, are also getting smaller for higher values of which is due to the fact that in those cases both ICs always select the unrestricted model.
The next two rows present the MAE for an averaging estimator which is based on an IC (see (12)). We observe that this method is performing worse than model selection for the two smaller sample sizes but has lower errors for all but the highest values of for sample size 1000. However, it is clearly visible that the error is not improving for rising values of even though the weighting decision should get easier.
The results for an averaged estimator where the weights are chosen to minimize the asymptotic MSE are given in the last row. Again we observe better performance for larger samples but the results also show that this averaging estimator is superior to the ICbased MA estimators. We further observe that this model averaging estimator outperforms model selection for small values of . It is important to recapitulate the results from Table 2 which show that the restricted model is wrongly almost never selected by CLAIC/CLBIC. This limited simulation exercise points out that model averaging – especially when the weights are chosen to be MSE optimally – might provide a solution to this problem.
n  0  0.1  0.2  0.3  0.4  

300  CLAIC Select.  0.201  0.203  0.193  0.177  0.160 
CLBIC Select.  0.201  0.205  0.193  0.177  0.160  
CLAIC Aver.  0.225  0.231  0.234  0.222  0.221  
CLBIC Aver.  0.225  0.230  0.234  0.222  0.221  
oMSE Aver.  0.226  0.219  0.214  0.196  0.184  
500  CLAIC Select.  0.153  0.151  0.151  0.143  0.128 
CLBIC Select.  0.153  0.151  0.151  0.143  0.128  
CLAIC Aver.  0.158  0.152  0.156  0.159  0.154  
CLBIC Aver.  0.158  0.152  0.156  0.159  0.154  
oMSE Aver.  0.157  0.149  0.150  0.147  0.136  
1000  CLAIC Select.  0.121  0.122  0.120  0.116  0.102 
CLBIC Select.  0.121  0.122  0.119  0.116  0.102  
CLAIC Aver.  0.117  0.110  0.113  0.116  0.116  
CLBIC Aver.  0.117  0.111  0.113  0.116  0.116  
oMSE Aver.  0.116  0.112  0.113  0.113  0.106 
7 Model Averaging: Empirical Example on Mobility Motifs
In this section we illustrate the use of the most promising of the previously discussed model averaging methods (asymptotically MSE optimal weights) using data from the German Mobility Panel (MOP). The MOP is a panel mobility survey with a rotating sample, keeping responding households in the sample for three consecutive years. Each year households are asked to record trips for all members of the household for one randomly assigned week. The corresponding trip diary conforms to the KONTIV design and collects trip start and end time and location, transport means, distances covered and trip purposes (for more details on the data generation and characteristics see (Zumkeller and Chlond; 2009)). For this paper we use data from 2013.
Based on the trip diary for each person we compile mobility motifs (Schneider et al.; 2013): Here a motif is a graph containing nodes and directed edges, compare Figure 2. The nodes represent locations people visited, the edges movements between edges. The significance of the motifs is seen in the fact that in different data sets in different cities it has been observed that from the large number of possible motifs with up to six nodes only relatively few ((Schneider et al.; 2013) list 17 motifs covering more than 90% of all day observations) turn out to occur frequently with also the frequency of occurrence being remarkably similar across studies.
Beside these empirical facts motifs are of interest also for simulation models of mobility behaviour. Often such models only cover a single trip, while clearly trip chains exist and hence choices for one trip depend on the other trips within a day. Most often such dependencies are not or only partly taken into account (see (Cascetta; 2009, p. 219ff)). In this respect it might be postulated that people in a first step choose one of the possible motifs in order to decide on the various locations to be visited and also the sequence of visits to these locations. And only in a second step the actual trips are planned in detail. Such activity plans also lie at the heart of some mobility simulation models such as the ones implemented in MATSim (Horni et al.; 2016).
For this case study, we limit our analysis to the workdays of the year 2013. The motifs were computed from the trip diaries of 2369 participants and in this data set the 15 most common motifs (as depicted in Figure 2) account for 92.6 percent of all choices and, therefore, we summarized the remaining 7.4 percent as ’other’.
In this case study we investigate the choice of the actual motif on a workday based on underlying sociodemographic characteristics of the persons. Cascetta (2009) argues that it is mainly the occupational status of an individual that influences the individual tripchaining. We will explore this hypothesis in this case study. To do so we consider a limited set of exploratory variables: a dummy which indicates when a person is not fulltime employment, a gender dummy as well as dummies for four age groups (1017, 1825, 2660, 61 and older).^{10}^{10}10Note that in Germany the legal age to acquire a (full) drivers license is 18. Furthermore, the compulsory school attendance ends around the age of 18 in most of the federal states. According to official figures the mean age of retirement in Germany was 61.9 in 2016 (see (Deutsche Rentenversicherung; 2016)). Those variables, which are included in the model as fixed effects, are summarizes in Table 4.
Variable  Value  Frequency 

occupation  fulltime employed  819 
not fulltime employed  1550  
gender  male  1181 
female  1188  
age  1017  173 
1825  116  
2561  1247  
61+  833  
N  2369 
We explore the effect those variables have separately for each potential choice. As we observe five repeated choices for each participant we model the Alternative Specific Constant (ASC) as random effects and the correlation between those random effects allows us to explore the substitution pattern between different motifs. In order to ensure identification the coefficients for the ”other”choice are fixed to one as is the corresponding variance. That leaves the model with 90 linear and – because the covariance parameters are estimated using the corresponding Cholesky decomposition () – with up to 120 covariance parameters.
Following through the steps outlined in section 5 we will start by specifying the list of candidate models. For this modeling exercise the specification of the covariance is of major concern because prior research suggests that mobility patterns are pretty stable during the work week and that, therefore, some substitution patterns might not be relevant (see (Schneider et al.; 2013)). Furthermore, Schneider et al. (2013) discuss that substitution in general can be explained by a rulebased system where several substitutions do not occur at all. Instead of selecting a covariance structure, which could be done using the methods we discussed previously, we estimate several possible specifications ranging from the diagonal REspecification without any correlation to the model featuring an unstructured covariance matrix.^{11}^{11}11Note that in order to ensure identification of the unstructured covariance we constrained all correlations between the ’other’option and the remaining motif to zero. In detail we consider four models,

Unrestricted correlations between the ASCs of the 15 motifs

Block1, substitution of the fist seven (simple) motifs to the other (more complex) motifs but no correlation within those blocks.

Block2, like Block1 but also correlation within the first block,

Diagonal RE, only uncorrelated random effects for each motif,
All those models include the 90 linear parameters and at least the 15 variances of the random ASCs. Using the terminology of section 5 those parameters form and the last model is the wide model which includes all 210 parameters (see Table 5). The estimation for all four models is set up similar to the procedure outlined in section 4 but we initialize the optimizer at random because the true values are obviously unknown.
In Table 5 we have summarized the different IC values for the models. Comparison of CLAIC and CLBIC reveals as the preferred specification. However, we can also observe that selection by CLAIC and CLBIC would lead to different choices for the second best model. While CLAIC hints towards the unrestricted model, we would select the model would we rely on the CLBIC. Furthermore, we show the weights for MSE optimal model averaging. We focused the MSE on the 15 coefficients which indicate whether an individual is not fulltime employed. The weights might appear strange on first inspection because the model with the smallest CLBIC/CLAIC gets a negative weight but from Table 6 it is clear that those weights lead to meaningful averaged coefficients. The averaged coefficients reside either in between the estimate for and the full model or and , this would have been impossible by just selecting one model.
Model  no. of parameters  CLAIC  CLBIC  weights 

105  193298  195941  0.7580  
161  193045  196258  0.0328  
182  192235  195673  0.2320  
210  192485  196189  0.5068 
Even though the main focus of this section is to illustrate the feasibility of model averaging for real data we will interpret the estimates but without going into too much detail. In order to facilitate interpretation we have added a plot which contains the motifs alongside the estimated averaged coefficients. We see that the estimates are plausible as individuals without fulltime employment seem to (I) stay at home more often and (II) favor motifs with a hubstructure, returning home in between visits to different locations, over roundtrips. An interesting results is that the two motifs with the highest coefficients represent distinct activity patterns, the stayathome motif (blue box) and the hubmotif with four nonhome locations (green box). The model would most likely benefit from considering interactions of the nonfulltime employment dummy with other variables, which might explain the reason for the occupational status, to further explore this paradox.
Another possibility is to use model averaging to recover information about the correlations between the ASCs. In this case we focus on the MSE with respect to all linear parameters but for the case of this empirical example we will only look at the estimated correlation matrices. First, note that by shifting the focus we obtain different weights,
where the first weight is for the model, followed by the , and specification. We see that the model has the smallest weight which might be interpreted as a penalty for the high standard errors of the estimate. When compared directly we see that the estimates of the correlation matrix of the random effects for the unrestricted model (Figure 4) and the averaged estimate (Figure 5) reveal similar but not identical patterns. The ordering of the motifs is identical to that depicted in Figure 2, therefore, motifs with lower numbers are observed more often. The first insight from those heatmaps is that the third motif is special in that its correlation with all other motifs is low in comparison to the other common motifs. This is plausible because it is the stayathome motif and is observed for the estimates from the averaged as well as from the model.
Model  Mot1  Mot2  Mot3  Mot4  Mot5  Mot6  Mot7  Mot8  Mot9  Mot10  Mot11  Mot12  Mot13  Mot14  Mot15 

1.165  1.205  1.612  0.994  1.050  1.375  0.934  1.050  1.353  0.873  1.709  1.136  1.041  0.865  1.224  
1.157  1.199  1.587  0.985  1.036  1.368  0.928  1.013  1.311  0.880  1.723  1.173  1.012  0.852  1.224  
1.140  1.164  1.597  0.971  1.025  1.323  0.948  1.009  1.339  0.864  1.695  1.123  0.945  0.806  1.203  
1.135  1.178  1.585  0.984  1.031  1.329  0.945  0.997  1.298  0.886  1.692  1.166  1.007  0.831  1.216  
1.145  1.187 