Model selection and model averaging in MACML-estimated Multinomial Probit (MNP) models

# Model selection and model averaging in MACML-estimated Multinomial Probit (Mnp) models

Manuel Batram111Corresponding author; E-mail adress: Manuel.Batram@uni-bielefeld.de Department of Economics, Bielefeld University, Postfach 10 01 31, D-33501 Bielefeld, Germany. Dietmar Bauer Department of Economics, Bielefeld University, Postfach 10 01 31, D-33501 Bielefeld, Germany.
###### Abstract

This paper provides a review of model selection and model averaging methods for multinomial probit models estimated using the Maximum Approximate Composite Marginal Likelihood (MACML) approach. The proposed approaches are partitioned into test based methods (mostly derived from the likelihood ratio paradigm), methods based on information criteria and model averaging methods.
Many of the approaches first have been derived for models estimated using maximum likelihood and later adapted to the composite marginal likelihood framework. In this paper all approaches are applied to the MACML approach for estimation. The investigation lists advantages and disadvantages of the various methods in terms of asymptotic properties as well as computational aspects. We find that likelihood-ratio-type tests and information criteria have a spotty performance when applied to MACML models and instead propose the use of an empirical likelihood test.
Furthermore, we show that model averaging is easily adaptable to Composite Marginal Likelihood (CML) estimation and has promising performance w.r.t to parameter recovery. Finally model averaging is applied to a real world example in order to demonstrate the feasibility of the method in real world sized problems.

AIC
Akaike Information Criterion
APB
Average Percentage Bias
ASC
Alternative Specific Constant
BIC
Bayesian Information Criterion
bME
bivariate Mendell-Elston
CCL
Composite Conditional Likelihood
CDF
Cumulative Distribution Function
CEF
Conditional Expectation Function
CL
Composite Likelihood
CLAIC
Composite Likelihood Akaike Information Criterion
CLBIC
Composite Likelihood Bayesian Information Criterion
CML
Composite Marginal Likelihood
c.p.
ceteris paribus
DCM
Discrete Choice Models
DGP
Data Generating Process
DM
Decision Maker
GHK
Geweke-Hajivassiliou-Keane
IC
Information Criterion
iid
independent, identically distributed
KLD
Kullback-Leibler Divergence
LLN
Law of Large Numbers
MA
model averaging
MACML
Maximum Approximate Composite Marginal Likelihood
MAE
Mean Absolute Error
MC
Monte Carlo
MCMC
Markov Chain Monte Carlo
ME
Mendell-Elston
ML
Maximum Likelihood
MNP
Multinomial Probit
MOP
German Mobility Panel
MSE
Mean Squared Error
MSL
Maximum Simulated Likelihood
MVNCDF
multivariate normal cumulative distribution function
OLS
Ordinary Least Squares
PQD
RE
Random Effect
RMSE
Root Mean Squared Error
RUM
Random Utility Model
RV
Random Variable
SEM
Structural Equation Models
SJ
Solow-Joe
SLLN
Strong Law of Large Numbers
ULLN
Uniform Law of Large Numbers
WLLN
Weak Law of Large Numbers
WTP
Willingness to Pay

## 1 Introduction

The two most commonly used families of discrete choice models are the multinomial logit (MNL) and the multinomial probit (MNP) model. Between these two the MNP offers better modeling flexibility at the expense of higher computational costs. To alleviate the computational burden of MNP estimation, the MACML estimation approach combines CML and an analytic approximation to allow for simulation-free, and fast (but approximate) estimation of MNP models (see (Bhat; 2011)). Bhat and coworkers have shown that estimation in the MACML framework is very fast (see Cherchi et al. (2016)) and can be applied even for complicated MNP models (see e.g. Kamargianni et al. (2015)).

Even though it is sometimes possible to specify a discrete choice model solely through theoretical reasoning, more often than not the researcher has to rely on model selection in order to arrive at a useful model. Despite Bhat (2011) as well as Bhat (2014) offering some limited advice, the literature on model selection for MNP models using the MACML estimation methodology is not extensive. Moreover, model averaging which has been proposed in a number of different contexts (see e.g. Gao et al. (2016), Hjort and Claeskens (2003), Hansen (2007), Wan et al. (2014)) has not been dealt with in the context of CML estimation.

Therefore this paper aims to stimulate discussion on this important topic by surveying proposals for model selection methods for MNP models estimated using MACML.222Note that in order to safe space we will sometimes violate the distinction between estimation method and model by abbreviating ’MNP models estimated using MACML’ as MACML models. We study those methods under the premise of a given, finite collection of competing models denoted as , from which the researcher selects one using a data-driven procedure.

In the literature one finds two approaches for model selection. The first approach is based on a penalized goodness-of-fit function which is associated with each of the competing models. This function incorporates the trade-off between model fit and model complexity. In the context of MACML this procedure is based on information criteria (IC) derived from the respective pseudo-likelihoods. The decision rule is that the model with the smallest, estimated value is retained. In an abstract sense this partitions the space of all observations into regions where a particular model is chosen.

The alternative approach is based on repeated hypothesis testing wherein the final model choice is taken by repeatedly performing hypothesis tests of a restricted model versus an unrestricted model. Stepwise regression techniques are examples of this procedure. Again a partitioning of the observation space results.

The estimated parameter after model selection based on either approach therefore might be presented as

 ^θj=∑m∈M^wm(Z)^θmj,

where denotes the observations and the weight is defined as

 ^wm(Z)={1,model m is selected by the decision % rule based on observations Z0,otherwise.

Note that the weights are random as they are also ’estimated’ from the data because the components of the model selection procedure depend on the given sample. Furthermore the decision rules only allow for the selection of a single model, which is equivalent to .

As an alternative to model selection in this paper we introduce model averaging methods for models estimated using the MACML paradigm. The idea of model averaging (MA) is to use the information from all models and not only the ’winning’ model. The first motivation for doing so is the acknowledgment that model selection based on just one sample is subject to randomness. From the theoretical point of view, MA avoids problems related to post-model-selection inference (see e.g. (Claeskens and Hjort; 2008, 206ff) or Leeb and Pötscher (2005)). MA is also appealing to the practitioner because it has been shown that predictions are better in terms of mean squared error when they are based on model averaging rather than model selection (see e.g. Gao et al. (2016)).

The estimated parameter after model averaging might be presented as,

 ^θMAj=∑m∈M^wm(Z)^θmj, (1)

where the weights fulfill . There are several model averaging procedures which differ in the way the weights are justified and computed.

In this paper we present the adaptations of the concepts well known in the context of maximum likelihood estimation to the MACML case. In all cases we discuss advantages and disadvantages of the various approaches focusing on the analytical and numerical properties. This is in particular of importance as model selection and model averaging requires the computation of a number of models and hence exacerbates computational issues already occurring for the estimation of a single model estimation.

The organization of the paper is as follows. First, we succinctly review the MACML framework and argue that the MACML pseudo-likelihood differs from the standard CML framework. In the second section we review test- and information criterion-based model selection and discuss computational challenges. As a new contribution this section introduces a test which is framed in empirical likelihood theory. This is followed by a simulation exercise comparing the various approaches on two demonstration examples.

Third, we introduce and discuss MA for MACML followed by a simulation-based comparison of the discussed methods. In section 7 we present an empirical example based on real choice data from the German Mobility Panel in order to demonstrate the feasibility of model averaging. We conclude the article with a summary of our results and discuss practical implications of our findings.

## 2 The MACML approach

The first building block of the MACML method is CML estimation (for a general survey of this method see Varin et al. (2011)). In this paper only the special case of the so called pairwise likelihood is considered, although many results generalize also to more general CMLs. For pairwise CML the full-likelihood representing the joint probability of the observation of choices () by one individual is replaced by the product of the probabilities of all pairs of choices,

 lcml(θ)=N∑n=1logcmln(θ) =N∑n=1T−1∑t=1T∑t′=t+1logcmlntt′(θ) =N∑n=1T−1∑t=1T∑t′=t+1logP(Cnt=ynt,Cnt′=ynt′|Xn,θ). (2)

Note that the functions are valid marginal likelihoods such that key properties of the likelihood framework continue to hold. For example it directly follows that the composite score, which we define as , is of zero mean at the true parameter .

In analogy to Maximum Likelihood (ML) estimation the CML estimator is defined as

 ^θ=argmaxθlcml(θ).

Under suitable regularity conditions this estimator is consistent for (see (Molenberghs and Verbeke; 2005, 190ff)), whenever is fixed. Furthermore, it is possible to show that under suitable assumptions the estimator is asymptotically normally distributed, where the asymptotic variance is given by the inverse of the Godambe/sandwich information matrix, where the so called sensitivity matrix is defined as and the variability matrix is the variance of the composite score . This is a standard result for misspecified (in the sense of using a pseudo likelihood instead of the true one) likelihood models which is due to the fact that the information identity (also called the second Bartlett identity) does not hold (see e.g. White (1982)). In general the CML estimator therefore is less efficient than the ML estimator.

When CML is utilized to estimate MNP models, each involves the evaluation of the multivariate normal cumulative distribution function (MVNCDF): As is well known the MNP model can be derived using an underlying random utility function assigning the utility

 Untk=Xntkβ+~Xntkαn+entk

to the choice of alternative with characteristics in the -th decision of individual . Here denotes fixed parameters while – assumed to be normally distributed with zero mean and variance – allows for heterogeneity between deciders. Furthermore denotes the independent, identically distributed (iid) random errors assumed to be normally distributed with zero mean and variance . Consequently using the vector we obtain that is normally distributed.
It follows that the probability that and are the choices (that is have maximal random utility) in the -th and -th choice respectively is given by

 Φ2K−2(b[Xnt:,Xnt′:,θ];0,R([Xnt:,Xnt′:,θ])), (3)

where collects all parameters contained in for appropriate functions and defining the upper limits of integration and the correlation matrix respectively. Furthermore denotes a dimensional MVNCDF corresponding to mean zero and variance-covariance matrix R.

Whenever is large, CML estimation of MNP models is still computationally demanding and the MACML approach by Bhat (2011), therefore, complements pairwise CML estimation with an analytic approximation for MVNCDFs. Bhat (2011) proposes the use of the Solow-Joe (SJ) approximation (Solow; 1990; Joe; 1995), whose general idea is to factorize the multivariate normal distribution into a product of conditional distributions, which are in turn approximated by linear projections. For a three dimensional case of calculating the MVNCDF for where are standard normally distributed with correlation between and we obtain (using ):

 Φ3(b;0,R) =P(u1≤b1,u2≤b2)P(u3≤b3|u1≤b1,u2≤b2) (4) =Φ2(b1,b2;0,R12)E[I3|I1=1,I2=1] ≈Φ2(b1,b2;0,R12)^p3|12(b,R) =:^PSJ:3|12(b,R),

where denotes the linear projection

 ^p3|12(b,R):=Φ(b3)+q(b,R)′Q(b,R)−1[1−Φ(b1),1−Φ(b2)]′.

Here the entries of and contain covariances of with , which are functions requiring the evaluation of univariate and bivariate MVNCDF functions. The positive definite matrix has smallest eigenvalues bounded away from zero for bounded b if the same holds for R.
Note that the ordering of the components for the approximation is arbitrary but influences the approximation quality. It is known that averaging over all possible permutations of coordinates improves the approximation accuracy at the price of higher computational load. Therefore typically only a fixed number (with one being a popular value) of random permutations are averaged.
Note that the approximation does not guarantee that the approximated choice probabilities for all choices sums to one. Furthermore there are no guarantees that the approximation lies in . This is typically ensured using some ad hoc interventions. For further details regarding the SJ approximation see Joe (1995).

Even though the SJ approximation is an important building block of MACML and ensures the comparatively fast estimation of complex MNP models, it is important to note that its application alters the pseudo-likelihood and that MACML estimation is, therefore, not equivalent to CML estimation. To be more precise it is possible to show that due to the SJ approximation333This is also true for several other analytic approximations and no particular problem of the SJ approximation (for further details see Batram and Bauer (2016)). Furthermore, this inconsistency is also present in Maximum Simulated Likelihood (MSL)-estimates whenever the number of simulations is finite (see e.g. Lee (1992)). the MACML estimators for data generated from an MNP is asymptotically biased and not consistent (see Batram and Bauer (2016)). To the best of our knowledge there are no general results that quantify or state bounds for the deviations between the MACML pseudo-likelihood and the CML pseudo-likelihood even in large samples, although the deviation has been shown to be small in all real world case studies performed so far.

A different way to look at the stated inconsistency is to note that, if the approximated choice probabilities are normalized to sum to one, they encode a parameterized mapping of the characteristics encoded in onto the choice probabilities. It appears that one can show444This result is work in progress. that the MACML estimator using the normalized probabilities can be used to consistently estimate the corresponding underlying parameters for this mapping. In this sense the MACML estimator regains all the usual properties of consistency and asymptotic normality for a slightly altered model.

For model selection and averaging we will need some more notation in the following sections: Assume that the parameter vector is of dimension and that we partition this vector into , where is -dimensional and has dimension . The vector contains parameters that are known to be present in the model while we are unsure whether the parameters assigned to the vector should be contained in the model. Our interest for model selection thus lies on and we want to test : . The elements of will most often equal zero. Restrictions to other values such as 1 are possible for example for variances.
With respect to the full parameter vector we denote the estimator where is constrained according to the null hypothesis by and the unconstrained estimator is . Finally, let be the trailing submatrix of the information matrix and the -dimensional trailing subvector of the score which contains only entries related to . Furthermore, denotes the corresponding trailing submatrix of .

## 3 Model selection

As discussed above, the MACML estimation is not equivalent to CML estimation. However, we start this section by surveying methods for the CML framework and then – in the following section – assess their applicability to MACML estimation by simulation.

### 3.1 Likelihood-ratio-type tests

In this section we focus only on likelihood-ratio-type tests. While it is easy to adapted score-type and Wald-type tests for the use within the CML framework, the Wald-type tests suffer from the known shortcomings (lack of invariance to reparametrization and elliptical confidence region) and the tests based on the score are numerically unreliable (see (Molenberghs and Verbeke; 2005, 193f)).

The standard likelihood-ratio tests are inapplicable because the ratio of two composite likelihoods does not adhere to an asymptotic distribution under but instead is a linear combination of independent but weighted distributions. If we are interested in testing , then

 CLR(γ)=2[lcml(^θ)−lcml(^θγ0)]a∼p∑j=1λj(Kj)2, (5)

where are eigenvalues of (as discussed those are matrices which are evaluated under the null hypothesis) and are independent standard normal random variables. Just like the standard likelihood ratio test its CML counterpart rejects whenever is large.

There are several different adjustments to , which aim to facilitate likelihood-ratio-type testing in the CML framework. Due to space constraints we will only focus on the adjustments which have shown promising results in previous studies (for a more general overview and simulation results see Pace et al. (2011) or Cattelan and Sartori (2016)).555Note, however, that the simulation studies here usually focus on simple models like estimation of moments from the multivariate normal or Gaussian random fields. As those models are not a good surrogate for MACML-estimated MNP models we provide our own simulations in section 4. Those adjustments either try to alter so that it is approximately distributed or match certain moments of the distribution.

The moment matching adjustments are motivated by the observation that if the eigenvalue is a simple ratio and, therefore, is asymptotically . For more than one parameter this adjustment has the form,

 cCLR1(γ)=CLR(γ)ωa∼χ2p, (6)

where . This adjustment is equivalent to match the first moment of the distribution. A more sophisticated adjustment, which is designed to match the first and second moment, was proposed in Varin (2008),

 cCLR2(γ)=CLR(γ)κa∼χ2ν, (7)

where and . It is well known that this adjustment might be inaccurate because it only matches the first two moments.
Both the calculation of and are computationally only slightly more expensive than CML optimization: Only the eigenvalues must be calculated based on estimates of and which demand one extra pass through the likelihood calculations (if the corresponding quantities are not stored in the last gradient descent step).
Another adjustment was first proposed by Chandler and Bate (2007) and later modified by Pace et al. (2011) in order to be invariant to reparametrization. In contrast to the previous method this adjustment does not match moments but tries to ensure that the corresponding is asymptotically distributed,

 cCLR3(γ)=sN,γ(^θγ0)′Hγγ(^θγ0)[Gγγ(^θγ0)]−1Hγγ(^θγ0)sN,γ(^θγ0)sN,γ(^θγ0)′Hγγ(^θγ0)−1sN,γ(^θγ0)CLR(γ)a∼χ2p.

Note that the numerator is the CML version of a score test statistic but as for example explained in Cattelan and Sartori (2016) does not inherit its numerical instabilities. It is important to note that all those adjustments rely on estimators for the Godambe information matrix. Again the computation requires little extra work.

As an alternative we introduce another test, which has the appeal that there is no need to compute the Godambe information matrix. This test is based on the empirical likelihood framework, which in the context of CML estimation was first used by Lunardon et al. (2013) to derive confidence regions for CML estimates. Using general results from Qin and Lawless (1994) it is straightforward to also introduce a likelihood-ratio-like test for CML. The only property needed is that the composite score is an unbiased estimation equation, which it is under standard conditions as discussed in section 2, and Corollary 5 from Qin and Lawless (1994). Hence we do not provide any proofs.
In general, this approach comes with two difficulties: First, regarding theoretical properties empirical likelihood methods are known for their slow convergence to their respective asymptotic distributions, which might impose problems for small sample sizes. Second there is a need for additional computations. The essential building block of empirical likelihood theory is,

 lE(θ)=2N∑n=1log[1+ψ′sn(θ)],

with , where is a d-dimensional vector chosen to satisfy,

 1NN∑n=1sn(θ)1+ψ′sn(θ)=0. (8)

While the individual gradients are available from the CML estimation we need to compute the Lagrange multiplier in order to calculate . The best way to determine is to use the strategy outlined in Owen (1990) and restate the root finding problem as a minimization problem (see (Owen; 1990, 104ff)). Note that for practical computations (8) should be ’numerical zero’. In our simulation experiments even with extremely low (gradient and step) tolerances it was rare to find a solution which fulfilled (8) exactly. Our first attempts to use direct (multivariate) root finding methods failed (e.g. NAGs c05qb) because the solutions provided by those algorithms were small but too far from zero and interfered with the tests performance.666How much deviation from (8) can be tolerated without harming the test performance remains a practically relevant but open research question. Therefore we opted for solving the minimization problem.

After is computed the empirical likelihood ratio statistic for testing is (see (Qin and Lawless; 1994, 307)),

 EL(γ)=2lE(^θγ0)−2lE(^θ)a∼χ2p.

Numerically this procedure is more demanding than the adjustments of the composite likelihood ratios: For two Lagrange multipliers need to be computed by searching for the appropriate roots, which is in general faster than the computation of the sensitivity as well as the variability matrix.777We refrain from presenting computation times alongside our results because those depend strongly on the implementation as well as on the specification of the computer used for the computations. However, the appendix provides those numbers for our example implementation and computer system.
The expected computation time is the reason we did not include Bootstrap tests into our comparison. Even though the theoretical results in Aerts and Claeskens (1999) show that the parametric bootstrap leads to a consistent estimator for the distribution of pseudo-likelihood tests under the null-hypothesis, the specific form of the parametric bootstrap (see e.g. Bhat (2014)) is set up in a way that two models need to be estimated for every bootstrap sample.888As discussed in (Molenberghs and Verbeke; 2005, 195) the parametric bootstrap is expected to break down once the assumption regarding the distribution of the error terms is wrong. It might therefore be worthwhile to assess the performance of Bhats bootstrap because from the theoretical standpoint the use of the SJ-approximation in the MACML pseudo-likelihood is not compatible with the assumption of multivariate normal error terms in the utility function. For a reasonable number of bootstrap samples this leads to very high computation times. It might still be worth to consider the bootstrap to draw inference from the final model but it is certainly not suited to be part of the model selection process for real world sample sizes.

All in all we have identified four different methods to perform likelihood-ratio-type tests in the CML framework. Three methods (, , ) are adjustments to the composite likelihood ratio (5), which rely on estimates of the Godambe information matrix. The fourth method (), which is based on empirical likelihood arguments, utilizes only the individual score functions.

### 3.2 Information Criteria

In this section we discuss the use of information criteria for model selection. The two classical criteria are the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The AIC was developed as an estimator for the Kullback-Leibler Divergence (KLD) between two competing models. Naturally the model with minimal Kullback-Leibler divergence would be the model of choice. Because of this the AIC is designed to select the mimimum-KLD-model. However, the AIC is not an estimate of the KLD (see Claeskens and Hjort (2008, p. 28ff)).
The BIC is designed to select the model with the highest posterior probability when equal prior probability is assigned to each model and vague priors for the model parameters are used (see Burnham and Anderson (2002, p. 286)). The BIC is an approximation to this quantity with the appealing feature ”that the specification of the prior completely disappears in the formula of BIC” (Claeskens and Hjort (2008, p. 81)).

Two important properties of Information Criterion (IC) are efficiency and consistency. An IC is deemed efficient if it selects the model with minimum prediction error. Consistency is defined as the property that an IC asymptotically selects the model with the minimum KLD and if that condition is satisfied by more than one model the model with the lowest number of parameters. This leads to the selection of the true model (if it is part of the model space) with probability going to one as . When assessing BIC and AIC with regard to efficiency, it can be shown that the AIC is efficient while the BIC is not (see Claeskens and Hjort (2008, p. 112)). On the other hand it can also be established that the BIC is consistent while AIC is not consistent.

Both of those classic IC have been adopted for use within the CML framework. The Composite Likelihood Akaike Information Criterion (CLAIC) was introduced in Varin and Vidoni (2005) and is a variant of the Takeuchi IC (TIC) (see e.g. (Claeskens and Hjort; 2008, 43f)). The major difference between AIC and TIC is that the latter is developed without the assumption that the true Data Generating Process (DGP) is amongst the candidate models. Given that within the CML framework we are willing to make a working independence assumption with respect to the DGP, it seems reasonable that the CLAIC is closer to the TIC rather than the actual AIC. The CLAIC selects the model minimizing

 −2lcml(^θ)+2tr[^J(^θ)^H(^θ)−1], (9)

where and are consistent, first-order unbiased estimators for and , respectively. When concerned with the CML framework it is important to note that the objective function () is not necessarily a proper likelihood function. Therefore, Varin and Vidoni (2005) use a modified definition of the KLD to derive the CLAIC. This definition focuses on the marginal distributions, but also imposes that the compared CML models are composed of marginals with the same dimension. Ng and Joe (2014) address some of the differences between the CLAIC and normal AIC. To do so they focus on nested models with tractable likelihoods and make use of the theory of local alternatives. Under those premises they can show that all other things equal the probability to pick the true models is smaller for the CLAIC than for the AIC. Furthermore, they illustrate that – as expected – the CLAIC gets closer to the AIC for rising dimensions of the marginals.

The BIC has been adapted in analogy to the definition of CLAIC for the CML framework by Gao and Song (2010). The Composite Likelihood Bayesian Information Criterion (CLBIC) selects the model minimizing,

 −2lcml(^θ)+log(n)tr[^J(^θ)^H(^θ)−1] (10)

where and are consistent, first-order unbiased estimators for and , respectively. Gao and Song (2010) show that the CLBIC as defined above is a consistent information criterion. Furthermore, they provide an extension for high-dimensional data which is irrelevant in the MNP context. Finally note that CLAIC and CLBIC rely on the sandwich information matrix and that the computational burden is therefore similar to the tests discussed in the previous section.
Considering the definition of CLAIC and CLBIC one notices a subtle dependence of the penalty term on the parameter . As formulated above hence the penalty term is not guaranteed to be a strictly monotonous function of the model order. One could evaluate the penalty term always at the biggest model which would guarantee monotonicity. However, as CLAIC and CLBIC is intended to be used also for non-nested comparisons this definition is not suitable.

## 4 Model selection: Comparison by simulation

After we have discussed several model selection procedures we will assess their respective performance using a simulation exercise. The model under consideration is a mixed panel MNP model with five alternatives. Those alternatives are explained by five explanatory variables (drawn from independent standard normal distributions) and their respective parameters are assumed to be an instance of a multivariate normal distribution with mean and covariance matrix . The error terms are assumed to be normally distributed with a mean of zero and a diagonal covariance matrix whose entries are fixed at 0.5.

We generated data sets by drawing values of the vector and the error term from their respective distribution. Based on those values we calculated the utility and the chosen alternative is the one with the highest utility. The data sets have sample sizes 300, 500 or 1000 and in either case we generate 500 of those data sets for each setup to assess the properties of the various selection methods.999For each sample size a small double-digit number of the Monte Carlo samples was not used in the final analysis due to obvious non-convergence of at least one of the estimators.

Our aim is to estimate the linear (b) as well as the covariance parameters. The covariance parameters are estimated using the corresponding Cholesky decomposition (). In order to find the optimum of the likelihood function we rely on the BFGS-algorithm provided by MATLABs fminunc function. To ensure competitive computation times we have derived the respective analytic gradient but other than that relied on the default options. Following (Bhat and Sidhartan; 2011) we initialize the optimizer at the true values and use one random permutation per observation for the SJ approximation, which stays the same during the optimization. Note that we always fit the unrestricted and restricted model and that we do not use a simple ’plug-in technique’ to compute the restricted pseudo-likelihood/estimate.

In order to compute the CLAIC, CLBIC and most tests we need estimators for the Godambe information matrix. A simple estimate for the sensitivity matrix () is given by the Hessian of evaluated at the CML estimate ,

 ^H(^θ)=−1NN∑n=1∂2lcmln(^θ).

An alternative estimator is derived by exploiting that the respective information identities hold for the marginal likelihoods (see Lindsay et al. (2011)),

 ^H1(^θ)=1NN∑n=1T−1∑t=1T∑t′=1∂lcmlntt′(^θ)∂lcmlntt′(^θ)′,

where is a pairwise likelihood of the n-th subject.
The estimator relies only on gradients, it can be computed on the fly, while the numerical computation of sometimes takes longer than the estimation of the corresponding model. Furthermore, like (Claeskens and Hjort; 2008, 43f) we found that the estimates for and are subject to (at times severe) sampling variability and that this variance is in general lower for compared to . We, therefore, base our analysis on the estimator .

For our setup and for MNP models in general the sample size is certainly larger than the number of repeated observations and, therefore, the variability matrix can be estimated by

 ^J(^θ)=1NN∑n=1∂lcmln(^θ)∂lcmln(^θ)′.

Alternative methods to compute the Godambe information matrix are a Jackknife-based estimator (for the general idea see (Joe; 1997, p. 302ff) and for an application see Zhao and Joe (2005)). Furthermore, it is possible to obtain estimates for and by simulation (see Cattelan and Sartori (2016)). We leave those options for further research.

### 4.1 Variable Selection

In this section we address the selection of variables. The setup is simple in that we simulate the addition of a fixed effect. We still assume that the other four variables are Random Effect (RE)s so but the covariance matrix is only . We only consider uncorrelated RE,

 Ω=⎡⎢ ⎢ ⎢⎣200001.50000100001.2⎤⎥ ⎥ ⎥⎦.

The decision involves only one parameter, which is either restricted to zero or estimated from the data - . In order to explore the power of the tests we vary the true from 0.1 to 0.5 by steps of 0.1. This is a simple task, which does not even require the use of a (composite) likelihood-ratio-test, but the results offer a first look into the performance of the various tests. Note that by definition , and are the same in the one parameter case.

The results are given in Table 1. First note that we have included the naive composite likelihood ratio test, which is based on using (5) and a distribution without any corrections. As shown in the first column the size () of this test is severely inflated. Even though the nominal size is supposed to be the empirical size is nearly 6 times larger. Furthermore, the power () seems satisfactory which is to be expected because the test has a high rejection rate anyway. The size inflation seems to get worse for rising sample sizes.

Regarding the various corrected tests, which are collectively presented as , our simulations reveal that the corrections work but that those tests still have an inflated size. Again, the nominal size is supposed to be the empirical sizes we observe are in the worst case nearly 3 times larger for , and . All tests reject more often than theory would suggest. We observe that the size inflation gets more pronounced as the sample size gets larger. By noting that those tests are constructed by premultiplying (5) with a corrective term, this hints to the fact that the correction is not able to keep up with the inflation inherited from . The power () for each test is satisfying and improves for larger sample sizes. However, the second column reveals problems for small deviations in small data sets, which is to be expected. Finally, note that the power results are not adjusted for the difference between nominal and empirical size.

For the the empirical likelihood test (), which is not based on a correction of (5), the empirical size is closer to the nominal size and not or only slightly inflated. Furthermore, the results suggest that the performance of the empirical likelihood test is in general superior compared to the correction-based tests because the power is higher or equal for all values of . This dominance is especially pronounced for the sample size 1000, which is not surprising because as discussed in section 3.1 empirical likelihoods methods have the drawback of slow convergence.

The performance difference between the correction-based tests and the test is also depicted in Figure 1 where we have plotted the empirical distribution function of the empirically observed values of the test statistics as well as the distribution they aim to approximate. It is clearly visible that the closely follows the desired distribution while the other tests have too many large values. Even though this is a one-dimensional example, where we only need to divide by the relevant eigenvalue to attain the desired distribution the estimation of this eigenvalue seems to be imprecise.

In Table 1 the last two rows for every sample size contain information regarding the performance of the information criteria. We choose the show the empirical probability that the larger model is selected which should be close to one for all values of except for the first column. We see that regardless of the underlying true model both criteria are highly in favor of the larger model. An interesting fact to note from the first column of Table 2 is that even the CLBIC is predominantly selecting the ’wrong’ model just like the CLAIC despite being a consistent IC. With rising sample sizes the probability to select the ’true model’ goes up, which is a reminder that this property is only adhered asymptotically.

### 4.2 Choosing Covariance structures

In this section we address the issue of choosing between different covariance structures for the RE. More specifically we aim to test whether the off-diagonal entries of are zero (diagonal covariance, ). For that reason is fixed as . In order to assess the empirical size of the test we first simulate 1000 data sets from a model with diagonal covariance (). We address the power of the tests using covariance matrices that mimic a Toeplitz matrices,

 Ωα=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1αα2α30α1αα20α2α1α0α3α2α1000001⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

In order to represent various degrees of correlation between the REs we vary the true from 0.1 to 0.4 by steps of 0.1. The decision involves the ten off-diagonal parameters of which in the estimation are either all restricted to zero or all estimated from the data without restrictions.

Table 2 shows that for the current setup the performance of the tests differs from that of the last section. First, the problems of the naive test are even more pronounced than for the variable selection task as is almost always rejected. However, contrary to the results of the last section the size inflation of the corrected test () is less severe but the difference between nominal and empirical size is, again, growing with the sample size. Furthermore, while and are again almost equal in performance there is a performance difference in comparison to . The latter has an empirical size which is more inflated for small samples but the power is higher even when (for sample size 1000) the size inflation is equal to the one of and .

For this setup the test suffers from size inflation, which is for the smallest sample size more severe than for the correction based test. As in the previous example the test has the empirical size which is closest to the nominal level for a sample size of 1000. This serves as a reminder that empirical likelihood methods might not be adequate for small sample sizes. Furthermore, for sample size 1000 the power is lower than that of the correction-based tests at least for .

Regarding the ICs it is clear that both CLAIC and CLBIC again favor the larger model, that is the unrestricted correlation structure, over the diagonal correlation matrix. This effect is extremely pronounced as the small model is almost never selected and illustrates that selecting the covariance structure using IC will not work.

## 5 Model averaging

An alternative to model selection is model averaging (MA). In the following we only refer to frequentist model averaging because we are concerned with MACML estimation, a frequentist method. However, it is worth to note that there exists a large body of literature on Bayesian MA.
In MA instead of selecting just one model, the estimates from several competing models are combined in a weighted average. In this section we assume that the models are labeled consecutively as . A model is characterized by setting some coordinates to prespecified values (often zero) and letting the other coordinates vary freely. The main case in this respect is regressor selection where setting a coordinate to zero implies that the corresponding regressor does not influence the outcome.
The estimated -th parameter after model averaging then might be written as

 ^θMAj=M∑m=1^wm(Z)^θmj, (11)

where the weight is defined such that and where if the parameter is not contained in the model. Different versions of MA methods use different weights .

A simple class of MA methods is based on information criteria where the weight of every model is derived from its respective AIC or BIC value. Those ad-hoc methods, which originated from Buckland et al. (1997) aim to incorporate the uncertainty involved in model selection. The weights for a given sample are defined as

 ^wm(Z)=exp(0.5ICm)∑s∈Mexp(0.5ICs), (12)

where in our case is either the CLAIC or CLBIC calculated for model . Note that in order to simplify the calculations the are often normalized to , in either case the weights sum up to one by definition. Furthermore all weights are positive by definition. A theoretical justification for this model averaging strategy is provided in Burnham and Anderson (2002) where it is shown that asymptotically even when those weights are computed from the AIC they might be interpreted as (Bayesian) posterior probability that the model is correct (see section 6.4.5. in (Burnham and Anderson; 2002)).

Alternatively asymptotically optimal MA methods base the weight choice on the minimization of some criterion function, e.g. the asymptotic Mean Squared Error (MSE) in the parameter or in prediction of certain focus quantities like conditional probabilities. These techniques typically are motivated within a local misspecification framework proposed in Hjort and Claeskens (2003): The parameter vector is partitioned into a set of parameters that are shared by all candidate models () and some additional parameters which only appear in some models (). Therefore, there is a range of models starting with a ’narrow model’ where (often ), which contains only the mandatory parameters and ending with the ’wide model’, which is based on the full set of parameters. Within the local misspecification framework geared towards difficult to separate cases the data is assumed to be generated by the parameters converging to . In this setting the bias for not including a variable in the model is of the same magnitude as the squared error due to sampling variability, making the decision of whether to include the variable corresponding to based on the available amount of data a hard decision.
Wan et al. (2014)) show how to derive such a setup for Multinomial Logit Models in connection with maximum likelihood estimation, but – to the best of our knowledge – no results are available neither for MNP nor for CML methods in general up to now.
Following Hjort and Claeskens (2003) an optimal MA is based on the following insight:

###### Theorem 5.1

Let and let . Further let (variance under ) and .
Let the model be such that the function is three times continuously differentiable in a neighborhood of where all derivatives are dominated by functions with finite means under . Furthermore the variables have finite fourth moments under .
Then if the data is generated as iid draws under the sequence of local alternatives we have:

 √N¯Un\lx@stackreld→U,U∼N(Hγδ,J) (13)

The theorem follows from Lemma 3.1. in Hjort and Claeskens (2003) as the assumptions directly imply assumptions (C2) and (C4) there. A slight difference occurs as in the current case the second derivative of the log-likelihood does not equal the negative variance of the score and thus different matrices and appear in the distribution of instead of only .
This theorem can be applied to the current setting, if the number of choices is constant over individuals (generalizations to cases of unequal number of choices per individual are immediate). Note in this respect that using the SJ approximation to the MVNCDF implicitly defines a model by providing a mapping between parameters and regressors onto choice probabilities. The corresponding model will be close to the MNP but not identical. Consequently these two models and corresponding estimation methods need to be dealt with separately.

###### Theorem 5.2

Assume that the data are generated by a MNP with repeated choices for each individual, where the regressors are uniformly bounded. Furthermore the parameter set is compact and such that .
In this case the conditions of Theorem 5.1 hold for the MNP model.
The same conclusions hold for data generated according to model implied by the SJ approximation.

The proof of the theorem is based on the fact that the MVNCDF is continuously differentiable in all its arguments (see Plackett (1954) for derivatives with respect to entries in the correlation matrix; derivatives with respect to upper integration limits are obviously given by the corresponding PDF). Thus also higher order derivatives are continuously differentiable. The bounds on the regressors and the eigenvalues of the covariance matrix imply bounds on the upper limits as well as the eigenvalues of the correlation matrices occurring in (3). For the SJ case also only Gaussian CDFs and PDFs occur and hence the result is immediate. Details are omitted.
Standard asymptotic expansions of the score in combination with the mean value theorem then provide the following consequence of Theorem 5.1:

###### Theorem 5.3

Under the conditions of Theorem 5.2 let denote the estimator (over the compact parameter set of which is an interior point) of the model defined via the fact that in for the estimation certain entries are chosen according to while the remaining are estimated. Then

 √N(^θm−θ0)\lx@stackreld→ΛmU

for some matrix .
If furthermore the function is continuously differentiable at with gradient then

 √N(μ(^θm)−μ([τ′0,γ′0+δ′/√N]′))\lx@stackreld→∂θμΛmU−∂γμδ

and hence in the limit is normally distributed with mean and variance . Consequently times the mean square error for the model asymptotically equals

 NMSEm=λmδδ′λ′m+∂θμΛmJ(∂θμΛm)′.

PROOF: The proof follows standard mean value expansions: As maximizes the CML function at an interior point of the parameter set, its derivative at the estimator is zero:

 0=∂θmlcml(^θm)=∂θmlcml(θ0)+∂2θmθmlcml(¯θ)(^θm−θ0)

Here denotes an intermediate value. Standard theory implies that where denotes the projection onto the free coordinates within . The limit for the estimation error then follows from the fact that

 ∂θmlcml(θ0)=πm¯Un

where consequently

 Λm=−(πmHπ′m)−1πm.

The asymptotic distribution of then follows from applying the Delta method. The formula for the asymptotic MSE then is obvious.

Therefore the asymptotics are for all models driven by the same random variable . Thus it follows that the estimation errors for all models jointly are asymptotically normal. This allows the calculation of the asymptotic MSE also for the averaged model using weights :

 NE(μ(^θMA)−μ[[]cτ0γ0+δ/√N])2 → M∑i,j=1(λiδδ′λ′j+∂θμΛiJ(∂θμΛj)′)wiwj (14) = M∑i,j=1Fijwiwj=w′Fw

where the next to last equation defines the matrix and the vector of weights . The optimal weights for estimating the value then is provided by the solution to

 ^wmse=argminw∈RM,∑Mm=1wm=1w′Fw. (15)

In practice the matrix is not known but contains only quantities that can be estimated consistently except for . In this respect it is customary to use the estimate from the wide model. This estimate is unbiased but not consistent. Note that contrary to the weights from the IC-based approach the model averaging weights that result from (15) might be either positive or negative. Without further restrictions the optimal weight vector can be calculated explicitly.
The matrix is not necessarily nonsingular, its rank is bounded by . Conditions for nonsingularity of can be derived for particular settings, see Charkhi et al. (2016). This also implies that the estimation of models is sufficient in order to obtain minimum MSE weightings. This number typically is much smaller than the number of all possible combinations of regressor selections.
The main steps to compute an optimally averaged estimator may be summarized as:

1. Decide on a list of candidate models

2. Estimate the wide model and obtain as an estimate of .

3. Estimate its sensitivity as well as its variability matrix .

4. Estimate the remaining models, then compute the weights using (15) and the averaged estimate.

This description leaves the question of the choice of open, which in this context is usually called the focus parameter. A number of options in this respect are:

• : extracting one parameter of interest.

• being a prediction such as the conditional probability of a particular choice.

• Alternatively the sum of squared errors corresponding to all parameters can be used.

In summary we have introduced two model averaging strategies, the first is rather ad-hoc but has the benefit that the weights are easily computed from either CLAIC or CLBIC. The other strategy involves additional computations but features a strong theoretic framing (optimality w.r.t asymptotic MSE of some focus quantity).

## 6 Model averaging: Comparison by simulation

In this section we use a rather simplistic approach to explore the performance of model averaging in that we only average over the two different covariance structures involved in the test decisions of section 4, where the models differ in 10 parameters.

The setup was described in the introduction to section 4 but here our focus is on the estimation accuracy of the previously discussed averaging estimators when compared to model selection. The results are based on the Mean Absolute Error (MAE) over all Monte Carlo samples with regard to the – arbitrarily chosen – third parameter in b (). So the difference between the models is in the covariance structure while we are interested in the impact of this difference on the estimates for a linear parameter. We first present the MAE for the case of selecting the model using either CLAIC or CLBIC followed by the errors for different MA methods. Note that as discussed previously the probability to select the restricted model is rather low and gets lower for rising such that this MAE is mainly driven by the unrestricted model.

In Table 3 we see that as expected the MAE for all methods gets smaller for rising sample sizes because the underlying estimation is getting better. Furthermore we see that the errors of model selection, which are presented in the first two rows, are also getting smaller for higher values of which is due to the fact that in those cases both ICs always select the unrestricted model.
The next two rows present the MAE for an averaging estimator which is based on an IC (see (12)). We observe that this method is performing worse than model selection for the two smaller sample sizes but has lower errors for all but the highest values of for sample size 1000. However, it is clearly visible that the error is not improving for rising values of even though the weighting decision should get easier.
The results for an averaged estimator where the weights are chosen to minimize the asymptotic MSE are given in the last row. Again we observe better performance for larger samples but the results also show that this averaging estimator is superior to the IC-based MA estimators. We further observe that this model averaging estimator outperforms model selection for small values of . It is important to recapitulate the results from Table 2 which show that the restricted model is wrongly almost never selected by CLAIC/CLBIC. This limited simulation exercise points out that model averaging – especially when the weights are chosen to be MSE optimally – might provide a solution to this problem.

## 7 Model Averaging: Empirical Example on Mobility Motifs

In this section we illustrate the use of the most promising of the previously discussed model averaging methods (asymptotically MSE optimal weights) using data from the German Mobility Panel (MOP). The MOP is a panel mobility survey with a rotating sample, keeping responding households in the sample for three consecutive years. Each year households are asked to record trips for all members of the household for one randomly assigned week. The corresponding trip diary conforms to the KONTIV design and collects trip start and end time and location, transport means, distances covered and trip purposes (for more details on the data generation and characteristics see (Zumkeller and Chlond; 2009)). For this paper we use data from 2013.

Based on the trip diary for each person we compile mobility motifs (Schneider et al.; 2013): Here a motif is a graph containing nodes and directed edges, compare Figure 2. The nodes represent locations people visited, the edges movements between edges. The significance of the motifs is seen in the fact that in different data sets in different cities it has been observed that from the large number of possible motifs with up to six nodes only relatively few ((Schneider et al.; 2013) list 17 motifs covering more than 90% of all day observations) turn out to occur frequently with also the frequency of occurrence being remarkably similar across studies.

Beside these empirical facts motifs are of interest also for simulation models of mobility behaviour. Often such models only cover a single trip, while clearly trip chains exist and hence choices for one trip depend on the other trips within a day. Most often such dependencies are not or only partly taken into account (see (Cascetta; 2009, p. 219ff)). In this respect it might be postulated that people in a first step choose one of the possible motifs in order to decide on the various locations to be visited and also the sequence of visits to these locations. And only in a second step the actual trips are planned in detail. Such activity plans also lie at the heart of some mobility simulation models such as the ones implemented in MATSim (Horni et al.; 2016).

For this case study, we limit our analysis to the workdays of the year 2013. The motifs were computed from the trip diaries of 2369 participants and in this data set the 15 most common motifs (as depicted in Figure 2) account for 92.6 percent of all choices and, therefore, we summarized the remaining 7.4 percent as ’other’.

In this case study we investigate the choice of the actual motif on a workday based on underlying sociodemographic characteristics of the persons. Cascetta (2009) argues that it is mainly the occupational status of an individual that influences the individual trip-chaining. We will explore this hypothesis in this case study. To do so we consider a limited set of exploratory variables: a dummy which indicates when a person is not full-time employment, a gender dummy as well as dummies for four age groups (10-17, 18-25, 26-60, 61 and older).101010Note that in Germany the legal age to acquire a (full) drivers license is 18. Furthermore, the compulsory school attendance ends around the age of 18 in most of the federal states. According to official figures the mean age of retirement in Germany was 61.9 in 2016 (see (Deutsche Rentenversicherung; 2016)). Those variables, which are included in the model as fixed effects, are summarizes in Table 4.

We explore the effect those variables have separately for each potential choice. As we observe five repeated choices for each participant we model the Alternative Specific Constant (ASC) as random effects and the correlation between those random effects allows us to explore the substitution pattern between different motifs. In order to ensure identification the coefficients for the ”other”-choice are fixed to one as is the corresponding variance. That leaves the model with 90 linear and – because the covariance parameters are estimated using the corresponding Cholesky decomposition () – with up to 120 covariance parameters.

Following through the steps outlined in section 5 we will start by specifying the list of candidate models. For this modeling exercise the specification of the covariance is of major concern because prior research suggests that mobility patterns are pretty stable during the work week and that, therefore, some substitution patterns might not be relevant (see (Schneider et al.; 2013)). Furthermore, Schneider et al. (2013) discuss that substitution in general can be explained by a rule-based system where several substitutions do not occur at all. Instead of selecting a covariance structure, which could be done using the methods we discussed previously, we estimate several possible specifications ranging from the diagonal RE-specification without any correlation to the model featuring an unstructured covariance matrix.111111Note that in order to ensure identification of the unstructured covariance we constrained all correlations between the ’other’-option and the remaining motif to zero. In detail we consider four models,

• Unrestricted correlations between the ASCs of the 15 motifs

• Block1, substitution of the fist seven (simple) motifs to the other (more complex) motifs but no correlation within those blocks.

• Block2, like Block1 but also correlation within the first block,

• Diagonal RE, only uncorrelated random effects for each motif,

All those models include the 90 linear parameters and at least the 15 variances of the random ASCs. Using the terminology of section 5 those parameters form and the last model is the wide model which includes all 210 parameters (see Table 5). The estimation for all four models is set up similar to the procedure outlined in section 4 but we initialize the optimizer at random because the true values are obviously unknown.

In Table 5 we have summarized the different IC values for the models. Comparison of CLAIC and CLBIC reveals as the preferred specification. However, we can also observe that selection by CLAIC and CLBIC would lead to different choices for the second best model. While CLAIC hints towards the unrestricted model, we would select the model would we rely on the CLBIC. Furthermore, we show the weights for MSE optimal model averaging. We focused the MSE on the 15 coefficients which indicate whether an individual is not full-time employed. The weights might appear strange on first inspection because the model with the smallest CLBIC/CLAIC gets a negative weight but from Table 6 it is clear that those weights lead to meaningful averaged coefficients. The averaged coefficients reside either in between the estimate for and the full model or and , this would have been impossible by just selecting one model.

Even though the main focus of this section is to illustrate the feasibility of model averaging for real data we will interpret the estimates but without going into too much detail. In order to facilitate interpretation we have added a plot which contains the motifs alongside the estimated averaged coefficients. We see that the estimates are plausible as individuals without full-time employment seem to (I) stay at home more often and (II) favor motifs with a hub-structure, returning home in between visits to different locations, over round-trips. An interesting results is that the two motifs with the highest coefficients represent distinct activity patterns, the stay-at-home motif (blue box) and the hub-motif with four non-home locations (green box). The model would most likely benefit from considering interactions of the non-full-time employment dummy with other variables, which might explain the reason for the occupational status, to further explore this paradox.

Another possibility is to use model averaging to recover information about the correlations between the ASCs. In this case we focus on the MSE with respect to all linear parameters but for the case of this empirical example we will only look at the estimated correlation matrices. First, note that by shifting the focus we obtain different weights,

 ^w=[1.5546,−1.8555,1.2105,0.0904],

where the first weight is for the model, followed by the , and specification. We see that the model has the smallest weight which might be interpreted as a penalty for the high standard errors of the estimate. When compared directly we see that the estimates of the correlation matrix of the random effects for the unrestricted model (Figure 4) and the averaged estimate (Figure 5) reveal similar but not identical patterns. The ordering of the motifs is identical to that depicted in Figure 2, therefore, motifs with lower numbers are observed more often. The first insight from those heatmaps is that the third motif is special in that its correlation with all other motifs is low in comparison to the other common motifs. This is plausible because it is the stay-at-home motif and is observed for the estimates from the averaged as well as from the model.