# Information processing constraints in travel behaviour modelling: A generative learning approach

###### Abstract

Travel decisions tend to exhibit sensitivity to uncertainty and information processing constraints. These behavioural conditions can be characterized by a generative learning process. We propose a data-driven generative model version of rational inattention theory to emulate these behavioural representations. We outline the methodology of the generative model and the associated learning process as well as provide an intuitive explanation of how this process captures the value of prior information in the choice utility specification. We demonstrate the effects of information heterogeneity on a travel choice, analyze the econometric interpretation, and explore the properties of our generative model. Our findings indicate a strong correlation with rational inattention behaviour theory, which suggest that individuals may ignore certain exogenous variables and rely on prior information for evaluating decisions under uncertainty. Finally, the principles demonstrated in this study can be formulated as a generalized entropy and utility based multinomial logit model.

###### keywords:

Information theory, generative model, rational inattention, variational inferencecolCorollary

## 1 Introduction

The classical assumption about modelling travel behaviour data is that individuals have varying unobserved heterogeneity in their choice preferences (mcfaddentrain2000, ). In recent years, the use of data-driven modelling and integration of behavioural and psychological factors in discrete choice and travel behaviour analysis have become active areas of research (lietal2016, ; vijkrueger2017, ; nikolicbierlaire2017, ). In the context of data-driven models, behavioural variations describe the correlation between observed choice attributes and unobserved socio-economic factors using a flexible and tractable model specification. These variations include: decision-protocols, choice sets, unobserved taste variations and unobserved attributes (gopinath1994, ). Under these considerations, recent studies on travel behaviour analysis have so far primarily focused on representing heterogeneity in the error correction function and incorporating it into utility based multinomial logit (MNL) models (vijkrueger2017, ). Models such as mixed multinomial logit (MMNL) or latent class (LC) model offers flexibility in representing heterogeneity and substitution patterns. In addition, recent conceptual frameworks such as the integrated choice and latent variable (ICLV) use individuals’ psychometric indicators to represent unobserved behavioural and perception heterogeneity (bolducalvarezdaziano2010, ). It is also possible to apply a generative machine learning to identify informative latent constructs in travel decision making without subjective behaviour indicators (wong2018modelling, ; wong2018discriminative, ). However, the true underlying behavioural patterns are often unknown and usually approximated by some pre-determined exogenous indicator variables that would often lead to model misspecification due to lack of complete information, or error in data collection (cherchipolak2005, ). Furthermore, accurate specification of the underlying distribution assumes individuals have access to all available information regarding the travel activity (e.g. travel times of each mode, knowledge of exact traffic status, etc.). This information will not always be available to the individual and they might also choose to not consider these variables in their decision making process. Therefore, statistical variations in the observed data may not exhibit the same underlying properties as with the individuals’ behaviour.

A different perspective to explain these heterogeneity manifestations is to consider the element of information processing costs based on rational inattention theory (sims2010, ; matvejkamckay2015, ). Rational inattention theory is defined as individuals choosing their optimal preference, at the same time considering incomplete information about the choice attributes and relying on their prior beliefs about the choice set. A typical example would be route choice selection: Individuals tend to ignore most path choices and consider only a few prioritized routes in their choice set (alizadeh2018online, ). These manifestations occur through repeated choice process and prior experiences about the travel routes. As described in matvejkamckay2015 (), information theoretic approaches do not impose any particular assumptions on what is learned or how they are learned—the structure of the model is estimated through the minimization of decision uncertainty. Under this interpretation, a rational inattention model captures the systematic utility and adjusts for prior knowledge and individuals’ internal information processing strategy using an entropy term. Individuals perceive route choices with heterogeneous prior beliefs and allocate different levels of attention to each alternative. Consequently, misspecification in classical econometric model estimation can be interpreted as the systematic error between the data observed by the analyst and the true underlying heterogeneous beliefs of the decision makers (which are hidden to the analyst).

The objective of this research is to model unobserved variations in travel behaviour data by emulating decisions under uncertainty and information processing constraints as a data-driven generative learning process. We develop a choice model estimation framework with latent constructs that capture information heterogeneity within the data. The key difference between our work and previous literature is that we show how rational inattention can be framed as a flexible and extendable generative learning model that emulates the cognitive processes in human behaviour (fosgerauetal2017, ; fosgeraujiang2019, ). We postulate that realistic behavioural patterns can be modelled using a data-driven generative learning process and we estimate a model to represent the underlying heterogeneity of the data. Lastly, we provide a quantifiable economic interpretation using latent variables by analyzing the model properties and systematic effects from the latent variable parameters. This will provide valuable insights into how modern data-driven and deep learning techniques can be exploited to improve travel behaviour modelling.

Our contributions are as follows: (i) A novel framework for capturing and extracting properties of information heterogeneity in travel behaviour models (Fig. 1). (ii) We show that generative modelling can be framed as an abstraction of rational inattention theory. Specifically, the learning and optimization process of a generative model emulates the internal information processing constraints of decision making. (iii) Demonstration of a data-driven modelling approach that exploits start-of-the-art deep learning techniques. A generative model architecture is described in the methodology. (iv) Discussion on the interpretation of generative learning on discrete choice analysis. (v) We provide new insights into sensitivity analysis of econometric parameters through a travel behaviour case study.

The remainder of the paper is organized as follows: Section 2 introduces preliminary concepts related to information theory in choice modelling and discusses existing literature on rational inattention behaviour theory. Section 3 describes the generative model framework and estimation methodology. In Section 4, a case study example on a trip-based travel behaviour analysis is shown and we demonstrate how the results explain information heterogeneity in the data. Section 5 provides a brief discussion on the results, conclusion and suggestions for future research.

## 2 Information theory in behavioural models

In this section, we introduce several preliminary concepts that relates to our work by beginning with the connection between rational inattention behaviour and information theory in the context of generative modelling.

### 2.1 Rational inattention behaviour

Rational inattention presents a behavioural scenario where individuals’ choice influences are based on Shannon’s mutual information that measures uncertainties between an exploitative and exploratory choice process. Specifically, it frames the choice problem on observations as well as information processing constraints similar to that of a communication channel with finite Shannon capacity (sims2003, ). By representing information processing constraints, it accounts for the natural deviations in econometric behaviour (sims2003, ; sims2010, ). This concept stems from the same principles of neuroscience where behaviour learning and perceptual inference can be explained through information theory and statistical physics (friston2006, ). Using modern deep learning techniques, one can construct a rational inattentive learning model using an artificial neural network to provide a principled way of analyzing travel behaviour patterns from large scale datasets.

As a simple generalization, information processing constraints across choice preferences can be represented by an unknown distribution of random utility shocks according to Ellsberg’s paradox which showed that individuals systematically violate utility theory by being adverse to ambiguity (ellsberg1961, ). Consider a case where an individual is faced with two options in a choice set when the expected utilities are identical for both options. In utility theory, both options will be chosen at equal probabilities, whereas in rational inattention, the individual chooses the option that maximizes entropy (attention). This decomposition accounts for the prediction error under different protocols as well as it resembles exploratory choice behaviour (i.e. prospect theory) (kahnemantversky1979, ). For instance, when the differences in utility between two travel modes do not differ, travellers would try new options, in relation to increased risk.

Existing studies on rational inattention in choice modelling research stems from the findings that this behaviour can be generalized in an MNL model (matvejkamckay2015, ). However, they have mostly focused on static models, as dynamic rational inattention models are difficult to solve and may be intractable using conventional methods (steineretal2017, ). The value of adding information processing constraints have suggested well-defined similarities with macroeconomic behaviour theory (sims2003, ). Recently, rational inattention has become a particularly appealing approach to modelling choice behaviour. For instance, matvejkamckay2015 () described the implication of information availability on consumer choice selection behaviour using a rational inattention model. In a combined location and mode choice model, teyeetal2017 () used a method of entropy maximization in a non-linear mixed integer program subject to available information constraints. leard2018 () investigated consumer inattention correlation with willingness-to-pay for fuel consumption. Recently, rational inattention has been found to work well in time variability problems in travel demand forecasting (fosgeraujiang2019, ). The theory of rational inattention seeks to endogenize the imperfect awareness about the circumstances (sims2010, ). The decision maker selects pieces of information that are most relevant for his or her utility and ignores the rest, so long as the information cost can be accounted for in the model.

### 2.2 Information theory

In this section, we explore some key properties of information theory in the context of behavioural modelling. Information theory has been used to provide insights into the non-rational behavioural choice, and it was shown to be equivalent to random utility maximization MNL model (anas1983, ). An information theoretic model can also be used as a tool for generating new predictions beyond MNL restrictions, subject to available information (anas1983, ). Recent studies have also shown that this is also functionally equivalent to an additive random utility maximization problem in rational inattention behaviour models and several well-known decision problems can be reasonably represented, e.g. Prospect Theory and Regret Theory (kahnemantversky1979, ; matvejkamckay2015, ). The measure of information heterogeneity is closely related to non-normative representation, involving Shannon entropy (fosgerauetal2017, ). Expected utility representation may not be sufficient in providing the proper specification for these decision problems as individuals may perceive choice probabilities with different levels of uncertainty. Decisions under uncertainty can be interpreted in a simple way by correcting for information processing constraints in the utility specification.

#### Energy

Assuming a bi-directional system with an observed and an unobserved (latent) states, the level of uncertainty of a state configuration of the system with observed and latent random variable is a function of energy of the state proportional to the joint probability or :

(1) |

where is the normalization function so that . Due to the logarithmic function, energy decreases monotonically as the probability increases. Imposing monotonicity allows the model estimates to be more interpretable and tractable. An event with high energy will have a lower probability of occurrence (individuals will tend to avoid this state). An event with low energy will always be within the expectation of the individual, thus having higher probability (ullah1996, ).

#### Mutual information

Mutual information allows us to identify general nonlinear dependencies by measuring the amount of information processed by the individual, i.e. how far two random variables are from being independent. Given two random variables and , let , the mutual information can be written in the form:

(2) |

It can be interpreted as the decrease in uncertainty of X given S, where H(X) and H(X|S) are the entropy and the conditional entropy respectively:

(3) |

Mutual information is symmetric and it is non-negative, and it is zero if and only if and are independent (with respect to the model identification process). Hence, the mutual information shown in Eq. 2 is equivalent to finding the expected energy difference between the data generating distribution and the true distribution obtained from the data.

### Kullback-Leibler divergence

The Kullback-Leibler (KL) divergence or the relative entropy measures the ‘distance’ between two distribution, and (ullah1996, ). The KL divergence of from is , when then . The mutual information, using the example above, can be defined as the divergence of the joint distribution from the product of marginals:

(4) |

Thus in practice, we can consider the hypothesis against as a test for independence between two random variables (ullah1996, ). To put it in a different perspective, if we can define a framework where the latent variables interact with the observed variables by a correlation matrix, then the mean and variance of the matrices indicate how much information heterogeneity is present in the data describing the population.

## 3 Methodology

We propose a generative model framework that extends rational inattentive behaviour in discrete choice, interpreting it as an optimization process rather than a structural model specification.
We differentiate our work from the generalized entropy function described in fosgerauetal2017 () by framing non-normative behaviour as a learning model – allowing for random perturbations to be data-driven.
Under this framework, the estimation of a generative model assumes to emulate information processing constraints in rational inattention behaviour and identifies observed and latent variable interactions through a neural network interface.
The correlation between random decision and information priors are reflected through the estimated latent variable parameters.
We use a Restricted Boltzmann Machine (RBM) learning algorithm as an example to estimate the generative model parameters.
Other forms of generative model algorithms (e.g. Autoencoders, GANs^{1}^{1}1GAN: Generative Adversarial Networks, DBNs^{2}^{2}2DBN: Deep Belief Nets (goodfellowetal2016, )) can similarly be used.
Another simpler form of generative modelling is principal component analysis (PCA).
However, PCA has severe limitations as it cannot handle complex non-linear relations in the data (hintonsalakhutdinov2006, ).
We focus on the RBM learning algorithm as we would show that it is an approximation to a rational inattention information processing with similarities to an error components model.
The error components control for the heterogeneity in the observed utility and variances in the unobserved utility, where the unobserved utility is represented by an entropy function.

### 3.1 Proposed generative model framework

The generative model framework is a tri-partite RBM with a data layer representing the set of observed variables including a dependent variable and a hidden layer representing the set latent variables (see Fig. 1). The generative model can be framed as a fully connected tri-partite graph where is the set of graph nodes and are the graph edges. The nodes from are connected to by edge subset , representing the choice model explanatory variable coefficients. The edges between and are the correlation matrix between the latent and observed variables. Decision level heterogeneity is represented by the edge subset . The algorithm focuses on generating synthetic data using a blocked Gibbs sampling protocol, alternating between observed and latent variable samples from the joint distribution conditioned on the previous step. A non-zero valued covariance matrix represents the level of information heterogeneity captured in the data. A zero covariance matrix indicates that the observed explanatory variables captures all the taste variations and assumes a fully homogeneous population. The observed data can be inferred by sampling from the generative model probability distribution. By minimizing the KL divergence between the observed and generated data, we learn the parameters of the correlation matrix between the observed and latent variables. When the generated data have matched the observations, the underlying priors are assumed to have encoded the information heterogeneity of the population and can be represented in the choice model.

### 3.2 Model specification

The RBM architecture was designed as an efficient feature descriptor that progressively trains a fully connected non-linear model structure (hintonetal2006, ).
The interactions between the two parallel components capture the information about the heterogeneity present between hypotheses.
Each latent variable represents a specific state encoded as distributed binary patterns.^{3}^{3}3Distributed binary patterns are commonly used in digital signal encoding.
For example of a pattern: or . We make the analogy to digital encoding to refer to choice behaviour perceptions.
A latent variable model with elements can represent up to different behaviour perceptions. The Boltzmann architecture uses this representation with a stochastic sampling algorithm to learn the model parameters.
Other forms such as multinomial discrete vectors or multivariate normal can also be used as possible encoding patterns, but binary encodings are the most straightforward method to simplify model inference.
The different combinations of latent variables form the complex behavioural activity patterns and are inferred through sampling from the posterior.
Similar to a random utility specification, we start with a scalar energy value describing the joint configuration of observed explanatory variables, dependent choice variable and latent variables:

(5) |

The energy function is parameterized by a set of coefficients , , where are the choice model coefficients and are the constants of the observed explanatory, dependent and latent variables respectively. and are the parameters matrices representing the information heterogeneity captured by the latent variables given the observed explanatory and dependent variables. is a discrete dependent variable representing the choice alternatives, e.g. or representing a selected alternative. is a vector of observed explanatory variables either as discrete or continuous values. Multiple discrete and continuous dependent values can also be used as the output (wongfarooq2019, ). is a vector of stochastic binary variables. Given that the non-latent variable terms can be factorized out, the posterior over the latent variables is as follows:

(6) |

Using the aforementioned energy function allows the conditional to be factorized. Defining the normalizing constant as the sum of the binary configurations, we obtain the normalized probability density function for each latent variable :

(7) | ||||

(8) |

The objective is to optimize the model parameters such that a sample is generated with a distribution as close to the data distribution . Computing the energy over the data layer corresponds to the expected energy of the model minus the entropy:

(9) |

which can be simplified into the form:

(10) | ||||

(11) | ||||

(12) | ||||

(13) |

Eq. 13 is a direct interpretation of the generalized entropy formulation for discrete choice (anas1983, ; fosgerauetal2017, ). The coefficients stand for the unknown parameters of the explanatory variables for each alternative and for the generative model respectively. Increasing decreases the energy over the data generating distribution conditioned on a choice alternative, while increasing decreasing the energy over all data generating configurations. represents the alternative specific constants and is the flexible error component generator given a specific input configuration of observed and with a constant . If this term is near zero, The expected energy function is equivalent to a utility function in a random utility maximizing (RUM) model. By definition, the probability of being generated is the Boltzmann distribution with energy :

(14) |

The computation of the marginal , which sums over an exponential number of possible configurations of the data vector, becomes difficult as we increase the number of explanatory variables.

### 3.3 Objective function formulation

Our proposed framework addresses the estimation problem for a highly non-linear and non-closed form function using variational inference. We select from a family of distributions that produce an approximate posterior distribution. The specification of the posterior distributions is obtained from data accumulation during the learning phase. If we restrict the family of distributions that are tractable and can be factorized over each variable in , the problem of simulation-based estimation becomes significantly simpler. For the sake of clarity, we omit the parameter terms in the equations below. First, we consider in terms of energy and the joint probability as follows:

(15) |

We can map the energy of the observed part as a function of the total system energy in a formulation similar to Eq. 1 by defining . The posterior over the latent variables as a function of energy using Bayes rule, results in a Boltzmann probability function over the joint distribution, which reveals the similarities to an MNL model:

(16) |

If we take the expected values with respect to the posterior on (Eq. 15), the uncertainty of choice can be expressed in terms of expected energy and entropy denoted as the evidence lower bound :

(17) |

In Eq. 17, a rational inattentive based choice can be framed as the information difference between the expected energy and the entropy gain. The first term on the right of Eq. 17 denotes the individuals’ behaviour towards prior expectations about the choice. The second term is the entropy and it can be viewed as the information processing constraints in a rational inattentive model or a penalty for low energies. It ensures that the generative model produces low uncertainty values for inputs with high probability in the true data distribution and high uncertainties for all other inputs (ranzatoetal2007, ). minimizing uncertainty implies both utility maximizing and entropy seeking behaviour. Computing the evidence is intractable, but we can use the posterior to evaluate the marginal log likelihood (kingmawelling2013, ).

In many cases, computing the posterior may be difficult when the distribution is complex, as we require an integral over all configurations of latent variables to find the marginal or denominator in Eq. 16. The primary motivation of defining the problem as variational inference is that we can approximate the posterior distribution using a tractable arbitrary distribution (bleietal2017, ). In the estimation procedure, we find the parameters that make as close as possible to the posterior by minimizing where is the approximating distribution, then we have:

(18) |

To show that the proposed distribution can be used to approximate , we compute the marginal loglikelihood over to minimize the KL divergence of from :

(19) | ||||

(20) | ||||

(21) | ||||

(22) |

Using the fact that the KL divergence cannot be negative, we get the lower bound on the model evidence and we define the variational free energy as:

(23) |

The intuition from Eq. 23 is that minimizing the variational energy has the same outcome as minimizing . The bound is exact if term is zero, which would happen if matches perfectly. Therefore, following the gradient of yields the optimal solution for . Another equivalent form of variational free energy can be derived by transforming the marginal into the conditional likelihood:

(24) |

In Eq. 24, the objective function can be optimized through assigning specific priors over the generative model then measuring how well the priors represent the observations. More generally, minimizing together with the KL divergence is a good substitute for minimizing the log-likelihood function (ranzatoetal2007, ). The first and second terms on the right-hand side are known as the fit and complexity respectively in Bayesian statistics. The first term defines the accuracy of the data generating model. If we presume that is a complex model (real-world representation, intricate correlation between behaviour and choices, etc.), then the complexity tells us how much capacity is required for the (non-trivial) approximator to match the empirical distribution. The variational energy can be used to determine the strength of non-linear interactions between components in a model. Minimization of variational energy provides consistent and reproducible models, equivalent to maximum likelihood estimation. We can establish the choice model by interpreting the data generating probabilities of a given data vector as the individuals’ information heterogeneity by minimizing the variational lower bound. The objective cost function now becomes selecting the model parameters such that:

(25) |

In the proposed generative model, we are interested in evaluating large numbers of non-linear latent variables which belongs to a family of extreme valued distributions parameterized by latent variable parameters , . The primary assumption is that the approximating distribution can be factorized, such that it gives a tractable form:

(26) |

This form allows the generative model to produce distributions with sharper boundaries over conventional mixture models. Using this specification, model variance can be increased or decreased with the number of activated latent variables.

### 3.4 Parameter estimation

We formalize the model learning as minimizing KL divergence given some observed data . The key advantage of this is that we can incorporate the differences between individual’s actual behaviour and mean population behaviour effectively in the objective function. The parameter update rule for a generative model is obtained by implementing a stochastic gradient descent on the variational free energy function, updating the weights of the coefficients between latent and observed variables according to the sampling states. Consequently, the gradients with respect to the parameters are as follows:

(27) |

where the expectation is over . The learning algorithm is based on a Gibbs chain starting at an initial sample from the data distribution and converging to the RBM data generating distribution after performing alternating blocked Gibbs sampling between the latent and observed variables. A naive implementation of this learning algorithm would require simulating the Gibbs sampler to equilibrium after every model update before drawing a new set of observations from the data. Sampling from the generative model to produce with and updating the model parameters between each iteration has been suggested as a optimal tradeoff between fast estimation without loss in generality or stability (hintonetal2006, ). The first term on the right-hand side of Eq. 27 is the derivative of the energy function w.r.t the initial Gibbs samples and the second term corresponds to the gradient of the energy function after steps.

Our proposed modification to the RBM learning algorithm uses a hybrid generative learning and maximum utility estimation. Rather than focusing solely on the optimization of the generative component, we also try to maximize the accuracy of our choice model given the data and generative samples. After each generative learning step, we update the choice model coefficients by performing maximum likelihood on the conditional using the choice alternative as the dependent variable. Next, we sample latent variables from the generative model using the observed explanatory variables as inputs. These latent variables are assumed to represent the information heterogeneity that is not captured by the explanatory variables. Our modification provides integration with discrete choice modelling methods and allows for other hybrid choice model use cases that can be explored in the future. We specify the conditional logit model using observed and latent variables as follows:

(28) |

where there are alternatives in the choice variable . In this step, only the coefficient and alternative specific constants are updated (by maximum likelihood) while keeping the parameters from the generative model unchanged. Given that parameters and are estimated from the generative model learning algorithm providing model error correction, the coefficients of the choice model is expected to converge to a non-biased, homogeneous value. This means that as we improve the precision of the data generation protocol, the choice model can be estimated without systematic errors.

### 3.5 Economic interpretation

The basis for economic interpretation of a generative model is through a combination of individual utility and entropy. Suppose that an individual will be in one of latent decision states, each state has associated with it a configuration of latent variables: . These latent variables are related to choice selection strategies, complexity and influence of repeated nature of travel activity choices. Thus they are interpreted as potential decision strategies. If in a particular state contains all zero elements, then the choice strategy is a purely utility driven one (since latent variable attributes are ignored). If by contrast, the latent variables are non-zero, then one might argue that the individuals used their internal information processing constraints to develop a choice strategy. These interpretations are similar to the rational inattention model, which were identified as decision strategies characterized by continuously optimizing agents sims2003 ().

We assume some distribution function to describe , an error generating density function that depends on for all alternatives. The density is the distribution of the unobserved heterogeneity on the individuals with similar utilities for each alternative. It represents the idealistic subjective perception of a particular individual on a specific choice context. We assume that are extreme value distributed across individuals and decisions:

(29) |

This specification allows a form of energy based models to be generated using entropy as a measure without relying completely on hypothesis-driven utility specifications (train2009, ). As such, from Eq. 28, the generative model specification under a generalized extreme valued function can be derived as follows:

(30) |

where , and . is non-negative, homogeneous of degree and function is , when for and if is odd and if is even. Thus, the level of uncertainty in a choice due to information heterogeneity is described using a function calculated on a set of prior weights and latent variables. The resulting approximate entropy is given as the negative log of the error generating function:

(31) |

We can expand the model from an MNL specification by substituting :

(32) |

where the arguments in are linearly separated into the observed utility and entropy . Thus the probability of choosing an alternative is a function of the observed utility, corrected by the information processing cost of the set of alternatives and its explanatory variables observed by the decision maker. An interesting consequence is that changes at every instance in the variable space i.e. individuals with similar utilities may have different choice distributions. Furthermore we can conclude that the changes in the decision making policy are influenced in two ways: first, through the direct correlation with the observed attributes and second, indirectly through the information processing capacity of the decision maker. As a result, even though it is impossible to directly measure the result of economic policy changes on the latent variables, we can obtain the mean and variance of the latent parameter distribution to evaluate the information sensitivity with respect to each explanatory variable.

### 3.6 Statistics for model evaluation and validation

One of the ways to obtain statistics for model evaluation and validation is through simulation and hyperparameter search. Model evaluation can be performed on out-of-sample simulations using adjusted serves as an equivalent to KL divergence to determine distribution accuracy. For evaluation, we fixed some of the input data and use the generative model to produce new data and compare their distribution accuracy.

There are no exact solutions to the number of latent variables required to create an optimal model. The most commonly used approach is to validate the model by iterative test on various number of latent variables. We note that validation is only a crude test of performance and there are generally no accepted methods to adequately determine the optimal number of variables. Several studies in literature have provided so-called ‘rule of thumb’ regarding the number of inputs and layer sizes (alwosheeletal2018, ). However, the optimal number of latent variables used can differ largely between datasets. Too few latent variables and the model cannot capture the complex structure in the data, too many latent variables may cause overfitting and increases estimation time.

Evaluating the sensitivity of parameters associated with the explanatory variables can be more challenging. In our experiment, we found that monitoring changes to -parameters as we increase the number of latent variables work well for sensitivity analysis. Theoretically, for variables not influenced by information processing constraints, -parameters should remain consistent. Otherwise, for variables that are sensitive to information processing constraints, -parameters would vanish or shrink to a small value as we increase the number of latent variables so that the choice response could not have been derived from that source (sims2003, ). From a macroeconomic perspective, the decision making actions should respond smoothly to external factors and any disturbances or randomness should be distinctive and manifest only from individual’s internal information processing constraints (sims2003, ).

### 3.7 Comparison with supervised neural networks

The probability distribution in Eq. 28 might seem to be equivalent to a single layer neural network (e.g. DNN) with a softmax output, we argue that this is not the case. In a DNN, model parameters are optimized to maximize a predictive output , which may result in significant overfitting if model is mis-specified or too many hidden units are used. Using multiple hidden layers may also potentially degrade the model and result in worse performance (heetal2016, ). However, in our approach of using generative modelling, parameters are optimized to reduce information loss by minimizing in the mapping process between observed and latent states, allowing as much of the original data to be reconstructed. A generative model provides some form of model generalization such that the parameters stay within the range of values that are realistically representative of the underlying behaviour, reducing the probability that the model overfits the choice variable.

Since latent variables are stochastic, may not always be generated by the same underlying configuration. Likewise, each sample of observed data vector may produce many different configurations of latent variables. The advantage of using unsupervised learning over supervised likelihood learning methods in discrete choice model is that it provides a flexible, high-level distributed representation and minimizes optimization inefficiencies caused by random initialization (tehetal2003, ). Model optimization uses a greedy learning algorithm to determine the underlying structure that captures the unobserved heterogenities without dependency on aggregate choice samples. Similar to rational inattention models, entropy in the variational free energy function is the cost of information from sampling from the generative model.

## 4 Case study

### 4.1 Data preparation

We consider a dataset collected from trip trajectories recorded by respondents from the Greater Montreal Metropolitan Area (Fig. 2). The data is available as an open dataset provided by the City of Montreal (datamobile2016, ). A total of 293,330 trips observations are available in the dataset and 58,034 trips within these observations have complete travel mode information, purpose and trip characteristics. We divide the data into two partitions: The first dataset contains complete (labelled) trip data and is used for model training and validation. The second dataset, contains incomplete data (unlabelled) and is used for model training, validation, model simulation and analysis.

For model evaluation, we train a generative model using and then we compute the mode choice log likelihood on for validation. The samples are randomly shuffled and split 70:30 for training and validation. We assume a multinomial extreme valued distribution for categorical observed variables, and log-normal distribution for continuous variables. Log-normal is used as the approximation distribution since the continuous data types (speed, distance and duration) follows a positive, right tailed distribution characteristic. Respective trips of individuals were recorded by self-imputation of their activity for each instance. Routes of individuals are sampled by GPS traces from their smartphones at frequent intervals. Speed, distance, activity type, trip duration and trip start location were used as explanatory variables in the estimation. The alternatives are: 1:cycling, 2:driving, 3:driving + transit, 4:transit and 5:walk. Continuous valued variables were normalized to unit standard deviation before model estimation. A one-of-j dummy variable encoding was applied on categorical variables. A sine/cosine 2D transformation was applied on cyclical continuous values, e.g. time information.

### 4.2 Choice model validation

We present the results of our model validation by assessing the model training performance and analyze the properties of the estimated parameters. We report the results of our training and validation on model instances with different latent variable sizes: (standard MNL), and . In our experiments, we did not notice any significant improvement over 50 latent variables in our model. To minimize the probability of overfitting in the generative model training, we validate the generative model by monitoring the likelihood loss on the labelled data and select the model parameters at minimum likelihood validation loss.

We used a standard batch stochastic gradient descent (SGD) learning algorithm divided into data batches and iterate over blocked-Gibbs sampling steps.
We fixed the hyperparameters for all our experiments to be , , and a learning rate of is used and model parameters are updated in parallel every batch cycle^{4}^{4}4The problem of identifying optimal hyperparameters is still not fully understood and it does not provide any useful information with respect to econometric interpretation.
In light of this, we selected these hyperparameters as our baseline for the ease of reproducibility in future work..

We monitor validation error by computing the total negative log likelihood of the validation data over the choice model at each iteration. As observed in the learning curves (Fig. 3), the model estimation process is stable and converges gradually without overfitting. At , the model achieved the best overall performance in terms of validation log likelihood. However, the relative gain in performance decreases as we increase the number of latent variables. We hypothesize that there is a maximum bound to the effective possible number of latent variables to represent unobserved variations in the data. This limit can be raised if a greater variation in data is used, i.e., data from different sources or over longer collection time frame. Note that this analysis is not a test for the ‘best’ mode – our primary objective is to understand the sensitivity of econometric parameters when a generative learning model is used to account for information heterogeneity. The loglikelihood decreases rapidly for the first 20 iterations, then plateaued as it reached 100 iterations. Estimation time for each model instance was less than an 1 hour running our code on a GPU hardware.

Model performance is evaluated by comparing the adjusted squared correlation statistical fit. Fig. 4 shows the mode share distribution of the model validation. For the baseline model we obtained a value of 0.807 We obtained a value of 0.940, representing a 15% increase in relative predictive performance. The nominal trend shows that distribution accuracy increases with an increase in number of latent variables. At , performance drops slightly compared to indicating that the performance does not increase asymptotically with the number of latent variables. Nevertheless, the results show that the model can be estimated with high accuracy, using KL divergence over maximum likelihood as the objective function. In this example, the models do not consistently predict the driving+transit and walking alternatives probabilities. One explanation can be attributed to the low observation counts of these two alternatives. Another possible explanation is that driving+transit and walking trips have a low correlation with the observed explanatory variables.

### 4.3 Latent variable analysis

To understand the representational value of latent variables, we analyze their sparse-overcomplete properties (ranzatoetal2007, ). Sparse-overcomplete representation a situation when a large number of latent variables are estimated while only a small number of them are non-zero (ranzatoetal2007, ). It is a practical constraint that allows for more efficient use of latent variables and more flexibility in handling complex correlations which results in a better approximation of the statistical distribution of the data. Sparse representation has two main advantages in generative modelling (ranzatoetal2008, ; glorotetal2011, ). The first advantage is that the model will be able to control the dimensionality of representation given a set of inputs, avoiding the overfitting problem. The second advantage in the context of travel behaviour model inference is that the resulting representation is more likely to be linearly separable, decreasing the complexity in the model even though more parameters are estimated. This means that even with a large number of latent variables, sparse distribution of parameters would constraint the model to learn distributions which are most statistically significant in reproducing the original data.

The plots in Fig. 5 show the mean and variance of estimated latent variable parameters given the choice outputs. Since we use binary coding for latent variables, the parameters offer insights into how many latent variables are utilized at any one time. Parameter vectors with mean values close to zero and low variance indicate that the latent components are sparsely distributed. We assume that overcomplete representation () does not cause model overfitting as not all latent variables are active. The figure shown below illustrates that our generative modelling approach is an efficient method of capturing the underlying heterogeneity across different mode choice decisions. The mean converges to zero and standard deviation decreases as the number of latent variables increase, indicating that the generative model ‘suppresses’ the influence of less relevant latent variables on the behaviour model.

The results suggest that the RBM learning algorithm inhibits weight connections between the observed and latent variables in order to produce sparse representation. At (), the mean parameter activation is near zero with small standard deviation () for cycling, driving, driving + transit modes with an average latent variable activation rate of 85.4%, 84.4% and 87.7% respectively. For transit and walk modes the average activation rates are 90.6% and 92.6% respectively, indicating that these modes have a higher level of information heterogeneity and less correlated with the observed explanatory variables.

### 4.4 Generative model evaluation

To evaluate generative model performance, we measure the statistical fit of the reconstructed distribution. Simulated reproduction of population data have been used previously to analyze the efficiency of model-based fitting (farooqetal2013, ). Simulation experiments allow evaluation of the model on limited data knowledge, reproducing accurate data distribution while having partial information shows flexibility in capturing decision heterogeneity due to information constraints. Therefore, the performance results of these simulation experiments can be used to calibrate large scale data-driven models where complex data correlation is present and accounts for the presumption that individuals have limited information processing capacity in choice selection. We use Gibbs sampling to obtain data from the generative model. First, evaluate the data generating distribution accuracy using the unlabelled dataset . Fig. 6, Fig. 7 and Fig. 8 shows the data generation results for activity, distance and trip duration variables respectively.

Next, for the data generating process, we draw an initial sample from the dataset and fix the observed variable to that data vector and perform Gibbs sampling, alternating between the latent and observed sample conditional probabilities. Lastly, we clamp the non-target variables to the data vector and update the simulated values of the target observed variable. For instance, we generate activity type data using the following steps:

The simulation results show the effects of increasing latent variables on the performance of the data generating model. and achieved high similarities in recovering the original data distribution with value well above . At , there was an insufficient number of latent variables to capture the structure of the data, shown by the low value. Increasing to significantly improves the result as it increases the non-linear information capacity.

### 4.5 Sensitivity analysis of model parameters

Finally, in this section, we investigate the systematic effects if the generative framework on -parameters in the mode choice model. In practice, bias and variances are subject to independent processes, as such, each individual may have vastly different underlying error correction function for the same utility and for each configuration of explanatory variables. Mixed Logit specification have been used previously to account for this problem, but unfortunately, any variability or noise in the dataset (e.g. through different collection techniques, missing information etc.) will be added to the -parameter model predictors. This is less of a problem if one is only interested in the relative variance given the model parameters. To account for the systematic effects of information heterogeneity, the net utility of each alternative should remain homogeneous across the population (e.g. zero noise level), such that the degree of uncertainty can be compensated by the latent constructs.

Fig. 9 shows the estimated -parameters of the choice models with different number of latent variables. The -parameters identify the systematic effects of each explanatory variable on each choice alternative. The values on the left edge of each plot show the -parameters estimated with a standard MNL model. As we increase the generative model capacity (by increasing the number of latent variables), -parameters converge to a stable predictor. This is an interesting finding as it may in fact indicate that a ordinary utility based choice models may not take into account the systematic effect of information heterogeneity.

We perform a test on the identification of the -parameters by computing the maximum entropy (maxent) estimate on the observed choice probability in the dataset shown in Section 4.5. The maxent estimate value quantifies the degree of uncertainty within the underlying model accounting for the complexity as well as to determine whether the variance can be attributed to information heterogeneity. Analysis of maxent can provide information about the uncertainty of the predictors across choice probabilities golanetal1996 (). We compute maxent of the explanatory variable parameters using the formula:

(33) |

where the population class share for each alternatives are: cycling=0.068, driving=0.613, driving + transit=0.028, transit=0.222 and walking=0.069 from the labelled dataset. The resulting may therefore be interpreted as the maxent estimate of as the proportion of the sample population in alternative . Likewise, a high maxent value indicates a high degree of stochasticity in the decision-making process. We find by computing . As the negative entropy increases, e.g. , the correlation between the -parameter and choice probability converges to the true value, e.g. .

The maxent estimate indicates the level of correlation between the set of -parameters and the output dependent choice variable. Section 4.5 shows that the -parameters for distance (2.833) and education activity (2.234) variables in the benchmark model are less likely to influence decisions relative to the other predictors and becomes an indicator of model misspecification. However, as we increase the number of latent variables in the generative model, maxent decreases and as such, the -parameters becomes a better predictor of the behaviour. Evidently, this suggests that the mode choice decision behaviour of individuals are less sensitive trip distance and education related activities.