Conventional methods of estimating latent behaviour generally use attitudinal questions which are subjective and these survey questions may not always be available. We hypothesize that an alternative approach can be used for latent variable estimation through an undirected graphical models. For instance, non-parametric artificial neural networks. In this study, we explore the use of generative non-parametric modelling methods to estimate latent variables from prior choice distribution without the conventional use of measurement indicators. A restricted Boltzmann machine is used to represent latent behaviour factors by analyzing the relationship information between the observed choices and explanatory variables. The algorithm is adapted for latent behaviour analysis in discrete choice scenario and we use a graphical approach to evaluate and understand the semantic meaning from estimated parameter vector values. We illustrate our methodology on a financial instrument choice dataset and perform statistical analysis on parameter sensitivity and stability. Our findings show that through non-parametric statistical tests, we can extract useful latent information on the behaviour of latent constructs through machine learning methods and present strong and significant influence on the choice process. Furthermore, our modelling framework shows robustness in input variability through sampling and validation.

## 1 Introduction

Complex theories of decision-making processes provide the basis of latent behaviour representation in statistical models focusing on the use of psychometric data such as choice perception and attitudinal questions. Although they can provide important insights into choice processes and underlying heterogeneity, studies have shown the limited flexibility and benefits of statistical latent behaviour models, i.e. Integrated Choice and Latent Variable (ICLV) models (chorus2014possibility; vij2016and). Two disadvantages are known in ICLV models: first, datasets are required to have attitudinal responses, for instance, likert scale questions in product choice surveys. Second, model mis-specification may occur when latent variable model equations are poorly defined and attitudinal questions are subjective and would change over time.

The objective of this study is to use of machine learning (ML) methods to analyze the underlying latent behaviour in choice models based on a set of synthetic ML considerations and hyperparameters without explicitly using attitudinal or perception attributes. A growing body of behavioural research focuses on patterns and clusters of behaviour characteristics including latent attitudes and choice perceptions. Yet, comparing with specific advanced choice modelling strategies such as ICLV models, our knowledge of the prevalence and consequences of latent behaviour in choice model still remains limited (vij2016and). Studies of hidden representations using neural network models may give us more nuanced and potentially new perspectives of latent variables on discrete choice experiments and choice behaviour theory (Rungie2012145). Given the many possible latent variable combinations, it is necessary to use advanced ML techniques to segment population into groups with similar attitudinal profiles. For this study, we have chosen to use restricted Boltzmann machines (RBM). RBM is a non-parametric generative modelling approach that seeks to find latent representations within a homogeneous group by hypothesizing that posterior outputs can be explained with a reduced number of hidden units (le2008representational). In addition, identifying common latent representation may enable policy makers to better understand the sensitivity and stability of latent behaviour models in surveyed and revealed preference data. We decouple the latent behaviour model underlying the data distribution by estimation on a financial instrument choice behaviour dataset without the need for subjective measurement indicators. The proposed method does not predefine a semantic meaning for each latent variable. Instead, we define a restricted Boltzmann machine to learn the latent relationships and approximate the posterior probability.

We show in our findings that a RBM modelling approach is able to characterize latent variables with semantic meaning without additional psychometric data. The parameters estimated through our RBM model presents strong and significant influence in the choice process. Furthermore, sensitivity analysis have shown that this approach is robust to input data variance and use of generated latent variables improves sampling stability.

The remainder of the paper is organized as follows: in Section 2, we provide a background literature review on latent behaviour models. Section 3 describes the conditional RBM modelling approach and model training methodology, given only observed variables without attitudinal questions. Section 4 explains the data and the experiment procedure. Section 5 presents the results and performance tests. Section 6 analyzes the model sensitivity and stability. Finally, section 7 discuss the conclusions and future research directions.

## 2 Background

Current practice in choice modelling is targeted at drawing conclusion on the mechanism of the stochastic model and not so much about the nature of the data itself. This leads to simple assumptions of data relevance and statistical properties of explanatory variables (burnham2003model). A number of parametric and non-parametric modelling methods are available. Parametric models are regression based and random utility maximization structural models. Examples of non-parametric methods include latent class and variable models. Other statistical models include k-means or hierarchical clustering. These non-parametric methods are often criticized for being too descriptive, theoretical, may result in inconsistent estimates and often not possible to make generalizations (ben1999discrete; atasoy2013attitudes; bhat2014new). Analyzing data through the statistical properties is generally applied for extracting information about the evolution of the responses associated with stochastic input variables rather than having good prediction capabilities. On the other hand, algorithmic modelling approaches such as artificial neural networks (ANN), decision trees, clustering and factor analysis are based on the ability to predict future responses accurately given future input variables within a ‘black-box’ framework (breiman2001statistical). Econometric choice models can be estimated by using both parametric and non-parametric methods that incorporate machine learning algorithms into discrete choice analysis to learn mappings from latent variables to posterior distribution (eric2008active).

A number of different approaches which implements the use of attitudinal variables have been used in existing literature (ashok2002extending; morey2006using; hackbarth2013consumer). The first approach relies on a top-down modelling framework which makes prior assumptions that individuals are divided into multiple market segments and each segment has its own utility function of underlying attributes. In the most generic form, these assumptions are based on multiple sources of unobserved heterogeneity influencing decisions, e.g. inter- and intra-class variance and ‘agent effect’ (yazdizadeh2017generic). Fig. 1 illustrates the Latent Class and ICLV model framework which shows the process of deriving latent classes or variables and how it integrates into the structural choice model.

The Latent Class model (LCM) is one such form which assumes a discrete distribution among market segments (hess2014handbook). LCM derive clusters using a probabilistic model that describes the distribution of the data. Based on this assumption, similarities within a heterogeneous population are identified through assignment of latent class probabilities. Individuals in the same class share a common joint probability distribution among the observed variables. Under assumption of class independence, the utility is generated with a prior hypothesis from several sub-populations, and each sub-population is modelled separately. The resulting classes are often meaningful and easily interpretable. The unobserved heterogeneity in the population is captured by the latent classes, each of which is associated with different utility vector in the sub-model (Fig. 1a). Another similar class of top-down models are finite mixture models, e.g. Mixed Logit, which allows the parameters to vary with a variance component and that behaviour is dependent on the observable attributes and on the latent heterogeneity which varies with the unobserved factors (hensher2003mixed).

The use of attitudes and perception latent variables are also particularly interesting and popular in past work (glerum2014forecasting; atasoy2013attitudes). Choice models with measurement indicator functions treat correlated indicators into multiple latent variables. This factor analysis method is similar to principal component analysis where the latent variables are used as principal components (glerum2014forecasting). This approach involves the analysis of relationship between indicators and the choice model. Within this domain, there is the sequential and simultaneous estimation process. Sequential approach first estimates a measurement model which derives the relationship between latent variables and indicators. Then, a choice model is estimated, integrating over the distribution of the latent variables. The main disadvantage of this approach is that the parameters may contain measurement errors from the indicator function that were not taken into account during the choice model.

To solve this issue, another approach uses simultaneous estimation of structural and measurement model, which includes the latent variable in the choice model framework. This is so called the Integrated Choice and Latent Variable (ICLV) model (Fig. 1b). The ICLV model explicitly uses information from measurement indicators and explanatory variables to derive latent constructs. This combined structural model framework has led to many interesting results, e.g. environmental attitudes in rail travel (hess2013accommodating), image, stress and safety attitudes towards cycling (maldonado2014exploring), and social attitudes towards electric cars (kim2014expanding). However, the simultaneous approach still relies on a separate measurement model (latent variable model) that describes the relationship to indicators. Despite the direct benefits of the ICLV model combining factor analysis with traditional discrete choice models, the only advantage to using such an approach is when attitudinal measurement indicators are expected to be available to the modeller and the observed explanatory variables are weak predictors of the choice model (vij2016and). Even when measurement indicators are available, they may not provide any further information that directly influence the choice than through explanatory variables (chorus2014possibility). Consequently, mis-specification and other measurement errors may occur, when the criteria is not associated with the choice model. Without measurement indicators to guide selection of latent variables, we can alternatively use ML for latent variables through data mining. This can be implemented through generative modelling methods used in ML. Generative modelling in ML is a class of models which uses unlabelled data to generate latent features. Generative models learn the underlying choice distribution and the latent inference , where is the latent variable. Followed by implementing a Bayesian network that represents a probabilistic conditional relationship between random variables and dependencies to derive the posterior distribution of given using . Efficient algorithms which perform ML and inference such as RBMs can be used in this method. The denominator is given by indicating choice is chosen. The rapid advancement of machine learning research have led to the development of efficient semi-supervised training algorithms such as the conditional restricted Boltzmann machine (C-RBM) (salakhutdinov2007restricted; larochelle2008classification), a hybrid discriminative-generative model, capable of simultaneously estimating a latent variable model using a priori choice distribution with an latent inference model (see Fig. 2).

To date, econometric and machine learning models are often studied for its contrasting purposes in decision forecasting by behavioural researchers (breiman2001statistical). Econometric models are based on the classical decision theory that individual’s decisions can be modelled rationally based on utility maximization. These models assume that the population will adhere to the strict formulation of the choice model, but may not always represent the true decisions. Generative modelling based approach uses clustering and factor analysis developed through algorithmic modelling of the data. Associations between decision factors can be classified in this method, obtaining latent information without explicit definition of latent constructs (poucin2016pedestrian). Thus, machine learning algorithms such as ANN that decouple latent information from ‘true’ distribution generally outperform traditional regression based models in multidimensional problems (ahmed2010empirical). Recent works on latent behaviour modelling on choice analysis agree on the potential of improving behaviour models with machine learning. Examples include combining machine learning to improve complex psychological models (rosenfeld2012combining), representing the phenomena of similarity, attraction and compromise in choice models (osogami2014restricted) and inference of priorities and attitudinal characteristics (aggarwal2016learning).

Despite the many benefits, interpretation of results are still extremely difficult due to the complexity and number of parameters in ML analysis. As a result, ML models are not often used for general purpose behaviour understanding, but created exclusively for a specific purpose for prediction accuracy. Still, machine learning research is a rapidly growing field at the intersection of statistical analysis and information science to find patterns in complex data (donoho201550). Furthermore, with the emphasis on applications and theoretical studies in today’s massive data driven industry, improving analytical techniques with ML is very relevant, although structural modelling, statistical and probability theory will remain the cornerstone of discrete choice analysis.

### 2.1 The basis of latent class and latent variable models

The latent class model shown in Figure 1 is a simple top-down model that imparts generalization properties to the choice model that predefines a discrete number of classes, allowing the parameters to vary with an fixed distribution. Formally, the LCM choice probability can be expressed as:

(1) |

where are the set of classes and is the probability that an individual belongs to class . is the conditional probability of choice selected given the class and input variable .

The ICLV model extends the choice model by describing how perceptions and attitudes affect real choices as well as using separate indicators to estimate latent variables (ben2002hybrid). Latent variables can be classified as either attitudinal (individual characteristics) or perceived (personal beliefs towards responses) (ben1999discrete). The latent variable model (measurement model) forms a sub-part of the structural framework which captures the relationship between the latent variables and indicators and the observed explanatory variables which influence the latent variables. This specification can be used to identify more useful parameters and predict accurate decision outcomes when there is a lack of strong significant correlation between explanatory variables and choice outcomes. The functions of the structural and measurement model can be explained in four equations (vij2016and):

(2) |

(3) |

(4) |

(5) |

where is the utility of selecting alternative . A represents the relationship between input explanatory variables x and latent variables , D represents the relationship between x and the indicator output . B and G represents the model parameters with respect to the observed and latent variables. , and are the stochastic error terms of the model, assumed to be mutually independent and Gumbel distributed. In a generative model, parameters are shared between G and D that simply defines the joint distribution of , i.e. (Fig. 2). The re-use of a shared parameter vector differentiates the RBM model from the structural equation formulation of the ICLV model.

### 2.2 Modelling through generative machine learning methods

In generative machine learning models, hidden units are the learned features (see Fig. 2) which performs non-redundant generalization of the data to reduce high dimensional input data (hinton2006reducing). Intuitively, in terms of econometric analysis, hidden units are latent variables that depend on some observed data, for instance, socio-economic attributes such as weather or price information or direct choices such as location and choice of purchase. We can construct a generative model as a function of these dependent and independent variables. In the case of factor analysis approach, a common process is to perform feature extraction based on statistical hypothesis testing to determine if the values of the two classes are distinct, for example, using Support Vector Machines (SVMs) or Principal Component Analysis (PCA) to learn low-dimensional classes by capturing only significant statistical variances in the data (poucin2016pedestrian; wong2016bike). The learned classes (or clusters) can then be introduced directly into the model via parameterization. In generative modelling approach, we use the priors directly to learn the distribution of the hidden units. In this process we extract latent information directly from the observed choice data instead of using measurement functions which may be prone to errors.

### 2.3 Balancing model inference and accuracy

One common problem that researchers face when constructing latent behaviour models is specifying of the optimal size of latent factors (vermunt2002latent). Since the hypothesis on the number of latent size cannot be tested directly, typical statistical evaluation methods such as AIC and BIC are used to guide class selection (vermunt2002latent), in the case of ICLV models, through predefinition of measurement functions (Rungie2012145). However, since the number of latent factors determines the ability of the model to represent the various heterogeneity in the data, it is likely that as we increase , the choice model become more efficient in capturing complex behaviour effects from individual and latent attributes. On the other hand, if we increase the number of latent segments, the number of parameters will also increase at an exponential rate (vermunt2002latent). Therefore, we may gain model accuracy but we would lose model interpretability.

The trade-off between inference and accuracy is a challenge when dealing with complex data (breiman2001statistical). If the goal of latent behaviour modelling is to leverage on data to understand underlying statistical problems, we have to incorporate implicit modelling methods in addition to describing explicit structural utility formulations.

## 3 Methodology

In this section, we provide a brief overview on restricted Boltzmann machines and how it can be used to generate prior over the choice distributions. We refer readers to Goodfellow-et-al-2016 for background and details on generative models and deep learning.

### 3.1 Restricted Boltzmann machines

A restricted Boltzmann machine (RBM) is an energy-based undirected graphical model that extends from a Markov Random Field distribution by including hidden variables (salakhutdinov2007restricted). It is a single layer artificial neural network with no internal layer connections. The model has stochastic visible variables and stochastic hidden variables . The joint configuration of visible and hidden variables is given by the Hopfield energy (hinton1984boltzmann):

(6) |

where and represent the vector biases (constants) for the hidden and visible vectors respectively. is the matrix of parameters representing an undirected connection between the hidden and visible variables. We can express the Boltzmann distribution as an energy model with energy function :

(7) |

where the partition function is the normalization function over all possible vector combinations. is defined as the free energy and further simplified to

(8) |

The probability of assigning a visible vector is given by the sum of all possible hidden vector states:

(9) |

The RBM model is used to learn aspects of an unknown probability distribution based on samples from that distribution. Given some observation, the RBM makes updates to the model weights such that the model best represent the distribution of the observation. To generate data with this method, it is necessary to compute the log likelihood gradient for all visible and hidden units. Hinton introduced a fast greedy algorithm to learn model parameters efficiently using Contrastive Divergence (CD) method that starts a sampling chain (Gibbs sampling) from real data points instead of random initialization (hinton2010practical).

### 3.2 Model estimation and inference

The probability that the RBM network learns a training sample can be raised by adjusting the weights to lower the energy of that training sample and raise the energy of other non-training samples. In order to minimize the negative log likelihood of the probability distribution , we take its gradient derivative of the log probability of a training vector with respect to the model parameters as follows:

(10) |

where the components in the angle brackets corresponds to the expectations under the specified distribution. The first and second terms are the positive and negative phases respectively. This function updates the model parameters using a simple learning rule with a learning rate :

(11) |

The updates for parameters can be performed using simple stochastic gradient descent at each iteration of :

(12) |

To obtain a sample of a hidden unit from , we take a random training sample y and sample the state in the hidden layer is given by the following function:

(13) |

where . Similarly, we can obtain a visible state, given a vector of sampled hidden units, via a logistic function:

(14) |

Since weights are shared between and and they define the distributions of and , we can express the posterior distribution as (ng2002discriminative). Due to its bidirectional structure, this framework possesses good generalization capabilities. The visible layer represents the data (in the case of choice modelling, data represent selected choices), and the hidden layer represents the capacity of the model as class distributions.

The model can be inferred from can be done by setting the states of the visible variables to a training sample and then the states of the hidden variables are computed using Eq. 13. Once a “state” is chosen for the hidden variables, a “reconstruction” phase produces a new vector with a probability given by Eq. 14, and the gradient update rule is given by:

(15) |

We approximate the gradient function by using a CD Gibbs sampler minimizing the divergence between the expected and estimated probability distribution, known as the Kullback-Leibler (KL) divergence (hinton2002training). A divergence ratio of 0 indicates that the estimated distribution is totally similar. The training algorithm that runs for a total number of chain steps is initialized from a fixed point from the data distribution and then averaged across all examples (carreira2005contrastive).

### 3.3 Modelling approach

In this paper, the proposed method uses a conditional RBM (C-RBM) training algorithm to include input-output connections that allows for discriminative learning (mnih2012conditional). C-RBM expands the model to include “context input variables”, i.e. . input explanatory variables are introduced as context variables so that they can be used to influence the latent variables, even though Eq.14 does not reconstruct these explanatory variables. This influence is represented by a weight matrix . The intuition is that for each latent variable, it acts as a function of the observed choice y, conditional on x (see Fig. 2). In the choice prediction stage, a vector of new input samples x generate latent variables h. Conditional on the explanatory and latent variables, a probability function describing the choice behaviour is given as:

(16) |

Likewise, sampling of the hidden state is extended to incorporate :

(17) |

where the update parameters are . During the reconstruction phase, the condition probability (Eq. 16) is equivalent to a MNL model with latent variables (where and represents the latent and observed variables respectively). Good latent variables best capture information along the orthogonal direction where choices and observed inputs vary the most. The training and choice estimation phase is illustrated in Fig. 3 and 4. In the positive phase, parameter vectors are adjusted decided by the learning rate to learn the transformed latent representation of the training set. In the negative phase, the latent variables are “clamped” or realized and the parameter vectors are adjusted again by reconstructing the observed variables. Referring to Fig. 2, the multinomial (MNL) model estimates the conditional parameter vector B and bias vector , while the C-RBM model includes vectors D, A and .

## 4 Data

In this section, we develop a financial product choice scenario with explanatory variables using the C-RBM model. The latent variables representing the latent attitudinal variables is simultaneously estimated in conjunction with the interaction with choice model. First, we construct a structured choice subset from a financial product transaction dataset from the Kaggle database^{2}^{2}2Dataset: https://www.kaggle.com/c/santander-product-recommendation/data. The data shows a monthly basis record of each financial product purchase by customers of Santander. The time span of the data is from January 2015 to June 2016. Next, we reduced the complexity of the dataset by removing transaction data which contain multiple product choices. To ensure consistency, inputs were scaled and normalized. Overall, the constructed dataset has a total of 13 alternatives (product choice) and 20 explanatory variables. Table 1 lists the alternatives and distribution across the dataset. Given the above conditions, a total of 253,803 valid responses were recorded representing the total population sample with 13 available choices. A descriptive list of mean and standard deviation values of the explanatory variables are shown in Table 2. The experimental question is straightforward: “Given a set of examples with explanatory variables, what product is the individual most likely to purchase in the given month?” In a typical situation, the decision maker chooses an alternative that yields the maximum utility, making an inference about the behaviour of the decision maker using the predictive model.

Choice index | Name | Total sample distrib. |
---|---|---|

1 | Guarantees | 0.002% |

2 | Short-term deposits | 0.83% |

3 | Medium-term deposits | 0.07% |

4 | Long-term deposits | 3.79% |

5 | Funds | 0.98% |

6 | Mortgage | 0.02% |

7 | Pensions | 0.15% |

8 | Loans | 0.035% |

9 | Taxes | 2.68% |

10 | Cards | 21.93% |

11 | Securities | 1.42% |

12 | Payroll | 22.04% |

13 | Direct debit | 46.05% |

Explanatory variable | Description | mean | std. dev. |
---|---|---|---|

age | Customer age | 42.9 | 13.0 |

loyalty | Customer seniority (in years) | 8.03 | 6.0 |

income | Customer income (€) | 141,838 | 262,748 |

sex | Customer sex (1=male) | 0.387 | 0.487 |

employee | Employee index, 1 if employee | 0.0006 | 0.024 |

active | Active customer index | 0.95 | 0.199 |

new_cust | 1 if customer loyalty 6 mo. | 0.045 | 0.207 |

resident | Resident index (Spain) | 0.999 | 0.007 |

foreigner | Foreign citizen index | 0.045 | 0.21 |

european | EU citizen index | 0.995 | 0.006 |

vip | VIP customer index | 0.116 | 0.32 |

savings | Savings Account type | 0.0002 | 0.012 |

current | Current Account type | 0.572 | 0.495 |

derivada | Derivada Account type | 0.0009 | 0.03 |

payroll_acc | Payroll Account type | 0.416 | 0.493 |

junior | Junior Account type | 0.0001 | 0.0098 |

masparti | Mas Particular Account type | 0.017 | 0.128 |

particular | Particular Account type | 0.168 | 0.373 |

partiplus | Particular Plus Account type | 0.113 | 0.316 |

e_acc | e-Account type | 0.255 | 0.436 |

### 4.1 Method for assessing C-RBM model performance

We can estimate the weights for the latent inference model and by optimizing the lower bound of the KL-divergence using gradient backpropagation. Intuitively represents the parameters for the explanatory variables and represents the parameters for the latent variables. We selected models with 2, 4, 16 and 32 latent variables to observe the effects of increasing model complexity. One disadvantage of this step is that it results in a large number of estimated parameters: . With , we ended up with 409 parameters. To counteract overfitting due to this problem, we trained on 70% of our data and validate the model on the other 30% with a 2-fold cross-validation to verify generalization. When the validation error stops decreasing, the optimal estimation is reached (Goodfellow-et-al-2016). A baseline comparison is set up using a standard multinomial logistic regression model with all explanatory variable and compared to the discriminative C-RBM modelling approach, followed by comparing the log-likelihood, model fit and predictive accuracy across all data models. The criteria for measuring performance of a categorical based model include: model fit and prediction error. The fit denotes the predictive ability between the trained model and a model without covariates. In the prediction error evaluation, the elements in the diagonal cells of a confusion matrix over the total number of examples denotes the accuracy of the model in predicting the correct choice and the error is

(18) |

is the actual choice and is the sum of all the error probabilities for correct assessment for each choice. We fit the model on the training set and evaluate on the validation set.

## 5 Results

We compare the different models based on their generalization performance on the test set. A total of 76,141 observations were used in the test. For the purpose of this study, we tested on both normalized and non-normalized data and found that both data produce similar result. Model estimation and validation were performed with Theano ML Python libraries^{3}^{3}3Theano Python library: http://github.com/Theano/Theano. Optimization parameters used were stochastic gradient descent (SGD) on mini-batches of 64 samples for 400 epochs with input normalization. We used an adaptive momentum based learning rate of with initial rate of (hinton2006fast). Training time was approximately 30 minutes for each model including validation running on a Intel Core i5 workstation. At the given time, computational demand may not be significant to justify the small number of hidden units, however, speed could become a more important consideration when model estimation and validation increase in data size or using very large parameter vectors with higher dimensionality. The statistical results of the model comparison across the same validation set is shown in Table 3. We found that additional latent information about the relationship between explanatory variables and observed decisions was useful and increases model accuracy. Bayesian Information Criterion (BIC) values indicate that 8 hidden units may be the optimal number of latent variables and higher BIC values above 8 hidden units might suggest overfitting. However, when generating semantic class meanings, a smaller number of latent variables may be simpler, therefore, in our example, we use only 2 latent variables for analysis.

To evaluate the efficiency of the models, we used a Hinton diagram (Bremner1994) to analyze the parameter strengths between independent and dependent variables. We plot the parameter values and significance with choice on the y-axis and independent variables on the x-axis (Bremner1994). A Hinton diagram is often used in model analysis where the dimensionality of the model is high and provides a simple visual way of analyzing each vector. Figs. 5 through 9 shows the parameter estimates of the completed training stage of the different models. The Hinton matrix shows the influence of each independent variable on each alternative or latent variable. Statistically significant (>95% confidence bound) parameters are highlighted in blue. The values along the x-axis are normalized with zero mean and unit variance. The 13 financial product choices are listed on the y-axis. The estimated parameters and bias of the C-RBM prediction model B, D and are projected onto the Hinton diagram (Figs. 5(a), 6(a), 7(a) and 8(a)) while parameters A and representing the parameters and bias for the latent variable with respect to the alternatives shown in Figs. 5(b), 6(b), 7(b) and 8(b). and are the constants with respect to the observed and hidden layer respectively. The signs and value of each parameter corresponds to the size and colour of the patches in the matrix, with white and black representing positive and negative signs respectively. Statistical significance (t-test) of each parameter is calculated using , where is the inverse of the Hessian of the log likelihood with sample size adjustment with respect to the parameters.

Model | latent variables | Validation error | log-likelihood | no. of params | BIC | |
---|---|---|---|---|---|---|

MNL | (baseline) | 0.4454 | -206808 | 0.546 | 273 | 416915 |

CRBM | 0.4360 | -203558 | 0.553 | 341 | 411237 | |

0.4338 | -202066 | 0.556 | 409 | 409075 | ||

0.4323 | -200846 | 0.559 | 545 | 408279 | ||

0.4318 | -200223 | 0.560 | 817 | 410321 |

RBM 2 LV | C-RBM 4 LV | C-RBM 8 LV | C-RBM 16 LV | |||||||||

sample size | ||||||||||||

std. err. | std. err. | std. err. | std. err. | |||||||||

parameter | rank | diff. | rank | diff. | rank | diff. | rank | diff. | ||||

age | 15 | 15 | 49.30 | 15 | 12 | 0.52 | 11 | 11 | 0.99 | 11 | 12 | 0.64 |

loyalty | 18 | 14 | 59.36 | 14 | 15 | 0.38 | 15 | 15 | 0.82 | 15 | 17 | 0.48 |

income | 3 | 3 | 3712.99 | 3 | 3 | 26.67 | 3 | 3 | 43.00 | 3 | 2 | 35.82 |

sex | 12 | 13 | 67.51 | 13 | 14 | 0.41 | 14 | 13 | 0.91 | 14 | 15 | 0.52 |

employee | 5 | 2 | 4267.79 | 2 | 4 | 13.74 | 5 | 5 | 21.27 | 4 | 4 | 33.74 |

active | 21 | 16 | 47.92 | 16 | 19 | 0.20 | 19 | 19 | 0.34 | 19 | 19 | 0.26 |

new_cust | 6 | 12 | 53.93 | 12 | 7 | 1.49 | 8 | 8 | 1.34 | 8 | 9 | 0.91 |

resident | 16 | 20 | 16.61 | 20 | 20 | 0.19 | 20 | 20 | 0.31 | 20 | 20 | 0.23 |

foreigner | 8 | 17 | 29.15 | 17 | 8 | 1.43 | 9 | 10 | 0.76 | 7 | 7 | 1.35 |

european | 17 | 20 | 16.62 | 20 | 20 | 0.19 | 21 | 20 | 0.31 | 21 | 20 | 0.23 |

vip | 20 | 10 | 122.66 | 10 | 16 | 0.33 | 16 | 12 | 0.99 | 16 | 13 | 0.68 |

savings | 1 | 1 | 34177.13 | 1 | 1 | 258.41 | 2 | 1 | 255.12 | 2 | 1 | 181.81 |

current | 7 | 11 | 64.19 | 11 | 13 | 0.41 | 12 | 18 | 0.38 | 12 | 16 | 0.39 |

derivada | 4 | 4 | 3112.38 | 4 | 5 | 4.70 | 4 | 4 | 19.67 | 5 | 5 | 2.82 |

payroll_acc | 9 | 18 | 24.91 | 18 | 18 | 0.29 | 18 | 17 | 0.52 | 18 | 18 | 0.41 |

junior | 2 | 5 | 1759.26 | 5 | 2 | 58.29 | 1 | 2 | 45.32 | 1 | 3 | 22.43 |

masparti | 11 | 7 | 185.94 | 7 | 9 | 1.41 | 7 | 6 | 4.99 | 9 | 6 | 2.29 |

particular | 14 | 8 | 166.56 | 8 | 11 | 0.61 | 13 | 14 | 0.83 | 13 | 14 | 0.53 |

partiplus | 10 | 6 | 189.75 | 6 | 10 | 0.65 | 10 | 9 | 1.51 | 10 | 10 | 0.86 |

e_acc | 19 | 9 | 159.38 | 9 | 17 | 0.33 | 17 | 16 | 0.82 | 17 | 11 | 0.91 |

bias | 13 | 19 | 19.07 | 19 | 6 | 3.17 | 6 | 7 | 3.35 | 6 | 8 | 0.48 |

## 6 Analysis

### 6.1 Characteristics of latent variables

We can characterize each hidden unit with the explained significance and strengths represented by the weights . is a parameter vector that indicates the linear contribution of each latent variable and a constant , such that each alternative can be described as a utility function of latent variables: .

For example, C-RBM-2 latent variable hidden1 is characterized by individuals who are of working age, non-EU foreign citizens with non-VIP status and does not own any special accounts. We can therefore infer this latent variable that indicates a ‘savings driven attitude’ (see Fig 5(b)). From the model results, population with such characteristics have a positive preference of purchasing a payroll product and a low motivation of purchasing a (credit/debit) card product as indicated in Fig 5(b). Likewise in latent variable hidden2, it is represented by older, loyal customers who are VIP and have held various account types over their lifetime. This latent variable can be inferred to as ‘self-reliance attitude’ and are indication of the population who are less likely to purchase long term deposits, funds, securities and card products. The C-RBM with latent variables outperforms the MNL model, however, the performance increase from increasing the number of latent variables past 4 LV, is small. This would suggest that the upper bound of latent representative capacity is reached with just a small number of latent variables. Using 2 or 4 latent variables would be sufficient for significant improvement over a MNL structure.

From the presented results, it is clear that the C-RBM models differ significantly from the MNL model in terms of parameters which are strong and significant. This result seems to be broad-based in the sense that it is not dictated by the number of hidden units and signifies that the observed distribution has some latent factors that can be explored. However, we should mention that the training parameter initialization may have a small random effect on the model. Note that in the parameter plots, the signs and strength contribution to the choice model differ from model to model which may indicate that model training may be stuck at a local optima. This also suggest that the hidden and observed layer have different scale (glorot2010understanding). What is suggested in he2015delving is to increase learning rate to improve convergence, but that would result in overgeneralization and loss of expressive power in the hidden units. We posit a middle-of-the-road solution should have adequate model accuracy and generalization over a large population.

We performed 2-fold cross-validation analysis and determined that the residual from model fit is not significant, therefore the model is robust to changes in input data – this is further confirmed with a sensitivity analysis presented in the following section. In the parameter plots, we can see the values and signs correspond to the strength of each variable. For instance, the parameters for Guarantees choice are not significant, since the distribution is very low (0.002%). The latent models show similar results. For C-RBM with 2 and 4 hidden units, almost all of the parameters are significant, except for income, employee, savings, derivada and junior variables. This can be attributed to the small mean values (and high deviation).

### 6.2 Sensitivity of parameter estimates

The versatility and effectiveness of parameter estimates are determined by a sensitivity analysis of the model output. Methods of sensitivity analysis include variance based estimator, sampling based and differential analysis (helton1991sensitivity; saltelli2000sensitivity; saltelli2010variance).“Sensitive” parameters are those whose uncertainty contributes substantially to the test results (helton1991sensitivity; hamby1994review). The model is sensitive to input parameters in the variability associated with the input variable resulting in a large output variability. Sensitivity ranking sorts the input parameters by the amount of influence it has on the model output and the disagreement between rankings measures the parameter sensitivity to changes to the input (hamby1994review).

We first define a list of parameters used in the model by their standard errors calculated over the full dataset. In large dataset sensitivity analysis, a key concern is the computational cost needed to complete the analysis, hence we use a sampling based approach as a cheap estimator to the output % difference of the parameter minimum and maximum value. Random sampling (e.g. simple random sample, Monte Carlo, etc.) generates distributions of input and output to assess model uncertainties (helton1991sensitivity). Analyzing the sampling effects can provide information of the overal model performance since parameter sensitivity depends on all parameters which the model is sensitive and therefore the importance of each parameter (hamby1994review).

Consider that the C-RBM model is represented by , where and are the input vectors of observed and latent variables respectively and is the model output. We suppose that the model is a complex, highly non-linear function such that we cannot completely define the way the C-RBM model responds to changes in input variables. Also, is dependent on through a submodel previously shown in Fig. 2. The analysis involves independently and randomly generated sample with size (10% random sample draw), where is the total number of observations. The model performance is considered by sampling stability of variable parameters. Sensitivities are also assessed for size of hidden units used in generating the C-RBM models and indicates what number of latent variables (hyperparameter) is required for model identifiability. Since the model was applied using a multinomial logit approach instead of a conditional logit, this resulted in a very large number of parameters. Thus the effect of relative changes to the number of distributed parameters gives the range of variance across each explanatory variable and number of hidden units used. Table 4 shows the effects of sampling on the sensitivity and stability of the model observed parameters on the theoretical values and size of latent variables. Notice that the relative difference in standard error between the full and sampled model decreases when number of latent variables increases. This shows that the C-RBM model with high synthetic latent variables are robust to changes to input values through sampling. Additionally, the parameter sensitivity rank across variables also becomes more consistent and therefore,we show that RBM models are efficient in obtaining good latent variables with low generalization error. The significant decrease in standard error difference from 2 LV to 4 LV may indicate that the number of latent variables used in the models has a lower bound on the generalization error, which implies that we need careful consideration on for obtaining efficient but yet accurate exact values of without losing model interpretability.

## 7 Conclusion

This study analyzes alternative means of latent behaviour modelling in the absence of attitudinal indicators. In ICLV models, specialized surveys have to be constructed with attitudinal questions to model latent effects on the decisions. While it has been one of the more popular method in discrete choice analysis, there are several disadvantages to it. First, attitudinal questions are subjective and the behaviours are subjected to changes over time. Next, existing datasets that have no attitudinal questions cannot leverage on the ICLV model, thus latent effects cannot be utilized. We explore generative modelling of the choice distribution to uncover latent variables using machine learning methods, without measurement indicators. We hypothesized that latent effects can be obtained not only from attitudinal questions, but also from the posterior choice distribution. In effect, we are modelling latent components that fits the real choice distribution rather than achieving good statistics on subjective models. For example, there could possibly be some mean behaviour that dictates a more probable influence on purchases given some latent variables.

For this method to be effective, certain conditions have to be present: First, difficultly to get a good discriminative prediction result using only the provided explanatory variables. In this scenario, the C-RBM models were able to learn good latent variable representation and improve the model fit and prediction accuracy while providing latent variable inferrability. Next, when the data lacks attitudinal survey data, this method can find latent effects without the use of subjective measurement indicators.

The current limitations of this study are the absence of choice dynamics or explanatory variable dynamics, i.e. changes over time or multiple choices for the same individual was not considered, but can be brought in. The underlying RBM is capable of dynamics. We hypothesize that this may improve the model significantly, but we are still looking for ways to incorporate dynamics into our C-RBM model. In recent studies, we have seen dynamic frameworks such as recurrent neural networks used in modelling temporal data (taylor2007modeling; mnih2012conditional). Finally, it is worth noting that as the number of latent variable increases, the number of estimated parameters increases exponentially. This will pose problems in large datasets and the ability to reduce dimensionality will give a significant benefit to efficient use of model parameters. In our observation, performing cross-validation or model selection with lowest validation error is a justifiable method to prevent overfitting using all the parameters. In the future, we would also look at the possibility of introducing deep learning architecture to choice modelling by stacking RBMs (otsuka2016deep).

While ICLV model are optimized to predict the effects of latent constructs on the choice model using measurement indicators to guide latent parameters selection, this method uses observed decisions as an influence source for optimizing latent variables through machine learning. This is not to say that we do not agree with using measurement indicators which may often be subjective and may raise mis-specification problems and when explanatory variables are poor predictors, ICLV models can improve latent effects on choice models (vij2016and). However, latent effects may not only be present in attitudes and perceptions, but also in the direct observation of choices. Our current work explores the use of posterior choice distribution for latent behaviour modelling. Generative modelling in DCA is inspired by state-of-the-art machine learning algorithms that performs unsupervised feature extraction from unlabelled data used in classification problems (hinton1984boltzmann). In circumstances when attitudinal variables are not available, we have a strong reason to believe that the generation of latent factors are important and effective in building a discrete choice model.

A future study that would be of interest is to extend this method to datasets with attitudinal questions and survey. For example, inter-city rail survey (sobhani2017innovative), and perform an analysis on both RBM and ICLV methods to obtain the generalization error of attitudinal survey models. A comparative study would provide a foundation for analysis of various latent behaviour models through graphical and algorithmic methods and provide guidance not only in selecting the appropriate latent variables, but also direct research effect to more promising directions.

## Acknowledgements

This research is in part funded by Ryerson University PhD Fellowship and by Fonds de recherche du Québec - Nature et technologies (FRQ-NT) with team grant No. 2016-PR-189250