MixedVariate Restricted Boltzmann Machines^{1}
Abstract
Modern datasets are becoming heterogeneous. To this end, we present in this paper MixedVariate Restricted Boltzmann Machines for simultaneously modelling variables of multiple types and modalities, including binary and continuous responses, categorical options, multicategorical choices, ordinal assessment and categoryranked preferences. Dependency among variables is modeled using latent binary variables, each of which can be interpreted as a particular hidden aspect of the data. The proposed model, similar to the standard RBMs, allows fast evaluation of the posterior for the latent variables. Hence, it is naturally suitable for many common tasks including, but not limited to, (a) as a preprocessing step to convert complex input data into a more convenient vectorial representation through the latent posteriors, thereby offering a dimensionality reduction capacity, (b) as a classifier supporting binary, multiclass, multilabel, and labelranking outputs, or a regression tool for continuous outputs and (c) as a data completion tool for multimodal and heterogeneous data. We evaluate the proposed model on a largescale dataset using the world opinion survey results on three tasks: feature extraction and visualization, data completion and prediction.
1 Introduction
Restricted Boltzmann Machines (RBM) [9, 5] have recently attracted an increasing attention for their rich capacity in a variety of learning tasks, including multivariate distribution modelling, feature extraction, classification, and construction of deep architectures [8, 19]. An RBM is a twolayer Markov random field in which the visible layer represents observed variables and the hidden layer represents latent aspects of the data. Pairwise interactions are only permitted for units between layers. As a result, the posterior distribution over the hidden variables and the probability of the data generative model are easy to evaluate, allowing fast feature extraction and efficient samplingbased inference [7]. Nonetheless, most existing work in RBMs implicitly assumes that the visible layer contains variables of the same modality. By far the most popular input types are binary [5] and Gaussian [8]. Recent extension includes categorical [21], ordinal [25], Poisson [6] and Beta [13] data. To the best of our knowledge, none has been considered for multicategorical and categoryranking data, nor for a mixed combination of these data types.
In this paper, we investigate a generalisation of the RBM for variables of multiple modalities and types. Take, for example, data from a typical survey, where a person is asked a variety of questions in many styles ranging from yes/no to multiple choices and preference statements. Typically, there are six question/answer types: (1) binary responses (e.g., satisfied vs. unsatisfied), (2) categorical options (e.g., one of employed, unemployed or retired), (iii) multicategorical choices (e.g., any of family, education or income), (iv) continuous information (e.g. age), (v) ordinal assessment (e.g., one of good, neural or bad), and (vi) categoryranked preferences (e.g., in the decreasing order of importance: children, security, food and money). As the answers in a response come from the same person, they are inherently correlated. For instance, a young American is likely to own a computer, whilst a typical Chinese adult may concern more about their children’s education. However, modelling the direct correlation among multiple types is difficult. We show, on the other hand, a twolayer RBM is wellsuited for this problem. First, its undirected graphical structure allows a great flexibility to encode all six data types into the same probability distribution. Second, the binary hidden layer pools information from visible units and redistributes to all others, thereby introducing dependencies among variables. We term our model the MixedVariate Restricted Boltzmann Machines ().
The has the capacity of supporting a variety of machine learning tasks. Its posteriors can be used as a vectorial representation of the data hiding away the obscured nature of the observed data. As the result, we can use for data preprocessing, visualisation, and dimensionality reduction. Given the hidden layer, the original and missing observables can also be reconstructed through the generative data model. By splitting the observed data into an input and output sets, predictive models can be learnt to perform classification, ranking or regression. These capacities are demonstrated in this paper on a largescale international opinion survey across nations involving more than thousand people.
2 MixedVariate Restricted Boltzmann Machines
In this section we present MixedVariate Restricted Boltzmann Machines () for jointly modelling variables of multiple modalities and types. For ease of following the text, we include a notation description in Table 1.
Single visible variable  ,  Functions of an input variable  
A set of visible variables  ,,  Input bias parameters  
Single hidden variable  ,,  Inputhidden parameters  
A set of hidden variables  Hidden bias parameter  
Normalising function  Activation indicator  
,,  Ordinal relations  Set of categories  
Indifference  The number of categories  
Number of visible units  Category member of set  
Number of hidden units  ,  Indicator functions  
Probability distribution  Index of a subset of variables  
Energy function  Data loglikelihood 
2.1 Model Definition
Denote by the set of mixedvariate visible variables where each can be one of the following types: binary, categorical, multicategorical, continuous, ordinal or categoryranked. Let be the joint set of discrete elements and be the continuous set, and thus . Denoting by the hidden variables, the model distribution of is defined as
(1) 
where is the model energy, is the normalisation constant. The model energy is further decomposed into a sum of singleton and pairwise energies:
where depends only on the th visible unit, on the th hidden unit, and on the interaction between the th visible and hidden units. The is thus a layer mixedvariate Markov random field with pairwise connectivity across layers.
For the distribution in Eq. (1) to be properly specified, we need to keep the normalisation constant finite. In other words, the following integration
must be bounded from above. One way is to choose appropriate continuous variable types with bounded moments, e.g., Gaussian. Another way is to explicitly bound the continuous variables to some finite ball, i.e., .
In our , we further assume that the energies have the following form:
(2) 
where is the bias parameter for the th hidden unit, and and are functions to be specified for each data type. An important consequence of this energy decomposition is the factorisation of the posterior:
(3) 
where denotes the assignment . This posterior is efficient to evaluate, and thus the vector can be used as extracted features for mixedvariate input .
Similarly, the data model has the following factorisation
(4) 
where if is discrete and if is continuous, assuming that the integration exists. Note that we deliberately use the subscript index in to emphasize the heterogeneous nature of the input variables.
2.2 Typespecific Data Models
We now specify in Eq. (4), or equivalently, the functionals and . Denote by the set of categories in the case of discrete variables. In this section, for continuous types, we limit to Gaussian variables as they are the by far the most common. Interested readers are referred to [13] for Beta variables in the context of image modelling. The data model and related functionals for binary, Gaussian and categorical data types are wellknown, and thus we provide a summary here:
–Binary  

–Gaussian  
–Categorical 
where ; are model parameters; and if and otherwise.
The cases of multicategorical, ordinal and categoryranking variables are, however, much more involved, and thus some further simplification may be necessary. In what follows, we describe the specification details for these three cases.
Multicategorical Variables
An assignment to a multicategorical variable has the form of a subset from a set of categories. For example, a person may be interested in games and music from a set of offers: games, sports, music, and photography. More formally, let be the set of categories for the th variable, and be the power set of (the set of all possible subsets of ). Each variable assignment consists of a nonempty element of , i.e. . Since there are possible ways to select a nonempty subset, directly enumerating proves to be highly difficult even for moderate sets. To handle this state explosion, we first assign each category with a binary indicator to indicate whether the th category is active, that is . We then assume the following factorisation:
(4) 
Note that this does not says that binary indicators are independent in their own right but given the knowledge of the hidden variables . Since they hidden variables are never observed, binary indicators are therefore interdependent. Now, the probability for activating a binary indicator is defined as
(5) 
Note that this specification is equivalent to the following decomposition of the functionals and in Eq. (2):
Ordinal Variables
An ordinal variable receives individual values from an ordinal set where denotes the order in some sense. For example, can be a numerical rating from a review, or it can be sentimental expression such as love, neutral and hate. There are two straightforward ways to treat an ordinal variable: (i) one is simply ignoring the order, and considering it as a multinomial variable, and (ii) another way is to convert the ordinal expression into some numerical scale, for example, for the triple {love,neutral,hate} and then proceed as if it is a continuous variable. However, in the first treatment, substantial ordinal information is lost, and in the second treatment, there is no satisfying interpretation using numbers.
In this paper, we adapt the Stereotype Ordered Regression Model (SORM) by [1]. More specifically, the SORM defines the conditional distribution as follows
where are free parameters, is the
dimensionality of the ordinal variable
A shortcoming of this setting is that when , the model reduces to the standard multiclass logistic, effectively removing the ordinal property. To deal with this, we propose to make the input bias parameters order dependent:
(6) 
where is the newly introduced parameter. Here we choose , and .
Categoryranking Variables
In category ranking, a variable assignment has the form of a ranked list of a set of categories. For example, from a set of offers namely games, sports, music, and photography, a person may express their preferences in a particular decreasing order: sports music games photography. Sometimes, they may like sports and music equally, creating a situation known as ties in ranking, or indifference in preference. When there are no ties, we can say that the rank is complete.
More formally, from a set of categories , a variable assignment without ties is then a permutation of elements of . Thus, there are possible complete rank assignments. When we allow ties to occur, however, the number of possible assignments is extremely large. To see how, let us group categories of the same rank into a partition. Orders within a partition are not important, but orders between partitions are. Thus, the problem of rank assignment turns out to be choosing from a set of all possible schemes for partitioning and ordering a set. The number of such schemes is known in combinatorics as the Fubini’s number [16, pp. 396–397], which is extremely large even for small sets. For example, , , and . Directly modelling ranking with ties proves to be intractable.
We thus resort to approximate methods. One way is to model just pairwise comparisons: we treat each pair of categories separately when conditioned on the hidden layer. More formally, denote by the preference of category over , and by the indifference. We replace the data model with a product of pairwise comparisons , where denotes preference relations (i.e., , or ). This effectively translates the original problem with Fubini’s number complexity to pairwise subproblems, each of which has only three preference choices. The drawback is that this relaxation loses the guarantee of transitivity (i.e., and would entail , where means better or equalto). The hope is that the hidden layer is rich enough to absorb this property, that is, the probability of preserving the transitivity is sufficiently high.
Now it remains to specify in details. In particular, we adapt the Davidson’s model [2] of pairwise comparison:
(7)  
where , is the tie parameter, and
The term normalises the occurrence frequency of a category in the model energy, leading to better numerical stability.
3 Learning and Inference
In this paper, we consider two applications of the : estimating data distribution and learning predictive models. Estimating data distribution is to learn a generative model that generates the visible data. This can be useful in many other applications including dimensionality reduction, feature extraction, and data completion. On the other hand, a predictive model is a classification (or regression) tool that predicts an output given the input covariates.
3.1 Parameter Learning
We now present parameter estimation for , which clearly depend on the specific applications.
Estimating Data Distribution
The problem of estimating a distribution from data is typically performed by maximising the data likelihood , where denotes the empirical distribution of the visible variables, and is the model distribution. Since the belongs to the exponential family, the gradient of with respect to parameters takes the form of difference of expectations. For example, in the case of binary variables, the gradient reads
where is the empirical distribution, and the model distribution. Due to space constraint, we omit the derivation details here.
The empirical expectation is easy to estimate due to the factorisation in Eq. (3). However, the model expectation is intractable to evaluate exactly, and thus we must resort to approximate methods. Due to the factorisations in Eqs. (3,4), Markov Chain Monte Carlo samplers are efficient to run. More specifically, the sampler is alternating between and . Note that in the case of multicategorical variables, make use of the factorisation in Eq. (4) and sample simultaneously. On the other hand, in the case of categoryranked variables, we do not sample directly from but from its relaxation  which have the form of multinomial distributions. To speed up, we follow the method of Contrastive Divergence (CD) [7], in which the MCMC is restarted from the observed data and stopped after just a few steps for every parameter update. This has been known to introduce bias to the model estimate, but it is often fast and effective for many applications.
For the data completion application, in the data we observed only
some variables and others are missing. There are two ways to handle
a missing variable during training time: one is to treat it as hidden,
and the other is to ignore it. In this paper, we follows the latter
for simplicity and efficiency, especially when the data is highly
sparse
Learning Predictive Models
In our , a predictive task can be represented by an output variable conditioned on input variables. Denote by the th output variable, and the set of input variables, that is, . The learning problem is translated into estimating the conditional distribution .
There are three general ways to learn a predictive model. The generative method first learns the joint distribution as in the problem of estimating data distribution. The discriminative method, on the other hand, effectively ignores and concentrates only on . In the latter, we typically maximise the conditional likelihood . This problem is inherently easier than the former because we do not have to make inference about . The learning strategy is almost identical to that of the generative counterpart, except that we clamp the input variables to their observed values. For tasks whose size of the output space is small (e.g., standard binary, ordinal, categorical variables) we can perform exact evaluations and use any nonlinear optimisation methods for parameter estimation. The conditional distribution can be computed as in Eq. (10). We omit the likelihood gradient here for space limitation.
It is often argued that the discriminative method is more preferable
since there is no waste of effort in learning ,
which we do not need at test time. In our setting, however, learning
may yield a more faithful representation
where is the hyperparameter controlling the relative
contribution of generative and discriminative components. Another
way is to use a stage procedure: first we pretrain
the model in an unsupervised manner, and then finetune
the predictive model
3.2 Prediction
Once the model has been learnt, we are ready to perform prediction. We study two predictive applications: completing missing data, and output labels in predictive modelling. The former leads to the inference of , where is the set of observed variables, and is the set of unseen variables to be predicted. Ideally, we should predict all unseen variables simultaneously but the inference is likely to be difficult. Thus, we resort to estimating , for . The prediction application requires the estimation of , which is clearly a special case of , i.e., when . The output is predicted as follows
(8)  
(9) 
where is the normalising constant. Noting that , the computation of can be simplified as
(10) 
where is computed using Eq. (3) as
For the cases of binary, categorical and ordinal outputs, the estimation in Eq. (8) is straightforward using Eq. (10). However, for other output types, suitable simplification must be made:

For multicategorical and categoryranking variables, we do not enumerate over all possible assignments of , but rather in an indirect manner:

For multiple categories (Section 2.2.1), we first estimate and then output if for some threshold
^{6} . 
For categoryranking (Section 2.2.3), we first estimate . The complete ranking over the set can be obtained by aggregating over probability pairwise relations. For example, the score for can be estimated as , which can be used for sorting categories
^{7} .


For continuous variables, the problem leads to a nontrivial nonlinear optimisation: even for the case of Gaussian variables, in Eq. (10) is no longer Gaussian. For efficiency and simplicity, we can take a meanfield approximation by substituting for . For example, in the case of Gaussian outputs, we then obtain a simplified expression for :
which is also a Gaussian. Thus the optimal value is the mean itself: Details of the meanfield approximation is presented in Appendix A.2.
4 A Case Study: World Attitudes
4.1 Setting
In this experiment, we run the on a largescale survey of
the general world opinion, which was published by the Pew Global Attitudes
Project
We evaluate each data type separately. In particular, let be the user index, be the predicted value of the th variable, and is the number of variables of type in the test data, we compute the prediction errors as follows:
–Binary  : , 

–Categorical  : , 
–Multicategorical  : , 
–Continuous  : , 
–Ordinal  : , 
–Categoryranking  : , 
where is the identity function, is the rank of the th category of the th variable, is the recall rate and is the precision. The recall and precision are defined as:
where is the th component of the th multicategorical variable. Note that the summation over for each type only consists of relevant variables.
To create baselines, we use the without the hidden layer,
i.e., by assuming that variables are independent
4.2 Results
Feature Extraction and Visualisation
Baseline  

Binary  32.9  23.6  20.1  16.3  13.2  9.8 
Categorical  52.3  29.8  22.0  17.0  13.2  7.1 
Multicategorical  49.6  46.6  42.2  36.9  29.2  23.8 
Continuous(*)  100.0  89.3  84.1  78.4  69.5  65.5 
Ordinal  25.2  19.5  16.2  13.5  10.9  7.7 
Category ranking  19.3  11.7  6.0  5.0  3.2  2.3 
Recall that our can be used as a feature extraction tool through the posterior projection. The projection converts a multimodal input into a realvalued vector of the form , where . Clearly, numerical vectors are much easier to process further than the original data, and in fact the vectorial form is required for the majority of modern data handling tools (e.g., for transformation, clustering, comparison and visualisation). To evaluate the faithfulness of the new representation, we reconstruct the original data using , that is, in Eq. (4), the binary vector is replaced by . The use of can be reasoned through the meanfield approximation framework presented in Appendix A.2. Table 2 presents the reconstruction results. The trends are not surprising: with more hidden units, the model becomes more flexible and accurate in capturing the data content.
For visualisation, we first learn our (with hidden units) using randomly chosen users, with the country information removed. Then we use the tSNE [27] to project the posteriors further into 2D. Figure 1 shows the distribution of people’s opinions in countries (Angola, Argentina, Bangladesh, Bolivia, Brazil, Bulgaria, Canada, China, Czech Republic, and Egypt). It is interesting to see how opinions cluster geographically and culturally: Europe & North America (Bulgaria, Canada & Czech Republic), South America (Argentina, Bolivia, Brazil), East Asia (China), South Asia (Bangladesh), North Africa (Egypt) and South Africa (Angola).
Data Completion
In this task, we need to fill missing answers for each survey response. Missing answers are common in real survey data because the respondents may forget to answer or simply ignore the questions. We create an evaluation test by randomly removing a portion of answers for each person. The is then trained on the remaining answers in a generative fashion (Section 3.1.1). Missing answers are then predicted as in Section 3.2. The idea here is that missing answers of a person can be interpolated from available answers by other persons. This is essentially a multimodal generalisation of the socalled collaborative filtering problem. Table 3 reports the completion results for a subset of the data.
Baseline  

Binary  32.7  26.0  24.2  23.3  22.7  22.3 
Categorical  52.1  34.3  30.0  28.2  27.5  27.1 
Multicategorical  49.5  48.3  45.7  43.6  42.4  42.0 
Continuous(*)  101.6  93.5  89.9  87.9  87.3  87.9 
Ordinal  25.1  20.7  19.3  18.6  18.2  17.9 
Category ranking  19.3  15.4  14.7  14.2  14.1  13.9 
Learning Predictive Models
We study six predictive problems, each of which is representative for a data type. This means six corresponding variables are reserved as outputs and the rest as input covariates. The predictive problems are: (i) satisfaction with the country (binary), (ii) country of origin (categorical, of size ), (iii) problems facing the country (multicategorical, of size ), (iv) age of the person (continuous), (v) ladder of life (ordinal, of size ), and (vi) rank of dangers of the world (categoryranking, of size ). All models are trained discriminatively (see Section 3.1.2). We randomly split the users into a training subset and a testing subset. The predictive results are presented in Table 4. It can be seen that learning predictive models requires far less number of hidden units than the tasks of reconstruction and completion. This is because in discriminative training, the hidden layer acts as an information filter that allows relevant amount of bits passing from the input to the output. Since there is only one output per prediction task, the number of required bits, therefore number of hidden units, is relatively small. In reconstruction and completion, on the other hand, we need many bits to represent all the available information.
Baseline  

Satisfaction (bin.)  26.3  18.0  17.7  17.7  17.8  18.0  18.0 
Country (cat.)  92.0  70.2  61.0  21.6  11.0  9.9  5.9 
Probs. (multicat.)  49.6  47.6  41.9  39.2  38.8  39.1  39.2 
Age (cont.*)  99.8  67.3  67.6  66.3  66.4  65.8  66.3 
Life ladder (ord.)  16.9  12.2  12.2  11.9  11.9  12.2  11.8 
Dangers (cat.rank)  31.2  27.1  24.6  24.0  23.2  23.0  22.5 
5 Related Work
The most popular use of RBMs is in modelling of individual types, for example, binary variables [5], Gaussian variables [8, 18], categorical variables [21], rectifier linear units [17], Poisson variables [6], counts [20] and Beta variables [13]. When RBMs are used for classification [12], categorical variables might be employed for labeling in additional to the features. Other than that, there has been a model called DualWing RBM for modelling both continuous and binary variables [28]. However, there have been no attempts to address all six data types in a single model as we do in the present paper.
The literature on ordinal variables is sufficiently rich in statistics, especially after the seminal work of [14]. In machine learning, on the other hand, the literature is quite sparse and recent (e.g. see [23, 29]) and it is often limitted to single ordinal output (given numerical input covariates). An RBMbased modelling of ordinal variables addressed in [25] is similar to ours, except that our treatment is more general and principled.
Mixedvariate modelling has been previously studied in statistics, under a variety of names such as mixed outcomes, mixed data, or mixed responses [22, 4, 24, 15]. Most papers focus on the mix of ordinal, Gaussian and binary variables under the latent variable framework. More specifically, each observed variable is assumed to be generated from one or more underlying continuous latent variables. Inference becomes complicated since we need to integrate out these correlated latent variables, making it difficult to handle hundreds of variables and largescale datasets.
In machine learning, the problem of predicting a single multicategorical variable is also known as multilabel learning (e.g., see [26]). Previous ideas that we have adapted into our context including the shared structure among labels [11]. In our model, the sharing is captured by the hidden layer in a probabilistic manner and we consider many multicategorical variables at the same time. Finally, the problem of predicting a single categoryranked variable is also known as labelranking (e.g., see [3, 10]). The idea we adopt is the pairwise comparison between categories. However, the previous work neither considered the hidden correlation between those pairs nor attempted multiple categoryranked variables.
6 Conclusion
We have introduced MixedVariate Restricted Boltzmann Machines () as a generalisation of the RBMs for modelling correlated variables of multiple modalities and types. Six types considered were: binary, categorical, multicategorical, continuous information, ordinal, and categoryranking. We shown that the is capable of handling a variety of machine learning tasks including feature exaction, dimensionality reduction, data completion, and label prediction. We demonstrated the capacity of the model on a largescale worldwide survey.
We plan to further the present work in several directions. First, the model has the capacity to handle multiple related predictive models simultaneously by learning a shared representation through hidden posteriors, thereby applicable to the setting of multitask learning. Second, there may exist strong interactions between variables which the RBM architecture may not be able to capture. The theoretical question is then how to model intertype dependencies directly without going through an intermediate hidden layer. Finally, we plan to enrich the range of applications of the proposed model.
Acknowledgment: We thank anonymous reviewers for insightful comments.
Appendix A Additional Materials
a.1 Sample Questions

Q1 (Ordinal): How would you describe your day today—has it been a typical day, a particularly good day, or a particularly bad day?

Q7 (Binary): Now thinking about our country, overall, are you satisfied or dissatisfied with the way things are going in our country today?

Q5 (Multicategorical): What do you think is the most important problem facing you and your family today? {Economic problems / Housing / Health / Children and education/Work/Social relations / Transportation / Problems with government / Crime / Terrorism and war / No problems / Other / Don’t know / Refused}

Q10,11 (Categoryranking): In your opinion, which one of these poses the greatest/second greatest threat to the world: {the spread of nuclear weapons / religious and ethnic hatred/AIDS and other infectious diseases / pollution and other environmental problems / or the growing gap between the rich and poor}?

Q74 (Continuous): How old were you at your last birthday?

Q91 (Categorical): Are you currently married or living with a partner, widowed, divorced, separated, or have you never been married?
a.2 Meanfield Approximation
We present here a simplification of in Eq. (10) using the meanfield approximation. Recall that , where is defined in Eq. (9). We approximate by a fully factorised distribution
The approximate distribution is obtained by minimising the KullbackLeibler divergence
with respect to and . This results in the following recursive relations:
Now we make a further assumption that , e.g., when the set is sufficiently large. This results in and
which is essentially the data model in Eq. (4) with being replaced by .
The overall complexity of computing is the same as that of evaluating in Eq. (10). However, the approximation is often numerically faster, and in the case of continuous variables, it has the simpler functional form.
Footnotes
 thanks: Work done when authors were with Curtin University, Australia.
 This should not be confused with the dimensionality of the whole data .
 Ignoring missing data may be inadequate if the missing patterns are not at random. However, treating missing data as zero observations (e.g., in the case of binary variables) may not be accurate either since it may introduce bias to the data marginals.
 As we do not need labels to learn , this is actually a form of semisupervised learning.
 We can also avoid tuning parameters associated with by using the posteriors as features and learn , where
 Raising the threshold typically leads to better precision at the expense of recall. Typically we choose when there is no preference over recall nor precision.
 Note that we do not estimate the event of ties during prediction.
 http://pewglobal.org/datasets/
 It may be desirable to learn the variance structure, but we keep it simple by fixing to unit variance. For more sophisticated variance learning, we refer to a recent paper [13] for more details.
 To the best of our knowledge, there has been no totally comparable work addressing the issues we study in this paper. Existing survey analysis methods are suitable for individual tasks such as measuring pairwise correlation among variables, or building individual regression models where complex covariates are coded into binary variables.
References
 J.A. Anderson. Regression and ordered categorical variables. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–30, 1984.
 R.R. Davidson. On extending the BradleyTerry model to accommodate ties in paired comparison experiments. Journal of the American Statistical Association, 65(329):317–328, 1970.
 O. Dekel, C. Manning, and Y. Singer. Loglinear models for label ranking. Advances in Neural Information Processing Systems, 16, 2003.
 D.B. Dunson. Bayesian latent variable models for clustered mixed outcomes. Journal of the Royal Statistical Society. Series B, Statistical Methodology, pages 355–366, 2000.
 Y. Freund and D. Haussler. Unsupervised learning of distributions on binary vectors using two layer networks. Advances in Neural Information Processing Systems, pages 912–919, 1993.
 P.V. Gehler, A.D. Holub, and M. Welling. The rate adapting Poisson model for information retrieval and object recognition. In Proceedings of the 23rd international conference on Machine learning, pages 337–344. ACM New York, NY, USA, 2006.
 G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002.
 G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 G.E. Hinton and T.J. Sejnowski. Learning and relearning in Boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition, 1:282–317, 1986.
 E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 2008.
 S. Ji, L. Tang, S. Yu, and J. Ye. Extracting shared subspace for multilabel classification. In KDD. ACM New York, NY, USA, 2008.
 H. Larochelle and Y. Bengio. Classification using discriminative restricted Boltzmann machines. In Proceedings of the 25th international conference on Machine learning, pages 536–543. ACM, 2008.
 N. Le Roux, N. Heess, J. Shotton, and J. Winn. Learning a generative model of images by factoring appearance and shape. Neural Computation, 23(3):593–650, 2011.
 P. McCullagh. Regression models for ordinal data. Journal of the Royal Statistical Society. Series B (Methodological), pages 109–142, 1980.
 C. McCulloch. Joint modelling of mixed outcome types using latent variables. Statistical Methods in Medical Research, 17(1):53, 2008.
 M. Mureşan. A concrete approach to classical analysis. Springer Verlag, 2008.
 V. Nair and G.E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th International Conference on Machine Learning, 2010.
 M.A. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized thirdorder Boltzmann machines. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2551–2558. IEEE, 2010.
 R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS’09, volume 5, pages 448–455, 2009.
 R. Salakhutdinov and G. Hinton. Replicated softmax: an undirected topic model. Advances in Neural Information Processing Systems, 22, 2009.
 R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning (ICML), pages 791–798, 2007.
 M.D. Sammel, L.M. Ryan, and J.M. Legler. Latent variable models for mixed discrete and continuous outcomes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):667–678, 1997.
 A. Shashua and A. Levin. Ranking with large margin principle: Two approaches. Advances in Neural Information Processing Systems, 15, 2002.
 J.Q. Shi and S.Y. Lee. Latent variable models with mixed continuous and polytomous data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):77–87, 2000.
 T.T. Truyen, D.Q. Phung, and S. Venkatesh. Ordinal Boltzmann machines for collaborative filtering. In TwentyFifth Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, June 2009.
 G. Tsoumakas and I. Katakis. Multilabel classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.
 L. van der Maaten and G. Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9(25792605):85, 2008.
 E. Xing, R. Yan, and A.G. Hauptmann. Mining associated text and images with dualwing harmoniums. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI2005). Citeseer, 2005.
 S. Yu, K. Yu, V. Tresp, and H.P. Kriegel. Collaborative ordinal regression. In Proceedings of the 23rd international conference on Machine learning, page 1096. ACM, 2006.