Thurstonian Boltzmann Machines: Learning from Multiple Inequalities

# Thurstonian Boltzmann Machines: Learning from Multiple Inequalities

Truyen Tran, Dinh Phung, Svetha Venkatesh
Pattern Recognition and Data Analytics, Deakin University, Australia
Department of Computing, Curtin University, Australia
{truyen.tran,dinh.phung,svetha.venkatesh}@deakin.edu.au
###### Abstract

We introduce Thurstonian Boltzmann Machines (), a unified architecture that can naturally incorporate a wide range of data inputs at the same time. Our motivation rests in the Thurstonian view that many discrete data types can be considered as being generated from a subset of underlying latent continuous variables, and in the observation that each realisation of a discrete type imposes certain inequalities on those variables. Thus learning and inference in reduce to making sense of a set of inequalities. Our proposed naturally supports the following types: Gaussian, intervals, censored, binary, categorical, muticategorical, ordinal, (in)-complete rank with and without ties. We demonstrate the versatility and capacity of the proposed model on three applications of very different natures; namely handwritten digit recognition, collaborative filtering and complex social survey analysis.

## 1 Introduction

Restricted Boltzmann machines (RBMs) have proved to be a versatile tool for a wide variety of machine learning tasks and as a building block for deep architectures [12, 24, 28]. The original proposals mainly handle binary visible and hidden units. Whilst binary hidden units are broadly applicable as feature detectors, non-binary visible data requires different designs. Recent extensions to other data types result in type-dependent models: the Gaussian for continuous inputs [12], Beta for bounded continuous inputs [16], Poisson for count data [9], multinomial for unordered categories [25], and ordinal models for ordered categories [37, 35].

The Boltzmann distribution permits several types to be jointly modelled, thus making the RBM a good tool for multimodal and complex social survey analysis. The work of [20, 29, 40] combines continuous (e.g., visual and audio) and discrete modalities (e.g., words). The work of [34] extends the idea further to incorporate ordinal and rank data. However, there are conceptual drawbacks: First, conditioned on the hidden layer, they are still separate type-specific models; second, handling ordered categories and ranks is not natural; and third, specifying direct correlation between these types remains difficult.

The main thesis of this paper is that many data types can be captured in one unified model. The key observations are that (i) type-specific properties can be modelled using one or several underlying continuous variables, in the spirit of Thurstonian models111Whilst Thurstonian models often refer to human’s judgment of discrete choices, we use the term “Thurstonian” more freely without the notion of human’s decision. [31], and (ii) evidences be expressed in the form of one or several inequalities of these underlying variables. For example, a binary visible unit is turned on if the underlying variable is beyond a threshold; and a category is chosen if its utility is the largest among all those of competing categories. The use of underlying variables is desirable when we want to explicitly model the generative mechanism of the data. In psychology and economics, for example, it gives much better interpretation on why a particular choice is made given the perceived utilities [2]. Further, it is natural to model the correlation among type-specific inputs using a covariance structure on the underlying variables.

The inequality observation is interesting in its own right: Instead of learning from assigned values, we learn from the inequality expression of evidences, which can be much more relaxed than the value assignments. This class of evidences indeed covers a wide range of practical situations, many of which have not been studied in the context of Boltzmann machines, as we shall see throughout the paper.

To this end, we propose a novel class of models called Thurstonian Boltzmann Machine (). The utilises the Gaussian restricted Boltzmann machine (GRBM): The top layer consists of binary hidden units as in standard RBMs; the bottom layer contains a collection of Gaussian variable groups, one per input type. The main difference is that does not require valued assignments for the bottom layer but a set of inequalities expressing the constraints imposed by the evidences. Except for a limiting case of point assignments where the inequalities are strictly equalities, the Gaussian layer is never fully observed. The supports more data types in a unified manner than ever before: For any combination of the point assignments, intervals, censored values, binary, unordered categories, multi-categories, ordered categories, (in)-complete ranks with and without ties, all we need to do is to supply relevant subset of inequalities.

We evaluate the proposed model on three applications of very different natures: handwritten digit recognitions, collaborative filtering and complex survey analysis. For the first two applications, the performance is competitive against methods designed for those data types. On the last application, we believe we are among the first to propose a scalable and generic machinery for handle those complex data types.

## 2 Gaussian RBM

Let be a vector of input variables. Let be a set of hidden factors which are designed to capture the variations in the observations. The input layer and the hidden layer form an undirected bipartite graph, i.e., only cross-layer connections are allowed. The model admits the Boltzmann distribution

 P(x,h) =1Zexp{−E(x,h)} (1)

where is the normalising constant and is the state energy. The energy is decomposed as

 E(x,h)=∑i(x2i2−(αi+Wi∙h)xi)−γ⊤h (2)

where are free parameters and denotes the -th row.

Given the input , the posterior has a simple form

 P(h∣x) = ∏kP(hk∣x) (3) P(hk=1∣x) = 11+e−γk−W′∙kx

where denotes the -th column. Similarly, the generative process given the binary factor is also factorisable

 P(x∣h) = ∏iP(xi∣h) (4) P(xi∣h) = N(αi+Wi∙h,1)

where is the normal distribution of mean and unit deviation.

## 3 Thurstonian Boltzmann Machines

We now generalise the Gaussian RBM into the Thurstonian Boltzmann Machine (). Denote by an observed evidence of . Standard evidences are the point assignment of to some specific real-valued vector, i.e., . Generalised evidences can be expressed using inequality constraints

 b≤⊙Ax≤⊙c (5)

for some transform matrix and vectors , where denotes element-wise inequalities. Thus an evidence can be completely realised by specifying the triple . For example, for the point assignment, , where is the identity matrix. In what follows, we will detail other useful popular realisations of these quantities.

### 3.1 Boxed Constraints

This refers to the case where input variables are independently constrained, i.e., , and thus we need only to specify the pair .

##### Censored observations.

This refers to situation where we only know the continuous observation beyond a certain point, i.e., and . For example, in survival analysis, the life expectancy of a person might be observed up to a certain age, and we have no further information afterward.

##### Interval observations.

When the measurements are imprecise, it may be better to specify the range of possible observations with greater confidence rather than a singe point, i.e., and for some pair . For instance, missile tracking may estimate the position of the target with certain precision.

##### Binary observations.

A binary observation can be thought as a result of clipping by a threshold , that is if and otherwise. The boundaries in Eq. (3.2) become:

 ⟨bi,ci⟩={⟨−∞,θi⟩ei=0⟨θi,+∞⟩ei=1 (6)

Thus, this model offers an alternative222To be consistent with the statistical literature, we can call it the probit RBM, which we will study in Section 6.1. to standard binary RBMs of [28, 8].

##### Ordinal observations.

Denote by the set of ordinal observations, where each is drawn from an ordered set . The common assumption is that the ordinal level is observed given for some thresholds . The boundaries thus read

 ⟨bi,ci⟩=⎧⎨⎩⟨−∞,θ1⟩l=1⟨θl−1,θl⟩l=2,3,..L−1⟨θL−1,+∞⟩otherwise (7)

This offers an alternative333This can be called ordered probit RBM. to the ordinal RBMs of [37].

### 3.2 Inequality Constraints

##### Categorical observations.

This refers to the situation where out of an unordered set of categories, we observe only one category at a time. This can be formulated as follows. Each category is associated with a “utility” variable. The category is observed (i.e., ) if it has the largest utility, that is . Thus, is the upper-threshold for all other utilities. On the other hand, is the lower-threshold for . This suggests an EM-style procedure: (i) fix (or treat it as a threshold) and learn the model under the intervals for all , and (ii) fix all categories other than , learn the model under the interval . This offers an alternative444We can call this model multinomial probit RBM. to the multinomial logit treatment in [26].

To illustrate the point, suppose there are only four variables , and is observed, then we have . This can be expressed as and . These are equivalent to

 ⟨A=⎡⎢⎣1−10010−10100−1⎤⎥⎦;b=0;c=+∞⟩
##### Imprecise categorical observations.

This generalises the categorical case: The observation is a subset of a set, where any member of the subset can be a possible observation555This is different from saying that all the members of the subset must be observed.. For example, when asked to choose the best sport team of interest, a person may pick two teams without saying which is the best. For instance, suppose the subset is , then , which can be expressed as , and . This translates to the following triple

 ⟨A=⎡⎢ ⎢ ⎢⎣10−10100−101−10010−1⎤⎥ ⎥ ⎥⎦;b=0;c=+∞⟩
##### Rank (with Ties) observations.

This generalises the imprecise categorical cases: Here we have a (partially) ranked set of categories. Assume that the rank is produced in a stagewise manner as follows: The best category subset is selected out of all categories, the second best is selected out of all categories except for the best one, and so on. Thus, at each stage we have an imprecise categorical setting, but now the utilities of middle categories are constrained from both sides – the previous utilities as the upper-bound, and the next utilities as the lower-bound.

As an illustration, suppose there are four variables and a particular rank (with ties) imposes that . This be rewritten as , which is equivalent to

 ⟨A=⎡⎢⎣10−1001−10001−1⎤⎥⎦;b=0;c=+∞⟩

## 4 Inference Under Linear Constraints

Under the , MCMC-based inference without evidences is simple: we alternate between and . This is efficient because of the factorisations in Eqs. (3,4). Inference with inequality-based evidence is, however, much more involved except for the limiting case of point assignments.

Denote by the constrained domain of defined by the evidence . Now we need to specify and sample from the constrained distribution defined on . Sampling remains unchanged, and in what follows we focus on sampling from .

### 4.1 Inference under Boxed Constraints

For boxed constraints (Section 3.1), due to the conditional independence, we still enjoy the factorisation . We further have

 P(xi∣h,e) = P(xi∣h)Φ(ci∣h)−Φ(bi∣h)

where is the normal cumulative distribution function of . Now is a truncated normal distribution, from which we can sample using the simple rejection method, or more advanced methods such as those in [23].

### 4.2 Inference under Inequality Constraints

For general inequality constraints (Section 3.2), the input variables are interdependent due to the linear transform . However, we can specify the conditional distribution (here ) by realising that

 bm−∑j≠iAmjxj≤Amixi≤cm−∑j≠iAmjxj

where for . In other words, is conditionally box-constrained given other variables.

This suggests a Gibbs procedure by looping through . With some abuse of notation, let and . The constraints can be summarised as

 xi ∈ ∩Mm=1[min{~bmi,~cmi},max{~bmi,~cmi}] = [maxmmin{~bmi,~cmi},minmmax{~bmi,~cmi}]

The and operators are needed to handle change in inequality direction with the sign of , and the join operator is due to multiple constraints.

For more sophisticated Gibbs procedures, we refer to the work in [10].

### 4.3 Estimating the Binary Posteriors

We are often interested in the posteriors , e.g., for further processing. Unlike the standard RBMs, the binary latent variables here are coupled through the unknown Gaussians and thus there are no exact solutions unless the evidences are all point assignments. The MCMC-based techniques described above offer an approximate estimation by averaging the samples . For the case of boxed constraints, mean-field offers an alternative approach which may be numerically faster. In particular, the mean-field updates are recursive:

 Qk ← 11+exp{−γk−∑iWik^μi} μi ← αi+∑kWikQk ^μi = μi+ϕ(bi−μi)−ϕ(ci−μi)Φ(ci−μi)−Φ(bi−μi)

where is the probability of the unit being activated, is the mean of the normal distribution truncated in the interval , is the probability density function, and is normal cumulative distribution function. Interested readers are referred to the Supplement666Supplement material will be available at: http://truyen.vietlabs.com for more details.

### 4.4 Estimating Probability of Evidence Generation

Given the hidden states we want to estimate the probability that hidden states generate a particular evidence

 P(e∣h)=∫Ω(e)P(x∣h)dx

For boxed constraints, analytic solution is available since the Gaussian variables are decoupled, i.e., , where . For general inequality constraints, however, these variables are coupled by the inequalities. The general strategy is to sample from and compute the portion of samples falling into the constrained domain . For certain classes of inequalities we can approximate the Gaussian by appropriate distributions from which the integration has the closed form. In particular, those inequalities imposed by the categorical and rank evidences can be dealt with by using the extreme value distributions. The integration will give the logit form on distribution of categories and Plackett-Luce distribution of ranks. For details, we refer to the Supplement.

## 5 Stochastic Gradient Learning with Persistent Markov Chains

Learning is based on maximising the evidence likelihood

 L = logP(e)=log∑h∫Ω(e)P(h,x)dx

where is defined in Eq. (1). Let , then . The gradient w.r.t. the mapping parameter reads

 ∂WikL = EP(xi,hk∣e)[xihk]−EP(xi,hk)[xihk] (8)

The derivation is left to the Supplement.

### 5.1 Estimating Data Statistics

The data-dependent statistics and the data-independent statistics
are not tractable to compute in general, and thus approximations are needed.

##### Data-dependent statistics.

Under the box constraints, the mean-field technique (Section 4.3) can be employed as follows

 EP(xi,hk∣e)[xihk]≈^μiQk

For general cases, sampling methods are applicable. In particular, we maintain one persistent Markov chain [42, 32] per data instance and estimate the statistics after a very short run. This would explore the space of the data-dependent distribution by alternating between and using techniques described in Section 4.

##### Data-independent statistics.

Mean-field distributions are not appropriate for exploring the entire state space because they tend to fit into one mode. One practical solution is based on the idea of Hinton’s Contrastive Divergence (CD), where we create another Markov chain on-the-fly starting from the latest state of the clamped chain. This chain will be discarded after each parameter update. This is particular useful when the models are instance-specific, e.g., in collaborative filtering, it is much cheaper to build one model per user, all share the same parameters. If it is not the case, then we can maintain a moderate set of parallel chains and collect the samples after a short run at every updating step [42, 32].

### 5.2 Learning the Box Boundaries

In the case of boxed constraints, sometimes it is helpful to learn the boundaries themselves. The gradient of the log-likelihood w.r.t. the lower boundaries reads

 ∂biL = 1Z(e)∑h∂bi∫Ω(e)exp{−E(x,h)}dx = ∑h∂bi∫Ω(e)P(x,h∣e)dx ≈ 1S∂bi∫cibi∫Ω(e)∖[bi,ci]P(x∣h(s),e)dx¬idxi = −1S∑hP(xi=bi∣h(s),e)

where are samples collected during the MCMC procedure running on the data-dependent distribution . Similarly we would have the gradient w.r.t. the upper boundaries:

 ∂ciL=1S∑hP(x=ci∣h(s),e).

## 6 Applications

In this section, we describe applications of the for three realistic domains, namely handwritten digit recognition, collaborative filtering and worldwide survey analysis. Before going to the details, let us first address key implementation issues (see Supplement for more details).

One observed difficulty in training the is that the hidden samples can get stuck in one of the two ends and thus learning cannot progress. The reasons might be the large mapping parameters or the unbounded nature of the underlying Gaussian variables, which can saturate the hidden units. We can control the norm of the mapping parameters, either by using the standard -norm regularisation, or by rescaling the norm of the parameter vector for each hidden unit. To deal with the non-boundedness of the Gaussian variables, then we can restrict their range, making the model bounded.

Another effective solution is to impose a constraint on the posteriors by adding the regularising term to the log-likelihood, e.g.,

 λ{∑k[ρlogP(h1k∣e)+(1−ρ)log(1−P(h1k∣e))]}

where is the expected probability that a hidden unit will turn on given the evidence and is the regularisation weight. Maximising this quantity is essentially minimising the Kullback-Leibler divergence between the expected posteriors and the true posteriors. In our experiments, we found and gave satisfying results.

The main technical issue is that does not have a simple form due to the integration over all the constrained Gaussian variables. Approximation is thus needed. The use of mean-field methods will lead to the simple sigmoid form, but it is only applicable for boxed constraints since it breaks down deterministic constraints among variables (Section 4.3). However, we can estimate the “mean” truncated Gaussian by averaging the recent samples of the Gaussian variables in the data-dependent phase.

Once these safeguards are in place, learning can greatly benefit from quite large learning rate and small batches as it appears to quickly get the samples out off the local energy traps by significantly distorting the energy landscape. Depending on the problem sizes, we vary the batch sizes in the range .

### 6.1 Probit RBM for Handwritten Digits

We use the name Probit RBM to denote the special case of where the observations are binary (i.e., boxed constraints, see Section 3.1). The threshold for each visible unit is chosen so that under the zero mean, the probability of generating a binary evidence equals the empirical probability, i.e., , and thus . Since any mismatch in thresholds can be corrected by shifting the corresponding biases, we do not need to update the thresholds further.

We report here the result of the mean-field method for computing data-dependent statistics, which are averaged over a random batch of images. For the data-independent statistics, persistent chains are run in parallel with samples collected after every Gibbs steps. The sparsity level is set to and the sparseness weight is set to . Once the model has been learned, mean-field is used to estimate the hidden posteriors. Typically this mean-field is quite fast as it converges in a few steps.

We take the data from MNIST and binarize the images using a mid-intensity threshold. The learned representation is shown in Figure 2. Most digits are well separated in 2D except for digits and . The learned representation can be used for classifications, e.g., by feeding to the multiclass logistic classifier. For hidden units, the Probit RBM achieves the error rate of %, comparable with those obtained by the RBM trained with CD- (%), and much better than the raw pixels (%). The features discovered by the Probit RBM and RBM with CD- are very different (Figure 1), and this is expected because they operate on different input representations. The energy surface learned by the Probit RBM is smooth enough to allow efficient exploration of modes, as shown in Figure 3.

### 6.2 Rank Evidences for Collaborative Filtering

In collaborative filtering, one of the main goals is to produce a personalized ranked list of items. Until very recently, the majority in the area, on the other hand, focused on predicting the ratings, which are then used for ranking items. It can be arguably more efficient to learn a rank model directly instead of going through the intermediate steps.

We build one for ranking with ties (i.e., inequality constraints, see Section 3.2) per user due to the variation in item choices but all the s share the same parameter set. The handling of ties is necessary because during training, many items share the same rating. Unseen items are simply not accounted for in each model: We only need to compare the utilities between the items seen by each person. The result is that the models are very sparse and fast to run. For the data-dependent statistics, we maintain one Markov chain per user. Since there is no single model for all data instances, the data-independent statistics cannot be estimated from a small set of Markov chains. Rather we also maintain a data-independent chain per data instance, which can be persistent on their own, or restarted from the data-dependent chains after every parameter updating step. The latter case, which is reported here, is in the spirit of the Hinton’s Contrastive Divergence, where the data-independent chain is just a few steps away from the data-dependent chain.

Once the model has been trained, the hidden posterior vector , where , is used as the new representation of the tastes of user . The rank of unseen movies is the mode of the distribution , where are the rank-based evidences (see Section 4.4). For fast computation of , we approximate the Gaussian by a Gumbel distribution, which leads to a simple way of ranking movies using the mean “utility” for user (see the Supplement for more details).

The data used in this experiment is the MovieLens, which contains M ratings by approximately K users on K movies. To encourage diversity in the rank lists, we remove the top % most popular movies. We then remove users with less than ratings on the remaining movies. The most recently rated movies per user are held out for testing, the next most recent movies are used for tuning hyper-parameters, and the rest for training.

For comparison, we implement a simple baseline using item popularity for ranking, and thus offering a naive non-personalized solution. For personalized alternatives, we implement two recent rank-based matrix factorisation methods, namely ListRank.MF [27] and PMOP [36]. Two ranking metrics from the information retrieval literature are used: the ERR [3] and the NDCG@T [13]. These metrics place more emphasis on the top-ranked items. Table 1 reports the movie ranking results on test subset (each user is presented with a ranked list of unseen movies), demonstrating that the is a clear winner in all metrics.

### 6.3 Mixed Evidences for World Attitude Analysis

Finally, we demonstrate the on mixed evidences. The data is from the survey analysis domain, which mostly consists of multiple questions of different natures such as basic facts (e.g., ages and genders) and opinions (e.g., binary choices, single choices, multiple choices, ordinal judgments, preferences and ranks). The standard approach to deal with such heterogeneity is to perform the so-called “coding”, which converts types into some numerical representations (e.g., ordinal scales into stars, ranks into multiple pairwise comparisons) so that standard processing tools can handle. However, this coding process breaks the structure in the data and thus significant information will be lost. Thus our offers a scalable and generic machinery to process the data in its native format and then convert the mixed types into a more homogeneous posterior vector.

We use the global attitude survey dataset collected by the PewResearch Centre777The datasets are publicly available from http://pewresearch.org/. The survey was conducted on people from countries during the period of March 17 – April 21, 2008 on a variety of topics concerning people’s life, opinions on issues in their countries and around the world as well as future expectations. There are binary, categorical (of variable category sizes), continuous, ordinal (of variable level sizes) question types.

Like the case of collaborative filtering, we build one per respondent due to the variation in questions and answers but all the s share the same parameter set. Unanswered/inappropriate questions are ignored. For each respondent, we maintain persistent and non-interacting Markov chains for the data-dependent statistics and the data-independent statistics, respectively.

Figure 4 shows the 2D distribution of respondents from countries obtained by feeding the posteriors to the t-SNE [38] (here no explicit information of countries is used). It is interesting to see the cultural/social clustering and gaps between countries as opposed to the geographical distribution (e.g., between Indonesia and Egypt, Australia and UK and the relative separation of the China, Pakistan, Turkey and the US from the rest). To predict the countries, we feed the posteriors into the standard multiclass logistic regression and achieve an error rate of , suggesting that the has captured the intrastate regularities and separated the interstate variations well.

## 7 Related Work

Latent multivariate Gaussian variables have been widely studied in statistical analysis, initially to model correlated binary data888This is often known as multivariate probit models. [1, 4] then now used for a variety of data types such as ordered categories [15], unordered categories [43], and the mixture of types [6]. Learning with the underlying Gaussian model is notoriously difficult for large-scale setting: independent sampling costs cubic time due to the need of inverting the covariance matrix, while MCMC techniques such as Gibbs sampling can be very slow if the graph is dense and the interactions between variables are strong. This can be partly overcome by adding one more layer of latent variables as in factor analysis [39, 14] and probabilistic principle component analysis [33]. The main difference from our is that those models are directed with continuous factors while ours is undirected with binary factors.

Gaussian RBMs have been used for modelling continuous data such as visual features [12], where the evidences are the value assignments, and thus a limiting case of our evidence system. Some restrictions to the continuous Boltzmann machines have been studied: In [5], Gaussian variables are assumed to be non-negative, and in [41], continuous variables are bounded. However, we do not make these restrictions on the model but rather placing restrictions during the training phase only. GRBMs that handle ordinal evidences have been studied in [35], which is an instance of the boxed-constraints in our .

## 8 Discussion and Conclusion

Since the underlying variables of the are Gaussian, various extensions can be made without much difficulty. For example, direct correlations among variables, regardless of their types, can be readily modelled by introducing the non-identity covariance matrix [22]. This is clearly a good choice for image modelling since nearby pixels are strongly correlated. Another situation is when the input units are associated with their own attributes. Each unit can be extended naturally by adding a linear combination of attributes to the mean structure of the Gaussian.

The additive nature of the mean-structure allows the natural extension to matrix modelling (e.g., see [37, 35]). That is, we do not distinguish the role of rows and columns, and thus each row and column can be modelled using their own hidden units (the row parameters and columns parameters are different). Conditioned on the row-based hidden units, we return to the standard for column vectors. Inversely, conditioned on the column-based hidden units, we have the for row vectors.

To sum up, we have proposed a generic class of models called Thurstonian Boltzmann machine () to unify many type-specific modelling problems and generalise them to the general problem of learning from multiple groups of inequalities. Our framework utilises the Gaussian restricted Boltzmann machines, but the Gaussian variables are never observed except for one limiting case. Rather, those variables are subject to inequality constraints whenever an evidence is observed. Under this representation, the supports a very wide range of evidences, many of which were not possible before in the Boltzmann machine literature, without the need to specify type-specific models. In particular, the supports any combination of the point assignments, intervals, censored values, binary, unordered categories, multi-categories, ordered categories, (in)-complete ranks with and without ties.

We demonstrated the on three applications of very different natures, namely handwritten digit recognition, collaborative filtering and complex survey analysis. The results are satisfying and the performance is competitive with those obtained by type-specific models.

## Appendix A Supplementary Material

### a.1 Inference

#### a.1.1 Estimating the Partition Function

For convenience, let us re-parameterise the distribution as follows

 ϕi(xi) = exp{−x2i2+αixi} (9) ψik(xi,hk) = exp{Wikxihk} ϕk(hk) = exp{γkhk}

The model potential is then the product of all local potentials

 (10)

The partition function can be rewritten as

 Z = ∑h∫xΨ(x,h)dx = ∑hΩ(h)

where . We now proceed to compute :

 Ω(h) = [∏kϕk(hk)]∫x[∏iϕi(xi)][∏ikψik(xi,hk)]dx = [∏kϕk(hk)]∏i∫xiexp{−x2i2+(αi+∑kWikhk)xi}dxi = [∏kϕk(hk)]∏iCi∫xiexp{−(xi−μi(h))22}dxi = [∏kϕk(hk)]∏iCi√2πσ2i

where

 μi(h) = αid+K∑k=1Wikhk Ci = exp⎧⎨⎩12(μi(h)σi)2⎫⎬⎭

Now we can define the distribution over the hidden layer as follows

 P(h)=1ZΩ(h)

Now we apply the Annealed Importance Sampling (AIS) [19]. The idea is to introduces the notion of inverse-temperature into the model, i.e., .

Let be the (slowly) increasing sequence of temperature, where and , that is . At , we have a uniform distribution, and at , we obtain the desired distribution. At each step , we draw a sample from the distribution (e.g. using some Metropolis-Hastings procedure). Let be the unnormalised distribution of , that is . The final weight after the annealing process is computed as

 ω =P∗(h1|τ1)P∗(h1|τ0)P∗(h2|τ2)P∗(h2|τ1)...P∗(hS|τS)P∗(hS|τS−1)

The above procedure is repeated times. Finally, the normalisation constant at is computed as where , which is the number of configurations of the hidden variables .

#### a.1.2 Estimating Posteriors using Mean-field

Recall that for evidence we want to estimate posteriors . Assume that the evidences can be expressed in term of boxed constraints, which lead to the following factorisation

 P(x∣e,h)=∏iP(xi∣e,h)

This factorisation is critical because it ensures that there are no deterministic constraints among , which are the conditions that variational methods such as mean-fields would work well. This is because mean-field solution will generally not satisfy deterministic constraints, and thus may assign non-zeros probability to improbably areas.

To be more concrete, the mean-field approximation would be

 Q(h,x) = ∏kQk(hk)∏iQi(xi) s.t.x ∈ Ω(e)

The best mean-field approximation will be the minimiser of the Kullback-Leibler divergence

 D(Q||P) = ∑h∑x∈Ω(e)Q(h,x)logQ(h,x)P(h,x∣e) (11) = −H[Q(h,x)]−∑h∑x∈Ω(e)Q(h,x)logP(h,x∣e)

where is the entropy function. Now first, exploit the fact that is factorisable, and thus its entropy is decomposable, i.e.,

 H[Q(h,x)]=∑kH[Qk(hk)]+∑iH[Qi(xi)] (12)

Second recall from Eq. (17) that

 P(h,x∣e)=1Z(e)exp{−E(x,h)}

and thus

 ∑h∑x∈Ω(e)Q(h,x)logP(h,x∣e)=−∑h∑x∈Ω(e)Q(h,x)E(x,h)−logZ(e)

Since is a constraint w.r.t. , we can safely ignore it here.

Now since is decomposable (see Eq. (2)), we have

 ∑h∑x∈Ω(e)Q(h,x)E(x,h) = ⎛⎝∑i∑xi∈Ω(ei)Qi(xi)Ei(xi)⎞⎠+⎛⎝∑k∑hkQk(hk)Ek(hk)⎞⎠+ +⎛⎝∑i∑k∑xi∈Ω(ei)∑hkQi(xi)Qk(hk)Eik(xi,hk)⎞⎠

where

 Ei(xi) = x2i2−αixi Ek(hk) = −γkhk Eik(xi,hk) = −Wikxihk

Combining this decomposition and Eq. (12), we have completely decomposed the Kullback-Leibler divergence in Eq. (11) into local terms:

 D(Q||P) = ∑i∑xi∈Ω(ei)Qi(xi)Ei(xi)+∑k∑hkQk(hk)Ek(hk)+ +⎛⎝∑i∑k∑xi∈Ω(ei)∑hkQi(xi)Qk(hk)Eik(xi,hk)⎞⎠−∑iH[Qi(xi)]−∑kH[Qk(hk)]

Now we wish to minimise the divergence with respect to the local distributions for and knowing the proper distribution constraints

 ∫xi∈Ω(ei)Qi(xi) = 1 ∑hkQk(hk) = 1

By the method of Lagrangian multiplier, we have

 L(λ)=D(Q||P)+∑iλi(∫xi∈Ω(ei)Qi(xi)−1)+∑kκk⎛⎝∑hkQk(hk)−1⎞⎠
• Let us compute the partial derivative w.r.t. :

 ∂Qi(xi)L(λ) = logQi(xi)+1+Ei(xi)+∑kQk(h1k)Eik(xi,h1k)+λi

where is a short hand for and we have made use of the fact that . Setting this gradient to zero yields

 Qi(xi) = exp{−(Ei(xi)+∑kQk(h1k)Eik(xi,h1k))−1−λi} (13) = exp⎧⎨⎩−12(xi−(αi+∑kQk(h1k)Wik))2−1−λi⎫⎬⎭

for . Normalising this distribution would lead to the truncated form of the normal distribution those the mean is

 μi=αi+∑kQk(h1k)Wikhk (14)
• In a similar way, the partial derivative w.r.t. would be

 ∂Qk(hk)L(λ)=logQk(hk)−hk⎛⎝γk+∑iWik∑xi∈Ω(ei)Qi(xi)xi⎞⎠+1+κk

Equating the gradient to zero, we have

 Qk(hk)∝exp{hk(γk+∑iWik^μi)}

where is the mean of the truncated normal distribution

 ^μi=∑xi∈Ω(ei)Qi(xi)xi

 Qk(h1k)=[1+exp{−γk−∑iWik^μi}]−1 (15)
• Finally, combining these findings in Eqs. (13,14,15), and letting be the boxed constraint, and using the fact that the mean of the truncated distribution is

 ^μi=μi+ϕ(bi−μi)−ϕ(ci−μi)Φ(ci−μi)−Φ(bi−μi)

we would arrive at the three recursive equations

 qk ← 11+exp{−γk−∑iWik^μi} μi ← αi+∑kWikqk ^μi ← μi+ϕ(bi−μi)−ϕ(ci−μi)Φ(ci−μi)−Φ(bi−μi)

where is a short hand for and is the normal probability density function, and is the cumulative distribution function

#### a.1.3 Seeking Modes and Generating Representative Samples

Once the model has been learned, samples can be generated straightforwardly by first sampling the underlying Gaussian RBM and then collect the true samples that satisfy the inequalities of interest. For example, for binary samples, if the generated Gaussian value for a visible unit is larger than the threshold, then we have an active sample. Likewise, rank samples, we only need to rank the sampled Gaussian values.

However, this may suffer from the poor mixing if we use standard Gibbs sampling, that is the Markov chain may get stuck in some energy traps. To jump out of the trap we propose to periodically raise the temperature to a certain level (e.g., ) and then slowly cool down to the original temperature (which is ). In our experiment, the cooling is scheduled as follows

 T←ηT

where is estimated so that for steps, the temperature will drop from to . That is, , leading to .

To locate a basis of attraction, we can lower the temperature further (e.g., to ) to trap the particles there. Then we collect successive samples and take the average to be the representative sample. In our experiments, .

### a.2 Learning

#### a.2.1 Gradient of the Likelihood

The log-likelihood of an evidence is

 L = logP(e)=log∑h∫Ω(e)P(h,x)dx = log∑h∫Ω(e)exp{−E(x,h)}dx−logZ = logZ(e)−logZ

 ∂WiklogZ(e) = −1Z(e)∑h∫Ω(e)exp{−E(x,h)}∂WikE(x,h)dx (16) = −∑h∫Ω(e)P(x,h∣e)∂WikE(x,h)dx

where we have moved the constant into the sum and integration and make use of the fact that

 P(x,h∣e)=P(x,h)P(e)=1Z(e)exp{−E(x,h)} (17)

where the domain of the Gaussian is constrained to .

From the definition of the energy function in Eq. (2), we know that the energy is decomposable, and thus the gradient w.r.t. only involves the pair . In particular

 ∂WikE(x,h)=−xihk

This simplifies Eq. (16)

 ∂WiklogZ(e) = −∑hi∫Ω(ei)P(xi,hk∣e)∂WikE(x,h)dxi = EP(x,h∣e)[xihk]

A similar process would lead to

 ∂WiklogZ=EP(xi,hk)[xihk]

and finally:

 ∂