Restricted Boltzmann Machines with Gaussian Visible Units Guided by Pairwise Constraints
Restricted Boltzmann machines (RBMs) and their variants are usually trained by contrastive divergence (CD) learning, but the training procedure is an unsupervised learning approach, without any guidances of the background knowledge. To enhance the expression ability of traditional RBMs, in this paper, we propose pairwise constraints restricted Boltzmann machine with Gaussian visible units (pcGRBM) model, in which the learning procedure is guided by pairwise constraints and the process of encoding is conducted under these guidances. The pairwise constraints are encoded in hidden layer features of pcGRBM. Then, some pairwise hidden features of pcGRBM flock together and another part of them are separated by the guidances. In order to deal with real-valued data, the binary visible units are replaced by linear units with Gausian noise in the pcGRBM model. In the learning process of pcGRBM, the pairwise constraints are iterated transitions between visible and hidden units during CD learning procedure. Then, the proposed model is inferred by approximative gradient descent method and the corresponding learning algorithm is designed in this paper. In order to compare the availability of pcGRBM and traditional RBMs with Gaussian visible units, the features of the pcGRBM and RBMs hidden layer are used as input ‘data’ for K-means, spectral clustering (SP) and affinity propagation (AP) algorithms, respectively. A thorough experimental evaluation is performed with sixteen image datasets of Microsoft Research Asia Multimedia (MSRA-MM). The experimental results show that the clustering performance of K-means, SP and AP algorithms based on pcGRBM model are significantly better than traditional RBMs. In addition, the pcGRBM model for clustering task shows better performance than some semi-supervised clustering algorithms.
Hinton and Sejnowski proposed a learning algorithm for general Boltzmann machine which has hidden-to-hidden and visible-to-visible connections, but in practice it was too slow to be used. Then, the restricted Blotzmann machine (RBM) was proposed by in 1986, which has no lateral connections among nodes in each layer, so the learning procedure becomes much more efficient than general Blotzmann machine. There has been extensive research into the RBM since Hinton proposed fast learning algorithms,  by contrastive divergence (CD) learning algorithm. Several power and tractability deep networks was proposed, including deep belief networks, deep autoencoder, deep Boltzmann machine, deep dropout neural net. Until now, a large number of successful applications built on the RBMs have appeared, e.g., classification, , , , , feature learning, facial recognition, collaborative filtering, topic modelling, speech recognition, natural language understanding, computer vision, dimensionality reduction, voice conversion, musical genre categorization, real-time key point recognition and periocular recognition.
The classic RBM has great ability of extracting hidden features from original data. More and more researchers proposed variant RBMs and their deep networks which were based on classic RBM, e.g., fuzzy restricted Boltzmann machine(FRBM), classification RBM, spike-and-slab restricted Boltzmann machine (ssRBM), Gaussian restricted Boltzmann machines (GRBMs), sparse restricted Boltzmann machine (SRBM), over-replicated softmax model, temporal restricted Boltzmann machines (RTRBMs), circle convolutional restricted Boltzmann machine (CCRBM), adaptive restricted Boltzmann machine, relevance restricted Boltzmann machine (ReRBM), theta-restricted Boltzmann machine (theta-RBM), disjunctive factored four-way conditional restricted Boltzmann machine (DFFW-CRBM), centered convolutional restricted Boltzmann machines (CCRBM), social restricted Boltzmann machine (SRBM), temperature based restricted Boltzmann machines (TRBMs) and deep feature coding architecture.
However, since the learning procedures of classic RBM and its variants are unsupervised methods, their processes of feature extraction are non-directional and conducted under no guidance. To remedy these weakness, this paper proposes a pairwise constraints restricted Blotzmann machine with Gaussian visible units (pcGRBM) and corresponding learning algorithm, where the learning procedure is guided by pairwise constraints which come from labels. In pcGRBM model, the pairwise constraints which is instance-level prior knowledge guide the process of encoding, some pairwise hidden features of pcGRBM flock together and another part of them are separated by the guidances, then the process of feature extraction is no longer non-directional. Then, the background knowledge of instance-level pairwise constraints are encoded in hidden layer features of pcGRBM. In order to testify the availability of pcGRBM, we design three structures of clustering ,in which the features of the hidden layer of the pcRBM are used as input ‘data’ for unsupervised clustering algorithms. The experimental results show that the clustering performance of K-means, SP and AP algorithms based on pcGRBM model are significantly better than traditional RBMs. In addition, the pcGRBM model for clustering is better performance than some semi-supervised algorithms (Cop-Kmeans, Semi-Spectral clustering (Semi-SP) and semi-supervised affinity propagation (Semi-AP)).
The remainder of this paper is organized as follows. In the next section, we outline the related work and provide the preliminary in section III, which includes pairwise constraints, RBM and Gauss visible units. The proposed pcGRBM model and its learning algorithm are introduced in section IV. Next, the remarkable performance of the pcGRBM model is affirmed by the task of clustering on MSRA-MM in section V. Finally, Section VI summarizes our contributions.
2 Related Work
Due to the outstanding performance, more and more variants of RBM have been proposed by researchers. There are several common methods to develop standard RBM such as adding connections information between the visible units and the hidden units, changing the value type of visible or hidden units, expanding the relationships of the units between visible layer and hidden layer from constant to variable by fuzzy mathematics, constructing deep network based on autoencoder by pairwise constraints.
To add connections information between the visible units into RBM is a kind of methods for developing standard RBM. Osindero and Hinton proposed a semi-restricted Boltzmann machines (SRBM) which has lateral connections between the visible units, but these lateral connections are unit-level semi-supervised information. The learning procedure includes two stages: the first one is the visible to hidden connections which is same as a classic RBM and the second one is the lateral connections which is applied the same learning procedure as the first one. In order to enforce hidden units to be pairwise uncorrelated and to maximize entropy, Tomczak proposed to add penalty term to the log-likelihood function. His framework of learning informative features is unit-level pairwise and for classification problem, while our model is instance-level pairwise and for clustering task. Zhang et al. built deep belief network based on SRBM for classification. Given the hidden units, the visible units of the SRBM form a Markov random field. However, the main weakness of the SRBM is that there are massive parameters for high-dimensional data, if every pairs of visible units have relations. Sutskever and Hinton proposed temporal restricted Boltzmann machine (TRBM) by adding directed connections between previous and current states of the visible and hidden units. There are three kinds of connections of the full TRBM, e.g., connections between the visible units, connections between the hidden and visible units and connections between the hidden units. Furthermore, they proposed the recurrent TRBM (RTRBM). It is easy to compute the gradient of the log-likelihood and infer exactly. Mnih and Hinton proposed the conditional restricted Boltzmann machines (CRBMs) by adding conditioning vector which determines increments to the biases of the visible and hidden layer of the traditional RBM.
By changing hidden units with relevancy is another kind of methods for developing standard RBM. Courville et al. developed the spike-and-slab restricted Boltzmann machine (ssRBM). The ssRBM is defined as having each hidden unit associated with the product of a binary “spike” latent variable and a real-valued “slab” latent variable. In order to keep learning efficiency, as a model of natural images, the binary hidden units of the ssRBM maintain the simple conditional independence structure when they encode the conditional covariance of visible units by exploiting real-valued slab variables.
In general, the relationships of the units between the visible layer and the hidden layer are restricted to be constants. In order to break through this restrictions, Chen et al. proposed a fuzzy restricted Boltzmann machine (FRBM) to enhance deep learning capability which can avoid the flaw. The FRBM model parameters are replaced by fuzzy numbers and the regular RBM energy function is given by fuzzy free energy functions. Moreover, the deep networks are designed by the fuzzy RBMs to boost deep learning. Nie et al. proposed to theoretically extend the conventional RBMs by introducing another term in the energy function to explicitly model the local spatial interactions in the input data.
Conventional RBM defines the units of visible and hidden layer to be binary, but this limitation cannot meet the needs in practice. Then one common way is to replace them by means of Gaussian linear units, that is Gaussian-Bernoulli restricted Boltzmann machines (GBRBMs). The GBRBMs have the ability to learn meaningful features both in modeling natural images and in a two-dimensional separation task. But, as we know, it is difficult to learn the GBRBMs. So, Cho et al. proposed a novel method to improve their learning efficiency. The new method includes three parts, e.g., changing energy function by different parameterizations to facilitate learning, parallel tempering learning and adaptive learning rate. Moreover, the deep networks of Gaussian-Bernoulli deep Boltzmann machine (GDBM),  has been developed by the GBRBM in recent years. The GDBM is designed by adding multiple layer of hidden units and applied to continuous data.
Furthermore, Zhang et al. proposed a mixed model named as supervision guided autoencoder (SUGAR) which includes three components: main network, auxiliary network and bridge. The main network is a sparsity-encouraging variant of the autoencoder, that is the unsupervised autoencoder. The auxiliary network is constructed by pairwise constraints, that is the supervised learning. The two heterogeneous networks are designed and each of which encodes either unsupervised or supervised data structure respectively. The main network and auxiliary network are connected by the bridge which is used to enforce the correlation of the parameters. Comparing SUGAR with supervised learning and supervised deep networks, it has flexible utilization of supervised information and better balances the numerical tractability.
In the work of , Chen proposed a deep network structure based on RBMs which is the most related to our work. Both the work of  and our work aim to solve the similar problems, e.g., how to obtain suitable features for clustering by non-linear mapping and how to use pairwise constraints during learning process, but the model and the solution are different. They use RBMs to initialize connection weights with CD learning, learning process is still unsupervised method, then the learned weights are used to incorporate pairwise constraints in features space by maximum margin techniques. However, our pcGRBM model is based on RBMs with Gaussian visible units. Its learning process is no longer unsupervised method, but guided by pairwise constraints.
In this section, the background of the pairwise constraints, RBM and Gaussian visible units is briefly summarized.
3.1 Pairwise Constraints
The priori knowledge of pairwise constraints is widely used in supervised and semi-supervised learning. There are two types of instance-level pairwise constraints: One is cannot-link constraints which instances should not be grouped together and the other is must-link constraints which instances should be grouped together. The must-link and cannot-link constraints define an instance-level relation of transitive binary. Consequently, two types of constraints may be derived from background knowledge about data set or labeled data. In this paper, we select labeled data from different groups randomly and ensure each group has the same ratio of labeled data to be selected. Then, the must-link constraints are produced by the selected same group labeled data and the cannot-link constraints are produced by the selected different group labeled data.
3.2 Restricted Boltzmann Machine
A RBM is a two-layer network in which the first layer consists of visible units, and the second layer consists of hidden units. The symmetric undirected weights are used to connect the visible and hidden layers. There are no interior-layer connections with either the visible units or the hidden units. A classic RBM model is shown in Fig. 1. An energy function of a joint configuration (v, h) between the visible layer and the hidden layer is given by:
where and are the visible and the hidden vectors, and are their biases, and are the dimension of visible layer and hidden layer, respectively, is the connection weight matrix between the visible layer and the hidden layer. A probability distribution over vectors v and h is defined as
where is a “partition function” which is defined by summing over all possible pairs of hidden layer and visible layer:
By means of summing over all the units of the hidden layer, the probability that the RBM assigns to the units of the visible layer v is given by:
The partial derivative of the log probability of Eq. (4) with respect to a weight is given by
where the angle brackets and are used to denote expectations of the distribution specified by the subscript and . In the log probability, a very simple learning rule for performing stochastic steepest ascent is given by:
where is a learning rate.
It is easy to get because there is no direct connections among the hidden units. However, it is difficult to get unbiased sample of . Hinton proposed a faster learning algorithm with the CD learning and the change of learning parameter is given by:
where can be computed efficiently than .
3.3 Gaussian Visible Units
Original RBMs were developed by binary stochastic units for the hidden and visible layers. To deal with real-valued data such as natural images, one solution is that the binary visible units are replaced by linear units with independent Gaussian noise, but the hidden units remain binary, which is first suggested by. The negative log probability is given by the following energy function:
where is the standard deviation of the Gaussian noise for visible unit .
For each visible unit, it is easy to learn the variance of the noise, but it is difficult using because of taking long time. Therefore, in many applications, it is easy to normalise the data to have unit variance and zero mean. Then the reconstructed value of Gaussian visible units is equal to its input from the binary hidden units plus its bias.
4 pcGRBM Model and Its Learning Algorithm
We first propose a pairwise constraints restricted Boltzmann machine with Gaussian visible units(pcGRBM) model which the binary visible units are replaced by noise-free linear units and its learning procedure is guided by pairwise constraints. Then we give exact inference of the pcGRBM optimization. Finally, the corresponding learning algorithm is presented.
4.1 pcGRBM Model
Suppose that is a -dimensional original data set which has been normalized, is a dimensional hidden code. The pairwise must-link constraints set of the reconstruction data is defined by , belongs to the same class and pairwise cannot-link constraints set of the reconstruction data is given by , belongs to the different classes.
For training the parameters of the pcGRBM model, the first objective is that how to maximize the log probability of RBM with Gaussian visible units and the second objective is that how to maximize distance of all pairwise vectors which come from cannot-link set and minimize distance of all pairwise vectors which come from must-link set in reconstructed visible layer. Because of using noise-free reconstruction in the model, the reconstructed value of a Gaussian visible linear unit is equal to its input from the hidden units plus its bias. Then the objective function is given by
where are the model parameters, is a scale coefficient, and are the cardinality of the must-link pairwise constraints set and the cannot-link pairwise constraints set , respectively, is the average of the log-likelihood and is the square of 2-norm.
The learning problem of the pcGRBM model is to get optimal or approximate optimal parameters , which minimize the objective function , i.e.,
4.2 pcGRBM Inference
For our first objective, we can use gradient descent to solve optimal problem, however, it is expensive to compute the gradient of the log probability. Recently, Karakida et al. demonstrated that learning is simpler than ML learning in RBMs with Gaussian linear units. Then, we apply the learning method to obtain an approximation of the log probability gradient. For our second objective, we use the method of gradient descent to solve the optimization problem. The following main work is to compute the gradient of .
Firstly, we assume that
Then, the gradients of the is
and the gradients of the is
In order to express concisely, we suppose that
where , and is the dimension of the hidden layer.
Then, the gradient of the takes the form
In like manner, the gradient of the takes the form
So, the gradient of the objective function is as follows.
It is obvious that , , and . So, in the pcGRBM model, we use Eq. (8) and Eq. (9) to update the biases and .
Finally, the updating rulers of connection weights of the pcGRBM model takes the form
4.3 pcGRBM Learning Algorithm
According to the above inference, the learning algorithm for pcGRBM is summarized as follows.
Algorithm 1 Learning for pcGRBM
|alphabet||0.41440.0066||0.41710.0004||0.4042 0.0000||0.40470.0385||0.37480.0230||0.4221 0.1144||0.41020.0360||0.39400.1413||0.42900.0935||0.41950.0503||0.43000.0375||0.43030.0415|
5 Results and Discussion
In this section, we introduce the datasets, define the experimental setup, and discuss about experimental results.
We used the Microsoft Research Asia Multimedia (MSRA-MM) which contains two sub-datasets, e.g., a video dataset and an image dataset. The image part contains 1,011,738 images and the video part contains 23,517 videos. To evaluate our pcGRBM model, we use 16 image datasets (alphabet, ambulances, bed, beret, beverage, bike, billiard, blog, blood, bonsai, book, bread, breakfast, building, vegetable and virus) from image part for our experiments. The summary of the datasets are listed in Table I.
5.2 Experimental Setup
The goal of the experiments is to study the following aspects:
Dose the pairwise constraints guide the encoding procedure of traditional RBM?
How dose unsupervised clustering algorithms based on pcGRBM model compare with their semi-supervised clustering algorithms?
How dose unsupervised clustering algorithms based on pcGRBM model compare with these algorithms based on traditional RBM?
To verify the features of pcGRBM contain guiding information whether or not, we use the output of pcGRBM as input of unsupervised clustering algorithm. In our experiments, we choose K-means, affinity propagation (AP), SP clustering algorithms as examples. Then, we present three algorithms which based on pcGRBM model for clustering task, termed as Kmeans.pcgrbm, AP.pcgrbm and SP.pcgrbm, their structure are shown in Fig.5. Similarly, we also present three algorithms which based on traditional RBM with Gaussian visible units for clustering task, called as Kmeans.grbm, AP.grbm and SP.grbm. In fact, Kmeans.pcgrbm, AP.pcgrbm and SP.pcgrbm are semi-supervised clustering algorithms with instance-level guiding of pairwise constraints, but Kmeans.grbm, AP.grbm and SP.grbm are unsupervised methods.
Firstly, we compare the clustering performance of the proposed algorithms (Kmeans.pcgrbm, AP.pcgrbm and SP.pcgrbm) with original K-means, AP and SP clustering algorithms, respectively. Secondly, the proposed algorithms are used to compare with Cop-Kmeans, Semi-SP and Semi-AP, respectively. Finally, we use unsupervised algorithms that are Kmeans.grbm, AP.grbm and SP.grbm to compare with the proposed algorithms.
5.3.1 The pcGRBM for Clustering VS Unsupervised Algorithms
In this section, we compare unsupervised clustering of K-means, SP and AP with Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm which based on the pcGRBM by evaluation of average accuracy, average rank and average purity. From Table II, the average accuracies of K-means, SP and AP algorithms are 43.78%, 39.97% and 42.31%, respectively, but the average accuracies of Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm algorithms raise to 47.48%, 47.13% and 47.39%, respectively. The average ranks of K-means, SP and AP algorithms are shown in Table III, their values are 97.5625, 154.0625 and 124.4375, respectively, but the average ranks of Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm algorithms reduce to 33.1875, 36.0625 and 32.8125, respectively. The smaller the rank value means the better the algorithm. From Table IV, the average purities of K-means, SP and AP algorithms are 0.7703, 0.7721 and 0.7772, respectively, but the average purities of Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm algorithms raise to 0.8010, 0.8012 and 0.8011, respectively. A greater purity indicates a better algorithm. From all above results, it is obvious that clustering by the pcGRBM is better than the original unsupervised clustering.
From the last three columns of Table II, there are more variance volatility of Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm than those of other algorithms because of the effect of pairwise constraints.
5.3.2 The pcGRBM for Clustering VS Semi-supervised Algorithms
In this section, we make further comparison among semi-supervised clustering of Cop-kmeans, Semi-SP and Semi-AP with Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm by evaluation of average accuracy, average rank and average purity. In addition, the comparison of average accuracy is shown in Figs .2-4, respectively. From Table II, the average accuracies of Cop-kmeans, Semi-SP and Semi-AP with Kmeans.pcgrbm algorithms are 43.85%, 40.26% and 42.53%, respectively. The pcGRBM raise the average accuracies by 3.98%, 6.87% and 5.09%, respectively. From Table III, the average ranks of Cop-kmeans, Semi-SP and Semi-AP algorithms are 95.2500, 151.6250 and 121.6250, respectively, however, the average ranks of Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm algorithms are 33.1875, 36.0625 and 32.8125, respectively. The smaller the rank value means the better the algorithm. The average purities of Cop-kmeans, Semi-SP and Semi-AP algorithms are shown in Table IV, their values are 0.7742, 0.7788 and 0.7753, respectively. From all above results, it is obvious that the pcGRBM for clustering is better than the semi-supervised clustering.
We plot the experiment results with the increasing percentage of pairwise constraints which ranges from 1% to 10% in steps of 1% for Cop-Kmeans and Kmeans.pcgrbm in Fig .2, Semi-SP and SP.pcgrbm in Fig .3, Semi-AP and AP.pcgrbm in Fig .4. From Figs .2-4, we can see that the accuracy of Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm can not maintain complete synchronous increases as the percentage of pairwise constraints, however, the average accuracies of Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm are higher than Cop-Kmeans, Semi-SP and Semi-AP, respectively.
5.3.3 The pcGRBM VS RBM with Gaussian Visible Units for Clustering
The pcGRBM and RBM with Gaussian visible have ability to extract features, but, which one shows better performance for clustering task? In order to compare the representation capability between the pcGRBM and RBM without any guiding of pairwise constraints, we design a structure of clustering algorithm in which the features of RBM with Gaussian visible units is used as input of unsupervised clustering. In our experiment, we use three clustering algorithms base on this structure which are termed as Kmeans.grbm, SP.grbm and AP.grbm algorithms to compare to Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm. From Table II, the average accuracies of kmeans.grbm, SP.grbm and AP.grbm algorithms are 43.321%, 43.387% and 43.11%, respectively, however, Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm algorithms raise the average accuracies by 4.27%, 3.26% and 4.28%, respectively. The average ranks of Kmeans.grbm, SP.grbm and AP.grbm algorithms are shown in Table III. The results are 105.5625, 96.9375 and 108.8750, respectively, however the average ranks of Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm algorithms reduce to 102.375, 60.875 and 76.0625, respectively. Table IV shows the average purities of kmeans.grbm, SP.grbm and AP.grbm. The values are 0.7831, 0.7853 and 0.7837, respectively. From all above results, it is obvious that the pcGRBM is better than RBM for clustering.
5.3.4 The Rank
We compare twelve algorithms and sixteen data sets by means of the Aligned Friedman test statistic which is given by
where is the rank sum of the th algorithm, is the rank sum of the th data set, is the number of data set and is the number of algorithm.
In our experiments, all pairwise constraints come from labels information. We choose 1% to 10% pairwise constraints in steps of 1%. The average rank value is smaller the algorithm is better. As we can see from Table III, the average rank of K-means, SP, AP, Cop-Kmeans, Semi-SP, Semi-AP, Kmeans.grbm, SP.grbm, AP.grbm, Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm algorithms are 97.5625, 154.0625, 124.4375, 95.2500, 151.6250, 121.6250, 105.5625, 96.9375, 108.8750, 33.1875, 36.0625 and 32.8125, respectively. It is easy to know that the least average rank is AP.pcgrbm algorithm with a value of 32.8125. From results on Table III, we can see that Kmeans.pcgrbm, SP.pcgrbm, AP.pcgrbm algorithms which based on pcGRBM are better than other nine algorithms. We check whether the measured sum of the ranks is significantly different from the average value of the total ranks expected under the null hypothesis:
T is the chi-square distribution with 11 degrees of freedom because we use nine algorithms and sixteen data sets. For one tailed test, the p-value is 0.00000001 which is computed by distribution and the p-value is 0.000000001 for two-tailed test. Then, the null hypothesis is rejected at high level significance. The experimental results of algorithms are significantly different because the p-values are far less than 0.05.
In this paper, we proposed a novel pcGRBM model, the learning procedure of which is guided by the pairwise constraints and the process of encoding is conducted under guidance. Then, some pairwise hidden features of pcGRBM flock together and another part of them are separated by the guidances. In the process of learning pcGRBM, CD learning is used to approximate ML learning and pairwise constraints are iterated transitions between visible and hidden units. Then, the background of pairwise constraints are encoded in hidden layer features of pcGRBM. In order to testify the availability of pcGRBM, the features of the hidden layer of the pcGRBM are used as input ‘data’ for clustering tasks. The experimental results showed that the performance of the Kmeans.pcgrbm, SP.pcgrbm and AP.pcgrbm algorithms which based on pcGRBM for clustering tasks are better than their classic unsupervised clustering algorithms (K-means, SP, AP), semi-supervised clustering algorithms( Cop-kmeans, Semi-SP, Semi-AP) and even better than Kmeans.grbm, SP.grbm and AP.grbm which based on RBM with Gaussian visible units without guiding of pairwise constraints.
There are several interesting questions in our future studies. For example, how to design deep networks based on the pcGRBM. How to strengthen pairwise constraints information when the layer of the deep network becomes deeper and deeper. How many dimensions in hidden layer can enhance the performance for clustering.
This work was partially supported by the National Science Foundation of China (No. 61573292).
-  G. E. Hinton and T. J. Sejnowski, “Optimal perceptual inference,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. Citeseer, 1983, pp. 448–453.
-  G. Hinton and T. Sejnowski, “Learning and releaming in boltzmann machines,” Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1, pp. 282–317, 1986.
-  G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
-  M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergence learning,” in Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics. Citeseer, 2005, pp. 33–40.
-  G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
-  Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle et al., “Greedy layer-wise training of deep networks,” Advances in Neural Information Processing Systems, vol. 19, p. 153, 2007.
-  R. Salakhutdinov and G. Hinton, “An efficient learning procedure for deep boltzmann machines,” Neural Computation, vol. 24, no. 8, pp. 1967–2006, 2012.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
-  D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3642–3649.
-  O. Fink, E. Zio, and U. Weidmann, “Fuzzy classification with restricted boltzman machines and echo-state networks for predicting potential railway door system failures,” Reliability, IEEE Transactions on, vol. 64, no. 3, pp. 861–868, 2015.
-  Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification of hyperspectral data based on deep belief network,” Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of, vol. 8, no. 6, pp. 2381–2392, 2015.
-  S. Elfwing, E. Uchibe, and K. Doya, “Expected energy-based restricted boltzmann machine for classification,” Neural Networks, vol. 64, pp. 29–38, 2015.
-  Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1798–1828, 2013.
-  Y. W. Teh and G. E. Hinton, “Rate-coded restricted boltzmann machines for face recognition,” Advances in Neural Information Processing Systems, pp. 908–914, 2001.
-  P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010.
-  G. E. Hinton and R. R. Salakhutdinov, “Replicated softmax: an undirected topic model,” in Advances in Neural Information Processing Systems, 2009, pp. 1607–1614.
-  A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6645–6649.
-  R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief networks for natural language understanding,” Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 22, no. 4, pp. 778–784, 2014.
-  S. Nie, Z. Wang, and Q. Ji, “A generative restricted boltzmann machine based method for high-dimensional motion data modeling,” Computer Vision and Image Understanding, vol. 136(C), pp. 14–22, 2015.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
-  T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion using rnn pre-trained by recurrent temporal restricted boltzmann machines,” Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 23, no. 3, pp. 580–587, 2015.
-  X. Yang, Q. Chen, S. Zhou, and X. Wang, “Deep belief networks for automatic music genre classification,” in Twelfth Annual Conference of the International Speech Communication Association, vol. 8, no. 11, 2011, pp. 13–16.
-  M. Yuan, H. Tang, and H. Li, “Real-time keypoint recognition using restricted boltzmann machine,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 25, no. 11, pp. 2119–2126, 2014.
-  L. Nie, A. Kumar, and S. Zhan, “Periocular recognition using unsupervised convolutional rbm feature learning,” in Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 2014, pp. 399–404.
-  C. Chen, C.-Y. Zhang, L. Chen, and M. Gan, “Fuzzy restricted boltzmann machine for the enhancement of deep learning,” Fuzzy Systems, IEEE Transactions on, vol. 23, no. 6, pp. 2163–2173, 2015.
-  Q. Yu, Y. Hou, X. Zhao, and G. Cheng, “Rényi divergence based generalization for learning of classification restricted boltzmann machines,” in Data Mining Workshop (ICDMW), 2014 IEEE International Conference on. IEEE, 2014, pp. 692–697.
-  A. Courville, G. Desjardins, J. Bergstra, and Y. Bengio, “The spike-and-slab rbm and extensions to discrete and sparse data distributions,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 9, pp. 1874–1887, 2014.
-  N. Wang, J. Melchior, and L. Wiskott, “Gaussian-binary restricted boltzmann machines on modeling natural image statistics,” arXiv preprint arXiv:1401.5900, 2014.
-  C. Ekanadham, S. Reader, and H. Lee, “Sparse deep belief net models for visual area v2,” Advances in Neural Information Processing Systems, vol. 20, 2007.
-  N. Srivastava, R. R. Salakhutdinov, and G. E. Hinton, “Modeling documents with deep boltzmann machines,” arXiv preprint arXiv:1309.6865, 2013.
-  I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent temporal restricted boltzmann machine,” in Advances in Neural Information Processing Systems, 2009, pp. 1601–1608.
-  Z. Han, Z. Liu, J. Han, C.-M. Vong, S. Bu, and X. Li, “Unsupervised 3d local feature learning by circle convolutional restricted boltzmann machine,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5331–5344, 2016.
-  T. Nakashika, T. Takiguchi, and Y. Minami, “Non-parallel training in voice conversion using an adaptive restricted boltzmann machine,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2032–2045, 2016.
-  F. Zhao, Y. Huang, L. Wang, T. Xiang, and T. Tan, “Learning relevance restricted boltzmann machine for unstructured group activity and event understanding,” International Journal of Computer Vision, pp. 1–17, 2016.
-  M. V. Giuffrida and S. A. Tsaftaris, “Theta-rbm: Unfactored gated restricted boltzmann machine for rotation-invariant representations,” arXiv preprint arXiv:1606.08805, 2016.
-  D. C. Mocanu, H. B. Ammar, L. Puig, E. Eaton, and A. Liotta, “Estimating 3d trajectories from 2d projections via disjunctive factored four-way conditional restricted boltzmann machines,” arXiv preprint arXiv:1604.05865, 2016.
-  J. Gao, J. Yang, G. Wang, and M. Li, “A novel feature extraction method for scene recognition based on centered convolutional restricted boltzmann machines,” Neurocomputing, vol. 11, no. 2, pp. p14–19, 2016.
-  N. Phan, D. Dou, B. Piniewski, and D. Kil, “Social restricted boltzmann machine: Human behavior prediction in health social networks,” in 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2015, pp. 424–431.
-  G. Li, L. Deng, Y. Xu, C. Wen, W. Wang, J. Pei, and L. Shi, “Temperature based restricted boltzmann machines.” Scientific Reports, vol. 6, p. 19133, 2016.
-  H. Goh, N. Thome, M. Cord, and J.-H. Lim, “Learning deep hierarchical visual feature coding,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 25, no. 12, pp. 2212–2225, 2014.
-  K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl et al., “Constrained k-means clustering with background knowledge,” in ICML, vol. 1, 2001, pp. 577–584.
-  S. S. Rangapuram and M. Hein, “Constrained 1-spectral clustering,” arXiv preprint arXiv:1505.06485, 2015.
-  Y. J. XIAO Yu, “Semi-supervised clustering based on affinity propagation algorithm,” Journal of Software, vol. 19, no. 11, pp. 2803–2813, 2008.
-  S. Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of markov random fields,” in Advances in Neural Information Processing Systems, 2008, pp. 1121–1128.
-  J. M. Tomczak, “Learning informative features from restricted boltzmann machines,” Neural Processing Letters, vol. 44, no. 3, pp. 1–16, 2015.
-  J. Zhang, S. Ding, N. Zhang, and Z. Shi, “Incremental extreme learning machine based on deep feature embedded,” International Journal of Machine Learning and Cybernetics, vol. 7, no. 1, pp. 111–120, 2016.
-  I. Sutskever and G. E. Hinton, “Learning multilevel distributed representations for high-dimensional sequences,” in International Conference on Artificial Intelligence and Statistics, 2007, pp. 548–555.
-  V. Mnih, H. Larochelle, and G. E. Hinton, “Conditional restricted boltzmann machines for structured output prediction,” arXiv preprint arXiv:1202.3748, 2012.
-  A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
-  K. Cho, A. Ilin, and T. Raiko, “Improved learning of gaussian-bernoulli restricted boltzmann machines,” in Artificial Neural Networks and Machine Learning–ICANN 2011. Springer, 2011, pp. 10–17.
-  K. H. Cho, T. Raiko, and A. Ilin, “Gaussian-bernoulli deep boltzmann machine,” in Neural Networks (IJCNN), The 2013 International Joint Conference on. IEEE, 2013, pp. 1–7.
-  G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Two distributed-state models for generating high-dimensional time series,” The Journal of Machine Learning Research, vol. 12, pp. 1025–1068, 2011.
-  J. Zhang, G. Tian, Y. Mu, and W. Fan, “Supervised deep learning with auxiliary networks,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 353–361.
-  G. Chen, “Deep transductive semi-supervised maximum margin clustering,” arXiv preprint arXiv:1501.06237, 2015.
-  J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982.
-  Y. Freund and D. Haussler, Unsupervised learning of distributions of binary vectors using two layer networks. Computer Research Laboratory [University of California, Santa Cruz], 1994.
-  A. Krizhevsky and G. Hinton, “Convolutional deep belief networks on cifar-10,” Unpublished manuscript, vol. 40, 2010.
-  R. Salakhutdinov, “Learning deep generative models,” Ph.D. dissertation, University of Toronto, 2009.
-  G. Hinton, “A practical guide to training restricted boltzmann machines,” Momentum, vol. 9, no. 1, p. 926, 2010.
-  V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807–814.
-  R. Karakida, M. Okada, and S. I. Amari, “Dynamical analysis of contrastive divergence learning: Restricted boltzmann machines with gaussian visible units.” Neural Networks the Official Journal of the International Neural Network Society, vol. 79, no. C, pp. 78–87, 2016.
-  H. Li, M. Wang, and X.-S. Hua, “Msra-mm 2.0: A large-scale web multimedia dataset,” in Data Mining Workshops, 2009. ICDMW’09. IEEE International Conference on. IEEE, 2009, pp. 164–169.
-  B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.
-  D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1624–1637, 2005.
-  C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegative matrix t-factorizations for clustering,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 126–135.
-  S. García, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power,” Information Sciences, vol. 180, no. 10, pp. 2044–2064, 2010.
Jielei Chu received the B.S. degree from Southwest Jiaotong University, Chengdu, China in 2008, and is currently working toward the Ph.D. degree at Southwest Jiaotong University. His research interests are machine learning, data mining, semi-supervised learning and ensemble learning.
Hongjun Wang received his Ph.D. degree in computer science from Sichuan University of China in 2009. He is currently Associate Professor of the Key Lab of Cloud Computing and Intelligent Techniques in Southwest Jiaotong University. His research interests are machine learning, data mining and ensemble learning. He published over 30 research papers in journals and conferences and he is a member of ACM and CCF. He has been a reviewer for several academic journals.
Meng Hua received his Ph.D. degree in mathematics from Sichuan University of China in 2010. His research interests include belief revision, reasoning with uncertainty, machine learning, general topology.
Peng Jin received his BS, MS and Ph.D. in Computing Science from the Zhongyuan University of Technology, Nanjing University of Science and Technology, Peking University respectively. From October 2007 to April 2008, he was a visiting student at the department of Informatics, University of Sussex (Funded by China Scholarship Council); from August 2014 to February 2015, he is a visiting research fellow at the department of Informatics, University of Sussex. Now, he is a professor at Leshan Normal University (School of Computer Science). His research interests include natural language processing, information retrieval and machine learning.
Tianrui Li (SM’11) received the B.S., M.S., and Ph.D. degrees in traffic information processing and control from Southwest Jiaotong University, Chengdu, China, in 1992, 1995, and 2002, respectively. He was a Post-Doctoral Researcher with Belgian Nuclear Research Centre, Mol, Belgium, from 2005 to 2006, and a Visiting Professor with Hasselt University, Hasselt, Belgium, in 2008; University of Technology, Sydney, Australia, in 2009; and University of Regina, Regina, Canada, in 2014. He is currently a Professor and the Director of the Key Laboratory of Cloud Computing and Intelligent Techniques, Southwest Jiaotong University. He has authored or co-authored over 150 research papers in refereed journals and conferences. His research interests include big data, cloud computing, data mining, granular computing, and rough sets. Dr. Li is a fellow of the International Rough Set Society.