# Learning Infinite RBMs with Frank-Wolfe

###### Abstract

In this work, we propose an infinite restricted Boltzmann machine (RBM), whose maximum likelihood estimation (MLE) corresponds to a constrained convex optimization. We consider the Frank-Wolfe algorithm to solve the program, which provides a sparse solution that can be interpreted as inserting a hidden unit at each iteration, so that the optimization process takes the form of a sequence of finite models of increasing complexity. As a side benefit, this can be used to easily and efficiently identify an appropriate number of hidden units during the optimization. The resulting model can also be used as an initialization for typical state-of-the-art RBM training algorithms such as contrastive divergence, leading to models with consistently higher test likelihood than random initialization.

## 1 Introduction

Restricted Boltzmann machines (RBMs) are two-layer latent variable models that use a layer of hidden units to model the distribution of visible units (Smolensky, 1986; Hinton, 2002). RBMs have been widely used to capture complex distributions in numerous application domains, including image modeling (Krizhevsky et al., 2010), human motion capture (Taylor et al., 2006) and collaborative filtering (Salakhutdinov et al., 2007), and are also widely used as building blocks for deep generative models, such as deep belief networks (Hinton et al., 2006) and deep Boltzmann machines (Salakhutdinov and Hinton, 2009). Due to the intractability of the likelihood function, RBMs are usually learned using the contrastive divergence (CD) algorithm (Hinton, 2002; Tieleman, 2008), which approximates the gradient of the likelihood using a Gibbs sampler.

One practical problem when using a RBM is that we need to decide the size of the hidden layer (number of hidden units) before performing learning, and it can be challenging to decide what is the optimal size. One simple heuristic is to search the ‘best” number of hidden units using cross validation or testing likelihood within a pre-defined candidate set. Unfortunately, this is extremely time consuming; it involves running a full training algorithm (e.g., CD) for each possible size, and thus we can only search over a relatively small set of sizes using this approach.

In addition, because the log-likelihood of the RBM is highly non-convex, its performance is sensitive to the initialization of the learning algorithm. Although random initializations (to relatively small values) are routinely used in practice with algorithms like CD, it would be valuable to explore more robust algorithms that are less sensitive to the initialization, as well as smarter initialization strategies to obtain better results.

In this work, we propose a fast, greedy algorithm for training RBMs by inserting one hidden unit at each iteration. Our algorithm provides an efficient way to determine the size of the hidden layer in an adaptive fashion, and can also be used as an initialization for a full CD-like learning algorithm. Our method is based on constructing a convex relaxation of the RBM that is parameterized by a distribution over the weights of the hidden units, for which the training problem can be framed as a convex functional optimization and solved using an efficient Frank-Wolfe algorithm (Frank and Wolfe, 1956; Jaggi, 2013) that effectively adds one hidden unit at each iteration by solving a relatively fast inner loop optimization.

##### Related Work

Our contributions connect to a number of different themes of existing work within machine learning and optimization. Here we give a brief discussion of prior related work.

There have been a number of works on convex relaxations of latent variable models in functional space, which are related to the gradient boosting method (Friedman, 2001). In supervised learning, Bengio et al. (2005) propose a convex neural network in which the number of hidden units is unbounded and can be learned, and Bach (2014) analyzes the appealing theoretical properties of such a model. For clustering problems, several works on convex functional relaxation have also been proposed (e.g., Nowozin and Bakir, 2008; Bradley and Bagnell, 2009). Other forms of convex relaxation have also been developed for two layer latent variable models (e.g., Aslan et al., 2013).

There has also been considerable work on extending directed/hierarchical models into “infinite” models such that the dimensionality of the latent space can be automatically inferred during learning. Most of these methods are Bayesian nonparametric models, and a brief overview can be found in Orbanz and Teh (2011). A few directions have been explored for undirected models, particularly RBMs. Welling et al. (2002) propose a boosting algorithm in the feature space of the model; a new feature is added into the RBM at each boosting iteration, instead of a new hidden unit. Nair and Hinton (2010) conceptually tie the weights of an infinite number of binary hidden units, and connect these sigmoid units with noisy rectified linear units (ReLUs). Recently, Côté and Larochelle (2015) extend an ordered RBM model with infinite number of hidden units, and Nalisnick and Ravi (2015) use the same technique for word embedding. The ordered RBM is sensitive to the ordering of its hidden units and can be viewed as an mixture of RBMs. In contrast, our model incorporates regular RBMs as a special case, and enables model selection for standard RBMs.

The Frank-Wolfe method (Frank and Wolfe, 1956) (a.k.a. conditional gradient) is a classical algorithm to solve constrained convex optimization. It has recently received much attention because it unifies a large variety of sparse greedy methods (Jaggi, 2013), including boosting algorithms (e.g., Beygelzimer et al., 2015), learning with dual structured SVM (Lacoste-Julien et al., 2013) and marginal inference using MAP in graphical models (e.g., Belanger et al., 2013; Krishnan et al., 2015).

Verbeek et al. (2003) proposed a greedy learning algorithm for Gaussian mixture models, which inserts a new component at each step and resembles our algorithm in its procedure. As one benefit, it provides a better initialization for EM than random initialization. Likas et al. (2003) investigate greedy initialization in k-means clustering.

## 2 Background

A restricted Boltzmann machine (RBM) is an undirected graphical model that defines a joint distribution over the vectors of visible units and hidden units ,

(1) |

where and are the dimensions of and respectively, and are the model parameters including the pairwise interaction term and the bias term for the visible units. Here we drop the bias term for the hidden units , since it can be achieved by introducing a dummy visible unit whose value is always one. The partition function serves to normalize the probability to sum to one, and is typically intractable to calculate exactly.

Because RBMs have a bipartite structure, the conditional distributions and are fully factorized and can be calculated in closed form,

(2) |

where is the logistic function, and and and are the -th column and -th row of respectively. Eq. (2) allows us to derive an efficient blocked Gibbs sampler that iteratively alternates between drawing and .

The marginal log-likelihood of the RBM is

(3) |

where is the -th column of and corresponds to the weights connected to the -th hidden unit. Because we assume each hidden unit takes values in , we get the softplus function when we marginalize . This form shows that the (marginal) free energy of the RBM is a sum of a linear term and a set of softplus functions with different weights ; this provides a foundation for our development.

Given a dataset , the gradient of the log-likelihood for each data point is

(4) |

where and the logistic function is applied in an element-wise manner. The positive part of the gradient can be calculated exactly, since the conditional distribution is fully factorized. The negative part arises from the derivatives of the log-partition function and is intractable. Stochastic optimization algorithms, such as CD (Hinton, 2002) and persistent CD (Tieleman, 2008), are popular methods to approximate the intractable expectation using Gibbs sampling.

## 3 RBM with Infinite Hidden Units

In this section, we first generalize the RBM model defined in Eq. (3) to a model with an infinite number of hidden units, which can also be viewed as a convex relaxation of the RBM in functional space. Then, we describe the learning algorithm.

### 3.1 Model Definition

Our general model is motivated by Eq. (3), in which the first term can be treated as an empirical average of the softplus function under an empirical distribution over the weights . To extend this, we define a general distribution over the weight , and replace the empirical averaging with the expectation under ; this gives the following generalization of an RBM with an infinite (possibly uncountable) number of hidden units,

(5) | |||

where and is a temperature parameter which controls the “effective number” of hidden units in the model, and . Note that is assumed to be properly normalized, i.e., . Intuitively, (5) defines a semi-parametric model whose log probability is a sum of a linear bias term parameterized by , and a nonlinear term parameterized by the weight distribution and that controls the magnitude of the nonlinear term. This model can be regarded as a convex relaxation of the regular RBM, as shown in the following result. {pro} The model in Eq. (5) includes the standard RBM (3) as special case by constraining and . Moreover, the log-likelihood of the model is concave w.r.t. the function , and respectively, and is jointly concave with and . We should point out that the parameter plays a special role in this model: we reduce to the standard RBM only when equals the number of particles in , and would otherwise get a fractional RBM. The fractional RBM leads to a more challenging inference problem than a standard RBM, since the standard Gibbs sampler is no longer directly applicable. We discuss this point further in Section 3.3.

Given a dataset , we learn the parameters and using a penalized maximum likelihood estimator (MLE) that involves a convex functional optimization:

(6) |

where is the set of valid distributions and we introduce a functional L2 norm regularization to penalize the likelihood for large values of . Alternatively, we could equivalently optimize the likelihood on , which restricts the probability mass to a -norm ball .

### 3.2 Learning Infinite RBMs with Frank-Wolfe

It is challenging to directly solve the optimization in Eq. (6) by standard gradient descent methods, because it involves optimizing the function with infinite dimensions. Instead, we propose to solve it using the Frank-Wolfe algorithm (Jaggi, 2013), which is projection-free and provides a sparse solution.

Assume we already have at the iteration ; then Frank-Wolfe finds the by maximizing the linearization of the objective function :

(7) |

where is a step size parameter, and the convex combination step guarantees the new remains a distribution after the update. A typical step-size is , in which case we have , that is, equals the average of all the earlier solutions obtained by the linear program.

To apply Frank-Wolfe to solve our problem, we need to calculate the functional gradient in E.q. (7). We can show that (see details in Appendix),

where is the distribution parametrized by the weight density and parameter ,

(8) |

The (functional) linear program in Eq. (7) is equivalent to an optimization over weight vector :

(9) |

The gradient of the objective (9) is,

where the expectation over can be intractable to calculate, and one may use stochastic optimization and draw samples using MCMC. Note that the second two terms in the gradient enforce an intuitive moment matching condition: the optimal introduces a set of “importance weights” that adjust the empirical data and the previous model, such that their moments match with each other. \qiangbetter explanation?

Now, suppose is the optimum of Eq. (9) at iteration , the item we added can be shown to be the delta over , that is, ; in addition, we have when the step size is taken to be . Therefore, this Frank-Wolfe update can be naturally interpreted as greedily inserting a hidden unit into the current model . In particular, if we update the temperature parameter as , according to Proposition 3.1, we can directly transform our model to a regular RBM after each Frank-Wolfe step, which enables the convenient blocked Gibbs sampling for inference.

Compared with the (regularized) MLE of the standard RBM (e.g. in Eq. (4)), the optimization in Eq. (9) has the following nice properties: (1) The current model does not depend on , which means we can draw enough samples from at each iteration , and reuse them during the optimization of . (2) The objective function in Eq. (9) can be evaluated explicitly given a set of samples, and hence efficient off-the-shelf optimization tools such as L-BFGS can be used to solve the optimization very efficiently. (3) Each iteration of our method involves much fewer parameters (only the weights for a single hidden unit, which is instead of the full weight matrix are updated), and hence defines a series of easier problems that can be less sensitive to initialization. We note that a similar greedy learning strategy has been successfully applied for learning mixture models (Verbeek et al., 2003), in which one greedily inserts a component at each step, and that this approach can provide better initialization for EM optimization than using multiple random initializations.

Once we obtain , we can update the bias parameter by gradient descent,

(10) |

One can further optimize by gradient descent,^{1}^{1}1see Appendix for the definition of
but we find simply updating is more efficient and works well in practice.
We summarize our Frank-Wolfe learning algorithm in Algorithm 1.

Adding hidden units on RBM. \qiangexplain what is the benefit for adding units on a existing RBM? Besides initializing to be a delta function at some random and learning the model from scratch, one can also adapt Algorithm 1 to incrementally add hidden units into an existing RBM in Eq. (3) (e.g. have been learned by CD). According to Proposition 3.1, one can simply initialize , , and continue the Frank-Wolfe iterations at .

Removing hidden units. Since the hidden units are added in a greedy manner, one may want to remove an old hidden unit during the Frank-Wolfe learning, provided it is bad with respect to our objective Eq. (9) after more hidden units have been added. A variant of Frank-Wolfe with away-steps (Guélat and Marcotte, 1986) fits this requirement and can be directly applied. As shown by (Clarkson, 2010), it can improve the sparsity of the final solution (i.e., fewer hidden units in the learned model).

### 3.3 MCMC Inference for Fractional RBMs

As we point out in Section 3.1, we need to take equal to the number of particles in (that is, in Algorithm 1) in order to have our model reduce to the standard RBM. If takes a more general real number, we obtain a more general fractional RBM model, for which inference is more challenging because the standard block Gibbs sampler is not directly applicable. In practice, we find that setting to correspond to a regular RBM seems to give the best performance, but for completeness, we discuss the fractional RBM in more detail in this section, and propose a Metropolis-Hastings algorithm to draw samples from the fractional RBM. We believe that this fractional RBM framework provides an avenue for further improvements in future work.

To frame the problem, let us assume , where is a general real number; the corresponding model is

(11) |

which differs from the standard RBM in (3) because each softplus function is multiplied by . Nevertheless, one may push the into the softplus function, and obtain a standard RBM that forms an approximation of (11):

(12) |

This approximation can be justified by considering the special case when the magnitude of the weights is very large, so that the softplus function essentially reduces to a ReLU function, that is, . In this case, (11) and (12) become equivalent because . More concretely, we can guarantee the following bound: {pro} For any , we have

The proof can be found in the Appendix. Note that we apply the bound when by splitting into the sum of its integer part and fractional remainder, and apply the bound to the fractional part.

Therefore, the fractional RBM (11) can be well approximated by the standard RBM (12), and this can be leveraged to design an inference algorithm for (11). As one example, we can use the Gibbs update of (12) as a proposal for a Metropolis-Hastings update for (11). To be specific, given a configuration , we perform Gibbs update in RBM to get , and accept it with probability ,

where is the Gibbs transition of RBM . Because the acceptance probability of a Gibbs sampler equals one, we have . This gives

## 4 Experiments

In this section, we test the performance of our Frank-Wolfe (FW) learning algorithm on two datasets: MNIST (LeCun et al., 1998) and Caltech101 Silhouettes (Marlin et al., 2010). The MNIST handwritten digits database contains 60,000 images in the training set and 10,000 test set images, where each image includes pixels and is associated with a digit label . We binarize the grayscale images by thresholding the pixels at , and randomly select 10,000 images from training as the validation set. The Caltech101 Silhouettes dataset (Marlin et al., 2010) has 8,671 images with binary pixels, where each image represents objects silhouette and has a class label (overall 101 classes). The dataset is divided into three subsets: 4,100 examples for training, 2,264 for validation and 2,307 for testing.

##### Training algorithms

We train RBMs with CD-10 algorithm. ^{2}^{2}2CD-k refers to using k-step Gibbs sampler to approximate the gradient of the log-partition function.
A fixed learning rate is selected from the set using the validation set,
and the mini-batch size is selected from the set .
We use epochs for training on MINIST and epochs on Caltech101.
Early stopping is applied by monitoring the difference of average log-likelihood between training and validation data, so that the intractable log-partition function is cancelled (Hinton, 2010).
We train RBMs with hidden units.
We incrementally train a RBM model by Frank-Wolfe (FW) algorithm 1.
A fixed step size is selected from the set using the validation data, and
a regularization strength is selected from the set .
We set in Algorithm 1, and use the same early stopping criterion as CD.
We randomly initialize the CD algorithm 5 times and select the best one on the validation set; meanwhile, we also initialize CD by the model learned from Frank-Wolfe.

##### Test likelihood

To evaluate the test likelihood of the learned models, we estimate the partition function using annealed importance sampling (AIS) (Salakhutdinov and Murray, 2008). The temperature parameter is selected following the standard guidance: first temperatures spaced uniformly from 0 to 0.5, and 4,000 spaced uniformly from 0.5 to 0.9, and 10,000 spaced uniformly from 0.9 to 1.0; this gives a total of 14,500 intermediate distributions. We summarize the averaged test log-likelihood of MNIST and Caltech101 Silhouettes in Figure 1, where we report the result averaged over 500 AIS runs in all experiments, with the error bars indicating the 3 standard deviations of the estimations.

We evaluate the test likelihood of the model in FW after adding every 20 hidden units. We perform early stopping when the gap of average log-likelihood between training and validation data largely increases. As shown in Figure 1, this procedure selects 460 hidden units on MNIST (as indicated by the green dashed lines), and 550 hidden units on Caltech101; purely for illustration purposes, we continue FW in the experiment until reaching hidden units. We can see that the identified number of hidden units roughly corresponds to the maximum of the test log-likelihood of all the three algorithms, suggesting that FW can identify the appropriate number of hidden units during the optimization.

We also use the model learned by FW as an initialization for CD (the blue lines in Figure 2), and find it consistently performs better than the best result of CD with 5 random initializations. In our implementation, the running time of the FW procedure is at most twice as CD for the same number of hidden units. Therefore, FW initialized CD provides a practical strategy for learning RBMs: it requires approximately three times of computation time as a single run of CD, while simultaneously identifying the proper number of hidden units and obtaining better test likelihood.

(a) MNIST | (b) Caltech101 Silhouettes |
---|

(a) MNIST | (b) Caltech101 Silhouettes |

##### Classification

The performance of our method is further evaluated using discriminant image classification tasks. We take the hidden units’ activation vectors generated by the three algorithms in Figure 1 and use it as the feature in a multi-class logistic regression on the class labels in MNIST and Caltech101. From Figure 2, we find that our basic FW tends to be worse than the fully trained CD (best in 5 random initializations) when only small numbers of hidden units are added, but outperforms CD when more hidden units (about 450 in both cases) are added. Meanwhile, the CD initialized by FW outperforms CD using the best of 5 random initializations.

## 5 Conclusion

In this work, we propose a convex relaxation of the restricted Boltzmann machine with an infinite number of hidden units, whose MLE corresponds to a constrained convex program in a function space. We solve the program using Frank-Wolfe, which provides a sparse greedy solution that can be interpreted as inserting a single hidden unit at each iteration. Our new method allows us to easily identify the appropriate number of hidden units during the progress of learning, and can provide an advanced initialization strategy for other state-of-the-art training methods such as CD to achieve higher test likelihood than random initialization.

#### Acknowledgements

This work is sponsored in part by NSF grants IIS-1254071 and CCF-1331915. It is also funded in part by the United States Air Force under Contract No. FA8750-14-C-0011 under the DARPA PPAML program.

## Appendix

##### Derivation of gradients

The functional gradient of w.r.t. the density function is

The gradient of w.r.t. the temperature parameter is

##### Proof of Proposition 4.2

###### Proof.

For any , we have following classical inequality,

Let and , and the proposition is a direct result of above two inequalities. ∎

## References

- Aslan et al. [2013] Ö. Aslan, H. Cheng, X. Zhang, and D. Schuurmans. Convex two-layer modeling. In NIPS, 2013.
- Bach [2014] F. Bach. Breaking the curse of dimensionality with convex neural networks. arXiv:1412.8690, 2014.
- Belanger et al. [2013] D. Belanger, D. Sheldon, and A. McCallum. Marginal inference in MRFs using Frank-Wolfe. In NIPS Workshop on Greedy Optimization, Frank-Wolfe and Friends, 2013.
- Bengio et al. [2005] Y. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In NIPS, 2005.
- Beygelzimer et al. [2015] A. Beygelzimer, E. Hazan, S. Kale, and H. Luo. Online gradient boosting. In NIPS, 2015.
- Bradley and Bagnell [2009] D. M. Bradley and J. A. Bagnell. Convex coding. In UAI, 2009.
- Clarkson [2010] K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms, 2010.
- Côté and Larochelle [2015] M.-A. Côté and H. Larochelle. An infinite restricted Boltzmann machine. Neural Computation, 2015.
- Frank and Wolfe [1956] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.
- Friedman [2001] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 2001.
- Guélat and Marcotte [1986] J. Guélat and P. Marcotte. Some comments on Wolfe’s ‘away step’. Mathematical Programming, 1986.
- Hinton [2010] G. Hinton. A practical guide to training restricted Boltzmann machines. UTML TR, 2010.
- Hinton et al. [2006] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006.
- Hinton [2002] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002.
- Jaggi [2013] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.
- Krishnan et al. [2015] R. G. Krishnan, S. Lacoste-Julien, and D. Sontag. Barrier Frank-Wolfe for marginal inference. In NIPS, 2015.
- Krizhevsky et al. [2010] A. Krizhevsky, G. E. Hinton, et al. Factored 3-way restricted Boltzmann machines for modeling natural images. In AISTATS, 2010.
- Lacoste-Julien et al. [2013] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In ICML, 2013.
- LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
- Likas et al. [2003] A. Likas, N. Vlassis, and J. J. Verbeek. The global k-means clustering algorithm. Pattern recognition, 2003.
- Marlin et al. [2010] B. M. Marlin, K. Swersky, B. Chen, and N. D. Freitas. Inductive principles for restricted Boltzmann machine learning. In AISTATS, 2010.
- Nair and Hinton [2010] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, 2010.
- Nalisnick and Ravi [2015] E. Nalisnick and S. Ravi. Infinite dimensional word embeddings. arXiv:1511.05392, 2015.
- Nowozin and Bakir [2008] S. Nowozin and G. Bakir. A decoupled approach to exemplar-based unsupervised learning. In ICML, 2008.
- Orbanz and Teh [2011] P. Orbanz and Y. W. Teh. Bayesian nonparametric models. In Encyclopedia of Machine Learning. 2011.
- Salakhutdinov and Hinton [2009] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009.
- Salakhutdinov and Murray [2008] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In ICML, 2008.
- Salakhutdinov et al. [2007] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. In ICML, 2007.
- Smolensky [1986] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986.
- Taylor et al. [2006] G. W. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. In NIPS, 2006.
- Tieleman [2008] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, 2008.
- Verbeek et al. [2003] J. J. Verbeek, N. Vlassis, and B. Kröse. Efficient greedy learning of Gaussian mixture models. Neural Computation, 2003.
- Welling et al. [2002] M. Welling, R. S. Zemel, and G. E. Hinton. Self supervised boosting. In NIPS, 2002.