Using Inherent Structures to design Lean 2-layer RBMs

Using Inherent Structures to design Lean 2-layer RBMs

Abstract

Understanding the representational power of Restricted Boltzmann Machines (RBMs) with multiple layers is an ill-understood problem and is an area of active research. Motivated from the approach of Inherent Structure formalism [21], extensively used in analysing Spin Glasses, we propose a novel measure called Inherent Structure Capacity (ISC), which characterizes the representation capacity of a fixed architecture RBM by the expected number of modes of distributions emanating from the RBM with parameters drawn from a prior distribution. Though ISC is intractable, we show that for a single layer RBM architecture ISC approaches a finite constant as number of hidden units are increased and to further improve the ISC, one needs to add a second layer. Furthermore, we introduce Lean RBMs, which are multi-layer RBMs where each layer can have at-most units with the number of visible units being  . We show that for every single layer RBM with , hidden units there exists a two-layered lean RBM with parameters with the same ISC, establishing that 2 layer RBMs can achieve the same representational power as single-layer RBMs but using far fewer number of parameters. To the best of our knowledge, this is the first result which quantitatively establishes the need for layering.


1 Introduction

Deep Boltzmann Machines (DBMs1) are largely tuned using empirical methods based on trial and error. Despite much effort, there is still very little theoretical understanding about why a particular neural network architecture works better than another for any given application. Furthermore there is no well defined metric to compare different network architectures.

It is known that given any input distribution on the set of binary vectors of length , there exists an RBM with () hidden units that can approximate that distribution to an arbitrary precision [13]. However with these many hidden units the number of parameters increase exponentially. We call a network lean if for each layer, the number of hidden units where is the number of visible units. The deep narrow Boltzmann Machines whose universal approximation properties were studied in [11] are a special case of lean networks. In this paper we study lean 2-layer deep RBMs.

We ask the questions, is there a measure that can relate DBM architectures to their representational power? Once we have such a measure then can we gain insights into the capabilities of different DBM architectures?

For example, given a wide single layer RBM, an RBM with many hidden nodes, can we find a lean multilayer RBM with equivalent representational power but with far lesser parameters? Despite much effort these questions are not satisfactorily answered and may provide important insights to the area of Deep Learning.

Our main contributions are as follows:

  1. We study the Inherent structures formalism, first introduced in Statistical Mechanics[21], to understand the configuration space of RBMs. We introduce a capacity measure Inherent Structure Capacity (ISC) (Definition 5) and discuss its relation with the expected number of perfectly reconstructible vectors [14], one-flip stable states and the modes of the input distribution. We use this as a measure of representation power of an RBM.

  2. Existing methods for computing expected number of inherent structures are rooted in Statistical Mechanics. They use the replica approach [4] which does not extend well to DBMs since it is not straightforward to incorporate the bipartite nature and layering in the calculations. We use a first principles approach to devise a method that yields upper and lower bounds for single layered and two-layered DBMs (Theorems 5.1,6.1). We show that the bounds become tight as we increase the number of hidden units.

  3. Previous results have shown that a sufficiently large single layer RBM can represent any distribution on the input visible vectors. However we show that if we continue adding units to hidden layer then the ISC tapers to as opposed to the maximum limit of (Corollary 6.1). This implies that although an RBM is a universal approximator, if the input distribution contains large number of modes multi-layering should be considered. We have empirically verified that when the number of units in a single hidden layer RBM, , the ISC saturates (Figure 3).

  4. By analyzing the ISC for two layer RBM we obtain an interesting result that for any such RBM with hidden units (number of parameters ) one can construct a two layered DBM with units in hidden layer 1 and units in layer 2 (Corollary 1) and with number of parameters , resulting in an order of magnitude saving in parameters. To the best of our knowledge this is the first such result which establishes the superiority of 2 layer DBMs over wide single layer RBMs in terms of representational efficiency. We conduct extensive experiments on synthetic datasets to verify our claim.

2 Model Definition and Notations

An RBM with visible and hidden units, denoted by , is a probability distribution on   of the form

(1)
(2)

where denotes the visible vector, hidden vector is denoted by , the parameter denotes the set of biases and coupling matrix and is the normalization constant. The log-likelihood of a given visible vector v for an is given by

(3)

In the sequel will denote the family of distributions parameterized by .

Definition 1 (Modes).

Given a distribution on vectors , a vector v is said to be a mode of that distribution if for all such that , . Here  is the Hamming distance2.

Definition 2 (Perfectly Reconstructible Vectors).

For an we define the function that takes a visible vector v as input and outputs the most likely hidden units vector h conditioned on v, i.e., . Similarly . A visible units vector v is said to be perfectly reconstructible (PR) if .

For any set the cardinality will be denoted by . For an we define

3 Problem Statement

We consider fitting an to a distribution where denotes the Dirac Delta function and where for each pair of vectors in , . We need to find the smallest such that the set contains an that represents . We also study the case of a DBM with 2 hiddden layers. We denote a DBM with visible units, hidden layers with hidden units in layer by . We denote the respective set of DBMs by . We would like to ask the following question. Are there lean two layer architectures, which can model distributions with the same number of modes as that of distributions generated by a one layer architecture where  .

3.1 Related Work

The representational power of Restricted Boltzmann Machines (RBMs) is an ongoing area of study [9, 15, 23, 10, 5].It is well known that an RBM with one hidden layer is a universal approximator [9, 12, 13]. [9] showed that the set can approximate any input distribution with support set size arbitrarily well if following inequality is satisfied.

(4)

If we know the number of modes of our input distribution, then we could design our RBM as per Eqn (4). Unfortunately the number of modes could be large resulting in a large RBM.

Figure 1: Number of modes attained for different choices of hidden units for . Can be seen that the current known result for the number of hidden units required (red graph) is a large over-estimate. The green and purple graphs are estimates given by Theorem 5.1. These are closer to the actual number of enumerated modes, given by the blue graph.

To test the bound in Eqn (4), we conducted simulation experiments. We kept , and generated random coupling weight matrix whose entries were i.i.d. and enumerated all the modes of the generated distribution. We averaged our readings over 100 different weight matrices. The results are shown in Figure 1. The results show that the bound gives a highly conservative estimate for . For example on average the set has the capability to represent distributions with 170 modes, instead of only 49 modes. Thus although the number of modes is an important design criteria, a more practical metric is desirable.

4 Inherent structures of RBM

To understand the complex structure in Spin glasses the notion of Inherent Structures(IS) was introduced in [21]. The IS approach consists of partitioning the configuration space into valleys, where each valley consists of configurations in the vicinity of a local minimum. The number of such valleys can thus be indicative of Complexity of the system.

In this section we recall the IS approach in a general setting to motivate a suitable capacity measure. Consider a system governed by the probability model

(5)

where   is an energy function defined over   dimensional binary vectors with parameter  .

Definition 3 (One-flip Stable States).

[20] For an Energy function a configuration, is called a local minimum, also called One flip stable state, if (equivalently ).

For every one-flip stable state we define the set . Let  form a partition of the configuration space where each   corresponds to the local minimum   and  is the total number of valleys 3. The logarithm of the partition function

where . Now, for any in a  dimensional probability simplex, using the non-negativity of KL divergence, it is straightforward to show that

(6)

where   is the entropy of . Equality holds whenever . One could construct from   if one had access to , and knew which is defined as .

(7)

From the properties of entropy function one could write

(8)

where the lower bound on   is attained at   and is realized when the Energy surface has only one local minimum, a very un-interesting case. The upper bound on is attained at  , which happens only when all valleys are considered similar. Since Number of states can be at most , the last upper bound holds and Thus , can be viewed as a measure of Complexity, of the energy surface. One could put a suitable prior distribution over the parameters and evaluate the complexity averaged over the prior, motivating the following definition.

Definition 4.

(Complexity) The Complexity of the model described in Eqn (5) is given by

where is the number of One-Flip stable states for Energy function defined with parameter and P is a prior distribution over  .

For Ising models, Complexity has been estimated in the large limit [4, 22] by methods such as Replica technique. However, extending their methods to RBMs for a finite size is not straightforward.

It has been shown (see e.g. [16]) that IS decomposition gives a very accurate picture of energy landscape of Ising models at Temperature, . But, for , one needs to take into account both the Valley structure and the energy landscape of the free energy [3]. Obtaining accurate estimates of Complexity is an active area of study, for a recent review see [1].

Our goal is to apply the aforementioned IS decomposition to RBMs. We now show the equivalence between these perfectly reconstructible vectors and one-flip stable states for an RBM. The IS decomposition then allows us to define the measure of capacity in terms of the modes of input distribution. {restatable}lemmalemequiv A vector v is perfectly reconstructible for an the state is one-flip stable.

Proof.

See Supplementary material. ∎

Thus we see that there is a one-one equivalence between perfectly reconstructible vectors and the one-flip stable states for a single layer .

Relationship between the modes of   and  In this section we discuss the relationship between the modes of the marginal distribution,  and the joint distributuon  . We make a mild assumption on one-flip stable state.

  • For a single layer RBM, given a visible vector v, vector is unique.

If the weights are given small random perturbation, then Assumption 1 holds with probability one. However it does not hold true for an layer . We denote to be hidden vectors, to be visible vector and define the set

It can be seen that can be more than one. For input distributions considered in Section 3, the modes of joint distribution with distinct v are atleast as many as modes of marginal distribution p(v). A formal statement with proof is given in supplementary material.

As discussed, for , the modes of the marginal distribution could be smaller than modes of the joint distribution. However, [14] [Theorem 1.6] gave precise conditions under which the number of modes for marginal and joint distributions are same for a single layer network. We suspect that a similar argument holds for . For the rest of the paper we will assume that the modes of joint distribution are same as those of .

Armed with these observations we are now ready to define a measure which relates the architecture of a DBM and the expected number of such modes under a prior distribution on the model parameters. More formally,

Definition 5 (Inherent Structure Capacity).

For an  layered DBM with  hidden units and   visible units we define the Inherent Structure Capacity (ISC), denoted by , to be the logarithm (divided by ) of the expected number of modes of all possible distributions generated over the visible units by the DBM.

We note that for the single layer case this definition reduces to . ISC as a measure would be useful in identifying DBM architectures which can model modes of an input distribution defined over the visible units.

This measure serves as a recipe for fitting DBMs. Suppose we know that the input distribution has   modes then one could find a suitable DBM architecture, i.e. by the following criterion

(9)

Once the architecture has been identified one can then use a standard learning algorithm to learn parameters to fit a given distribution.

In the following sections we investigate the computation of ISC and their applications to single and two layer networks, i.e. and  . To keep the exposition simple we assume the bias parameters to be zero 4. We also assume that the coupling weights are distributed as per mean zero Gaussian, i.e., .

5 Computing capacity of and need for more layers

In this section we discuss the computation of ISC for a single layer RBM. In absence of a definitive proof we conjecture that ISC is intractable just like the Complexity measure in Spin glasses. The problem of computing Complexity has been addressed in the Statistical Mechanics community using the Replica method [17, 6] which yields reasonable estimates. However the applicability of Replica trick to Multi-layer DBMs is not clear. In this section we develop an alternative method for estimating ISC.

5.1 Computing Isc of

For any arbitrary vector we compute where is the indicator random variable and expectation is over the model parameters with prior as stated in Section 4. We then sum this over all vectors, i.e., . Before stating our main theorem we state a few Lemmas.

{restatable}

lemmalemindep For the set , if a given vector v has ones, has ones and ,then 5 for ,

For the expression equates to . where

Proof.

See Supplementary Material. ∎

For ones in v and ones in the problem of computing can be reformulated in terms of matrix row and column sums, viz, given where all entries are i.i.d. and given that all the column sums , to compute the probability that all the row sums are positive, i.e., . Conditioned on the fact the random variables are negatively correlated. This gives us an upper bound mentioned in Lemma 5.1. We now get a lower bound for the estimate.

{restatable}

lemmalemtildconst For the set , if v has ones, has ones, then such that conditioned on , the moments of posterior distribution of is given by

where

Proof.

See Supplementary Material. ∎

Lemma 5.1 gives an upper bound on expected number of PR vectors while Lemma 5.1 gives us a posterior distribution on after taking into account the conditional correlation between . This eventually results in a lower bound . Thus even though a closed-form expression for ISC is difficult, we obtain bounds on it as the following theorem states. {restatable}theoremthmcapacity(ISC of ) There exist non-trivial functions such that ISC of the set obeys the following inequality.

Proof.

See Supplementary material. ∎

5.2 Need for more hidden layers

Theorem 5.1 establishes the lower and upper bounds for ISC. A direct corollary of the theorem establishes that approaches a limit as   increases. {restatable}corollarycorlimit(Large limit) For the set , where is defined in Theorem 5.1.

Proof.

In the Supplementary material we show that . Then claim follows from squeeze theorem6. ∎

Empirically we observe that this saturation limit is achieved when (see Figure 3). Here we discuss the implications of the results derived in the previous subsection.

  1. We plotted the actual expected modes attained and the ISC estimates derived from Theorem 5.1 for and varying number of hidden units (Figure 1). We can see that even a small number of hidden units admits a large ISC and the current known bound given in Equation 4 is not necessary. This shows that for a large class of distributions we give a more practical estimate of number of hidden units required than the current state of the art.

  2. The upper bound on the ISC estimated above seems surprising at first sight since it seems to contradict the well established fact that RBMs are universal approximators [7, 9]. However, one should note that the bound is in expected sense which means that in the family many RBMs shall have modes close to or less than . For the class of input distributions for which number of modes training an to represent these might be difficult. The need for multi-layering arises in such conditions.

  3. Corollary 5.2 shows for a large enough the bounds become tight and the expression is exact. We also show this through simulations in Section 7.

Remark.

When we can approximate by the following relatively simple expression that we can use to conduct further analysis.

(10)

6 Isc of two-layer Rbm architecture

To study the effect of adding layers, we consider the family . As stated in Section 4, adapting analysis for single layer RBMs to multi-layer RBMs is not straightforward. In this section we discuss the computation of ISC and study its application to design RBMs.

6.1 Computing the capacity of 2 layer RBM

We observe that an shares the same bipartite structure as a single layer (Figure 2). This enables us to extend our single layer result to two layers. We introduce a threshold quantity . This value was obtained by simulating the asymtotics of .

v

v

v

h1

h1

h1

h1

h1

h2

h2

h2

v

v

v

h2

h2

h2

h1

h1

h1

h1

h1

Figure 2: Two Layer shares same bipartite graph structure as single layer
{restatable}

theoremthmdbmcap(ISC  of ) For an ( and ), if we denote , then

whenever

Proof.

See Supplementary material. ∎

Theorem 6.1 gives a general formula from which different regimes can be derived by varying . We will use this theorem to understand the design of multi-layer RBMs.

In the previous section we saw that in a single layer RBM, irrespective of number of hidden units, ISC, achieves a limiting value of . The theorem will be useful to quantitatively show that ISC can indeed be improved if we consider layering. For an ( and ), we denote . We say that a layer with hidden units is narrow if and it is wide if .

Regime ISC Implications
ISC determined only by . For a single layer RBM (), further increase in hidden units not effective, multi-layering recommended.
, where (. Given a budget of parameters, this is the maximum ISC achievable with optimal choice of .
, where (). If total number of parameters , then multi-layering does not help.
Table 1: ISC Values for different . (obtained by simulating the asymtotics of ).
{restatable}

corollarycorhonemtwo(Layer 1 Wide, Layer 2 Narrow) For an ( and ), if and then

Proof.

See Supplementary material. ∎

The Corollary shows that for a RBM with a wide first layer and a narrow second layer, the upper bound on ISC increases linearly with the number of units in second layer.

6.2 design under budget on parameters

We extend the result obtained in previous section to consider a real scenario wherein we have a budget on the maximum number of parameters that we can use and we have to design a two-layered DBM given this constraint. For a given input distribution with modes, the DBM should have . {restatable}corollarycorfb(Fixed budget on parameters) For an ( and ), if there is a budget of on the total number of parameters, i.e, then the maximum possible ISC, where

Proof.

See Supplementary material. ∎

Corollary 6.2 can be used to determine the optimal allocation of hidden units to the two layers if there is a budget on the number of parameters to be used due to computational power or time constraints. It says that if , then for optimality and if , then which means that all hidden units should be added to layer 1. The following corollary highlights the existence of a two layer architecture that has ISC equal to , the saturation limit for single layer RBMs.

Corollary 1.

There exists a two layer architecture with parameters such that

where  and 

Proof.

In Corollary 6.2 if we put , we get . Number of parameters for such an RBM is . ∎

The number of parameters for any single layer RBM is where is number of hidden units. The above corollary gives an important insight: one can construct a two layer RBM with parameters that has the same ISC as a single layer RBM with infinitely many hidden units. Ofcourse this is true only if the upper-bound  is close to . This suggests that lean 2 layer networks with order of magnitude less number of parameters can achieve the same ISC as that of a single layer RBM.

Table 1 summarises the ISC values for different regimes and their respective implications for the two-hidden layered DBN. For example if then the capacity is dictated only by the number of hidden units in the second layer and increasing has no effect. Multi-layering should be considered to handle distributions with multiple modes. Also, considering a practical scenario where there is a computational and memory constraint that translates into a budget on the number of parameters, i.e. , we get the optimal distribution of hidden units in the two layers that maximizes the capacity. In particular if then it is recommended to allocate all hidden units to layer 1 itself instead of adding more layers.

7 Experimental Results

Our main goals are to experimentally verify Theorems 5.1, 6.1 and Corollaries 6.2 1. All experiments were run on CPU with 2 Xeon Quad-Core processors (2.60GHz 12MB L2 Cache) and 16GB memory running Ubuntu 16.02 7.

7.1 Validating estimate of Number of modes

To verify our theoretical claims of Theorems 5.1 and 6.1 a number of simulation experiments for varied number of visible and hidden units were conducted. To enable execution of exhaustive tests in reasonable time, the values of had to be kept small. The entries of the weight matrix were drawn from an i.i.d. mean zero normal distribution. Each of the vectors (leaving out the trivial all zero vector) was then tested for being perfectly reconstructible. A comparison of the theoretical predictions and experimental results is shown in Figures 3 and 8 for single layer and two layer RBMs respectively. It can be seen that the theoretical predictions follow similar trend as the experimental results.

Figure 3: Comparison chart of the upper and lower estimates with the actual simulation value of expected number of modes () for .
0.585 0.585 0.588
Table 2: Actual ISC for for , obtained by averaging brute-force enumeration from 2000 independent instantiations of weight matrix, i.e., where is the number of modes enumerated in th instantiation.

Discussion. Figure 3 shows that the predicted bounds on the modes are close to the actual modes enumerated. Table 2 validates the claim that for an as , (Corollary 5.2). To enable bruteforce enumeration in reasonable time the values for had to be kept small. Figure 8 in the supplementary section shows the theoretical upper bound and actual simulated ISC values for a DBM with 2 hidden layers if we fix the total number of hidden units () and vary the ratio . It can be seen that both theoretical prediction of ISC and actual simulation results are closely aligned.

7.2 DBM design under budget on parameters

To validate the claim made in Corollary 6.2 we considered training a DBM with two hidden layers on the MNIST dataset. For this dataset, the standard architecture for a two hidden layer DBM uses hidden units (784x500x1000) [18, 19, 8]. In this case and the number of parameters . Under a budget of fixed number of parameters Corollary 6.2 suggests a better split of the number of hidden units. Accordingly we trained a DBM, with architecture of 784x945x161(Recommended), with 894915 parameters. We note that the number of parameters are similar to the standard architecture of 784x500x1000 (Classical), with 894284 parameters.

We used the standard metric average log-likelihood of test data [18, 19] as the measure to compare. To estimate the model’s partition function we used 20,000 spaced uniformly from 0 to 1.0.

Discussion. The classical tuned architecture for training a DBM with 2 hidden layers for the original MNIST dataset gives a log-likelihood of -84.62. Using our recommended architecture, we were able to get a matched log-likelihood of -84.29 without significant tuning.

7.3 Wide single layer RBM vs lean two-layered DBM

To verify our claim in Corollary 1 we chose single layer RBMs with and and varying . We initialized weights and biases of each RBM architecture randomly and then performed gibbs sampling for 5000 steps to generate a synthetic dataset of 60,000 points. The same dataset was then used for training and evaluating corresponding multilayer DBM architecture suggested by our formula. The resulting test-set log likelihood are depicted in Figure 4.

Figure 4: Comparison of test set log-likelihood attained for single layer RBM and two-layer DBM for and . It can be seen that the DBM with much less parameters gives atleast as good log-likelihood as RBM.

Discussion. We can see that optimal DBM architecture gives same or improved log-likelihood despite the fact that it has less number of parameters than the respective single layer RBM, thus justifying our claim.

8 Conclusion

We studied the IS formalism, first introduced to study Spin glasses, to understand the energy landscape of one and two layer DBMs and proposed ISC, a measure of representation power of RBMs. ISC  makes practical suggestions such as whenever number of hidden units , the ISC  saturates and multilayering should be considered. Also, ISC  suggests alternative two layer architectures to single layer RBMs which have equal or more representational power with far fewer number of parameters.

Acknowledgment

The authors would like to thank the referees for their insightful comments. CB gratefully acknowledge partial support from a generous grant from Microsoft Research India.

Supplementary material

In the following sections we provide additional material (proofs and figures) that supplement our main results. Section A outlines the preliminary facts and notations that we use for the proofs. The subsequent sections provide the detailed proofs for respective lemmas and theorems. Figure 8 compares the theoretical upper bound estimate with the actual simulated values for modes of two layer DBMs ().

Appendix A Preliminary Facts and Notations

In the proofs that follow we use the following facts and notations:

  1. The probability density function (pdf) of standard normal distribution

  2. The cumulative distribution function (cdf) of standard normal distribution

  3. The pdf of a skew normal distribution with skew parameter

  4. If , , then conditioned on follows a truncated normal distribution with moments

    where .

  5. Squeeze Theorem8: Let, be sequences such that ()

    Further, let , then

Appendix B Proof of Lemma 4 (See page 4)

\lemequiv

*

Proof.

Let (conditioning on is implicit). If v is perfectly reconstructible . Similarly since . Hence the state is stable against any number of flips of visible units and against any number of flips of hidden units, is atleast one-flip stable.
Conversely let be one-flip stable. We shall prove by contradiction that and . Assume . We use the fact that for an RBM the hidden units are conditionally independent of each other given the visible units. Thus . Further . Let be an index such that . Since . Moreover, . Thus just by flipping to we can increase the probability of the state . This contradicts the one-flip stability hypothesis. Similarly using the conditional independence of visible units given the hidden units we can show that . ∎

Appendix C Proof of Lemma 5.1 (See page 5.1)

\lemindep

*

Proof.

We first note that given a visible vector the most likely configuration of the hidden vector

Likewise given a hidden vector h, the most likely visible vector

Case 1:
By symmetry it can be assumed , and . Then . Since each of is i.i.d. as per , is a Bernoulli random variable with . Again by symmetry it is assumed the first units are one. Then the most likely reconstructed visible vector is given by . Since for all . Also, for all is a Bernoulli random variable with . The result then follows by mutual independence of .

Case 2:
For ones in v and ones in the problem of computing can be reformulated in terms of matrix row and column sums, viz, given where all entries are i.i.d. and given that all the column sums , to compute the probability that all the row sums are positive, i.e., .

Using properties of normal distribution it can be shown that conditioned on the fact that , the posterior distribution of shall be skew-normal with mean and variance . Since the random variables are independent the posterior mean of shall be and the posterior variance . Since by Central Limit Theorem follow a normal distribution. Since the are negatively correlated (proof follows) and by similar reasoning as in Case 1 we get our desired upper bound.

Negatively Correlated ’s: Conditioned on the fact the random variables are not independent. They are negatively correlated because for all ,

Hence the expression given in Lemma 5.1 is an upper bound since we have neglected the negative correlation among the and in the process over-estimated the probabilities.

Appendix D Proof of Lemma 5.1 (See page 5.1)

\lemtildconst

*

Proof.

The conditional distribution for is obtained from the proof of Lemma 5.1.

where , . Using similar arguments as in proof of Lemma 5.1, conditioned on the posterior distribution of shall be skew normal . Then conditioned on , shall be distributed as per skew normal