Clustering of solutions in the symmetric binary perceptron

Clustering of solutions in the symmetric binary perceptron

Carlo Baldassi Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy Istituto Nazionale di Fisica Nucleare, Sezione di Torino, Italy    Riccardo Della Vecchia Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy    Carlo Lucibello Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy    Riccardo Zecchina Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy International Centre for Theoretical Physics, Trieste, Italy
Abstract

The geometrical features of the (non-convex) loss landscape of neural network models are crucial in ensuring successful optimization and, most importantly, the capability to generalize well. While minimizers’ flatness consistently correlates with good generalization, there has been little rigorous work in exploring the condition of existence of such minimizers, even in toy models. Here we consider a simple neural network model, the symmetric perceptron, with binary weights. Phrasing the learning problem as a constraint satisfaction problem, the analogous of a flat minimizer becomes a large and dense cluster of solutions, while the narrowest minimizers are isolated solutions. We perform the first steps toward the rigorous proof of the existence of a dense cluster in certain regimes of the parameters, by computing the first and second moment upper bounds for the existence of pairs of arbitrarily close solutions. Moreover, we present a non rigorous derivation of the same bounds for sets of solutions at fixed pairwise distances.

I Introduction

The problem of learning to classify a set random patterns with a binary perceptron has been a recurrent topic since the very beginning of the statistical physics studies of neural networks models (gardner1988optimal, ). The learning problem consists in finding the optimal binary assignments of the connection weights which minimize the number of misclassifications of the patterns. We shall refer to such set of optimal assignments as the space of solutions of the perceptron. In spite of the extremely simple architecture of the model, the learning task is highly non convex and its geometrical features are believed to play a role also in more complex neural architectures (watkin1993statistical, ; seung1992statistical, ; engel2001statistical, ).

For the case of random i.i.d. patterns, the space of solutions of the binary perceptron is known to be dominated by an exponential number of isolated solutions (krauth1989storage, ) which lie at a large mutual Hamming distances (huang2013entropy, ; huang2014origin, ) (golf course landscape). An even larger number of local minima have been shown to exist.

The study of how the number of these isolated solutions decreases as more patterns are learned provides the correct prediction for the so-called capacity of the binary perceptron, i.e. the maximum number of random patterns which can be correctly classified. However, the same analysis does not provide the insight which is needed for understanding the behavior of learning algorithms: one would expect that finding solutions in a golf course landscape should be difficult for search algorithms, and indeed Monte Carlo based algorithms satisfying detailed balance get stuck in local minima; yet, empirical results have shown that many learning algorithms, even simple ones, are able to find solutions efficiently (braunstein2006learning, ; baldassi2007efficient, ; baldassi2009generalization, ; baldassi2015max, ).

These empirical results suggested that the solutions which were not the dominant ones in the Gibbs measure, and were as such neglected in the analysis of the capacity, could in fact play an important algorithmic role. As discussed in refs. (baldassi2015subdominant, ; baldassi2016local, ) this turned out to be the actual case: the study of the dominant solutions in the Gibbs measure theory does not take into account the existence of rare (sub-dominant) regions in the solution space which are those found by algorithms. Revealing those rare, accessible regions required a large deviation analysis based on the notion of local entropy. The regions of maximal local entropy are extremely dense in solutions. More recently, the existence of high local entropy regions has been found also in multi-layer networks with continuous weights, and their role has been connected to the structural characteristics of deep neural networks (baldassi2019shaping, ; baldassi2019properties, ).

All the above results rely on methods of statistical mechanics of disordered systems which are extremely powerful and yet not fully rigorous. It is therefore important to corroborate them with rigorous bounds (ding2019capacity, ). In a recent paper (aubin2019storage, ), Aubin et al. have studied a simple variant of the binary perceptron model for which the rigorous bounds provided by first and second moment methods can be shown to be tight. The authors have been able to confirm the predictions of the statistical physics methods concerning the capacity of the model, and the golf course nature of the space of solutions. The model that the authors have studied has a modified activation criterion compared to the traditional perceptron, replacing the Heaviside step function by a function with an even symmetry.

The goal of the present paper is to study the existence of dense regions in the the symmetrized binary perceptron model, employing the first and second moment methods where possible. As a preliminary motivation step, we first present in sec. II the results of the replica-method large deviation analysis, which predicts that the phenomenology for the symmetrized model is the same as for the traditional one, and thus that high local entropy regions exist. In sec. III we then extend the analysis of ref. (aubin2019storage, ) and show rigorously (except for a numerical optimization step) that, in addition to the isolated solutions, there exist an exponential number of pairs of solution at arbitrary Hamming distance. In sec. IV we study more generally the existence of groups of solutions all at fixed mutual distance: for or , we can derive a rigorous upper bound that coincides with the non-rigorous results for general ; as for the lower bound, only the case can be derived rigorously (and again it coincides with the non-rigorous results that we also derive). All the results are consistent with the existence of high local entropy regions, as predicted by the large deviation study.

Ii Dense clusters in the symmetric binary perceptron

ii.1 Model definition

We investigate the rectangular-binary-perceptron (RBP) problem. The RBP has the key property of having a symmetric activation function, characterized by a parameter , the half-width of the channel. This symmetry simplifies the theoretical analysis and allows to obtain tighter bounds through the first and second moment methods. The RBP problem can be expressed as a constraint satisfaction problem (CSP). An instance of the problem is defined by a set of examples in dimension , Throughout the paper we will assume the entries to be i.i.d. Gaussian variables with zero mean and variance . A binary vector is called a solution of the problem if it satisfies

(1)

where . Equivalently, a vector is a solution of the RBP problem iff the function , defined as

(2)

is equal to one, where we have denoted with an indicator function that is if the statement is true and otherwise.

The storage capacity is then defined similarly to the satisfiability threshold in random constraint satisfaction problems: we denote the constraint density as and define the storage capacity as the infimum of densities such that, in the limit , with high probability (over the choice of the matrix ) there are no solutions. It is natural to conjecture that the converse also holds, i.e. that the storage capacity equals the supremum of such that in the limit solutions exist with high probability. In this case we would say the storage capacity is a sharp threshold.

ii.2 Large deviation analysis

In order to obtain a geometric characterization of the solution space, we consider the Hamming distance of any two configurations and , defined by

and investigate the problem of finding a set of solutions of the RBP problem, where is an arbitrary natural number, with all pairwise distances constrained to some value. The existence of such set of solutions, w.h.p. in the large limit, for some range of and for large values of and for small values of the distance constraints (on the scale of ), would point to the presence of a dense region of solutions, coexisting with an exponentially larger number of isolated ones (baldassi2015subdominant, ).

As a starting point for the analysis we introduce the partition function of the model with real replicas, , accounting for the number of such sets (up to a symmetry factor). For any fixed (normalized) distance , this is given by

(3)

The summation here is over the spin configurations. The asymptotic behavior is captured by the (normalized) free local entropy defined by

(4)

The computation of this quantity can be approached by rigorous techniques only for small , as discussed in the next sections. In the general case, for any finite and in the limit, it can be carried out at present only using the non-rigorous replica method of statistical physics of disordered systems. The computations for this model follow entirely those of ref. (baldassi2015subdominant, ) and are reported in Appendix C.

Figure 1: Plot of the free local entropy (logarithm of the number of solutions) as a function of the normalized Hamming distance between solutions , when we use a large deviation measure that focuses on regions of maximal local density, obtained with the replica method using the replica-symmetric ansatz (see Appendix C for the details). On the left there are curves for up to in steps of . When the distance approaches zero we see that all curves tend to coincide with the curve for , meaning that there exist regions of solutions that are maximally dense (nearly all configurations are solutions) in their immediate surroundings. On the right we focus on the interval of values of where there is a change in monotonicity, which we interpret as signaling a fragmentation of the dense clusters into separate pieces. We refine the step of to , and we find that the change happens for .

The replica analysis in the limit reveals the existence of ultra-dense regions of solutions: as shown in fig. 1, for sufficiently small the curves for tend to collapse onto the curve for , implying that these regions are maximally dense in their immediate surroundings (nearly all configurations are solutions in an extensive region). Furthermore, there is a transition at around after which the curves are no longer monotonic. Overall, this is the same phenomenology that was observed in the standard binary perceptron case in ref. (baldassi2015subdominant, ), and we interpret it in the same way, i.e. we speculate that ultra-dense sub-dominant regions of solutions exist, and that the break of monotonicity at signals a transition between two regimes: one for low in which the ultra-dense regions are immersed in a vast connected structure, and one at high in which the structure of the dense solutions fragments into separate regions that are no longer easily accessible.111It should be noted that in the standard binary perceptron case (i.e. with sign activation) there is empirical evidence only for the first scenario of a vast connected structure with ultra-dense regions in it, while the second scenario of fragmented regions has never been directly observed at large , arguably due to the intrinsic algorithmic hardness of finding such regions.

These results were obtained with the so-called replica-symmetric ansatz, and they are certainly not exact. However, as in previous studies, the corrections (which would require the use of a replica-symmetry-broken ansatz) only become numerically relevant at relatively large (e.g. we may expect small corrections to the value of , and larger effects close to ), and they don’t affect the qualitative picture, the emerging phenomenology and its physical interpretation.

Iii Pairs of solutions (): rigorous bounds

We are able to derive rigorous lower and upper bounds for the existence of pairs of solutions, i.e. for the case, without resorting to the replica method. The idea of the derivation follows very closely the strategy used in refs. (mezard2005clustering, ; daude2008pairs, ) for the random K-SAT problem.

We define a SAT--pair as a pair of binary weights , which are both solutions of the CSP, and whose Hamming distance is . The number of such pairs is , see eq. (3).

iii.1 Upper bound: the first moment method

In this section we are interested in finding an upper-bound (which depends on ) to the critical capacity of pairs of solutions. To do that we use the upper bound that holds when the random variable is non-negative. When we apply it to the random variable we get:

(5)

where we have bounded the sum by the maximal term times the number of terms and we have introduced the two Gaussian random variables and , with and covariance

Let us consider the normalized logarithm of the first moment,

where is the two-state entropy function while is defined as follows. Denote with the covariance matrix of the Gaussian random vector whose components have covariance equal to and variances equal to one. We define as the probability that this random vector takes values in the box :

(6)

From the inequality (5), implies that . In turn this provides the upper bound we are seeking:

Upper Bound.

For each and , and for all such that

(7)

there are no SAT--pairs w.h.p.

The upper bound that we obtained for and as a function of is shown in Fig. 2.

iii.2 Lower bound: the second moment method

We compute the lower bound to the critical capacity using the second moment method, which is a direct consequence of the Cauchy-Schwarz inequality:

Lemma 1 (Second moment method).

If is a non-negative random variable, then

(8)

From the results of section III.1 we have

where is defined like in eq. (6). The second moment of the random variable follows from simple combinatorics and reads

(9)

where we have adopted the following conventions.

  • is an -component vector giving the proportion of each type of quadruplets as described in the table below, where we have arbitrarily (but without loss of generality) fixed to . Fixing the vector entails fixing all the possible overlaps between the vectors and and consequently the covariances of the random variables , , and with i.i.d. These covariances as functions of are made explicit in eq. (10).

+ + + + + + + +
+ + + +
+ + + +
+ + + +
  • has the expression

    where is a -dimensional Gaussian vector, with the following set of covariances:

    (10)

    Therefore can be simply written as the following Gaussian integral

  • The set is a simplex specified by:

    (11)

    These three conditions correspond to the normalization of the proportions and to the enforcement of the conditions , . When , defines a five-dimensional simplex described by the three hyperplanes:

    (12)

In order to yield an asymptotic estimate of we first use the following known result, which comes from the approximation of integrals by sums (proof in Appendix A.2):

Lemma 2.

Let be a real, positive, continuous function of , and let , be as defined above. Then for any given there exists a constant such that for sufficiently large :222Here and below this -dimensional integration is to be understood as being performed with a uniform measure in the -dimensional subspace , i.e. , where is a Dirac delta, cf. eq. (12).

(13)

where .

The bound for the second moment then reads:

(14)

which is obtained from substitution of eq. (9) into Lemma 2. The number of components of the vector is eight, but we can reduce their number to five with a change of variables and rewrite the integral in a particularly simple form where just depends on four of them. This is done in Appendix A.1. Here we give just the final expression where the new integration variables are (a scalar) and . The bound becomes

(15)

where:

  • has the expression

    where is the covariance matrix of eq. (10) with and where the components of are considered as independent variables.

  • is defined as the Shannon entropy of a probability mass function with masses corresponding to the components of the following vector:

  • is the new domain of integration specified by the following inequalities:

(16)

The Laplace method used on the integral in (15) yields the following result:

Proposition 1.

For each , define:

(17)

and let be the global maximum of restricted to . Then there exists a constant such that, for sufficiently large,

(18)

Given that , the second moment method gives a useful bound just when . If instead , the probability is bounded above zero (included) and the bound is non-informative.

The only way to obtain is at the particular point defined by

(19)

which can be interpreted intuitively as capturing the situation where the two pairs of solutions are uncorrelated with each other. In that case, we have the following properties:

  • ,

  • .

Therefore, is the largest value of such that is the global maximum, i.e. such that there exists no with . The last condition is written in the following way:

(20)

From properties of multivariate normals (with equality holding only for ) and eq. (20) implies that for the following condition must hold as well:

(21)

where . Finally we obtain the following result:

Lower Bound.

For each and , and for all such that

(22)

we have that there is a positive probability of finding SAT--pairs of solutions, namely

(23)

The optimization can be simplified further by slicing the set in the two “directions” and . We define a -slice as and the natural projection of the set on the -subspace as . With this notation, eq. (22) becomes:

(24)

The optimization in is easy because the function is concave in for each . This is not necessarily true for the optimization in . In fact, it is crucial that we find the global optimum of the objective function because this gives the correct value for the lower bound. To this purpose we have devised two computational strategies. First we evaluated the objective function on a -dimensional grid with increasing number of points. Then we have also implemented a simple gradient descent starting from the points of the grid. The different strategies are discussed in Appendix (A.3).

Figure 2: Lower and upper bounds for the RBP with . For bigger than the critical value (blu vertical line) there is a gap between the lower (purple line) and the upper bound (green line). This gap closes for where the symmetric solution fails. This is evident from the fact that the symmetric lower bound becomes bigger than the upper bound (gray line). In this region instead, the true lower bound perfectly matches the upper bound since the optimum of eq. (31) is in correspondence of the solution.

Our results indicate the existence of two phases. For values of bigger than some value (that depends on ) the optimum is completely symmetric: all four entries of take the same value. We use the subscript to denote this kind of solution. Below this symmetry is broken and the optimum is achieved on a different point that we call Symmetry Broken () solution. This point has the property that the two pairs of binary vectors of solutions and coincide, as can be seen from the structure of the covariance matrix. We report below the covariance structure of the two solutions, where we adopted the convention . The symmetric covariance matrix is the following:

(25)

while the one corresponding to point is the following:

(26)

This is a degenerate covariance matrix, and in correspondence of the solution it follows from the previous equations that the lower bound and the upper bound coincide.

The physical meaning of what happens is qualitatively different for smaller vs bigger than . Let us take , where the bounds are tight, and let’s start with low and progressively increase it. In this regime the typical overlap between pairs of solutions is zero, i.e. the two pairs of solutions are independent and there is a positive probability of finding SAT--pairs since we are below . When we reach there is a transition to a regime where w.h.p. there exists no pair of solutions to the problem. When this happens the point that optimizes (17) is the point. For , the bounds are no longer tight and we can only identify a region between the two bounds where a SAT/UNSAT transition occurs.

Iv Multiplets of solutions ()

In the previous section we were able to derive rigorous expressions for the upper bound , in eq. (7), and the lower bound , in eq. (24), obtained by first and second moment calculations, such that w.h.p. no pairs of solutions at distance exist for load and at least one pair exists for . It would be then natural to try to generalize the derivation to sets of solutions at pairwise distance (multiplets) and in particular asses the existence of a small regime where such sets can be found for any value of and for small enough . This result would rigorously confirm the existence of a dense region of solutions as derived in sec. II, which in turn has been non-rigorously advocated as a necessary condition for the existence of efficient learning algorithms (baldassi2015subdominant, ).

Unfortunately, it is technically unfeasible to carry out the rigorous derivation for as we have done above for the case . Therefore, in this section, after giving an rigorous expression for the first moment upper bound limited to the cases and , we will derive compact expressions for the first and second moment bound using non-rigorous field theoretical calculations and a replica symmetric ansatz. We find that the non-rigorous results match the rigorous ones when available, although we expect the prediction to break down at large values of due to replica symmetry breaking effects (see the discussion in the Introduction).

iv.1 Rigorous first moment upper bounds

In the following we derive the rigorous the expressions for the first moment bound in two additional cases: the existence of triplets and quadruplets of solutions at fixed pairwise distance .

iv.1.1 Triplets ()

Let us define the symbol as equivalence up to sub-exponential terms as , that is for any two sequences and we write iff . The first moment of the triplets partition function has the following asymptotic expression:

(27)
(28)

where , and is the probability that a zero mean Gaussian random vector , whose covariance matrix has ones on the diagonal and off-diagonal, takes values in the box , that is

(29)

An equivalent argument to the case gives the following upper bound for the existence of clusters of three solutions:

(30)

iv.1.2 Quadruplets ()

For quadruplets of solutions, we have

(31)

where:

  • In complete analogy with the previous case is the probability that a zero mean Gaussian random vector , whose covariance matrix has ones on the diagonal and off-diagonal, takes values in the box , that is

    (32)
  • The summation is restricted to the set , specified by:

    (33)

In the limit , due to the 7 constraints in eq. (33), the summation over elements in the box in eq. (31) can be replaced by an integral over the interval

(34)

while for the constraints admit no solutions and .

Therefore, for , we can write

where in the last line we estimated the integral with its saddle point contribution at . The function is the Shannon entropy of an eight-states discrete probability distribution with masses given by the components of the vector . It follows that the second moment upper bound to the storage capacity for quadruplets of solutions at a fixed distance is given by

The numerical evaluation of the two upper bounds, and , can be found in Fig. 3 along with the predictions for the upper bound from non-rigorous calculations for larger ’s.

iv.2 Upper bounds under symmetric assumption for saddle point

Since a rigorous expression for the upper bound for is hard to derive, due to highly non-trivial combinatorial factors, we resort to non-rigorous field theoretical techniques and replica symmetric ansatz to obtain an expression that we believe to be exact for low values of but is likely slightly incorrect for very large due to replica symmetry breaking effects. The generic computation of the