Understanding the computational difficulty of binary-weight perceptron and the advantage of input sparseness
Limited precision of synaptic weights is a key aspect of both biological and hardware implementation of neural networks. To assign low-precise weights during learning is a non-trivial task, but may benefit from representing to-be-learned items using sparse code. However, the computational difficulty resulting from low weight precision and the advantage of sparse coding remain not fully understood. Here, we study a perceptron model, which associates binary (0 or 1) input patterns with outputs using binary (0 or 1) weights, modeling a single neuron receiving excitatory inputs. We studied the difficulty encountered by two efficient algorithms (SBPI and rBP) for solving binary-weight perceptron with the help of a successive weight-fixing (also called decimation) process. In each time step of decimation, marginal probabilities of unfixed weights are computed using belief propagation or, in small-sized systems, enumerated solutions, then the most polarized weight is fixed at its preferred value. We showed that decimation is a process approaching the region of weight configuration space where solutions densely aggregate. In SBPI and rBP, the values of weights fixed early in decimation process can be determined in the first few time steps, while most time steps are spent on determining values of late-decimation-fixed weights. This algorithmic difficult point may result from strong cross-correlation between late-decimation-fixed weights in the solution subspace where early-decimation-fixed weights take their fixed values. By studying geometry of clusters (defined as solutions connected by flipping single late-decimation-fixed weight) in , we propose that this strong cross-correlation is related to a solution condensation process in during decimation. Under biologically plausible requirements of low energy consumption and high robustness of memory retrieval, sparse input is able to reduce time steps that SBPI and rBP use to find solutions, by reducing the time steps needed to assign values to late-decimation-fixed weights, due to the reduction of cross-correlation between late-decimation-fixed weights in . Our work suggests that the computational difficulty of constraint satisfaction problem originates from the subspace of late-decimation-fixed variables. Our work highlights the heterogeneity of learning dynamics of weights, which may help understand axonal pruning in brain development, and inspire more efficient algorithms to train artificial neural networks.
Information of neural networks is stored in synaptic weights. An important feature of the synaptic weights in nature and simple hardware implementation is low precision O’Connor et al. (2005); Montgomery and Madison (2004); Misra and Saha (2010). However, assigning suitable values to low-precision weights is a non-trivial task, and is known to be intractable in the worst case Rivest and Blum (1992); Amaldi (1991). Artificial neural networks with low-precision weights requires more epochs to train before performance convergence Hubara et al. (2018); Ott et al. (2017). Therefore, it is of great interest to understand where the hardness comes from and how to avoid it or use it.
Perceptron is a one-layer network that receives inputs from many neurons and gives a single output. It serves as an elementary building block of complex neural networks, and is also one of the basic structures for learning and memory Engel and den Broeck (2001). It is also a powerful computational unit by itself, and has applications in, say, rule inference Lage-Castellanos et al. (2009) and data compression Hosaka et al. (2002). Here we try to understand the computational difficulty induced by low weight precision by studying perceptron whose weights take binary values (0 or 1). We consider a classification task: given a set of random input patterns, we want to adjust the synaptic weights, so that the perceptron gives a desired output in response to each input pattern. A synaptic weight configuration successfully associating all these input-output pairs is a solution of the perceptron.
Methods of statistical physics imply that most solutions of binary-weight perceptron are isolated and computationally hard to find Huang and Kabashima (2014); Zdeborová and Mézard (2008a), but some algorithms can efficiently find solutions in a region where solutions densely aggregate Baldassi et al. (2015, 2018). It is believed that solutions in the dense solution region have good generalization performance Baldassi et al. (2015, 2018), and efficient algorithms have been designed by accessing weight configurations toward the dense solution region Baldassi et al. (2016a, b); Chaudhari et al. (2017). This implies that the computational difficulty in solving binary-weight perceptron, if any, should emerge during the process approaching the dense solution region. This difficulty should be related to the geometry structure of solution space around the dense solution region, and have little to do with the isolated solutions. However, little is known about the solution space around the dense region and how it influences the dense-solution-region approaching process in algorithms.
A possible way to reduce the computational hardness of binary-weight perceptron is to represent the to-be-classified items using sparse code in input patterns. Input sparseness facilitates algorithms to store more input-output (IO) associations into perceptron Amit and Fusi (1994); Barrett and van Rossum (2008); Legenstein and Maass (2008). Biological neural analogies of perceptron, such as cerebellar Purkenje cells and insectile mushroom body output neurons receive from a large number of cerebellar granule cells and Keynon cells Brunel et al. (2004); Huerta (2004), which have low firing probability. Theoretically, the advantage of input sparseness is mainly understood by calculating theoretical capacity, which means the maximal number of IO pairs that perceptron has a solution, and provides an upper limit of the number of IO pairs that a perceptron can actually associate through algorithms. The result is that input sparseness increases theoretical capacity Brunel et al. (2004); Clopath and Brunel (2013). A point seldom discussed is how input sparsity influences the easiness to algorithmically find a solution when solutions exist. This point is important firstly because it is biologically and engineeringly meaningful to find a solution using less time steps. Secondly, this point is even more important for perceptron whose weights take discrete, rather than continuous, values. Discrete-weight perceptron is a NP-complete problem Rivest and Blum (1992); Amaldi (1991), which implies that a solution may be very hard to algorithmically find even if solutions exist. Up to now, we do not have a good understanding on the origin of the difficulty in solving discrete-weight perceptron, and how input sparsity changes this difficulty.
One possible aspect to understand the computational difficulty of binary-weight perceptron is cross-correlation of weights in solution space. Strong cross-correlation breaks Bethe-Peierls approximation Mézard and Montanari (2009), which supposes that the probability distribution of the value of one weight in solution space is independent of the values of the other weights. This approximation lies in the heart of belief propagation Mézard and Montanari (2009), which is an ingredient or closely related to a number of efficient algorithms Braunstein and Zecchina (2006); Baldassi et al. (2016a, 2007); Baldassi and Braunstein (2015). From another aspect, strong cross-correlation implies that the value preferences of a large number of weights are reconfigured in response to the flipping of a single weight. This means that in a local search algorithm, many subsequent rearrangements may be needed following the modification of a single weight, which increases the difficulty to find solutions Semerjian (2008).
In this paper, we study a perceptron model in which both inputs and weights take binary values (0 or 1), modeling a single neuron receiving excitatory inputs. To understand the computational difficulty encountered by two efficient algorithms (SBPI and rBP), we studied a decimation process, in which weights are successively fixed. At each time step, marginal probabilities of unfixed weights are computed using belief propagation or, in small-sized systems, enumerated solutions, then the most polarized weight is fixed at its preferred value. We found that decimation is a process approaching the dense solution region. In SBPI and rBP, most time steps are spent on determining values of weights late fixed in decimation, and input sparseness is able to reduce the time steps needed to assign values to these late-decimation-fixed weights. We then tried to understand the difficulty in assigning late-decimation-fixed weights and the advantage of input sparseness from the aspect of cross-correlation of unfixed weights during decimation, and then discuss this cross-correlation from the aspect of geometry of solution clusters.
We consider a perceptron model in which a neuron receives binary inputs (), which indicates whether the th input neuron fires or not. The synaptic weights are also binary , modeling excitatory synapses. The output of the perceptron is , which takes 1 or 0 depending on whether its total synaptic input is larger than the firing threshold , where is an adjustable parameter. This neuron is then provided input patterns and desired outputs with . The task is to adjust the synaptic weights , so that for all s. The probability that take 1 is input coding level . Input becomes sparser with smaller . In our model, is fixed at a nonzero value when , so that the number of s that take 1 is of order.
Now let’s discuss the meaning of . For a well-trained perceptron, it can be shown that the total synaptic current average over the patterns is equal to the firing threshold up to order (see Appendix E). As larger synaptic current consumes more energy, energetic efficient perceptron should reduce firing threshold by enlarging (Fig.1a). Of course, if firing threshold is too low, neuron may have high spontaneous activity due to membrane or synaptic noise Faisal et al. (2008), potentially consuming more energy. However, the advantage to enlarge should be apparent as long as such spontaneous activity is rare.
The meaning of can also be interpreted in another way by noting that the neuron output , where in the latter expression form, rescales the discrete values that weights can take. This weight rescaling can increase successful rate of memory retrieval under noise (Fig.1b). Intuitively, because s are supposed to be independent with each other, the standard deviation of the distribution of total synaptic current over s is approximately
In a large body of literature (e.g. Ref. Gardner (1988); Brunel (2016)), retrieval robustness is realized by requiring that total synaptic input is above (below) firing threshold by a stability parameter when the desired output is active (inactive). Here we effectively set . Introducing ensures that memory retrieval is always successful when the noise strength , but enlarging increases the probability of successful retrieval when .
Together, in our model, the requirements for low energy cost and high robustness for memory retrieval can be unified in a single motivation: enlarging
Computational advantage of sparse input under large
We evaluated the capability of our perceptron model in classification task by investigating two efficient algorithmic solvers: stochastic Belief-Propagation Inspired algorithm (SBPI) Baldassi et al. (2007) and reinforced Belief Propagation (rBP) Braunstein and Zecchina (2006). Both algorithms assign a hidden state to weight , update during solving, and take (or 1) when is negative (or positive). SBPI is a biologically plausible on-line algorithm, which updates in response to an unassociated input-output (IO) pair, or when an IO pair is associated but the synaptic current is too close to the firing threshold . rBP is a message-passing algorithm, which uses belief propagation (BP) to evaluate hidden states , then in the next time step, with probability (with and being time step), adds as external fields to BP and evaluates new values. See Appendix A for their implementations.
We first studied the dependence of algorithmic capacity of SBPI and rBP on and . means the maximal number of IO pairs that the perceptron can associate using an algorithm. We found that at a given , increases with when is smaller than an optimal value , and decreases with when (Fig.2a-c). So this , which decreases with (Fig.2a-c), divides the function of with into two branches: low- and high- branches. Because of the non-monotonicity of with , an usually corresponds with two values at a given (Fig.2a-c). However, biologically and engineeringly good design should choose to work at the larger value, because it implies lower energy cost or higher retrieval robustness without compromising memory capacity. So only the high- branch is of biological and engineering interest, and will be considered in the following discussion. At a given , the dependence of on may not be monotonic, but if we only look at high- branch, increases with input sparseness (Fig.2a-c). This means that under the requirements of low energy consumption and robust memory retrieval, input sparseness facilitates memory capacity. From another aspect, the decrease of with at high- branch reflects the competition between capacity with energy cost or retrieval robustness, but this competition can be alleviated by increasing input sparseness: decreasing enables the same to be fulfilled using a larger (Fig.2a-c).
To understand the change of with and , we first calculated theoretical capacity using replica method (Appendix E), which means the maximal number of IO pairs that perceptron has a solution, and provides an upper bound of . We found that the change of with and follows similar profile as (Fig.2a,c): the function of with also has two branches; and at high- branch, increases with . We discuss the non-monotonicity of capacity with and in Supplemental Material Section 2 and Fig.S2, and we find that the existence of low- branch is because of the upper boundedness of weights (i.e., ) in our model.
However, only calculating is not sufficient to understand the computational advantage of input sparseness, because it is possible that solutions become more difficult to obtain even though there are more solutions in the weight configuration space. How easy it is to get solutions also matters: on the one hand, algorithms usually have a upper limit of time step , and only the cases when solutions are found within time steps report solving success and contribute to ; on the other hand, solving perceptron in less time steps, which implies quick learning, is itself of biological and engineering interest.
We investigated the time steps that SBPI and rBP use to solve under different input sparsities at a given . Under SBPI, the change of with and is largely anti-correlated with the change of : the function of with also has two branches, and at high- branch, decreases with input sparseness (Fig.2d). Under rBP, however, the change of with is more subtle, dependent on the parameter of rBP, which quantifies the probability to add the hidden states evaluated in time step to belief propagation as external fields in time step (see Appendix A for details). At the high- branch of , at a fixed , may not decreases with input sparseness (Fig.S3b); but if we also tune , defining as the minimal time steps that rBP achieves successful solving under different s, then follows similar profile as that in SBPI (Fig.2e): has two branches with , and decreases with input sparseness at high- branch. This means that we are, in principle, able to design a new algorithm based on rBP, which cleverly chooses for quick solving based on, say, experience; and this clever-rBP algorithm requires less time steps under sparser input at high- branch. See Supplemental Material Section 3 and Fig.S3 for more discussion on the rBP case.
Together, at large value, the computational advantage of input sparseness is of two folds: (1) sparser input results in higher theoretical capacity, which means that more IO pairs can potentially be associated with sparser input; (2) at a given , solutions are easier to obtain under sparser input, which brings the theoretical potentiality of input sparseness suggested by the first point into algorithmic reality. The first point can be seen from the results of replica method, and is consistent with the discovery in previous studies Brunel et al. (2004); Clopath and Brunel (2013). In the following discussion, we will discuss the difficulty encountered by SBPI and rBP during solving, and try to understand the reason for the second point above.
Understanding the difficulty of perceptron solving through successive weight-fixing process
To understand the difficulty in solving binary-weight perceptron and the advantage of sparse input, we studied a successive weight-fixing process, where we fixed a single weight in each time step. Specifically, at the first time step, we evaluated the marginal probability that the th weight took 1 value in solutions. Then we chose the th weight that had the strongest value polarization (i.e., ), and fixed the th weight at its preferred value (i.e., or 1 if or otherwise). At the th time step when we had fixed weights, we evaluated of the unfixed weights in the solutions whose fixed weights took the values they were fixed at in previous time steps, and fixed the most polarized unfixed weight in the same way as above. By iteratively doing this, we could hopefully fix all the weights, and got a solution of the perceptron problem.
When the marginal probabilities in each time step are computed using belief propagation, the successive weight-fixing process above has the name belief-propagation-guided decimation (BPD), where the word “decimation” means removing variable node from factor graph M. Mézard (2002). In theoretical interest, in small-sized systems, we can get the exact value of using enumerated solutions, which is called exact decimation.
The motivation why we studied the successive weight-fixing process is that BPD is heuristic from similar idea with rBP and SBPI: BPD is equivalent to adding an external field of infinite intensity to the fixed weights, while rBP is a sort of smooth decimation in which each variable gets an external field with intensity proportional to its polarization Braunstein and Zecchina (2006); and SBPI is an on-line algorithm heuristic from rBP Baldassi et al. (2007). So the dynamic processes of these three algorithms should be comparable in some sense. Additionally, in successive weight-fixing process, weight values will not be changed again once fixed, so this process can be studied in a more controlled manner, unlike in rBP and SBPI, where the preferences of weights are changing continuously.
We compared the dynamics of SBPI and rBP with the weight-fixing order in BPD. We found that for weights fixed early in BPD, the hidden states in SBPI or rBP quickly deviate from zero in the first few time step, and hardly change their signs in the subsequent solving process; however, for weights fixed late in BPD, their hidden states keep close to zero, prone to change their signs in the subsequent solving process (Fig.3a). We then defined fixing time of weight as the last time step that the hidden state changes its sign during SBPI or rBP. We found for early-BPD-fixed weights; while is large for late-BPD-fixed weights, larger under denser input (Fig.3b,c). This means that SBPI and rBP can assign values to early-BPD-fixed weights in the first few time steps, while most time steps are spent on determining the values of late-BPD-fixed weights; and spending less time for these late-BPD-fixed weights is the key reason for faster solving under sparser input.
We also calculated the difference of in solutions found by SBPI or rBP from that found by BPD:
where is the value of in the solution found by BPD, is the value of in a solution found by SBPI or rBP, means averaging over the solutions found by SBPI or rBP under different seeds of random number generator, and means quenched average (i.e., average over different sets of IO pairs). We found that for early-BPD-fixed weights (Fig.3d,e), indicating that most of the time; while for late-BPD-fixed weights (Fig.3d,e), just like the case when takes 0 or 1 randomly. This result indicates that these algorithms (BPD, SBPI and rBP) have hardly any discrepancy in what values that early-BPD-fixed weights should take, while the values of late-BPD-fixed weights are prone to stochasticity during solving.
Together, during SBPI and rBP, the values of early-BPD-fixed weights are very easy to determine, and have little disagreement among agorithms; most solving time steps are spent on determining the values of late-BPD-fixed weights after the early-BPD-fixed weights are fixed to their little-disputed values. Under sparser input, the fact that solutions can be found in less time steps is mostly because less time steps are needed for determining the values of late-BPD-fixed weights. In other words, the difficulty in solving binary-weight perceptron and also the advantage of input sparseness should originate from the subspace of weight configuration space in which early-BPD-fixed weights are fixed to their little-disputed values.
Approaching dense solution region through successive weight fixing
Previous studies suggest that solutions of binary-weight perceptron are typically isolated Huang and Kabashima (2014), but there is a spatial region in the weight configuration space where solutions densely aggregate Baldassi et al. (2015, 2016c). It is believed that solutions in the dense region have good generalization performance, and are those found by efficient algorithms Baldassi et al. (2015, 2016a, 2016b); Chaudhari et al. (2017). Similar scenario also exists in our model. Using the replica method introduced in Ref. Huang and Kabashima (2014); Baldassi et al. (2015), we calculated the local entropy (i.e., logarithm of solution number at distance from a weight configuration) from a typical solution or a configuration in the dense solution region (see Appendix E for details). Here, means Hamming distance between two weight configurations, which is the number of weights that take different values in the two weight configurations. Consistent with the findings in Ref. Huang and Kabashima (2014); Baldassi et al. (2015, 2016a), we found that from a typical solution is negative at small (Fig.4a), suggesting that solutions are typically isolated; however, from a configuration in the dense solution region, at small , tends to its upper bound, which is the local entropy in the case that all weight configurations are solutions (Fig.4b), suggesting extremely dense solution aggregation. Using belief propagation (Appendix D), we found that the local entropy from a solution found by SBPI or rBP is higher than that from a solution after long-term random walk (Appendix F) starting from (Fig.4c,d), suggesting that solutions found by SBPI or rBP are usually near the dense solution region.
Where does the dense solution region locate? To answer this question, we calculated the difference of in a given solution from that in the solution found by BPD: . We found that for solutions found by SBPI or rBP, is close to zero for s early fixed in BPD (Fig.4e,f). However, for solutions reached after long-term random walk starting from , is significantly larger in these early-BPD-fixed weights (Fig.4e,f). This result, together with the calculation of local entropy (Fig.4c,d), suggests that solutions with small in early-BPD-fixed weights have high local entropy. In other words, the dense solution region locates in the subspace where the early-BPD-fixed weights take their fixed values. These early-fixed weights tend to have strong polarization (i.e., small , with being the probability that in solution space, see Fig.S6e). Therefore, in a simple picture, solutions in the dense region are those whose strong-polarized weights take their preferred values.
For the convenience of the following discussion, we define solution subspace to be all the solutions in which the first fixed weights during a weight-fixing process take their fixed values.
To better understand the dense-region approaching process during BPD, we studied small-sized systems in which all solutions can be exactly enumerated out. We calculated local entropy of the enumerated solutions in during BPD:
where is Heaviside step function which is 1 when and 0 otherwise, is the number of solutions at Hamming distance from a configuration , means average over the solutions in , and represents quenched average (i.e., average over sets of IO pairs). We found that at small increases with during BPD (Fig.4g), and the solutions found by BPD after fixing all weights have higher-than-average local entropy at small (Fig.4h). These results further support BPD as a process approaching the dense solution region.
As an intuitive understanding how BPD can be related with dense-solution-region approaching, note that after we fix at its preferred value at time step , we actually eliminate the solutions with from . By fixing the most polarized weights at its preferred value, we reduce the number of unfixed weights while eliminating the least number of solutions. As a result, solution density in the subspace of unfixed weights should continually increases during this process. Consistent with this scenario, we found that solution entropy density in the subspace of unfixed weights (with denoting the set size of ) continually increases during BPD (Fig.4i).
We also tried exact decimation in small-sized systems, in which the marginal probabilities used for weight fixing is calculated with the enumerated solutions, and also found the increase of local entropy and solution entropy in with the progress of weight fixing (Fig.S5). This means that the dense-region approaching phenomenon observed in BPD is universal in successive weight fixing algorithms, instead of an artifact introduced by belief propagation.
Cross-correlation of unfixed weights in solution subspace
From the two subsections above, we see that the subspace of late-BPD-fixed weights contains the dense solution region, and determining the values of these late-BPD-fixed weights is the main difficulty in solving binary-weight perceptron. To better understand this difficulty as well as the advantage of input sparseness, we computed cross-correlation within the subspace of solutions during BPD:
where and are two unfixed weights, and means average over solutions in .
Before decimation, distributes narrowly around zero. With the progress of BPD, the distribution of becomes broader (Fig.5a). Consistently, the mean square cross-correlations
increases with BPD progress (Fig.5b,d, Fig.S6a,c). Cross-correlations are negative-dominated: their mean value
is negative and decreases with BPD progress (Fig.5c,e, Fig.S6b,d). This indicates strong cross-correlation between late-BPD-fixed weights. This strong cross-correlation, which breaks Bethe-Peierls approximation, is probably the reason for the difficulty in assigning values to late-BPD-fixed weights, and why SBPI and rBP spend so many times steps on assigning values to late-BPD-fixed weights (Fig.3b,c). At the latter stage of BPD, with sparser input, is smaller and is less negative (Fig.5b-e, Fig.S6a-d). The reduction of cross-correlation between late-BPD-fixed weights may be the reason for quicker solving under sparser input.
Notably, at the early stage of BPD, both and are small, and they do not change much with input sparsities (Fig.5b-e, Fig.S6a-d). This implies that the difficulty in solving binary-weight perceptron and also the computational advantage of input sparseness come from the structure of solution space near the dense solution region, and cannot be unveiled using equilibrium analysis where all solutions have equal statistical contribution.
We also investigated and during exact decimation in small-sized systems, and found similar profile how and change with and (Fig.S5d,e).
Understanding the weight cross-correlation in through geometry of solution clusters
In this subsection, we will try to understand the cross-correlation between unfixed weights in from the geometry of solution clusters.
A solution cluster in is defined as a set of solutions that can be connected by flipping a single unfixed weight. A cluster is used to represent a pure state, which is a set of weight configurations that are separated from other configurations by infinite free-energy barrier. In other words, from a configuration in a pure state, the system needs infinitely long time to access a configuration outside of this pure state under natural thermodynamics. Strictly speaking, it is not known how to adapt the definition of pure states to instances of finite size. Here we adopt the suggestion in Ref. Ardelius and Zdeborová (2008); Obuchi and Kabashima (2009), and numerically identify a pure state in as a solution cluster.
Under 1-step replica symmetry breaking (1RSB) ansatz, the overlap between solutions in the same cluster (or different clusters) is a delta function locates at (or ). In this case, mean square cross-correlation can be related with solution overlap in the limit as (Supplemental Material Section 6)
where is Parisi parameter, which is the probability that two solutions belong to different clusters. Numerically, is defined as
where is the number of solutions in pure state , and is the total number of solutions. Using the replica method introduced in Ref. Obuchi and Kabashima (2009), it can be shown that in the full solution space of infinite-sized systems, and (see Supplemental Material Section 7 for discussion and Fig.S8 for numeric support), which from eq.6 implies zero cross-correlation. Possibly because of this zero cross-correlation, belief propagation can give entropy landscape closely matching that predicted by replica method in the solution space before weight fixing Huang et al. (2013).
In the following context, we will try to understand the cross-correlation between unfixed weights through eq.6, by investigating 1-weight-flip clusters in using enumerated solutions of small-sized systems. Strictly speaking, eq.6 is valid only when . Here, by investigating small-sized systems, we hope to get some understanding on the change of cross-correlation during weight fixing observed in finite-sized systems (Fig.5), and also the possible phase transition ( transits from zero to nonzero) during weight fixing in infinite-sized systems .
We found that decreases with during weight fixing (Fig.7a,e), which implies a solution condensation process. One possible scenario is as follows. Before weight fixing, the solution space contains sub-exponential number of large clusters with exponentially many solutions in each, and exponentially many small clusters with sub-exponential number of solutions in each Obuchi and Kabashima (2009) (Supplemental Material Section 7). The dense solution region presumably lies in a region where most solutions belong to large clusters. Therefore, in the solution subspace at the latter stage of weight-fixing process, solutions may be more condensed in large clusters (Fig.6). With this solution condensation, decreases from 1, because two randomly chosen solutions become more likely to locate in the same large pure state. The decrease of from 1 increases weight cross-correlation through eq.6.
We tested a hypothesis in the discussion above: most solutions around the dense solution region come from large clusters. To this end, we calculated a index, defined as
where means the number of solutions in cluster where solution lies in, and the subscript means that is a cluster defined in the full solution space before weight fixing. Therefore, means the average entropy of the clusters in that solutions in belong to. We found that increases with (Fig.7b,f); and that , which is the entropy of the cluster that contains the solution obtained after fixing all weights, is close to the entropy of the largest cluster in (Fig.7b,f). These results support the scenario depicted in Fig.6: solutions in the neighborhood of the dense solution region mostly belong to large clusters.
We found that tends to increase with weight-fixing progress, and decrease with input sparseness (Fig.7c,g). This means that around the dense solution region, the difference of overlaps within and between clusters is larger than that in full solution space, and gets smaller with sparser input. This point contributes to the increase of weight cross-correlation during weight fixing, and also the reduction of weight cross-correlation under sparse input at the latter stage of weight-fixing process through eq.6.
We compared calculated from eq.6 with that directly calculated by eq.4. We found that eq.6 cannot reproduce the value of , but it can manifest some features of the profile how changes with and (Fig.7d,h): (1) increases with ; (2) the discrepancy between under different gets larger with ; (3) increases with , especially for BPD when is large and for exact decimation. This suggests that eq.6 provides a promising aspect to understand the strong weight cross-correlation at the latter stage of weight-fixing process, and the reduction of weight cross-correlation under sparse input.
Possible reasons why eq.6 cannot reproduce the value of include: (1) finite size of , (2) failure of 1RSB ansatz, (3) improper numeric definition of pure state. The first and second points undermine the conditions to establish eq.6. As for the third point, as mentioned in Ref. Ardelius and Zdeborová (2008), a shortcoming of defining pure state using 1-weight flip is that it cannot describe entropic barrier. For example, suppose there are two subsets of solutions. Both subsets are densely intra-connected by 1-weight flip, but there is only a single long 1-weight flip routine connecting the two subsets. According to the 1-weight-flip definition of pure state, these two subsets should belong to the same pure state. However, it is very hard to access a subset starting from the other through this single routine using random-walk dynamics when is large. So according to the physical definition of pure state mentioned at the beginning of this subsection, these two subsets should belong to different pure states.
In this paper, we tried to understand the computational difficulty of binary-weight perceptron and the advantage of input sparseness in classification task. We found that under biologically plausible requirements for low energy consumption and high robustness of memory retrieval, perceptron has higher theoretical capacity and becomes easier to be algorithmically solved with sparser input. We then studied a decimation process, so that the progress approaching the dense solution region can be investigated in a controlled manner. We found that most time steps in SBPI and rBP are spent on determining the values of the weights late fixed in decimation, and the easiness of solution finding under sparser input is because less time steps are needed to assign values to late-decimation-fixed weights. We then tried to understand the difficulty in assigning late-decimation-fixed weights and the advantage of input sparseness from the aspect of cross-correlation of unfixed weights during decimation. We found this cross-correlation increases with decimation progress and decreases with input sparseness. We then studied the geometry structure of clusters in the subspace of unfixed weights during decimation. We proposed that the change of cross-correlation with decimation progress and input sparsity may be related to solution condensation in the subspace of unfixed weights during decimation, and also the fact that the difference between overlaps of solutions in the same and different clusters increases with decimation progress and decreases with input sparseness. Our work provides new understanding on the computational difficulty encountered by efficient solvers (SBPI and rBP) and the advantage of input sparseness.
Binary weight perceptron belongs to constraint satisfaction problems (CSP). At present, we still do not have a full understanding on the computational difficulty of CSP. Our work implies that this difficulty should originate from the subspace of late-decimation-fixed variables. An influential viewpoint is that this computational difficulty comes from frozen variables in clusters Zdeborová and Krza̧kała (2007); Zdeborová and Mézard (2008a, b), which are variables that take the same value in solutions in the same cluster. However, when solving CSPs in which all solutions are isolated, or in other word, all variables are frozen in every cluster, BPD can still evaluate the preferences of variables at the early stage, while hardness emerges during BPD progress Zdeborová and Mézard (2008a). This implies that the freezing in the full solution space does not hamper variable assignment (i.e., to get information about solutions), while difficulty comes from the subspace of unfixed variables during BPD. In other words, freezing may not be the origin of computational difficulty, although it leads to this difficulty. From the solution condensation scenario depicted in this paper, we presume that the difficulty during BPD in Ref. Zdeborová and Mézard (2008a) may be related to a solution elimination process: the number of isolated solutions in the subspace of unfixed variables decreases during BPD, and difficulty arises when this number becomes finite so that , indicating solutions are condensed into a small number of clusters. The isolation of solutions in Ref. Zdeborová and Mézard (2008a) results in large , which increases cross-correlation from eq.6. However, solution condensation should not be complete to understand the hardness of CSP neither, because BPD can find solutions in the condensation phase of CSP Zdeborová and Krza̧kała (2007), at which cross-correlation may be nonzero. Therefore, it seems that both the concepts of condensation and freezing are important to understand the hardness of CSP. In our perceptron model, we investigated the number of frozen weights in the solution cluster that contains the least number of frozen weights, and the probability that there exists clusters that have no frozen weights during decimation. We found that with input sparseness, decreases and also increases (Fig.S7), which mean that clusters become less frozen with sparser input; while with the progress of decimation, does not change much at the early stage, but decreases at the late stage, and increases (Fig.S7), which means that the degree of freezing does not increase with decimation progress. This implies that freezing is a good aspect to understand the advantage of input sparseness, but not good to understand the difficulty in assigning late-decimation-fixed weights observed in SBPI and rBP (Fig.3a-d).
The dense solution region is called “sub-dominant” in Ref. Baldassi et al. (2015), because solutions in the dense region are so rare and cannot be unveiled using equilibrium analysis. We found solutions in the dense region are those whose strong-polarized weights take their preferred values, which may provide an aspect to understand the scarcity of solutions in the dense region: in solutions of the dense region, all strong polarized weights take their preferred values, while in typical solutions, some strong polarized weights violate their preferences (manifested by Fig.4e,f); typical solutions are much more than solutions in dense region because there are combinatorially many ways to choose the weights that violate their preferences. As a simple example, suppose a perceptron with 5 weights has six solutions: 00000,10000,01000,00100,00010 and 00001. Here, each digit indicates the value of a weight (e.g., 10000 means that the value of the first weight is 1, while the other four weights are zero). In this case, each weight has probability 0.833 to be at 0, so according to our finding (i.e., dense solution region is made up solutions in which strong polarized weights take their preferred values), the dense solution region should be the solution in which all these weights take 0 (i.e., 00000). Consistently, it is easy to check that 00000 has 5 neighbors at Hamming distance 1, while all the other solutions have only one neighbor at Hamming distance 1, so 00000 has larger local entropy. However, we see that there is only one solution 00000 in which all weights take their preferred value 0, while in all the other five solutions, there is a weight that violates its preference and takes 1.
Our result suggests that during training neural networks, the learning dynamics of different weights are heterogeneous: the values of some weights can be quickly determined in the first few time steps, while most time steps are spent on determining the values of some other weights (Fig.3b,c). This result has implications in both fields of neuroscience and machine learning. During the development of embryo and infant, synapses are vastly eliminated Piochon et al. (2016). People may wonder why the brain removes synapses that could be used to store information, long before the brain becomes functionally competent. Our result suggests that some synapses may be assigned at zero efficacy at the very early learning stage of the animal, unlikely to be changed in the subsequent learning. In this case, it is structurally and spatially economical to remove these silent synapses as soon as possible. Additionally, we presume that this heterogeneous dynamics of weights may also exist during training artificial neural networks. We can evaluate the confidence that a weight stays at a value using hidden states as in SBPI (also see Ref. Hubara et al. (2018); Ott et al. (2017)), and fix the high-confident weights at their values in a few initial time steps, excluding them from further updating. This scheme can significantly reduce the computational overhead during training. Notably, in both biological and artificial contexts, this early-weight-fixing strategy may not compromise the generalization performance of the neural network after training, because the dense solution region lies in the subspace where these early-fixed weights take the values they are fixed at.
Acknowledgements.We thank Haiping Huang for constructive comments. This work was partially supported by Hong Kong Baptist University Strategic Development Fund, RGC (Grant No. 12200217) and NSFC (Grants No.11275027).
Appendix A Implementations of SBPI and rBP
In Fig.2, the capacity of an algorithm was estimated by adding the number of input-output associations one by one, until the algorithm failed to find a solution within a given time step. Below we briefly introduce the two efficient algorithms (SBPI and rBP) we studied in this paper.
We used the SBPI01 algorithm introduced in Ref. Baldassi et al. (2007). In this algorithm, each synapse has a hidden variable that takes odd integer values between and . if , and if . Suppose an input-output (IO) pair is presented, the hidden states are updated in the following two cases. Firstly, when this IO pair is unassociated, is updated as . Secondly, when this IO pair is associated, but and , then with probability , . Solving was stopped when a solution was found or after going sweep of all IO pairs in random sequential order for times. , .
The rBP updating rule Braunstein and Zecchina (2006) is
The values of weights are determined by the sign of single-site fields
so that if and otherwise. The message-passing equations eq.9 are iterated in a factor graph in which the th variable node (representing ) and the th factor node (representing the th IO pairs) are linked if . In eqs.9 and 12, is a random number that takes with probability and 0 otherwise. We also added damping through in eq.9, which is necessary to avoid divergence of the algorithm before finding a solution. In practice, we constrained between , so that if (or ) by the first equation of eq.9, we set (or ). We found adding this constraint can improve the performance of the algorithm. Solving was stopped when a solution was found or after time steps. In Fig.2c, ; in Fig.2e, at a given pair of and values, we chose the minimal time steps that rBP used to solve among when and 0.999. We found that at the parameter values we chose (, and ), when , 0.2 and 0.3, rBP can successfully solve the perceptron with high probability in either the three values, and it takes minimal time steps to solve when ; when , however, rBP has low probability to solve the problem if , and it takes less time steps to solve when than when (Fig.S3). Because of this, in Fig.3 and Fig.4, when , 0.2 and 0.3, and when .
Appendix B Belief-propagation-guided decimation
The idea of belief-propagation-guided decimation (BPD) is to use belief propagation (BP) to evaluate the marginal probabilities of unfixed weights, and then fix the most polarized weight at its preferred value; then run BP again to fix another weight, until all weights are fixed. At a step of BPD, may become fixed at 1 or 0, whatever the values of the unfixed weights. If , then the th IO pair is successfully associated and eliminated from the factor graph; if , however, it means that BPD fails to find a solution of the perceptron problem and is terminated.
BP equations are
where as in eq.10, and . These BP equations run on a factor graph where unfixed weights (represented by ) and IO pairs (represented by ) are variable and factor nodes respectively, and there is a link between the th variable node and the th factor node if . in eq.14 means the number of fixed weights with . After getting a fixed point of BP, marginal probabilities of unfixed weights can be estimated as
After introducing and , and correspondingly and , BP equations can be simplified as
When the number of unfixed variable nodes around factor node is large, eq.17 can be efficiently calculated using Gaussian approximation:
where , .
In practice, for factor nodes with , we used the Gaussian approximation above, while for factor nodes with , we used exact enumeration to calculate eq.17. We found that this small-degree-enumeration strategy results in better convergence of BP than pure Gaussian strategy. We also added damping to eq.16:
with when , and when in all our results.
Numerically, we defined convergence of BP as the case when the change of BP entropy density is smaller than in adjacent iteration step. In the case when BP failed to converge, we estimated the iteration step that most close to the fixed point from the iteration dynamics of using a number of heuristics, and calculated the marginal probabilities using the BP messages in that time step using eq.15. See Supplemental Material Section 4 for more details.
Appendix C Evaluating cross-correlation using belief propagation
The cross-correlation between and can be calculated using
where and can be evaluated using BP through eq.15. To evaluate , we fixed , run BP, got the fixed point, and calculated the marginal probabilities of the other weights.
In Fig.5b-d and Fig.S6, we evaluated mean cross-correlation and mean square cross-correlation using
where represents quenched average (i.e., average over IO pairs ), is the set of values of , and is the number of unfixed weights during BPD. When , contains all the unfixed weights; ; when , contains 50 randomly chosen unfixed weights.
If BP for evaluating unconditioned marginal probabilities (i.e., in eq.20) did not converge, we did not continue to evaluate conditional probabilities, and excluded this case from quenched average. Sometimes, BP for unconditioned probabilities converged, but BP for conditional probabilities did not converge. We found that this usually happened when was small, so the ill-convergence of BP when we fixed may reflect the unlikelihood that takes 1, or in other words, in most solutions. Therefore, we set for all in this case. We also tried the operation which excludes from the s whose fixation to 1 lead to non-convergence of BP, and found that the results were not significantly different.
Appendix D Evaluating local entropy using belief propagation
We followed Ref.Huang et al. (2013) to evaluate the local entropy around a given weight configuration using belief propagation. The partition function we calculated is
with being a coupling factor controlling the distance of from .
Appendix E Replica-method analysis
The replica methods we used in Fig.2a-c and Fig.4a,b closely follows the standard approaches presented in Ref. Brunel (2016); Huang and Kabashima (2014); Baldassi et al. (2016c). Here we only list out the free entropy density used to calculate theoretical capacity, local entropy around typical solutions and local entropy around configurations in the dense solution region. Details how to introduce replicas to calculate the quenched average of free entropy density are seen in Ref. Brunel (2016); Huang and Kabashima (2014); Baldassi et al. (2016c).
When calculating theoretical capacity, we introduce partition function
counts the number of solutions given a set of IO pairs . Using replica method, we can calculate free entropy density
with indicating quenched average of sets of IO pairs. Theoretical capacity plotted in Fig.2a-c is the value at which . During replica calculation, an order parameter is defined as . This means that in a well trained perceptron, the average total synaptic current deviates from the firing threshold up to order. In the classification task we considered, output has equal probability to be 0 and 1. In this case, has saddle point value 0, so .
Local entropy around typical solution plotted in Fig.4a has the name Franz-Parisi potential at zero temperature. It is defined as
where means the number of solutions at distance from reference configuration . The meaning of is the mean free entropy density at distance from a solution.
Appendix F Miscellaneous
The method of enumeration is used both for listing out all the solutions of small-sized systems and for calculating eq.17 when . The basic idea to speed up enumeration is to sweep all possible weight configurations in an order so that adjacent configurations have minimal Hamming distance. In practice, configurations were swept in the order of Gray code Krauth and Opper (1989).
The random walk in Fig.3 and Fig.4 was done as follows. Starting from a solution, we tried to flip a chosen weight, and accepted this flip if the configuration after flipping was still a solution. Weights were chosen in random sequential order. In Fig.3 and Fig.4, we started random walk from a solution found by SBPI or rBP, and swept all weights 10000 times.
- When deriving the latter form, we have used the facts that and . The expression has the meaning that the average of total synaptic current is on average equal to the firing threshold (Fig.1a, see Appendix E).
- A caveat here is that when interpreting as threshold controller, the standard deviation of total synaptic current over s is approximately , which decreases with . This implies that lowering firing threshold may decrease the robustness of memory retrieval under noise. However, this does not matter, because we can always decrease firing threshold and scale up synaptic weights simultaneously to enjoy both low energy cost and high robustness of memory retrieval.
- Note the choice of parameter in numeric experiments. In our model, does not scale when . Therefore, if we set a large value, comparable to or larger than , how well the numeric results can represent the large case is questionable. We chose or 12 when studying systems of size or 25, so that is well smaller than in both cases.
- D. H. O’Connor, G. M. Wittenberg, and S. S.-H. Wang, Proc. Natl. Acad. Sci. U. S. A. 102, 9679 (2005).
- J. M. Montgomery and D. V. Madison, Trends Neurosci. 27, 744 (2004).
- J. Misra and I. Saha, Neurocomputing 74, 239 (2010).
- R. L. Rivest and A. L. Blum, Neural Networks 5, 117 (1992).
- A. Amaldi, in Artificial Neural Networks, edited by O. Simula, T. Kohonen, K. Makisara, and J. Kangas (Elsevier, 1991) pp. 55–60.
- I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, J. Mach. Learn. Res. 18, 1 (2018).
- J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio, arXiv:1608.06902 (2017).
- A. Engel and C. V. den Broeck, Statistical Mechanics of Learning (Cambridge University Press, Cambridge, England, 2001).
- A. Lage-Castellanos, A. Pagnani, and M. Weigt, J. Stat. Mech. 2009, P10009 (2009).
- T. Hosaka, Y. Kabashima, and H. Nishimori, Phys. Rev. E 66, 066126 (2002).
- H. Huang and Y. Kabashima, Phys. Rev. E 90, 052813 (2014).
- L. Zdeborová and M. Mézard, Phys. Rev. Lett. 101, 078702 (2008a).
- C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Phys. Rev. Lett. 115, 128101 (2015).
- C. Baldassi, F. Gerace, H. J. Kappen, C. Lucibello, L. Saglietti, E. Tartaglione, and R. Zecchina, Phys. Rev. Lett. 120, 268103 (2018).
- C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, J. Stat. Mech. Theory Exp. 2016, 023301 (2016a).
- C. Baldassi, C. Borgs, J. T. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Proc. Natl. Acad. Sci. U.S.A. 113, E7655 (2016b).
- P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, in International Conference on Learning Representations (ICLR) (2017).
- D. J. Amit and S. Fusi, Neural Comput. 6, 957 (1994).
- A. B. Barrett and M. C. W. van Rossum, PLoS Comput. Biol. 4, e1000230 (2008).
- R. Legenstein and W. Maass, Neural Comput. 20, 288 (2008).
- N. Brunel, V. Hakim, P. Isope, J. P. Nadal, and B. Barbour, Neuron 43, 745 (2004).
- R. Huerta, Neural Comput. 16, 1601 (2004).
- C. Clopath and N. Brunel, PLoS Comput Biol 9, e1002919 (2013).
- M. Mézard and A. Montanari, Information, Physics, and Computation (Oxford University Press, Oxford, England, 2009).
- A. Braunstein and R. Zecchina, Phys. Rev. Lett. 96, 030201 (2006).
- C. Baldassi, A. Braunstein, N. Brunel, and R. Zecchina, Proc. Natl. Acad. Sci. U. S. A. 104, 11079 (2007).
- C. Baldassi and A. Braunstein, J. Stat. Mech. P08008, 052313 (2015).
- G. Semerjian, J. Stat. Phys. 130, 251 (2008).
- A. A. Faisal, L. P. J. Selen, and D. M. Wolpert, Nat. Rev. Neurosci. 9, 292 (2008).
- When deriving the latter form, we have used the facts that and . The expression