About Learning in
Recurrent Bistable Gradient Networks
Recurrent Bistable Gradient Networks [1, 2, 3] are attractor based neural networks characterized by bistable dynamics of each single neuron. Coupled together using linear interaction determined by the interconnection weights, these networks do not suffer from spurious states or very limited capacity anymore. Vladimir Chinarov and Michael Menzinger, who invented these networks, trained them using Hebb’s learning rule. We show, that this way of computing the weights leads to unwanted behaviour and limitations of the networks capabilities. Furthermore we evince, that using the first order of Hintons Contrastive Divergence algorithm  leads to a quite promising recurrent neural network. These findings are tested by learning images of the MNIST database for handwritten numbers.
Hopfield networks invented in 1984 by John Hopfield [6, 7] are somehow predecessors of Deep Belief Networks which are widely used as state of the art neural networks. They are recurrent neural networks inspired by the physical behaviour of spin glasses. Hopfield networks are perceptron based and have a symmetric weight matrix and no self-connecting neurons. This guarantees that all dynamics, that can take place in this type of network, is a fixed point attraction. To overcome negative effects as spurious states and limited capacity of Hopfield Networks, Bolzmann Machines  were introduced which were then restricted to have no interconnections between neurons in a layer and which were stacked and trained layer by layer e.g. with the Wake Sleep Algorithm introduced by Geoffrey Hinton . In the BioSystems journal  in the year 2000 and on the IWANN conference in 2001  Vladimir Chinarov and Michael Menzinger presented a class of recurrent Hopfield-like networks called Bistable Gradient Networks, which eliminated the disadvantages of spurious states and of the very limited capacity. They demonstrated this by training these networks successfully with the Hebbian learning rule, storing many more patterns than a standard Hopfield network could memorize (, with neurons).
Because of their successful implementation of interconnected neurons, their paper  presents Hebb’s rule as the perfect, efficient way to train the Bistable Gradient Networks. In our investigation we realize, that this is not always true. There are pattern combinations which may not be stored with Hebb’s learning rule. In the following section we start with a short description of the basic principles of Bistable Gradient Networks. To understand why Hebb’s learning rule is not the best choice to train them, a simple thought experiment is described afterwards. We show, that using Hintons Contrastive Divergence leads to far better results. Furthermore we demonstrate the capabilities storing handwritten numbers from the MNIST-database into the network and point out that noisy images are nearly perfectly denoised. Finally we end up with a conclusion.
Ii Bistable Gradient Networks
In this section a short introduction to the basic concepts of Bistable Gradient Networks is given. In the domain of dynamical systems a neuron is written down as a differential equation. To derive this equation we start with the energy function of such a neuron, which may be defined as follows:
where leads to a bistable behaviour of the neuron and describes the linear coupling between the neurons.
The variable defines the neurons state or output, while the derivative of the energy function with respect to gives us the direction, in which the neurons state changes in time:
This energy function or potential is shown in figure 1; the derivative is plotted below in figure 2. The minima of the energy function correspond to the fixed points marked in the figure of its derivative. In the differential equation (2) we can see that there is a linear part—the sum of the weighted outputs—which may shift the function up or down as shown in figure 2 as a dashed line. In dependence of this linear part it easily happens that only the left or only the right fixed point exists anymore. This leads to a predetermined behaviour. The neural output converges to (or slightly above) or to or (slightly below). Let us name the state active and inactive. If a number of neurons have positive interconnection weights and a large part of these neurons is active, then their derivative will be shifted up and the inactive neurons converge to the active state. On the other hand a neuron which is active, but connected with negative weights to and from the other active neurons, will shift its derivative down and it converges to the inactive state.
Iii Thought Experiment
To understand why special pattern combinations may not be stored we first write down the Hebbian learning rule:
Especially if we store only one pattern it is easily seen, that the active and inactive neurons get strong positive interconnections and the connections between active and inactive neurons will be strongly negative. This implicates, that the inverse image is always stored as strong as the image itself into the network, a phenomenon which is also described by Hopfield [6, 7].
If we now try to store many patterns into a network, which strongly overlap e.g. a big number of active neurons for all patterns with a small number of neurons which make the difference between the patterns, we find a problem emerging: a big number of neurons which is always active will always inhibit a small area of mostly inactive neurons, even if a few of them are active for a stored pattern. The network would always activate the big area, while the rest would be certainly always deactivated.
In the following section we describe how to change the learning rule to observe the wanted behaviour.
Iv Contrastive Divergence
Though the type of neurons Geoffrey Hinton uses are completely different (he uses binary output with stochastic activation) the learning rule is of great interest for us. The learning rule for may be written down as follows:
where denotes the time step. is computed as:
We start with randomly initialized weights . We initialize the networks output with the pattern to be learned. is calculated from this initialisation. After computing the activation of the network for one time step we receive . To adapt the weights only (5) has to be applied for all patterns for several times.
If a pattern is represented by a fixed point the difference in (5) will be zero and the weights stay unchanged. If a neuron changes its state after activation, the difference may be positive or negative. In the case of a negative difference the weights are weakened, while if it is positive the weights are strengthened. This is done until the difference for all patterns is zero, so that each pattern to be learned results in a fixed point.
After each weight change we keep the neurons connections symmetric and eliminate self connections . These two conditions guarantee our network to contain only fixed point attractors. This is because any state change will decrease the appropriate energy function. In further experiments we neglected these constraints and see that the behavior of the network does not change remarkably.
In the next section the algorithm is tested on patterns of the MNIST-database for handwritten numbers.
V The MNIST-database for handwritten numbers
To proof that learning with the algorithm is successful, we trained a network of neurons with patterns from the MNIST-database using (5). An excerpt of these learning patterns is shown in figure 3. The great overlap of the neurons activation from one handwritten number to another makes it impossible to train these patterns using the Hebbian learning rule.
The handwritten numbers are trained for iterations with a learning rate . Figure 4 shows the reconstruction of the original images out of images with more than of noise added. The network is activated using the Euler-method with a step size of . Each -th time step an image is computed. The converged images have a mean error rate about .
The recurrent Bistable Gradient Network using Hebb’s learning rule for computing the interconnection weights of the network leads to difficulties especially in strongly overlapping patterns. To overcome these problems we applied Hintons first order Contrastive Divergence algorithm to train the weights. The results were successfully tested with patterns from the MNIST-database for handwritten letters. Testing an image reconstruction with noisy images of more than of noise leads to a near perfect reconstruction with a mean error rate of about . In our future research we will try to improve learning by taking higher orders of the algorithm into account.
The authors would like to thank Tobias Becht for helpful comments.
-  V. Chinarov, M. Menzinger, Computational dynamics of gradient bistable networks, BioSystems 55, p 137-142, 2000
-  V. Chinarov, M. Menzinger, Bistable Gradient Neural Networks: Their Computational Properties, IWANN2001 Conference in Granada, Spain, proceedings pp 333-338, Springer, 2001
-  V. Chinarov, M. Menzinger, Reconstruction of noisy patterns by bistable gradient neural like networks, BioSystems 68, p 147-153, 2003
-  G. Hinton, Hinton, P. Dayan,, B. Frey., and R. Neal. The wake-sleep algorithm for self-organizing neural networks. Science, 268, 1158â1161, 1995
-  G. E. Hinton, T. J. Sejnowski, D. E. Rumelhart, J. L. McClelland, PDP Research Group, Learning and Relearning in Boltzmann Machines, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations. Cambridge: MIT Press: 282â317, 1986
-  J. J. Hopfield, Neural networks and physical systems with emergent collective computational properties. Proc. Nat. Acad. Sci. (USA) 79, 2554-2558., 1982
-  J. J. Hopfield, Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nat. Acad. Sci. (USA) 81, 3088-3092.,1984