Contrastive Hebbian Learning with Random Feedback Weights

Contrastive Hebbian Learning with Random Feedback Weights

Abstract

Neural networks are commonly trained to make predictions through learning algorithms. Contrastive Hebbian learning, which is a powerful rule inspired by gradient backpropagation, is based on Hebb’s rule and the contrastive divergence algorithm. It operates in two phases, the forward (or free) phase, where the data are fed to the network, and a backward (or clamped) phase, where the target signals are clamped to the output layer of the network and the feedback signals are transformed through the transpose synaptic weight matrices. This implies symmetries at the synaptic level, for which there is no evidence in the brain. In this work, we propose a new variant of the algorithm, called random contrastive Hebbian learning, which does not rely on any synaptic weights symmetries. Instead, it uses random matrices to transform the feedback signals during the clamped phase, and the neural dynamics are described by first order non-linear differential equations. The algorithm is experimentally verified by solving a Boolean logic task, classification tasks (handwritten digits and letters), and an autoencoding task. This article also shows how the parameters affect learning, especially the random matrices. We use the pseudospectra analysis to investigate further how random matrices impact the learning process. Finally, we discuss the biological plausibility of the proposed algorithm, and how it can give rise to better computational models for learning.

1 Introduction

Learning is one of the fundamental aspects of any living organism, regardless of their complexity. From less complex biological entities such as viruses [elde:2012], to highly complex primates [hebb:2005, kandel:2000], learning is of vital importance for survival and evolution. Research in neuroscience has dedicated significant effort to understanding the mechanisms and principles that govern learning in highly complex organisms, such as rodents and primates. Both Hebbian learning [hebb:2005] and spike-timing dependent plasticity (STDP) [bi:1998, markram:1997, zhang:1998] have had high impact on modern computational neuroscience, since both Hebb and STDP rules solve the problem of adaptation in the nervous system and can account for explaining synaptic plasticity [abbott:2000, cooper:2005]. On the other hand, the field of machine learning has made progress using gradient backpropagation (BP) [rumelhart:1988, dreyfus:1962] in deep neural networks, providing state-of-the-art solutions to a variety of classification and representation tasks [lecun:2015].

Despite this progress, most learning algorithms employed with artificial neural networks are implausible from a biological standpoint, this is especially true for methods based on BP. In particular, some of the more common properties that are implausible are (i) the requirement of symmetric weights, (ii) neurons that do not have temporal dynamics (i.e., neural dynamics are not described by autonomous dynamical systems or maps as in the case of recurrent neural networks), (iii) the derivatives of the non-linearities have to be computed for each layer with high precision, (iv) the flow of information in the neural network is not similar to the one in real biological systems (synchronization between different phases–forward, backward– is required), and (v) the backward phase requires the neural activity of the forward phase to be stored.

In the aforementioned context there are attempts to make Machine Learning algorithms more biologically plausible. Such a biologically plausible alternative to BP is the target propagation algorithm [lee:2015]. The target propagation algorithm computes local errors at each layer of the network using information about the target, which is propagated instead of the error signal as in classical BP. However, target propagation still computes derivatives (locally) and requires symmetries at the synaptic level. Another biologically plausible learning algorithm is the recirculation algorithm [hinton:1988] and its generalization the GeneRec [oreily:1996], where the neural network has some recurrent connections that propagate the error signals from the output layer to the hidden(s) one(s) via symmetric weights. Furthermore, the recirculation algorithm does not preserve the symmetries at the synaptic level, though GeneRec is still based on derivatives and back-propagation of error signals. Moreover, all the aforementioned algorithms require a specific pattern of information flow. Every layer should wait for the previous one to reach its equilibrium and then to proceed in receiving and processing the incoming information. This issue can be circumvented by the Contrastive Hebbian learning (CHL, deterministic Boltzmann Machine or a mean field approach) [movellan:1991, baldi:1991, xie:2003], which is similar to the contrastive divergence algorithm [hinton:2002]. CHL is based on the Hebbian learning rule and does not require knowledge of any derivatives. Moreover, due to its non-linear continuous coupled dynamics the information flows in a continuous way. All the neural activities in all the layers may be updated simultaneously without waiting for the convergence of previous or subsequent layers. However, CHL requires synaptic symmetries since it relies on the transpose of the synaptic matrix to propagate backwards the feedback signals.

Motivated to create a more biologically plausible CHL, we proposed random Contrastive Hebbian learning (rCHL), which avoids the use of symmetric synaptic weights, instead replacing the transpose of the synaptic weights in CHL with fixed random matrices. This was performed in a manner similar to that of Feedback Alignment (FDA) [lillicrap:2016, nokland:2016, neftci:2017]. CHL provides a good basis upon which to develop biologically realistic learning rules because it employs continuous nonlinear dynamics at the neuronal level, does not rely on gradients, allows information to flow in a coupled, synchronous way, and, is grounded upon Hebb’s learning rule. CHL uses feedback to transmit information from the output layer to hidden(s) layer(s), and in instances when the feedback gain is small (such as in the clamped phase), has been demonstrated by Xie and Seung to be equivalent to BP [xie:2003].

Using this approach, the information necessary for learning propagates backwards, though it is not transmitted through the same axons (as required in the symmetric case), but instead via separate pathways or neural populations. Therefore, the randomness we introduce may account for the structure and dynamics of other cerebral areas interfering with the transmitted signals of interest or feed-back projections as occur in the visual system [markov:2014, macknik:2009, macknik:2007, shou:2010]. By this we do not necessarily imply that the brain implements random transformations, instead thatthe random matrices being used here in place of transpose synaptic matrices may be thought of in a more general sense as instruments for modeling unknown or hard-to-model dynamics within the brain.

The proposed learning scheme can be used in different contexts, as demonstrated on several learning tasks. We show how rCHL can cope with (i) binary operations such as the XOR, (ii) classifying handwritten digits and letters, and (iii) autoencoding. In most of these cases, the performance (in terms of the mean squared error) of rCHL was either equivalent or similar to BP, FDA, and CHL, suggesting that rCHL can be a potential candidate for general biologically plausible learning models.

2 Materials and Methods

In this section we summarize Contrastive Hebbian learning (CHL) [xie:2003] and introduce random Contrastive Hebbian learning (rCHL). We assume feed-forward networks along with their corresponding feed-backs. is the total number of layers with being the input layer and the output one. Connections from layer to are given by a matrix where and are the sizes of the and layers (number of neurons), respectively. Feedback connections are given by , which can be either a transpose matrix (in the case of CHL) or a random matrix in the case of random CHL. In both CHL and rCHL, the feedback connections are multiplied by a constant gain, . We define the non-linearities as Lipschitz continuous functions , with Lipschitz constant (i.e. , ). The state of a neuron in the -th layer is described by the state function , and the corresponding bias is given by . The dynamics of all neurons at the -th layer are given by:

(1)

2.1 Contrastive Hebbian Learning

Both CHL and the proposed rCHL operate on the same principle as contrastive divergence [hinton:2002]. This means that learning takes place in two phases. In the positive (free) phase, the input is presented and the output is built by forward propagation of neural activities. In the negative (clamped) phase, the outputs are clamped, and the activity is propagated backwards towards the input layer.

In the free phase, the input layer is held fixed, and the signals are propagated forward through each layer (see the red arrows in figures 1). The dynamics of the neurons at each -th layer are computed through the equation (1) for (the layer does not exist and thus and ). During the clamped phase, the target signal is clamped at the output layer and the activity of all the neurons in every layer is computed through equation (1) for (notice here that the input layer does not express any dynamics). The backward flow is illustrated as cyan arrows in figure 1. At the end of the two phases we update the synaptic weights and the biases based on the following equations,

(2a)
(2b)

where is the tensor product, is the learning rate, is the feedback gain, and is weight update. represents the activity of neurons in the -th layer at the equilibrium configuration of equation (1) in the free phase and the activity of the -th layer in the clamped phase.

2.2 Random Contrastive Hebbian Learning

As described by equation (1), CHL implicitly requires the synaptic weights to be symmetric () in order to use the feedback information. In this work, the main contribution is to cast aside the symmetry and replace all the transpose matrices that appear in CHL with random matrices . This idea is similar to random feedback alignment [lillicrap:2016, nokland:2016], where the error signals are propagated back through random matrices that remain constant during learning. Therefore, equation (1) is modified to the following:

(3)

where the learning increments remain the same. In order to properly apply CHL and rCHL, we follow the second strategy of training proposed by Movellan in [movellan:1991] (pg. , case ), suggesting to first let activity settle during the clamped phase. Then, without resetting activations, free the output units and allow activity to settle again. This method assures that when the minimum for the clamped phase has been reached, it remains stable.

We summarize rCHL in Algorithm 1, where is the input dataset, is the corresponding target set, is the number of input samples, and is the number of layers.

, , , epochs, ,
Initialize and randomly for
if Bias is adaptive then
     Initialize randomly
else
      is fixed
end if
for  do
     
     
     
     for  do
         for  with step  do Forward Phase
              
         end for
     end for
     for  do
         for  with step  do Backward Phase
              
         end for
     end for
     for  do
         
         if Bias is adaptive then
              
         end if
     end for
end for
Algorithm 1 Random contrastive Hebbian learning (rCHL). is the input dataset, is the corresponding target set (labels) of the input data set, and is the number of layers of the network, is the simulation time and the Forward Euler method’s time-step.

The rCHL starts with randomly initializing the synaptic weights and the feed-back random matrices. If the bias is not allowed to learn then it is fixed at the very beginning. If it’s permitted to adapt then it is randomly initialized. Then in every epoch rCHL picks up randomly an input sample and assigns it to the input layer, . At the same time, it assigns the corresponding label (target) to the output layer . Then it solves all the non-linear coupled differential equations for each layer using a Forward Euler method for the backward phase. This means that it computes the and then it solves again the system of the coupled non-linear equations for the forward phase in order to compute the . Once all the activities for the forward and the backward phases have been computed, rCHL updates the synaptic weights and the biases, if they are allowed to be updated, based on equations (2)a and b.

Figure 1 illustrates the neural network architecture and the information flow of CHL (top panel) and rCHL (bottom panel). In the forward pass, the input is provided and information is propagated through matrices . In the backward phase, the output is clamped, and the information flows from the output layer to the hidden(s) through the transpose matrix (CHL) or random matrices (rCHL).

Figure 1: Learning scheme information flow. CHL (top panel) and rCHL (lower panel) are illustrated in this figure. Both CHL and rCHL consist of two phases. In the forward phase, the input signal is fed to the network, and the activity propagates up to the deepest layer through matrices (red arrows). In the backward phase, the output (target) signal is clamped, whilst the input signal is still present and affects the neural dynamics. The activity is propagated backwards through matrix (navy color arrows) in the CHL and in the rCHL. It is clear that rCHL’s feedback mechanisms does not require any symmetries and acts more like a feedback system on the dynamics of neurons conveying information from the -th layer back to -th layer.

3 Results

Next, we demonstrate rCHL on a variety of tasks. The algorithm successfully solves logical operations, classification tasks and pattern generation problems. In the logical operations and classification tasks we compared our results against the state-of-the-art back-propagation (BP) and feed-back alignment (FDA). In all rCHL and CHL simulations, we used the following settings, unless otherwise stated: time step , total simulation time , learning rate , and feedback gain .

3.1 Bars and Stripes Classification

We investigate how the parameters of the rCHL affect the learning process. In particular, we examine how to select the random matrix , the feedback gain , and the number of layers . To this end we use the bars and stripes classification task to demonstrate the effect of the different parameters [mackay:2003]. The dataset consists of binary images (black– and white–) of size representing bars and stripes, as shown in figure 2A. We train a network of of three layers () with sizes to classify the input data into bars and stripes. During each epoch we pick up randomly one out of images and present it to the network for epochs. Every epochs we freeze the learning and we test the performance of the network. We measure the Mean Square Error (MSE) (i.e., , where is a reference signal of components and is the estimated signal) and the accuracy of the network. All neurons have sigmoid activation functions ( for all ).

The achieved test MSE is shown in figure 2B and the test accuracy in figure 2C. It is apparent that rCHL converges, and the binary classification task has been learned after epochs. We tested two different versions of rCHL, first we used the bias terms (cyan color in figure 2) in equation (3). All have been initialized randomly from a uniform distribution and they are allowed to learn based on equation (2). In the second case we set the bias terms to zero (purple color in figure 2). The same method was also followed for CHL. Figures 2B and 2C indicate that CHL and rCHL can achieve similar results in the task. Once we have established the functionality of the rCHL, we investigate how the parameters of the random matrix , the feedback gain , the learning rate , and the number of layers affect the learning.

Figure 2: Bars and stripes classification with rCHL. A Bars and stripes dataset used to train the neural network. B the test MSE computed every epochs. In every epoch, a stimulus (image) is presented to the network. Here, the rCHL algorithm (cyan and magenta curves) are compared against the CHL algorithm (black and red curves). After epochs the network has converged. C Test accuracy of the network on classifying the bars and stripes. We presented the entire dataset ( images) during testing.

Therefore, we can conclude at that point that the rCHL has similar behavior with the CHL and the learning process works as well as CHL’s. In the following paragraphs we are investigating how four basic parameters of the model (learning rate, feed-back gain, number of layers, and the random matrix) affect the learning using the Bars and Stripes classification task as toy model. We choose to not use the bias terms since they do not affect the learning in this particular task.

Feedback Gain

First we start with the feedback gain by sweeping it over the interval , and drawing the initial values of the synaptic weights as well as the random matrix from a uniform distribution . The effect of the feedback gain for standard CHL has been examined in [xie:2003]. In figures 3A and B, the test MSE and accuracy are shown for different values of . Lower feedback gain (about ), works well for rCHL. Even when the feedback gain is around the learning process still works (data not shown) and the convergence is fast. This is explained by the fact that the term in equation (2)(a) becomes extremely large for small feedback gains for the very first layers and quite small for the deeper layers. On the other hand, when when the feedback gain is high (e.g., ) the rCHL does not converge (gray color in figure 3).

Figure 3: Effect of feedback gain on bars and stripes classification. This figure illustrates the test MSE A and the test accuracy B for nine different values of feedback gain . Both panels illustrate that the neural network trained using the rCHL algorithm learns to classify the bars and stripes dataset in all cases, except where . In this case, the MSE is high and the classification is unstable.

Learning Rate

Another important parameter that affects learning is the rate the rCHL adjusts the synaptic weights, or the learning rate . Therefore, we use the same network architecture and we keep fix the feedback gain and we vary the learning rate, i.e. . Figure 4 shows the test error and accuracy for the various values of learning rate. When the learning rate is too low the learning diverges. On the other hand when the learning rate is higher the learning converges faster. For the values between and the learning converges smoother but takes more time to reach the equilibrium in comparison to the higher values. In this case the random matrix and the synaptic weights have been initialized by a uniform distribution .

Figure 4: Effect of learning rate on bars and stripes classification. This figure illustrates the test MSE A and the test accuracy B for six different values of learning rate . It is apparent that if the learning rate is too low the learning process takes more time to converge. In this case we keep the feed-back gain constant at .

Number of Layers

Next, we investigate how the number of hidden layers affects learning. To this end, we train four different neural networks using rCHL. The configurations for the networks are (), (), (), and (). As before, we draw the synaptic weights and the random matrix from a uniform distribution . Figures 5 A and B show the test MSE and test accuracy of the networks. As we increase the number of layers, the rCHL networks fail to converge (data not shown). However, when we increase the feedback gain , convergence is achievable (gray curves in figure 5). This behavior can be explained by the fact that the feedback gain affects the synaptic weight update by a factor . This means that the more layers a network has, the higher should be.

Figure 5: Effect of number of layers on bars and stripes classification. This figure illustrates the test MSE A and the test accuracy B for four different values of , which represents the number of layers. For the largest value , rCHL diverges (data not shown) and we have to re-tune the feedback gain. Once the feedback gain has been increased, the convergence of rCHL is guaranteed (gray line).

Feedback Random Matrix

The random feedback matrix plays a crucial role in the rCHL learning process. It conveys the information from the output layer back to the hidden(s) one(s). Therefore, the feedback random matrix, essentially, alters the signals by applying the feedback from layer to neural dynamics at layer . Therefore, the learning process is directly affected by the choice of the feedback random matrices within the network. The properties of random matrices arise from the distributions that generate them. In this case we investigate how random matrices generated by a normal and a uniform distribution can impact the learning process. Furthermore, we define the length of the uniform distribution interval as .

We start by varying the variance , and the length of the two different distributions. Next the network is evaluated on the bars and stripes classification task, and the MSE and accuracy are recorded. Figures 6A and B show the test MSE and accuracy, for sixteen different values for the normal distribution. As shown, the higher the variance, the faster convergence is reached. Figures 6 C and D indicate the test MSE and accuracy for sixteen different values of of the uniform distribution. In this case, we observe that the shorter the interval (smaller ), the slower the convergence. Meanwhile, the wider the interval, the faster and the convergence is. The convergence for short intervals is slower for the uniform distribution than for corresponding small variance values for the normal distribution (see figures 6A and C).

Figure 6: Effect of matrix distribution on bars and stripes classification. This figure illustrates how the random matrix affects the learning process when it is initialized from a normal and a uniform distribution. A Test MSE for the normal distribution, B corresponding test accuracy. C Test MSE for the uniform distribution, D corresponding test accuracy. The colored lines indicate different values for the variance of the normal distribution and the interval (length ) of the uniform distribution (see the legend). The higher the variance or the length, the faster and more stable the convergence of the learning. The network with configuration performs a classification task on the bars and stripes dataset (see the text for more details).

One of the key aspects in Random Matrix Theory is the spectrum (i.e., eigenvalues) of the random matrix and especially the distribution of the eigenvalues [vu:2014]. However, most connection matrices in neural networks are not square and non-normal (i.e., , where ). Therefore, one alternative to study the spectrum of random matrices that are rectangular and non-normal is to study their singular values and the corresponding -pseudospectra  [trefethen:1991, trefethen:2005, wright:2002] (we provide a brief description of the -pseudospectra in the Appendix 5.1). The pseudospectra define a set of pseudoeigenvalues () of a matrix , if for some eigenvector with it holds . Hence pseudospectra indicates potential sensitivity of eigenvalues under perturbations of the matrix . In this study we are interested in identifying spectral properties that can be related to the learning (i.e., convergence, speed of convergence, oscillations).

Figure 7 illustrates the -pseudospectra for the random matrix for the Bars and Stripes classification task. We chose three cases for the uniform and three for the normal distribution from figure 6 and we then applied on the random matrices the algorithm given in [wright:2002] to compute the -pseudospectra of in each case. In every panel the contour lines indicate the minimum singular values that correspond to different values of on the complex plane. In addition in every panel we provide the corresponding test MSE (inset plot). Subplots 7A B and C depict the pseudospectra of drawn from uniform distributions (, , , respectively), as well as the test MSE. Likewise, subplots D, E, and F illustrate the cases of normal distributions (, , , respectively).

Since the only varying parameter in this experiment is the way we generate the random matrix the learning process is solely affected by that matrix. Therefore, in the cases A and D the pseudospectra of the two different distributions look identical, whilst the test MSE has similar behavior, it decays slower (blue and black lines in the insets). In both cases the minimum value for is the same (and around the origin). In the other cases B, C, E, and F the convergence of the test MSE toward zero is faster (inset) and less violent (in terms of oscillations). The pseudospectra show higher minimum values for . For the uniform distribution all the values are arranged on concentric circles. On the other hand, the normal distribution with variances and causes a shift towards the right-half complex plane of the pseudospectrum. The minimum is higher in comparison to the uniform cases. One more remark is that for these four cases the uniform distributions have smaller values in comparison to their normal counterparts. This might lead to a less oscillatory behavior of the test MSE as it is shown in the insets (blue and black curves). In all cases, we compute the pseudospectra using the implementation provided in [wright:2002]1.

Figure 7: -pseudospectra of for the Bars and Stripes classification. Six different feedback random matrices have been analyzed using the -pseudospectra method. The colormap indicates the different values. The top row shows the for the matrices drawn by uniform distributions from intervals (A) , (B) , and (C) . The bottom row illustrates for the random matrices drawn from normal distributions with zero mean and variances (D) , (E) , and (F) .

3.2 Exclusive Or (XOR)

The exclusive or (denoted XOR or ) problem consists of the evaluation of four possible Boolean input states (i.e., , , , ). To solve the XOR problem, we use a feed-forward network with one hidden layer and one output layer () with a configuration . All neural units are sigmoidal: , and we train the neural network on the XOR problem for epochs using CHL and rCHL. The random matrix for rCHL has been initialized from a uniform distribution . In every epoch we present samples to the network and every epochs we measure the MSE and the accuracy on the test dataset. The accuracy here is defined to be , where and the index runs over the test samples.

Figure 8: Exclusive OR (XOR). Four different learning algorithms have been use to train a feed-forward network with three layers, , (). (A) Test MSE error illustrated against the number of samples, (B) test accuracy. The red and purples lines indicate the CHL and rCHL, respectively. The brown and gray dashed lines show the minimum test MSE and the maximum test accuracy for the BP and the FDA, respectively.

The results of the learning are shown in figure 8, where figure 8A shows the test error, and figure 8B the test accuracy for CHL (red line) and rCHL (purple line). The test error and the test accuracy attained is the same for BP (brown dashed line) and FDA (gray dashed line). Comparing against the error and accuracy of BP and FDA (see SI figure 11), rCHL converges faster and more smoothly. This is because rCHL and CHL are on-line learning algorithms, and the input and target signals are both embedded into the dynamics of the neurons. This leads to a rapid convergence of the Hebbian learning rule, which rapidly assimilates the proper associations between input and output signals.

3.3 Handwritten Digit and Letter Classification

For the classification tasks, we used the MNIST [deng:2012] dataset of handwritten digits and the eMNIST datasets [cohen:2017] of handwritten letters of the English alphabet. The neural network layouts for the MNIST and the eMNIST was and , respectively. On every unit in both MNIST and eMNIST networks, we use a sigmoid function: . We drew the initial synaptic weights and the random matrices from uniform distributions and , respectively. We trained the network for (MNIST) and (eMNIST) epochs, and in each epoch we present the entire MNIST and eMNIST datasets, which consist of and images, respectively. At the end of each epoch, we measured the MSE and the accuracy of the network. The accuracy is defined to be the ratio of the successfully classified images to the total presented test images ( for MNIST and for eMNIST, respectively).

Figure 9: MNIST and eMNIST digits classification. The figure illustrates  (A) the test error and (B) the accuracy of a neural network () with sigmoid function as non-linearity in all layers. The network was trained on the entire MNIST set of digits ( images) and tested on the whole MNIST test set ( images). In addition, (C) illustrates the test error and (D) the accuracy of a neural network () with sigmoid function as non-linearity in all layers. The network was trained on the entire eMNIST set of digits ( images) and tested on the whole eMNIST test set ( images). In both cases the MNIST and the eMNIST, CHL and rCHL have similar performance (red and purple curves, respectively).

Figure 9A illustrates the test error of CHL (red line) and rCHL (purple line), respectively. The error is lower than BP and FDA (brown and gray dashed lines, respectively), and the convergence is faster (compared against SI figure 12, where BP and FDA test MSE and accuracy are illustrated). However, the classification test accuracy is the same as for BP and FDA, as figure 9B indicates. The convergence of the BP and FDA algorithms can be seen in SI figure 12, as well as the test accuracy of those algorithms performing on the same type of neural network (same neural units and network architecture).

For the eMNIST dataset, the test error is illustrated in figure 9C, where the error of CHL (red line) and rCHL (purple line) are close to the error attained by BP and FDA (brown and gray dashed lines). However, the classification test accuracy of rCHL is worse than the other three learning algorithms (CHL, BP and FDA), as figures 9D and SI 13 B show. For the eMNIST data set the accuracy of the rCHL is lower than the other three algorithms despite the fact that the error is smaller than the errors of BP and FDA. This drawback might be due to the feed-back random matrix and the lack of symmetric synaptic connections. This is also supported by the fact that the CHL (which differs from rCHL only in the way the feedback signals are transmitted) achieves an accuracy as good as the BP and FDA (minimum test error and maximum test accuracy for all four algorithms are provided in table 1,Appendix 5.2).

3.4 Autoencoder

The final test case was the implementation of an autoencoder using CHL and rCHL. An autoencoder is a neural network that learns to reconstruct the input data using a restricted latent representation. Classic autoencoders consist of three layers: the first layer is called the encoder, which is responsible for encoding the input data to the hidden layer, which describes a code. The third layer is the decoder, which generates the approximation of the input data, based on the code provided by the hidden layer [goodfellow:2016].

To this end, we use a network with three layers , with a layout , and we use the MNIST handwritten digits. This means that we encode the input images of dimension to a dimension of . For both CHL and rCHL, the learning rate is set to . The synaptic weights and the random matrix were initialized from uniform distributions and , respectively. We trained the network for epochs, and in each epoch we presented to the network samples of MNIST handwritten digits.

Figure 10: CHL and rCHL autoencoder. The MNIST dataset was used to train an autoencoder using CHL and rCHL. (A) shows some randomly chosen MNIST digits. After training the network for epochs (during each epoch we presented MNIST images to the network), it was able to reconstruct the learned digits as panel (B) shows for CHL, and panel (C) for rCHL. The learned representations are shown in panels (D) and (E) for CHL and rCHL, respectively.

After training the autoencoder, we fed digits (figure 10A) from the MNIST dataset to the autoencoders. The decoding (reconstruction) of the input data is illustrated in figures 10B and 10C for CHL and rCHL, respectively. The results indicate that rCHL and its non-random counterpart (CHL), can both learn to approximate the identity function, and thus they are both good candidates for implementing autoencoders. The reconstruction (decoding) was not perfect, which implies that the network correctly learned an approximation (and not the exact function), indicating that the network had learned the appropriate representations (code). The representations (codebooks) are shown in figures 10D and 10E for the CHL and rCHL models, respectively. As expected, the representations of rCHL were more noisy than that of the CHL model. This can be partially explained by the chosen simulation parameters in this experiment.

4 Discussion

In this work, we introduced a modified version of Contrastive Hebbian Learning  [xie:2003, movellan:1991, baldi:1991], based on random feedback connections. The key contribution of this work is a biologically plausible variant of CHL, attained by using random fixed matrices (during learning) instead of bidirectional synapses. We have shown that this new variation of CHL, random Contrastive Hebbian Learning, can achieve results equivalent to that of CHL, BP, or FDA. In addition, rCHL can solve logical problems such as the XOR, as well as classification tasks, such as the handwritten digits and letters classification (MNIST and eMNIST datasets). Furthermore, rCHL supports representation learning (e.g., autoencoders), as well.

We provide a thorough investigation of the algorithm, examining how the different parameters of the learning scheme affect the learning process. We have shown that the feedback gain affects the performance of learning. The smaller the gain , the better the performance of the algorithm, implying that the feedback gain should be small. This is in accordance with previous theoretical results found in [xie:2003], where CHL requires low values of in order to be equivalent to BP. A second factor that affects learning is the number of hidden layers. When the number of hidden layers increases, the performance of the learning decreases. However, this can be mitigated by increasing the feedback gain (see figure 5). Finally, we tested the initial conditions of the random matrix . We found that if we choose the random matrix from a normal distribution with zero mean, the convergence speed of the learning algorithm is faster than if we draw from a uniform distribution. In addition, the variance and the interval length of the normal and uniform distributions, respectively, affect the learning as well. The smaller the values, the faster the convergence of the learning.

We further investigated the feedback random matrices using tools provided by the pseudospectra analysis [trefethen:1991, wright:2002]. The convergence of learning can be affected by the choice of distribution that generates the feedback random matrix. We found that sub-Gaussian feedback matrices with normal-like pseudospectrum tend to cause slower convergence of learning, but a more robust one. On the other hand, Gaussian random matrices tend to have faster convergence.

Both CHL and rCHL share some common properties such as (i) the neurons express non-linear dynamics, (ii) the learning rule is based on Hebb’s rule [hebb:2005], which means no biologically implausible information is necessary (e.g., knowledge of the derivatives of the non-linearities of the neural units), (iii) the propagation of the activity from one layer to the next takes place in a manner similar that of biological systems. All the layers are coupled through a non-linear dynamical system, which implies that when an input is presented the algorithm does not have to wait until all the units in the -th layer have fired. Instead, neurons can propagate their activity through couplings to the next -th layer. This is not the case for artificial neural networks where the activity is computed for all the units of the -th layer prior to transmission to the subsequent layer. Thus, CHL and rCHL circulate signals in a more natural fashion (similar to the processes evident in biological nervous systems).

The salient difference between CHL and rCHL is the replacement of symmetric synaptic weights (CHL) with fixed random matrices (rCHL). Therefore, the feedback connections do not require any symmetric weights (transpose matrices) of the feed-forward connections. This is in accordance with biology since there is no evidence thus far to indicate that chemical synapses (one of the two major synapse types in the nervous system, the other being electrical synapses) are bidirectional [kandel:2000, pickel:2013]. More precisely, we can interpret the random feedback as a model of afferent feedback connections from higher hierarchical layers to lower ones. For instance, primary sensory areas are connected in series with higher sensory or multi-modal areas. These higher areas project back to the primary areas through feedback pathways [markov:2012, oh:2014, glasser:2016]. Other types of feedback pathways to which our approach may be considered relevant are the inter-laminar connections within cortical layers [harris:2013, thomson:2003]. Therefore, such feedback pathways can be modeled as random feedback matrices that interfere with the neural dynamics of the lower layers (top-down signal transmission). The randomness in these feedback pathways can account for currently unknown underlying neural dynamics or random networks with interesting properties [may:1976, vu:2014].

Our current implementation, rCHL, allows us to interpret the backward phase (clamped) as a top-down signal propagation. For instance, when a visual stimulus is presented, such as a letter, and a subject is required to learn the semantic of the letter (which letter is presented), then the target can be the semantic and the input signal the image of the letter. Therefore, we can assume that the forward phase simulates the bottom-up signal propagation from the primary sensory cortices to higher associate cortices, and the backward phase as the top-down signal propagation from the higher cognitive areas back to the lower cortical areas.

Due to the nature of the learning algorithm and the two phases (positive and negative), the input and the target signals are clamped, and the dynamics of those signals are captured and embedded in the neural dynamics of the network. This can be compared to target propagation [lee:2015, le:1986], where the loss gradient of BP is replaced by a target value. When the target value is very close to the neural activation of the forward pass, TP behaves like BP. These sort of learning algorithms solve the credit assignment problem [minsky:1961], using local information. In the proposed model, the target and the input signals are both embedded into the dynamics of the neural activity and affect the learning process in an indirect way through a Hebb and an anti-Hebb rule.

The major contribution of this work is the development of a new method to implement CHL using random feedback instead of symmetric one. This leads to a more biologically plausible implementation of CHL without loss of performance and accuracy in most of the cases studied. In addition, the algorithm offers a shorter runtime since it does not require the computation of the transpose matrix for synaptic weights at every time step. Furthermore, the proposed algorithm offers a suitable learning scheme for neuromorphic devices since its neural dynamics can be transformed into spiking neurons [gerstner:2002]. Therefore, we can build spiking neural networks with STDP-like learning rules as equivalent to the Hebb’s rule in rCHL and use ideas from event-based Contrastive Divergence [neftci:2014] and synaptic sampling machines [neftci:2016] in order to implement a neuromorphic rCHL. This idea is similar to the event-based random back-propagation algorithm [neftci:2017], where the authors have implemented an event-based feedback alignment equivalent for neuromorphic devices.

A potential extension of the model might be a replacement of the firing rate units with spiking neurons, such as leaky integrate-and-fire units, in order to make the computations even more similar to the biological case. Another potential direction would be to remove entirely the synchronization in the neural dynamics integration. This means that the integrations of neural dynamics can take place on-demand in a more event-based fashion [rougier:2011, taouali:2009]. Since the units are coupled with each other, and the propagation of activity takes place in a more natural way, we might be able to use asynchronous algorithms for implementing rCHL such that it scales up in a more efficient and natural way. Another extension of the model in the future would be to impose sparsity constraints on the learning rule in order to render data encoding processes more efficient. Furthermore, such a modification would make the model able to simulate more biological phenomena, such as the sparse compression occurring within the hippocampus [petrantonakis:2014].

In the future, we would like to conduct more analytical work to determine the optimal type of random matrix , for a given problem, and how to design such a matrix, if it does exist. This means that if an optimal random matrix exists, then one can properly choose the eigenvalues (or singular values) and the distribution that generates that matrix based on the problem they would like to solve. To this end, the pseudospectra analysis can play a key role. Pseudospectra can help us to better understand the relations between singular values, convergence of learning and accuracy. This can lead to the development of sophisticated methods for designing feedback matrices with particular properties. Therefore, we could improve and accelerate the learning process.

Author Contributions

GD conceived the idea, implemented CHL and rCHL, ran the CHL and rCHL experiments; TB implemented and ran the BP and FDA experiments; GD analyzed and interpreted the results; All the authors wrote and reviewed the manuscript.

Acknowledgments

This work was supported in part by the Intel Corporation and by the National Science Foundation under grant 1640081. We thank NVIDIA corporation for providing the GPU used in this work.

5 Appendix

5.1 -pseudospectra

Let be a rectangular matrix, not necessarily normal (i.e., ). The -pseudospectra ([trefethen:2005, wright:2002] is the set of -eigenvalues a closed subset of .

Definition 5.1 (-Pseudospectra).

Let , and (the identity matrix with ones on the main diagonal and zeros elsewhere), then the -pseudospectra .

Definition 5.2 (-Pseudospectra).

Let , and (the identity matrix with ones on the main diagonal and zeros elsewhere), then the -pseudospectra .

Definitions 5.1 and 5.2 are equivalent [wright:2002], and both account for the computation of of rectangular matrices of dimension , where . The algorithm given in [wright:2002] for the numerical computation of the set can be applied on matrices of dimension , where either or .

5.2 Error and Accuracy Table

Here we provide the minimum test error and maximum test accuracy for each of the experiments (XOR, MNIST and eMNIST). We compare in the following table our CHL and rCHL implementations against each other and against the BP and the FDA.

\diagbox[width=10em]AlgorithmExperiment XOR MNIST eMNIST
BP
FDA
CHL
rCHL
Table 1: Mean squared error (MSE) and Accuracy. The test MSE and accuracy of three experiments, XOR, MNIST, and eMNIST are given in this table. BP–Backpropagation, FDA–Feedback Alignment, CHL– Contrastive Hebbian Learning, and rCHL–random Contrastive Hebbian Learning.

5.3 Abbreviations and Notation Tables

Abbreviation Description
STDP Spike-timing Dependent Plasticity
BP Back-propagation
FDA Feedback Alignment
CHL Contrastive Hebbian Learning
rCHL random Contrastive Hebbian Learning
MSE Mean Squared Error
Table 2: Abbreviations
Symbols Description
Total number of layers
Index of layer
Neural state at layer
Neural state at layer in the free phase
Neural state at layer in the clamped phase
Synaptic matrix (connects layer with )
Random feedback matrix (connects layer with
Bias for layer
Non-linear function (transfer function)
Learning rate
Feedback gain
Tensor product
Forward Euler time-step
Forward Euler total integration time
Input data set
Target set
Set of pseudospectra
Eigenvalue
Uniform distribution
Length of uniform’s distribution interval
Normal distribution
Normal distribution’s variance
Exclusive Or (XOR)
Table 3: Notation

5.4 Simulation and Platform Details

All serial simulations were run on a Dell OptiPlex with GB physical memory, and a th generation Intel i processor (Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz) running Arch Linux (-ARCH GNU/Linux). The source code for CHL and rCHL were written in the C programming language [kernighan:2006] (gcc (GCC) ). In all C simulations, we used the random number generator provided by [oneill:2014]2. The backpropagation and feedback alignment algorithms were written in Python using Tensorflow [abadi:2016], and ran on an Nvidia GeForce GTX Titan X with memory. The source code is distributed under the GNU General Public License, and can be found at: [LINK]. All simulation parameters are provided in the Results section and in the Supplementary Information (SI).

5.5 Simulation Parameters

Experiment Epochs Layout
XOR
MNIST
eMNIST
Autoencoder
Table 4: Simulation Parameters. In the experiments, we integrated the dynamics for using the forward Euler method with time-step . is the learning rate and is the feedback gain.

Supplementary Information

Backpropagation and Feedback Alignment on XOR

Figure 11: Backpropagation (BP) and feedback alignment (FDA) on XOR. The test MSE (A) and the test accuracy (B) show the convergence of BP (red) and FDA (gray) in solving the XOR problem. For training both BP and FDA, we used batches, and each batch had a size of sample. The learning rate was and the momentum . Synaptic weights and biases were initialized using uniform distributions and , respectively.

Backpropagation and Feedback Alignment on MNIST

Figure 12: Backpropagation (BP) and feedback alignment (FDA) on MNIST. The test MSE (A) and the test accuracy (B) show the convergence of BP (red) and FDA (gray) in solving MNIST handwritten digits classification problem. For training both BP and FDA, we used batches, and each batch had a size of samples. The learning rate was . Synaptic weights and biases were initialized using uniform distributions and , respectively.

Backpropagation and Feedback Alignment on eMNIST

Figure 13: Backpropagation (BP) and feedback alignment (FDA) on eMNIST. The test MSE (A) and the test accuracy (B) show the convergence of BP (red) and FDA (gray) in solving the eMNIST handwritten letters classification problem ( letters of the English alphabet). For training both BP and FDA, we used batches, and each batch had size of samples. The learning rate was . Synaptic weights and biases were initialized using uniform distributions and , respectively.

References

Footnotes

  1. The source code can be found at: https://github.com/gdetor/pygpsa
  2. http://www.pcg-random.org/
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
204758
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description