Biologically Motivated Algorithms for Propagating Local Target Representations

Biologically Motivated Algorithms for Propagating Local Target Representations

Alexander G. Ororbia
Penn State University
ago109@psu.edu &Ankur Mali
Penn State University
aam35@ist.psu.edu
Abstract

Finding biologically plausible alternatives to back-propagation of errors is a fundamentally important challenge in artificial neural network research. In this paper, we propose a simple learning algorithm called error-driven Local Representation Alignment, which has strong connections to predictive coding, a theory that offers a mechanistic way of describing neurocomputational machinery. In addition, we propose an improved variant of Difference Target Propagation, another algorithm that comes from the same family of algorithms as Local Representation Alignment. We compare our learning procedures to several other biologically-motivated algorithms, including two feedback alignment algorithms and Equilibrium Propagation. In two benchmark datasets, we find that both of our proposed learning algorithms yield stable performance and strong generalization abilities in comparison to other competing back-propagation alternatives when training deeper, highly nonlinear networks, with Local Representation Alignment performing the best overall.

 

Biologically Motivated Algorithms for Propagating Local Target Representations


  Alexander G. Ororbia Penn State University ago109@psu.edu Ankur Mali Penn State University aam35@ist.psu.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Behind the many modern achievements in artificial neural network research is back-propagation of errors [37] (or “backprop”), the key training algorithm used in computing updates to the many parameters that define the computational architectures applied to problems ranging from Computer Vision to Natural Language Processing and Speech. However, though neural architectures are inspired by our current neuro-scientific understanding of the human brain, the connections to the actual mechanisms that compose systems of natural neurons are often very loose, at best. More importantly, back-propagation of errors faces some of the strongest neuro-biological criticisms, argued to be a highly implausible way in which learning occurs in the human brain.

Among the many problems with back-propagation of errors, some of the most prominent include: 1) the “weight transport problem”, where the feedback weights use to carry error signals must be the transposes of the feedforward weights, 2) forward propagation and backward propagation utilize different computations, and 3) the error gradients are stored separate from the activations. These problems, as originally argued in [30, 28], largely center around the one critical component of backprop–the global feedback pathway needed for transporting error derivatives across the system. This pathway is necessary given the design of modern supervised learning system–a loss function measures the error between an artificial neural system’s output units and some target (such as a class label) and the global pathway relates how the internal processing elements affect this error. When considering modern theories of the brain [9, 36, 12, 4], which posit that local computations occur at multiple levels of the somewhat hierarchical structure of natural neural systems, this global pathway should not be necessary to learn effectively. Furthermore, this pathway is the source behind many practical problems that make training very deep, more complex networks difficult–as a result of the many multiplications that underly traversing along this global feedback pathway, error gradients will either explode or vanish [34]. In trying to fix this particular issue, gradients can be kept within reasonable magnitudes by requiring layers to behave sufficiently linearly (which prevents saturation of the post-activation function used, which yield zero gradient). However, this remedy creates other highly undesirable side-effects, such as the well-known problem of adversarial samples [43, 31] and prevents the usage of neuro-biological such as lateral competition and discrete-valued/stochastic activation functions (since this pathway requires precise knowledge of the activation function derivatives [3]).

If we remove this global feedback pathway, we create a new problem–what are the learning signals for the hidden processing elements? This problem is one of the main concerns of the recently introduced Discrepancy Reduction family of learning algorithms [30]. In this paper, we will develop two learning algorithms within this family–error-driven Local Representation Alignment and adaptive noise Difference Target Propagation. In experiments on two classification benchmarks, we will show that two algorithms generalize better than a variety of other biologically motivated learning algorithms, all without employing the global feedback pathway required by back-propagation.

2 Reducing Discrepancy with Globally-Coordinated Local Learning Rules

Algorithms within the Discrepancy Reduction [30] family offer computational mechanisms to perform the following two steps when learning from a sample (or mini-batch of samples):

  1. Search for latent representations that better explain the input/output, also known as target representations. This facilitates the need for local (higher-level) objectives that will help guide the current latent representations towards better ones.

  2. Reduce, as much as possible, the mismatch between a model’s currently “guessed” representations and target representations. The sum of the internal, local losses is also defined as the total discrepancy in a system, and can also be thought of a sort of pseudo-energy function.

This general process forms the basis of what we call globally-coordinated local learning rules. Computing targets with these kinds of rules should not require an actual pathway, as in back-propagation, and instead make use of top-down and bottom-up signals to create targets. This idea is particularly motivated by the theory of predictive coding [33], which claims that the brain is in a continuous process of creating and updating hypotheses to predict the sensory input. This paper will explore two ways in which this hypothesis updating (in the form of local target creation) might happen: 1) through error-correction in Local Representation Alignment, and 2) through repeated encoding and decoding in Difference Target Propagation.

The idea of learning locally in general is slowly becoming prominent in the training of artificial neural networks, with recent proposals including decoupled neural interfaces [13] and kickback [1] (which was derived specifically for regression problems). Far earlier approaches that employed local learning included the layer-wise training procedures that were once used to build models for unsupervised learning [2], supervised learning [18], and semi-supervised learning [32, 29]. The key problem with these older algorithms is that they were greedy–a model was built from the bottom-up, freezing lower-level parameters as higher-level feature detectors were learnt.

Another important idea that comes into play in algorithms such as LRA and DTP is that learning is possible with asymmetry–which directly resolves the weight-transport problem [10, 21], another strong neuro-biological criticism of backprop. This is even possible, surprisingly, even if those feedback loops are random and fixed, which led to the proposal of two algorithms we also compare to in this paper–Random Feedback Alignment (RFA) [23], which essentially replaces the transpose of the feedforward weights in back-propagation a non-learnable, random matrix of the same dimensions. Direct feedback alignment (DFA) [25] extends this idea further by directly connecting the output layer’s pre-activation derivative to each layer’s post-activation. It was shown in [30, 28] that these feedback loops would be better suited in generating target representations.

2.1 Local Representation Alignment

To concretely describe how LRA is practically implemented, we will formulate how LRA would be applied to a 3-layer feedforward network, or multilayer perceptron (MLP). Note that LRA easily generalizes to models with an arbitrary number of layers.

Input: sample , model parameters
Procedure for computing error units & targets
function ComputeTargets()
      Run feedforward weights to get activities
     ,
     ,
     ,
     
     ,
     
     
     
     
     Return
Algorithm 1 LRA-E: Target computation.
Input: sample , calculations
Procedure for computing weight updates
function ComputeUpdates1()
     
     
     
     
     
     Return
function ComputeUpdates2()
     
     
     
     
     
     Return
Algorithm 2 LRA-E: Update computation.

The pre-activities of the MLP at layer are denoted as while the post-activities, or the values output by the non-linearity , are denoted as . The target variable used to correct the output units () is denoted as ( or if we are learning an auto-associative function ). Connecting one layer of neurons , with pre-activities , to another layer , with pre-activities , is a set of synaptic weights . The forward propagation equations for computing pre-activtion and post-activation values for a layer would then simply be:

(1)

Before computing targets or updates, we first must define the set of local losses, one per layer of neurons except for the input neurons, that constitute the measure of total discrepancy inside the MLP, . With losses defined, we can then explicitly formulate the error units for each layer as well, since any given layer’s error units correspond to the first derivative of that layer’s loss with respect to that layer’s post-activation values. For the MLP’s output layer, we could assume a categorical distribution, which is appropriate for 1-of- classification tasks, and use the following negative log likelihood loss:

(2)

where the loss is computed over all dimensions of the vector (where a dimension is indexed/accessed by integer ). Note that for this loss function, we assume that is a vector of probabilities computed by using the softmax function as the output nonlinearity, . For the hidden layers, we can choose between a wider variety of loss functions, and in this paper, we experimented with assuming either a Gaussian or Cauchy distribution over the hidden units. For the Gaussian distribution (or L2 norm), we have the following loss and error unit pair:

(3)

where is a scalar representing a fixed variance (setting this to get rids of the multiplicative factor entirely). For the Cauchy distribution (or log-penalty), we obtain:

(4)

For the activation function used in calculating the hidden post-activities, we use the hyperbolic tangent, or . Using the Cauchy distribution proved particularly useful in our experiments, most likely because encourages sparser representations, which aligns nicely with the biological considerations of sparse coding [26] and predictive sparse decomposition [14] as well as lateral competition [36] that naturally occurs in groups of neural processing elements. These are relatively simple local losses that one can use measure the agreement between the representation and target and future work should entail developing better metrics.

With local losses specified and error units implemented, all that remains is to define how targets are computed and what the parameter updates will be. At any given layer , starting at the output units (in our example, ), we calculate the target for the layer below by multiplying the error unit values at by a set of synaptic error weights . This projected displacement, weighted by the modulation factor 111In the experiments of this paper, a value of , found with only minor tuning in preliminary experiments on subset of training and validation data proved to be effective in general., is then subtracted from the initially found pre-activation of the layer below . This updated pre-activity is then run through the appropriate nonlinearity to calculate the final target . This computation amounts to:

(5)

Once the targets for each layer have been found, we can then use the local loss to compute updates to the weights and its corresponding error weights 222Except for very bottom set of forward weights, , of which there are no error corresponding error weights.. The update calculation for any given parameter at layer would be:

(6)
(7)

where indicates the Hadamard product and is a decay factor (a value that we found should be set to less than ) meant to ensure that the error weights change more slowly than the forward weights. Note that the second variation of the update rule does not require , which makes it particularly attractive in that it does not require the first derivative of the activation function thus permitting the use of discrete and stochastic operations. The update for each set of error weights is simply proportional to the negative transpose of the update computed for its matching forward weights, which is a computationally fast and cheap rule we propose inspired by [35].

In Algorithm 2, we show how the equations above, which constitute the LRA, are applied to a 3-layer MLP, assuming Gaussian local loss functions and their respective error units. This means and the model is defined by (biases are omitted for clarity). We will refer to this algorithm as LRA-E.

(a) Gradient angle between LRA-E and backprop.
(b) Validation NLL and discrepancy .
Figure 1: In Figure 1(a), we compare the updates calculated by LRA-E and backprop. In Figure 1(b), we show how the total discrepancy, as measured in an LRA-trained MLP, evolves during training, alongside the output loss.

With a local loss assigned to each hidden layer, we can measure our neural model’s total internal discrepancy for a given data point, , as a simple linear combination of all of the internal local losses. Figure 1(b) shows the 3-layer MLP example developed in this section (256 units each), trained by stochastic gradient descent (SGD) and mini-batches of 50 image samples, over the first 20 epochs of learning using a Categorical output loss and two Gaussian local losses. While the output loss continues to decrease, the total discrepancy does not always appear to do so, especially in the earlier part of learning. However, since each layer will try to minimize the mismatch between itself and a target value, any fluxes, or local loss values that actually increases instead of decreases which might raise the total discrepancy, will be taken care of later as the model starts generating better targets. The hope is that so long as the angle of the updates computed from LRA are within 90 degrees of the updates obtained by back-propagation of errors, LRA will move parameters towards the same general direction as back-propagation, which greedily points to the direction of steepest descent, and still find reasonably good local optima. In Figure 1(a), this does indeed appear to be the case–for the 3-layer MLP trained for illustrative purposes in this section, we compare the updates calculated by LRA-E with those given by back-propagation after each mini-batch. The angle, fortunately, while certainly non-zero, never deviates too far from the direction pointed by back-propagation (at most 11 degrees) and remains relatively stable throughout the learning process.

2.2 Improving Difference Target Propagation

As mentioned earlier, Difference Target Propagation (DTP) (and also, less directly, recirculation [11, 27]), like LRA-E, also falls under the same family of algorithms concerned with minimizing internal discrepancy, as shown in [30, 28]. However, DTP takes a very different approach to computing alignment targets than LRA-E does–instead of transmitting messages through error units and error feedback weights as in LRA [30], DTP employs feedback weights to learn the inverse of the mapping created by the feedforward weights. However, [28] showed that DTP struggles to assign good local targets as the network gets deeper and thus more highly nonlinear, facing an initially positive but brief phase in learning where generalization error decreases (within the first epochs) before ultimately collapsing (unless very specific initializations are used). One potential cause of this failure could be the lack of a strong enough mechanism to globally coordinate the local learning problems created by the encoder-decoder pairs that define the system. In particular, we hypothesize this problem might be coming from the noise injection scheme, which is local and fixed, offering no adaptation to each specific layer and making some of the layerwise optimization problems more difficult than necessary. Here, we will aim to remove this potential cause through an adaptive layerwise corruption scheme.

Assuming we have a target calculated from above , we consider the forward weights connecting the layer to layer and the decoding weights that define the inverse mapping between the two. The first forward propagation step is the same as in Equation 1. In contrast to LRA-E’s error-driven way of computing targets, we consider each pair of neuronal layers, , as forming a particular type of encoding/decoding cycle that will be used in computing layerwise targets. To calculate the target , we update the original post-activation using the linear combination of two applications of the decoding weights as follows:

(8)

where we see that we decode two times, one from the original post-activation calculated from the feedforward pass of the MLP and another from the target value generated from the encoding/decoding process from the layer pair above, e.g. . This will serve as the target in training the forward weights for the layer below .We multiply top layer target with the fixed constant 0.01 as compared to learning rate through out the experiments for our improved DTP.To train the inverse-mapping weights , as required by the original proposed version of DTP, zero-mean Gaussian noise , with fixed standard deviation , is injected into following by re-running the encoder and the decoder on this new corrupted activation vector. Formally, this is defined as:

(9)

This process we will refer to as DTP. In our proposed, improved variation of DTP, or DTP-, we will take an adaptive approach to the noise injection process . To develop our “adaptive” noise scheme, we have taken some insights from studies of biological neuron systems, which show there are different levels of variability at different neuronal layers [6, 45, 44, 40]. It has been argued that this noise variability enhances neurons’ overall ability to detect and transmit signals across a system [41, 16, 40] and, furthermore, that the presence of noise yields more robust representation [5, 40, 7]. There is also is biological evidence demonstrating an increase in the noise level across successive groups of neurons which is thought to help in local neural computation [40, 38, 17].

The standard deviation of the noise process should be a function of the noise across layers, and an interesting way in which we implemented this was to make (the standard deviation of the noise injection at layer ) a function of an local loss measurements. At the top layer, we can set the standard deviation to be ( worked well in our experiments), or, rather, equal to the step-size used to compute the top-most target (when differentiating the output loss with respect to ). The standard deviation for the layers below would be a function of where it is within the network. This means that:

(10)

noting that the local loss chosen for DTP is a Gaussian loss (but with the input arguments flipped–the target value is now the corrupted initial encoding and the prediction is the clean, original encoding).

The updates to the weights are calculated by differentiating each local loss with respect to the appropriate encoder weights, or , or with respect to the decoder synaptic weights . Note that the order of the input arguments to each loss function for these two partial derivatives is important, in keeping aligned with the original paper in which DTP was proposed [19], in order to obtain the correct sign to multiply the gradients by.

As we will see in our experimental results, DTP- is a much more stable learning algorithm, especially when training deeper and wider networks. DTP- now benefits from a stronger form of global coordination among its internal encoding/decoding sub-problems through the pair-wise comparison of local loss values that drive the hidden layer corruption.

2.3 A Comment on the Efficiency of LRA-E and DTP

It should be noted that LRA-E is in general faster than DTP in calculating targets. Specifically, if we just focus on matrix multiplications within an MLP, which would take up the bulk of the computation underlying both processes, LRA-E only requires matrix multiplications while DTP (and our proposed DTP-) requires multiplications. In particular, the bulk of DTP’s primary expense comes from its approach to computing the targets for the hidden layers since it requires 2 applications of the encoder parameters (1 of these comes from the initial feedfoward pass through the network) and 3 applications of the decoder parameters in order to properly generate targets to train the forward weights and the inverse-mapping weights.

3 Experimental Results

In this section, we present experimental results of training MLPs using a variety of learning.

MNIST: This dataset 333Publicly available at the URL: http://yann.lecun.com/exdb/mnist/. contains images with gray-scale pixel feature values in the range of . The only preprocessing applied to this data is to normalize the feature values to the range of , done by dividing the pixel values by 255.

Fashion MNIST: This database [46] contains grey-scale images of clothing items, meant to serve as a much more difficult drop-in replacement for MNIST itself. Training contains 60000 samples and testing contains 10000, each image is associated with one of 10 classes. We create a validation set of 2000 samples from the training split. Preprocessing was the same as on MNIST.

For both datasets and all models, over 100 epochs, we calculate updates over mini-batches of 50 samples. Unless internal to the algorithm itself, we do not regularize parameters any further, such as through drop-out [42] or penalties placed on the weights. All feedfoward architectures for all experiments were of either of 3, 5, or 8 hidden layers of processing elements. The post-activation function used was simply the hyperbolic tangent and the top layer was chosen to be a maximum-entropy classifier (employing the softmax function). The output layer objective for all algorithms was to minimize the categorical negative log likelihood.

Parameters were initialized using a scheme that gave best performance on the validation split of each dataset on a per-algorithm case. Though we wanted to use very simple initialization schemes for all algorithms, in preliminary experiments, we found that the feedback alignment algorithms as well as difference target propagation (including our improved version of it) worked best when using a uniform fan-in-fan-out scheme [8]. [28] confirms this result, originally showing how these algorithms often are unstable or fail to perform well using a simple initialization based on the uniform or Gaussian distributions. For Local Representation Alignment, however, we simply initialized the parameters using a zero-mean Gaussian distribution, with variance of .

3 Layers
5 Layers
8 Layers
Model
Train
Test
Train
Test
Train
Test
Backprop
Equil-Prop
RFA
DFA
DTP
DTP- (ours)
LRA-E (ours)
Table 1: MNIST supervised classification results.
3 Layers
5 Layers
8 Layers
Model
Train
Test
Train
Test
Train
Test
Backprop
Equil-Prop
RFA
DFA
DTP
DTP- (ours)
LRA-E (ours)
Table 2: Fashion MNIST supervised classification results.

The choice of update rule was also somewhat dependent on the learning algorithm employed. Again, as shown in [28], it is difficult to get good, stable performance from algorithms, such as the original DTP, when using simple SGD. As done in [20], we used the RMSprop adaptive learning rate with a global step size of . For Backprop, RFA, DFA, and LRA-E, SGD was used ().

3.1 Supervised Learning Experiments

SGD
Adam
RMSprop
Model
Train
Test
Train
Test
Train
Test
LRA, MNIST
LRA, Fashion MNIST
Table 3: Effect of update rule on LRA when training a 3-layer MLP on MNIST.
(a) DFA.
(b) Equil-Prop.
(c) DTP-.
(d) LRA-E.
Figure 2: Visualization of the topmost hidden layer extracted from a 5-layer MLP trained by Direct Feedback Alignment (DFA), Equilibrium Propagation (Equil-Prop), adaptive noise Difference Target Propagation (DTP-), and error-driven Local Representation Alignment (LRA-E).

In this experiment, we compare all of the algorithms discussed earlier. These include back-propagation of errors (Backprop), Random Feedback Alignment (RFA) [22], Direct Feedback Alignment (DFA) [25], Equilibrium Propagation [39] (Equil-Prop) and the original Difference Target Propagation [19] (DTP). Our algorithms include our proposed, improved version of DTP (DTP-) and the proposed error-driven Local Representation Alignment (LRA-E).

The results of our experiments are presented in Tables 1 and 2. Test and training scores are reported for the set of model parameters that had lowest validation error. Observe that LRA-E is the most stable and consistently well-performing algorithm compared to the other biologically-motivated backprop alternatives, closely followed by our improved variant of DTP. More importantly note algorithms like Equil-Prop and DTP appear to break down when training deeper networks, such as the 8-layer MLP. Note that while DTP was used to successfully train a 7-layer network of 240 units (using RMSprop), we followed the same settings reported for deeper 7 layers network and in our experiments uncovered that the algorithm begins to struggle as the layers are made wider, starting even as soon as the width of 256 we experimented with in this paper. However, this problem is rectified using our variant of DTP, leading to much more stable performance and even in cases where the algorithm completely overfits the training set (as in the case of 3 and 5 layers for MNIST). Nonetheless, LRA-E still performs the best with respect to generalization across both datasets, despite using a vastly simpler parameter update rule and naive initialization scheme. Table 3 shows the results of using a different update rule other than SGD for LRA-E, such as Adam [15] or RMSprop for 3-layer MLP, with a global step size for both algorithms. We see that LRA-E is not only compatible with other learning rate schemes but reaches yet better generalization performance when using them, e.g. Adam.

(a) 5-layer MLP.
(b) 8-layer MLP
Figure 3: Validation accuracy of DTP vs improved DTP- as a function of epoch.

In Figure 2 displays a t-SNE [24] visualization of the top-most hidden layer of a learned 5-layer MLP using either DFA, Equil-Prop, DTP-, and LRA-E on the Fashion MNIST test set. Qualitatively, we see that all learning algorithms extract representations that separate out the data points reasonably well, at least in the sense that the points are clustered based on clothing type. However, it appears that LRA-E representations yields more strongly separated clusters as evidenced by the somewhat wider gaps between the clusters, especially the gaps around pink, blue, and black colored clusters.

Finally, DTP, as also mentioned in [28], appears to be quite sensitive to its initialization scheme. For both MNIST and Fashion MNIST, we trained DTP and our improved variant DTP-1 with three different settings, including random orthogonal, fan-in-fan-out, and simple zero-mean Gaussian initializations. Figure 3 shows the validation accuracy curves of the DTP and DTP- as a function of epoch for 5 and 8 layer networks (3-layer networks can be found in the appendix) with various weight initializations such as Gaussian (G), Orthogonal(ortho), and Xavier/Glorot. As shown in the figure DTP is highly unstable as the network gets deeper while DTP- appears to be less dependent on the weight initialization scheme. Thus, our experiments show some promising evidence of DTP-’s generalization improvement over the original DTP. Moreso, as the test performance indicates in Tables 1 and 2, DTP- can be nearly as good as those found by the error units in LRA-E.

4 Conclusions

In this paper, we proposed two learning algorithms, error-driven Local Representation Alignment algorithm and adaptive noise Difference Target Propagation. On two classification benchmarks, we show strong positive results when training deep multilayer perceptrons. Future work includes in investigating these algorithms fare in the face of much larger-scale tasks and adapting them to problems where labeled data is scare or the data has a temporal dimension to it.

References

  • [1] Balduzzi, D., Vanchinathan, H., and Buhmann, J. M. Kickback cuts backprop’s red-tape: Biologically plausible credit assignment in neural networks. In AAAI (2015), pp. 485–491.
  • [2] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. Greedy layer-wise training of deep networks. Advances in neural information processing systems 19 (2007), 153.
  • [3] Bengio, Y., Lee, D.-H., Bornschein, J., Mesnard, T., and Lin, Z. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156 (2015).
  • [4] Clark, A. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences 36, 3 (2013), 181–204.
  • [5] Cordo, P., Inglis, J. T., Verschueren, S., Collins, J. J., Merfeld, D. M., Rosenblum, S., Buckley, S., and Moss, F. Noise in human muscle spindles. Nature 383, 6603 (Oct 1996), 769–770.
  • [6] D., A. E., and Yngve, Z. The impulses produced by sensory nerve‐endings. The Journal of Physiology 61, 2, 151–171.
  • [7] Faisal, A. A., Selen, L. P., and Wolpert, D. M. Noise in the nervous system. Nat. Rev. Neurosci. 9, 4 (Apr 2008), 292–303.
  • [8] Glorot, X., and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (2010), pp. 249–256.
  • [9] Grossberg, S. How does a brain build a cognitive code? In Studies of mind and brain. Springer, 1982, pp. 1–52.
  • [10] Grossberg, S. Competitive learning: From interactive activation to adaptive resonance. Cognitive Science 11, 1 (1987), 23 – 63.
  • [11] Hinton, G. E., and McClelland, J. L. Learning representations by recirculation. In Neural information processing systems (1988), pp. 358–366.
  • [12] Huang, Y., and Rao, R. P. Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science 2, 5 (2011), 580–593.
  • [13] Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., and Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343 (2016).
  • [14] Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467 (2010).
  • [15] Kingma, D., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • [16] Kruglikov, I. L., and Dertinger, H. Stochastic resonance as a possible mechanism of amplification of weak electric signals in living cells. Bioelectromagnetics 15, 6 (1994), 539–547.
  • [17] Laughlin, S. B., de Ruyter van Steveninck, R. R., and Anderson, J. C. The metabolic cost of neural information. Nat. Neurosci. 1, 1 (May 1998), 36–41.
  • [18] Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. Deeply-Supervised Nets. arXiv:1409.5185 [cs, stat] (2014).
  • [19] Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2015), Springer, pp. 498–515.
  • [20] Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I (Switzerland, 2015), ECMLPKDD’15, Springer, pp. 498–515.
  • [21] Liao, Q., Leibo, J. Z., and Poggio, T. A. How important is weight symmetry in backpropagation? In AAAI (2016), pp. 1837–1844.
  • [22] Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. Random feedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247 (2014).
  • [23] Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications 7 (2016), 13276.
  • [24] Maaten, L. v. d., and Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9, Nov (2008), 2579–2605.
  • [25] Nøkland, A. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems (2016), pp. 1037–1045.
  • [26] Olshausen, B. A., and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research 37, 23 (1997), 3311–3325.
  • [27] O’Reilly, R. C. Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural computation 8, 5 (1996), 895–938.
  • [28] Ororbia, A. G., Mali, A., Kifer, D., and Giles, C. L. Conducting credit assignment by aligning local representations. arXiv preprint arXiv:1803.01834 (2018).
  • [29] Ororbia II, A. G., Giles, C. L., and Reitter, D. Online semi-supervised learning with deep hybrid boltzmann machines and denoising autoencoders. arXiv preprint arXiv:1511.06964 (2015).
  • [30] Ororbia II, A. G., Haffner, P., Reitter, D., and Giles, C. L. Learning to adapt by minimizing discrepancy. arXiv preprint arXiv:1711.11542 (2017).
  • [31] Ororbia II, A. G., Kifer, D., and Giles, C. L. Unifying adversarial training algorithms with data gradient regularization. Neural computation 29, 4 (2017), 867–887.
  • [32] Ororbia II, A. G., Reitter, D., Wu, J., and Giles, C. L. Online learning of deep hybrid architectures for semi-supervised categorization. In Machine Learning and Knowledge Discovery in Databases (Proceedings, ECML PKDD 2015), vol. 9284 of Lecture Notes in Computer Science. Springer, Porto, Portugal, 2015, pp. 516–532.
  • [33] Panichello, M., Cheung, O., and Bar, M. Predictive feedback and conscious visual experience. Frontiers in Psychology 3 (2013), 620.
  • [34] Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning (2013), pp. 1310–1318.
  • [35] Rao, R. P., and Ballard, D. H. Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural computation 9, 4 (1997), 721–763.
  • [36] Rao, R. P., and Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience 2, 1 (1999).
  • [37] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Neurocomputing: Foundations of research. MIT Press, Cambridge, MA, USA, 1988, ch. Learning Representations by Back-propagating Errors, pp. 696–699.
  • [38] Sarpeshkar, R. Analog versus digital: extrapolating from electronics to neurobiology. Neural Comput 10, 7 (Oct 1998), 1601–1638.
  • [39] Scellier, B., and Bengio, Y. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in computational neuroscience 11 (2017).
  • [40] Shadlen, M. N., and Newsome, W. T. The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J. Neurosci. 18, 10 (May 1998), 3870–3896.
  • [41] Shu, Y., Hasenstaub, A., Badoual, M., Bal, T., and McCormick, D. A. Barrages of synaptic activity control the gain and sensitivity of cortical neurons. J. Neurosci. 23, 32 (Nov 2003), 10388–10401.
  • [42] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
  • [43] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
  • [44] Tolhurst, D., Movshon, J., and Dean, A. The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research 23, 8 (1983), 775 – 785.
  • [45] Tomko, G. J., and Crapper, D. R. Neuronal variability: non-stationary responses to identical visual stimuli. Brain Research 79, 3 (1974), 405 – 418.
  • [46] Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
200419
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description