Biologically Motivated Algorithms for Propagating Local Target Representations
Abstract
Finding biologically plausible alternatives to backpropagation of errors is a fundamentally important challenge in artificial neural network research. In this paper, we propose a simple learning algorithm called errordriven Local Representation Alignment, which has strong connections to predictive coding, a theory that offers a mechanistic way of describing neurocomputational machinery. In addition, we propose an improved variant of Difference Target Propagation, another algorithm that comes from the same family of algorithms as Local Representation Alignment. We compare our learning procedures to several other biologicallymotivated algorithms, including two feedback alignment algorithms and Equilibrium Propagation. In two benchmark datasets, we find that both of our proposed learning algorithms yield stable performance and strong generalization abilities in comparison to other competing backpropagation alternatives when training deeper, highly nonlinear networks, with Local Representation Alignment performing the best overall.
Biologically Motivated Algorithms for Propagating Local Target Representations
Alexander G. Ororbia Penn State University ago109@psu.edu Ankur Mali Penn State University aam35@ist.psu.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Behind the many modern achievements in artificial neural network research is backpropagation of errors [37] (or “backprop”), the key training algorithm used in computing updates to the many parameters that define the computational architectures applied to problems ranging from Computer Vision to Natural Language Processing and Speech. However, though neural architectures are inspired by our current neuroscientific understanding of the human brain, the connections to the actual mechanisms that compose systems of natural neurons are often very loose, at best. More importantly, backpropagation of errors faces some of the strongest neurobiological criticisms, argued to be a highly implausible way in which learning occurs in the human brain.
Among the many problems with backpropagation of errors, some of the most prominent include: 1) the “weight transport problem”, where the feedback weights use to carry error signals must be the transposes of the feedforward weights, 2) forward propagation and backward propagation utilize different computations, and 3) the error gradients are stored separate from the activations. These problems, as originally argued in [30, 28], largely center around the one critical component of backprop–the global feedback pathway needed for transporting error derivatives across the system. This pathway is necessary given the design of modern supervised learning system–a loss function measures the error between an artificial neural system’s output units and some target (such as a class label) and the global pathway relates how the internal processing elements affect this error. When considering modern theories of the brain [9, 36, 12, 4], which posit that local computations occur at multiple levels of the somewhat hierarchical structure of natural neural systems, this global pathway should not be necessary to learn effectively. Furthermore, this pathway is the source behind many practical problems that make training very deep, more complex networks difficult–as a result of the many multiplications that underly traversing along this global feedback pathway, error gradients will either explode or vanish [34]. In trying to fix this particular issue, gradients can be kept within reasonable magnitudes by requiring layers to behave sufficiently linearly (which prevents saturation of the postactivation function used, which yield zero gradient). However, this remedy creates other highly undesirable sideeffects, such as the wellknown problem of adversarial samples [43, 31] and prevents the usage of neurobiological such as lateral competition and discretevalued/stochastic activation functions (since this pathway requires precise knowledge of the activation function derivatives [3]).
If we remove this global feedback pathway, we create a new problem–what are the learning signals for the hidden processing elements? This problem is one of the main concerns of the recently introduced Discrepancy Reduction family of learning algorithms [30]. In this paper, we will develop two learning algorithms within this family–errordriven Local Representation Alignment and adaptive noise Difference Target Propagation. In experiments on two classification benchmarks, we will show that two algorithms generalize better than a variety of other biologically motivated learning algorithms, all without employing the global feedback pathway required by backpropagation.
2 Reducing Discrepancy with GloballyCoordinated Local Learning Rules
Algorithms within the Discrepancy Reduction [30] family offer computational mechanisms to perform the following two steps when learning from a sample (or minibatch of samples):

Search for latent representations that better explain the input/output, also known as target representations. This facilitates the need for local (higherlevel) objectives that will help guide the current latent representations towards better ones.

Reduce, as much as possible, the mismatch between a model’s currently “guessed” representations and target representations. The sum of the internal, local losses is also defined as the total discrepancy in a system, and can also be thought of a sort of pseudoenergy function.
This general process forms the basis of what we call globallycoordinated local learning rules. Computing targets with these kinds of rules should not require an actual pathway, as in backpropagation, and instead make use of topdown and bottomup signals to create targets. This idea is particularly motivated by the theory of predictive coding [33], which claims that the brain is in a continuous process of creating and updating hypotheses to predict the sensory input. This paper will explore two ways in which this hypothesis updating (in the form of local target creation) might happen: 1) through errorcorrection in Local Representation Alignment, and 2) through repeated encoding and decoding in Difference Target Propagation.
The idea of learning locally in general is slowly becoming prominent in the training of artificial neural networks, with recent proposals including decoupled neural interfaces [13] and kickback [1] (which was derived specifically for regression problems). Far earlier approaches that employed local learning included the layerwise training procedures that were once used to build models for unsupervised learning [2], supervised learning [18], and semisupervised learning [32, 29]. The key problem with these older algorithms is that they were greedy–a model was built from the bottomup, freezing lowerlevel parameters as higherlevel feature detectors were learnt.
Another important idea that comes into play in algorithms such as LRA and DTP is that learning is possible with asymmetry–which directly resolves the weighttransport problem [10, 21], another strong neurobiological criticism of backprop. This is even possible, surprisingly, even if those feedback loops are random and fixed, which led to the proposal of two algorithms we also compare to in this paper–Random Feedback Alignment (RFA) [23], which essentially replaces the transpose of the feedforward weights in backpropagation a nonlearnable, random matrix of the same dimensions. Direct feedback alignment (DFA) [25] extends this idea further by directly connecting the output layer’s preactivation derivative to each layer’s postactivation. It was shown in [30, 28] that these feedback loops would be better suited in generating target representations.
2.1 Local Representation Alignment
To concretely describe how LRA is practically implemented, we will formulate how LRA would be applied to a 3layer feedforward network, or multilayer perceptron (MLP). Note that LRA easily generalizes to models with an arbitrary number of layers.
The preactivities of the MLP at layer are denoted as while the postactivities, or the values output by the nonlinearity , are denoted as . The target variable used to correct the output units () is denoted as ( or if we are learning an autoassociative function ). Connecting one layer of neurons , with preactivities , to another layer , with preactivities , is a set of synaptic weights . The forward propagation equations for computing preactivtion and postactivation values for a layer would then simply be:
(1) 
Before computing targets or updates, we first must define the set of local losses, one per layer of neurons except for the input neurons, that constitute the measure of total discrepancy inside the MLP, . With losses defined, we can then explicitly formulate the error units for each layer as well, since any given layer’s error units correspond to the first derivative of that layer’s loss with respect to that layer’s postactivation values. For the MLP’s output layer, we could assume a categorical distribution, which is appropriate for 1of classification tasks, and use the following negative log likelihood loss:
(2) 
where the loss is computed over all dimensions of the vector (where a dimension is indexed/accessed by integer ). Note that for this loss function, we assume that is a vector of probabilities computed by using the softmax function as the output nonlinearity, . For the hidden layers, we can choose between a wider variety of loss functions, and in this paper, we experimented with assuming either a Gaussian or Cauchy distribution over the hidden units. For the Gaussian distribution (or L2 norm), we have the following loss and error unit pair:
(3) 
where is a scalar representing a fixed variance (setting this to get rids of the multiplicative factor entirely). For the Cauchy distribution (or logpenalty), we obtain:
(4) 
For the activation function used in calculating the hidden postactivities, we use the hyperbolic tangent, or . Using the Cauchy distribution proved particularly useful in our experiments, most likely because encourages sparser representations, which aligns nicely with the biological considerations of sparse coding [26] and predictive sparse decomposition [14] as well as lateral competition [36] that naturally occurs in groups of neural processing elements. These are relatively simple local losses that one can use measure the agreement between the representation and target and future work should entail developing better metrics.
With local losses specified and error units implemented, all that remains is to define how targets are computed and what the parameter updates will be. At any given layer , starting at the output units (in our example, ), we calculate the target for the layer below by multiplying the error unit values at by a set of synaptic error weights . This projected displacement, weighted by the modulation factor ^{1}^{1}1In the experiments of this paper, a value of , found with only minor tuning in preliminary experiments on subset of training and validation data proved to be effective in general., is then subtracted from the initially found preactivation of the layer below . This updated preactivity is then run through the appropriate nonlinearity to calculate the final target . This computation amounts to:
(5) 
Once the targets for each layer have been found, we can then use the local loss to compute updates to the weights and its corresponding error weights ^{2}^{2}2Except for very bottom set of forward weights, , of which there are no error corresponding error weights.. The update calculation for any given parameter at layer would be:
(6)  
(7) 
where indicates the Hadamard product and is a decay factor (a value that we found should be set to less than ) meant to ensure that the error weights change more slowly than the forward weights. Note that the second variation of the update rule does not require , which makes it particularly attractive in that it does not require the first derivative of the activation function thus permitting the use of discrete and stochastic operations. The update for each set of error weights is simply proportional to the negative transpose of the update computed for its matching forward weights, which is a computationally fast and cheap rule we propose inspired by [35].
In Algorithm 2, we show how the equations above, which constitute the LRA, are applied to a 3layer MLP, assuming Gaussian local loss functions and their respective error units. This means and the model is defined by (biases are omitted for clarity). We will refer to this algorithm as LRAE.
With a local loss assigned to each hidden layer, we can measure our neural model’s total internal discrepancy for a given data point, , as a simple linear combination of all of the internal local losses. Figure 1(b) shows the 3layer MLP example developed in this section (256 units each), trained by stochastic gradient descent (SGD) and minibatches of 50 image samples, over the first 20 epochs of learning using a Categorical output loss and two Gaussian local losses. While the output loss continues to decrease, the total discrepancy does not always appear to do so, especially in the earlier part of learning. However, since each layer will try to minimize the mismatch between itself and a target value, any fluxes, or local loss values that actually increases instead of decreases which might raise the total discrepancy, will be taken care of later as the model starts generating better targets. The hope is that so long as the angle of the updates computed from LRA are within 90 degrees of the updates obtained by backpropagation of errors, LRA will move parameters towards the same general direction as backpropagation, which greedily points to the direction of steepest descent, and still find reasonably good local optima. In Figure 1(a), this does indeed appear to be the case–for the 3layer MLP trained for illustrative purposes in this section, we compare the updates calculated by LRAE with those given by backpropagation after each minibatch. The angle, fortunately, while certainly nonzero, never deviates too far from the direction pointed by backpropagation (at most 11 degrees) and remains relatively stable throughout the learning process.
2.2 Improving Difference Target Propagation
As mentioned earlier, Difference Target Propagation (DTP) (and also, less directly, recirculation [11, 27]), like LRAE, also falls under the same family of algorithms concerned with minimizing internal discrepancy, as shown in [30, 28]. However, DTP takes a very different approach to computing alignment targets than LRAE does–instead of transmitting messages through error units and error feedback weights as in LRA [30], DTP employs feedback weights to learn the inverse of the mapping created by the feedforward weights. However, [28] showed that DTP struggles to assign good local targets as the network gets deeper and thus more highly nonlinear, facing an initially positive but brief phase in learning where generalization error decreases (within the first epochs) before ultimately collapsing (unless very specific initializations are used). One potential cause of this failure could be the lack of a strong enough mechanism to globally coordinate the local learning problems created by the encoderdecoder pairs that define the system. In particular, we hypothesize this problem might be coming from the noise injection scheme, which is local and fixed, offering no adaptation to each specific layer and making some of the layerwise optimization problems more difficult than necessary. Here, we will aim to remove this potential cause through an adaptive layerwise corruption scheme.
Assuming we have a target calculated from above , we consider the forward weights connecting the layer to layer and the decoding weights that define the inverse mapping between the two. The first forward propagation step is the same as in Equation 1. In contrast to LRAE’s errordriven way of computing targets, we consider each pair of neuronal layers, , as forming a particular type of encoding/decoding cycle that will be used in computing layerwise targets. To calculate the target , we update the original postactivation using the linear combination of two applications of the decoding weights as follows:
(8) 
where we see that we decode two times, one from the original postactivation calculated from the feedforward pass of the MLP and another from the target value generated from the encoding/decoding process from the layer pair above, e.g. . This will serve as the target in training the forward weights for the layer below .We multiply top layer target with the fixed constant 0.01 as compared to learning rate through out the experiments for our improved DTP.To train the inversemapping weights , as required by the original proposed version of DTP, zeromean Gaussian noise , with fixed standard deviation , is injected into following by rerunning the encoder and the decoder on this new corrupted activation vector. Formally, this is defined as:
(9) 
This process we will refer to as DTP. In our proposed, improved variation of DTP, or DTP, we will take an adaptive approach to the noise injection process . To develop our “adaptive” noise scheme, we have taken some insights from studies of biological neuron systems, which show there are different levels of variability at different neuronal layers [6, 45, 44, 40]. It has been argued that this noise variability enhances neurons’ overall ability to detect and transmit signals across a system [41, 16, 40] and, furthermore, that the presence of noise yields more robust representation [5, 40, 7]. There is also is biological evidence demonstrating an increase in the noise level across successive groups of neurons which is thought to help in local neural computation [40, 38, 17].
The standard deviation of the noise process should be a function of the noise across layers, and an interesting way in which we implemented this was to make (the standard deviation of the noise injection at layer ) a function of an local loss measurements. At the top layer, we can set the standard deviation to be ( worked well in our experiments), or, rather, equal to the stepsize used to compute the topmost target (when differentiating the output loss with respect to ). The standard deviation for the layers below would be a function of where it is within the network. This means that:
(10) 
noting that the local loss chosen for DTP is a Gaussian loss (but with the input arguments flipped–the target value is now the corrupted initial encoding and the prediction is the clean, original encoding).
The updates to the weights are calculated by differentiating each local loss with respect to the appropriate encoder weights, or , or with respect to the decoder synaptic weights . Note that the order of the input arguments to each loss function for these two partial derivatives is important, in keeping aligned with the original paper in which DTP was proposed [19], in order to obtain the correct sign to multiply the gradients by.
As we will see in our experimental results, DTP is a much more stable learning algorithm, especially when training deeper and wider networks. DTP now benefits from a stronger form of global coordination among its internal encoding/decoding subproblems through the pairwise comparison of local loss values that drive the hidden layer corruption.
2.3 A Comment on the Efficiency of LRAE and DTP
It should be noted that LRAE is in general faster than DTP in calculating targets. Specifically, if we just focus on matrix multiplications within an MLP, which would take up the bulk of the computation underlying both processes, LRAE only requires matrix multiplications while DTP (and our proposed DTP) requires multiplications. In particular, the bulk of DTP’s primary expense comes from its approach to computing the targets for the hidden layers since it requires 2 applications of the encoder parameters (1 of these comes from the initial feedfoward pass through the network) and 3 applications of the decoder parameters in order to properly generate targets to train the forward weights and the inversemapping weights.
3 Experimental Results
In this section, we present experimental results of training MLPs using a variety of learning.
MNIST: This dataset ^{3}^{3}3Publicly available at the URL: http://yann.lecun.com/exdb/mnist/. contains images with grayscale pixel feature values in the range of . The only preprocessing applied to this data is to normalize the feature values to the range of , done by dividing the pixel values by 255.
Fashion MNIST: This database [46] contains greyscale images of clothing items, meant to serve as a much more difficult dropin replacement for MNIST itself. Training contains 60000 samples and testing contains 10000, each image is associated with one of 10 classes. We create a validation set of 2000 samples from the training split. Preprocessing was the same as on MNIST.
For both datasets and all models, over 100 epochs, we calculate updates over minibatches of 50 samples. Unless internal to the algorithm itself, we do not regularize parameters any further, such as through dropout [42] or penalties placed on the weights. All feedfoward architectures for all experiments were of either of 3, 5, or 8 hidden layers of processing elements. The postactivation function used was simply the hyperbolic tangent and the top layer was chosen to be a maximumentropy classifier (employing the softmax function). The output layer objective for all algorithms was to minimize the categorical negative log likelihood.
Parameters were initialized using a scheme that gave best performance on the validation split of each dataset on a peralgorithm case. Though we wanted to use very simple initialization schemes for all algorithms, in preliminary experiments, we found that the feedback alignment algorithms as well as difference target propagation (including our improved version of it) worked best when using a uniform faninfanout scheme [8]. [28] confirms this result, originally showing how these algorithms often are unstable or fail to perform well using a simple initialization based on the uniform or Gaussian distributions. For Local Representation Alignment, however, we simply initialized the parameters using a zeromean Gaussian distribution, with variance of .





Model 







Backprop  
EquilProp  
RFA  
DFA  
DTP  
DTP (ours)  
LRAE (ours) 





Model 







Backprop  
EquilProp  
RFA  
DFA  
DTP  
DTP (ours)  
LRAE (ours) 
The choice of update rule was also somewhat dependent on the learning algorithm employed. Again, as shown in [28], it is difficult to get good, stable performance from algorithms, such as the original DTP, when using simple SGD. As done in [20], we used the RMSprop adaptive learning rate with a global step size of . For Backprop, RFA, DFA, and LRAE, SGD was used ().
3.1 Supervised Learning Experiments








Model 







LRA, MNIST  
LRA, Fashion MNIST 
In this experiment, we compare all of the algorithms discussed earlier. These include backpropagation of errors (Backprop), Random Feedback Alignment (RFA) [22], Direct Feedback Alignment (DFA) [25], Equilibrium Propagation [39] (EquilProp) and the original Difference Target Propagation [19] (DTP). Our algorithms include our proposed, improved version of DTP (DTP) and the proposed errordriven Local Representation Alignment (LRAE).
The results of our experiments are presented in Tables 1 and 2. Test and training scores are reported for the set of model parameters that had lowest validation error. Observe that LRAE is the most stable and consistently wellperforming algorithm compared to the other biologicallymotivated backprop alternatives, closely followed by our improved variant of DTP. More importantly note algorithms like EquilProp and DTP appear to break down when training deeper networks, such as the 8layer MLP. Note that while DTP was used to successfully train a 7layer network of 240 units (using RMSprop), we followed the same settings reported for deeper 7 layers network and in our experiments uncovered that the algorithm begins to struggle as the layers are made wider, starting even as soon as the width of 256 we experimented with in this paper. However, this problem is rectified using our variant of DTP, leading to much more stable performance and even in cases where the algorithm completely overfits the training set (as in the case of 3 and 5 layers for MNIST). Nonetheless, LRAE still performs the best with respect to generalization across both datasets, despite using a vastly simpler parameter update rule and naive initialization scheme. Table 3 shows the results of using a different update rule other than SGD for LRAE, such as Adam [15] or RMSprop for 3layer MLP, with a global step size for both algorithms. We see that LRAE is not only compatible with other learning rate schemes but reaches yet better generalization performance when using them, e.g. Adam.
In Figure 2 displays a tSNE [24] visualization of the topmost hidden layer of a learned 5layer MLP using either DFA, EquilProp, DTP, and LRAE on the Fashion MNIST test set. Qualitatively, we see that all learning algorithms extract representations that separate out the data points reasonably well, at least in the sense that the points are clustered based on clothing type. However, it appears that LRAE representations yields more strongly separated clusters as evidenced by the somewhat wider gaps between the clusters, especially the gaps around pink, blue, and black colored clusters.
Finally, DTP, as also mentioned in [28], appears to be quite sensitive to its initialization scheme. For both MNIST and Fashion MNIST, we trained DTP and our improved variant DTP1 with three different settings, including random orthogonal, faninfanout, and simple zeromean Gaussian initializations. Figure 3 shows the validation accuracy curves of the DTP and DTP as a function of epoch for 5 and 8 layer networks (3layer networks can be found in the appendix) with various weight initializations such as Gaussian (G), Orthogonal(ortho), and Xavier/Glorot. As shown in the figure DTP is highly unstable as the network gets deeper while DTP appears to be less dependent on the weight initialization scheme. Thus, our experiments show some promising evidence of DTP’s generalization improvement over the original DTP. Moreso, as the test performance indicates in Tables 1 and 2, DTP can be nearly as good as those found by the error units in LRAE.
4 Conclusions
In this paper, we proposed two learning algorithms, errordriven Local Representation Alignment algorithm and adaptive noise Difference Target Propagation. On two classification benchmarks, we show strong positive results when training deep multilayer perceptrons. Future work includes in investigating these algorithms fare in the face of much largerscale tasks and adapting them to problems where labeled data is scare or the data has a temporal dimension to it.
References
 [1] Balduzzi, D., Vanchinathan, H., and Buhmann, J. M. Kickback cuts backprop’s redtape: Biologically plausible credit assignment in neural networks. In AAAI (2015), pp. 485–491.
 [2] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. Greedy layerwise training of deep networks. Advances in neural information processing systems 19 (2007), 153.
 [3] Bengio, Y., Lee, D.H., Bornschein, J., Mesnard, T., and Lin, Z. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156 (2015).
 [4] Clark, A. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences 36, 3 (2013), 181–204.
 [5] Cordo, P., Inglis, J. T., Verschueren, S., Collins, J. J., Merfeld, D. M., Rosenblum, S., Buckley, S., and Moss, F. Noise in human muscle spindles. Nature 383, 6603 (Oct 1996), 769–770.
 [6] D., A. E., and Yngve, Z. The impulses produced by sensory nerve‐endings. The Journal of Physiology 61, 2, 151–171.
 [7] Faisal, A. A., Selen, L. P., and Wolpert, D. M. Noise in the nervous system. Nat. Rev. Neurosci. 9, 4 (Apr 2008), 292–303.
 [8] Glorot, X., and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (2010), pp. 249–256.
 [9] Grossberg, S. How does a brain build a cognitive code? In Studies of mind and brain. Springer, 1982, pp. 1–52.
 [10] Grossberg, S. Competitive learning: From interactive activation to adaptive resonance. Cognitive Science 11, 1 (1987), 23 – 63.
 [11] Hinton, G. E., and McClelland, J. L. Learning representations by recirculation. In Neural information processing systems (1988), pp. 358–366.
 [12] Huang, Y., and Rao, R. P. Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science 2, 5 (2011), 580–593.
 [13] Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., and Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343 (2016).
 [14] Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467 (2010).
 [15] Kingma, D., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 [16] Kruglikov, I. L., and Dertinger, H. Stochastic resonance as a possible mechanism of amplification of weak electric signals in living cells. Bioelectromagnetics 15, 6 (1994), 539–547.
 [17] Laughlin, S. B., de Ruyter van Steveninck, R. R., and Anderson, J. C. The metabolic cost of neural information. Nat. Neurosci. 1, 1 (May 1998), 36–41.
 [18] Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. DeeplySupervised Nets. arXiv:1409.5185 [cs, stat] (2014).
 [19] Lee, D.H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2015), Springer, pp. 498–515.
 [20] Lee, D.H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases  Volume Part I (Switzerland, 2015), ECMLPKDD’15, Springer, pp. 498–515.
 [21] Liao, Q., Leibo, J. Z., and Poggio, T. A. How important is weight symmetry in backpropagation? In AAAI (2016), pp. 1837–1844.
 [22] Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. Random feedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247 (2014).
 [23] Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications 7 (2016), 13276.
 [24] Maaten, L. v. d., and Hinton, G. Visualizing data using tsne. Journal of machine learning research 9, Nov (2008), 2579–2605.
 [25] Nøkland, A. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems (2016), pp. 1037–1045.
 [26] Olshausen, B. A., and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research 37, 23 (1997), 3311–3325.
 [27] O’Reilly, R. C. Biologically plausible errordriven learning using local activation differences: The generalized recirculation algorithm. Neural computation 8, 5 (1996), 895–938.
 [28] Ororbia, A. G., Mali, A., Kifer, D., and Giles, C. L. Conducting credit assignment by aligning local representations. arXiv preprint arXiv:1803.01834 (2018).
 [29] Ororbia II, A. G., Giles, C. L., and Reitter, D. Online semisupervised learning with deep hybrid boltzmann machines and denoising autoencoders. arXiv preprint arXiv:1511.06964 (2015).
 [30] Ororbia II, A. G., Haffner, P., Reitter, D., and Giles, C. L. Learning to adapt by minimizing discrepancy. arXiv preprint arXiv:1711.11542 (2017).
 [31] Ororbia II, A. G., Kifer, D., and Giles, C. L. Unifying adversarial training algorithms with data gradient regularization. Neural computation 29, 4 (2017), 867–887.
 [32] Ororbia II, A. G., Reitter, D., Wu, J., and Giles, C. L. Online learning of deep hybrid architectures for semisupervised categorization. In Machine Learning and Knowledge Discovery in Databases (Proceedings, ECML PKDD 2015), vol. 9284 of Lecture Notes in Computer Science. Springer, Porto, Portugal, 2015, pp. 516–532.
 [33] Panichello, M., Cheung, O., and Bar, M. Predictive feedback and conscious visual experience. Frontiers in Psychology 3 (2013), 620.
 [34] Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning (2013), pp. 1310–1318.
 [35] Rao, R. P., and Ballard, D. H. Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural computation 9, 4 (1997), 721–763.
 [36] Rao, R. P., and Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptivefield effects. Nature neuroscience 2, 1 (1999).
 [37] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Neurocomputing: Foundations of research. MIT Press, Cambridge, MA, USA, 1988, ch. Learning Representations by Backpropagating Errors, pp. 696–699.
 [38] Sarpeshkar, R. Analog versus digital: extrapolating from electronics to neurobiology. Neural Comput 10, 7 (Oct 1998), 1601–1638.
 [39] Scellier, B., and Bengio, Y. Equilibrium propagation: Bridging the gap between energybased models and backpropagation. Frontiers in computational neuroscience 11 (2017).
 [40] Shadlen, M. N., and Newsome, W. T. The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J. Neurosci. 18, 10 (May 1998), 3870–3896.
 [41] Shu, Y., Hasenstaub, A., Badoual, M., Bal, T., and McCormick, D. A. Barrages of synaptic activity control the gain and sensitivity of cortical neurons. J. Neurosci. 23, 32 (Nov 2003), 10388–10401.
 [42] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
 [43] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
 [44] Tolhurst, D., Movshon, J., and Dean, A. The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research 23, 8 (1983), 775 – 785.
 [45] Tomko, G. J., and Crapper, D. R. Neuronal variability: nonstationary responses to identical visual stimuli. Brain Research 79, 3 (1974), 405 – 418.
 [46] Xiao, H., Rasul, K., and Vollgraf, R. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).