Learning to solve the credit assignment problem
Abstract
Backpropagation is driving today’s artificial neural networks (ANNs). However, despite extensive research, it remains unclear if the brain implements this algorithm. Among neuroscientists, reinforcement learning (RL) algorithms are often seen as a realistic alternative: neurons can randomly introduce change, and use unspecific feedback signals to observe their effect on the cost and thus approximate their gradient. However, the convergence rate of such learning scales poorly with the number of involved neurons (e.g. ). Here we propose a hybrid learning approach. Each neuron uses an RLtype strategy to learn how to approximate the gradients that backpropagation would provide – in this way it learns to learn. We provide proof that our approach converges to the true gradient for certain classes of networks. In both feedforward and recurrent networks, we empirically show that our approach learns to approximate the gradient, and can match the performance of gradientbased learning. Learning to learn provides a biologically plausible mechanism of achieving good performance, without the need for precise, prespecified learning rules.
Learning to solve the credit assignment problem
Benjamin James Lansdell Department of Bioengineering University of Pennsylvania Pennsylvania, PA 19104 lansdell@seas.upenn.edu Prashanth Prakash Department of Bioengineering University of Pennsylvania Pennsylvania, PA 19104 Konrad Paul Kording Department of Bioengineering University of Pennsylvania Pennsylvania, PA 19104
noticebox[b]Preprint. Under review.\end@float
1 Introduction
It is unknown how the brain solves the credit assignment problem when learning: how does each neuron know its role in a positive (or negative) outcome, and thus know how to change its activity to perform better next time? Actions are rarely immediately rewarded (or punished), so each neuron must further determine which of a potential series of its actions is responsible for ultimate reward. This is a challenge for models of learning in the brain.
Biologically plausible solutions to credit assignment include those based on reinforcement learning (RL) and rewardmodulated STDP Bouvier2016 (); Fiete2007 (); Fiete (); Legenstein2010 (); Miconi2017 (). In these approaches a globally distributed reward signal provides feedback to all neurons in a network. Essentially, changes in rewards from a baseline, or expected, level are correlated with noise in neural activity, allowing a stochastic approximation of the gradient to be computed. However these methods have not been demonstrated to operate at scale. For instance, variance in the REINFORCE estimator Williams1992 () scales with the number of units in the network Rezende2014 (). This drives the hypothesis that learning in the brain must rely on additional structures beyond a global reward signal.
In artificial neural networks (ANNs), credit assignment is performed with gradientbased methods computed through backpropagation Rumelhart1986 (). This is significantly more efficient than RLbased algorithms, with ANNs now matching or surpassing humanlevel performance in a number of domains Mnih2015io (); Silver2017hp (); LeCun2015yo (); He2015oe (); Haenssle2018nj (); Russakovsky2015hw (). However there are well known problems with implementing backpropagation in biologically realistic neural networks. One problem is known as weight transport: an exact implementation of backpropagation requires a feedback structure with the same weights as the feedforward network to communicate gradients. Such a symmetric feedback structure has not been observed in neural circuits. A further problem, particularly in recurrent neural networks (RNNs), is that the temporal trace of each neuron’s activity must be somehow stored by the network until the backward pass occurs (though eligibility traces may be able to address this issue to some extent Gerstner2018 (); Lehmann2017 ()). Despite these issues, backpropagation is the only method known to solve supervised and reinforcement learning problems at scale. Thus modifications or approximations to backpropagation that are more plausible have been the focus of significant recent attention Scellier2016 (); Lillicrap2016 (); Lee2015a (); Lansdell2018a ().
These efforts do show some ways forward. Synthetic gradients demonstrate that learning can be based on approximate gradients, and need not be temporally locked Jaderberg2016 (); Czarnecki2017 (). In small feedforward networks, somewhat surprisingly, fixed random feedback matrices in fact suffice for learning Lillicrap2016 () (a phenomenon known as feedback alignment). But still issues remain: feedback alignment does not work in RNNs, very deep networks, networks with tight bottleneck layers. Regardless, these results show that rough approximations of a gradient signal can be used to learn, and suggest that even relatively inefficient methods of approximating the gradient may be good enough.
On this basis, here we propose an RL algorithm to train a feedback system to enable learning. Recent work has explored similar ideas, but not with the explicit goal of approximating backpropagation Miconi2017 (); Miconi2018 (); Song2017 (). RLbased methods like REINFORCE may be inefficient when used as a base learner, but they may be sufficient when used to train a system that itself instructs a base learner. We propose to use REINFORCEstyle perturbation approach to train a feedback signal to approximate what would have been provided by backpropagation. Our system learns to learn.
Learning to learn is often framed as a twolearner system: one system that updates a network’s weights, and another system that modifies the learner to update weights more efficiently Lansdell2018 (). A two learner system may in fact align well with cortical neuron physiology. For instance, the dendritic trees of pyramidal neurons consist of an apical and basal component Guergiuev2017 (); Kording2001 (). Similarly, climbing fibers and Purkinje cells may define a learner/teacher system in the cerebellum Marr1969 (). These components allow for independent integration of two different signals. Indeed such a setup has been shown to support supervised learning in feedforward networks Guergiuev2017 (); Kording2001 (). Learning to learn may thus provide a realistic solution to the credit assignment problem.
Here we implement a system that learns to use feedback signals trained with reinforcement learning via a global reward signal. This provides a plausible account of how the brain may perform deep learning. We mathematically analyze the model, and compare its capabilities to other biologically plausible accounts of learning in ANNs. We prove consistency of the estimator in particular cases, extending the few theoretical results available in synthetic gradients Jaderberg2016 (); Czarnecki2017 (). We demonstrate that our synthetic gradient model learns as well as regular backpropagation in small models, overcomes the limitations of feedback alignment on more complicated feedforward networks, and can be utilized in recurrent networks. Thus our method may provide an account of how the brain performs gradient descent learning.
2 Learning to learn through perturbations
We use the following notation. Let represent an input vector. Let an hiddenlayer network be given by . This is composed of a set of layerwise summation and nonlinear activations
for hidden layer states , nonlinearity and denoting and . Some loss function is defined in terms of the network output: . Let denote the loss as a function of : . Let data be drawn from a distribution . Then we aim to minimize:
Backpropagation relies on the error signal , computed in a topdown fashion:
2.1 Basic setup
Let the loss gradient term be denoted as
In this work we replace with an approximation with its own parameters to be learned (known as a synthetic gradient Jaderberg2016 (); Czarnecki2017 (), or error critic Werbos1992 ()):
for parameters . This setup can accommodate both topdown and bottomup information, and encompasses a number of published models Jaderberg2016 (); Czarnecki2017 (); Lillicrap2016 (); Nokland2016 (); Liao2015 (); Xiao2018 ().
2.2 Stochastic networks and gradient descent
To learn a synthetic gradient we use stochasticity inherent to biological neural networks. A number of biologically plausible learning rules exploit random perturbations in neural activity Xie2004 (); Seung2003 (); Fiete (); Fiete2007 (); Song2017 (). Here, at each time each unit produces a noisy response:
for independent Gaussian noise and standard deviation . This generates a noisy loss and a baseline loss . We will use the noisy response to estimate gradients that then allow us to optimize the baseline – the gradients used for weight updates are computed using the deterministic baseline.
2.3 Synthetic gradients via node perturbation
For Gaussian white noise, the wellknown REINFORCE algorithm Williams1992 () coincides with the nodeperturbation method Fiete (); Fiete2007 (). Node perturbation works by linearizing the loss:
(1) 
such that
with expectation taken over the noise distribution . This provides an estimator of the loss gradient
(2) 
2.4 Training a feedback network
There are many possible sensible choices of . For example, taking as simply a function of each layer’s activations: is in fact sufficient parameterization to express the true gradient function Jaderberg2016 (). We may expect, however, that the gradient estimation problem be simpler if each layer is provided with some error information obtained from the loss function and propagated in a topdown fashion. Symmetric feedback weights may not be biologically plausible, and random fixed weights may only solve certain problems of limited size or complexity Lillicrap2016 (). However, a system that can learn to appropriate feedback weights may be able to align the feedforward and feedback weights as much as is needed to successfully learn.
We investigate , which describes a nonsymmetric feedback network (Figure 1). Parameters are estimated by solving the least squares problem:
(3) 
Here, unless otherwise noted, this was solved by gradientdescent. Refer to the supplementary material for additional experimental descriptions and parameters.
3 Theoretical results
We can prove the estimator (7) is consistent as in two particular cases. To establish these results we must distinguish the true loss gradients from their synthetic estimates. Let be loss gradients computed by backpropagating the synthetic gradients
To prove consistency we must show the expectation of the Taylor series approximation (6) is well behaved. That is, we must show the expected remainder term of the expansion:
is finite. This requires some additional assumptions on the problem. We prove the result under the following assumptions:

A1: the noise is subgaussian,

A2: the loss function is analytic on ,

A3: the error matrices are full rank, for ,

A4: the mean of the remainder and error terms is bounded:
for .
Under these assumptions convergence follows from consistency of the least squares estimator for linear models.
Consider first convergence of the final layer feedback matrix, .
Theorem 1.
Assume A14. Then the least squares estimator
(4) 
solves (8) and converges to the true feedback matrix, in the sense that:
where indicates convergence in probability.
Theorem 1 thus establishes convergence of in a shallow (1 hidden layer) nonlinear network, provided the activation function and loss function are smooth.
In a deep, linear network we can also use Theorem 1 to establish convergence over the rest of the layers of the network.
Theorem 2.
Assume A14. For , the least squares estimator
(5) 
solves (8) and converges to the true feedback matrix, in the sense that:
Proofs and a discussion of the assumptions are provided in the supplementary material.
Thus either for a nonlinear shallow network, or a deep linear network, we have the result that, for sufficiently small , if we fix the network weights and train through node perturbation then we converge to . Validation that the method learns to approximate , for fixed , is provided in the supplementary material. In practice, we update and simultaneously. Some convergence theory is established for this case in Jaderberg2016 (); Czarnecki2017 ().
4 Applications
4.1 Solving MNIST
To demonstrate the method can be used to solve simple supervised learning problems we use node perturbation with a fourlayer network and MSE loss to solve MNIST (Figure 2). Updates to are made using the synthetic gradients
for learning rate . The feedback network needs to coadapt with the feedforward network in order to continue to provide a useful error signal. We observed that the system is able to adjust to provide a close correspondence between the feedforward and feedback matrices in both layers of the network (Figure 2A).
We observed that the relative error between and is lower than what is observed for feedback alignment, suggesting that this coadaptation of both and is indeed beneficial. The relative error depends on the amount of noise used in node perturbation – lower variance doesn’t necessarily imply the lowest error between and , suggesting there is an optimal noise level that balances bias in the estimate and the ability to coadapt to the changing feedforward weights.
Consistent with the low relative error in both layers, we observe that the alignment (the angle between the estimated gradient and the true gradient – proportional to ) is low in each layer – much lower for node perturbation than for feedback alignment, again suggesting that the method is much better at communicating error signals between layers (Figure 2B). In fact, recent studies have shown that sign congruence of the feedforward and feedback matrices is all that is required to achieve good performance Liao2015 (); Xiao2018 (). Here the sign congruence is also higher in node perturbation, again depending somewhat the variance. The amount of congruence is comparable between layers (Figure 2C).
Finally, the learning performance of node perturbation is comparable to backpropagation (Figure 2D) – achieving close to 3% test error. It is better than feedback alignment in this case. The same learning rate was used for all experiments here, and was not optimized individually for each method. Thus this result is not indicative of the superior performance of one method over the other – all methods do converge, and each likely could be optimized to converge faster. These results instead highlight the qualitative differences between the methods. They suggest node perturbation for learning to learn can be used in deep networks.
4.2 Autoencoder
The above results demonstrate node perturbation provides error signals closely aligned with the true gradients. However, performancewise they do not demonstrate any clear advantage over feedback alignment or backpropagation in this small network. A known shortcoming of feedback alignment is in very deep networks and in autoencoding networks with tight bottleneck layers Lillicrap2016 (). To see if node perturbation has the same shortcoming, we test performance on a simple autoencoding network with MNIST input data (size 7842002200784). In this more challenging case we also compare the method to the ‘matching’ learning rule Rombouts2015 (); Martinolli2018 (), in which updates to match updates to .
As expected, feedback alignment performs poorly, while node perturbation performs better than backpropagation and comparable ADAM (Figure 3a). In fact ADAM begins to overfit in training, while node perturbation does not. The increased performance relative to backpropagation is surprising. It may be a similar effect to that speculated to explain feedback alignment – the method strikes the right balance between providing a useful gradient signal to learn, and constraining the updates to be sufficiently aligned with , acting as a type of regularization Lillicrap2016 (). Th matched learning rule performs similarly to backpropagation. In line with these results, the latent space (bottleneck layer) learnt by node perturbation shows a useful separation between the digits, as do the networks trained by backpropagation and ADAM. In contrast, feedback alignment does not learn to separate digits in the bottleneck layer (Figure 3b). This results in scrambled output (Figure 3c). These results show that node perturbation is able to successfully communicate error signals through thin layers of a network as needed.
4.3 Recurrent networks
Node perturbation can also be applied to approximate gradients in recurrent networks. We demonstrate this with a network setup as
with output
Node perturbation is applied as in the feedforward setting to generate estimates of the gradient for updating and . Truncated BPTT of length is used to propagate the error signal from time to . Here 50 hidden units are used, and . While long term dependencies are challenging for vanilla RNNs Bengio1994 (), we used a vanilla RNN as a simple demonstration of node perturbation. Other architectures such as LSTMs may be used to improve performance Hochreiter1997 ().
We test the method on a delayed XOR task. Here, at random times a go cue is given, , and two random binary inputs are generated. With delay the network is required to output . Here . With node perturbation the network is able to learn to perform this task (Figure 4 A), converging in a time comparable to backpropagation through time. In this setup feedback alignment does not converge. Averaged over all layers (times unrolled) the angles of alignment to the true gradient are both near ninety, suggesting only a weak relation to the true error signals (Figure 4B). Regardless, the learned error signal is useful enough for node perturbation to solve the task.
5 Discussion
Here we implement a perturbationbased synthetic gradient method to train neural networks. We show that this hybrid approach can solve both feedforward and recurrent tasks. By removing both the symmetric forward backward weight requirement and the update locking imposed by backpropagation this approach is a step towards more biologicallyplausible deep learning. In contrast to many perturbationbased methods, this hybrid approach can solve problems at a large scale Xie2004 (); Hara2017 (); Hara2011 (); Seung2003 (); Fiete2007 (). Moreover, recently proposed causal estimation techniques Lansdell2018a () promise to provide lower variance estimators than node perturbation. We thus believe this type of learning to learn approach may ultimately provide a powerful and biologically plausible learning algorithm.
Computationally, the method has a number of benefits. First, training of the forward and backward weights may be performed separately, and hence the forward and backward pass of backpropagation may be performed asynchronously. Thus the method may be applied on distributed systems in which synchronization is difficult or time consuming Czarnecki2017 (); Crafton2019 (), including some integrated circuits Wilson2019 (). Second, by relying on random perturbations to measure gradients, the method does not rely on the environment to provide gradients. It works in cases common in reinforcement learning, where gradients in the environment cannot be backpropagated through. And third, the method is mathematically quite straightforward, facilitating analysis. This allows us to provide proofs of convergence in some special cases.
These proofs extend the theory of synthetic gradients and feedback alignment. Previous results in synthetic gradients prove convergence in a deep linear networks with MSE loss Czarnecki2017 (), with a synthetic gradient module in a single layer. Here, by assuming smoothness and sufficiently concentrated noise, we are able extend these results somewhat. Further, proof of convergence for more general synthetic gradient function choices, , may be possible using the same ideas presented here.
As a REINFORCEstyle estimator, it may seem corresponding theory would be relevant – the REINFORCE estimator provides an unbiased estimate of the loss gradient of a nonlinear, noisy system Williams1992 (). However, here we define synthetic gradients in terms of the system without noise. So REINFORCE theory cannot be used directly (though perhaps results with a noisy baseline are possible, e.g. Hara2017 ()). Our results instead show that, as the noise tends to zero, the REINFORCEstyle estimator can be used to estimate the true parameters of the deterministic system. As pointed out in Czarnecki2017 (), synthetic gradient methods are closely related in form and motivation to actorcritic methods. Thus it is likely further ideas from reinforcement learning could provide additional theory and insight.
While previous research has provided some insight and theory for how feedback alignment works Lillicrap2016 (); Ororbia2018 (); Moskovitz2018 (); Bartunov2018 (); Baldi2018 (), the effect remains somewhat mysterious, and not applicable in some network architectures. Recent studies have shown that some of these weaknesses can be addressed by instead imposing sign congruent feedforward and feedback matrices Liao2015 (); Xiao2018 (). Yet what mechanism may produce congruence in biological networks has not been addressed. Instead, here we show that some of the shortcomings of feedback alignment can be addressed in another way – the system can adjust the weights as needed to provide a useful error signal. While we have just investigated one choice of function in which the approach directly approximates backpropagation, the theory may be extended easily to other forms of . Our work is closely related to a recent proposal from Akrout et al 2019 Wilson2019 (), which also uses perturbations to learn feedback weights. However our approach does not divide learning into two phases, and training of the feedback weights does not occur in a layerwise fashion. A future combination of the two approaches may prove fruitful.
The method does have some drawbacks. Our approach does not, by itself, reach stateoftheart performance in common benchmarks like CIFAR or ImageNet, which would require convolutional networks. Further, as implemented here, distortion of the error signal does accumulate in a layerwise fashion, from top to bottom. This means it is unlikely to be a practical approach to learning in very deep networks. It is possible these drawbacks can be addressed to some extent, for example by using direct feedback alignment Nokland2016 () to produce a system in which convergence does not proceed layer by layer. Recent studies have shown that shallow spiking networks can competitively solve problems often tackled with deep networks Illing2019 (). Further, noise injection may be replaced with an estimate of the effect on a cost function that doesn’t require the injection of noise Lansdell2018a (). Thus perhaps some of these drawbacks can be mitigated.
How to solve the credit assignment problem remains a challenge not just in biological networks. In artificial networks, training a system that can learn long term dependencies is difficult. Synthetic gradients show how a method can be trained to solve a problem beyond its truncated BPTT horizon. Yet these have not been demonstrated to solve very longterm dependencies. Recent research has thus focused on the notion of attention to bridge long time spans Hung2018 (); ke2018sparse (); Bengio2017 (). Other recent work has shown how recurrent networks can be trained in an online fashion Cooijmans (), in an approach that can be seed as making REINFORCEtype updates with small noise perturbations. This may be related to our framework, and this is the subject of future work.
Though we are interested in biologically plausible learning, our method is at the computational and algorithmic level: it operates within constraints consistent with neurobiology, but does not specify exactly how it may be implemented. Rather, we focused on theoretical analysis and testing the method in an idealized setting. In a similar fashion, feedback alignment was first analyzed in an artificial network setting and now forms a part of some biologically plausible models of learning in cortex Guerguiev2017lp (). We thus believe this work is an important first step before more detailed models are considered.
Notably, however, the method is consistent with neurobiology in two important ways. First, it involves separate learning of feedforward and feedback weights. This is possible in cortical networks, where complex feedback connections exist between layers Lacefield2019 (); Richards2019 () and pyramidal cells have apical and basal compartments that allow for separate integration of feedback and feedforward signals Guerguiev2017lp (). A recent finding that apical dendrites receive reward information is particularly interesting Lacefield2019 (). Models like Guergiev et al 2017 Guerguiev2017lp () are thus quite compelling. We believe such models can be augmented with a perturbationbased rule like ours to provide a better learning system. The second feature is that perturbations are used to learn the feedback weights. How can a neuron measure these perturbations? There are many plausible mechanisms Seung2003 (); Xie2004 (); Fiete (); Fiete2007 (). For instance, birdsong learning uses ‘empiric synapses’ from area LMAN Fiete2007 (), others proposed it is approximated Legenstein2010 (); Hoerzer2014 (), or neurons could use a learning rule that does not require knowing the noise Lansdell2018a (). Further, our model involves the subtraction of a baseline loss to reduce the variance of the estimator. This does not affect the expected value of the estimator – technically the baseline could be removed or replaced with a approximation Legenstein2010 (); Loewenstein2006 (). Thus both separation of feedforward and feedback systems and perturbationbased estimators can be implemented by neurons.
Learning to learn is a powerful mechanism not just to learn efficient learning rules, but also to learn rules that generalize well to new data on the basis of common structure wang2016learning (); maclaurin2015gradient (); Andrychowicz2016fr (). Its potential to provide realistic accounts of efficient learning in the brain is only just beginning to be explored.
References
 [1] Mohamed Akrout, Collin Wilson, Peter C Humphreys, Timothy Lillicrap, and Douglas Tweed. Deep Learning without Weight Transport. ArXiv eprints, 2019.
 [2] Marcin Andrychowicz, Misha Denil, Sergio Gómez Colmenarejo, and Matthew W Hoffman. Learning to learn by gradient descent by gradient descent. Adv. Neural Inf. Process. Syst., (Nips):1–17, 2016.
 [3] Pierre Baldi, Peter Sadowski, and Zhiqin Lu. Learning in the Machine: Random Backpropagation and the Deep Learning Channel. Artificial Intelligence, 260:1–35, 2018.
 [4] Sergey Bartunov, Adam Santoro, Blake Richard, Geoffrey Hinton, and Timothy Lillicrap. Assessing the scalability of biologicallymotivated deep learning algorithms and architectures. ArXiv eprints, 2018.
 [5] Yoshua Bengio. The Consciousness Prior. ArXiv eprints, (1):1–4, 2017.
 [6] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning LongTerm Dependencies with Gradient Descent is Difficult, 1994.
 [7] Guy Bouvier, Claudia Clopath, Célian Bimbard, JeanPierre Nadal, Nicolas Brunel, Vincent Hakim, and Boris Barbour. Cerebellar learning using perturbations. bioRxiv, page 053785, 2016.
 [8] Tim Cooijmans and James Martens. On the Variance of Unbiased Online Recurrent Optimization. ArXiv eprints, 2019.
 [9] Brian Crafton, Abhinav Parihar, Evan Gebhardt, and Arijit Raychowdhury. Direct Feedback Alignment with Sparse Connections for Local Learning. ArXiv eprints, pages 1–13, 2019.
 [10] Wojciech Marian Czarnecki, Grzegorz Świrszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding Synthetic Gradients and Decoupled Neural Interfaces. ArXiv eprints, 2017.
 [11] Ila R Fiete, Michale S Fee, and H Sebastian Seung. Model of Birdsong Learning Based on Gradient Estimation by Dynamic Perturbation of Neural Conductances. Journal of neurophysiology, 98:2038–2057, 2007.
 [12] Ila R Fiete and H Sebastian Seung. Gradient learning in spiking neural networks by dynamic perturbation of conductances. Physical Review Letters, 97, 2006.
 [13] Wulfram Gerstner, Marco Lehmann, Vasiliki Liakoni, Dane Corneil, and Johanni Brea. Eligibility Traces and Plasticity on Behavioral Time Scales : Experimental Support of neoHebbian ThreeFactor Learning Rules. ArXiv eprints, pages 1–23, 2018.
 [14] Jordan Guergiuev, Timothy P. Lillicrap, and Blake A. Richards. Towards deep learning with segregated dendrites. eLife, 6:1–37, 2017.
 [15] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated dendrites. Elife, 6, December 2017.
 [16] H A Haenssle, C Fink, R Schneiderbauer, F Toberer, T Buhl, A Blum, A Kalloo, A Ben Hadj Hassen, L Thomas, A Enk, L Uhlmann, and Reader study levelI and levelII Groups. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol., 29(8):1836–1842, August 2018.
 [17] Kazuyuki Hara, Kentaro Katahira, and Masato Okada. Statistical mechanics of nodeperturbation learning with noisy baseline. Journal of the Physical Society of Japan, 86(2):1–7, 2017.
 [18] Kazuyuki Hara, Kentaro Katahira, Kazuo Okanoya, and Masato Okada. Statistical Mechanics of Online Nodeperturbation Learning. Information processing society of Japan, 4:23–32, 2011.
 [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing HumanLevel performance on ImageNet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
 [20] Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1–32, 1997.
 [21] Gregor M. Hoerzer, Robert Legenstein, and Wolfgang Maass. Emergence of complex computational structures from chaotic neural networks through rewardmodulated hebbian learning. Cerebral Cortex, 24(3):677–690, 2014.
 [22] ChiaChun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing Agent Behavior over Long Time Scales by Transporting Value. ArXiv eprints, 8:1–60, 2018.
 [23] Bernd Illing, Wulfram Gerstner, and Johanni Brea. Biologically plausible deep learning – but how far can we go with shallow networks ? ArXiv eprints, 2019.
 [24] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. ArXiv eprints, 1, 2016.
 [25] Nan Rosemary Ke, Anirudh Goyal ALIAS PARTH GOYAL, Olexa Bilaniuk, Jonathan Binas, Michael C Mozer, Chris Pal, and Yoshua Bengio. Sparse attentive backtracking: Temporal credit assignment through reminding. In Advances in Neural Information Processing Systems, pages 7651–7662, 2018.
 [26] Konrad Kording and Peter Konig. Supervised and Unsupervised Learning with Two Sites of Synaptic Integration. Journal of Computational Neuroscience, 11:207–215, 2001.
 [27] Clay O Lacefield, Eftychios A Pnevmatikakis, Liam Paninski, and Randy M Bruno. Reinforcement Learning Recruits Somata and Apical Dendrites across Layers of Primary Sensory Cortex. Cell Reports, 26(8):2000–2008.e2, 2019.
 [28] Benjamin James Lansdell and Konrad Paul Kording. Spiking allows neurons to estimate their causal effect. bioRxiv, pages 1–19, 2018.
 [29] Benjamin James Lansdell and Konrad Paul Kording. Towards learningtolearn. pages 1–8, 2018.
 [30] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
 [31] Dong Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9284:498–515, 2015.
 [32] Robert Legenstein, Steven M. Chase, Andrew B. Schwartz, Wolfgang Maas, and W. Maass. A RewardModulated Hebbian Learning Rule Can Explain Experimentally Observed Network Reorganization in a Brain Control Task. Journal of Neuroscience, 30(25):8400–8410, 2010.
 [33] Marco Lehmann, He Xu, Vasiliki Liakoni, Michael Herzog, Wulfram Gerstner, and Kerstin Preuschoff. Evidence for eligibility traces in human learning. ArXiv eprints, pages 2–7, 2017.
 [34] Qianli Liao, Joel Z. Leibo, and Tomaso Poggio. How Important is Weight Symmetry in Backpropagation? 2015.
 [35] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random feedback weights support learning in deep neural networks. Nature Communications, 7:13276, 2016.
 [36] Y. Loewenstein and H. S. Seung. Operant matching is a generic outcome of synaptic plasticity based on the covariance between reward and neural activity. Proceedings of the National Academy of Sciences, 103(41):15224–15229, 2006.
 [37] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
 [38] David Marr. A theory of cerebellar cortex. J. Physiol, 202:437–470, 1969.
 [39] Marco Martinolli, Wulfram Gerstner, and Aditya Gilra. MultiTimescale Memory Dynamics Extend Task Repertoire in a Reinforcement Learning Network With AttentionGated Memory. Front. Comput. Neurosci. …, 12(July):1–15, 2018.
 [40] Thomas Miconi. Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks. eLife, 6:1–24, 2017.
 [41] Thomas Miconi, Jeff Clune, and Kenneth O. Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. ArXiv eprints, 2018.
 [42] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
 [43] Theodore H. Moskovitz, Ashok Litwinkumar, and L.f. Abbott. Feedback alignment in deep convolutional networks. arXiv Neural and Evolutionary Computing, pages 1–10, 2018.
 [44] Arild Nøkland. Direct Feedback Alignment Provides Learning in Deep Neural Networks. NIPS, (Nips), 2016.
 [45] Alexander G. Ororbia, Ankur Mali, Daniel Kifer, and C. Lee Giles. Conducting Credit Assignment by Aligning Local Representations. ArXiv eprints, pages 1–27, 2018.
 [46] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. Proceedings of the 31st International Conference on Machine Learning, PMLR, 32(2):1278–1286, 2014.
 [47] Blake A Richards and Timothy P Lillicrap. Dendritic solutions to the credit assignment problem. Current Opinion in Neurobiology, 54:28–36, 2019.
 [48] Jaldert O Rombouts, Sander M Bohte, and Pieter R Roelfsema. How Attention Can Create Synaptic Tags for the Learning of Working Memories in Sequential Tasks. PLoS Computational Biology, 11(3):1–34, 2015.
 [49] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(9):533–536, 1986.
 [50] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li FeiFei. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis., 115(3):211–252, 2015.
 [51] Benjamin Scellier and Yoshua Bengio. Equilibrium Propagation: Bridging the Gap Between EnergyBased Models and Backpropagation. arXiv, 11(1987):1–13, 2016.
 [52] Sebastian Seung. Learning in Spiking Neural Networks by Reinforcement of Stochastics Transmission. Neuron, 40:1063–1073, 2003.
 [53] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, October 2017.
 [54] H Francis Song, Guangyu R Yang, and Xiao Jing Wang. Rewardbased training of recurrent neural networks for cognitive and valuebased tasks. eLife, 6:1–24, 2017.
 [55] Jane X Wang, Zeb KurthNelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
 [56] Paul Werbos. Approximate dynamic programming for realtime control and neural modeling. In Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, chapter 13. Multiscience Press, Inc., New York, 1992.
 [57] Ronald Williams. Simple Statistical GradientFollowing Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:299–256, 1992.
 [58] Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio. BiologicallyPlausible Learning Algorithms Can Scale to Large Datasets. ArXiv eprints, (92), 2018.
 [59] Xiaohui Xie and H. Sebastian Seung. Learning in neural networks by reinforcement of irregular spiking. Physical Review E, 69, 2004.
Supplementary material
Appendix A Validation with fixed
We demonstrate the method’s convergence in a small nonlinear network solving MNIST for different noise levels, , and layer widths (Supplementary Figure 5). As basic validation of the method, in this experiment the feedback matrices are updated while the feedforward weights are held fixed. In contrast to the text, to best understand how close the method can get to getting to approximate , we used the exact ridge regression solution to update :
with identity matrix and regularization parameter .
We should expect the feedback matrices to converge to the feedforward matrices . Here different noise variance does results equally accurate estimators (Supplementary Figure 5A). The estimator correctly estimates the true feedback matrix to a relative error of 0.8%. The convergence is layer dependent, with the second hidden layer matrix, , being accurately estimated, and the convergence of the first hidden layer matrix, , being less accurately estimated. Despite this, the angles between the estimated gradient and the true gradient (proportional to ) are very close to zero for both layers (Supplementary Figure 5B) (less than 90 degrees corresponds to a descent direction). Thus the estimated gradients strongly align with true gradients in both layers. Recent studies have shown that sign congruence of the feedforward and feedback matrices is all that is required to achieve good performance [34, 58]. Here significant sign congruence is achieved in both layers (Supplementary Figure 5C), despite the matrices themselves being quite different in the first layer. The number of neurons has an effect on both the relative error in each layer and the extent of alignment between true and synthetic gradient (Supplementary Figure 5D,E). The method provides useful error signals for a variety of sized networks, and can provide useful error information to layers through a deep network.
With fixed , only the top layer feedforward and feedback matrices were in close correspondence (compare with Figure 2, which shows both layers converge as well). Thus it seems that in the coadapting case, a similar effect to feedback alignment may be occurring – the feedforward matrices adapt to the feedback matrices to allow for a more useful error signal to propagate to deeper layers and allow for greater correspondence between and throughout the network than what occurs with fixed .
Appendix B Proofs
We review the key components of the model. Data are drawn from a distribution . The loss function is linearized:
(6) 
such that
with expectation taken over the noise distribution . This suggests a good estimator of the loss gradient is
(7) 
Let be the error signal computed by backpropagating the synthetic gradients:
Then parameters are estimated by solving the least squares problem:
(8) 
Under what conditions can we show that (with enough data)?
One way to find an answer is to define the synthetic gradient in terms of the system without noise added. Then is deterministic with respect to and, assuming has a convergent power series around , we can write
Taken together these suggest we can prove in the same way we prove consistency of the linear least squares estimator.
For this to work we must show the expectation of the Taylor series approximation (6) is well behaved. That is, we must show the expected remainder term of the expansion:
is finite and goes to zero as . This requires some additional assumptions on the problem.
We make the following assumptions:

A1: the noise is subgaussian,

A2: the loss function is analytic on ,

A3: the error matrices are full rank, for ,

A4: the mean of the remainder and error terms is bounded:
for .
Consider first convergence of the final layer feedback matrix, . In the final layer it is true that .
Theorem 3.
Assume A14. Then the least squares estimator
(9) 
solves (8) and converges to the true feedback matrix, in the sense that:
Proof.
Let . We first show that, under A12, the conditional expectation of the estimator (7) converges to the gradient as . For each , by A2, we have the following series expanded around :
Taking a conditional expectation gives:
We must show the remainder term
goes to zero as . This is true provided each moment is sufficiently wellbehaved. Using Jensen’s inequality and the triangle inequality in the first line, we have that
[monotone convergence]  
[subgaussian]  
(10) 
With this in place, we have that the problem (8) is close to a linear least squares problem, since
(11) 
with residual . The residual satisfies
(12) 
This follows since is defined in relation to the baseline loss, not the stochastic loss, meaning it is measurable with respect to and can be moved into the conditional expectation.
We can use Theorem 1 to establish convergence over the rest of the layers of the network when the activation function is the identity.
Theorem 4.
Assume A14. For , the least squares estimator
(13) 
solves (8) and converges to the true feedback matrix, in the sense that:
Proof.
Define
assuming this limit exists. From Theorem 1 the top layer estimate converges in probability to .
We can then use induction to establish that in the remaining layers also converges in probability to . That is, assume that converge in probability to in higher layers . Then we must establish that also converges in probability.
To proceed it is useful to also define
as the error signal backpropagated through the converged (but biased) weight matrices . Again it is true that .
As in Theorem 1, the least squares estimator has the form:
Thus, again by the continuous mapping theorem:
In this case continuity again allows us to separate convergence of each term in the product:
(14)  
using the weak law of large numbers in the first term, and the induction assumption for the remaining terms. In the same way
Note that the induction assumption also implies . Thus, putting it together, by A3, A4 and the same reasoning as in Theorem 3 we have the result:
∎
b.1 Discussion of assumptions
It is worth making the following points on each of the assumptions:

A1. In the paper we assume is Gaussian. Here we prove the more general result of convergence for any subgaussian random variable.

A2. In practice this may be a fairly restrictive assumption, since it precludes using relu nonlinearities. Other common choices, such as hyperbolic tangent and sigmoid nonlinearities with an analytic cost function do satisfy this assumption, however.

A3. It is hard to establish general conditions under which will be full rank. While it may be a reasonable assumption in some cases.
Extensions of Theorem 2 to a nonlinear network may be possible. However, the method of proof used here is not immediately applicable because the continuous mapping theorem can not be applied in such a straightforward fashion as in Equation (14). In the nonlinear case the resulting sums over all observations are neither independent or identically distributed, which makes applying any law of large numbers complicated.
Appendix C Experiment details
Details of each task and parameters are provided here. All code is implemented in TensorFlow.
c.1 Supplementary Figure 1
Networks are 784502010 (noise variance) or 784N5010 (number of neurons) solving MNIST with an MSE loss function. A sigmoid nonlinearity is used. A batch size of 32 is used. Here is fixed, and is updated according to the online ridge regression leastsquares solution. Regularization parameter .
c.2 Figure 2
Networks are 784502010. Unless stated otherwise, assume same parameters as in Figure 2. Now is updated using synthetic gradient updates with learning rate . Same step size is used for feedback alignment, backpropagation and node perturbation.
c.3 Figure 3
Network has dimensions 7842002200784. Activation functions are, in order: tanh, identity, tanh, relu. MNIST input data with MSE reconstruction loss is used. Unless stated otherwise, assume same parameters as in Figure 2. Step size for node perturbation is , and noise variance . In this case node perturbation performance was more stable for stochastic gradient updates to , instead of the exact least squares solution. The step size of updates to was . Values for step size, noise variance and were found by random hyperparameter search in the range of to for step size and , and between and for the noise variance.
c.4 Figure 4
Data is generated as a long continuous input stream and expected output stream . One epoch was defined as 50,000 time steps. BPTT was unrolled 7 time steps, a batch size of 20 was used. 50 hidden units are used, with a tanh activation function, and MSE loss function is used. Same step size was used for node perturbation, feedback alignment and backpropagation: . In this case node perturbation performance was more stable for stochastic gradient updates to , instead of the exact least squares solution. The step size of updates to was , and noise variance was . Values for noise variance and were found by random hyperparameter search.