Learning in the Machine: the Symmetries of the Deep Learning Channel

Learning in the Machine: the Symmetries of the Deep Learning Channel

Pierre Baldi111Corresponding author. Department of Computer Science, University of California, Irvine. Department of Mathematics, University of California, Irvine., Peter Sadowski, and Zhiqin Lu
Abstract

Abstract: In a physical neural system, learning rules must be local both in space and time. In order for learning to occur, non-local information must be communicated to the deep synapses through a communication channel, the deep learning channel. We identify several possible architectures for this learning channel (Bidirectional, Conjoined, Twin, Distinct) and six symmetry challenges: 1) symmetry of architectures; 2) symmetry of weights; 3) symmetry of neurons; 4) symmetry of derivatives; 5) symmetry of processing; and 6) symmetry of learning rules. Random backpropagation (RBP) addresses the second and third symmetry, and some of its variations, such as skipped RBP (SRBP) address the first and the fourth symmetry. Here we address the last two desirable symmetries showing through simulations that they can be achieved and that the learning channel is particularly robust to symmetry variations. Specifically, random backpropagation and its variations can be performed with the same non-linear neurons used in the main input-output forward channel, and the connections in the learning channel can be adapted using the same algorithm used in the forward channel, removing the need for any specialized hardware in the learning channel. Finally, we provide mathematical results in simple cases showing that the learning equations in the forward and backward channels converge to fixed points, for almost any initial conditions. In symmetric architectures, if the weights in both channels are small at initialization, adaptation in both channels leads to weights that are essentially symmetric during and after learning. Biological connections are discussed.

1 Introduction

Backpropagation implemented in digital computers has been successful at addressing a host of difficult problems ranging from computer vision [35, 63, 61, 26] to speech recognition [24] in engineering, and from high energy physics [7, 54] to biology [19, 74, 2] in the natural sciences. Furthermore, recent results have shown that backpropagation is optimal in some sense [6]. However, backpropagation implemented in digital computers is not the real thing. It is merely a digital emulation of a learning process occurring in an idealized physical neural system. Thus thinking about learning in this digital simulation can be useful but also misleading, as it often obfuscates fundamental issues. Thinking about learning in physical neural systems or learning in the machine–biological or other–is useful not only for better understanding how specific or idealized machines can learn, but also to better understand fundamental, hardware-independent, principles of learning. And, in the process, it may occasionally also be useful for deriving new approaches and algorithms to improve the effectiveness of digital simulations and current applications.

Thinking about learning in physical systems first leads to the notion of locality [6]. In a physical system, a learning rule for adjusting synaptic weights can only depend on variables that are available locally in space and time. This in turn immediately identifies a fundamental problem for backpropagation in a physical neural system and leads to the notion of a learning channel. The critical equations associated with backpropagation show that the deep weights of an architecture must depend on non-local information, such as the targets. Thus a channel must exist for communicating this information to the deep synapses–this is the learning channel [6].

Depending on the hardware embodiment, several options are possible for implementing the learning channel. A first possibility is to use the forward connections in the reverse direction. A second possibility is to use two separate channels with different characteristics and possibly different hardware substrates in the forward and backward directions. These two cases will not be further discussed here. The third case we wish to address here is when the learning channel is a separate channel but it is similar to the forward channel, in the sense that it uses the same kinds of neurons, connections, and learning rules. Such a learning channel is faced with at least six different symmetry challenges: 1) symmetry of architectures; 2) symmetry of weights; 3) symmetry of neurons; 4) symmetry of derivatives; 5) symmetry of processing; and 6) symmetry of learning rules, where in each case the corresponding symmetry is in general either desirable (5-6) or undesirable (1-4).

In the next sections, we first identify the six symmetry problems and then show how they can be addressed within the formalism of simple neural networks. While biological neural networks remain the major source of inspiration for this work, the analyses derived are more general and not tied to neural computing in any particular substrate.

2 The Learning Channel and the Symmetry Problems

2.1 Basic Notation

Throughout this paper, we consider layered feedforward neural network architectures and supervised learning tasks. We will denote such an architecture by

(1)

where is the size of the input layer, is the size of hidden layer , and is the size of the output layer. For simplicity, we assume that the layers are fully connected and let denote the weight connecting neuron in layer to neuron in layer . The output of neuron in layer is computed by:

(2)

The transfer functions are usually the same for most neurons, with typical exceptions for the output layer, and usually are monotonic increasing functions. Typical functions used in artificial neural networks are: the identity, the logistic function, the hyperbolic tangent function, the rectified linear function, and the softmax function.

We assume that there is a training set of examples consisting of input-target pairs , with . refers to the -th component of the -th training example, and similarly for . In addition there is an error function to be minimized by the learning process. In general, we will assume standard error functions, such as the squared error in the case of regression problems with identity transfer functions in the output layer, or relative entropy in the case of classification problems with logistic (two-class) or softmax (multi-class) transfer functions in the output layer, although this is not an essential point. The error function is a differentiable function of the weights and its critical points are given by the equations .

2.2 Local Learning

In a physical neural system, learning rules must be local [6], in the sense that they can only involve variables that are available locally in both space and time, although for simplicity here we will focus primarily on locality in space. Thus typically, in the present formalism, a local learning rule for a deep layer is of the form:

(3)

while for the top layer:

(4)

assuming that the targets are local variables for the top layer. Hebbian learning [27] is a form of local learning. Deep local learning corresponds to stacking local learning rules in a feedforward neural network. Deep local learning using Hebbian learning rules has been proposed by Fukushima [22] to train the neocognitron architecture, essentially a feed forward convolutional neural network inspired by the earlier neurophysiological work of Hubel and Wiesel [31]. However, in deep local learning, information about the targets cannot be propagated to the deep layers and therefore in general deep local learning cannot find solutions of the critical equations, and thus cannot succeed at learning complex functions in any optimal way.

2.3 The Learning Channel

Ultimately, for optimal learning, all the information required to reach a critical point of must appear in the learning rule of the deep weights. Setting the gradient (or the backpropagation equations) to zero shows immediately that in general at a critical point all the deep synapses must depend on the target or the error information, and this information is not available locally [6]. Thus, to enable efficient learning, there must exist a communication channel to communicate information about the targets or the errors to the deep weights. This is the deep learning channel or, in short, the learning channel. Note that the learning channel is different from the typical notion of “feedback.” Although feedback and learning may share the same physical connections, these refer in general to two different processes that often operate at very different time scales, the feedback being fast compared to learning.

In a learning machine, one must think about the physical nature of the channel. A first possibility is to use the forward connections in the reverse direction. This is unlikely to be the case in biological neural systems, in spite of known example of retrograde transmission, as discussed later in Section 6. A second possibility is to use two separate channels with different characteristics and possibly different hardware substrates in the forward and backward directions. As a thought experiment, for instance, one could imagine using electrons in one direction, and photons in the other. Biology can easily produce many different types of cells, in particular of neurons, and conceivably it could use special kinds of neurons in the learning channel, different from all the other neurons. While this scenario is discussed in Section 6, in general it does not seem to be the most elegant or economical solution as it requires different kinds of hardware in each channel. In any case, regardless of biological considerations, we are interested here in exploring the case where the learning channel is as similar as possible to the forward channel, in the sense of being made of the same hardware, and not requiring any special accommodations. However, at the same time, we also want to get rid of any undesirable symmetry properties and constraints, as discussed below. This leads to six different symmetry challenges, four undesirable and two desirable ones.

2.4 The Symmetry Problems

Symmetry of Architectures [ARC]: Symmetry of architectures refers to having the exact same architecture in the forward and in the backward channel, with the same number of neurons in each hidden layer and the same connectivity. This corresponds to the Bidirectional, Conjoined, and Twin cases defined below. In the Bidirectional and Conjoined case the Symmetry of Architectures is even stronger, in the sense that the same neurons are used in the forward and the backward channel. ARC is very constraining in a physical system, and it would be desirable if this constraint was unnecessary.

Symmetry of Weights (Transposition)[WTS]: This is probably the most well known symmetry. In the backpropagation equations, the weights in the learning channel are identical transposed copies of the weights in the forward network. This is a special and even stronger case of architectural symmetry. Furthermore, such a constraint would have to be satisfied not only at the beginning of learning, but it would have to be maintained also at all times throughout any learning process. This poses a major challenge in any physical implementation, including biological ones, and may thus be considered undesirable. If symmetry of the weights is required, then a physical mechanism must be proposed by which such symmetry could be achieved. As we shall see, approximate symmetry can arise automatically under certain conditions.

Symmetry of Neurons (Correspondence)[NEU]: For any neuron in layer , backpropagation computes a backpropagated error . If is computed in a separate learning channel, how does the learning channel know that this variable correspond to neuron in layer of the forward pathway? Thus there is a correspondence problem between variables computed in the learning channel and neurons in the forward channel. A desirable solution would have to address this question in a way that does not violate the locality principle and other constraints of a learning machine.

Symmetry of Derivatives (Derivatives Transport and Correspondence) [DER]: Each time a layer is traversed, each backpropagated error must be multiplied by the derivative of the activation of the corresponding forward neuron. Again, how does the learning channel, as a separate channel, know about all these derivatives, and which derivatives correspond to which neurons? A desirable solution would have to address this question in way that does not violate the locality principle and other constraints of a learning machine.

Symmetry of Processing (Non-Linear vs Linear) [LIN]: The backpropagation equations are linear in the sense that they involve only multiplications and additions, but no non-linear transformations. Thus a straightforward implementation of backpropagation would require non-linear neurons in the forward channel and linear neurons in the learning channel. Having different kinds of neurons, or neurons that can operate in different regimes is possible, but not particularly elegant, and it would be desirable to be able to use the same neurons in both channels. Since non-linear neurons are necessary in the forward channel to implement non-linear input-output functions, the question we address here is whether we can have similar non-linear neurons in the learning channel.

Symmetry of Adaptations and Learning Rules [ADA]: Finally, in backpropagation, a neuron in the forward networks adapts its incoming weights using the learning rule: where is the activity of the presynaptic neuron, and is the postsynaptic backpropagated error. All the weights in the forward network evolve in time during learning. If the learning channel is made of the same kinds of neurons, shouldn’t the weights in the learning channel adapt too, and preferably using a similar rule? This is also desirable otherwise one must postulate the existence of at least two types of neurons or connections, those that adapt and those that do not, and use each type exclusively in the forward and in the backward channel respectively.

Other Symmetries: While the symmetries above are the major symmetries to be considered here, in a physical system there exist other properties that can be investigated for symmetry or similarity between the forward and the learning channel. Some of these will be considered too, but more briefly. For instance, are there similar kinds of noise and noise levels in both channels? Can dropout [60, 5] be used in both channels? Is the precision on the weights the same in both channels? Another asymmetry between the channels, left for future work, is that the forward channel has a target whereas the learning channel does not. Finally, it must be noted that in backpropagation neurons operate in fundamentally different ways in the forward and backward directions. In particular, in backpropagation the backpropagated error is never added to the input activation in order to trigger a neuronal response. Thus the standard backpropagation model assumes that neurons can distinguish the forward messages from the backward messages and react differently to each. While one can imagine plausible mechanisms for doing that, it may also be desirable to come up with models where the two kinds of messages are treated in the same way, and the backpropagated message is included in the total neuronal activation. A small step in this direction is taken in Section 3.6.

Solutions for the first four symmetry problems are provided to some extent by the study of random backpropagation and several of its variations [38, 4], which we now briefly describe.

3 Backpropagation, Random Backpropagation, and their Variants

3.1 Backpropagation (BP)

Standard backpropagation implements gradient descent on , and can be applied in a stochastic fashion on-line (or in mini batches) or in batch form, by summing or averaging over all training examples. For a single example, omitting the index for simplicity, the standard backpropagation learning rule is given by:

(5)

where is the learning rate, is the presynaptic activity, and is the backpropagated error. Using the chain rule, it is easy to see that the backpropagated error satisfies the recurrence relation:

(6)

with the boundary condition:

(7)

Thus in backpropagation the errors are propagated backwards in an essentially linear fashion, using the transpose of the forward matrices, hence the symmetry of the weights, with a multiplication by the derivative of the corresponding forward activations every time a layer is traversed.

3.2 Random Backpropagation (RBP)

Standard random backpropagation [38] operates exactly like backpropagation except that the weights used in the backward pass are completely random and fixed. Thus the learning rule becomes:

(8)

where the randomly backpropagated error satisfies the recurrence relation:

(9)

where the weights are random and fixed. The boundary condition at the top remains the same:

(10)

Note that as described, RBP solves the second symmetry problem, but not the other five symmetry problems.

3.3 Skipped Random Backpropagation (SRBP)

Skipped random backpropagation was introduced independently in [48, 4]. In its basic form, SRBP uses connections with random weights that run directly from the top layer to each deep neuron. In this case, the signal carried by the learning channel has the form:

(11)

where are fixed random weights. SRBP has been shown, both through simulations and mathematical analyses, to work well even in very deep networks. Furthermore, another important conclusion derived from the study of SRBP, is that when updating the weight , the only derivative information that matters is the derivative of the activation of neuron in layer , and this information is available locally. Information about all the other derivatives, which is carried by the backpropagated signal in standard backpropagation, is not local and is not necessary for successful learning. Note that omitting all the derivatives does not work [4].

3.4 Other Variants: Adaptation

Several other variants are considered in [4]. The most important one for our purposes is the adaptive variants of RBP and SRBP, called ARBP and ASRBP. In these variants, the random weights of the learning channel are adapted using the product of the corresponding forward and backward signals, so that , where denotes the randomly backpropagated error. While ARBP and ASRBP allow both channels to learn, and use rules that are roughly similar, these rules are not identical. This is because in ARBP and ASRBP, and all the previously described algorithms, propagation in the learning channel is linear, as opposed to the non-linear propagation in the forward channel. As a result, derivatives of activations appear in the learning rules for the forward weights, but not for the weights in the learning channel. In this work we also explore the case where the learning channel is non-linear too and modify its learning rule accordingly by including the derivatives of the activations in the learning channel.

Figure 1: Representations of three different physical implementations of standard backpropagation (BP). Architecture 1 correspond to the Bidirectional Case, where information can flow in both directions along the same connections. Architecture 2 corresponds to the Conjoined Case, where the architecture and the neurons in the learning channel are identical to those in the forward channel. Architecture 3 corresponds to the Twin Case where the architecture of the learning channel is identical to the forward channel but the neurons are different. All the symmetry problems are evident in Architecture 3: How can the architecture of the learning channel be identical to the forward architecture? How can the weights in the learning channel be exactly symmetric to the weights in the forward channel? How can a backpropagated error computed in the learning channel precisely reach the corresponding neuron in the forward channel? How can the learning channel know about the derivatives of the activation functions in the forward channel? Why is the forward channel processing data in a non-linear fashion while the learning channel processes data in a linear fashion? And why is the forward channel adaptive but the learning channel is not?
Figure 2: Representations of physical implementations of standard random backpropagation (RBP) and skipped random backpropagation (SRBP) in the Conjoined Case. Standard RBP and SRBP addresse the WTS problem. The ARC, NEU, and DER problems are addressed automatically by the conjoined nature of the architecture.
Figure 3: Representations of physical implementations of standard random backpropagation (RBP) and skipped random backpropagation (SRBP) using a learning channel which is distinct from the forward channel (Twin Case).

3.5 Addressing the Symmetry Problems

We now review how these algorithms address some of the symmetry problems, but only partially, in relation to the corresponding architectures described in Figures 1-6. In these figures, the symbol represents the forward matrices, the transpose of the forward matrices, and the random matrices.

Figure 4: Random backpropagation with random matrices () also connecting the learning channel to the corresponding layer in the forward channel (Non-Identical Twin Case, or Distinct Case). This version addresses both the symmetry of the weights problem, and the neuronal correspondence problem. In addition, insights from SRBP show that only local information about the derivative of the activation function of the neuron under consideration for learning is needed(i.e. the derivatives in the upper layers are not needed). So this version can also address the issue of the transport of the derivatives from one channel to the other–such transport is not necessary. In this version, the learning channel can be run in linear or non linear fashion. The only two remaining symmetry problems that this version does not address are the symmetry in architectures and the symmetry in adaptability of the two channels.
Figure 5: Random backpropagation with random matrices () also connecting the learning channel to the corresponding layer in the forward channel, and random matrices () connecting the forward channel to the corresponding layer in the learning channel (Non-Identical Twin Case and Distinct Case). This version solves the correspondence problem in the reverse direction, allowing the forward channel to provide “targets” for the learning channel. Thus the learning channel can adapt by using the exact same learning rule as the forward channel. The only symmetry problem that is not addressed is the symmetry in architectures.
Figure 6: This configuration addresses all six symmetry problems (Distinct Case). Not only the learning channel can have a different architecture but it is also allowed to have skip connections of various kinds.
  • Architecture 1: This represents a physical implementation of BP where information can flow bidirectionally along the same connections (Figure 1). In this Bidirectional Case, the ARC, WTS, NEU, and ADA problems are solved by definition (assuming the “weight” on a connection is the same in both directions) and so is the DER problem. However the LIN problem is not addressed–neurons would have to operate differently in the two directions–and bidirectional flow is not possible in currently known physical implementation, including biology.

  • Architecture 2: This represents a physical implementation of BP where the same neurons are used in the forward and learning channels, with a separate identical set of connections, exactly mirroring the forward connections, in the learning channel (Figure 1). This is the Conjoined Case. This implementation uses transpose matrices and in essence corresponds to how BP is viewed when implemented in a digital computer. Such an implementation in a physical system is faced by major challenges in terms of ARC realization and WTS. If these were solved, then NEU and DER could also be solved as a byproduct of being conjoined. The ADA problem is addressed only if one can postulate a corresponding mechanism to maintain the weight symmetry at all times during learning. If the WTS problem is solved only at initialization, then the ADA problem is a challenge. Finally, this architecture does not address the LIN problem.

  • Architecture 3: This represents a physical implementation of BP using a set of neurons and connections in the learning channel that is clearly distinct from the neurons and connections in the forward channel(Figure 1). This is the Twin Case when the architecture in the learning channel is identical to the forward architecture. This implementation is faced with all six symmetry challenges: ARC, WTS, NEU, DER, ADA, and LIN. The Identical Twin subcase correspond to having a one-to-one map between neurons in the forward and learning channel, which solves the NEU problem. In a digital computer implementation, the Conjoined and Identical Twin Cases are essentially the same.

  • Architecture 4: This represents a physical implementation of RBP in the Conjoined Case, using the same neurons in the forward and the learning channel (Figure 2). Each forward connection is mirrored by a connection in the reverse direction, but the forward and backward connection have different weights. The weights on the backward connections are random and fixed. This corresponds also to the standard implementation of RBP in a digital computer and, as such, addressed the WTS challenge. The ARC and NEU symmetries are inherent in the Conjoined architecture, and the DER challenge can be addressed as a byproduct. (Without multiplication by the derivative of the activation functions, RBP does not seem to work.) The LIN and ADA challenges are not addressed by standard RBP. Simulations carried in [4], however with no supporting theoretical results, show that if each random weight is adapted proportionally to the product of the forward signal (postsynatpic term in the backward direction) and the randomly backpropagated error (presynaptic term in the backward direction), then learning converges.

  • Architecture 5: This represents a physical implementation of SRBP (skipped RBP) in the Conjoined Case, using the same neurons in the forward and the learning channel (Figure 2). Each top neuron (where the error is computed) is connected to each deep neuron. The weights on the backward connections are random and fixed. This corresponds also to the standard implementation of SRBP in a digital computer and, as such, addressed the WTS challenge. The ARC and NEU symmetries are inherent in the Conjoined skipped architecture, and the DER challenge can be addressed as a byproduct. (Without multiplication by the derivative of the activation functions, RBP does not seem to work.) Importantly, this implementation shows that when updating a forward weight, only the derivative of the activation of its postsynaptic neuron matters. All other derivatives can be ignored. The LIN and ADA challenges are not addressed by standard SRBP. Simulations carried in [4], however with no supporting theoretical results, show that if each random weight is adapted proportionally to the product of the forward signal (postsynatpic term in the backward direction) and the randomly backpropagated error (presynaptic term in the backward direction), then learning converges.

  • Architecture 6: This represents a physical implementation of RBP using a set of neurons and connections in the learning channel that is clearly distinct from the neurons and connections in the forward channel (Figure 3). This is the Twin Case if the architecture is the same in both pathways, and Identical Twin if there is a one-to-one correspondence between the neurons in each pathway. The WTS challenge is addressed by the random weights. However the NEU and DER challenges remain major challenges, even if the ARC challenge is fully addressed, and so are the LIN and ADA challenges. Nevertheless, simulations carried in [4] show that the random connection can be adapted using the same algorithm described in Architecture 4.

  • Architecture 7: This represents a physical implementation of SRBP. It is identical to Architecture 6, except with skip connections in the learning channel (Figure 3). This is the Twin Case if the architecture is the same in both pathways, and Identical Twin if there is a one-to-one correspondence between the neurons in each pathway. The WTS challenge is addressed by the random weights. However the NEU and DER challenges remain major challenges, even if the ARC challenge is fully addressed, and so are the LIN and ADA challenges. Nevertheless, simulations carried in [4] show that the random connections can be adapted using the same algorithm described in Architecture 4.

  • Architecture 8: This represents a physical implementation of RBP (or similarly for SRBP) similar to Architecture 6 (Figure 4), corresponding to the Twin Case if the architectures in both pathways are identical. However connections with random weights ( matrices ) are used to addressed the NEU challenge in one direction. The LIN and ADA problems remain as above.

  • Architecture 9 and 10: This represents a physical implementation of RBP (or similarly for SRBP) similar to Architecture 6, however connections with random weights (matrices and ) are used to addressed the NEU problem in both directions (Figure 5). This can be for the Twin Case, or the more general Distinct case, where the architecture of the learning channel is distinct and different (at least in terms of layer sizes) from the forward channel. Figure 6 is simply a variation in which the learning channel has some combination of standard and skip connections. One goal of this work is to address the LIN and ADA problems in this architecture, by using the same non-linear neurons in the forward channel and the learning channel, and using the same learning rule–including the derivative of the local activation function–in both channels.

In summary, RBP directly solves the symmetry of weights problem (WTS), immediately showing that symmetry is not needed. However the plain RBP algorithm is computed on an architecture that mirrors the forward architecture and thus by itself it does not solve the first symmetry problem (ARC). This problem is solved by SRBP and RBP when the learning channel is implemented in a separate architecture (Distinct), which could even include a combination of SRBP and RBP connections.

RBP and SRBP also provide an elegant solution to the third symmetry problem, the correspondence problem (NEU). In particular, the learning channel does not have to know which neuron is which in a given layer of the forward network. It simply connects randomly to all of them. Finally, simulation studies in [4] show that only the derivative of the activation functions of the neuron in layer whose weights are being updated are necessary, which addressed the transport of derivatives problems (DER). This information is local and information about all the derivatives of the activations in layers above are not necessary. Thus we are left essentially with the last two symmetry problems (LIN and ADA). Through simulations we are going to show that it is possible to use the same non-linear neurons in both the forward channel and the learning channel and, in addition, it is possible to let the weights in the learning channel adapt using the same learning rule as the forward weights. We will also be able to prove convergence results when both channels are adaptive, at least in some simplified cases.

3.6 Other Learning Rules (STDP)

As previously discussed, in most of the simulations and the mathematical results we use the learning rule:

(12)

for both channels, where here represents the synaptic weight of a directed connection in either channel and is the derivative of the activity of the postsynaptic neuron. The presynaptic terms correspond to activity in the same channel as the weight , whereas the postsynaptic terms correspond to activity originated in the opposite channel. This approach requires neurons to be able to make a distinction between signals received from the forward channel and the learning channel and to be able to remember activities across different channel activations.

Other Hebbian or anti-Hebbian learning rules have been proposed, in connection with spike time dependent synaptic plasticity (STDP), based on the temporal derivative of the activity of the postsynaptic neuron [71]. These temporal derivatives could be used to encode error derivatives. Within the present framework which uses non-spiking neurons, these learning rules rely on the product of the presynaptic activity times the rate of change of the postsynaptic activity:

(13)

with a negative sign in the anti-Hebbian case. For simplicity, we denote this kind of learning rule as a STDP rule, even if we do not use spiking neurons in this work. For a deep weight , we can write:

(14)

To establish a connection to error derivatives, it is easiest to consider the SRBP framework and consider that at :

(15)

Now consider that at the output is fed back, by the random connections in the learning channel, giving:

(16)

Finally, consider that at the output is clamped to the target and fed back by the random connection in the learning channel, giving:

(17)

Then, provided the weights in the learning channel are small, we have:

(18)

Thus the resulting learning rule in the forward channel is identical or very similar to SRBP. However, in the learning channel, the same reasoning leads to a different learning rule given by

(19)

A similarly inspired rule can be derived also in the non-skipped case. Thus, in short, for completeness we will also present simulation results for this class of STDP learning rules, and derive a proof of convergence in a simple case (see Section 5.2).

4 Simulations

In this section, various implementations of the learning channel are investigated through simulations on three benchmark classification tasks: the MNIST handwritten-digit data set [36], synthetic data sets of increasing complexity as in  [9], and the HIGGS data set from high-energy physics [7]. We start with the relatively easy MNIST task, then confirm some of the main results using the more difficult synthetic and HIGGS tasks.

4.1 MNIST Experiments

On the MNIST task, the forward channel consisted of 784 inputs, four fully-connected hidden layers of 100 tanh units, and a softmax output layer with 10 units. All weights were initialized by sampling from a scaled normal distribution [23], and bias terms were initialized to zero. Training was performed for 100 epochs using mini-batches of size 100 with no momentum and a learning rate of unless otherwise specified. Training was performed on 60,000 training examples and tested on 10,000 examples.

4.1.1 Non-Linearity in the Learning Channel

The following simulations investigate learning channels made from non-linear processing units. The learning channel is characterized by (1) the type of non-linear transfer function, (2) the architecture (Conjoined or Distinct), and (3) the algorithm (BP, RBP, or SRBP). We focus here on models that use the tanh non-linearity in both the forward channel and the learning channel, but other non-linearities are discussed.

In a Conjoined architecture, the error signal is propagated backwards through each neuron in the forward channel, and is modulated by the derivative of the transfer function. Our first simulation examines the effect of applying a non-linearity to the backpropagated error signal summation, immediately before it is multiplied by the transfer function derivative. Figure 7 shows that the performance of the BP, RBP, and SRBP algorithms does not suffer from this minor modification. Here, BP is represented by Architecture 2 in Figure 1, RBP by Architecture 3 in Figure 2, and SRBP by Architecture 4 in Figure 2.

It should be noted that the non-linearity has little effect when the error signal being propagated through the learning channel is small. So while we verified experimentally that the error signals sometimes fall in the non-linear regime at initialization and early in training, the impact of the non-linearity is small after the network fits the data. We also note that the behavior can be very different for other non-linearities. If the non-linearities in both the forward and the learning channels have a non-negative range, such as the logistic or rectifier functions, then both the neuron activities in the forward channel and the error signals are positive, leading to monotonically increasing weights and poor learning.

In a Distinct architecture, the learning channel consists of a completely separate set of neurons. These learning channel neurons propagate the error signal to the deep layers of the network, then laterally to the corresponding forward neurons via random lateral connections, parameterized by fixed matrices of random weights at each layer (Architecture 8 in Figure 4). In the SRBP version, skip connections propagate the error signal from the output to each layer of the learning channel, rather than a sequential chain. Figure 8 demonstrates that the model can reach perfect training accuracy with a Distinct architecture. This is true whether the neurons in the learning channel are tanh or linear (not-shown), and whether the learning channel consists of 100 neurons at each layer or 10. A learning channel consisting of a single neuron at each layer leads to slow learning, but still appears to converge.

Figure 7: MNIST training and validation performance trajectories, as a function of training epoch, for Conjoined architectures, trained with the three algorithms (BP, RBP, SRBP). For each algorithm, we compare the original algorithm to a variant where the tanh non-linearity is applied to the error signal at each neuron (lc-tanh).
Figure 8: MNIST training and validation performance trajectories, as a function of training epoch, for Distinct architectures. The forward channel consists of four hidden layers of 100 tanh neurons, while the learning channel consists of a completely separate set of tanh neurons (with 100, 10, or 1 neuron in each layer) and additional, random, lateral connections that propagate the error signals from the learning channel neurons to the forward channel neurons.

4.1.2 Dropout in the Learning Channel

The dropout algorithm is a common regularization method for neural network models. Here we demonstrate that dropout can also be applied to both the forward channel and the learning channel, where the learning channel consists of tanh neurons organized in a Conjoined or Distinct architecture. During training, the probability of dropping out a hidden neuron in the forward channel is controlled by a parameter , and the activities of neurons that are not dropped are scaled by ; dropout in the learning channel is controlled by an analagous parameter . At evaluation time, no dropout is used.

Figure 9 demonstrates the use of dropout on non-linear (tanh) learning channels in both Conjoined and Distinct architectures. Dropout is independently applied to all hidden layers in both the forward channel and the learning channel. As expected, dropout in the forward channel slowed learning, especially in RBP with a Conjoined architecture. However, the effect of dropout in the learning channel was small regardless of algorithm or architecture. From these results, it cannot be said whether dropout in the learning channel contributes to regularization, but it appears to interfere with learning less than dropout in the forward channel.

Figure 9: MNIST training and validation performance trajectories, as a function of training epoch, for Conjoined and Distinct architectures with a non-linear (tanh) learning channel trained with the three algorithms (BP, RBP, SRBP) and dropout. Dropout is applied to every layer with probability in the forward channel and probability in the learning channel. The performance of the classifier was evaluated without dropout on the training and test set after every epoch.

4.1.3 Adaptation in the Learning Channel

In the simulations so far, randomly-initialized parameters in the learning channel remain constant while the parameters in the forward channel are trained. In this section, we investigate adaptive learning channels where the parameters are randomly-initialized but are updated during training according to the local learning rules defined in Section 2.

First we examine the Hebbian adaptive rule. Mathematical analysis on simple architectures suggest that the Hebbian adaptive versions of ARBP and ASRBP could converge to a minimum error solution for deep, linear, Conjoined architectures. Figure 10 demonstrates this behavior on a MNIST classifier with four hidden layers of linear neurons and a softmax output (no non-linearity is used in the learning channel either). In the case of ARBP, the intuition for why this works is that the learning channel matrices at each layer are updated in the same direction as the forward channel matrices, so after training they are approximately transposes of one another, and thus ARBP approximates BP.

The situation becomes more complicated with non-linearities, and our experiments demonstrate that adaptation in the learning channel sometimes prevents the system from converging to the minimum error solution. Figure 11 shows the performance of the two adaptive rules (Hebbian and STDP) on a Conjoined architecture with tanh units in the forward channel. The Hebbian ARBP algorithm initially learns quickly, but the weights in the learning channel grow faster than the weights in the forward channel, causing the activities in the forward channel to saturate and leading to a poor solution. This occures even when the tanh non-linearity is used in the learning channel, and the learning channel weight updates are modulated by the derivative of that transfer function (not shown). In Hebbian ASRBP and the STDP adaptive rule, this problem is avoided and the classifier reaches 100% training accuracy.

In these experiments, we have investigated each of the six symmetries described in Section 2. The main conclusion is that the learning channel can be implemented in a number of ways that make use of random connections to transmit the error signals. In particular, the learning channel could be physically separate from the forward channel, have a distinct architecture, and could consist of non-linear, adaptive processing units. Next, we confirm these results on a more difficult classification task.

Figure 10: MNIST training and validation performance trajectories, as a function of training epoch, for a linear architecture (Conjoined) trained with standard backpropagation (BP), and the adaptive versions of RBP and SRBP with the Hebbian rule (ARBP and ASRBP). Performance does not reach 100% training accuracy because of the limitations of a linear architecture. The results demonstrate how Hebbian adaptation in deep linear networks leads to performance that is similar to BP.
Figure 11: MNIST training and validation performance trajectories, as a function of training epoch, for a Conjoined architecture with tanh units in the forward channel and linear adaptive units in the learning channel. The learning channel weights are updated according to either the Hebbian rule or the STDP rule.

4.2 Synthetic Data Experiments

Bianchini et al.  [9] suggest a class of easily-visualized target functions in which the difficulty of the learning problem is parameterized by an integer that controls the topology of the decision boundaries. Specifically, they propose a recursively defined sequence of functions defined by: , with , , , , , , etc. As illustrated in the top row of Figure 12 by plotting for , the functions can be visualized as increasingly complex patterns of white and black regions in the two-dimensional plane. The first two Betti numbers of these functions are ( is the number of connected components) and ( is the number of one-dimensional or “circular” holes). Thus, functions with larger values of have more holes.

Each function induces a classification learning problem with two-dimensional inputs and corresponding targets . Training examples are generated by sampling . Figure 12 shows for , as well as predictions of neural networks trained with the different learning rules. The same Conjoined neural network architecture was trained with every algorithm, and consisted of five layers of 500 hidden units per layer, followed by a single logistic output unit. Weights were initialized by sampling from a scaled uniform distribution [23], and bias terms were initialized to zero. Training was performed on mini-batches of 100 random examples, using stochastic gradient descent without momentum. The learning rate was initialized to 0.01 and decayed by a factor of after each weight update. Training was stopped after 1.5 million iterations, or when the validation error increased by more than 1% over a 5000-iteration epoch. The only hyperparameter that was not constant across algorithms was the non-linear transfer function, which was (ReLU) for all but the ARBP (STDP) and ASRBP (STDP) algorithms, which had better performance with tanh neurons.

The easier data sets (k=0 and k=1) are learned by all the algorithms with a high degree of accuracy. On the more difficult data sets (k=2 and k=3), the random backpropagation algorithms perform slightly worse than standard backpropagation for the fixed architecture and hyperparameters used here. However, all the algorithms learn these functions to a high degree of accuracy with additional hyperparameter tuning (not shown).

Figure 12: Synthetic classification data sets associated with the functions , with , , , , , , etc. Row 1 visualizes these functions and how their complexity increases with , for . Subsequent rows show probabilistic predictions of a fixed neural network architecture trained with the adaptive and non-adaptive versions of RBP and SRBP. With additional hyperparameter tuning (not shown), all the algorithms are able to learn all the functions.

4.3 HIGGS Experiments

The HIGGS data set is a two-class classification task from high-energy physics [7]. Deep learning provides a significant boost in performance over shallow neural network architectures on this task, especially when the input is restricted to 21 low-level features. In the following experiments, the forward channel consists of the 21 low-level inputs, eight fully-connected hidden layers of 300 tanh units, and a single logistic output unit. Weights were initialized by sampling from a scaled normal distribution [23], and bias terms were initialized to zero. Training was performed for 100 epochs using mini-batches of size 100, a momentum factor of 0.9, and a learning rate of unless otherwise specified. Classifiers were trained on 10,000,000 examples and tested on 100,000 examples.

The results on Conjoined architectures agree with the results on MNIST. First, we verified that the use of the tanh non-linearity in the learning channel has minimal effect on the performance of BP, RBP, and SRBP (not shown). Figure 13 shows the results of BP, RBP, and SRBP along with the adaptive variants in Conjoined architectures. As on MNIST, the Hebbian ARBP algorithm learns quickly then decreases in accuracy as the learning channel weights grow too large. However, the other adaptive algorithms perform similarly to their non-adaptive variants, and perform better than a benchmark shallow neural network trained with BP.

The results on Distinct architectures with tanh neurons in both channels are shown in Figure 14. When the learning channel contains the same number of neurons as the forward channel, SRBP does just as well as in the Conjoined architecture, while RBP does slightly worse. As in the MNIST experiments, changing the number of neurons in the learning channel does not have a large effect on performance.

Figure 13: HIGGS training and validation performance trajectories, as a function of training epoch, for a Conjoined architecture with tanh neurons in the forward channel and linear neurons in the learning channel. The original BP, RBP, and SRBP are shown along with the adaptive variants of RBP and SRBP (Hebbian and STDP). The STDP variants train slower because they were trained with a smaller learning rate ( rather than ). As a benchmark, a shallow network consisting of a single hidden layer and trained with BP is also shown.
Figure 14: HIGGS training and validation performance trajectories, as a function of training epoch, for a Distinct architecture with tanh neurons in both the forward channel and the learning channel. The number of neurons in each hidden layer of the forward channel is 300, while the number of neurons in the learning channel is either 300 or 100. In these experiments, the weights in the learning channel were initialized from a normal distribution with a slightly smaller standard deviation of compared to the used in the forward layer.

5 Mathematical Analyses

5.1 General Considerations

The general strategy to try to derive more precise mathematical results is to proceed from simple architectures to more complex architectures, and from the linear case to the non-linear case. In the case of linear networks, when there is no adaptation in the learning channel, then RBP and SRBP are equivalent when there is only one hidden layer, or when all the layers have the same size. However when there is adaptation in the learning channel, then ARBP and ASRBP are equivalent when there is a single hidden layer, but not when there are multiple hidden layers, even if they are of the same size. When the learning channel is not adaptive, the differential equations for several kinds of networks were studied in [4]. Here we consider the case of adaptive learning channels.

For each linear network, under a set of standard assumptions, one can derive a set of non-linear–in fact polynomial–autonomous (independent of time), first order, ordinary differential equations (ODEs) for the average (or batch) time evolution of the synaptic weights under the ARBP or ASRBP algorithm. As soon as there is more than one variable and the polynomial system is non-linear, there is no general theory to understand the corresponding behavior. In fact, even in two dimensions, the problem of understanding the upper bound on the number and relative position of the limit cycles of a system of the form and , where and are polynomials of degree is open–this is Hilbert’s 16-th problem in the field of dynamical systems.

When considering the specific systems arising from the ARBP/ASRBP learning equations, one must first prove that these systems have a long-term solution. Note that polynomial ODEs may not have long-term solutions in general (e.g. , with , does not have long-term solutions for ) but, if the trajectories are bounded, then long-term solutions exist. We are particularly interested in long-term solutions that converge to a fixed point, as opposed to limit cycles or other behaviors.

A number of interesting cases can be reduced to a first-order, autonomous, differential equation in one dimension for which long-term existence and convergence to fixed point theorems can be derived. In general we will assume that is locally Lipschitz over the domain of interest, that is for every in the domain there is a neighborhood such that for any pair of points , in this neighborhood the function satisfies the Lipschitz condition:

(20)

for some constant . The local Lipschitz condition implies that is continuous, but not necessarily differentiable at . On the other hand, if is differentiable at then it is locally Lipschitz.

The fundamental theorem of ordinary differential equations states that if is locally Lipschitz around an initial condition of the form , then for some value there exists a unique solution to the initial value problem on the interval . If is a fixed point, i.e. is a root of (), and is locally Lipschitz over an entire neighborhood of , then the qualitative behavior of the trajectories of the differential equation with a starting point near can easily be understood simply by inspecting the sign of around . We give two slightly different versions of a resulting theorem that will be used in the following analyses.

Theorem 1: Let and assume is a continuously differentiable function defined over an open, closed, or semi-open interval of the real line where or can be finite or infinite. Assume that has a finite number of roots () with . Then for any starting point in the trajectory converges to one of the roots. The result remains true for any starting point in provided . Likewise the result remains true for any starting point in provided . If the domain of consists of multiple disjoint intervals, then the theorem can be applied to each interval. Furthermore, for any root of (fixed point), its stability is determined immediately by inspecting the sign of to the left and right of the root. In particular, correspond to attractor on the left, unstable on the right; correspond to unstable on the left, attractor on the right; correspond to attractor; and correspond to unstable. Finally, if is a polynomial of odd degree with leading negative coefficient, then satisfies all the conditions above and always converges to a fixed point.

Theorem 1: We consider the extended real line222with the base of the topology given by the open intervals where and , and . . Let:

(21)

where all . Let be a first order differential equation in one dimension, where is locally Lipschitz on . We assume that

  1. in a neighborhood of , ;

  2. in a neighborhood of , .

Then for any initial value , the system has a long-term solution and is convergent to one of the roots of . Note that the local Lipschitz condition is automatically satisfied when is continuously differentiable on the real line. Furthermore the theorem is valid even when has infinitely many roots.

Proof: The proof of either version of this theorem is easily derived from the fundamental theorem of ODE and can easily be visualized by plotting the function .

Finally, in terms of notations, the matrices in the forward channel are denoted by , and the matrices in the learning channel are denoted by Theorems are stated in concise form and additional important facts are contained in the proofs.

5.2 The Simplest Linear Chain:

Derivation of the System (ARBP=ASRBP): The simplest case correspond to a linear architecture (Figure 15). Let us denote by and the weights in the first and second layer, and by the random weight of the learning channel. In this case, we have and the learning equations are given by:

(22)

When averaged over the training set:

(23)

where and . With the proper scaling of the learning rate () this leads to the non-linear system of coupled differential equations for the temporal evolution of , , and during learning:

(24)

Note that the dynamic of is given by:

(25)

The error is given by:

(26)

and:

(27)

the last equality requires .

Theorem 2: Starting from any initial conditions the system converges to a fixed point, corresponding to a global minimum of the quadratic error function. All the fixed points are located on the hyperbolas given by and are global minima of the error function. For any starting point, the system reduces to a one-dimensional differential equation where satisfies the conditions of Theorem 1 and its leading term is a third degree monomial with negative coefficient . As a result converges to a root of , converges to and converges to . Thus if the initial conditions of and are close, they will converge to similar values after learning.

Proof: In this case, the critical points for and are given by:

(28)

which corresponds to two hyperbolas in the two-dimensional plane, in the first and third quadrant for . Note that these critical points do not depend on the feedback weight . All these critical points correspond to global minima of the error function . Now note that the differential equations for and are identical:

(29)

where is a constant depending only on the initial conditions. In addition:

(30)

resulting in:

(31)

where depends only on the initial conditions. Thus, by substituting this value in the differential equation for it is easy to see that it has the form: where is a functions that satisfies Theorem 1 and its leading term is a monomial of degree 3 in with a negative leading coefficient equal to (we exclude the trivial case where corresponding to a single input equal to 0). Thus is convergent to a fixed point, and so are and , and they converge to the values given in the theorem.

Derivation of the System (STDP rule): This learning rule correspond to the system:

(32)

As usual, taking expectations, this leads to the system of differential equations:

(33)

where , , , and (not needed in this version).

Theorem 2’: Starting from any initial conditions the system has long-term existence and converges to a fixed point, corresponding to a global minimum of the quadratic error function. All the fixed points are located on the hyperbolas given by and are global minima of the error function.

Proof: We assume that is the maximum time interval of the solution. Of course, we need to prove that . For contradiction, assume that .

If , then at all times. From the first equation, is a constant. Then the second equation becomes:

(34)

If , then is a constant as well. If , then:

(35)

which is also convergent.

The equation is invariant under the transformation . Therefore, in order to prove the theorem, from now on, we only need to assume that .

We observe that when , then the function is always positive. This is because:

(36)

We also observe that if , then the system has a constant solution, and hence is convergent. So we only need to consider the case where . By the uniqueness of the ODE solutions, will not change sign along the solutions.

We first prove that is bounded. From the first equation and the fact that is either positive or negative, we conclude that is always monotonic. Thus if is unbounded, then either and , or and .

We consider the equation:

(37)

If and , or and , then:

(38)

is bounded from above. From Equation 37, we know that is monotonically increasing for sufficiently large . Therefore the expression is convergent. In particular is bounded. Thus we have:

(39)

and is also bounded. Integrating Equation 37, we obtain:

(40)

Since , we have:

(41)

Since is bounded, we have:

(42)

Using the first equation, we have:

(43)

and is bounded.

Next, we prove the boundedness and convergence of and . We observe that:

(44)

Since both and are bounded, the expression is convergent. Since is monotonic and is bounded, it must be convergent. Thus must be convergent as well.

Assume that . Then since is bounded, is also bounded. It follows that:

(45)

From Equation 37, this implies the boundedness and convergence of . By the monotonicity of , we conclude that is also convergent.

Finally, we assume that . We claim that in this case we have when , and when . To prove the claim, we consider the equation for :

(46)

If the claim were false, then:

(47)

It follows that if is positive, then is decreasing hence is bounded and convergent. Otherwise for sufficiently large , we must have . Thus for sufficiently large , . Therefore is actually monotonically increasing and hence cannot be convergent to .

Using the claim, we conclude that is bounded from above. Therefore is convergent, and hence is also bounded and is convergent.

Since the bounds for are independent of , the system has long-term existence (which means ) and is convergent.

Figure 15: Left: architecture. The weights and are adjustable, and so is the weight in the learning channel. Right: architecture in the ARBP and ASRBP cases. The weights , and are adjustable, and so are the weights and in the learning channel.

5.3 Adding Depth: the Linear Chain .

Derivation of the System (ARBP): In the case of a linear architecture, for notational simplicity, let us denote by and the forward weights, and by and the random weights of the learning channel (note the index is equal to the target layer). In this case, we have . The learning equations are:

(48)

As usual, by averaging over the training set, using a small learning rate, and letting gives the system of coupled ordinary differential equations:

(49)

The dynamic of is given by:

(50)

Theorem 3: Starting from any initial conditions the system converges to a fixed point, corresponding to a global minimum of the quadratic error function. All the fixed points are located on the manifold given by and are global minima of the error function. Along any trajectory () where is a constant that depends only on the initial conditions. Thus if is small, at all times during learning. Along any trajectory, is a quadratic function of , ().The system can be reduced to the system where satisfies the conditions of Theorem 1 and its leading term is a monomial of degree seven with negative coefficient equal to . Thus converges to one of the roots of and the other variables also converge to fixed values that can be determined from .

Proof: The system is solved by noting that:

(51)

for . In addition:

(52)

for , and thus:

(53)

for . By substituting in the differential equations for we see that it has the form where satisfies the conditions of Theorem 1 and its leading term is a monomial of degree 7 with negative coefficient . Thus must converge to a root of , and therefore the other variables also converge to a fixed point. Note that the weights and track the weights and , and if the initial differences are small, the difference between the final values is equally small.

Derivation of the System (ASRBP):

(54)

With the usual assumptions, this leads to the system of coupled ordinary differential equations:

(55)

The dynamic of is given by:

(56)

Theorem 3’: Starting from almost any set of initial conditions (except for a set of measure 0) the system converges to a fixed point, corresponding to a global minimum of the quadratic error function. All the fixed points are located on the manifold given by and are global minima of the error function. Along any trajectory (