Supervised Deep Similarity Matching

Supervised Deep Similarity Matching

Abstract

We propose a novel biologically-plausible solution to the credit assignment problem, being motivated by observations in the ventral visual pathway and trained deep neural networks. In both, representations of objects in the same category become progressively more similar, while objects belonging to different categories becomes less similar. We use this observation to motivate a layer-specific learning goal in a deep network: each layer aims to learn a representational similarity matrix that interpolates between previous and later layers. We formulate this idea using a supervised deep similarity matching cost function and derive from it deep neural networks with feedforward, lateral and feedback connections, and neurons that exhibit biologically-plausible Hebbian and anti-Hebbian plasticity. Supervised deep similarity matching can be interpreted as an energy-based learning algorithm, but with significant differences from others in how a contrastive function is constructed.

\printAffiliationsAndNotice

1 Introduction

Synaptic plasticity is generally accepted to be the underlying mechanism of learning in the brain, which almost always involves a large population of neurons and synapses across many different brain regions. How the brain modifies and coordinates individual synapses in the face of limited information available to each synapse in order to achieve a global learning task, the credit assignment problem, has puzzled scientists for decades. A major effort in this domain has been to look for a biologically-plausible implementation of the back-propagation of error algorithm (BP) Rumelhart et al. (1986), which has long been disputed due to its biologically implausibility Crick (1989), although recent work has made progress in resolving some of these concerns Xie and Seung (2003); Lee et al. (2015); Lillicrap et al. (2016); Nøkland (2016); Scellier and Bengio (2017); Guerguiev et al. (2017); Whittington and Bogacz (2017); Sacramento et al. (2018); Richards and Lillicrap (2019); Whittington and Bogacz (2019).

Figure 1: Supervised learning via layer-wise similarity matching. For inputs of different categories, similarity-matching differentiates the representations progressively (up), while for objects of the same category, representations becomes more and more similar (middle). For a given set of training data and their corresponding labels, the training process can be regarded as learning hidden representations whose similarity matrices match that of both input and output (lower). The tuning of representational similarity is indicated by the springs with the constraints that input and output similarity matrices are fixed.

In this paper, we present a novel approach to the credit assignment problem, motivated by observations on the nature of hidden layer representations in the ventral visual pathway of the brain and deep neural networks. In both, representations of objects belonging to different categories become less similar, while same category become more similar Grill-Spector and Weiner (2014). In other words, categorical clustering of representations becomes more and more explicit at the later layers, Fig.1. These results suggest a new approach to the credit assignment problem. By assigning each layer a layer-local similarity matching task, whose goal is to learn an intermediate representational similarity matrix between previous and later layers, we may be able to get away from the need of backward propagation of errors, Fig.1. The similarity matching principle has recently been used to derive various biologically plausible unsupervised learning algorithms Pehlevan and Chklovskii (2019), such as principal subspace projection Pehlevan and Chklovskii (2015), blind source separation Pehlevan et al. (2017), feature learning Obeid et al. (2019), manifold learning Sengupta et al. (2018) and classification Genkin et al. (2019). Motivated by this idea and previous observations that error signal can be implicitly propagated via the change of neural activities Hinton and McClelland (1988); Scellier and Bengio (2017), we introduce biologically plausible supervised learning algorithms based on the principle of similarity matching that uses Hebbian/anti-Hebbian learning rules.

We present two different algorithms. In the first algorithm, which we call “Supervised Similarity Matching” (SSM), the supervised signal is introduced to the network by clamping the output neurons to the desired states. The hidden layers learn intermediate representations between their previous and later layers by optimizing a layer-local similarity matching objective function. Optimizing a dual minimax formulation of this problem by an alternating gradient ascent-descent Pehlevan et al. (2018) leads to an algorithm that can be mapped to the operation of a neural network with local Hebbian learning rules for the feedforward and feedback weights, and anti-Hebbian learning rules for the lateral weights of the hidden layers. To ensure that the desired output is a fixed point of the neural dynamics when outputs are not clamped, a necessity for a correct prediction, we introduce a local update rule for lateral connections between output neurons. After learning, the network predicts by running the neural dynamics until it converges to a fixed point for a given input. While this algorithm performs well for deterministic and noisy linear tasks, its performance decays when nonlinear activation functions are introduced, a consequence of the existence of spurious fixed points.

To overcome this problem, we develop a second algorithm borrowing an idea from energy-based learning algorithms such as Contrastive Hebbian Learning (CHL) Movellan (1991) and Equilibrium Propagation (EP) Scellier and Bengio (2017). In these algorithms, weight-updates rely on the difference of the neural activity between a “free” phase, and a “clamped” (CHL) or “nudged” (EP) phase to locally approximate the gradient of an error signal. The learning process can be interpreted as minimizing a contrastive function, which reshapes the energy landscape to eliminate spurious fixed points and makes the desired fixed point more stable.

Adopting this idea, we introduce a novel optimization problem named “Contrastive Similarity Matching” (CSM), and, using the same duality principles as before, derive a two-phased biologically plausible supervised algorithm. The first phase is analogous to the nudged phase of EP, and performs Hebbian feedforward and anti-Hebbian lateral updates. We emphasize that our algorithm has the opposite sign for the lateral connection updates than EP and CHL. This is due to the fact that our weight updates solve a minimax problem. Anti-Hebbian learning pushes each neuron within a layer to learn different representations and facilitates the forward Hebbian feature learning. The second phase of our algorithm only performs anti-Hebbian updates on the feedforward weights. This is again different from EP or CHL where all weights are updated.

Our main contributions and results are listed below:

  • We introduce a novel approach to the credit assignment problem using biologically-plausible learning rules. We introduce two algorithms: Supervised Similarity Matching and Contrastive Similarity Matching. Our work generalizes the similarity matching principle to supervised learning tasks.

  • The proposed supervised learning algorithms can be related to other energy-based algorithms, but with distinct underlying mechanism.

  • We present a version of our neural network algorithm with structured connectivity.

  • Using numerical simulations, we show that the performance of our algorithms is on par with other energy-based algorithms. The learned representations of our Hebbian/anti-Hebbian network is more sparse and structured.

The rest of this paper is organized as follows. In Section 2, we derive the Supervised Similarity Matching algorithm for linear tasks via the similarity matching principle. In Section 3, we present the Contrastive Similarity Matching algorithm for deep neural networks with nonlinear activation functions and structured connectivity. We discuss the relation of our algorithm to other energy-based learning algorithms. In Section 4, we report the performance of our algorithm and compare it with EP, highlighting the differences between them. Finally, we discuss our results and relate it to other works in Section 5.

2 Supervised Similarity Matching for Linear Tasks

We start by deriving a neural network (NN) that solves deterministic and noisy linear tasks based on the similarity matching principle. In the next section, we will generalize it to multi-layer NNs with nonlinear activation functions and structured connectivity. Our approach is normative, in the sense that we derive an online algorithm by optimizing an objective function, and map the steps of the algorithm onto the operation of a biologically-plausible neural network. In contrast to traditional deep learning approaches, the neural architecture and dynamics are not prescribed but derived from the optimization problem.

Our goal in this section is both to introduce a new algorithm and to introduce some of the key ideas behind the development of the more involved Contrastive Similarity Matching algorithm. Therefore, we we will focus on the simplest non-trivial NN architecture, a NN with a single hidden layer with linear activation functions.

2.1 Supervised Similarity Matching Objective

Let , be a set of data points and , where , , , be their corresponding desired output or labels. Our goal is to derive a NN with one hidden layer that uses biologically plausible learning rules to learn the underlying linear function between and . As we will see below, will be the inputs to a neural network and will be the outputs.

Our idea is that the representation learned by the hidden layer, , should be half-way between the input and the desired output . We formulate this idea using representational similarities, using the dot product as a measure of similarity within a layer. Our proposal can be formulated as an optimization problem as follows:

(1)

To get an intuition about what this cost function achieves, suppose we had only one training datum. Then, , satisfying our condition. When multiple training data are involved, interactions between different data points lead to a non-trivial solution, but the fact that the hidden layer representations are in between the input and output layers stay.

The optimization problem (1) can be analytically solved, making our intuition precise. Let the representational similarity matrix of the input layer be , the hidden layer be , and the output layer be . Instead of solving for , we can reformulate and solve the supervised similarity matching problem (1) for , and obtain s back by a matrix factorization through an eigenvalue decomposition. The optimization problem for , that follows from (1) and completing the square, is:

(2)

where is the set of symmetric matrices with rank , and denotes the Frobenious norm. Optimal is given by keeping the top modes in the eigenvalue decomposition of and setting the rest to zero. If , then optimal exactly equals , achieving a representational similarity matrix that is the average of input and output layers.

2.2 Derivation of a Supervised Similarity Matching Neural Network

The supervised similarity matching cost function (1) is formulated in terms of the activities of units, but a statement about the architecture and the dynamics of the network has not been made. In fact, we will derive all these from the cost function, without prescribing them. To do so, we need to introduce variables that correspond to the synaptic weights in the network. As it turns out, these variables are dual to correlations between unit activities Pehlevan et al. (2018).

To see this explicitly, following the method of Pehlevan et al. (2018), we expand the squares in Eq.(1) and introduce new dual variables , and by using the following identities:

(3)

Plugging these into Eq.(1), and changing orders of optimization, we arrive the following dual, minimax formulation of supervised similarity matching:

(4)

where

(5)

A stochastic optimization of the above objective can be mapped to a Hebbian/anti-Hebbian network following steps in previous work Pehlevan et al. (2018). For each training datum, , a two-step procedure is performed. First, optimal that minimizes is obtained by a gradient flow until convergence,

(6)

We interpret this flow as the dynamics of a neural circuit with linear activation functions, where the dual variables and are synaptic weight matrices (Fig. 2A). In the second part of the algorithm, we update the synaptic weights by a gradient descent-ascent on (2.2) with fixed. This gives the following synaptic plasticity rules

(7)

The learning rate of each matrix can be chosen to be different to achieve best performance.

Figure 2: A linear NN with Hebbian/anti-Hebbian learning rules. (A) During the learning process, the output neurons (blue) are clamped at their desired states. After training, prediction for a new input is given by the value of at the fixed point of neural dynamics. (B) Test error, defined as the mean square error between the network’s prediction, , and the ground-truth value, , , decreases with the gradient ascent-descent steps during learning. (C) Scatter plot of the predicted value versus the desired value (element-wise). (D) The algorithm learns the correct mapping between and even in the presence of Gaussian noise. In these examples, , , elements of and are drawn from a uniform distribution in the range , and . In (C) and (D), 200 data points are shown.

Overall, the network dynamics (6) and the update rules (2.2) map to a NN with one hidden layer, with the output layer clamped to the desired state, the feedforward weight updates are Hebbian. The lateral weight updates are “anti-Hebbian“ since the connections are inhibitory (Fig. 2A).

For prediction, the network takes an input data point, , and runs with unclamped output until convergence. We take the value of the units at the fixed point as the network’s prediction.

To make sure that the network produces correct fixed points, at least for training data, we introduce the following step to the training procedure. We aim to construct a neural dynamics for the output layer in prediction phase such that its fixed point corresponds to the desired output . Since the output layer receives input from previous layer, a decay term that depends on is required to achieve stable fixed point at . The simplest way is by introducing lateral inhibitions. And now the output layer has the following neural dynamics:

(8)

where the lateral connections are learned such that the fixed point . This is achieved by minimizing the following target function

(9)

Taking the derivative of the above target function with respect to while keeping other parameters and variables evaluated at the fixed point of neural dynamics, we get the following “delta” learning rule for :

(10)
  Input: Initial weights , , , , and initial
  for  to  do
     Run the neural dynamics (6) until convergence to get the fixed point of
     Update and according to (2.2)
     Update according to (10)
  end for
Algorithm 1 Supervised Similarity Matching (SSM)

To summarize, (6), (2.2) and (10) constitute our Supervised Similarity Matching learning algorithm, which is summarized in Algorithm 1.

After learning, the NN makes a prediction about a new input by running the neural dynamics of and (6) and (8) until they converge to a fixed point. We take the value of units at the fixed point as the prediction. As shown in Fig.2 B-D, Supervised Similarity Matching algorithm solves linear tasks efficiently.

The above algorithm can be generalized to multi-layer and nonlinear networks. We chose to not present them, because we found that its performance is not satisfactoy due to appearance of spurious fixed points for a given input . To overcome this problem, we will introduce the Contrastive Similarity Matching algorithm, borrowing ideas from energy-based learning algorithms such as Contrastive Hebbian Learning (CHL) and Equilibrium Propagation (EP) in the next section.

3 Contrastive Similarity Matching for Deep Nonlinear Networks

In energy-based learning algorithms like CHL and EP, weight updates rely on the difference of neural activity between a free phase and a clamped/nudged phase to locally approximate the gradient of an error signal. This process can be interpreted as minimizing a contrastive function, which in turn reshapes the energy landscape to eliminate the spurious fixed points and make the fixed point corresponding to the desired output more stable.

We will adopt this idea to propose a Contrastive Similarity Matching algorithm. In order to do so, we will first introduce what we call the nudged similarity matching cost function, and derive its dual formulation, which will be the energy function we will end up using in our contrastive formulation.

3.1 Nudged Deep Similarity Matching Objective and Its Dual Formulation

We now discuss a -layer ( hidden layers) NN with nonlinear activation functions, . For notational convenience, we denote inputs to the network by , outputs by , and activities of hidden layer units by .

We propose the following objective function for training phase where outputs are nudged towards the desired labels

(11)

where is a control parameter that specifies how strong the nudge is, is a parameter that allows the influence of the later layers to the previous layers, is a regularizer defined and related to the activation function as , where , and are the total input and output of -th layer respectively, is the the threshold of neurons in layer . We also assume to be monotonic and bounded function, whose bounds are given by and .

The objective function (3.1) is almost identical to the deep similarity matching objective that was introduced in Obeid et al. (2019), except the nudging term. Obeid et al. (2019) used version as an unsupervised algorithm. Here, we use a non-zero for supervised learning.

Using analogues of the duality transforms discussed in the previous section and in Pehlevan et al. (2018); Obeid et al. (2019), the supervised deep similarity matching problem (3.1) can be turned into a minimax problem:

(12)

where

(13)

Details of this transformation is given in the Supplementary Information Section 1.

We note that the objective of the in defines an energy function for a NN. It can be minimized by running the following dynamics:

(14)

where is the Kronecker delta, , is a time constant, , . This observation will the building block of our CSM algorithm, which we present below.

While the objective function (S.10) could be optimized using a stochastic alternating optimization procedure that can be interpreted as the operation of a deep neural network with feedforward, feedback and lateral connections, as in Section 2, we also found the performance of this network not satisfactory due to existence of spurious minima in its energy landscape.

3.2 Contrastive Similarity Matching

We first state our contrastive function and then discuss its implications. We suppress the dependence on training data in and define:

(15)

and

(16)

Finally, we formulate our contrastive function as

(17)

which is to be minimized over feedforward and feedback weights , as well as bias . For fixed bias, minimization of the first term, corresponds exactly to the optimization of the minimax dual of nudged deep similarity matching (S.10). The second term corresponds to a free phase, where no nudging is applied. We note that in order to arrive at a contrastive minimization problem, we use the same optimal lateral weights, (3.2), from the nudged phase in the free phase. Compared to the minimax dual of nudged deep similarity matching (S.10), we also optimize over the bias for better performance.

Minimization of the contrastive function (17) closes the energy gap between nudged and free phases. Because the energy functions are evaluated at the fixed point of the neural dynamics (3.1), such procedure enforces the nudged network output to be a fixed point of the free neural dynamics.

To optimize our contrastive function (17) in a stochastic (one training datum at a time) manner, we use the following procedure. For each pair of training data , we run the nudged phase () dynamics (3.1) until convergence to get the fixed point . Next, we run the free phase () neural dynamics (3.1) until convergence. We collect the fixed points . is updated following a gradient ascent of (3.2), while and following a gradient descent of (17):

(18)

In practice, learning rate can be chosen differently to achieve best performance. The above CSM algorithm is summarized in Algorithm 2.

  Input: Initial , , ,
  for  to  do
     Run the nudged phase neural dynamics (3.1) with until convergence, collect the fixed point
     Run the free phase dynamics (3.1) with until convergence, collect fixed point
     Update , and according to (3.2).
  end for
Algorithm 2 Constrative Similarity Matching (CSM)

Relation to Other Energy Based Learning Algorithms

The CSM algorithm is similar in spirit to other contrastive algorithms, like CHL and EP. Like these algorithms, CSM performs two runs of the neural dynamics in a free and a nudged phase. However there are important differences. One major difference is that in CSM, the contrast function is minimized by the feedforward weights. The lateral weights take part in maximization of a different minimax objective (3.2). In CHL and EP, such minimization is done with respect to all the weights.

As a consequence of this difference, compared with CHL and EP algorithm, the CSM algorithm uses a different update for lateral weights. This is an anti-Hebbian update arising from the nudged phase, and is different from EP and CHL in two ways: 1) It has the opposite sign, i.e. EP and CHL nudged/clamped phase updates are Hebbian. 2) No update is applied in the free phase. As we will demonstrate in numerical simulations, our lateral update imposes a competition between different units in the same layer, and facilitates forward feature leaning. When network activations are nonnegative, such lateral interactions are inhibitory and sparsens neural activity.

3.3 Introducing structured connectivity

We can also generalize the above supervised similarity matching to derive a Hebbian/anti-Hebbian network with structured connectivity. Following the idea in Obeid et al. (2019), we can modify any of the cross terms in the layer-wise similarity matching objective (Eq.3.1) by introducing synapse specific structure constants. For example:

(19)

where is the number of neurons in -th layer, are constants that set the structure of feedforward weight matrix between -th layer and -th layer. In particular, setting them to zero removes the connection, without changing the energy function interpretation Obeid et al. (2019). Similarly, we can introduce constants to specify the structure of the lateral connections (Fig. 6 A). Using such structure constants, one can introduce many different architectures, some of which we experiment with below. We present a detailed explanation of these points in the Supplementary Information Section 2.

Figure 3: Training (left) and validation (right) classification errors for different contrastive learning algorithms for a network with one hidden layer (784-500-10, upper panels) and three hidden layers (784-500-500-500-10, lower panels).

4 Numerical Simulations

In the following section, we describe the results of simulations that tested the CSM algorithm on a supervised classification task using the MNIST dataset of handwritten digits LeCun et al. (2010). For our simulations, we used the Theano Deep Learning framework Team et al. (2016) and modified the code released by Scellier and Bengio (2017). The activation functions of the units were .

4.1 Fully connected neural network

The inputs consist of gray-scale 28 by 28 pixel-images with associated labels ranging from . We encode the labels as one-hot 10-dimensional vector. We trained fully connected neural networks with 1 and 3 hidden layers with lateral connections within each hidden layer. The performance of CSM algorithm was compared with several variants of EP algorithm: (1) random sign of as a regularization Scellier and Bengio (2017), (2) positive constant ; (3) Network with lateral connections trained by EP (EP+lateral). In all the simulations, the number of neruons in each hidden layer is 500. We attained 0% training error and and validation errors with CSM, in the one and three hidden layer cases respectively. This is on par with the performance of the EP algorithm, which attains a validation error of and respectively (Fig.3). In the 3 layer case, a training error-dependent adaptive learning rate scheme (CSM-Adaptive) was used, wherein the learning rate for the lateral updates is successively decreased when the training error drops below certain thresholds (see Supplementary Material Section 3 for details).

Figure 4: Representations of neurons in NNs trained by CSM algorithm is much sparser than that of EP algorithm. (A) Heatmaps of representations at the layer 2, each row is the response of 500 neurons to a given digit image. Upper: CSM algorithm. Lower: EP algorithm. (B) Representation sparsity, defined as number of neurons whose activity are larger than a threshold (0.01), along different layers. Layer 0 is the input. The network has a 784-500-500-500-10 architecture.

4.2 Neuronal Representation

Despite the similar performance of the CSM algorithm and the EP algorithm, the underlying mechanisms are different. Due to the non-negativity of hidden units and Anti-Hebbian lateral updates, lateral connections in the the CSM algorithm are inhibitory. Therefore, the learned representations are more sparse than that of the EP algorithm (Fig.4).

CSM algorithm solves the credit assignment problem by minimizing layer-wise local similarity matching objective. Indeed, representations of images in a 3-layer NN trained by CSM become progressively more structured. This is quantified by the representational similarity matrices (Fig.5 A-B), the average representational similarity between images of the same class (diagonal blocks) is much larger than that of different classes (off-diagonal blocks). In contrast, such difference of representational similarity in NN trained by EP is much smaller (Fig.5).

Figure 5: Comparison of the representational similarity matrices at each layer. (A) Normalized representational similarity matrix at layer 2 for the NN trained by CSM and EP algorithm. 1000 randomly sampled MNIST images are used, each digit has 100 samples. (B) Elements of similarity matrix are divided into into within class (diagonal blocks, orange) and between classes (blue, off-diagonal blocks). left panel: evolution of similarity matrix along different layers. Right panel: difference between within and between-classes are much larger in CSM (red) than EP (blue) algorithm.

4.3 Structured Networks

We examine the performance of CSM in networks with structured connectivity. Every hidden layer can be constructed by first considering sites arranged on a two dimensional grid. Each site only receives inputs from select nearby sites; this is controlled by the ‘radius’ parameter. This setting resembles retinotopy Kandel et al. (2000) in the visual cortex. Several of such two dimensional grids are stacked such that multiple neurons are present at a single site; this is controlled by the ‘neurons per site’(NPS) parameter. We consider lateral connections only between neurons sharing the same (x, y) coordinate. Networks with structured connectivity trained with the CSM rule achieved comparable performance, validation error for a single hidden layer network with a radius of 4 and NPS of 20 (Fig. 6) (See Supplementary Material Section 3 for details).

Figure 6: (A) Sketch of structured connectivity in a deep neural network. Neurons live in a 2-d grid. Each neuron takes input from a small grid (blue shades) from previous layer and a small grid of inhibition from its nearby neurons (orange shades). (B) Training and validation curves with CSM in structured, single hidden layer networks, with a receptive field of radius 4 and neurons per site 4, 16 and 20.

5 Discussion

In this paper, we proposed a new solution to the credit assignment problem by generalizing the similarity matching principle to supervised domain and proposed two algorithms. In the SSM algorithm, supervision signal spreads to the network implicitly by clamping the output nodes and distributes to each synapse by minimizing a layer-local similarity-matching target. In CSM algorithm, supervision signal is introduced by minimizing the energy difference between a free phase and nudged phase. CSM differs significantly from other energy-based algorithms in how the contrastive function is constructed. The anti-Hebbian learning rule for the lateral connections sparsens the representations and facilitates forward Hebbian feature learning. We also derived CSM for NNs with structured connectivity.

The idea of using representational similarity for training neural networks have taken various forms in previous work. For example, similarity matching has been introduced as part of a local cost function to train a deep convolutional network Nøkland and Eidnes (2019), where instead of layer-wise similarity matching, each hidden layer aims to learn representations similar to the output layer. Representational similarity matrices derived from neurobiology data has recently been used to regularize the CNNs trained on image classification. The resulting network is more robust to noise and adversarial attacks Li et al. (2019). It would be interesting to study the robustness of NNs trained by the CSM algorithm.

A practical issue of CSM, and also for other energy-based algorithms, such as EP and CHL, is that the free phase takes a long time to converge. Recently, a discrete-time version of EP has shown much faster training speeds Ernoult et al. (2019) and the application to the CSM could be an interesting future work.

Acknowledgements

We acknowledge support by NIH and the Intel Corporation through Intel Neuromorphic Research Community.

Supplementary Material

1 Supervised deep similarity matching

In this section, we follow Obeid et al. (2019) to derive the minimax dual of deep similarity matching target function. We start from rewriting the objective function (3.1) by expanding its first term and combine the same terms from adjacent layers, we have

(S.1)

Plug the following identities:

(S.2)
(S.3)

in (S.1) and exchange the optimization order of and the weight matrices, we turn the target function (3.1) into the following minmax problem

(S.4)

where we have defined an “energy” term (Eq. 13 in the main text)

(S.5)

The neural dynamics of each layer can be derived by following the gradient of :

(S.6)

Define , the above equation becomes Eq. 14 in the main text.

2 Supervised Similarity Matching for Neural Networks With Structured Connectivity

In this section, we derive the supervised similarity matching algorithm for neural networks with structured connectivity. Structure can be introduced to the quartic terms in (S.1):

(S.7)

where and specify the feedforward connections of layer p with p-1 layer and lateral connectivity within layer p. For example, setting them to be 0s eliminates connections. Now the have the following deep structured similarity cost function for supervised learning

(S.8)

For each layer, we can define dual variables for and for interactions with positive constants, define the following variables

(S.9)

Now we can rewrite (S.8) as:

(S.10)

where

(S.11)

The neural dynamics follows the gradient of (2), which is

(S.12)

local learning rules follow the gradient descent and ascent of (2):

(S.13)
(S.14)

3 Hyperparameters and Performance in Numerical Simulations

3.1 One Hidden Layer

The models were trained until the training error dropped to or as close to as possible (as in the case with the Equilibrium Propagation runs with ); errors reported herein correspond to errors obtained for specific runs and do not reflect ensemble averages. The training and validation errors below, and in subsequent subsections are reported at an epoch when the training error has dropped to , or at the last epoch for the run (eg. for EP ). This epoch number is recorded in the last column.


Algorithm Learning Rate Training Error (%) Validation Error (%) No. epochs
EP: = 0.1, 0.05 0 2.53 40
EP:+ = 0.5, 0.125 0.034 2.18 100
EP: lateral = 0.5, 0.25 0 2.29 25
= 0.75
CSM = 0.5, 0.25 0 2.16 25
= 0.75
Table 1: Comparison of the training and validation errors of different algorithms for one hidden layer NNs

3.2 Three Hidden Layers

In Table 2, the CSM Adaptive algorithm employs a scheme wherein the learning rates for lateral updates are divided by a factor of 5, 10, 50, and 100 when the training error dropped below , and respectively.


Algorithm Learning Rate Training Error (%) Validation Error (%) No. epochs
EP: =0.128, 0.032, 0.008, 0.002 0 2.73 250
EP: =0.128, 0.032, 0.008, 0.002 0 2.77 250
EP lateral =0.128, 0.032, 0.008, 0.002 0 2.4 250
0.192, 0.048, 0.012
CSM 0.5, 0.375, 0.281, 0.211 0 4.82 250
0.75, 0.562, 0.422
CSM Adaptive 0.5, 0.375, 0.281, 0.211 0 3.52 250
0.75, 0.562, 0.422
Table 2: Comparison of the training and validation errors of different algorithms for three hidden layer NNs

3.3 Structured Connectivity

All results documented so far correspond to fully connected networks. In this section, results pertaining to networks with structured connectivity are discussed. Every hidden layer in these networks can be considered as multiple two dimensional grids stacked onto each other, with each grid containing neurons / units at periodically arranged sites. Each site only receives inputs from select nearby sites. In this scheme, we consider lateral connections only between neurons sharing the same (x, y) coordinate, and the length and width of the grid are the same. In the table below, ‘Full’ refers to runs where the input is the MNIST input image and ’Crop’ refers to runs in which the input image is a cropped MNIST image. The first three runs, annotated by ‘Full’ correspond to the runs plotted in the paper. Errors are reported at the last epoch for the run. In these networks, additional hyperparameters are required to constrain the structure; these are enumerated below:

  • Neurons-per-site (nps): The number of neurons placed at each site in a given hidden layer, i.e. the number of two dimensional grids stacked onto each other. The nps for the input is 1.

  • Stride: Spacing between sites, relative to the input channel. The stride of the input is always 1, i.e. sites are placed at (0, 0), (0, 1), (1, 0), so on, on the two dimensional grid. If the stride of the l-th layer is s, the nearest sites to the site at the origin will be and . The stride increases deeper into the network. Specifying the stride specifies the dimension of the grid. A layer with stride s and nps n, will have units, where , for the ‘Full’ runs and for the ‘Crop’ runs. The nps values and stride together assign coordinates to all the units in the network.

  • Radius: The radius of the circular two-dimensional region that all units in the previous layer must lie within in order to have non-zero weights to the the unit in concern. Any units in the previous layer, lying outside the circle will not be connected to the unit.


Algorithm Learning Rate Training Error (%) Validation Error (%) No. epochs
R4, NPS4, Full 0.5, 0.375 0.02 2.71 50
0.01
R4, NPS16, Full 0.5, 0.25 0 2.41 49
0.75
R4, NPS20, Full 0.664, 0.577 0 2.22 50
0.9
R8, NPS80, Crop 0.664, 0.577 0.01 2.27 20
0.9
R4, NPS4, Crop 0.099, 0.065 0.08 2.98 100
0.335
R8, NPS4, Crop 0.099, 0.065 0 2.73 100
0.335
R8, NPS20, Crop 0.664, 0.577 0 2.23 79
0.9
Table 3: Comparison of the training and validation errors of different algorithms for one hidden layer NNs with structured connectivity

References

  1. The recent excitement about neural networks. Nature 337 (6203), pp. 129–132. Cited by: §1.
  2. Updates of equilibrium prop match gradients of backprop through time in an rnn with static input. In Advances in Neural Information Processing Systems, pp. 7079–7089. Cited by: §5.
  3. A neural network for semi-supervised learning on manifolds. In International Conference on Artificial Neural Networks, pp. 375–386. Cited by: §1.
  4. The functional architecture of the ventral temporal cortex and its role in categorization. Nature Reviews Neuroscience 15 (8), pp. 536–548. Cited by: §1.
  5. Towards deep learning with segregated dendrites. ELife 6, pp. e22901. Cited by: §1.
  6. Learning representations by recirculation. In Neural information processing systems, pp. 358–366. Cited by: §1.
  7. Principles of neural science. Vol. 4, McGraw-hill New York. Cited by: §4.3.
  8. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: §4.
  9. Difference target propagation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 498–515. Cited by: §1.
  10. Learning from brains how to regularize machines. In Advances in Neural Information Processing Systems,