# Representation Learning using Event-based STDP

###### Abstract

Although representation learning methods developed within the framework of traditional neural networks are relatively mature, developing a spiking representation model remains a challenging problem. This paper proposes an event-based method to train a feedforward spiking neural network (SNN) for extracting visual features. The method introduces a novel spike-timing-dependent plasticity (STDP) learning rule and a threshold adjustment rule that are derived from a vector quantization-like objective function subject to a sparsity constraint. The STDP rule is obtained by the gradient of a vector quantization criterion that is converted to spike-based, spatio-temporally local update rules in a spiking network of leaky, integrate-and-fire (LIF) neurons. Independence and sparsity of the model are achieved by the threshold adjustment rule and by a softmax function implementing inhibition in the representation layer consisting of WTA-thresholded spiking neurons. Together, these mechanisms implement a form of spike-based, competitive learning. Two sets of experiments are performed on the MNIST and natural image datasets. The results demonstrate a sparse spiking visual representation model with low reconstruction loss comparable with state-of-the-art visual coding approaches, yet our rule is local in both time and space, thus biologically plausible and hardware friendly.

Representation Learning using Event-based STDP

Amirhossein Tavanaei The Center for Advanced Computer Studies The School of Computing and Informatics University of Louisiana at Lafayette Lafayette, LA 70504, USA tavanaei@louisiana.edu Timothee Masquelier CERCO UMR 5549, CNRS-Université de Toulouse 3, F-31300, France timothee.masquelier@cnrs.fr Anthony Maida The Center for Advanced Computer Studies The School of Computing and Informatics University of Louisiana at Lafayette Lafayette, LA 70504, USA maida@louisiana.edu

noticebox[b].\end@float

## 1 Introduction

Unsupervised learning approaches using neural networks have frequently been used to extract uncorrelated features from visual inputs [1, 2]. Single layer networks using distributed representations or autoencoder networks [3, 4] have offered effective representation platforms. However, the robust, high level, and efficient representation that is obtained by networks in the brain is still not fully understood [5, 6, 7, 8, 9, 10, 11]. Understanding the brain’s functionality in representation learning is accomplished by studying spike activity [12] and bio-inspired spiking neural networks (SNNs) [13, 14, 15]. SNNs provide a biologically plausible architecture, high computational power, and an efficient neural implementation [16, 17, 18]. The main challenge is to develop a spiking representation learning model that codes input spike trains to uncorrelated, sparse, output spike trains using spatio-temporally local learning rules. In this study, we seek to develop representation learning in a network of spiking neurons to address this challenge. Our contribution determines novel spatio-temporally local learning rules embedded in a single layer SNN to code independent features of visual stimuli received as spike trains. Synaptic weights in the proposed model are adjusted based on a novel spike-timing-dependent plasticity (STDP) rule which achieves spatio-temporal locality.

Nonlinear Hebbian learning has played a key role in the development of a unified unsupervised learning approach to represent receptive fields [19]. Földiák [20], influenced by Barlow [21], was one of the early designers of sparse, weakly distributed representations having low redundancy. The Földiák model introduced a set of three learning rules (Hebbian, anti-Hebbian, and homeostatic) to work in concert to achieve these representations. Zylberberg et al. [22] showed that Földiák’s plasticity rules, in a spiking platform, could be derived from the constraints of reconstructive accuracy, sparsity, and decorrelation. Furthermore, the acquired receptive fields of the representation cells in their model (named SAILnet) qualitatively matched those in primate visual cortex. The representation kernels determining the synaptic weight sets have been successfully utilized by our recent study [23] for a spiking convolutional neural network to extract primary visual features of the MNIST dataset. Additionally, the learning rules only used information which was locally available at the relevant synapse. Although SAILnet utilized spiking neurons in the representation layer and the plasticity rules were spatially local, the learning rules were not temporally local. The SAILnet plasticity rules use spike counts accumulated over the duration of a stimulus presentation interval. Since the SAILnet rules do not use spike times, the question of training the spiking representation network using a spatio-temporally local, spike-based approach like spike-timing-dependent plasticity (STDP) [24], which needs neural spike times, remains unresolved. Later work, [25], extends [22] to use both excitatory and inhibitory neurons (obeying Dale’s law), but the learning rules still use temporal windows of varying duration to estimate spike rates, rather than the timing of spike events. Our work seeks to develop a learning rule which matches this performance but remains local in both time and space.

In another line of research based on cost functions, Olshausen and Field [26] and Bell and Sejnowski [27] showed that the constraints of reconstructive fidelity and sparseness, when applied to natural images, could account for many of the qualitative receptive field (RF) properties of primary visual cortex (area 17, V1). These works were agnostic about the possible learning mechanisms used in visual cortex to achieve these representations. Following [26], Rehn and Sommer [28] developed the sparse-set coding (SSC) network which minimizes the number of active neurons instead of the average activity measure. Later, Olshausen et al. [29] introduced an -norm minimization criterion embedded in a highly overcomplete neural framework. Although these models offer great insight of what might be computed when receptive fields are acquired, they do not offer insight into details of the learning rules used to achieve these representations.

Early works that proposed a learning mechanism to explain the emergence of orientation selectivity in visual cortex are those of von der Malsburg [30] and Bienenstock et al. [31]. A state-of-the-art model is that of Masquelier [32]. This model blends strong biological detail with signal processing analysis and simulation to establish a proof-of-concept demonstration of the original Hubel and Wiesel [33] feedforward model of orientation selectivity. A key feature of that model, relevant to the present paper, is the use of STDP to account for RF acquisition. STDP is the most popular learning rule in SNNs in which the synaptic weights are adapted according to the relative pre- and postsynaptic spike times [24, 34]. Different variations of STDP have shown successful visual feature extraction in layer-wise training of SNNs [35, 36, 37, 38]. In a similar vein, Burbank [39] has also proposed an STDP-based autoencoder. This autoencoder uses a mirrored pair of Hebbian and anti-Hebbian STDP rules. Its goal is to account for the emergence of symmetric, but physically separate, connections for encoding weights () and decoding weights (). Another component playing a key role in representing uncorrelated visual features in a bio-inspired SNN pertains to the inhibition circuits embedded within a layer. For instance, Savin et al. [40] developed an independent component analysis (ICA) computation within an SNN using STDP and synaptic scaling in which independent neural activities in the representation layer were controlled by lateral inhibition. Lateral inhibition establishes a winner-take-all (WTA) neural circuit to maintain the independence and sparsity of the neural representation layer.

The present research proposes event-based, STDP-type rules embedded in a single layer spiking neural network (SNN) for spatio-temporal feature coding. Specifically, this paper proposes a novel STDP-based representation learning method in the spirit of [32, 39, 22]. Its learning rules are local in time and space to implement an approximation to clustering-based, vector quantization [41] using the SNN while controlling the sparseness and independence of visual codes. Our derivation uses a continuous-time formulation and takes the limit as the length of the stimulus presentation interval tends to one time step. This leads to STDP-type learning rules, although they differ from the classic rules found in [32] and [34]. In this sense, the rules and resulting visual coding model are novel. Independence and sparsity are also maintained by an implicit inhibition and a new threshold adjustment rule implementing a WTA circuit.

## 2 Background

Földiák [42] developed a feedforward network with anti-Hebbian interconnections for visual feature extraction. The Hebbian rule in his model shown in Eq. 1 is inspired from Oja’s learning rule [43] extracting the largest principal component from an input sequence,

(1) |

(2) |

where, is the weight associated with the synapse connecting input (presynaptic) neuron and representation (postsynaptic) unit . and are input and linear output respectively. Over repeated trials, the term increases the weight when the input and the output are correlated. The second term () maintains the learning stability [42]. A more consistent assumption for binary (or spiking) units has been made by Földiák [20]. He modified the previous feedforward network by incorporating non-linear units in the representation layer. The units are binary neurons with a threshold of 0.5 in which (Note: ). Thus, the Hebbian rule in Eq. 1 is simplified to

(3) |

(4) |

The weight change rules defined in Eqs. 1 and 3 are based on the input and output correlation. Another interpretation for Eq. 3 can be explained in terms of vector quantization (or clustering in a WTA circuit) [44, 45] in which the weights connected to each output neuron represent particular clusters (centroids). The weight change also is affected by the output neuron activation, . In this paper, we utilize the vector quantization concept to define an objective function. The objective function can be adapted to develop a spiking visual representation model equipped with a temporally local learning rule while still maintaining sparsity and independence. Our motivation is to use event-based, STDP-type learning rules. This requires the learning to be temporally local, specifically using spike times between pre- and postsynaptic neurons.

## 3 Spiking Visual Representation

The proposed model adopts a constrained optimization approach to develop learning rules that are synaptically local. The spiking representation model is a single layer SNN shown in Fig. 1. The representation layer recodes a image patch ( spike trains) using spike trains generated by neurons, , in the representation layer.

We derive plasticity rules that operate over a stimulus presentation interval (non-local) and then take the limit as tends to one local time step to derive event-based rules. The objective function using both the vector quantization criterion and a regularizer that prefers small weight values is shown below.

(5) |

The parameters are: normalized input pixel intensities in the range [0, 1], the linear output activation, and the synaptic weight respectively. The first component shows a vector quantization criterion that is scaled by the output neuron’s activity, . The scales the weight update rule according to the neuron’s response to the input pattern (). The second component (regularizer) is also scaled by the output neuron’s activity to control the weight decay criterion (e.g. if , does not undergo learning). We assume that the input and output values can be converted to the spike counts over ms. The hyperparameter controls the model’s relative preference for smaller weights. As , the objective function emphasizes the vector quantization criterion. In contrast, as the vector quantization component is eliminated and the minimum of the objective function is obtained when the ’s .

In response to a stimulus, a subset of neurons in the representation layer are activated to code the input. To represent the stimuli by uncorrelated codes, the neurons should be activated independently and sparsely. That is, the representation layer demands a WTA neural implementation. This criterion can be achieved by a soft constraint such that

(6) |

where, shows the binary state of unit after the ms presentation interval such that if unit fires at least once. Also, the firing status of a neuron can be controlled by its threshold, . Therefore, this constraint can be addressed by a threshold adjustment rule.

The goal is to minimize the objective function (Eq. 5) while maintaining the constraint (Eq. 6). This can be achieved by using a Lagrangian function

(7) |

where, is a Lagrange multiplier. Minimizing the first component of Eq. 7 results in a coding module that represents the input by a new feature vector which can cluster the data via the synaptic weights. Minimizing the second component supports the sparsity and independence of the representation to finally (as a special case) end with a winner-take-all network in which one and only one neuron fires upon stimulus presentation. This matter is accomplished by adapting the neuron’s threshold, . The optimum of the Lagrangian function can be obtained by gradient descent on its derivatives

(8) |

(9) |

From gradient descent on Eq. 8 (reversing the sign on the derivative), we obtain

(10) |

However, the information needed in Eq. 10 is not yet temporally local. denotes the rescaled pixel intensity and does not represent the input spike train. To re-encode a pixel intensity, , to a spike train, , we use uniformly distributed spikes (however, each spike train has a different random lag) with rate of normalized pixel intensity in the range [0, 1]. The maximum number of spikes (for a completely white pixel) for a ms interval is 40. Additionally, is a positive value (spike count) denoting the neuron’s activation in response to a stimulus presentation and is not available at synapse, . The value can be reexpressed as representing the output spike train of neuron . Spike trains and are formulated by the sum of Dirac functions as shown in Eq. 11.

(11) |

and are the sets of presynaptic and postsynaptic spike times. After coding and by spike trains and respectively, we seek to propose a local, STDP learning rule following Eq. 10. When, and are coded by spike trains over ms, the synaptic change in continuous time is given by

(12) |

is a normalizer denoting the maximum number of presynaptic spikes in ms interval. Over a short time period (, so that ), the weight adjustment at time is calculated by

(13) |

shows the firing status of neuron at time (). specifies the presynaptic spike emitted from neuron at time interval . In our experiments ms. The synaptic weight is changed only when a postsynaptic spike occurs (). Finally, the learning rule is formulated (upon firing of output neuron ) as follows

(14) |

Where, . This learning rule is applied when an output neuron fires. The weight change is related to the presynaptic spike times received by the output neurons. This scenario reminds us of a popular local learning rule in a biologically plausible spiking neural network named spike-timing-dependent plasticity (STDP). In this STDP rule (Eq. 14), the current synaptic weight controls the weight change. For instance, if and (it will be proved in Eq. 19), the smaller weights undergo larger LTP and LTD; and vice versa.

The second adaptation rule is referred to the threshold learning rule. Eq. 9 is used to implement a learning rule for adjusting the threshold, . The threshold learning rule shown in Eq. 15 provides an independent and sparse feature representation. The threshold is the same for all neurons in the representation layer.

(15) |

## 4 Network Architecture and Learning

### 4.1 Neuron Model

The network architecture is shown in Fig. 1 consisting of and neurons in the input and representation layers respectively. Stimuli are presented by spike trains over ms for both layers. At a given time step, a neuron in the representation layer is allowed to fire only if its criterion is met. The firing criterion records the neuron’s score in a winners-take-all competition. The WTA score at time step t, given the entire set of incoming weights, , into the representation layer, is given by

(16) |

(17) |

where, is the excitatory postsynaptic potential (EPSP) generated by input neuron and the s are the recent spike times of unit during a small interval , where is 4 ms. The decay time constant, , is set to 0.5 ms. In our network, the softmax value governs the time at which STDP occurs. If WTAscore of a neuron is greater than the adaptive threshold, , STDP is triggered and a spike is emitted. The softmax phenomologically implements an inhibition to develop a winners-take-all circuit [46, 47] in the representation layer. The neurons in the representation layer are purely excitatory and there is no explicit lateral inhibition between them other than that implicitly implemented by the softmax. When inhibition is imposed within the representation layer, the network implements a form of competitive learning by virtue of STDP being triggered by the firing of postsynaptic neurons. Only neurons that “win the competition” are allowed to learn.

### 4.2 Learning Rules

The synaptic weight change shown in Eq. 14 defines an STDP rule where the current synaptic weight controls the magnitude of the change. STDP events are triggered upon postsynaptic firing. Eq. 18 shows the final STDP rule derived from Eq. 14. The weights fall in the range [0, 1] and are initialized randomly between 0 and 1.

(18) |

is the learning rate. If , the first and second adaptation cases increase and decrease the synaptic weight respectively (LTP and LTD). If , then both cases are negative and decrease the weights down to the minimum value (). The weight adjustment, at equilibrium, demonstrates a probabilistic interpretation as follows

(19) |

(20) |

Therefore, the synaptic weight converges to the scaled probability of presynaptic spike occurrence given postsynaptic spike (LTP probability). From Eq. 20, the weights fall in the range so that the first case refers to LTP () and the second one refers to LTD (), at equilibrium point.

To show that the STDP rule (Eq. 18) is consistent with the learning rule in Eq. 10, we rewrite the non-local rule with learning rate, , as follows

(21) |

As stated earlier, this rule is temporally non-local and shows the weight change over a ms interval. In contrast, the STDP rule is temporally local, applying the weight change at one time step when the postsynaptic neuron fires. To make Eq. 21 and Eq. 18 (which is derived from Eq. 14) comparable with each other, we consider a time interval with only one postsynaptic spike where . Specifically, we break the ms interval into subintervals whose boundaries are determined by the event of a postsynaptic spike (). It is sufficient to analyze an arbitrary subinterval. Therefore, Eq. 21 at time is simplified to

(22) |

Following Eq. 19 for calculating the expected weight change using the proposed STDP rule, where , we find that

(23) |

Where, is the firing probability of presynaptic neuron . Also, we generated the presynaptic spike trains using the normalized pixel intensities in the range [0, 1] with different random lags. Thus, this probability value is the same as the normalized pixel intensity, , as firing rate. Therefore,

(24) |

Which matches the weight change shown in Eq. 22. This shows that the proposed STDP rule is consistent with the non-local rule. Additionally, the STDP weight change is an unbiased estimation for the non-local (non-spike based) learning rule. Over a short time period, the proposed learning rule is also an unbiased estimation for the Hebbian rule provided by Földiák [20] (Eq. 3).

For the threshold adaptation, following Eq. 15, the threshold learning rule can be written as

(25) |

where, is the learning rate. is the number of neurons in the representation layer firing in ms. This rule adjusts the threshold such that only one neuron fires in response to a stimulus. This criterion provides a framework to extract independent features in a sparse representation. In the experiments, the initial threshold is set to 0.15.

## 5 Evaluation Metrics

### 5.1 Reconstructed image

The representation filter set, , is a weight matrix coding an image patch ( input spike trains) to a vector of postsynaptic spike trains. To reconstruct the image patch from the coded spike trains, the reconstruction filter set, , is used to build spike trains. For this purpose, neurons in the input layer receive spike trains from the neurons in the representation layer via the transposed synaptic weight matrix (like an autoencoder).

### 5.2 Reconstruction loss

To report the reconstruction loss, we use the correlation measure (Pearson correlation) and the root mean square (RMS) between the normalized original, , and reconstructed, , patches as shown in Eqs. 26 and 27 respectively. A patch stands for spike train frequencies, .

(26) |

(27) |

Where, is the number of patches extracted from the image.

### 5.3 Sparsity

To calculate the sparsity, we use average activity and breadth tuning measures. The average activity specifies the density of spikes released from neurons in the representation layer over time steps given in Eq. 28.

(28) |

The breadth tuning measure introduced by Rolls and Tovee [48] specifies the density of neural layer activity (Eq. 29) calculated by the ratio of mean, , and standard deviation, , of spike frequencies in the representation layer upon presenting a stimulus. The breadth tuning measures the neural selectivity such that the sparse code distribution concentrates near zero with heavy tail [49]. For a neural layer where most of the neurons fire, the activity distribution is more uniformly spread and is greater than 0.5. In contrast, in a sparse code where most of the neurons do not fire, the distribution is peaked at zero and is less than 0.5.

(29) |

## 6 Experiments and Results

We ran two experiments using the MNIST [50] and the natural image [26] datasets to evaluate the proposed local representation learning rules embedded in the single-layer SNN. The intensities of gray-scaled images were normalized in the range of [0, 1] yielding possible spike rates to generate uniformly distributed spike trains for the input layer over ms. The learning rates and are set to 0.0005 and 0.0001 respectively. The regularizer hyperparameter, , is set to zero for most of the experiments.

### 6.1 Experiment 1: MNIST dataset

Experiments were run using patches from MNIST digits. We used a random subset of the MNIST dataset divided into 15,000 training and 1000 testing images for learning and evaluating the model. The SNN consists of 25 ( image patch) neurons in the input layer and neurons in the representation layer. These variations of the network architecture (different values) determine under-complete to over-complete representations. Trained filters, after 1 through 15,000 iterations, for the network with 32 neurons in the representation layer are shown in Fig. (a)a. After 1000 training iterations, the kernels start becoming selective to specific visual patterns (orientations). The filters shown in this image tend to be orientation selective and extract different visual features. Fig. (c)c shows the RMS reconstruction loss and statistical characteristics of the trained weights versus the log regularizer hyperparameter (). The RMS reconstruction loss values are less than 0.18 for and the minimum RMS values belong to the models trained with . The maximum and minimum synaptic weights after training are and 0 respectively as predicted by Eq. 20.

The three performance measures from the previous section were used to assess the model. These were the reconstructed images, the reconstruction loss, and the sparsity. The reconstructed images of randomly selected digits 0 through 9, acquired by the SNN with neurons in the representation layer, are shown in Fig. (b)b. The reconstructed maps show high quality images comparable with the original images. The reconstruction loss measures for the SNNs with through filters appear in Figs. (a)a and (b)b. The SNNs with and show minimum reconstruction loss after training. Sparsity measures reported by the average sparsity and the breadth tuning are shown in Fig. (c)c and (d)d. The sparsity measures also show better performance of the networks with filters. The average sparsity value of 0.09 shows that only 9% of neurons are active at each trial. The breadth tuning value of 0.23 indicates the sparse stimulus representation.

### 6.2 Experiment 2: Natural images

This experiment evaluates the proposed spiking representation model on natural image patches [26]. Fig. (a)a shows the trained representation filters for the SNNs with 16, 32, and 64 neurons in the representation layer. For instance, where , except for the marked filters, the other filters have low correlation with each other. For visual assessments, Fig. (b)b shows three natural images and their reconstructed maps. Performance of the proposed model in terms of the reconstruction loss and sparsity measures on natural images is shown in Fig. (c)c. Minimum reconstruction loss belongs to the networks with neurons in the representation layer. The small number of neurons () is not able to capture visual codes. On the other hand, using many neurons (in our study ) increases reconstruction loss because a number of neurons can not be involved in the learning process due to the WTA constraint.

### 6.3 Comparisons

The proposed spiking representation learning method shows better performance than the traditional K-means clustering [51] and the restricted Boltzman machine (RBM) [52, 53] while introducing local learning in time and space. The K-means and RBM approaches were applied to the normalized pixel intensities of image patches (not spike trains). Table 1 shows this comparison in terms of reconstruction loss (correlation and RMS). Our model outperforms the RBM and K-means methods except for the two cases (natural images) in which the RBM shows slightly better performance. Fig. 5 demonstrates the trained filters obtained by K-means, RBM, and our model based on the MNIST and natural image patches. K-means, similar to our model, detects different visual orientations for the MNIST and natural image patches, but the filters are highly correlated. RBM did not perform well for the MNIST dataset but it successfully learned representative visual filters for the natural image patches where . These trained filters (Fig. 5) confirm the reconstruction loss variations reported in Table 1.

Rec. Loss | MNIST Corr. | MNIST RMS | Natural Corr. | Natural RMS | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

D | 16 | 32 | 64 | 16 | 32 | 64 | 16 | 32 | 64 | 16 | 32 | 64 |

K-means | 0.22 | 0.23 | 0.26 | 0.18 | 0.21 | 0.26 | 0.45 | 0.52 | 0.57 | 0.31 | 0.36 | 0.40 |

RBM | 0.49 | 0.49 | 0.40 | 0.27 | 0.27 | 0.26 | 0.92 | 0.41 | 0.44 | 0.47 | 0.27 | 0.26 |

Our STDP | 0.20 | 0.20 | 0.24 | 0.17 | 0.17 | 0.21 | 0.49 | 0.40 | 0.47 | 0.24 | 0.27 | 0.40 |

Table 2 compares our results with the only (to the best of our knowledge) spike-based representation learning models. The correlation-based reconstruction loss on MNIST and natural images (0.2 and 0.4) shows improvement over the existing spiking autoencoder using mirrored STDP (0.2 and 0.65) proposed by Burbank [39]. The sparse representation introduced by King et al [25], which is a modified version of the SAILnet algorithm [22], reported an RMS reconstruction loss around 0.74 that is calculated based on the spike rates normalized to unit standard deviation (let’s say zRMS). Our model compared favorably with their model with zRMS=0.67. However, our model did not scale well to a larger number of neurons when in the representation layer. The problem appears to stem from the threshold adjustment rule (Eq. 25). If we change the rule to , where is a proportion of , the representation layer would be more active and a large number of filters can be trained to reduce the reconstruction loss.

## 7 Discussion and Conclusion

This paper proposed a novel STDP-based representation learning method embedded in an SNN and evaluated in two experiments to establish its initial viability. The learning rules were derived by constrained optimization incorporating a vector quantization-like objective function with regularization and a firing constraint. The learning rules include spatio-temporally local STDP-type weight adaptation and a threshold adjustment rule. The STDP rule at equilibrium showed a probabilistic interpretation of the synaptic weights scaled by the regularizer hyperparameter. In addition to the threshold adaptation rule, the WTA-thresholded neurons in the representation layer implemented inhibition to represent sparse and independent visual features.

The experimental results showed high performance of the proposed model in comparison with non-spiking and spiking approaches. Our model almost outperformed the traditional K-Means and RBM models in representation learning and training the orientation selective kernels. Also, our method showed better performance (in terms of reconstruction loss) than the state-of-the-art spiking representation learning approaches introduced by [39] (spiking autoencoder) and [22, 25] (sparse representation).

To obtain the spatio-temporally local learning rules embedded in the SNN, we started from a non-spiking quantization criterion inspired from [20]. Then, we developed novel rules to implement an STDP based representation learning and a threshold adjustment rule for spiking platforms. The spike-based platform and spatio-temporally local learning rules that follow human brain functionality lead the main difference between our study and well-known, traditional representation learning methods introduced in the literature. Very few spiking representation learning methods in the literature suffer from limitations such as violating Dale’s law [22], synapses that can change sign [22, 25], low performance in terms of reconstruction loss [39], and non-spiking input signals [22, 25]. In this study we proposed an STDP learning rule which updates the synaptic weights fallen in the range [0, 1]. The SNN architecture consists of excitatory neurons and an implicit inhibition occurring in the representation layer. The implicit inhibition can be assumed as analogous as a separate inhibitory neuron balancing neural activities in the representation neural layer where Dale’s law is maintained. Furthermore, the proposed SNN implements spiking neurons in both the input and representation layers and the neurons only communicate through temporal spike trains. To the best of our knowledge, our approach is the only high performance representation learning approach implemented on spiking neural networks.

Although the proposed spiking representation learning was successful, there is a limitation that the spike rate of the presynaptic neurons is higher than biological spiking neurons. Our future work seeks to reduce this spike rate to be more biologically plausible. Using more presynaptic neurons presenting mutual exclusive intensity bands would be a starting point.

## References

- [1] Maneesh Bhand, Ritvik Mudur, Bipin Suresh, Andrew Saxe, and Andrew Y Ng. Unsupervised learning models of primary cortical receptive fields and receptive field plasticity. In Advances in neural information processing systems, pages 1971–1979, 2011.
- [2] Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for visual area v2. In Advances in neural information processing systems, pages 873–880, 2008.
- [3] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
- [4] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- [5] R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual representation by single neurons in the human brain. Nature, 435(7045):1102–1107, 2005.
- [6] Nikos K Logothetis and David L Sheinberg. Visual object recognition. Annual review of neuroscience, 19(1):577–621, 1996.
- [7] Maximilian Riesenhuber and Tomaso Poggio. Neural mechanisms of object recognition. Current opinion in neurobiology, 12(2):162–168, 2002.
- [8] Malcolm P Young and Shigeru Yamane. Sparse population coding of faces in the inferotemporal cortex. Science, 256(5061):1327–1331, 1992.
- [9] Sofia M Landi and Winrich A Freiwald. Two areas for familiar face recognition in the primate brain. Science, 357(6351):591–595, 2017.
- [10] Brian A Wandell. Foundations of vision. Sinauer Associates, 1995.
- [11] Yves Frégnac, Julien Fournier, Florian Gérard-Mercier, Cyril Monier, Marc Pananceau, Pedro Carelli, and Xoana Troncoso. The visual brain: Computing through multiscale complexity. In Micro-, Meso-and Macro-Dynamics of the Brain, pages 43–57. Springer, 2016.
- [12] Matthew W Self, Judith C Peters, Jessy K Possel, Joel Reithler, Rainer Goebel, Peterjan Ris, Danique Jeurissen, Leila Reddy, Steven Claus, Johannes C Baayen, et al. The effects of context and attention on spiking activity in human early visual cortex. PLoS biology, 14(3):e1002420, 2016.
- [13] Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural Networks, 10(9):1659–1671, 1997.
- [14] Eugene M Izhikevich. Which model to use for cortical spiking neurons? IEEE transactions on neural networks, 15(5):1063–1070, 2004.
- [15] Samanwoy Ghosh-Dastidar and Hojjat Adeli. Spiking neural networks. International Journal of Neural Systems, 19(04):295–308, 2009.
- [16] Wolfgang Maass. To spike or not to spike: that is the question. Proceedings of the IEEE, 103(12):2219–2224, 2015.
- [17] Wolfgang Maass. On the computational power of noisy spiking neurons. In Advances in neural information processing systems, pages 211–217, 1996.
- [18] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Learning to be efficient: Algorithms for training low-latency, low-compute deep spiking neural networks. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, pages 293–298. ACM, 2016.
- [19] Carlos SN Brito and Wulfram Gerstner. Nonlinear hebbian learning as a unifying principle in receptive field formation. PLoS computational biology, 12(9):e1005070, 2016.
- [20] Peter Földiák. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64(2):165–170, 1990.
- [21] H B Barlow. Unsupervised learning. Neural Computation, 1(3):295–311, 1989.
- [22] Joel Zylberberg, Jason Timothy Murphy, and Michael Robert DeWeese. A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of V1 simple cell receptive fields. PLoS Comput Biol, 7(10):e1002250, 2011.
- [23] Amirhossein Tavanaei and Anthony S Maida. Multi-layer unsupervised learning in a spiking convolutional neural network. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 2023–2030. IEEE, 2017.
- [24] Henry Markram, Wulfram Gerstner, and Per Jesper Sjöström. Spike-timing-dependent plasticity: a comprehensive overview. Frontiers in synaptic neuroscience, 4, 2012.
- [25] Paul D King, Joel Zylberberg, and Michael R DeWeese. Inhibitory interneurons decorrelate excitatory cells to drive sparse code formation in a spiking model of V1. The Journal of Neuroscience, 33(13):5475–5485, 2013.
- [26] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
- [27] Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes are edge filters. Vision Research, 37(23):3327–3338, 1997.
- [28] Martin Rehn and Friedrich T Sommer. A network that uses few active neurones to code visual input predicts the diverse shapes of cortical receptive fields. Journal of computational neuroscience, 22(2):135–146, 2007.
- [29] Bruno A Olshausen, Charles F Cadieu, and David K Warland. Learning real and complex overcomplete representations from the statistics of natural images. In Proc SPIE, volume 7446, pages 74460S–1, 2009.
- [30] Chr. von der Malsburg. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14:85–100, 1973.
- [31] E. L. Bienenstock, L. N. Cooper, and Paul W. Munro. Theory for the development of neuron selectivity: Orientation specifity and binocular interaction in visual cortex. Journal of Neuroscience, 2(1):32–48, 1892.
- [32] Timothee Masquelier. Relative spike time coding and STDP-based orientation selectivity in the early visual system in natural continuous and saccadic vision: a computational model. Journal of Computational Neuroscience, 32(3):425–441, 2012.
- [33] D H Hubel and T N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1):106–154, 1962.
- [34] Natalia Caporale and Yang Dan. Spike timing-dependent plasticity: a Hebbian learning rule. Annu. Rev. Neurosci., 31:25–46, 2008.
- [35] Timothée Masquelier and Simon J Thorpe. Unsupervised learning of visual features through spike timing dependent plasticity. PLoS computational biology, 3(2):e31, 2007.
- [36] Saeed Reza Kheradpisheh, Mohammad Ganjtabesh, and Timothée Masquelier. Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition. Neurocomputing, 205:382–392, 2016.
- [37] Amirhossein Tavanaei, Timothée Masquelier, and Anthony S Maida. Acquisition of visual features through probabilistic spike-timing-dependent plasticity. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 307–314. IEEE, 2016.
- [38] Saeed Reza Kheradpisheh, Mohammad Ganjtabesh, Simon J Thorpe, and Timothée Masquelier. Stdp-based spiking deep neural networks for object recognition. arXiv preprint arXiv:1611.01421, 2016.
- [39] Kendra S Burbank. Mirrored STDP implements autoencoder learning in a network of spiking neurons. PLoS Comput Biol, 11(12):e1004566, 2015.
- [40] Cristina Savin, Prashant Joshi, and Jochen Triesch. Independent component analysis in spiking neurons. PLoS Comput Biol, 6(4):e1000757, 2010.
- [41] Adam Coates and Andrew Y Ng. Learning feature representations with k-means. In Neural networks: Tricks of the trade, pages 561–580. Springer, 2012.
- [42] Peter Földiák. Adaptive network for optimal linear feature extraction. In Neural Networks (IJCNN), 1989 International Joint Conference on, volume 1, pages 401–405. IEEE, 1989.
- [43] Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273, 1982.
- [44] Barbara Hammer and Thomas Villmann. Generalized relevance learning vector quantization. Neural Networks, 15(8):1059–1068, 2002.
- [45] Petra Schneider, Michael Biehl, and Barbara Hammer. Distance learning in discriminative vector quantization. Neural computation, 21(10):2942–2969, 2009.
- [46] Amirhossein Tavanaei and Anthony S Maida. Bio-inspired spiking convolutional neural network using layer-wise sparse coding and stdp learning. arXiv preprint arXiv:1611.03000, 2016.
- [47] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- [48] Edmund T Rolls and Martin J Tovee. Sparseness of the neuronal representation of stimuli in the primate temporal visual cortex. Journal of Neurophysiology, 73(2):713–726, 1995.
- [49] P. Foldiak and D. Endres. Sparse coding. Scholarpedia, 3(1):2984, 2008.
- [50] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The MNIST database. URL http://yann. lecun. com/exdb/mnist, 1998.
- [51] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
- [52] Geoffrey Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):926, 2010.
- [53] Nicolas Le Roux and Yoshua Bengio. Representational power of restricted boltzmann machines and deep belief networks. Neural computation, 20(6):1631–1649, 2008.