# Reliable counting of weakly labeled concepts by a single spiking neuron model.

###### Abstract

Making an informed, correct and quick decision can be life-saving. It’s crucial for animals during an escape behaviour or for autonomous cars during driving. The decision can be complex and may involve an assessment of the amount of threats present and the nature of each threat. Thus, we should expect early sensory processing to supply classification information fast and accurately, even before relying the information to higher brain areas or more complex system components downstream. Today, advanced convolution artificial neural networks can successfully solve such tasks and are commonly used to build complex decision making systems. However, in order to achieve excellent performance on these tasks they require increasingly complex, "very deep" model structure, which is costly in inference run-time, energy consumption and number of training samples, only trainable on cloud-computing clusters. A single spiking neuron has been shown to be able to solve many of these required tasks Gutig:2016 () for homogeneous Poisson input statistics, a commonly used model for spiking activity in the neocortex; when modeled as leaky integrate and fire with gradient decent learning algorithm it was shown to posses a wide variety of complex computational capabilities. Here we improve its learning algorithm. We also account for more natural stimulus generated inputs that deviate from this homogeneous Poisson spiking. The improved gradient-based local learning rule allows for significantly better and stable generalization and more efficient performance. We finally apply our model to a problem of multiple instance learning (MIL) with counting where labels are only available for collections of concepts. In this counting MNIST task the neuron exploits the improved algorithm and succeeds while out performing the previously introduced single neuron learning algorithm as well as conventional ConvNet architecture with similar parameter space size and number of training epochs.

Reliable counting of weakly labeled concepts by a single spiking neuron model.

Hannes Rapp Computational Systems Neuroscience University of Cologne Cologne, Germany hannes.rapp@smail.uni-koeln.de Martin Paul Nawrot University of Cologne Cologne, Germany mnawrot@uni-koeln.de Merav Stern Department of Applied Mathematics University of Washington Seattle, WA 98195 ms4325@uw.edu

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

The basic elements of our brains are neurons. Biological neurons communicate among themselves with discrete time events - the so-called spikes. However, networks of spiking neurons are difficult to model and analyze, because of thee discrete nature of spikes and their mechanism of fast rise and reset of the neuron’s voltage. Hence, vast majority of models ignore the spikes discrete nature and assume that only the rate of spike occurrences matters. Rates, by concept, can be treated with continuous time-varying functions, which allows for various derivative based approaches such as gradient decent learning, to be implemented.

Hence, it is not surprising that the majority of neural network studies and algorithms are rate based. Their implementations through deep learning (lecun2015deep (), schmidhuber2015deep ()), ConvNet (lecun1998gradient ()), echo state (Jaeger:2004 ()) and more are indeed highly successful. As the tasks are becoming more complex, however, these model classes are becoming increasingly more costly and often require cloud-computing clusters and millions of samples to be trained NIPS2012_4824 (), Simonyan:2014rt (). It was recently shown OpenAI:compute () that the amount of computation needed by such artificial systems has been growing exponentially since 2012.

Therefore, it is worth to examine biologically realistic spiking neurons as computational units despite the technical challenges in doing so.Thanks to their efficiency, spiking neurons and networks seem to be natural candidates for the next generation of neural network algorithms. Some recent studies managed to train spiking neural networks with gradient-based learning methods. To overcome the discontinuity problem, the currents created by the spikes in receiving neurons (essentially through linear low-pass filtering) were used in Clopath:2017 () and Huh:2017dz () for the training procedures. Other studies use the timing of the spikes as a continues parameter Bohte2000 (), OConnor:2017xd (), which leads to neuronal (synaptic) learning rules that rely on the exact time intervals between the spikes of the sending and receiving neurons (pre- and post- synaptic). These Spike Timing Dependent (STDP) rules had first been observed experimentally and hence much of attention is given to them in neuroscientific studies Bi:2001 () Caporale:2008 () SongSTDP:2000 (). But their computational capability, especially for classification tasks, has not been well exploited. It is generally interesting, still highly debated question, whether the brain uses the timing of the spikes or their rate to represent the information, and whether connectivity is modified in the brain accordingly. We leave this broad open question, outside the scope of this paper.

An additional intriguing approach is to train spiking neurons as classifiers, perceptron-like machines Gutig:2006lf (), Memmesheimer:2014 (). Here, the gradient learning is done based on the neuron’s membrane voltage in relation to the maximum voltage the neuron reached compare to its threshold for spiking. A full spiking network was trained in a similar fashion to generate patterns SussilloCOSYNE:2018 (). Here we concentrate on the algorithm for the recently published Multi-Spike Tempotron Gutig:2016 (), a single neuron leaky integrate and fire model that solves regression problems including learning how to recognize concepts within a collection. Specifically, the Multi-Spike Tempotron (MSP) learns to generate a certain number of spikes for a given concept (stimulus). The learning algorithm changes the input weights according to a voltage threshold gradient decent, such that the weights eventually fit the threshold in which the neuron generates the exact number of spikes required. We outline the underlying algorithm in more detail in the method section. The signals we use for training the Multi-Spike Tempotron are for collections (bags) of concepts, a learning strategy termed Multiple Instance Learning (MIL) with counting that has been recently proposed in the literature Foulds:2010 (), Lempitsky2010 (), Segui:2015 (). Thus, the Multi-Spike Tempotron is capable of evaluating a sum of multiple object instances present in an input stream. This is especially useful in early stages of decision making Stanford:2010 (), where assessment of the number of threats present is needed quickly, for example to help escape predators or avoid collisions.

It has been shown that in-vivo cortical spiking activity is typically more regular then Poisson Mochizuki:2016 (), Nawrot:2010 (). In general any correlated stimuli input is expected to deviate from Poisson Farkhooi:2011 (). Moreover, input is generally non-homogenous, i.e. time-varying. However only homogeneous Poisson statistic of input patterns and background were considered in Gutig:2016 (). Hence, here we study learning capabilities of the Multi-Spike Tempotron when the statistic of input patterns deviates from the Poisson statistic and when the background statistic is non-homogeneous, i.e. time-varying as under realistic noisy conditions.

Our main goal in this study is to present an improved learning algorithm. The Tempotron algorithm originally used Momentum POLYAK:1964 () to boost its capabilities. We show here that a connection (synaptse) specific adaptive update approach with smoothing over previous updates, similar to RMSprop TielemanHinton:2012 (), generates significantly better and stable generalization and more efficient performance of the learning capabilities of the Multi-Spike Tempotron. We review both the Momentum and the RMSprop in the method section. We further show that our improved learning algorithm performs better in the biological context of non-homogeneous spiking in the neocortex as well as in a counting task on MNIST figures. We finally show that it outperforms a deep network (ConvNet) with similar parameter space and training epochs.

## 2 Method

### 2.1 Tempotron Model

The Multi-Spike Tempotron is a current-based leaky integrate-and-fire neuron model. Its membrane potential, , follows the dynamical equation:

(1) |

where denotes the time of spike number from the input source (presynaptic) number , and denotes the time of spike number of the Tempotron neuron model. Every input spike at contributes to the potential by the kernel:

(2) |

times the synaptic weight of that input source . These synaptic input weights are learned via the gradient decent algorithm. The kernel is normalized to have its peak value with and where and are the membrane time constant and the synaptic decay time constant. The Kernel is casual, it vanishes for . When crosses the threshold the neuron emits a spike and is reset to by the second term in equation (1).

In order to have the neuron emit the required number of spikes in response to a specific concept (implemented as synaptic input spike pattern) the weights are modified. Since the required spike numbers are non differential discrete numbers the gradient for the weights is derived from the spiking threshold. We wish to change the weights such that the neuron’s voltage would reach a critical threshold that would coincide with its threshold , that would be crossed exactly desired times to generate the desired spikes. This loss function is called Spike-Threshold surface (STS). Hence the appropriate gradient can be describe by:

(3) |

Where controls whether to increase or decrease the number of output spikes towards the required, is the learning rate parameter that controls the size of the gradient step and is the gradient of the critical voltage threshold with respect to the synaptic weights. In practice, multiple concepts are presented and the learning signal is given based on the sum of spikes that are required by the collection of concepts to generate.

To evaluate the expression in (3) we use the properties of the -th spike time in which the potential reaches the critical threshold and hence where are all the previous time points when the neuron spiked. Together with the voltage (membrane potential) dynamics (1) and (2) a recursive expression, that depends on all previous spike times, can be found for the gradient (3). For details about its full derivation see the section in the methods of Gutig:2016 ().

The learning rate is global for all synaptic weights meaning that the gradient descent takes an equal size step along all directions. If this parameter is too small the training process will take very long, but if it’s too big the algorithm might simply jump over optima within the error surface and never converge to a good solution.Hence, tuning this learning rate is important to achieve decent training speed.

A possible approach to avoid these problems, taken in Gutig:2016 (), is to update the weights according to accumulating error, Momentum heuristic:

(4) |

where is the Momentum parameter.

### 2.2 Adaptive input weight learning and gradient smoothing

We propose here to use an adaptive learning approach for the weight updates. The algorithm fits each input synapse with its own update scale and by doing so over comes the possibility that not all synapses are equally important to update. For example, updates should be larger for directions which provide more consistent information across examples. The RMSprop (Root Mean Square (back-)propagation) TielemanHinton:2012 () is a possible approach to achieve this. It was successfully used in deep learning for training mini-batches. It computes an adaptive learning rate per synapse weight as a function of its previous gradient steps :

(5) |

## 3 Results

We evaluate the learning performance of the Multi-Spike Tempotron with Momentum and adaptive learning (RMSprop) on the biologically relevant problem of task-related spiking activity-like in the neocortex and afterwards also on a non-biological task of counting digits. We first consider the biological application and evaluate the model under different input statistics that deviate from homogeneous Poisson. We do this by constructing data-sets where the task-related patterns to be learned are drawn from three different distributions. We then slowly increase the complexity of noise level by spike jittering, stationary and finally time-varying background activity. For this we construct several synthetic data-sets with varying noise levels. We then apply the MSP model to the non-biological problem of counting even digits within an image composed of several random handwritten MNIST digits and compare performance with a conventional Convolutional Neural Network. All simulations are carried out with the same set of parameters, sec, using our discrete-time implementation of the Multi-Spike Tempotron in MATLAB. To the best of our knowledge this is also the first publicly available implementation of the Multi-Spike Tempotron model.

### 3.1 Task-related inhomogenous activity in the neocortex

We construct three data-sets where for each of which a set of patterns are generated. Out of this 9 patterns, 5 patterns are considered to be task-related and are associated with some positive reward . The remaining 4 patterns are considered to be distractor patterns (from the same distribution) with reward . The patterns are generated as 1sec long spike trains by drawing instantaneous firing rates from three different stationary point processes (renewal processes): representing the homogeneous Poisson process, , and with a fixed intensity (or rate) of spike events per second. Each pattern is associated with a fixed, positive integer reward . Input spike trains of sec are assembled by drawing a random number of patterns from a Poisson distribution of mean patterns (with replacement). These patterns are randomly positioned within those sec but are not allowed to overlap. The training target for each of such input spike train is determined as the sum over all individual rewards of each occurring pattern. We evaluate learning under different noise levels: patterns only, patterns + spike jittering, patterns + homogenous Poisson background activity and finally patterns + inhomogenous Poisson background activity. The homogenous background activity is drawn from a stationary Poisson process while for the inhomogenous case the instantaneous firing rates are modulated by superimposed sinusoidal functions. A sample of such an input spike-train with inhomogenous background activity is shown at the top of figure 2. While speed of convergence is similar, we find that the method with adaptive learning rate shows significantly less variance in training error (fig. 1). This also holds for variance of test error on an independent validation data-set and results in better generalization capabilities to previously unseen inputs. The adaptive, per synapse learning rate combined with smoothing over past gradients has a regularizing effect and prevents the model from over-fitting to the training data. We further conclude, that the modified algorithm is able to find better and wider optima of the spike-threshold surface loss function as compared to learning with Momentum. More importantly this behaviour is consistent and independent of the input spike train’s distribution and noise level (fig. 2 bottom plots).

### 3.2 Counting MNIST

In this section we consider a non-biological application, namely the problem of multiple-instance learning using the MNIST lecun-mnisthandwrittendigit-2010 () data-set of handwritten digits. Following Segui:2015 (), Fomoro () we generate new images of size 100x100 pixels which contain a random set of 5 MNIST digits, randomly positioned within that image (fig 3). Rejection sampling is used to ensure digits are separated by at least 28 pixels. Each such image is weakly labeled with the total number of even digits present in that image. The model is supposed to learn to count the number of even digits given a weak label in order to solve this task correctly.

For the the Multi-Spike Tempotron the images have to be encoded as spiketrains. We first consider a naive spike-encoding which encodes each individual pixel as s long spiketrain generated by a process with the rate proportional to the pixel’s intensity (grey value). This type of encoding is naive in the sense, that it considers each pixel to be independent and thus does not exploit local spatial correlations of images. Next we consider a more sophisticated spike-encoding frontend, the Filter-Overlap Correction Algorithm (FoCal), a model of the fovea centralis Bhattacharya2010 (). This encoding algorithm makes use of spatial correlations to some extent in order to reduce the amount of redundant information. This is similar, somewhat simpler version, of the convolutional filters used in current deep neural networks. For comparison, we train a conventional ConvNet architecture that has been shown to successfully accomplish this task when trained on samples Fomoro (). The architecture uses several layers (conv1 - MaxPool - conv2 - conv3 - conv4 - fc - softmax) and includes recently discovered advances like strided and dilated convolutions. To train the ConvNet we use the ADAM kingma2014adam () optimizer which has been found to be the most effective optimizer for training ConvNets. For the MSP model we use our adaptive learning rate method and the originally proposed Momentum method. Since we want to evaluate with regard to computational and sample efficiency all models are trained for 30 epochs on the same training set of 800 images and are evaluated on an independent test set of 800 unseen images. In general the counting problem is more similar to a regression problem, since one does not know a-priori the maximum number of desired concepts present in an input. For this reason, we choose root mean-squared error (RMSE) of wrongly counted even digits as the evaluation criterion, where a lower value means better performance. This criterion especially penalizes predictions that show a large difference between the true and the predicted value. Thus, we want to point out that the ConvNet model is build by incorporating prior knowledge regarding the distribution of training targets. It is constrained to learn a categorical distribution over , where is the maximum possible count of even digits in an image. This has two important implications; First, the ConvNet model is unable to correctly predict images that would have more than even digits. While for this particular task the data-set is construction such that this is not possible. For general regression problems the prediction targets are usually not bounded or constrained to fixed set of values. And secondly, the maximum possible prediction error is constrained to be . In contrast, the MSP model does not have any of this prior knowledge or constraints. It is able to solve the general, true regression problem and can also make predictions for images that contain more than 6 even digits. Further this means, that for the MSP the learning problem to be solved is harder. The maximum prediction error in this case is unbounded and makes the MSP more vulnerable to high RMSE compared to the ConvNet. Results are summarized in table 1 and the best performing model, MSP with adaptive learning rate, is highlighted in bold. We find that generally the MSP with adaptive learning rate performs better compared to the Momentum independent of the choice of a particular spike-encoding frontend. Interestingly the single-neuron MSP model is also able to outperform the rate-based ConvNet. In order for the ConvNet to achieve better RMSE () as the MSP model, the ConvNet needs to be trained for 5-10x more epochs as the MSP. If the model’s complexity in terms of free parameters is taken into account (adjusted RMSE), the MSP model is computationally more efficient. We find that using FoCal as spike-encoding frontend works much better compared to our naive encoding, which is expected behaviour. Exploiting local, spatial correlations is known to be more effective compared to considering each pixel to be independent. This goes in line with artificial neural networks where the success of ConvNets over regular, multilayer networks is mostly due to the learned spatial filters by its convolutional layers. We conclude, that the type of encoding has a strong impact on the model’s performance in general and by applying more sophisticated and efficient encoding algorithms, the performance of the MSP model can be improved further. We leave the exploration of different types of encodings open for future research.

Counting MNIST Results | ||
---|---|---|

Model | #Parameters | RMSE |

ConvNet | ||

(naive encoding) | ||

(FoCal encoding) | ||

(naive encoding) | ||

(FoCal encoding) | ||

always-zero | n/a | |

random guessing | n/a |

## 4 Discussion

Rate-based neural networks and algorithms, especially their implementation through deep learning (lecun2015deep (), schmidhuber2015deep ()), have seen great success to build intelligent artificial systems that are able to solve remarkably complex tasks. However, with their increasing rate of success their computational complexity has also been shown to grow exponentially of recent years OpenAI:compute (). Nervous systems of animals are highly efficient in solving complex classification tasks. Therefore it is worth to examine alternative models for neural computing that exploit biological mechansisms of information processing.

In this work we have explored the Multi-Spike Tempotron Gutig:2016 () (MSP), a spiking neuron model of the neocortex, which can be trained by gradient-descent to produce precise number of output spikes in response to a certain stimulus. We first studied and quantified the learning and generalization performance of the model in the biological context of task-related spiking activity in the neocortex. Specifically we studied input statistics that deviate from homogeneous Poisson, which due to its mathematical convenience is the commonly used model of spiking statistics of the neocortex and has also been used in the original work of the MSP model Gutig:2016 (). We showed, that by choosing different and biologically more realistic input statistics, the MSP model exhibits large variance with regard to training error and more importantly with regard to generalization on unseen inputs. In order to overcome this issue, we have successfully proposed a modified learning rule that uses adaptive learning rates per synapses and smoothing over past gradients instead of the original Momentum-based learning. We evaluated both methods on data-sets with different input statistics that resemble task-related spiking activity in the neocortex and under different levels of background noise complexity. We were able to show, that the adaptive learning rate method performs consistently better as compared to Momentum in terms of variability of training error and generalization. The modified learning rule has a regularizing effect and prevents the model from overfitting to the training data, without modifying the model’s equation and gradient derivation.

While previous related work of gradient-based learning in spiking network models are mostly concerned with solving classical classification tasks, in this work we applied the single-neuron MSP model to solve a regression problem. Specifically, we have used the improved learning rule and applied the MSP to the non-biological problem of multiple-instance learning with weakly labeled objects. For this we have used a visual counting task of handwritten even digits from the MNIST data-set. For successful learning, the model needed to solve the binding problem using the weak label and count the number of even digits in each image. Finally we have asked the question, whether a single spiking neuron model is able to compete against more complex, rate-based network models with approximately the same parameter space and training data. For this we have compared the MSP against a conventional convolutional neural network of seven layers and assessed the performance using the root mean-squared error (RMSE) of wrongly counted digits. We have found, that the improved MSP model can outperform the ConvNet in this setting, which needed 5-10x more training epochs to reach the MSP performance. While in this work we specifically focused on the computational capabilities of the single-neuron model, the same model and learning rule can also be used to create more complex and layered networks. We leave the study of complex networks of multiple, interconnected Multi-Spike Tempotrons up for future research. We conclude that, despite it’s simplicity, the single-neuron Multi-Spike Tempotron provides competitive performance not only for biologically relevant inputs like task-related activity in the neocortex, but also on tasks that are unrelated to Biology. We are confident it can be considered for other classes of machine learning problems that go beyond strict classification tasks. We made our code publicly available to support further research Rapp:2018 ().

#### Acknowledgments

A significant portion of this work has been conducted during the OIST Computational Neuroscience Course and has been supported by accommodations and travel grants from the Okinawa Institute of Science and Technology (OIST).

## References

- [1] Robert Gütig. Spiking neurons can discover predictive features by aggregate-label learning. Science, 351(6277), 2016.
- [2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
- [3] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
- [4] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [5] Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.
- [6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
- [7] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. 09 2014.
- [8] OpenAI. https://blog.openai.com/ai-and-compute/.
- [9] Wilten Nicola and Claudia Clopath. Supervised learning in spiking neural networks with force training. Nature Communications, 8(1):2208, 2017.
- [10] Dongsung Huh and Terrence J. Sejnowski. Gradient descent for spiking neural networks. 06 2017.
- [11] Sander M. Bohte, Joost N. Kok, and Han La Poutré. Spikeprop: backpropagation for networks of spiking neurons. In ESANN, 2000.
- [12] Peter O’Connor, Efstratios Gavves, and Max Welling. Temporally efficient deep learning with spikes. 06 2017.
- [13] Guo-qiang Bi and Mu ming Poo. Synaptic modification by correlated activity: Hebb’s postulate revisited. Annual review of neuroscience, 1(24):139–166, 2001.
- [14] Natalia Caporale and Yang Dan. Spike timing–dependent plasticity: a hebbian learning rule. Annual review of neuroscience, 31(24):25–46, 2008.
- [15] Miller K. D. Song, S. and L. F. Abbott. Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature neuroscience, 3(9):919, 2000.
- [16] Robert Gütig and Haim Sompolinsky. The tempotron: a neuron that learns spike timing-based decisions. Nature Neuroscience, 9:420 EP –, 02 2006.
- [17] Raoul-Martin Memmesheimer, Ran Rubin, Bence P. Ölveczky, and Haim Sompolinsky. Learning precisely timed spikes. Neuron, 82(4):925 – 938, 2014.
- [18] Matthew Johnson Danny Tarlow David Sussillo, Chris Maddison. Training continuous time spiking neural networks with back-propagation through spike times. Cosyne Abstracts 2018, 2018.
- [19] James Richard Foulds and Eibe Frank. A review of multi-instance learning assumptions. 25, 03 2010.
- [20] Victor Lempitsky and Andrew Zisserman. Learning to count objects in images. In Advances in neural information processing systems, pages 1324–1332, 2010.
- [21] Santi Seguí, Oriol Pujol, and Jordi Vitrià. Learning to count with deep object features. 05 2015.
- [22] Terrence R Stanford, Swetha Shankar, Dino P Massoglia, M Gabriela Costello, and Emilio Salinas. Perceptual decision making in less than 30 milliseconds. Nature Neuroscience, 13(3):379, 2010.
- [23] Yasuhiro Mochizuki, Tomokatsu Onaga, Hideaki Shimazaki, Takeaki Shimokawa, Yasuhiro Tsubo, Rie Kimura, Akiko Saiki, Yutaka Sakai, Yoshikazu Isomura, Shigeyoshi Fujisawa, et al. Similarity in neuronal firing regimes across mammalian species. Journal of Neuroscience, 36(21):5736–5747, 2016.
- [24] Martin Paul Nawrot. Analysis and interpretation of interval and count variability in neural spike trains. In Analysis of parallel spike trains, pages 37–58. Springer, 2010.
- [25] Farzad Farkhooi, Eilif Muller, and Martin P Nawrot. Adaptation reduces variability of the neuronal population code. Physical Review E, 83(5):050905, 2011.
- [26] B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 – 17, 1964.
- [27] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- [28] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
- [29] Fomoro. http://github.com/fomorians/counting-mnist.
- [30] B. Sen Bhattacharya and S. B. Furber. Biologically inspired means for rank-order encoding images: A quantitative analysis. IEEE Transactions on Neural Networks, 21(7):1087–1099, July 2010.
- [31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [32] Hannes Rapp and Merav Stern. Matlab implementation of multispike tempotron with adaptive learning. (doi:10.5281/zenodo.1247442), 2018.