# CaMKII activation supports reward-based neural network optimization through Hamiltonian sampling

###### Abstract

Synaptic plasticity is implemented and controlled through over thousand different types of molecules in the postsynaptic density and presynaptic boutons that assume a staggering array of different states through phosporylation and other mechanisms. One of the most prominent molecule in the postsynaptic density is CaMKII, that is described in molecular biology as a “memory molecule” that can integrate through auto-phosporylation Ca-influx signals on a relatively large time scale of dozens of seconds. The functional impact of this memory mechanism is largely unknown. We show that the experimental data on the specific role of CaMKII activation in dopamine-gated spine consolidation suggest a general functional role in speeding up reward-guided search for network configurations that maximize reward expectation. Our theoretical analysis shows that stochastic search could in principle even attain optimal network configurations by emulating one of the most well-known nonlinear optimization methods, simulated annealing. But this optimization is usually impeded by slowness of stochastic search at a given temperature. We propose that CaMKII contributes a momentum term that substantially speeds up this search. In particular, it allows the network to overcome saddle points of the fitness function. The resulting improved stochastic policy search can be understood on a more abstract level as Hamiltonian sampling, which is known to be one of the most efficient stochastic search methods.

## 1 Introduction

Calcium-calmodulin dependent protein kinase II (CaMKII) is the most frequently occurring complex molecule in the postsynaptic density and a key molecule for the implementation of synaptic plasticity [1] (see Fig. 1A). It is described in molecular biology as a “memory molecule” that creates through its somewhat persistent autophosphorylated (active) state a short term memory or low pass filter on the time scale of dozens of seconds for calcium influx (see e.g. Ch. 15 in [2], Fig. 1c in [3], and Fig. 3F in [4]). Calcium influx is a typical feature of the induction of longterm plasticity via NMDA receptors. More specifically, incoming calcium transforms CaMKII via calmodulin into its active state, which is maintained for a while via autophosphorylation among its 12 subunits. Furthermore CaMKII triggers in its activated state changes of synaptic efficacy through the phosphorylation of AMPA receptors, the anchoring of additional AMPA receptors in the postsynaptic density, and dopamine-gated stabilization of spines (see e.g. Fig. 3, S5, S11 in [4]).

Although numerous experimental data show that CaMKII in its activated state is essential both for LTP and LTD [5, 6], its contribution to network plasticity is still unclear. We address in this article the question how the activation dynamics of CaMKII could contribute to reward-based network optimization for specific computational tasks. Since the molecular processes that involve CaMKII and give rise to LTP and LTD contain a strong stochastic component, it is natural to view this optimization not as a deterministic but a stochastic search for good network parameters. This view is also consistent with numerous experimental data that show that synaptic connections are even in the adult cortex subject to a continuous coming and going of dendritic spines that appears to be inherently stochastic and independent of pre- or postsynaptic firing in the absence of a functional synaptic connection [7, 8]. A theoretical framework for stochastic network plasticity has been introduced in [9, 10] and termed synaptic sampling. There, it was shown that a neural network with parameters subject to stochastic plasticity rules samples from a stationary distribution of network configurations through a sampling process known as Langevin sampling in the machine learning literature. This means that the network will visit – in the long run – network configurations most often that have a large probability . The index in the distribution denotes the “temperature” of the search, which depends on the amount of noise in the plasticity process. The exact shape of is determined by the plasticity rule and in the context of reward-based learning it can be chosen to prefer network configurations that lead frequently to large rewards. In other words, in this framework, synaptic plasticity can be shown to implement an ongoing stochastic policy search [10].

However, as synaptic sampling carries out Langevin sampling, convergence to the stationary distribution is rather slow for any fixed temperature, which is in general undesirable as it implies slow learning. In particular, the search for high fitness regions by gradient-based optimization techniques such as Langevin sampling is hindered by local optima and – even more severely as recently suggested by Dauphin et al. [11, 12] – by the presence of saddle points in . This slowness is a generic impediment for the implementation of a global optimization strategy such as simulated annealing as further detailed below.

In this article, we show that CaMKII activation dynamics can ease these issues. Compared to the synaptic sampling framework, the activation dynamics of CaMKII gives rise to an additional dynamic variable that basically low-pass filters parameter updates. This low-pass filtering implements a momentum term, a method that is well-known to improve gradient-based optimization in many circumstances, for example in the vicinity of saddle points. More abstractly, in our stochastic framework, we show that the resulting dynamics gives rise to a parameter sampling algorithm known as Hamiltonian sampling, that however still samples from the same stationary distribution . A well-known advantage of Hamiltonian sampling over Langevin sampling is faster convergence to the stationary distribution [13].

With such faster convergence properties, our model for CaMKII driven plasticity allows us to create a link from reward-based learning to optimization theory, which establishes conditions under which a neural circuit could attain not only functionally attractive locally optimal network configurations, but in principle even a global optimum. Simulated annealing [15, 16] is arguably one of the most powerful algorithmic approach to nonlinear optimization. Evolutionary algorithms also work well in some cases, but require a large control overhead and many competing networks in parallel for which no biological evidence exists so far. We show that reward-based network plasticity can in principle reach even globally optimal network configurations if the amount of stochasticity is sufficiently slowly decreased during learning (“cooling” or “annealing”), similar to simulated annealing in continuous time. This theoretical result provides a new gold standard for reward-based network learning.

## 2 Results

Consider a network that receives at certain times reward signals , e.g., in the form of dopamine. The dynamics of each synaptic connection in the network is modeled by a parameter , which determines the synaptic efficacy. Therefore, we assume that the behavior of the network (i.e., its response to network input; also referred to as the network policy) is determined by the parameter vector (the vector of all synaptic parameters). In biological neuronal networks, neurons are either excitatory or inhibitory, a fact that is commonly refered to as Dale’s principle. This implies that their outgoing synaptic connections are exclusively excitatory (modelled as positive synaptic weights) or inhibitory (negative synaptic weights), and that these synaptic weights cannot change the their sign through plasticity processes. We will first introduce a version of the model that allows such a sign-switch of synaptic weights for demonstration purposes. We will later introduce a slightly modified version of the model where only excitatory synapses are plastic with weights constrained to be non-negative.

Previous work [10] has analyzed under which conditions such a network can perform an ongoing stochastic policy search. That is, under which conditions local stochastic synaptic plasticity processes on can achieve that the network seeks network configurations that provide a large expected discounted reward. Mathematically, the expected discounted reward for a given parameter vector is given by

(1) |

The integral integrates all future rewards , while discounting more remote rewards exponentially with a discount rate . The expectation is an average over multiple learning episodes where in each episode one realization of the reward trajectory is encountered for the given parameters according to some distribution .

In addition, a biological network needs to satisfy structural constraints, such as sparse connectivity, that can be formulated through a prior over network configurations [10]. Hence, network learning can be regarded as a search for policies (i.e., network configurations ) that both satisfy structural constraints and provide a large expected discounted reward. This can be stated more formally as a sampling from the posterior distribution of parameters

(2) |

It was shown in [10] that if the stochastic dynamics of each parameter can be characterized through a stochastic differential equation (SDE) of the form

(3) |

then the network reaches the unique stationary distribution given by the posterior and then samples from this distribution over network configurations. The parameter denotes a learning rate that controls the speed of the parameter dynamics. The last term of Eq. (4) describes infinitesimal stochastic increments and decrements of a Wiener process – a standard model for Brownian motion in one dimension (see [17]). The amplitude of this noise term is scaled by the temperature parameter , which can be used to increase or decrease random exploration of the parameter space.

To integrate the role of CaMKII in the plasticity processes, we model the previously sketched transient role of CaMKII as a low pass filter in the induction of synaptic plasticity (see Fig. 1B). For each potential synapse , we introduce another dynamic variable that determines the change of the at time . It was found that both, LTP and LTD require the activated form of CaMKII, and that the switch between LTP and LTD is determined by other mechanisms [5, 6]. We therefore interpret the absolute value of as the local concentration of CaMKII in its activated state. The interaction of these two variables is modeled by the stochastic differential equation (SDE) of the form

(4) |

where the change of parameter directly depends on the value of the hidden CaMKII-related variable ( is a learning rate). The dynamics of in turn is determined by three terms. The first term is the gradient of the parameter posterior. In reward-based learning, this gradient can be estimated by a rule that depends only on pre- and post-synaptic spike times and a global reward signal implemented for example as a dopaminergic signal. The friction term implements the decay of CaMKII activation with a time constant . Detailed experimental studies suggest that this time constant depends on a variety of factors, e.g. the inactivation time constant of CaMKII activity and the mobility of CaMKII [18, 19, 20] (we used 10 s in Fig. 2 and 50 s in the remainder of the paper). The last term models noise on CaMKII activation, such as stochastic opening of N-methyl-d-aspartate (NMDA) receptor channels [21].

With these extended parameter dynamics, the network samples from the posterior over network configurations (see Theorem 1 in 4 Methods for details). By marginalization over the CaMKII parameters it then follows that the stationary distribution over the synaptic parameters again is given by .

In other words, the CaMKII-enriched dynamics gives rise to the same reward-optimizing distribution over network configurations as the direct dynamics considered in [10]. Importantly however, it turns out that the dynamics (4) actually posesses advantageous properties when compared to the direct dynamics (3). For the noise-less case (), the dynamics (3) corresponds to a gradient ascent on . In comparison, the dynamics (4) introduces a momentum term which is well-known to improve gradient descent in many circumstances, for example in the presence of small local optima or in the vicinity of saddle points. In the case with noise, the dynamics (3) corresponds to Langevin sampling from , and the dynamics (4) to Hamiltonian sampling with friction. It is knwon that Hamiltonian sampling typically shows much faster convergence to the stationary distribution than the rather slow Langevin sampling [22]. In fact, a similar low-pass filtering of gradient updates was already implemented in [10] to improve learning performance, but without a clean mathematical background and biological motivation.

To arrive at concrete plasticity rules, one has to determine in Eq. (4), for the concrete neuron model and prior at hand. As one example to be used in subsequent simulations, we consider a stochastic spiking neuron model (see 4.2 Spiking neuron model) and independent zero-mean Gaussian priors with variance for each parameter . We obtain for the derivative of the prior. Using this and Eq. (1) we find that the derivative in Eq. (4) at time can be approximated by (see 4 Methods)

(5) | |||

(6) |

where is the PSP activation under synapse , denotes the firing probability of the postsynaptic neuron, and is a binary variable that is one if the postsynaptic neuron spiked at time and zero else. Here the synaptic plasticity rule acts on (see Eq. (4)) which is related to CaMKII activation instead of acting directly on the synaptic parameter . This learning rule is a simple version of reward-modulated spike-timing dependent plasticity (STDP). Similar rules were derived previously in the context of reward-based learning [23, 24]. The current work extends these rules to include a prior over network configurations, stochastic parameter updates, and CaMKII-induced Hamiltonian dynamics.

### 2.1 A spiking neuron learns to emulate a sigmoidal neuron

Learning in recurrent networks of spiking neurons is notoriously hard [25], in particular with reward-based learning. For example, interesting functionality has been acquired through reward-based synaptic plasticity in [26], but only in recurrent networks of smooth non-spiking neurons. Recently, it has been proposed that functionality of a non-spiking network can be ported to a spiking network if it has previously learned to exhibit smooth dynamics [25]. We wondered whether such smoothing of network responses can be obtained through reward-based learning. We considered a very simple basic setup where the task is to reproduce with a single spiking neuron the behavior of an artificial sigmoidal neuron model (see Fig. 2A).

The target firing rate of the spiking neuron was given by the output of a sigmoidal neuron with four inputs and pre-defined weights. Fig. 2B shows the desired input-output behavior. The spiking neuron received inputs from 4 pools of 20 spiking neurons each, with firing rates proportional to the sigmoidal neurons’ inputs (and a maximum of Hz). Input patterns were presented to the spiking neuron continuously while its weights were adapted through reward-based plasticity (at each presentation, one out of patterns was chosen randomly and presented, see Fig. 2B). Each presentation of an input pattern lasted for ms. The presentation was followed by a ms phase where a reward was delivered which was given by 1 minus the absolute difference between spiking neuron and sigmoidal neuron output (see 4 Methods for details). After reward delivery, a ms delay period was introduced where input neurons were silent, followed by another pattern presentation.

Before learning, the firing rate of the spiking neuron was rather random over the whole range of inputs (Fig. 2C). After 20000 pattern presentations, the neurons’ firing rate approximated the smooth behavior of the sigmoidal neuron well (Fig. 2D). Fig. 2E shows the average reward throughout learning for Hamiltonian sampling in comparison with non-Hamiltionian dynamics (synaptic sampling). One can see that Hamiltonian dynamics speeds up learning significantly.

### 2.2 Reward-guided network plasticity

Next we investigated whether the benefit in learning performance of Hamiltonian sampling scales up to biologically more realistic network architectures, that are larger in size and less structured. To do so, we applied the Hamiltonian synaptic sampling framework outlined above to learn a blind reaching task in a simple model of motor cortex. Reward-guided changes of network activity and task-induced spine dynamics are well documented in motor cortex [27]. We used a network of 100 recurrently connected excitatory neurons and 20 inhibitory neurons to control a cursor in 2D space (see Fig. 3A,B). Connectivity parameters of this cortical network motif were taken from [28] (see 4 Methods). In addition to recurrent connections a random subset of 30 excitatory neurons received input from 200 afferent neurons. From the remaining 70 excitatory neurons we randomly selected a neural pool C of 50 neurons to control the cursor position. For controlling the cursor we adopted the population vector model [29]. Briefly, each neuron in C was assigned a randomly selected preferred direction in 2D cursor space. At each time point the cursor was moved in the direction of the population vector (accumulated preferred directions weighted by neural activities) of the 50 neurons in C.

Each trial started with the cursor centered at the start area S (blue circle in Fig. 3A). The cursor had to be held at S for 50 ms to initiate the movement phase of the trial. The movement phase was indicated through the presentation of a cue pattern (a rate pattern for all 200 afferent input neurons, see 4 Methods). Reward was given to the network if the cursor was moved to the target area G in Fig. 3A and held there for 50 ms. At success, the presentation of the cue pattern was stopped and a 400 ms reward window was initiated during which was set to (indicated by red vertical bars in Fig. 3B). If the network failed to reach the target within 5 seconds, or failed to hold the cursor at S and G, the trial was aborted and a 400 ms time window without reward was presented.

Note that this is a nontrivial reinforcement learning task, since the neurons did not “know” whether they belonged to the population C. Also, the network did not receive feedback about the cursor position, only binary information about the trial phase through the cue was provided. This is also true for the preferred directions assigned to the neurons in C, which could not be observed by the neurons. Furthermore, the neurons in C did not receive input from the cue directly, such that the routing of cue information to C had to be learned on top of the reaching task. All this information had to be discovered through random exploration from a global and sparse binary reward signal.

We used the synaptic sampling framework with and without the Hamiltonian momentum term to learn this task. Synaptic plasticity was here only active for excitatory synapses (both recurrent and feedforward), whereas inhibitory synapses were fixed. In order to guarantee that synapses didn’t change their role, i.e., become inhibitory during learning, we used here a model for synaptic plasticity that does not allow synaptic weights to become negative. This was done by applying a mapping between synaptic parameters and the synaptic efficacies . We used here the exponential mapping

(7) |

with offset parameter , such that is positive for any value of . We show in 4 Methods that by inserting equation (7) into the general Hamiltonian learning framework (4), we arrive at a slightly modified version of the eligibility trace , given by

(8) |

This dynamics differs from equation (6) by the additional term , such that weight changes are scaled by the current value of the synaptic efficacy. This feature of our model mimics the multiplicative dynamics observed in cortical synaptic spines [30], see [10] for a detailed analysis.

In Fig. 3 we show that the Hamiltonian momentum term in the rule (4) significantly enhances learning this task. Network responses before and after learning with the Hamiltonian momentum term are shown in Fig. 3B. Initially the rewarded goal is only reached occasionally (around success rate, one example unsuccessful trial is shown). After learning for 12 hours the network is able to reach the target in most of the trials (success rate was on average, see Fig. 3C). In Fig. 3C we compare the learning progress with and without Hamiltonian sampling. We found that this task is hard to learn without the Hamiltonian momentum term (success rate typically below after 12 hours of learning).

### 2.3 Hamiltonian dynamics improves network behavior at saddle points

Traditionally, it was believed that gradient-based non-convex optimization in high-dimensional spaces is hampered by the presence of local optima in the fitness landscape. Recently, Dauphin et al. [11, 12] argued that in high-dimensional spaces there are typically only few local optima and that these local optima are nearly as good as the global optimum. Importantly, it was further noted by these authors that saddle points are much more numerous in high-dimensional fitness landscapes. Hence, stochastic procedures over high dimensional spaces, like synaptic sampling, tend to be inefficient and time consuming due to the presence of saddle points, but not so much due to local optima. One generally accepted method to speed up convergence of learning or sampling in the presence of saddle points is to use Hamiltonian dynamics (or a momentum term). We therefore hypothesized that CaMKII-induced Hamiltonian parameter dynamics should provide a benefit in this respect.

To test this hypothesis, we considered a three-layer neural network with 784 input neurons, 30 hidden neurons and 10 output neurons. The task was to learn to classify images of handwritten digits from the MNIST dataset (see Fig. 4A and 4 Methods). Due to the large computational demands for this task, we did not consider a spiking network here but rather a network consisting of stochastic perceptrons (i.e., neurons with binary outputs which were set stochastically based on the weighted sum of inputs, similar to units in a Boltzmann machine). At each pattern presentation, one digit was chosen randomly from the MNIST dataset and presented as input. A binary reward was delivered depending on the activity of output neurons. If the output neuron corresponding to the target for the current example had larger firing probability than other output neurons, a reward of 1 was delivered, otherwise the reward was set to 0. Note that no eligibility trace was used as the network obtained feedback immediately (presentation of the pattern, computation of network output, and reward delivery were all performed in the same time step).

We first ran the network with non-Hamiltonian synaptic sampling (Fig. 4B, magenta curve). The behavior of the network during learning showed typical signs of saddle points. In particular, the test accuracy tended to get stuck at some plateau value with only slight increases during longer periods. Then, at some point performance increased significantly (the network escaped from the saddle point) until another plateau was reached (see step-like behaivor of the magenta curve in Fig. 4B). Similar behavior was observed with Hamiltonian dynamics, however, in this case, the network tended to escape from saddle points much faster (Fig. 4B, green curve). To test whether Hamiltonian dynamics can escape saddle points faster than non-Hamiltonian synaptic sampling, we considered a parameter setting obtained by synaptic sampling close to a putative saddle point and continued the simulation with Hamiltonian dynamics (light green curve in Fig. 4B). We observed that the network escaped from the current saddle point much faster with the Hamiltonian dynamics. Considering the average performance for 30 independent learning trials, we found that Hamiltonian sampling accelerates learning significantly and obtains better result within reasonable learning times (Fig. 4C).

### 2.4 From reward-based learning to global network optimization

Virtually all previous approaches for reward-based learning in spiking neural networks are based on the policy gradient method, that is, the parameters of the network are gradually adjusted in the direction that increases the expected reward locally. Hence, for sufficiently long learning, the parameter setting of the network converges to a local optimum and stays at this local optimum thereafter. The proposed mathematical framework of Hamiltonian sampling allows us to create a link from reinforcement learning to nonlinear optimization theory and the simulated annealing algorithm. This link implies that (spiking or artificial) neural networks can in principle attain through learning not only functionally attractive locally optimal network configurations, but in principle even a global optimum. This theoretical result hence reveals a fundamental advantage of Hamiltonian synaptic dynamics over previous approaches for reward-based network optimization.

The link to nonlinear optimization becomes apparent when one takes a closer look at the temperature parameter in our plasticity dynamics (4) that scales the amount of noise in the parameter updates. Since for a given , the network samples from , a decreased temperature concentrates parameter samples at values that lead to large rewards (for an uninformative prior) and therefore increases the expected reward of the network. In the limit , the stationary distribution converges to the uniform distribution over optimal parameter settings with other parameter settings assuming zero probability

(9) |

where we have defined as the set of optimal network parameters and denotes the measure of this set, see 4 Methods. Further, the expected reward also assumes its global optimum in this limit. One attempt to attain such an optimum is to start with a large temperature and reduce it slowly towards 0. Such an annealing procedure is used in simulated annealing, a non-linear optimization technique [16]. This cooling technique however needs convergence to the associated stationary distribution for each temperature within a reasonable time. While some data suggest that the genetic program for developmental learning has some features that are reminiscent of a cooling schedule [31], a Hamiltonian sampling dynamics is likely to improve the convergence speed for each temperature.

We studied the benefit of a cooling schedule by considering a spiking neural network that learns the exclusive-or (XOR) function through reward-based learning (Fig. 5A,B). The XOR function maps two binary variables to one binary output in the following manner: . It is a classical task for artificial neural networks. The spiking neural network that we used for this task is shown in Fig. 5A. It consisted of 2 input neurons, 10 hidden neurons and 1 output neuron. Each input neuron encoded one binary input variable. It produced a Poisson spike train at its output with a rate of Hz for the input and Hz for the input . The hidden neurons and output neuron were stochastic spiking neurons with a refractory time of ms. Each layer was fully connected to the next layer and initial synaptic weights were set randomly (see 4 Methods for details).

During learning, a pattern was chosen randomly and presented to the network for ms. During this time, the output of the network was compared to the target output and a binary reward was delivered accordingly. More specifically, every ms the reward was recomputed and delivered to the network – being if the output neuron spiked (was silent) in the past ms for a target of ( respectively) and otherwise. The pattern presentation was followed by a delay period of ms where no input or reward was delivered to the network. Then, another randomly chosen pattern was presented and so on.

The evolution of the synaptic weights during learning is shown in Fig. 5C,D. The weights of both layers change significantly throughout learning and contribute to learning the task. Synapses also remain plastic throughout the whole learning time and explore different solutions. Network responses before and after learning are shown in Fig. 5E and Fig. 5F. Before learning, the average firing rate of the output neuron for all the input patterns were 52.7, 132, 97.2, and 81.5 Hz respectively (see Fig. 5E). After learning for 6 hours, the output neuron has maximized the firing rate for the input patterns and ) and significantly reduced it for patterns and (see Fig. 5F).

This task was considered before by Seung and Xie [32, 33]. They also considered a stochastic spiking neuron model, however with zero refractory time. Further, in their model, positive or negative reward was delivered to the network every millisecond. It was noted in [33] that learning does not work reliably if positive and negative rewards are not balanced. In fact using our highly unbalanced reward schedule with rewards being either or , the network often does not achieve optimal performance if a constant temperature is chosen for learning (Fig. 5G). In this case, optimal results were obtained in only % of the learning trials (where each trial was started with a different random initialization of the parameters). When we introduced a “cooling” schedule in which the temperature was decreased during learning, this ratio increased to %. The superiority of the annealed optimization is also visible in average reward attained during learning, see Fig. 5B. This shows that parameter optimization with annealed noise can significantly improve performance of spiking neural networks. A similar observation was reported for deep artificial neural networks [34]. Our theoretical framework of Hamiltonian sampling provides an explanation for this phenomenon as an optimization through annealed sampling similar to simulated annealing and thus opens the door to apply the toolkit of stochastic optimization to gradient-based neural network learning in a principled manner.

## 3 Discussion

We have presented a new theoretical framework for reward-based neural network optimization that integrates a hidden synaptic parameter in the plasticity process. We suggest that this synaptic parameter could be implemented in the synapse through CaMKII, that is abundantly present in the postsynaptic density and acts as low pass filter in the induction of synaptic plasticity. We have shown that the CaMKII-enriched dynamics supports a special type of ongoing stochastic policy search – Hamiltonian sampling with friction – and convergences to the stationary distribution much faster than Langevin sampling (synaptic sampling).

David Marr famously proposed to treat brain computation at three distinct, complementary levels of analysis [35], which is today known as Marr’s Tri-Level Hypothesis. It is of interest to realize that biological data on the activation dynamics of the kinase CaMKII, corresponds to the implementational/physical level of Marr’s Tri-Level Hypothesis. Our proposed model for network plasticity suggests that CaMKII enables the brain to perform Hamiltonian sampling on the algorithm level (algorithmic/representational level of Marr’s Tri-Level Hypothesis). To be specific, biological networks of neurons are able to approximate Hamiltonian sampling of network configurations, rather than slower Langevin sampling or gradient descent.

We have demonstrated several advantages of Hamiltonian sampling over previously considered approaches to reward-based learning in spiking neural networks. We have shown in Fig. 2 that this Hamiltonian synaptic sampling framework can be use to learn smooth responses of spiking neurons through reward-based learning, and that it further can scale up to learn recurrent networks of spiking neurons (shown in Fig. 3). In Fig. 4 we have shown that synaptic sampling is prone to be slow near saddle points of the objective function, and that Hamiltonian synaptic sampling can significantly speed up learning in these cases. Finally, we have demonstrated that reward-based network plasticity is in principle able to acquire through down-regulation of the stochastic component in parameter updates the full power of simulated annealing for optimizing the network. This allows neural networks to attain through learning not only locally optimal network configurations, but in principle even a global optimum. This theoretical result provides a new gold standard for reward-based network learning.

CaMKII dynamics has previously been studied in [36]. While this work focused on detailed molecular dynamics and its implications for STDP on the level of pairing protocols, we treated CaMKII dynamics in the current study more abstractly as a low-pass filtering process and studied the implications for system level reward-based learning. It is interesting to note that the low-pass filtering effect was also predicted in the model of [36]. In addition they proposed a role of CaMKII for binary-state behavior of synapses in hippocampus. The underlying hypothesis that synaptic efficacies can attain only two possible states, a depressed state and a potentiated state, has been put into question by recent experimental data [37].

Our model makes a number of experimentally testable predictions. It was shown in previous work that synaptic spine dynamics can be modeled by a stochastic process (Ornstein-Uhlenbeck process) with two time-constants on the temporal scale of several days. Our model that includes the Hamiltonian momentum term suggests that also on short time scales (minutes to few hours), models of synaptic dynamics with two time constants should provide better fits. Moreover, the proposed role of CaMKII suggests that these time constant should correspond to rates of dephosphorylation.

Our result on network optimization in Fig. 5 suggests that biological networks are able to control the level of stochasticity, and that stochasticity decreases during long lasting learning processes (cooling). Experimental results revealed that learning a new behavioral task is accompanied by increased synaptic spine numbers and spine dynamics [27, 39]. In [10] we analyzed a simple model for synaptic turnover and found that the statistics of spine regrowth during task acquisition can be explained by a brief increase of the learning temperature . These findings suggest that the brain employs – in addition to deterministic synaptic updates – a mechanism to regulate the speed of random exploration in the high-dimensional space of synaptic parameters over several hours to days. This article has introduced a mathematical framework that provides a step towards understanding the complex interplay of deterministic and stochastic strategies employed by the brain, to solve complex learning problems.

## 4 Methods

### 4.1 Details to learning the stationary distribution of network configurations through synaptic plasticity rules

Here we present the general mathematical framework of synaptic parameter dynamics and derive the emerging stationary distribution of network configurations that results from this dynamics. The generalized model, that includes both Hamiltonian synaptic sampling (4) and synaptic sampling without momentum (3) as special cases, is given by the following set of SDEs:

(10) |

where is the posterior distribution over the network parameter given by equation (2) and is the distribution over the CaMKII-related hidden synaptic parameter. The notation, , denotes the derivative of evaluated at the parameter vector . In 2 Results we suppressed this time-dependences in order to simplify the notation. is the temperature parameter, are independent one-dimensional Wiener processes, and , , are positive constants.

This dynamics describes a general noisy first-order interaction between visible synaptic parameters , that determine the efficacy of the synapse, and hidden synaptic parameters , the absolute value of which model the local concentration of CaMKII in its activated state. The dynamics can thus be seen as a generalization of standard gradient-based synaptic plasticity rules (e.g. for maximum likelihood learning) that includes structural constraints, CaMKII activation and stochastic plasticity. For the general dynamics, the joint distribution over the sets of all parameters and will converge after a while to and produce samples from it. This result can be formalized in the following theorem:

###### Theorem 4.1.

Let , be strictly positive, continuous probability distributions over parameters and respectively, twice continuously differentiable with respect to and . Let be postitive constants. Then the set of stochastic differential equations (10) leaves the distribution invariant. Furthermore, this is the unique stationary distribution of the sampling dynamics.

Proof. Eq. (10) has two drift terms , and two diffusion terms , :

(11) |

(12) |

(13) |

(14) |

Hence the SDEs (10) can be translate into the following Fokker-Planck equation:

(15) |

where denotes the distribution over network parameters at time . If we plug the stationary distribution to the right side of eq. (15), we have:

(16) |

This proves that is the stationary distribution of the parameters dynamic (10). Under the assumption that and are strictly positive, this stationary distribution is also unique. If the matrix of diffusion coefficients is invertible, and the potential conditions are satisfied, the stationary distribution can be obtained (uniquely) by simple integration. Since the matrix of diffusion coefficients is diagonal in our model, the diffusion coefficient matrix is trivially invertible if all diagonal elements, i.e. all and are strictly positive. Also the potential conditions are fulfilled (by design), as can be verified by substituting eqs. (11 – 14) into Equation (5.3.22) in [17],

(17) |

(18) |

This shows that is a gradient. Thus, the potential conditions are met and the stationary distribution is unique.

In Theorem 4.1, , need to be strictly positive. Note that we can relax it to or is strictly positive (or both) – which means there exists diffusion noise – and can prove is a unique stationary distribution of stochastic differential equations (10) in the same way.

Hamiltonian synaptic sampling (4) and synaptic sampling (3) are special cases of the more general parameter dynamics (10). Hamiltonian synaptic sampling as defined in (4) is obtained by choosing and a Gaussian distribution for the hidden parameters . Synaptic sampling as defined in (3) is obtained by choosing . We remark that various types of gradient descent can also be recovered from the generalized dynamics for , e.g. gradient descent with momentum for the noiseless Hamiltonian dynamics. Equation (10) can be seen as the continuous version of Hamiltonian sampling [22], where a Metropolis update is performed after simulating Hamiltonian dynamic. Equation (10) can also be seen as an extension of stochastic gradient Hamiltonian Monte Carlo with friction [40, 41] to the case where the temperature is used to shape the static distribution .

### 4.2 Spiking neuron model

Spiking neurons were modeled by a stochastic variant of the spike response model [42]. We use to donate the synaptic efficacy of the -th synapse in the network at time . Each neuron of network is then modeled as a point neuron with membrane potential at time

(19) |

where is the index set of synapses that project to neuron , denote the index of the presynaptic neuron of synapse , denotes the bias potential of neuron . In the recurrent network in Fig. 3 we used a slowly changing bias potential to ensures that the output rate of each neuron stays within finite bounds (described in detail below). In all other experiments we used a constant bias potential. denote the trace of the (unweighted) postsynaptic potentials (PSPs) from presynaptic neuron of synapse at time . Throughout this paper, we used standard double-exponential PSP kernels with a brief finite rise and exponential decay, of the form , with time constants and (any other PSP shape may be used in principle without further adaptations of the theoretical model).

We denote the output spike train of neuron by , which is defined as a sum of Dirac delta pulses positioned at the spike times , i.e., . Neuron fires according to the link function which denotes the firing probability of neuron at time . Due to the lasting effects of PSPs, the firing probability may depend on the history of past spiking activities of all input neurons up to time which we denote by , which is defined as:

(20) |

where denotes a refractory variable that is given by the time elapsed since the last spike of neuron . In this article, we set , where is a sigmoid activation function and denotes the Heaviside step function, i.e. for and otherwise. In our simulation, we set refectory time to 5 ms.

### 4.3 Reward-modulated synaptic plasticity rule

Here we derive the reward-based learning rules for the spiking neural network model outline above. In particular, we compute here the gradient of the expected reward:

(21) |

We only consider the recurrent network in section 2.2 Reward-guided network plasticity as an example and show that the parameter dynamic (6), (5) approximate this gradient. Actually one can compute the gradient of the expected reward for the feed-forward neural network in other simulations and get similar learning rule. In order to simplify notation, we use to represent the history of past spiking activity of all neurons up to time t. Supposing that the reward signal is only decided by , we can rewrite as the expectation over all possible spike trains up to time t:

(22) | |||||

Note that we use the fact . The problem now is to estimate the gradient of the probability of observing the spike train in the time interval to . According to Eqs. (19) and (20), the logarithm of the probability distribution can be rewritten as:

(23) |

where the integration runs from time to . Using this, the gradient can be estimated

(24) |

The dependence on (the current value of the synaptic weight), is a result of applying the chain rule and using the exponential mapping function (7). If a linear mapping between and is used this term vanishes as in Eq. (6). The learning rules are similar to previous ones which were found in the context of maximum likelihood and reinforcement learning in neural networks [43, 24].

Eq. (22) defines a batch learning rule with an average taken over learning episodes where in each episode network responses and rewards are drawn according to the distribution . In order to arrive at an online learning rule for this scenario, we consider an estimator of Eq. (22) that approximates its value at each time based on the recent network activity and rewards during time for some suitable . We denote the estimator at time by where we want for all . To arrive at such an estimator, we approximate the average over episodes in Eq. (22) by an average over time where each time point is treated as the start of an episode. The average is taken over a long sequence of network activity that starts at time and ends at time . Here, one systematic difference to the batch setup is that one cannot guarantee a time-invariant distribution over initial network conditions as we did there since those will depend on the current network parameter setting. However, under the assumption that the influence of initial conditions (such as initial membrane potentials and refractory states) decays quickly compared to the time scale of the environmental dynamics, it is reasonable to assume that the induced error is negligible. We thus rewrite Eq. (22) in the form

where is the length of the sequence of network activity over which the empirical expectation is taken. Finally, we can combine the second and third integral into a single one, rearrange terms and substitute and so that integrals run into the past rather than the future, to obtain

(25) |

Supposing that tends to , we get a simple on-line learning rule to approximate :

(26) | |||

(27) |

A similar learning rule has already been proposed by Seung and Xie [33]. In fact, as the learning rule only estimate Eq. (25) based on the reward at time , it ignores outer integral in Eq. (25) and thus can’t approximate accurately. A better estimation has been given by Kappel et al. [10] to improve learning performance, but without biological plausible motivation. Actually in our Hamiltonian synaptic sampling framework, CaMKII works as a momentum term that computes the average of the gradient instead of current gradient during on-line learning, which corresponds to the outer integral and thereby supporting better estimate of the gradient of expected reward.

### 4.4 Relating Hamiltonian synaptic sampling to synaptic sampling

Here we build the relationship between Hamiltonian synaptic sampling and synaptic sampling and show that synaptic sampling is included in Hamiltonian synaptic sampling. For simplicity and brevity, here we consider a version of the parameters dynamics for discrete time. According to Eq. (3), the parameter change of synaptic sampling during a small discrete time step can be written as:

(28) |

where denotes a learning rate that controls the speed of the parameter dynamics. represents Gaussian noise with zero mean and variance 1. These noises are independent for each parameter and each update time .

To compare synaptic sampling with Hamiltonian synaptic sampling, we rewrite Eq. (4) into the discrete version with the same time step :

(29) |

Eq. (29) seems different from Eq. (28). Actually, we can build the relationship between Eq. (29) and (28) with the assumption that the momentum term has transient time constant (tends to zeros). To be specific, the parameter is very large and tends to be 1. We thus rewrite the discrete version of Hamiltonian synaptic sampling (29) as:

(30) |

Note that Eq. (30) and (28) are the same if holds. Hence we conclude that synaptic sampling is a special case of Hamiltonian synaptic sampling that the momentum term changes on transient time constant.

### 4.5 Global network optimization through stochastic synaptic plasticity

Here, we show that in principle, stochastic plasticity with a cooling schedule (i.e., with a slow decrease of the noise amplitude) can produce globally optimal network configurations.

Temperature-dependent expected reward: We first calculate the expected reward that is attained by a network with parameters that have converged to the stationary distribution at temperature . We denote by the Bernoulli random variable that indicates reward at temperature . When the network has reached the stationary distribution , the expected reward is given by

Assuming an uninformative prior , we obtain

where normalizes with respect to . In other words, sampling from the posterior with decreased temperature concentrates parameter samples at values that lead to large rewards and therefore increases the expected reward of the network.

Temperature annealing for global optimization: For small temperatures, the posterior is concentrated at the global optimum of the reward landscape. In practice, the sampling process mixes extremely slowly at low temperatures due to low probabilities for non-optimal states. Hence, an annealing schedule that decreases the temperature slowly over time has to be employed in order to give synaptic sampling enough time to settle at the global optimum, similar to the simulated annealing optimization algorithm.

Here we show that in the limit , the network achieves maximal possible performance (the derivation is similar to the one in [15] for simulated annealing). Let denote the set of optimal circuit parameters, i.e., for . For an uninformative prior we have