MorphIC: A 65-nm 738k-Synapse/mm Quad-Core Binary-Weight Digital Neuromorphic Processor
with Stochastic Spike-Driven Online Learning
Recent trends in the field of artificial neural networks (ANNs) and convolutional neural networks (CNNs) investigate weight quantization as a means to increase the resource- and power-efficiency of hardware devices. As full on-chip weight storage is necessary to avoid the high energy cost of off-chip memory accesses, memory reduction requirements for weight storage pushed toward the use of binary weights, which were demonstrated to have a limited accuracy reduction on many applications when quantization-aware training techniques are used. In parallel, spiking neural network (SNN) architectures are explored to further reduce power when processing sparse event-based data streams, while on-chip spike-based online learning appears as a key feature for applications constrained in power and resources during the training phase. However, designing power- and area-efficient spiking neural networks still requires the development of specific techniques in order to leverage on-chip online learning on binary weights without compromising the synapse density. In this work, we demonstrate MorphIC, a quad-core binary-weight digital neuromorphic processor embedding a stochastic version of the spike-driven synaptic plasticity (S-SDSP) learning rule and a hierarchical routing fabric for large-scale chip interconnection. The MorphIC SNN processor embeds a total of 2k leaky integrate-and-fire (LIF) neurons and more than two million plastic synapses for an active silicon area of 2.86mm in 65nm CMOS, achieving a high density of 738k synapses/mm. MorphIC demonstrates an order-of-magnitude improvement in the area-accuracy tradeoff on the MNIST classification task compared to previously-proposed SNNs, while keeping a competitive energy-accuracy tradeoff.
The massive deployment of neural network accelerators as inference devices is currently hindered by the memory footprint and power consumption required for high-accuracy classification [Whatmough17]. Two trends are being explored in order to solve this issue. The first trend consists in optimizing current artificial neural network (ANN) and convolutional neural network (CNN) architectures. Weight quantization down to binarization is a promising approach as it allows to simplify the operations and minimize the memory footprint, thus avoiding the high energy cost of off-chip memory accesses if all the weights can be stored into on-chip memory [Moons17]. The accuracy drop induced by quantization can be mitigated to acceptable levels for many applications with the use of quantization-aware training techniques that propagate binary weights during the forward pass and keep full-resolution weights for backpropagation updates [Courbariaux16]. The associated off-chip learning setup for quantization-aware training is shown in Fig. 1(a): this strategy allows binary-weight neural networks to perform inference with a favorable energy-area-accuracy tradeoff, as recently demonstrated by binary CNN chips (e.g., [Andri18, Moons18, Bankman18]).
The second trend consists in changing the neural network architecture and data representation, which is currently being explored with bio-inspired spiking neural networks (SNNs) as a power-efficient neuromorphic processing alternative for sparse event-based data streams [Poon11]. Embedded online learning is a key feature in SNNs as it enables on-the-fly adaptation to the environment [Azghadi14]. Moreover, by avoiding the use of an off-chip optimizer, on-chip online learning allows SNNs to target applications that are power- and resource-constrained during both the training and the inference phases, as shown in Fig. 1(b). Spike-based online learning is an active research area, both in the development of new rules for high-accuracy learning in multi-layer networks (e.g., [Zheng17, Mostafa17, Neftci17, Zenke18]) and in the demonstration of silicon implementations in applications such as unsupervised learning for image denoising and reconstruction [Knag15, Chen18]. However, these approaches currently rely on multi-bit weights.
These two trends mostly evolve in parallel as only three chips have been proposed previously to leverage the density and power advantage of binary weights with SNNs. First, the TrueNorth chip proposed by IBM is the largest-scale neuromorphic chip with 1M neurons and 256M 1-bit synapses, however it does not embed online learning [Akopyan15]. Second, the Loihi chip recently proposed by Intel has a configurable synaptic resolution that can be reduced to 1 bit and embeds a programmable co-processor for on-chip learning, though not demonstrated using a binary synaptic resolution to the best of our knowledge [Davies18]. Finally, Seo et al. propose a stochastic version of the spike-timing-dependent plasticity (S-STDP) rule for online learning in binary synapses [Seo11]. However, S-STDP requires the design of a custom transpose SRAM memory with both row and column accesses, which severely degrades the density advantage of their approach.
It has been demonstrated in [Frenkel17] that the spike-dependent synaptic plasticity (SDSP) learning rule proposed by Brader et al. in [Brader07] allows for a more efficient resource usage than STDP: all the information necessary for learning is available in the post-synaptic neuron at pre-synaptic spike time. SDSP requires neither an expensive local synaptic storage of spike timings nor a custom SRAM with both row and column accesses. Therefore, in this work, we propose an efficient stochastic implementation of SDSP compatible with standard high-density foundry SRAMs in order to leverage embedded online learning in binary-weight SNNs.
Beyond plasticity, a second key aspect of spiking neural networks lies in connectivity. The brain organization in small-world networks with dense local connectivity and sparse long-range wiring leads to efficient clustering of neuronal activity and hierarchical information encoding [Bassett06]. Network-on-chip (NoC) design applied to multi-core SNNs is thus an active research topic [Akopyan15, Davies18, Moradi18, Park17, Navaridas09, Benjamin14, Schemmel10]. In this work, we propose a hierarchical combination of mesh-based routing for inter-chip connectivity, star-based routing for intra-chip inter-core connectivity and crossbar-based routing for local intra-core connectivity. We store all the connectivity information locally in the neuron memory to enable memory-less routers that do not require local mapping table accesses. With only 27 connectivity bits per neuron, this low-memory hierarchical connectivity allows reaching biologically-realistic fan-in and fan-out values of 1k and 2k neurons, respectively.
This paper extends [Frenkel19b] and demonstrates this two-fold approach with MorphIC, a quad-core digital neuromorphic processor: stochastic SDSP (S-SDSP) is combined with a hierarchical routing fabric for large-scale plastic connectivity. MorphIC was prototyped in 65nm CMOS and embeds 2k leaky integrate-and-fire (LIF) neurons and more than 2M synapses in an active silicon area of 2.86mm, therefore achieving a high density of 738k 1-bit online-learning synapses per mm. It results in an order-of-magnitude density improvement compared to the only previously-proposed binary-weight online-learning SNN processor from Seo et al. On the MNIST image recognition task [LeCun98], MorphIC achieves an accuracy of 97.8%. It demonstrates an order-of-magnitude improvement in the area-accuracy tradeoff compared to other SNNs, while keeping a competitive energy-accuracy tradeoff using rank order coding.
The remainder of this paper is structured as follows. The architecture and implementation of the MorphIC SNN processor are provided in Section II, together with detailed descriptions of the hierarchical event routing infrastructure and S-SDSP learning rule. The specifications, measurements and benchmarking results are presented in Section III. Finally, the presented results are discussed in Section LABEL:sec_disc.
Ii Architecture and Implementation
A block diagram of the MorphIC quad-core spiking neuromorphic processor is shown in Fig. 2, illustrating its hierarchical routing fabric for large-scale chip interconnection. Level-2 (L2) routers handle inter-chip connectivity, level-1 (L1) routers handle inter-core connectivity and level-0 (L0) routers handle intra-core connectivity (Section II-A). The clock can be either provided externally or generated internally using a configurable-length ring oscillator. A block diagram of the MorphIC core is shown in Fig. 3: each core embeds 512 leaky integrate-and-fire (LIF) neurons configured as a crossbar array with 256k L0 1-bit synapses and 256k L1 1-bit synapses, while 16k L2 synapses can be accessed independently. Each synapse embeds online learning with a stochastic implementation of the spike-dependent synaptic plasticity (S-SDSP) learning rule (Section II-B). Each axon can be configured to multiply its associated synaptic weights by a factor of 1, 2, 4 or 8. Time multiplexing is used to increase the neuron and synapse densities by using shared update circuits and storing neuron and synapse states to local SRAM memory, based on the strategy we previously proposed for the ODIN SNN in [Frenkel19]. Fig. 4 illustrates the time-multiplexed crossbar operation of a MorphIC core when it processes a spike event from a neuron in the local core (L0 connectivity) or from a neuron in another core in the same chip (L1 connectivity). The core controller goes sequentially through all the 512 local neurons, leading to 512 synaptic operations (SOPs), and handles the local SRAM memory accesses accordingly. As L2 events target a specific synapse of a neuron (Section II-A), they lead to a single SOP.
Ii-a Hierarchical event routing
Clustering groups of neurons with dense local and sparse long-range connectivity allows minimizing memory requirements while keeping flexibility and scalability [Moradi18]. This organization is found in the brain and is known as small-world networks. Hierarchy is therefore a key concept in SNN event routing infrastructures for large-scale networks [Akopyan15, Davies18, Moradi18, Park17, Navaridas09, Benjamin14, Schemmel10]. MorphIC uses a heterogeneous hierarchical routing fabric with different router types at each level, as shown in Fig. 5: the L2 router follows a unicast mesh-based dimension-ordered destination-driven operation (Section II-A1), the L1 router follows a multicast star-based source-driven operation (Section II-A2) while the L0 router handles decoding and encoding of the different packet types for local core crossbar-based processing (Section II-A3). Such a heterogeneous event routing infrastructure is deadlock-free and allows for the three connectivity patterns illustrated in Fig. 6, depending on the source neuron location:
The source neuron targets neurons in any combination of other cores in the same chip (L1 connectivity): the time-multiplexed crossbar approach of Fig. 4 is followed with the L1 synapses of the destination cores. The same L1 synapses are shared with up to three cores (e.g., orange pattern from source neurons in cores 1 and 2 to destination cores 0 and 3 in Fig. 6).
The source neuron is located in another MorphIC chip (L2 connectivity): the target is a specific L2 synapse address in any combination of cores in one destination chip (e.g., gray pattern from a source neuron retrieved from the West link toward identical L2 synapse addresses in cores 1, 2 and 3 in Fig. 6). As each neuron has 32 L2 synapses, an L2 synapse address has a width of 14 bits (9 bits for the neuron, 5 bits for the L2 synapse).
Each neuron of MorphIC can use any combination of the aforementioned three types of L0, L1 and L2 connectivities, which allows reaching a fan-in of 512 (L0) + 512 (L1) + 32 (L2) and a fan-out of 512 (L0) + 3512 (L1) + 4 (L2).
The entire connectivity of a network of MorphIC chips is determined by only 27 connectivity bits per neuron, which are stored in the neuron 8-kB SRAM memories located inside each core (Fig. 3). It consists of 512 128-bit words, one word for each of the 512 LIF neurons per core, whose structure is outlined in Fig. 7. Destination-based L2 connectivity requires 24 bits in total: the 6-bit chip field stores 3-bit and fields encoding the destination chip (Section II-A1), the 4-bit cores field encodes the combination of target cores and the 5-bit syn and the 9-bit neur fields encode the 14-bit L2 synapse address. Source-based L1 connectivity requires only 3 bits per neuron in order to target any combination of the other cores in a MorphIC chip. Except if disabled in the core parameter bank, L0 crossbar connectivity is automatic and does not require further connectivity information. As all the connectivity information is decentralized next to the neurons and then encapsulated in the event packets, the routers do not require local or external mapping tables: they are memory-less beyond simple packet buffering. Let us now discuss the architectural details of the L2, L1 and L0 routers.
Ii-A1 Level-2 (L2) router
The L2 router (Fig. 5(a)) handles high-level inter-chip connectivity with four links along the North, South, East and West directions that operate independently and in parallel. Events from/to the four chip-level links and from/to the L1 router are buffered into FIFOs before being dispatched following a standard unicast mesh-based strategy with dimension-ordered routing (i.e. direction before direction). Two and fields in the chip-level packet contain the information necessary for destination-based routing. and have a 3-bit width each (one sign bit, two data bits), which allows routing packets to up to three MorphIC chips in any direction. At each East or West (resp. North or South) hop, the L2 router decrements the value of the (resp. ) data field. When both and are zero, the packet is then forwarded to the L1 router. Distance information is also maintained separately in the event packet: is 0 for local L0 events and 1 for events received from local L1 connectivity, it then increases for each L2 hop up to a maximum of 7 for events received from a chip located at =3 and =3. As synapses at all routing levels of MorphIC embed online learning (Section II-B), the probability of synaptic weight update can be modulated by the distance information, following a small-world network modeling strategy. To the best of our knowledge, this is the first SNN to propose online hierarchical learning.
The mesh-based dispatcher is controlled by an arbiter, which can be configured either for round-robin or for priority-based operation. Round-robin operation, by cycling through each link independently of the FIFO usage, guarantees a maximum latency for packet processing, while priority-based operation is a greedy approach that allocates processing time to the most active links based on the current FIFO usage.
Links in each direction consist of two address-event representation (AER) busses, a sender and a receiver, for a total of eight AER busses per MorphIC chip. AER is a de facto standard for spiking neural network connectivity as it allows high-speed asynchronous communication of spike events between chips using a four-phase handshake protocol [Mortara94, Boahen00]. The MorphIC design being pad-limited, the width of the AER busses has been reduced to 8 bits. Transmission and reception of 32-bit event packets are thus multiplexed into four 8-bit AER transactions, as illustrated in Fig. 8. In order to ensure an asynchronous operation of the AER busses between MorphIC chips, double-latching synchronization barriers have been placed on the receiver REQ and sender ACK handshake lines to limit metastability issues. Due to the increased latency of off-chip packet routing, L2 packet activity should be sparse compared to L1 and L0 activity. L2 events should thus represent high-level features, as illustrated in the experiments outlined in Section III.
Ii-A2 Level-1 (L1) router
The L1 router (Fig. 5(b)) handles mid-level intra-chip inter-core connectivity with the four local MorphIC cores. This router is based on a star topology and relies on a simple dispatcher that multicasts events to local cores following a source-based approach. It does not contain any FIFO buffering as awaiting packets are already buffered in the L2 and L0 routers. An arbiter controls the dispatcher following a configurable round-robin or greedy priority-based operation, similarly to the L2 router.
The L1 router is at the center of the hierarchy. For neuron events from local cores (i.e. ascending-hierarchy events), it handles multicasting to any combination of the other cores toward L1 synapses and/or forwarding to the L2 router toward another MorphIC chip. For events retrieved from the L2 router (i.e. descending-hierarchy events), it handles multicasting to any combination of the MorphIC cores toward L2 synapses.
Ii-A3 Level-0 (L0) router
The L0 router (Fig. 5(c)) handles low-level intra-core connectivity. This router is divided into two blocks: an interface and a scheduler. The interface handles packet decoding and encoding from/to the L1 router. The packet decoder segments input packets into different types:
configuration packets are used to program the local neuron and synapse SRAMs and the core parameter bank (Fig. 3), they are handled by the controller,
monitoring request packets query one byte from the neuron or synapse SRAM, they are handled by the controller,
scheduler events are buffered by a FIFO in the core scheduler, they include L2 events targeting a single L2 synapse, L1 events targeting L1 synapses, L0 events targeting L0 synapses, virtual events that directly update a neuron without accessing any physical synapse, teacher events that control the S-SDSP supervision mechanism through the neuron Calcium variables (Section II-B) and the leak events that drive the LIF leakage time constant.
Locally-generated L0 events are buffered directly in a scheduler FIFO, they are not visible from the L1/L2 router hierarchy. Locally-generated events that need to go up the router hierarchy are handled by the packet decoder:
monitoring reply packets contain the neuron or the synapse state byte previously queried by a monitoring request packet,
L1/L2 events forward the L1 and L2 connectivity information of a source neuron to the L1 router.
Ii-B Stochastic spike-dependent synaptic plasticity (S-SDSP)
As the spike-timing-dependent plasticity (STDP) learning rule relies on the relative timing between pre- and post-synaptic spikes, it requires a local synaptic buffering of spike timings, which leads to critical overheads as buffering circuitry has to be replicated inside each synapse [Frenkel17]. In order to avoid this problem, the stochastic binary approach proposed by Seo et al. in [Seo11] involves the design of a custom transpose SRAM with both row and column accesses to carry out STDP updates each time pre- and post-synaptic spikes occur. However, beyond increasing the design time, custom SRAMs do not benefit from DRC pushed rules for foundry bitcells and induce a strong area penalty compared to single-port high-density foundry SRAMs [Frenkel17]. Therefore, STDP cannot be implemented efficiently in silicon.
The spike-dependent synaptic plasticity (SDSP) learning rule [Brader07] avoids this drawback: the synaptic weight is updated each time a pre-synaptic event occurs, according to Eq. (1). The update depends solely on the state of the post-synaptic neuron at the time of the pre-synaptic spike, i.e. the membrane potential compared to threshold and the Calcium concentration Ca compared to thresholds , and . The Calcium concentration represents an image of the recent firing activity of the neuron, it disables SDSP updates for high and low post-synaptic neuron activities and helps prevent overfitting [Brader07]. A single-port high-density foundry SRAM can therefore be used for high-density time-multiplexed implementations. However, as SDSP relies on discrete positive and negative steps, it cannot be applied directly to binary weights.
Senn and Fusi proposed a bio-inspired stochastic learning rule for binary synapses in [Senn05], where the update conditions rely on the total synaptic input of the post-synaptic neuron at the time of the pre-synaptic spike. However, this information is not easily available in time-multiplexed implementations: as shown in Fig. 4, the destination neurons are processed sequentially, while obtaining the total post-synaptic input of a neuron would require sequential processing of the source neurons instead, which is incompatible with an event-driven operation. Therefore, we propose a stochastic spike-dependent synaptic plasticity (S-SDSP) learning rule suitable for binary weights, as formulated in Eq. (2). It results from the fusion of the stochastic mechanism proposed in [Senn05] with the SDSP update conditions. and are binary random variables with probabilities and of being at 1, respectively. The synaptic weight therefore goes from 0 to 1 (resp. 1 to 0) with probability (resp. ), depending on the update conditions. The Calcium concentration is implemented as a 4-bit variable, it is stored next to all S-SDSP parameters in the neuron SRAM (Fig. 7).
The proposed S-SDSP update logic is shown in Fig. 9. The binary random variables can be generated with probabilities using linear feedback shift register (LFSR)-based pseudo-random number generation. In order to generate with a resolution similar to the probabilities down to 0.01 used in [Senn05], approximately 6 bits of resolution are required. Distance-based modulation of from small-world network modeling requires another 3 bits of resolution as the distance information ranges from 0 to 7 (Section II-A). Therefore, we selected a 9-bit resolution for probabilities. As S-SDSP updates must be computed in a single clock cycle, it is possible to parallelize successive iterations of an LFSR by using the unfolding algorithm from [Parhi99], as suggested in [Cheng06] to avoid instantiating parallel LFSRs and save switching power. The number of parallelized successive iterations is governed by the unfolding factor, which is 9 in this case. The unfolding process and the resulting unfolded LFSR are illustrated in Fig. 10. Unfolding leads the combinational logic resources (here, a single XOR gate) to be multiplied by the unfolding factor, while the LFSR period is divided by the unfolding factor. In order to avoid inducing correlation between synapses, the period of the unfolded LFSR must be one order of magnitude higher than the number of synapses per neuron. We thus selected a 17-bit depth for the LFSR to be unfolded (Fig. 10(a-b)). The 9-unfolded LFSR is shown in Fig. 10(c). The overhead incurred by the resulting S-SDSP update logic is negligible as it is shared with time multiplexing for all the L0, L1 and L2 synapses in a MorphIC core.
Iii Measurements and Benchmarking Results
MorphIC was prototyped in the UMC 8-metal 65-nm low-power (LP) CMOS process. A chip microphotograph is presented in Fig. 11, while specifications and measurement results are provided in Table III. A detailed area breakdown is provided in Table LABEL:table_area. As derived in [Frenkel19], the power consumption of time-multiplexed digital SNN architectures can be modeled by
|Number of cores||\adl@mkpreamc\@addtopreamble\@arstrut\@preamble|
|Total # neurons (type)||\adl@mkpreamc\@addtopreamble\@arstrut\@preamble|
|Total # synapses (hier.)||\adl@mkpreamc\@addtopreamble\@arstrut\@preamble|
|Max. clock frequency||55MHz||210MHz|
|Leakage power ()||45W||190W|
|Idle power ()||41.3W/MHz||94.0W/MHz|
|Energy per SOP ()||30pJ||65pJ|
|Energy per L2 hop||9.0pJ||20.3pJ|
|Energy per L1 hop||1.7pJ||3.8pJ|
|L2 router bandwidth||2.3Mpackets/s/link||5.7Mpackets/s/link|
|L1 router bandwidth||55Mpackets/s||210Mpackets/s|
|Core bandwidth ()||27.5MSOP/s/core||105MSOP/s/core|