The brain on low power architectures
Efficient brain simulation is a scientific grand challenge, a parallel/distributed coding challenge and a source of requirements and suggestions for future computing architectures. Indeed, the human brain includes about synapses and neurons activated at a mean rate of several Hz. Full brain simulation poses Exascale challenges even if simulated at the highest abstraction level. The WaveScalES experiment in the Human Brain Project (HBP) has the goal of matching experimental measures and simulations of slow waves during deep-sleep and anesthesia and the transition to other brain states. The focus is the development of dedicated large-scale parallel/distributed simulation technologies. The ExaNeSt project designs an ARM-based, low-power HPC architecture scalable to million of cores, developing a dedicated scalable interconnect system, and SWA/AW simulations are included among the driving benchmarks. At the joint between both projects is the INFN proprietary Distributed and Plastic Spiking Neural Networks (DPSNN) simulation engine. DPSNN can be configured to stress either the networking or the computation features available on the execution platforms. The simulation stresses the networking component when the neural net — composed by a relatively low number of neurons, each one projecting thousands of synapses — is distributed over a large number of hardware cores. When growing the number of neurons per core, the computation starts to be the dominating component for short range connections. This paper reports about preliminary performance results obtained on an ARM-based HPC prototype developed in the framework of the ExaNeSt project. Furthermore, a comparison is given of instantaneous power, total energy consumption, execution time and energetic cost per synaptic event of SWA/AW DPSNN simulations when executed on either ARM- or Intel-based server platforms.
The final publication is available at IOS Press through
(2018) Advances in Parallel Computing, 32, pp. 760-769, Talk at ParCo 2017.
Efficient simulation of cortical slow waves and asynchronous states
††thanks: Corresponding Author: Andrea Biagioni, INFN Sezione di Roma, Piazzale Aldo Moro 2, Roma, Italy
E-mail:email@example.com, B]\fnmsFabrizio \snmCapuani B]\fnmsPaolo \snmCretaro B]\fnmsGiulia \snmDe Bonis B]\fnmsFrancesca \snmLo Cicero B]\fnmsAlessandro \snmLonardo B]\fnmsMichele \snmMartinelli B]\fnmsPier Stanislao \snmPaolucci B]\fnmsElena \snmPastorelli B]\fnmsLuca \snmPontisso B]\fnmsFrancesco \snmSimula and B]\fnmsPiero \snmVicini
The scaling of the performance of modern HPC systems and applications is strongly limited by the energy consumption. Electricity is the main contributor to the total cost of running an application and energy-efficiency is becoming the principal requirement for this class of computing devices. In this context, the performance assessment of processors with a high performance-per-watt ratio is necessary to understand how to make energy-efficient computing systems for scientific applications. Processors based on the ARM architecture dominate the market of low-power and battery-powered devices such as tablets and smartphones. Several scientific communities are exploring non-traditional many-core processors architectures coming from the embedded market, from the Graphics Processing Unit (GPU) to the System-on-Chip (SoC), looking for a better tradeoff between time-to-solution and energy-to-solution. A number of research projects are active in trying to design an actual platform along this direction. The Mont-Blanc project [1, 2], coordinated by the Barcelona Supercomputing Center, has deployed two generations of HPC clusters based on ARM processors, developing also the corresponding ecosystem of HPC tools targeted to this architecture. Another example is the EU-FP7 EUROSERVER  project, coordinated by CEA, which aims to design and prototype technology, architecture, and systems software for the next generation of datacenter “microservers”, exploiting 64-bit ARM cores.
Fast simulation of spiking neural network models plays a dual role: (i) it contributes to the solution of a scientific grand challenge — i.e. the comprehension of brain activity — and, (ii) by including it into embedded systems, it can enhance applications like autonomous navigation, surveillance and robotics. Therefore, these simulations assume a driving role in shaping the architecture of either specialized and general-purpose multi-core/many-core systems to come, standing at the crossroads between embedded and High Performance Computing. See, for example , describing the TrueNorth low-power specialized hardware architecture dedicated to embedded applications, and  discussing the power consumption of the SpiNNaker hardware architecture, based on embedded multi-cores, dedicated to brain simulation. Worthy of mention are also [6, 7] as examples of approaches based on standard HPC platforms and general-purpose simulators.
The APE Research Group at INFN developed a distributed neural network simulator  as a mini-application and benchmark in the framework of the EURETILE FP7 project . Indeed, the Distributed and Plastic Spiking Neural Network with synaptic Spike-Timing Dependent Plasticity mini-application was developed with two main purposes in mind: as a quantitative benchmarking tool for the evaluation of requirements for future embedded and HPC systems and as an efficient simulation tool addressing specific scientific problems in computational neuroscience. As regards the former goal, the ExaNeSt project  includes DPSNN in the set of benchmarks used to specify and validate the requirements of future interconnects and storage systems; as an example of the latter, the distributed simulation technology is employed in the study of slow waves in large scale cortical fields [11, 12] in the framework of HBP project.
This paper describes porting DPSNN onto different ARM-based platforms and running it on low-power CPUs, comparing the resulting computing and energy performances with traditional systems mainly based on x86 multicores. The characterization of DPSNN-generated data traffic is described, highlighting the limitations faced when the application is run on off-the-shelf networking components. The code organization and its compactness give the DPSNN a high degree of tunability, giving the opportunity to test different areas of the platform. The networking compartment is the most stressed when the simulated neural net — composed by a relatively low number of neurons, each one projecting thousands of synapses — is distributed over a large number of hardware cores. When the number of neurons per core grows, the impact of both computing and memory increases. For this reason, we employ DPSNN as a general benchmarking tool for HPC systems.
2 Mini-application benchmarking tool
Evaluation of HPC hardware is a key element especially in the first stages of a project — i.e. definition of specification and design — and during the development and implementation. Features impacting performance should be identified in the analysis and design of new architectures. In the early stages of the development, full applications are too complex to run on the hardware prototype. In usual practice, hardware is tested with very simple kernels and benchmarking tools which often reveal their inadequacy as soon as they are compared with real applications running on the final platform, showing a huge performance gap.
In the last years, a new category of compact, self-contained proxies for real applications called mini-apps has appeared. Although a full application is usually composed by a huge amount of code, the overall behaviour is driven by a relatively small subset. Mini-apps are composed by these core operations providing a tool to study different subjects: (i) analysis of the computing device — i.e. the node of the system. (ii) evaluation of scaling capabilities, configuring the mini-apps to run on different numbers of nodes, and (iii) study of the memory usage and the effective throughput towards the memory.
This effort is led by the Mantevo project , that provides application performance proxies since 2009. Furthermore, the main research computing centers provide sets of mini-applications, adopted when procuring the systems, as in the case of the NERSC-8/Trinity Benchmarks , used to assess the performance of the Cray XC30 architecture, or the Fiber Miniapp Suite , developed by RIKEN Advanced Institute for Computational Science (RIKEN AICS) and the Tokyo Institute of Technology.
The miniDPSNN benchmarking tool leverages on the Hardware-Software Co-design approach that starts from the collection of application requirements for the initial development of the infrastructure and then pursues testing the adopted solution during the implementation. Thus, the application drives the research about the main components of a HPC system from its roots by optimizing modeling and simulation of a complex system.
The analysis is based on the behaviour of a strong scaling test. Neurons are arranged into “columns”, each one composed by about one thousand neurons; columns are then arranged into a bidimensional grid. Each excitatory neuron projects 80% of its synapses out to those residing in its own column while the rest reaches out to those in the neighbouring columns, according to the chosen remote connectivity. Instead, synapses of inhibitory neurons are projected only towards excitatory ones residing in their same column. When DPSNN runs, each process can either host a fraction of a column, a whole single column, or an integer number of columns.
Each core of the computing system hosts only one process optimizing the performance. Thus, the varying of the columns-per-process ratio — i.e. ratio of columns per core of the computing devices — throttles the application into different regimes, allowing to stress and test several elements of the platform. Be noted that in general, the hardware connection topology bears no resemblance whatsoever with the lateral connectivity of columns and neurons, the exception being when running only one process per node, so that all outwards connectivity of a column impinges upon the network system of the node.
Here is a rundown of the application tasks that miniDPSNN performs and that allow to gauge the components of the architecture under test:
Computation: processing of the time step in the dynamical evolution of the neuron.
Memory Management: management of either axonal spikes organized in time delay queues and lists of synaptic spikes, both stored in memory.
Communication: transmission along the interconnect system of the axonal spikes to the subset of processes where target neurons exist.
Synchronization: at each time step, the processes deliver the spikes produced by the dynamics according to the internal connectivity supported by the synaptic configuration. This global exchange is currently implemented by means of synchronous MPI collectives; any offset in time when different processes reach these waypoints — whether it be by fluctuations in load or network congestion — causes idling cores and diminished parallelization.
|Neurons||0.18 M||0.71 M||2.86 M|
|Synapses||0.20 G||0.80 G||3.20 G|
Table 1 displays results obtained running on a standard HPC cluster based on Intel Xeon processors communicating over an InfiniBand interconnect, as a function of the configuration of the testbed — i.e. grid size, simulated seconds, allocated cores. The distribution of tasks is strongly dependent on the columns-per-core ratio. As already stated, the computation task becomes more demanding when increasing the number of columns per node — which means increasing the total number of neurons. Instead, reducing the columns-per-core ratio generates relatively more communication among processes, moving the focus of the test to the interconnect.
2.1 Analysis of low-power and off-the-shelf architectures in the real-time domain
In this domain, being “real-time” signifies a miniDPSNN workpoint such that the execution time — i.e. wall-clock time of the running application — is not greater than the simulated time. Accomplishment of this workpoint is obtained through an accurate configuration of parameters. Prelimary trials of DPSNN keeping pace with this real-time requirement are reported in this section. This working condition could be useful in the robotics application field.
The testbed is a standard strong scaling test of a columns grid. Figure 2 shows the results of the test obtained simulating 10 s on the Intel-based platform.
Up to cores, the architecture scales well, decreasing the execution time down to seconds. The execution time increases unexpectedly ( seconds) when distributing the problem over 32 cores, thus preventing the achievement of the target workpoint.
Singling out the times of the various tasks as reported in Figure 2 sheds some light on this behaviour. We see that communication quickly becomes more demanding when the problem is split over more than 16 processes, dominating the behaviour of the application. As mentioned before, the application stresses the interconnect when the column-per-core ratio decreases — a whole column or portions of column are managed by each core in the tested configuration. More than 80% of synapses remain within the column their projecting neuron belongs to. Communication between processes increases when the columns are split among them, clogging the network with an ever increasing number of small packets. The miniDPSNN highligths this “latency” limitation of the IB interconnect provided by the cluster. In general, COTS interconnects offer adequate throughput when moving large amounts of data, but tipically trudge when the communication is latency-dominated. This issue with communication — manifesting here with a number of computing cores which is, by today’s standards, not large — is similar to that encountered by the parallel cortical simulator C2  — targeting a scale in excess of that of the cat cortex — on the Dawn Blue Gene/P supercomputer at LLNL, with 147456 CPUs and 144 TB of main memory. The capability to replicate the behaviour of a supercomputer with a mini-app running on a limited number of 1U servers could be considered the proof of its effectiveness.
Similar results are obtained performing the same test on an ARM-based platform as showed in Figure 4 and Figure 4, although the analysis is limited by the available number of cores (16). The ARM-based prototype is composed by four nodes, each node consisting of a TEBF0808 Trenz board equipped with a Trenz TE0808 UltraSOM+ module. The Trenz UltraSOM+ consists of a Xilinx Zynq UltraScale+ xczu9eg-ffvc900-1-e-es1 MPSoC and 2 Gbytes of DDR4 memory. The Zynq UltraScale+ MPSoC incorporates both a processing system composed by quad-core ARM Cortex-A53 and the programmable logic — not used in this test. All four nodes are connected together through a 1 Gbps Ethernet-based network.
The number of transmitted packets increases distributing the same problem over an increasing number of processes (cores) as shown in Figure 6 while the payload generated by each process does not vary as shown in Figure 6 and the communication becomes more demanding.
The characterization of the traffic generated by the DPSNN over several off-the-shelf interconnects allows to identify the main requirement for a network device of future exascale computing system simulating spiking neural network simulation: the network system should be optimized for the trasmission of small packets. In particular, performances are strongly influenced by (i) the design and implementation of a low-latency interconnect architecture, and (ii) the definition of a light and reliable communication protocol guaranteeing high throughput and optimizing the transfers of data packets with payload Bytes.
Finally, a planned re-engineering of the DPSNN foresees a two-level hierarchy enforced via MPI communicators: one auxiliary process (called “broker”) is added per node and communications are segregated to be only among processes belonging to the same node â– i.e. exchanges that go only through intra-node, shared-memory channels â– or among brokers â– i.e. exchanges that only go through inter-node, remote interfaces. In this way, “local” exchanges among neighbouring neural columns (which, given the biologically plausible topology for the synaptic connectivity, make up the exchange bulk) can be contained to the fastest and possibly less congested intra-node channel while “distal” exchanges are gathered to the broker process of the node, then scattered to brokers of other nodes that take care of scattering them to the appropriate recipients.
3 Energy-to-Solution analysis
Instantaneous power, total energy consumption, execution time and energetic cost per synaptic event of a spiking neural network simulator distributed on MPI processes are compared when executed on different generations of low-power and traditional computing architecture to have a (limited) estimate of the trend.
The power and energy consumption reported were obtained simulating 3 s of activity of a network made of 18 M equivalent (internal + external) synapses: the network includes 10 K neurons (Leaky Integrate-and-Fire with Calcium-mediated spike-frequency adaptation), each one projecting an average of 1195 internal synapses and receiving an “external” stimulus, corresponding to 594 equivalent external synapses/neuron. A Poissonian spike train targets external synapses with an average rate of 3 Hz; synaptic plasticity is disabled. In response, the neurons fire trains of spikes at a mean rate of 5.1 Hz.
The power measurement equipment consists of a DC power supply, a high-precision Tektronix DMM4050 digital multimeter for DC current measurements connected to National Instruments data logging software and a high-precision AC power meter. The AC power of the high-end server node is measured by a Voltech PM300 Power Analyzer upstream of the main server power supply (measuring on the AC cable). For the SoCs, the DC current was instead sampled downstream of the power supply. Such difference should not affect significantly the results, given the closeness to one of the factor of the server power supply.
3.1 First Generation Benchmark
The traditional computing system — i.e. “server platform” — is based on a SuperMicro X8DTG-D 1U dual-socket server housing 8 computing cores residing on quad-core Intel Xeon CPUs (Westmere E5620@2.4 GHz in 32 nm CMOS technology). This “server platform” is juxtaposed to the “embedded platform”: two NVIDIA Jetson TK1 boards, connected by an Ethernet 100 Mb mini-switch to emulate a dual-socket node, each board equipped with a NVIDIA Tegra K1 chip, i.e. a quad-core ARM Cortex-A15@2.3 GHz in 28 nm CMOS technology.
The “server platform” has 48 GB of DDR3 memory on-board, operating at 1333 MHz — 6 GB per core — while the “embedded platform” only has 2 GB running at 933 MHz — 0.5 GB per core. This makes for a considerable difference in terms of memory bandwidth — 14.9 GB/s for the ARM-based system against the 25.6 GB/s of the Intel-based one — which has an impact on DPSNN and its intensive memory usage, e.g. for delivering spikes to post-synaptic neuron queues.
Partitioning the neural grid onto 8 MPI processes, the simulation of 3 s of activity required 9.1 s on the “server platform” and 30 s on the “embedded platform”, as shown in Figure 10.
Observed currents were A (“server”) and mA (“embedded”), with a 5 mA measure error. Therefore, the energies required to complete the same task on the two architectures were KJ and J (see Figure 14), while the observed instantaneous power consumptions were W and W (see Figure 12). Note that we did not subtract any “base-line” power — e.g. power consumption after bootstrap, so the estimate is “pessimistic” in the sense that it includes the load of the complete system runnning.
The simulation produced a total of 235 M synaptic events: the total energetic cost of simulation can be estimated in 2.2 J/synaptic event on the “embedded platform” node and 9.8 J/synaptic event for the “server platform”. The “server platform” dual-socket node is faster, spending 3.3 times less time than the “embedded platform” node. However, the “embedded platform” node consumes a total energy 4.4 times lower to complete the simulation task, with an instantaneous power consumption 14.4 times lower than the “server platform” node.
The energetic cost of the optimized Compass simulator of the TrueNorth ASIC-based platform, run on an Intel Core i7 CPU firstname.lastname@example.org GHz (45 nm CMOS process) with 4 cores and 8 threads, is 5.7 J/synaptic event, but excludes a significant base-line power consumption. Our measures show that if we excluded a similar base-line our power consumption would be approximately reduced by a factor 4 on the “server platform”, and by a factor 2 on the “embedded platform” platform.
3.2 Second Generation Comparison
The performances are measured executing the DPSNN code along with those of a coeval mainstream Intel processor architecture using a hardware/software configuration suitable to extrapolate a direct comparison of time-to-solution and energy-to-solution at the level of the single core. The measures are extended to the new generation NVIDIA Jetson TX1 SoC based on the ARMv8 architecture. The Jetson TX1 includes four ARM Cortex-A57 cores plus four ARM Cortex-A53 cores in big.LITTLE configuration.
The “server platform” is a Supermicro SuperServer 7048GR-TR with two hexa-core Intel Haswell E5-2620 v3 @2.40 GHz. Four MPI processes are run on either platform, simulating 3 s of the dynamics of a network made of Leaky Integrate-and-Fire with Calcium Adaptation (LIFCA) neurons connected via synapses. Results are shown in Figure 10, Figure 12 and Figure 14. Although the x86 architecture is about faster than the ARM Cortex-A57 core in executing the simulation, the energy it consumes in doing so is higher .
The characterization of the network traffic generated by a cortical simulator running on a standard computing system provides information for the definition of specification of custom interconnect. The result obtained with miniDPSNN drives to the specification of a network IP, characterized by a low-latency transfer optimized architecture and to the definition of a data transmission protocol providing high-throughput also for small dimensions of data payload. Finally ARM processors turned out to be an efficient solution in terms of power consumption. The energy-to-solution result obtained running the DPSNN application on ARM Cortex-A57 based platform is about three times lower than the x86 core architecture.
This work has received funding from the European Unionâs Horizon 2020 Research and Innovation Programme under Grant Agreement No. 720270 (HBP SGA1) and under Grant Agreement No. 671553 (ExaNeSt).
-  Rajovic N et al. 2016 The mont-blanc prototype: An alternative approach for hpc systems SC16: International Conference for High Performance Computing, Networking, Storage and Analysis pp 444–455
-  The montblanc project accessed: 27/Sep/2017 URL www.montblanc-project.eu
-  Marazakis M, Goodacre J, Fuin D, Carpenter P, Thomson J, Matus E, Bruno A, Stenstrom P, Martin J, Durand Y and Dor I 2016 Euroserver: Share-anything scale-out micro-server design 2016 Design, Automation Test in Europe Conference Exhibition (DATE) pp 678–683
-  Merolla P A et al. 2014 Science 345 668–673 ISSN 0036-8075
-  Stromatias E, Galluppi F, Patterson C and Furber S 2013 Power analysis of large-scale, real-time neural networks on spinnaker The 2013 International Joint Conference on Neural Networks (IJCNN) pp 1–8 ISSN 2161-4393
-  Gewaltig M O and Diesmann M 2007 Scholarpedia 2 1430
-  Modha D S, Ananthanarayanan R, Esser S K, Ndirango A, Sherbondy A J and Singh R 2011 Commun. ACM 54 62–71 ISSN 0001-0782 URL http://doi.acm.org/10.1145/1978542.1978559
-  Paolucci P S, Ammendola R, Biagioni A, Frezza O, Lo Cicero F, Lonardo A, Pastorelli E, Simula F, Tosoratto L and Vicini P 2013 arXiv:1310.8478 http://arxiv.org/abs/1310.8478
-  Paolucci P S et al. 2016 Journal of Systems Architecture 69 29–53 ISSN 1383-7621 URL http://www.sciencedirect.com/science/article/pii/S1383762115001423
-  Katevenis M et al. 2016 The ExaNeSt Project: Interconnects, Storage, and Packaging for Exascale Systems 2016 Euromicro Conference on Digital System Design (DSD) pp 60–67
-  Ruiz-Mejias M, Ciria-Suarez L, Mattia M and Sanchez-Vives M V 2011 Journal of Neurophysiology 106 2910–2921 ISSN 0022-3077
-  Stroh A, Adelsberger H, Groh A, RÃ¼hlmann C, Fischer S, Schierloh A, Deisseroth K and Konnerth A 2013 Neuron 77 1136 – 1150 ISSN 0896-6273 URL http://www.sciencedirect.com/science/article/pii/S0896627313000974
-  Heroux M A, Doerfler D W, Crozier P S, Willenbring J M, Edwards H C, Williams A, Rajan M, Keiter E R, Thornquist H K and Numrich R W 2009 Improving Performance via Mini-applications Tech. Rep. SAND2009-5574 Sandia National Laboratories
-  Cordery M J, Austin B, Wassermann H J, Daley C S, Wright N J, Hammond S D and Doerfler D 2014 Analysis of Cray XC30 Performance Using Trinity-NERSC-8 Benchmarks and Comparison with Cray XE6 and IBM BG/Q (Cham: Springer International Publishing) pp 52–72 ISBN 978-3-319-10214-6 URL https://doi.org/10.1007/978-3-319-10214-6-3
-  Fiber miniapp suite accessed: 10/Oct/2017 URL http://fiber-miniapp.github.io/
-  Ananthanarayanan R, Esser S K, Simon H D and Modha D S 2009 The cat is out of the bag: cortical simulations with neurons, synapses Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp 1–12 ISSN 2167-4329
-  Cesini D et al. 2017 Scientific Programming 2017 14 article ID 7206595