Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems
Modern Graphics Processing Units (GPUs) are now considered accelerators for general purpose computation. A tight interaction between the GPU and the interconnection network is the strategy to express the full potential on capability computing of a multi-GPU system on large HPC clusters; that is the reason why an efficient and scalable interconnect is a key technology to finally deliver GPUs for scientific HPC. In this paper we show the latest architectural and performance improvement of the APEnet+ network fabric, a FPGA-based PCIe board with 6 fully bidirectional off-board links with 34 Gbps of raw bandwidth per direction, and X8 Gen2 bandwidth towards the host PC. The board implements a Remote Direct Memory Access (RDMA) protocol that leverages upon peer-to-peer (P2P) capabilities of Fermi- and Kepler-class NVIDIA GPUs to obtain real zero-copy, low-latency GPU-to-GPU transfers. Finally, we report on the development activities for 2013 focusing on the adoption of the latest generation 28 nm FPGAs and the preliminary tests performed on this new platform.
We present a status update of APEnet+, which is the high performance, low latency custom interconnect system developed at INFN targeting hybrid CPU-GPU-based HPC platforms. The APEnet+ hardware, a PCIe X8 Gen2 card described in , allows building a 3D toroidal mesh topology of computing nodes.
Moreover, we implemented NVIDIA GPUDirect V1.0 and V2.0  to directly access data on Fermi and Kepler GPUs. In this way, real zero-copy inter-node GPU-to-host, host-to-GPU and GPU-to-GPU transactions can be issued using a Remote DMA programming paradigm.
At the moment, APEnet+ is able to outperform commercial solutions (like InfiniBand ) for small-to-medium message size when using GPU peer-to-peer. For large message sizes, host memory staging techniques are still winning, also due to highest bandwidth of latest commercial cards, which are already Gen3-enabled and guarantee 56 Gbps on the link. Of course, our mid-term focus is in upgrading the APEnet+ hardware in order to keep pace with the advances of technology standards.
2 Architectural Improvements
In the short-term period, we focused on the improvement of our internal architecture. Three major reworkings have been undertaken regarding the PCIe interface, the on-board memory management and the off-board interface.
2.1 PCIe Interface
On the PCIe side, we noticed that effective bandwidth on data transactions was quite low compared to the theoretical one (). This is due to the time elapsed between issuing a request on the PCIe bus and its completion; this time is system dependent and can be very large. In order to optimize the performances (in addition to parameters tuning like maximum payload size), the card must be able to manage more than one outstanding request on the PCIe bus. In this way, multiple transactions can overlap and total transaction time is shorter.
2.2 Memory management
We also focused on removing a known bottleneck on the receiving path, when virtual to physical address translation is necessary in order to dispatch data payloads in the correct physical memory areas (whether they be on host memory or GPU memory). On APEnet+ this task was initially executed by the Nios II embedded processor but the impact on the resulting execution time was higher than expected.
Thus, a novel implementation of a Translation Look-Aside Buffer (TLB) has been developed on the FPGA, to accelerate virtual-to-physical address translation at hardware level . As shown in Fig. 2, the TLB block can store a limited amount of page entries and, in case of page hit, the Nios II processor is completely bypassed. A speedup of up to 60% in bandwidth on synthetic benchmarks has been measured with this enhancement.
2.3 Off-board Interface
Signal integrity of the transmission system was analyzed in order to push the embedded transceiver operating frequency to its limits. Currently, for reliable operations and upper level firmware and software validation, the Altera transceiver are set at 7.0 Gbps, yielding a raw aggregated bandwidth of 28 Gbps per APElink  channel. To estimate the efficiency of the APElink Transmission Control Logic operation — managing the data flow by encapsulating packets into a light, low-level, word-stuffing protocol — we devised a mathematical model; current implementation yields a total efficiency of over a channel able to sustain 2.6 GB/s bandwidth with a memory footprint limited to 40 KB per channel.
3 Update of Performance Tests
The described architectural improvements yielded significant performance gains with respect to our previously published results on tests of latency and bandwidth.
On Fig. 2(a) we show the round trip latency. All host-bound and GPU-bound combinations are presented; the plots clearly show that involvement of the GPU in the transaction either as sender or receiver causes roughly a 30% latency increase for small message sizes.
On Fig. 2(b) we show the advantage of APEnet+ P2P technique over InfiniBand, used with MVAPICH SW stack for message size up to 128 KB; we measure in GPU-to-GPU latency when using P2P; is measured when P2P is not used; is measured with InfiniBand, on the same platform.
On Fig. 2(c) we show the results of several bandwidth tests; apart from the GPU-outbound cases — where show that GPU memory read transactions incur into a bottleneck within the GPU itself — in all other transactions (CPU memory read, GPU and CPU memory write) we can reach the APEnet+ link limit, which is on current hardware.
4 Studies on Fault Awareness
Fault awareness is the first step when applying Fault Tolerance techniques in HPC (e.g. task migration, checkpoint/restart, …). On the QUonG platform, thanks to some APEnet+ hardware features, each node is able to be aware of faults and critical events occurring to its components and to components of its neighbouring nodes.
Even in case of multiple faults no area of the mesh can be isolated and no fault can remain undetected at global level. At the core of this approach, named LO|FA|MO (LOcal FAult MOnitor), there is a lightweight mutual watchdog protocol between the host node and APEnet+ and the 3D network topology . The APEnet+ core contains a LO|FA|MO hardware component and a set of LO|FA|MO watchdog registers, containing information about the host status, the APEnet+ status and the status of first neighbouring hosts. On each host in the platform a dedicated LO|FA|MO software component is able to periodically update the Host Watchdog Register and read the APEnet Watchdog Register.
In Fig. 5 it is depicted how a Global Fault Awareness is obtained, for example, in case a host node stops working; as the faulty host misses to update its watchdog register, the APEnet+ LO|FA|MO hardware on the same node becomes aware of the fault and sends diagnostic messages via the 3D network towards its own neighbours. These hosts can retrieve data about faults occurring on the neighbour nodes from the watchdog registers and can inform about them (e.g. via a service network) a Master node. In this way, the Master node has the global picture of the platform health status and can take decisions about proper countermeasures.
Note that time elapsed since fault occurrence to global fault awareness is dominated by the watchdog period: for a , . In the time range of interest for HPC (watchdog period ), the addition of LO|FA|MO features has no impact on APEnet+ data transfer latency, as the diagnostic messages are hidden in the communication protocol.
5 APEnet+ deployment: QUonG and other platforms.
The largest and most significant deployment of APEnet+ cards is within the QUonG cluster which is our hybrid, x86_64 dual GPU cluster with a APEnet+ 3D-torus network topology; it is used for testing, development and production run of scientific codes. Several on-going projects are developing applications that can fully exploit our peculiar interconnect solution with promising results. Among these, we mention:
the simulation of polychronous spiking neural networks ;
a Breadth-First-Search algorithms implementation for graph traversal ;
a benchmark based on 3D Heisenberg spin glass model by using the over-relaxation algorithm ;
6 Designing next generation board
Newer FPGA families — e.g. Altera Stratix V — are now available on the market, driving redesign of APEnet+ in two major hardware logic areas: Gen3 migration for PCIe interface and new transceivers for increased off-board link speed.
PCIe Gen3 migration allows an increase in bandwidth for the host interface. It is based on lanes using a 128/130 bit encoding (thus the protocol overhead is reduced to less than 1% from 20% for previous generations). The total raw bandwidth that can be obtained with a interface is . To support this data rate, on the back-end the data-path must be 256-bit wide, with a clock reference of 250 MHz. The standard used is AXI4 , which needed a redesign of APEnet+ internal PCIe interface, as depicted in Fig. 5. AXI4 migration is also preparatory for future use of embedded ARM hard IP processors, foreseen on high-end class FPGAs only like the future Stratix 10 devices.
New Altera devices are capable of 14.1 Gb/s transceivers, which can be bonded in 4 lanes to build up a 56 Gb/s off-board link. As a physical medium we can rely on QSFP+ standard that has been upgraded to work at these data rates (the same as InfiniBand FDR).
In order to develop PCIe Gen3 migration and 56 Gb/s class off-board links, we used an Altera development board with a Stratix V GX FPGA . We implemented the link using the single 40 Gb/s QSFP+ onboard connector and performed data transfer tests between 2 such boards. As a preliminary result we achieved a link speed of 11.3 Gbps/lane (45.2 Gbps/channel), still using 40 Gb/s-certified cables.
We presented a status update of the development of our custom interconnect system APEnet+. Several architectural improvements have been discussed, that brought to substantial performance enhancements. We also introduced the fault-aware capability now embedded in the APEnet+ communication protocol, which can be used in advanced high-level fault tolerance techniques. We finally reported on latest developments on 28 nm FPGA devices, that allow us to upgrade the PCIe interface to Gen3, and the Off-board link to 56 Gb/s data rate.
This work was partially supported by the EU Framework Programme 7 project EURETILE under grant number 247846; Roberto Ammendola was supported by MIUR (Italy) through INFN SUMA project.
-  Ammendola R et al. 2012 Journal of Physics: Conference Series 396 042059 URL http://stacks.iop.org/1742-6596/396/i=4/a=042059
-  NVIDIA GPUDirect technology https://developer.nvidia.com/gpudirect
-  Bureddy D, Wang H, Venkatesh A, Potluri S and Panda D 2012 Recent Advances in the Message Passing Interface (Lecture Notes in Computer Science vol 7490) ed Träff J L, Benkner S and Dongarra J J (Springer Berlin Heidelberg) pp 110–120 ISBN 978-3-642-33517-4 http://dx.doi.org/10.1007/978-3-642-33518-1_16
-  Ammendola R et al. 2013 Track on Interconnect architectures for reconfigurable computing systems, held at Reconfigurable Computing and FPGAs (ReConFig), 2013 International Conference on to be published
-  Ammendola R et al. 2013 Field-Programmable Technology (FPT), 2013 International Conference on to be published
-  Ammendola R et al. 2013 JINST, Journal of Instrumentation, Proceedings of Topical Workshop on Electronics for Particle Physics (TWEPP) 2013 (IOP Publishing) to be published
-  Ammendola R et al. 2013 arXiv:1307.0433 http://arxiv.org/abs/1307.0433
-  Paolucci P et al. 2013 arXiv:1310.8478 http://arxiv.org/abs/1310.8478
-  Bisson M, Bernaschi M, Mastrostefano E and Rossetti D Breadth first search on APEnet+ http://cass-mt.pnnl.gov/docs/Session2-1.pdf iA3 Workshop on Irregular Applications: Architectures and Algorithms, in conjunction with Super Computing 2012
-  Bernaschi M, Bisson M and Rossetti D 2013 Journal of Parallel and Distributed Computing 73 250 – 255 ISSN 0743-7315 http://www.sciencedirect.com/science/article/pii/S0743731512002213
-  Lonardo A et al. 2013 JINST, Journal of Instrumentation, Proceedings of Topical Workshop on Electronics for Particle Physics (TWEPP) 2013 (IOP Publishing) to be published
-  Amerio S et al. 2012 Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), 2012 IEEE pp 1806–1811 ISSN 1082-3654
-  Altera Qsys Interconnect, Quartus II 13.0 Handbook, Volume 1 http://www.altera.com/literature/hb/qts/qsys_interconnect.pdf
-  Altera Stratix V GX Development Board http://www.altera.com/products/devkits/altera/kit-sv-gx-host.html