Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems

Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems

Roberto Ammendola    Andrea Biagioni    Ottorino Frezza    Francesca Lo Cicero    Pier Stanislao Paolucci    Alessandro Lonardo    Davide Rossetti    Francesco Simula    Laura Tosoratto    Piero Vicini INFN Sezione Roma Tor Vergata INFN Sezione Roma

Modern Graphics Processing Units (GPUs) are now considered accelerators for general purpose computation. A tight interaction between the GPU and the interconnection network is the strategy to express the full potential on capability computing of a multi-GPU system on large HPC clusters; that is the reason why an efficient and scalable interconnect is a key technology to finally deliver GPUs for scientific HPC. In this paper we show the latest architectural and performance improvement of the APEnet+ network fabric, a FPGA-based PCIe board with 6 fully bidirectional off-board links with 34 Gbps of raw bandwidth per direction, and X8 Gen2 bandwidth towards the host PC. The board implements a Remote Direct Memory Access (RDMA) protocol that leverages upon peer-to-peer (P2P) capabilities of Fermi- and Kepler-class NVIDIA GPUs to obtain real zero-copy, low-latency GPU-to-GPU transfers. Finally, we report on the development activities for 2013 focusing on the adoption of the latest generation 28 nm FPGAs and the preliminary tests performed on this new platform.

1 Introduction

We present a status update of APEnet+, which is the high performance, low latency custom interconnect system developed at INFN targeting hybrid CPU-GPU-based HPC platforms. The APEnet+ hardware, a PCIe X8 Gen2 card described in [1], allows building a 3D toroidal mesh topology of computing nodes.

Moreover, we implemented NVIDIA GPUDirect V1.0 and V2.0 [2] to directly access data on Fermi and Kepler GPUs. In this way, real zero-copy inter-node GPU-to-host, host-to-GPU and GPU-to-GPU transactions can be issued using a Remote DMA programming paradigm.

At the moment, APEnet+ is able to outperform commercial solutions (like InfiniBand [3]) for small-to-medium message size when using GPU peer-to-peer. For large message sizes, host memory staging techniques are still winning, also due to highest bandwidth of latest commercial cards, which are already Gen3-enabled and guarantee 56 Gbps on the link. Of course, our mid-term focus is in upgrading the APEnet+ hardware in order to keep pace with the advances of technology standards.

2 Architectural Improvements

In the short-term period, we focused on the improvement of our internal architecture. Three major reworkings have been undertaken regarding the PCIe interface, the on-board memory management and the off-board interface.

2.1 PCIe Interface

On the PCIe side, we noticed that effective bandwidth on data transactions was quite low compared to the theoretical one (). This is due to the time elapsed between issuing a request on the PCIe bus and its completion; this time is system dependent and can be very large. In order to optimize the performances (in addition to parameters tuning like maximum payload size), the card must be able to manage more than one outstanding request on the PCIe bus. In this way, multiple transactions can overlap and total transaction time is shorter.

At hardware level we needed to implement two concurrent DMA engines fed by a prefetchable command queue. The difference between a single DMA and a double DMA implementation can be seen in Fig. 2. We estimated an efficiency gain up of to 40% in time [4].

2.2 Memory management

We also focused on removing a known bottleneck on the receiving path, when virtual to physical address translation is necessary in order to dispatch data payloads in the correct physical memory areas (whether they be on host memory or GPU memory). On APEnet+ this task was initially executed by the Nios II embedded processor but the impact on the resulting execution time was higher than expected.

Thus, a novel implementation of a Translation Look-Aside Buffer (TLB) has been developed on the FPGA, to accelerate virtual-to-physical address translation at hardware level [5]. As shown in Fig. 2, the TLB block can store a limited amount of page entries and, in case of page hit, the Nios II processor is completely bypassed. A speedup of up to 60% in bandwidth on synthetic benchmarks has been measured with this enhancement.

Figure 1: Doubling the number of transaction request on PCIe bus allows an efficiency gain in multiple data transactions (40% reduction in total duration).
Figure 2: The TLB block performs virtual to physical address translation for HW cached pages.

2.3 Off-board Interface

Signal integrity of the transmission system was analyzed in order to push the embedded transceiver operating frequency to its limits. Currently, for reliable operations and upper level firmware and software validation, the Altera transceiver are set at 7.0 Gbps, yielding a raw aggregated bandwidth of 28 Gbps per APElink [6] channel. To estimate the efficiency of the APElink Transmission Control Logic operation — managing the data flow by encapsulating packets into a light, low-level, word-stuffing protocol — we devised a mathematical model; current implementation yields a total efficiency of over a channel able to sustain  2.6 GB/s bandwidth with a memory footprint limited to  40 KB per channel.

3 Update of Performance Tests

The described architectural improvements yielded significant performance gains with respect to our previously published results on tests of latency and bandwidth.

Figure 3: Plots for GPU-to-GPU transfer latency of APEnet+ vs. InfiniBand with MVAPICH SW stack, Roundtrip latency and Bandwidth.

On Fig. 2(a) we show the round trip latency. All host-bound and GPU-bound combinations are presented; the plots clearly show that involvement of the GPU in the transaction either as sender or receiver causes roughly a 30% latency increase for small message sizes.

On Fig. 2(b) we show the advantage of APEnet+ P2P technique over InfiniBand, used with MVAPICH SW stack for message size up to 128 KB; we measure in GPU-to-GPU latency when using P2P; is measured when P2P is not used; is measured with InfiniBand, on the same platform.

On Fig. 2(c) we show the results of several bandwidth tests; apart from the GPU-outbound cases — where show that GPU memory read transactions incur into a bottleneck within the GPU itself — in all other transactions (CPU memory read, GPU and CPU memory write) we can reach the APEnet+ link limit, which is on current hardware.

4 Studies on Fault Awareness

Fault awareness is the first step when applying Fault Tolerance techniques in HPC (e.g. task migration, checkpoint/restart, …). On the QUonG platform, thanks to some APEnet+ hardware features, each node is able to be aware of faults and critical events occurring to its components and to components of its neighbouring nodes.

Even in case of multiple faults no area of the mesh can be isolated and no fault can remain undetected at global level. At the core of this approach, named LO|FA|MO (LOcal FAult MOnitor), there is a lightweight mutual watchdog protocol between the host node and APEnet+ and the 3D network topology [7]. The APEnet+ core contains a LO|FA|MO hardware component and a set of LO|FA|MO watchdog registers, containing information about the host status, the APEnet+ status and the status of first neighbouring hosts. On each host in the platform a dedicated LO|FA|MO software component is able to periodically update the Host Watchdog Register and read the APEnet Watchdog Register.

In Fig. 5 it is depicted how a Global Fault Awareness is obtained, for example, in case a host node stops working; as the faulty host misses to update its watchdog register, the APEnet+ LO|FA|MO hardware on the same node becomes aware of the fault and sends diagnostic messages via the 3D network towards its own neighbours. These hosts can retrieve data about faults occurring on the neighbour nodes from the watchdog registers and can inform about them (e.g. via a service network) a Master node. In this way, the Master node has the global picture of the platform health status and can take decisions about proper countermeasures.

Note that time elapsed since fault occurrence to global fault awareness is dominated by the watchdog period: for a , . In the time range of interest for HPC (watchdog period ), the addition of LO|FA|MO features has no impact on APEnet+ data transfer latency, as the diagnostic messages are hidden in the communication protocol.

5 APEnet+ deployment: QUonG and other platforms.

The largest and most significant deployment of APEnet+ cards is within the QUonG cluster which is our hybrid, x86_64 dual GPU cluster with a APEnet+ 3D-torus network topology; it is used for testing, development and production run of scientific codes. Several on-going projects are developing applications that can fully exploit our peculiar interconnect solution with promising results. Among these, we mention:

  • the simulation of polychronous spiking neural networks [8];

  • a Breadth-First-Search algorithms implementation for graph traversal [9];

  • a benchmark based on 3D Heisenberg spin glass model by using the over-relaxation algorithm [10];

Furthermore, in the context of HEP experiments we are testing designs derived from the APEnet+ for real-time GPU stream processing [11] and online track reconstruction with GPUs [12].

6 Designing next generation board

Newer FPGA families — e.g. Altera Stratix V — are now available on the market, driving redesign of APEnet+ in two major hardware logic areas: Gen3 migration for PCIe interface and new transceivers for increased off-board link speed.

PCIe Gen3 migration allows an increase in bandwidth for the host interface. It is based on lanes using a 128/130 bit encoding (thus the protocol overhead is reduced to less than 1% from 20% for previous generations). The total raw bandwidth that can be obtained with a interface is . To support this data rate, on the back-end the data-path must be 256-bit wide, with a clock reference of 250 MHz. The standard used is AXI4 [13], which needed a redesign of APEnet+ internal PCIe interface, as depicted in Fig. 5. AXI4 migration is also preparatory for future use of embedded ARM hard IP processors, foreseen on high-end class FPGAs only like the future Stratix 10 devices.

New Altera devices are capable of 14.1 Gb/s transceivers, which can be bonded in 4 lanes to build up a 56 Gb/s off-board link. As a physical medium we can rely on QSFP+ standard that has been upgraded to work at these data rates (the same as InfiniBand FDR).

In order to develop PCIe Gen3 migration and 56 Gb/s class off-board links, we used an Altera development board with a Stratix V GX FPGA [14]. We implemented the link using the single 40 Gb/s QSFP+ onboard connector and performed data transfer tests between 2 such boards. As a preliminary result we achieved a link speed of 11.3 Gbps/lane (45.2 Gbps/channel), still using 40 Gb/s-certified cables.

Figure 4: Example of fault detection and awareness at system level with LOFAMO.
Figure 5: Design of PCIe interface, largely based on AXI4 protocol.

7 Conclusion

We presented a status update of the development of our custom interconnect system APEnet+. Several architectural improvements have been discussed, that brought to substantial performance enhancements. We also introduced the fault-aware capability now embedded in the APEnet+ communication protocol, which can be used in advanced high-level fault tolerance techniques. We finally reported on latest developments on 28 nm FPGA devices, that allow us to upgrade the PCIe interface to Gen3, and the Off-board link to 56 Gb/s data rate.


This work was partially supported by the EU Framework Programme 7 project EURETILE under grant number 247846; Roberto Ammendola was supported by MIUR (Italy) through INFN SUMA project.



  • [1] Ammendola R et al. 2012 Journal of Physics: Conference Series 396 042059 URL
  • [2] NVIDIA GPUDirect technology
  • [3] Bureddy D, Wang H, Venkatesh A, Potluri S and Panda D 2012 Recent Advances in the Message Passing Interface (Lecture Notes in Computer Science vol 7490) ed Träff J L, Benkner S and Dongarra J J (Springer Berlin Heidelberg) pp 110–120 ISBN 978-3-642-33517-4
  • [4] Ammendola R et al. 2013 Track on Interconnect architectures for reconfigurable computing systems, held at Reconfigurable Computing and FPGAs (ReConFig), 2013 International Conference on to be published
  • [5] Ammendola R et al. 2013 Field-Programmable Technology (FPT), 2013 International Conference on to be published
  • [6] Ammendola R et al. 2013 JINST, Journal of Instrumentation, Proceedings of Topical Workshop on Electronics for Particle Physics (TWEPP) 2013 (IOP Publishing) to be published
  • [7] Ammendola R et al. 2013 arXiv:1307.0433
  • [8] Paolucci P et al. 2013 arXiv:1310.8478
  • [9] Bisson M, Bernaschi M, Mastrostefano E and Rossetti D Breadth first search on APEnet+ iA3 Workshop on Irregular Applications: Architectures and Algorithms, in conjunction with Super Computing 2012
  • [10] Bernaschi M, Bisson M and Rossetti D 2013 Journal of Parallel and Distributed Computing 73 250 – 255 ISSN 0743-7315
  • [11] Lonardo A et al. 2013 JINST, Journal of Instrumentation, Proceedings of Topical Workshop on Electronics for Particle Physics (TWEPP) 2013 (IOP Publishing) to be published
  • [12] Amerio S et al. 2012 Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), 2012 IEEE pp 1806–1811 ISSN 1082-3654
  • [13] Altera Qsys Interconnect, Quartus II 13.0 Handbook, Volume 1
  • [14] Altera Stratix V GX Development Board
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description