A GPU-Based Wide-Band Radio Spectrometer
The Graphics Processing Unit (GPU) has become an integral part of astronomical instrumentation, enabling high-performance online data reduction and accelerated online signal processing. In this paper, we describe a wide-band reconfigurable spectrometer built using an off-the-shelf GPU card. This spectrometer, when configured as a polyphase filter bank (PFB), supports a dual-polarization bandwidth of up to 1.1 GHz (or a single-polarization bandwidth of up to 2.2 GHz) on the latest generation of GPUs. On the other hand, when configured as a direct FFT, the spectrometer supports a dual-polarization bandwidth of up to 1.4 GHz (or a single-polarization bandwidth of up to 2.8 GHz).
A GPU-Based Wide-Band Radio Spectrometer]A GPU-Based Wide-Band Radio Spectrometer
Chennamangalam et al.]Jayanth Chennamangalam, Simon Scott,
Glenn Jones, Hong Chen, John Ford, Amanda Kepley,
D. R. Lorimer, Jun Nie, Richard Prestage, D. Anish Roshi, Mark Wagner, and Dan Werthimer
Astronomical data acquisition and online reduction of data are steadily becoming more resource-intensive, not just for new and upcoming telescopes such as the Low Frequency Array (LOFAR) and the Square Kilometre Array (SKA), but also for new instruments at established facilities. Field Programmable Gate Arrays (FPGAs) have long been used at the output of Analog-to-Digital Converters (ADCs) for data reduction and/or packetization, followed by a computer that manages the recording of data to disk. FPGAs have traditionally been considered suitable for high-bandwidth applications, but the relative difficulty in programming them and the lack of support for floating-point arithmetic, coupled with the relatively inexpensive pricing of Graphics Processing Unit (GPU) cards, have popularized the use of GPUs in astronomical instrumentation. Several real-time GPU-based signal processing systems intended for pulsar astronomy have been developed in recent years [\citenameRansom et al. 2009, \citenameMagro et al. 2011, \citenameArmour et al. 2012, \citenameBarsdell et al. 2012, \citenameMagro et al. 2013, for instance]. Many new instruments combine the high-bandwidth data acquisition capability of FPGAs with the high-performance data reduction capability of GPUs, glueing them together with high-throughput networking hardware. Such a heterogeneous architecture is expected to scale up to meet the data-handling requirements of future instruments and telescopes.
In this paper, we give an overview of a heterogeneous, wide-bandwidth,
multi-beam spectrometer that we have built for the Green Bank Telescope (GBT),
focussing on the GPU-based spectrometry code and its performance. This
spectrometer forms part of the ‘Versatile GBT Astronomical Spectrometer’
(VEGAS) [\citenameRoshi et al. 2011, \citenameFord et al. 2013]. VEGAS has multiple modes of operation that are broadly
classified into two categories – the so-called high bandwidth (HBW) and
low-bandwidth (LBW) modes. The HBW modes are characterized by
higher bandwidths and faster spectral dump rates. The HBW mode spectrometry
takes place exclusively on Field-Programmable Gate Array (FPGA)
2 The GPU-Programming Paradigm
In the past, computing performance was improved most commonly by using smaller
silicon features and increasing the clock rate. Since computer designers are no
longer able to increase the clock rate further due to power constraints,
parallelization is the primary method to improve performance in recent times.
The Central Processing Unit (CPU) of a typical personal computer (PC) has
traditionally contained a single instruction-processing core that can perform
only one operation at a time. Multi-tasking on a PC powered by such a CPU is
usually achieved by interleaving tasks in time. This obviously degrades the
performance of time-critical tasks such as rendering graphics for computer
games. One solution to this problem is to offload graphics processing to a
dedicated co-processor, the GPU. The GPU contains multiple processing cores
that enables it to run multiple instructions simultaneously
The most common general purpose GPU programming platform is Compute Unified
Since GPUs are suitable computing platforms for data-parallel applications, they are increasingly used as dedicated co-processors for data analysis applications that use the high-performance hardware to accelerate their time-critical paths. This also makes GPUs ideal for data-acquisition instruments such as VEGAS.
3 Overview of VEGAS
The heterogeneous modes of operation of
|1||100.0 – 187.5||32768 – 131072||0.8 – 5.7||10 – 30|
|1||11.72 – 23.44||32768 – 524288||0.02 – 0.7||5 – 75|
|8||15.625 – 23.44||4096 – 65536||0.24 – 5.7||5 – 100|
Figure 1 shows a block diagram of the software section of the VEGAS heterogeneous-mode data acquisition pipeline. In the heterogeneous modes, the FPGA board packetizes the signal sampled by an ADC and sends it over 10-Gigabit Ethernet (10GbE) to a PC with a GPU. The VEGAS software pipeline, based on the Green-Bank Ultimate Pulsar Processing Instrument (GUPPI; Ransom et al. \shortciteran09) is made up of multiple concurrent threads, each associated with a separate CPU core. The first thread, called the ‘network thread’ reads packets off the network and writes the payload to a shared memory ring buffer. The next thread, called the ‘GPU thread’ reads the data off the buffer and performs spectrometry, including accumulation of multiple spectra, if needed. Once the accumulated spectra are ready, the output is written to another ring buffer from which the third thread – the ‘CPU thread’ – reads data and performs further accumulation as needed. Once this is done, the output is sent to the ‘disk thread’ that writes it to disk. This paper describes the spectrometer implemented in the GPU thread.
VEGAS supports multi-beam receivers, in which case the signal from each beam is processed by a separate software data acquisition pipeline. The implementation utilizes dual-socket, dual-NIC, dual-GPU PCs, wherein one PC processes signals from two beams independently.
4 The GPU Spectrometer
Spectrometry is a Discrete Fourier Transform (DFT; see, for example, Bracewell \shortcitebra99) operation, usually implemented as a Fast Fourier Transform (FFT) for its performance benefits. Due to the finite length of the ‘DFT window’ (the number of input time samples), the single-bin frequency domain response of the DFT is not rectangular, but is a sinc function, with side lobes spread across the entire bandwidth. This ‘spectral leakage’, and the related phenomenon of ‘scalloping loss’ – due to the non-flat nature of the main lobe of the sinc function – can be mitigated by suppressing the side-lobes of the sinc function and changing the single-bin frequency response of the DFT to approximate a rectangular function. One way of achieving this is using the polyphase filter bank technique (PFB), also known as weighted overlap-add method, in which a ‘pre-filter’ is introduced preceding the FFT stage (for details, see Crochiere & Rabiner \shortcitecro83 and Harris & Haines \shortcitehar11). The GPU spectrometer described in this paper implements an 8-tap polyphase filter bank.
The input data to our PFB spectrometer is made up of dual-polarization, 8-bit, complex-valued samples, while the output contains , , Re(), and Im(), where is the Fourier transform of the horizontal polarization, is the Fourier transform of the vertical polarization, and and are the corresponding complex conjugates. Note that full-Stokes spectra can easily be derived from these values.
The high-level algorithm of the spectrometer
Load filter coefficients from host to device memory
Create FFT plan to perform two FFTs in parallel in the case of single-sub-band modes and 16 FFTs in parallel in the case of eight-sub-band modes
Copy time series data for one set of parallel FFTs to device
Perform parallel FFTs
Accumulate spectra for desired duration
Copy output to host
Repeat from Step 2
4.1 Test Observations
We observed the Galactic HII region W3 using the seven-beam -Band Focal Plane Array (KFPA) receiver of the GBT during commissioning tests. Figure 2 shows a plot of antenna temperature versus velocity for multiple sub-bands corresponding to one of the KFPA beams, in which ammonia lines are visible.
5 Benchmarking and Performance Results
Benchmarking of the software spectrometer was performed on a server-class PC running a flavour of the Linux operating system, with an NVIDIA GeForce GTX TITAN commercial (gaming) GPU card. A stand-alone version of the spectrometer program was used, wherein data was read off disk files and pre-loaded in memory, to simulate reading from the shared memory ring buffers of VEGAS described in §3. Each test was repeated 100 times and we report the average values. The peak bandwidth achieved was MHz (dual-polarization), corresponding to a data rate of Gbps which is more than what a 10GbE link can support. The peak performance was achieved for an FFT length of , with long integrations (accumulation length of 1000 spectra). When direct FFT was used, the peak bandwidth achieved was MHz, corresponding to a data rate of Gbps, again, more than what is supported by 10GbE. This peak was for a -point FFT with an accumulation length of 1000 spectra. The performance of the code as a function of transform length and accumulation length is depicted in Figure 3.
The performance of the code is lower at low values of FFT length due to the following: Each FFT kernel invocation (that does either two FFTs in parallel for single-sub-band modes, or 16 FFTs in parallel for eight-sub-band modes) is preceded by a host-to-device data copy step and a pre-filter stage (in the case of PFB), and followed by a device-to-host data copy step. Given the overhead involved in launches of both the copy and compute kernels, this translates to fewer data processing operations per unit time, resulting in reduced performance. This becomes less of an issue at larger FFT lengths.
We have developed a GPU-based PFB spectrometer that supports a dual-polarization bandwidth of up to 1.1 GHz (or a single-polarization bandwidth of up to 2.2 GHz). Without doing PFB (i.e., direct FFT), it supports a dual-polarization bandwidth of up to 1.4 GHz (or a single-polarization bandwidth of up to 2.8 GHz). This bandwidth is sufficient for most spectral line observations, and can be traded off with spectral integration time for some pulsar observations. Future work would involve improving the performance of this software. The simplest way to speed it up would be to run it on the latest generation of GPU cards. Each new generation of GPU cards typically have, compared to its predecessors, more processing cores and larger memory bandwidth. This naturally leads to some improvement in performance. However, to significantly improve performance between two consecutive generations of GPUs, the code would need to be tuned keeping in mind the architecture of the GPU used. A better – albeit, brute-force – way to speed up the code would be to implement support for scalability, by enabling the software to take advantage of dual-GPU cards, and/or to spread the load across multiple GPU cards. This has the potential to increase the bandwidth processed by up to a factor of a few, depending on the number of GPUs used. Additionally, algorithm-level and further code-level optimizations – such as pipelining kernel launches using CUDA streams – may also have the potential to yield higher performance.
We thank Mike Clark for suggestions on code optimization, and Ben Barsdell, Matthew Bailes, Jonathon Kocz, Gregory Desvignes, David MacMahon, and Terry Filiba for useful discussions. We also thank the anonymous referee for comments that served to clarify the paper. NVIDIA, GeForce, GeForce GTX TITAN, and GTX are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries.
- The Reconfigurable Open Architecture Computing Hardware II (ROACH II) platform.
- Even though modern CPUs contain multiple processing cores (on the order of tens of cores), modern GPUs far surpass them, having cores on the order of hundreds to thousands.
- The GPU spectrometer code that we have developed is available freely for download from https://github.com/jayanthc/grating.
- Armour, W., Karastergiou, A., Giles, M., et al. 2012, ASP Conference Series, 461, 33
- Barsdell, B. R., Bailes, M., Barnes, D. G., Fluke, C. J. 2012, MNRAS, 422, 379
- Bracewell, R. N. 1999, The Fourier Transform and its Applications, McGraw-Hill
- Crochiere, R. E., Rabiner, L. R., 1983, Multirate Digital Signal Processing, Prentice-Hall
- Ford, J., Bloss, M., Brandt, P., et al. 2013, Radio Science Meeting (USNC-URSI NRSM), 2013 US National Committee of URSI National, doi:10.1109/USNC-URSI-NRSM.2013.6525022
- Harris, C., Haines, K., 2011, PASA, 28, 317
- Magro, A., Hickish, J., Adami, K. Z. 2013, JAI, 02, 1350008
- Magro, A., Karastergiou, A., Salvini, S., et al. 2011, MNRAS, 417, 2642
- Ransom, S. M., Demorest, P., Ford, J., et al. 2009, AAS Meeting, 214, 605.08
- Roshi, D. A., Bloss, M., Brandt, P., et al. 2011, General Assembly and Scientific Symposium, 2011 XXXth URSI, doi:10.1109/URSIGASS.2011.6051280