HighPerformance Image Synthesis for Radio Interferometry
Abstract
A radio interferometer indirectly measures the intensity distribution of the sky over the celestial sphere. Since measurements are made over an irregularly sampled Fourier plane, synthesising an intensity image from interferometric measurements requires substantial processing. Furthermore there are distortions that have to be corrected.
In this thesis, a new highperformance image synthesis tool (imaging tool) for radio interferometry is developed. Implemented in C++ and CUDA, the imaging tool achieves unprecedented performance by means of Graphics Processing Units (GPUs). The imaging tool is divided into several components, and the backend handling numerical calculations is generalised in a new framework. A new feature termed compression arbitrarily increases the performance of an already highly efficient GPUbased implementation of the wprojection algorithm. Compression takes advantage of the behaviour of oversampled convolution functions and the baseline trajectories. A CPUbased component prepares data for the GPU which is multithreaded to ensure maximum use of modern multicore CPUs. Best performance can only be achieved if all hardware components in a system do work in parallel. The imaging tool is designed such that disk I/O and work on CPU and GPUs is done concurrently.
Test cases show that the imaging tool performs nearly 100 faster than another general CPUbased imaging tool. Unfortunately, the tool is limited in use since deconvolution and Aprojection are not yet supported. It is also limited by GPU memory. Future work will implement deconvolution and Aprojection, whilst finding ways of overcoming the memory limitation.
Dr Kristian Zarb Adami
Physics
\degreeM.Sc
\acknowledgementsAs in most ambitious and successful projects, success is only possible with the help of others.
I wish to thank my supervisor, Dr Kristian Zarb Adami, for his 24x7 support, dedication and patience. Special thanks to Dr Oleg Smirnov and Dr Filippe Abdalla, who provided valuable insights and advice. Thanks also to Mr Alessio Magro who gave advice and help on GPU programming. A big thanks goes to Dr John W. Romein for supplying me with the test code and data set related to his work, used in this thesis. Thanks also to Dr Anna Scaife and Dr Cyril Tasse for their help.
My family also gave their contribution. A big thank you goes to my parents Joseph and Maria Dolores Muscat for their support. Their support was given during a period when they also had to face a number of health problems. My father also helped to creating some of the graphics included. A last but not least heartfelt thanks goes to my girlfriend Ms Nicola Attard, who, notwithstanding the challenging environment to our relationship, stood by me and gave me her full support.
Chapter 1 Introduction to Radio Interferometry and Image Synthesis
The first interferometer used in astronomy dates back to 1921 when Albert Abraham Michelson together with Francis G. Pease made the first diameter measurements of the star Betelgeuse [1]. The setting used, known as the Michelson Interferometer, is today the basis of modern interferometers. Michelson had been discussing the use of interferometry in astronomy for at least 30 years before the experiment was done [2, 3]. Throughout the century, the technology of radio interferometers has made enormous advancement and up to this day extraordinarily ambitious projects have been commissioned, and more are on the pipeline. An example of a recently commissioned radio interferometer is the LOwFrequency ARray (LOFAR) [4]. The Square Kilometre Array (SKA) [5, 6, 7, 8, 9] is the most ambitious project currently underway.
Radio interferometers can be used for various purposes, even for applications outside the scope of astronomy, such as geophysics [10, 11]. The scope of this thesis is limited to the use of the interferometer as a measuring device for the intensity distribution of the sky over the celestial sphere. The measured quantity of the interferometer is known as visibility, and the thesis’ main focus is on how to recover the said intensity distribution from such measured visibility data.
An interferometer is made up of an array of N 2 antennas, and differently from a single dish antenna it achieves subarcsecond resolutions with high accuracy. The maximum angular resolution of a single dish is limited by its diffraction limit. The limit is inversely proportional to the diameter of the dish, that is . On the other hand, the angular resolution of the interferometer is limited by the distance between the furthest two antennas in the array in the same inverse proportional way, that is . Achieving subarcsecond resolution in singledish antennas requires large diameters that are prohibitory. For interferometers, achieving subarcsecond resolution is just a matter of having antennas as far away as possible from each other. If necessary, part of the array can be orbiting in space such as in the case of the Very Long Baseline Interferometry (VLBI) Space Observatory Programme (VSOP) [12].
The basic measurement device (based on the Michelson interferometer) is composed of just two elements. Each possible two element combination of N 2 antennas forms a two element independent measuring device. During an observation, the antenna array tracks a position on the celestial sphere normally within the field of view of the observation. Each basic element makes a measurement of the intensity distribution of the sky in the form of visibility values. visibility readings are done simultaneously by the whole element antenna array and each reading differs in the geographical setup of the basic device. The geographical setup is described by the baseline which is the vector covering the distance between the two antennas. Earth rotation changes the directions of the baseline (assuming a frame of reference stationary with the celestial sphere), and this is taken advantage of by taking subsequent visibility readings [13]. As it will be shown later on in this chapter, a Fourier relationship exists between visibility as a function of baseline and the intensity distribution, provided that certain conditions are met. To take advantage of such a relationship is not an easy computational task and is today an active area of research which this thesis is part of.
This introduction aims to give a brief on the theory of interferometry, defines the measurement equation of the interferometer and discusses an image synthesis pipeline commonly used. The brief serves as a preamble for the discussion on motivations, aims and objectives of this thesis, which is done in the penultimate section. The chapter is concluded by giving an outline of the thesis.
The theory presented in this chapter is based on Thompson et al. [14] and some online sources, notably presentations given in a workshop [15] organised by the National Radio Astronomy Observatory(NRAO), and a course [16] given by the same organisation.
1.1 Analysis of a twoelement interferometer
A basic twoelement interferometer is depicted in Figure 1.1. It is observing an arbitrary cosmic source in the sky, in the direction of the unit vector . The baseline is defined with the vector . The interferometer is assumed to have a quasimonochromatic response at a central frequency . Cosmic sources are in the far field of the interferometer, and so the incoming waves at the interferometer can be considered planar over the baseline. Clearly the output of antenna 1 is the same as that of antenna 2 , except that is lagging behind by say seconds. is called the geometric delay. The wave needs to travel an extra distance to reach antenna 1 thus, letting represent the speed of light, equates to:
(1.1) 
Since the interferometer is quasimonochromatic then, the voltage output of antenna 1 and antenna 2 denoted by and respectively can be written as follows:
(1.2a)  
(1.2b) 
where is time and is proportional to the intensity of the source.
1.1.1 The correlator
To produce a visibility reading, the two voltages get correlated by first multiplying and then averaging over a period of time known as the integration time. As it will be shown in this section, such a correlation is not enough since some intensity data is lost. A complex correlator is used to make a second correlation between the output of the antennas with one of the outputs phase delayed by 90 degrees. The two outputs of the complex correlator are presented as one complex quantity which is the visibility. Visibility is defined formally in equation 1.9.
Multiplication of with results in:
(1.3) 
has a constant term and a sinusoidal term with respect to . Provided that averaging is done over a long enough time interval, the sinusoidal term averages out to 0 implying that the average equates to:
(1.4) 
The correlator response is dependent on the direction of the source with a sinusoidal behaviour known as a fringe. The term is known as the fringe phase.
Figure 1.2 shows a plot of a fringe. It clearly indicates that, for some values of , the correlator gives no output. This implies loss of data. If an intensity distribution is considered, then it is easy to show that the even part of the distribution is lost.
Let and be the even and odd part of . Using equations 1.4 and integrating over , results to be proportional to:
(1.5a)  
since  
(1.5b) 
The second correlation made over the two antenna outputs, with one of the output phase delayed by 90 degrees, generates an even fringe and will respond to the even intensity distribution. This implies that the complex correlator responds to all the intensity distribution.
1.1.2 Effects of channel bandwidth and the delay centre
If it is assumed that the interferometer is responsive over a bandwidth , whereby intensity is constant throughout, then, based on equation 1.4, the correlator output is:
(1.6a)  
(1.6b)  
where  
(1.6c) 
The term is known as the fringe washing function and degrades the response of the interferometer. The degradation is dependent on the frequency bandwidth and should be kept as narrow as possible, especially for widefield imaging. An interferometer splits up a wide bandwidth into many frequency channels each of narrow bandwidth, and treats each channel independently.
The fringe washing function has no effect over the interferometer response for . This is because signals of different frequencies will reach the correlator in phase. The geometric delay () can be controlled by inserting an extra delay in the circuit between the antenna receiving the leading signal and the complex correlator. For , the fringe washing function is nullified. Since is a function of , one can choose a direction for which the fringe washing function is nullified. This direction is termed the delay centre.
1.2 The Uvw coordinate system
Many variables in interferometry are expressed against a reference direction known as the phase reference centre or phase centre. The phase centre is fixed to the sky and is common practice that it is set to point to the same direction of the delay centre.
The general coordinate system used in interferometry is the coordinate system and is depicted in Figure 1.3. It is a right handed Cartesian system where axes and are on a plane normal to the phase centre and the axis in the direction of the phase centre. The axis is in the EastWest direction while the axis is in the NorthSouth direction.
One of the main uses of the coordinate system is to measure the baseline against the phase centre. The baseline components expressed over the axes defined by are depicted in Figure 1.3. The components are normally given in units of number of wavelengths such that:
(1.7) 
where is the channel wavelength, and is the baseline expressed in number of wavelengths.
As visibility measurements are taken consecutively in time, the baseline vector rotates in the UVWspace since the interferometer resides on the Earth surface. The UVcoverage of an observation is defined as the set of values for which the interferometer makes a visibility measurement. Figure 1.4 depicts the UVcoverage of a true LOFAR observation. Each baseline will form an elliptical trajectory, while the interferometer samples (measures) visibility in time. A short baseline tends to form a trajectory near the centre, while a longer one tends to form a trajectory further out.
A clarification on the units of the components is now proper. Expressing in units of number of wavelengths is mathematically convenient and is the general approach found in literature [14]. Nevertheless it is not always practical to use, since values are dependent on channel frequency. The MeasurementSet [17] is a common format used to store visibility measurements. It specifies data to be stored in meters rendering it valid over all frequency channels. In this thesis the components are expressed in number of wavelengths unless stated otherwise.
When describing a position on the celestial sphere the direction cosines and are used as shown in Figure 1.3. They are defined with respect to the U and V axes such that the direction vector is described as:
(1.8) 
1.3 The measurement equation
Consider Figure 1.5. represents the phase reference position. The unit vector retains its definition, that is, a pointer towards an arbitrary position in the celestial sphere. is a solid angle, and is the baseline expressed in number of wavelengths.
Not considered in the previous section, is the nonuniform polarisation dependent reception behaviour of the antennas known as the Aterm. Let represent a normalised version of the Aterm such that . The complex visibility is defined as:
(1.9) 
This definition is inline with that given in [14] that follows the sign convention of the exponent used in [18, 19]. is the intensity distribution.
The complex visibility, as defined in equation 1.9, has dimensions of flux density (). The dimensions of the intensity distribution are . In astronomy, the Jansky(Jy) is a commonly used unit defining the flux density where 1 Jansky=.
As described in [14] equates to:
(1.10) 
(1.11a)  
(1.11b) 
Substituting in equation 1.9 the following results:
(1.12) 
Equation 1.12 is the measurement equation of the interferometer. It defines the relationship between the interferometer measured quantities and the intensity distribution of the sky.
1.4 Image Synthesis
This thesis hardly gives importance to the effects of and the quotient term in the measurement equation 1.12. For convenience, these two terms are subsumed in the intensity distribution which will be referred to, as the measured intensity distribution .
(1.13) 
The measurement equation 1.12 is rewritten as:
(1.14) 
This section reviews how can be recovered from visibility measurements made by the interferometer. The term , is referred to as the wterm. If it equates to 0 then a Fourier relationship between visibility expressed in terms of and ( ignored) and the measured intensity distribution is obtained, that is:
(1.15) 
where is the Fourier operator.
Most of the image synthesis techniques, exploit this Fourier relationship in order to use the computationally efficient Inverse Fast Fourier Transforms (IFFT) algorithms. However, there are various issues that need to be circumvented as to apply such algorithms, which mandates some pre and postprocessing of data. The next subsections gives a brief on the main issues and how they are commonly handled. Based on this brief, Figure 1.6 depicts an imaging synthesis pipeline that is commonly used.
1.4.1 Nonuniform sampling of the visibility plane
Fast Fourier Transform algorithms require that the visibility UVplane (also referred to as the UVgrid) is uniformly sampled. As already discussed in section 1.2 the UVcoverage of the interferometer is not uniform, implying that some processing is required to generate a uniformly sampled UVgrid.
Convolutional gridding is the technique used in interferometry to transform the nonuniform UVcoverage into a uniform one. Details on this technique are given in section 2.1. Measured data is convolved with an appropriately chosen function and the result is sampled as to generate a uniform UVgrid. After an IFFT takes place, the convolution is reversed by a simple elementbyelement division on the intensity image. One notes that convolutional gridding is a topic actively researched upon, especially in its application in medical sciences [20, 21, 22].
1.4.2 Incompleteness
The nonuniform UVcoverage implies that the measured visibility data is incomplete. The intensity distribution can never be recovered in full by relying solely on the measured visibility samples. The output intensity image that is generated by the convolution gridding algorithm is called a dirty image in view that it contains artefacts caused by the nonuniform incomplete UVcoverage.
Incompleteness is handled by iterative methods commonly known as deconvolution. Based on a priori knowledge on the intensity distribution, deconvolution is applied over the dirty image in order to recover the intensity distribution. A classic example of a deconvolution algorithm is CLEAN [23, 24] which assumes that the intensity is made up of point sources. Starck et al. [25] give a review of other deconvolution methods.
1.4.3 The wterm
As pointed out by Cornwell et al. [26], if the wterm is much less than unity it can be neglected and a two dimensional Fourier relationship results. When synthesising a narrow field of view, the wterm can in many cases, be neglected. Neglecting the wterm when synthesising a widefield of view, causes substantial distortions.
An advanced method that corrects the wterm, is the wprojection [26] algorithm. A visibility sample is projected on the plane by means of a convolution. The wterm for the w=0 plane is 0, so effects of the wterm get nullified by the projection. The algorithm is integrated with the convolutional gridding algorithm and is explained in detail in section 2.2.
Other alternatives to wprojection exploit other characteristics of the measurement equation. For example, it is possible to express the measurement equation as a 3dimensional Fourier transform [27, 28]. Snapshots [26, 29, 28, 30] consider a coplanar array whereby at any given time is a linear combination of and for all the measurements made by the coplanar array [31, 28]. Thus, for short periods of observation time, the wterm causes only a distortion in the lmplane coordinates which is corrected by a linear transform. Facets consider the widefield intensity image as a sum of smaller images (facets) over which the wterm can be neglected. There are two types of facets: the imageplane facets and the uvwspace facets. These are reviewed in [26].
Hybrid algorithms using wprojection with any of these alternative algorithms are also possible and are discussed in section 2.2.2.
1.5 Motivation, aims and objectives of the thesis
As new powerful radio telescopes are being built up, the computational demand of the imaging pipeline described in Figure 1.6, is increasing to exuberant levels. It is estimated that Phase 2 of the Square Kilometre Array (SKA) will need as much as 4 Exaflops of computational power [32], with 90% of computational resources taken by the gridding algorithm [33]. Phase 2 is expected to be commissioned by around 2020 and is predicted that though technology would have advanced by that time, a supercomputer delivering computational rates in Exaflops would be at the top of the TOP 500 [34] list [32, 35].
Motivated by such computational requirement, this thesis aims in giving a contribution to the application of high performance computing in radio interferometry. The main objective is to develop a new highperformance imaging tool. It is being called the maltaimager or mtimager for short. It relies on CUDA^{1}^{1}1CUDA stands for Compute Unified Device Architecture and is a parallel programming framework for GPUs developed by NVIDIA® [36] compatible NVIDIA®Graphical Processing Units (GPUs) to synthesise images by means of wprojection. Deconvolution is outside the scope of this work.
The infrastructure handling all numerical calculations is generalised in a framework, which is being called the General Array Framework (GAFW). It is designed to handle different hardware such as GPUs and Field Programmable Gate Arrays (FPGAs). It is being implemented for CUDA compatible GPUs only so as to serve the main objective. The mtimager is built over this framework to perform all GPUbased computation.
1.6 Outline of thesis
This introductory chapter discussed the main concept of image synthesis in radio interferometry. The next chapter, that is Chapter 2, gives a literature review. The focal point of the thesis is the implementation of wprojection and convolutional gridding. A mathematical treatment and a review of reported implementations of these algorithms are given. The chapter is concluded by giving a detailed description of a gridding algorithm over GPUs proposed by Romein [37]. It is a pivotal algorithm to this thesis and will be often referred to as Romein’s algorithm.
The General Array Framework is the subject of Chapter 3. The framework is discussed in detail and terms are defined.
Chapter 4 is the main chapter whereby the mtimager is discussed in detail. Note that the chapter uses the terms and definitions given in previous chapters. The imaging tool is divided into components that are independent of each other. The highlevel design is first discussed, and then details of the tool are given through a discussion on each component.
Chapter 5 reports on the performance obtained by the tool and shows that the main objective of the thesis has been achieved with success. The chapter also reports some detailed experimental analyses on the performance of Romein’s algorithm as implemented in the imaging tool. These analyses are meant to enrich the knowledge on the algorithm. The chapter is concluded by proposing future work.
The thesis is concluded in Chapter 6.
Chapter 2 Gridding and WProjection
This chapter reviews convolutional gridding and the wprojection algorithm. Some mathematical treatment is given together with other background knowledge. A literature review on performance and implementation of these algorithms is included together with some notes on GPU programming.
2.1 Convolutional gridding
In interferometry, convolutional gridding is the method used to transform the nonuniform sampled UVplane (visibility plane) to a uniform sampled one such that Inverse Fast Fourier Transform (IFFT) algorithms can be applied (refer to subsection 1.4.1). Visibility measurements are convolved with an appropriately chosen convolution function . The resultant UVplane is then sampled uniformly, over which an IFFT is applied. The result is an image in the lmplane (intensity distribution plane), which is corrected from convolution by a simple elementbyelement division. The output is a dirty image of the intensity distribution, which is the true intensity aliased by the nonuniform incomplete UVcoverage.
In this section, convolutional gridding is discussed via a mathematical treatment. The concept of weighting is introduced, and some topics on the convolution function are reviewed.
2.1.1 Mathematical review
The mathematical treatment given in this section is based on, and adapted from, Jackson et al. [21].
The measured intensity distribution considered here, has been defined in section 1.4 as follows:
(2.1) 
For this treatment, the wterm is ignored such that Fourier relationship expressed in equation 1.15 is true, that is:
(2.2) 
Let be the number of visibility measurements that the interferometer samples, and let be a zerobase index such that and are the and values of a measured visibility point. The sampling function of the interferometer is defined as:
(2.3) 
where is the two dimensional Dirac function.
As discussed further on, it is desirable to control the shape of the sampling function. This is done by introducing a weight for each sampled visibility, such that the weighted sampling function equates to:
(2.4) 
is known as the synthesised beam, dirty beam or point spread function (PSF). It is here denoted as .
(2.5) 
With interferometric sampling and use of weights, the visibility plane is transformed into a nonuniform sampled plane . The desire is to transform the nonuniform sampled plane to a uniform UVgrid , sampled at regular intervals and on the U and V axes respectively. This is achieved by convolving with an appropriately chosen convolution function and then uniformly sampling the output.
(2.6) 
where the operator is a twodimensional convolution, and is the Shah function defined as:
(2.7) 
An inverse Fourier transform is applied on which equates to:
(2.8) 
where is referred to as the tapering function.
Convolution with the Shah function causes a replication over at intervals in and in . The synthesised field of view is thus dependent on the choice of and . Only one replica is truly calculated which is mathematically equivalent to multiplying with a rectangular function. To complete the process, the image is divided by so as to reverse the convolution and produce the nonnormalised dirty image .
(2.9) 
where
(2.10) 
Equation 2.9 describes the output of the convolutional gridding algorithm.
2.1.2 Deconvolution and weighting schemes
The PSF aliases in such a way that cannot be recovered through direct methods. This is the effect of data incompleteness. Iterative methods known as deconvolution are applied to try to recover to the best possible accuracy (refer to section 1.4.2). The effectiveness is dependent on the form of the PSF, which is desired to be as compact as possible. If too few visibility measurements samples are considered, then, the PSF might be too wide for a proper recovery. In the extreme case, where only one measurement is considered, the PSF is infinitely wide.
The weight introduced in equation 2.4 plays a pivotal role in controlling the form of the PSF. As visible in Figure 1.4, there is a higher density of visibility measures in regions covered by shorter baselines. This tends to overemphasise the long spatial wavelength of the PSF causing wide skirts. The uniform weighting scheme caters for this issue by weighting each measurement with a value inversely proportional to the neighbourhood density [38]. In another scheme, called natural weighting, the density issue is given second priority, and visibility data is weighted inversely proportional to its variance so as to obtain the best signal to noise ratio [14]. Short baseline data will still be overemphasised, but an inbetween scheme, known as robust weighting [39], tries to compromise between natural and uniform weighting schemes.
2.1.3 The tapering function
The form of the tapering function is crucial, since the replication caused by the convolution with the Shah function, forces to alias the image. Aliasing can be fully suppressed if the tapering function covers the whole field of view defined by and then goes to 0 outside the region. The infinite sinc function (in the UVplane) is the best choice for [20] but it is computational unattainable. Various functions have been studied [21] and the general choice in interferometry is the spheroidal prolate function [40, 41, 42].
The convolution function has to be finite, or in other words it has to have a finite support. The support of the convolution function is defined as the full width of the function in grid pixels. Note that in literature, support might be defined differently as explained by Humphreys and Cornwell [43].
It is a common approach that the convolution function is not calculated during gridding but is pregenerated prior the gridding process [37, 43] and stored in memory. The function has to be sampled at a higher rate than . Such sampling is referred to as oversampling. The oversampling factor is defined as:
(2.11) 
where and represent sampling intervals at which the convolution function is being sampled. For convenience, it is assumed that the oversampling factor is invariant over the U and V axes.
In general an implementation of a gridder does not perform any interpolation on the numerical data of the convolution function. It merely chooses the nearest numerical values while gridding a record^{1}^{1}1In this thesis, the term record is used to describe a single or multipolarised visibility measurement.. This changes the form of the convolution function which can be modelled by the following equation:
(2.12) 
where is the oversampled version of .
Applying the Inverse Fourier transform, the following is obtained:
(2.13) 
The sinc functions change the form of the convolution function. Their affects are corrected for, by dividing the image with the sinc functions after the inverse Fourier transform takes place.
2.2 Wprojection
In subsection 2.1.1, the wterm was ignored while describing the convolutional gridding algorithm. When the term is much less than unity, such an assumption is acceptable [26]. Otherwise ignoring the wterm leads to substantial distortions. This section discusses a recent algorithm known as wprojection [26] that handles the wterm through convolution. A visibility record can be projected to the visibility plane, by convolving with a dependent function. Wprojection builds over the convolutional gridding algorithm by applying the stated fact. A mathematical treatment of wprojection are the subject of the next subsection. It is followed by other wprojection related topics which are performance, and hybrid algorithms (with wprojection).
2.2.1 Mathematical treatment
Frater and Docherty [44] show that a relationship exists between any constant plane and the plane. Following is a mathematical proof based on [44]:
Let represent visibility over a plane with constant and let represent the wterm in its exponential form.
(2.14) 
Using the measurement equation 1.14 it can be shown that:
(2.15) 
which implies the following Fourier relationship:
(2.16) 
Clearly and thus
(2.17) 
Substituting in equation 2.16 and applying the convolution theorem it results that:
(2.18) 
where
(2.19) 
Any visibility plane with constant is related to the plane by a convolution with a known function .
Clearly exists implying that also exists and thus the convolution in equation 2.18 can be inverted to put subject of the formula.
(2.20) 
Any visibility plane with constant can be projected to the plane via a convolution with the wdependent function .
Cornwell et al. [26] argue that as a consequence of equation 2.20 any visibility point can be reprojected on the plane. This forms the basis of the wprojection algorithm which builds over convolutional gridding by convolving each visibility record with a dependent function . It is defined as:
(2.21) 
The Point Spread Function (defined in section 2.1.1) is calculated using wprojection as follows:
(2.22) 
(2.23) 
Equation 2.9, which is the output of the convolutional gridding algorithm is valid for wprojection provided that equation 2.23 is used for .
is not directly solvable and thus the convolution functions^{2}^{2}2The plural is used in view that there is a different convolution function for each unique value of . have to be solved numerically [26]. They need to be pregenerated and stored in an oversampled format. The gridding process will convolve with which is the oversampled version of .
(2.24) 
The aliasing and correction arguments given in section 2.1.3 still hold. Sampling in is also required, and this will also generate some aliasing. Cornwell et al. [26] argues that aliasing can be reduced to tolerable values by scaling in rather than linearly. Support of is dependent on and increases with increasing [29]. This can lead to prohibitive memory requirements for storage of [45, 29]. It is a known issue in wprojection and there is research going on in order to handle the problem, such as the work presented by Bannister and Cornwell [45].
It is to be noted that based on the same principles described here, it is possible to correct the wterm in the intensity plane. Applying an inverse Fourier transform on equation 2.16 it results that:
(2.25) 
2.2.2 Hybrid algorithms
In section 1.4.3 alternatives to wprojection were discussed. These alternatives are not mutually exclusive with wprojection. Hybrid solutions using wprojection are possible and have been successfully implemented. For example, facets and wprojection have integrated together in CASA [46] where wprojection is applied over wide facets. The recently proposed wsnapshots algorithm [29] uses wprojection to project visibilities on a plane over which the snapshots algorithm can be applied. The plane is chosen by least square fit and changed whenever the visibility reading being projected over the plane deviates too much.
2.2.3 Performance
2.3 Notes on GPU programming using CUDA
This section summarises key aspects of NVIDIA GPU programming using CUDA. For details, reference should be made to documents [49, 50] supplied by NVIDIA.
2.3.1 Thread configuration
GPUs are parallel devices that can achieve impressive computational power by handling thousands of execution threads concurrently. Mandated by hardware design, threads are grouped in blocks, and each block is executed by one multiprocessor. Each multiprocessor concurrently handles a few of these blocks and blocks are scheduled to run only after other blocks finish execution. Threads within a block are guaranteed to be all running when a given block is scheduled on a multiprocessor. For modern GPUs, the maximum threads per block is 1024, implying that the modern GPU can handle tens of thousand of threads concurrently.
An execution code over a GPU is referred to as a kernel, and in this thesis this term is used only for this purpose. The term thread configuration is used to describe the setup of threads and blocks for the kernel. Many of the thread configurations mentioned in this thesis are in such a way that a thread is set to process an element without the need to interact with other threads. In such thread configurations it is to be assumed that blocks are set with the maximum number of threads, and enough number of blocks are requested to execute at full occupancy. It is also to be assumed that there can be thread reuse, in the sense that a given thread will process a set of elements one after each other. It has the same effect of a thread being destroyed after an element is processed and recreated to handle a new element when a new block is scheduled.
2.3.2 Memory
Many GPU algorithms tend to be memory bound meaning that the main limiting factor is access to memory. In the design of such algorithms, the strive is often the optimisation memory usage.
The GPU provides different types of memory. Table 2.1 lists the relevant ones for this thesis together with some salient features.
Memory Type  Location (on/off chip)  Access  Scope  Lifetime 

Register  On  R/W  1 thread  Thread 
Shared  On  R/W  All threads in block  Block 
Global  Off  R/W  All threads and host  Host allocation 
Texture  Off  R  All threads and host  Host allocation 
Global memory is the only Read/Write memory that is persistent beyond the lifetime of a kernel and it is also the largest in size (some few Gigabytes). It also suffers from high latency. Shared memory is much faster than global memory but limited in scope and size (a few tens of kilobytes). Registers are even faster as they can be accessed at zero extra clock cycles per instruction in most execution scenarios. They are much more limited in scope than shared memory and are a scarce resource.
Shared memory and registers can be used to store temporary data, that need to be accessed quickly. For example, if a set of data is required to be read by all threads in a block, then it is general practice to load the data set in shared memory.
Textures are readonly memory structures, and in this thesis, they are used for their onchip cache that enables efficient access to readonly data from global memory. They provide further functionalities, not used in this thesis, and not reviewed here.
2.4 Related work on highperformance gridding
Literature on highperformance implementation of wprojection and convolution gridding for radio interferometry is scarce. This contrasts with medical science research whereby GPUs are a common topic of research for use in medical equipment [51, 22, 52].
Edgar et al. [53] report work on a CUDA gridder for the Murchison Wide Field Array (MWA) [54]. A thread is associated with a unique point on the grid. Each thread scans through a list of visibility records, calculating the contribution (if any) that the record gives to the grid point under the thread responsibility. Most of the records do not give any contribution to a grid point, and there is substantial waste of time in the scan. This waste is reduced by grouping records by their position on the grid, such that each gridder thread scans only some of the groups.
Humphreys and Cornwell [43] discuss a gridder used for benchmarking. A thread is assigned for each point of the convolution function. To avoid race conditions, a single run of the kernel only grids few records which do not overlap when convolved.
Another benchmark study is presented by Amesfoort et al. [55]. Race conditions are evaded by allocating a private grid for each thread block. This study makes the gridder unusable for radio interferometry in view that memory requirement for such a configuration are much higher than the global memory available in modern GPUs.
Romein [37] has recently proposed as algorithm which this thesis regards as a breakthrough on the subject. It is used in the development of the thesis’ imaging tool. From here on this algorithm will be referred to, as Romein’s algorithm. It exploits the trajectory behaviour of the baseline that was explained in section 1.2. Details on the algorithm are given in section 2.5.
Yatawatta [56] at the Netherlands Institute for Radio Astronomy (ASTRON)[57] developed a new imaging tool called excon. It grids visibility measurement over GPUs using wprojection and wsnapshots.
GPUs are not the only device considered for the high performance implementation of wprojection and convolutional gridding. For example, Verbanescu et al. [58], Verbanescu[59] and Amesfoort et al. [55] consider an implementation over the Cell B.E processor while Kestur et al. [60] reports on a framework for gridding over FPGAs.
2.5 Romein’s algorithm
As already stated, Romein’s algorithm refers to the algorithm presented in [37]. This thesis makes use of this algorithm and hence it is reviewed in some detail here. It should be noted that chapter 5 presents new analysis of this algorithm that are not published in [37].
Records measured by a given baseline are sorted by time and presented consecutively to the algorithm. In this way, while records are being gridded one after the other, the region in the UVgrid that is updated by a convolved record moves with the baseline trajectory.
The grid is split up in subgrids of sizes equal to the size of the convolution function that will be used to grid. Threads are assigned to take care of one grid point in each subgrid as shown in Figure 2.1. Thanks to this configuration a thread updates one and only one grid point in convolving a visibility record. By virtue of the baseline trajectory behaviour, it is quite likely that, subsequent records will need to update the same grid point. Updates are accumulated in the registry, until the grid point moves out of the region being updated. At this point, the thread commits the accumulated update to global memory and commences a new accumulation for a new grid point in a different subgrid.
Explanation of some implementation details used in the test case presented by Romein [37] for this algorithm, now follows. Only the main details that are reused in this thesis are pointed out.
All grid point updates related to a record are handled by one CUDA block. This means that the block requires an amount of threads equal to the convolution function size. GPUs impose a maximum on the number of threads per block (maximum of 1024 threads per block for the latest architectures) which is in many cases smaller than the size of a convolution function. A given thread is thus allowed to cater for more than one grid point in each subgrid. Ideally all the grid points under the responsibility of a thread should be handled concurrently but due to registry limitations in modern GPUs, this is not always possible. Instead, a group of records is scanned several times by the thread so as to cater for all the grid points entrusted to it. All the threads in a block will need to read the same record data for several times, and by GPU best practices, data is preloaded in stored memory so as to have a fast access.
Different CUDA blocks grid different groups of records. This mandates the use of atomic additions to commit the accumulated grid point updates to the grid. Single precision atomic additions are intrinsically supported by GPUs, but double precision atomic additions are not supported. Despite the fact that a software implementation of double precision atomic additions is possible, such an implementation is inefficient and heavily impairs the algorithm’s performance. In conclusion, the algorithm works efficiently only for singleprecision.
The implementation makes use of textures for retrieving convolution function data stored in global memory.
Romein [37] reports a maximum of 93 Giga Grid point updates per second^{3}^{3}3The metric is explained in section 5.3.1. for the test implementation. Gridding was done over a quadpolarised grid using a GeForce GTX680 GPU. This result is revised in chapter 5 in view that enhancements made in this thesis, give more knowledge about the algorithm.
Chapter 3 The General Array FrameWork
The General Array FrameWork (GAFW) provides an infrastructure for numerical calculations. The framework has features meant to facilitate highperformance and has a simple user^{1}^{1}1In this context the "user" is the developer of an application built over the General Array Framework. interface. It adapts to the underlying hardware (CPUs, GPUs, FPGAs etc) since all control logic is handled by the framework through an engine.
The framework’s design promotes collaboration between scientists with basic knowledge in programming, and developers specialised in high performance computing. A layered approach allows for two distinct development areas, each suited for the respective collaborators mentioned above. The concept of this framework is based on the mathematical concepts of arrays and operators with minimal frameworkspecific jargon. This makes it easily comprehensible and manageable by scientists.
In this thesis, a C++ [61] implementation has been developed that supports multiGPU systems and forms the basis of the imaging tool that will be discussed in the next chapter. The use of the framework in the imaging tool simplified the development of the tool.
3.1 Main concepts
The framework mimics the way humans tend to make a calculation. In most cases, an equation is first defined, and afterwards numerical data is entered to obtain a result. In other words, humans tend to understand mathematical relationships before doing any calculation. A similar computational scenario is presented in this framework. The application defines how a calculation is done and then makes a separate request for a calculation over a set of input data. Since calculation definitions are separate from the actual calculation, they can be reused over and over again for different input data.
A calculation is defined using a calculation tree. The calculation tree is a graph of operators that get connected together via arrays. Operators define logic that generates numerical data, while each array represents an Ndimensional ordered set of numerical data of a specific type (ex integers, complex numbers, etc). Arrays are set as input and/or output to operators.
Figure 3.1 gives an example of a simple tree. It defines convolution using FFTs. Arrays are represented by arrows.
For convenience, the entry point for the generated numerical data for an array is provided by a third object called the result. Each array has associated to it one and only one result object.
Result objects provide also the entry point for calculation requests. Calculation requests are resultcentric and not treecentric in the sense that the request is to calculate the result. Based on the defined calculation trees, the framework decides which operations (that is, execution logic described by operators) are required to calculate and obtain the requested result.
Result objects provide mechanisms to inform the framework on what to do with the given data. An example is whether the application intends to retrieve the data or not. This has a significant impact on performance since a memory copy from GPU to host is required.
Calculations are done asynchronously with respect to the application thread. Such behaviour is crucial to achieve high performance and maximise the usage of available resources. Once the application requests a calculation, the framework validates and proceeds in the background. The application thread can load more data and request further calculations while GPUs are executing code in the background.
Two final points on the general concept of the framework are the factory and object identification. A factory is used to make things as easy as possible to the user and ensure full control of the framework over its own objects. The factory maintains all framework objects, including their creation and destruction. Framework objects are identified by a twostring structure which is referred to as an identity. A registry service provided by the framework enables the application to use the objects’ identity instead of C++ pointers or references, when communicating with the framework.
3.2 A layered approach
The framework can be modelled by five stacked layers as shown in Figure 3.2. Each layer is described hereunder:

The Application Layer: As its name implies, this layer represents the application built over the GAFW. The application constructs calculation trees, inputs numerical data to the system, requests calculations and retrieves results. Note that the application layer is unaware of the underlying hardware as it is up to the framework to make the necessary adaptations to execute calculations on the underlying hardware. This keeps the development of the application layer simple.

The Programming Interface Layer (PIL) defines all functionality that the framework provides to the application layer. Its design is intended to offer a remarkably simple interface to the application layer. Section 3.3 discusses the programming interface layer in greater detail.

The Calculation Execution Control Layer (CECL) contains all the logic to perform a calculation. It is well aware of the underlying hardware and adapts to it. As already pointed out, only GPUs are currently supported, and it adapts calculations to the number of GPUs available. The layer is made up of a multithreaded engine, providing all GPU control logic to execute a calculation. It requires direct access to the underlying hardware.

The Operation Execution Layer(OEL) is responsible to execute operator code on the hardware.

The Hardware Layer is the hardware itself. As pointed out many times, CUDA compatible GPUs are the only supported hardware. It is aimed that in the future, there will be support for other hardware such as FPGAs, DSP boards, CPUs and clusters.
3.3 The programming interface layer
This section discusses more concepts and details of the framework as seen from the point of view of the application layer. Each subsection discusses a concept, object or service given by the framework.
3.3.1 The factory
Since a priority in the design of the framework is simplicity, the handling of objects is mandated to the framework and stripped off from the application. The framework delivers such service through a factory. Only one instance of the factory can exist at any moment in the application life cycle, and is expected to exist throughout the whole lifetime of the application. Whenever the application requires the creation and destruction of framework objects, it does so by making a request to the framework’s factory.
The factory simplifies the management of operators for the application layer, since operator initialisation and destruction logic can vary from one operator to an other. Further details are discussed in section 3.3.4.
The use of the factory is beneficial for forward compatibility with future releases of the framework. Since all the initialisation/destruction logic is in the absolute control of the framework, any future enhancements of the framework that changes such logic can be implemented with no change to the application layer code. The application will still be able to compile against new versions of the framework.
The factory represents the framework, and its instantiation is analogous to the instantiation of the framework. Once the factory is set up, there are no other procedures to follow to initialise the framework. The application can immediately build calculation trees and subsequently request calculations.
3.3.2 Identification of objects and the registry service
Framework objects are identified using a twostring structure called an identity. Userdefined objects can also have an identity, and the framework is well structured to support such userdefined objects.
An identity is defined by two strings: a name and an objectname. The name identifies the type of object such as "array" or "result" while the objectname identifies the object itself. In Figure 3.1 the objectnames "A" and "B" are used to represent two arrays in the calculation tree. Every object is expected to have a unique objectname, which is given by the application layer. Names are given by the framework and are useful to distinguish between different operators (such as "FFT" or "Dot Multiply").
Objectnames have a hierarchical naming scheme^{2}^{2}2Names do not have a naming scheme.. A dot (.) is used to separate levels in the hierarchy. There are no rules regarding how the hierarchy is set up, but, if the object is to be registered, its parent needs to be registered beforehand. The hierarchy is essential to avoid conflicts in objectnames. For example, if two unrelated trees are developed, there is the possibility that, by mistake, the same objectname is chosen for objects in the two trees. If each tree uses its own namespace, by defining its own unique branch in the hierarchy, then the issue is avoided. Note that the framework provides mechanisms to support application defined objects to work in their own namespace.
The use of identities gives the possibility of communicating with the framework on objects using objectnames or names instead of C++ pointers or references. A registry service is available in the framework that keeps a mapping between objectnames and respective objects. Framework objects are automatically registered on creation. Other application defined objects can be registered manually. This strategy alleviates the application from maintaining C++ memory pointers to objects that it needs.
3.3.3 Arrays
An array represents the movement of data within a tree. It has an identity and is automatically registered by the factory on creation. An array is defined by its dimensions and the type of numbers it holds (ex integers, complex numbers, etc).
An array within a tree can be in three modes: input, output or bound. The modes are mutually exclusive, and arrays can only be in one mode.
An array is in input mode if numerical data is manually pushed in by the application layer. The array provides the entry point to input such data.
An array can be bound with a result object. Such a bind instructs the framework to use the last generated data represented by the result object as input to a subsequent calculation in which the bound array is involved.
If an array is neither bound nor in input mode then the array is in output mode and is expected to be set as an output of an operator. The application is not normally expected to define the dimensions and type of data for such arrays, since operators are expected to have logic to determine such properties automatically during validation (refer to section 3.4.1).
3.3.4 Operators
A GAFW operator describes logic that generates numerical data. For example, an operator might describe the logic of a Fast Fourier Transform.
Operators have an identity (discussed in section 3.3.2). The identity’s name of an operator is an essential property and is one of the main reasons why an identity includes a name. Different operators are regarded as different object types and must have a different name. The application communicates its request for a new operator object using the name of the operator. Good meaningful names are those that describe the execution code represented by the operator such as "FFT".
In the general case, different operators are implemented using a different class. Thanks to identities and objectoriented polymorphism, this complexity is hidden from the application which is aware only of the base class.
An operator has three forms of input data: input arrays, prevalidator arrays and parameters. Inputless operators are legitimate.

Input arrays are an ordered list of arrays that describe the data on which the operator operates. The numerical data that the array represents is made available on the GPU memory during the operation execution.

Prevalidator arrays are another ordered list of arrays. Different from the normal input arrays, the data is not presented to the operator during execution. They are only applicable during the validation phase (refer to section 3.4.1). The framework guarantees that they get validated before the operator. It is expected that the operator validation logic will need to obtain some information from the array (such as dimensions or data type) in order to execute its own validation logic.

Parameters are a set of variables that are set up prior to calculations. They can be regarded as configuration data for the operator. For example, an operator doing zero padding would need to know how much zero padding is needed. This information can be given to the operator through parameters.
An operator has only one form of output, which is an ordered list of arrays in output mode.
An operator can request some additional allocation of memory on the GPU for its own private use while executing. This memory is referred to as a buffer.
Operators exist in the PIL and OEL layers (refer to Figure 3.2). As mentioned earlier, the base class is presented to the application layer. The PIL, therefore, only defines the basic interactions of the operator with the application layer. As for the OEL, the operator defines all validation and execution logic.
The procedure to develop an operator is simple. The operator is defined by a new C++ class that inherits the operator basic class. Two methods need to be overloaded, one providing validation logic and another one providing the kernel submission logic. In most cases, a new CUDA kernel needs to be coded. This is the only hard part in the whole procedure since it requires skilled expertise in CUDA programming. The final step is to register it in the framework by its name. The registration process requires coding for class instantiation and operator object destruction.
3.3.5 Results
Result objects provide the entry points for controlling and retrieving calculated data and requesting calculations. Each array has associated to it one and only one result object and contains all the generated numerical data related to the array.
Since calculation logic is automated within the framework, the application needs to give instructions on how to handle the data. These instructions are listed and explained below:

Application requires results: The application has to inform the framework about its intention of retrieving the calculated data. This is particularly essential for performance because extra memory copies from GPU to host memory are required to allow the application to access the result. It also removes any unnecessary memory allocation on the host.

Data reusability: This defines the intention of the application to request subsequent calculations that will reuse data generated. On such instruction, the framework keeps a copy of the result, possibly on GPU memory, for reuse.

Overwrite: Setting a result for overwrite instructs the framework that prior to executing the corresponding operation on the GPU, it should initialise output data to the last calculated result.
These instructions can be given to any result object participating in a calculation and not only to the result object through which a calculation is requested. For example, referring to the calculation tree in Figure 3.1, if O is requested to be calculated and C is set to be a required result, then the framework will copy the result of C to the host and make it available to the application.
3.3.6 Proxy results
As their name implies, proxy results are meant to serve as proxies to result objects. Proxy result objects behave as if they are genuine result objects (the result class is inherited). The proper result object which is proxied, is configurable at any time, and can be changed at will. The main use of these objects is in situations where a fixed result object needs to be presented, but needs to be changed every now and then behind the scenes.
3.3.7 Modules
GAFW modules are intended to contain a defined calculation in one object. They are application defined and need to inherit the module’s base class. Inputs and outputs of modules are in the form of result objects. The module must provide all the logic to request calculations to the framework.
3.3.8 Statistics
It is advantageous to have a statistics system in a highperformance application. For example, it is helpful to have execution timings of operations executed on the GPU. The framework generates its own statistics that are pushed to the application for further processing.
A single statistic is contained in a unique object whose class inherits a general base class. An interface is defined in such a way that statistic objects can be posted through it. The interface needs to be implemented by the application. It is also expected that posted statistics are processed in a background thread. This design allows the application to use the infrastructure for its own generated statistics.
3.4 The engine
The Calculation Execution Control Layer is made up of an engine whose function is to execute a calculation request. This section discusses the engine and how it executes a calculation over GPUs. Figure 3.3 portrays a highlevel design of the engine.
Calculations are executed in three steps: Validation, snapshot taking and proper calculation. Each step is discussed in the following subsections:
3.4.1 Validation
The framework needs to verify that the calculation tree is valid. It also needs to handle any missing tree data (array dimensions and data type) that can be predicted. If the framework has no sufficient data, or the calculation tree is invalid, then the validation process returns an error in the form of a C++ exception.
Some of the validation logic has to be supplied by the operator used in the calculation tree. The operator has to validate itself on details regarding its unique specifications, such as, the number of inputs and outputs. It also has to determine the dimensions of the output arrays when missing or incorrect.
The validation algorithm is depicted in the flow chart shown in Figure 3.4. The algorithm is split in two subalgorithms, one for validation of arrays and one for validation of operators. These two subalgorithms recur on each other. Array validation in output mode requires the validation of the respective operator, while operators need the validation of all input arrays and prevalidators. In this way, the tree is traversed until nonoutput arrays or inputless operators are found. The algorithm then reverses back validating all objects.
3.4.2 Snapshot taking
Taking a snapshot means the copying of all relevant data regarding a calculation such that the proper calculation can be executed asynchronously. In this way, the proper calculation takes place in the background while the application can change trees, input new data, request other calculations or execute any other logic.
To execute a calculation, the engine transforms the calculation tree into a stream of operations to run. Each element in the stream describes an operation in full, including transient data, such as state, locking mechanisms and memory allocation on GPU for input and output data. The engine relies exclusively on this data to perform a calculation.
3.4.3 Proper calculation
This step, computes the calculation tree over GPUs. Simplistically speaking, it is a matter of memory management and kernel submission. All the tasks are executed by the engine with the exception of kernel submission. The latter is executed by the operator on request from the engine.
The engine is implemented using a multithreaded approach. All the work done by the engine is divided into simple tasks, each handled by a separate thread. A blocking FIFO queue is used to communicate between threads. This approach enables the framework to monitor many events concurrently without the need of looping. It avoids busy waiting were events are continuously queried for status updates. It also helps the framework to act quickly on events over which a thread block waits. The waiting thread is immediately released as the event is fired.
The engine delegates the actual management of a device to a device manager. This simplifies functional support for multiGPUs (and in the future other devices). A unique instance of the manager is brought up for every device supported. It has its own task threads as illustrated in Figure 3.5.
The engine schedules an operation to be executed over a device by submitting it to the respective device manager. In systems having more than one GPU, a whole calculation request is submitted to one GPU while the next calculation request is submitted to the next GPU. This simple scheduling algorithm proved to be good enough for the imaging tool discussed in the next chapter.
GPU memory management works as follows: The allocation thread continuously allocates memory for incoming operations until the memory is full or there are no operations in the pipeline. Once the GPU memory is full it waits until the postexecution thread frees memory. Memory is freed once the computation related to an operator is ready and will not be reused for future operations. Memory that is not freed and is not set as input or output for any operation submitted to the device manager is managed by the memory management thread. Such memory can halt the whole calculation process as it might not leave enough space for currently scheduled operators. In the case of such a scenario, the device manager caches the data on the host main memory (if not already done) and frees memory on the GPU. Caching also takes place when data needs to be transferred from one GPU to another, or when defragmentation of memory is required.
Unfortunately, the CUDA runtime API [62] is unable to allocate memory on a GPU while a kernel is being executed. This causes the allocation thread to freeze up during kernel execution, reducing thread concurrency on the CPU. This has a direct impact on the overall performance since data transfers^{3}^{3}3By GPU best practices [49] time to transfer data from host to GPU is hidden by doing memory transfers in parallel with kernel execution. cannot be initialised prior to the allocation of the respective memory. In order to ease the problem, a locking mechanism is in place that denies concurrent allocation of memory and submission of threads. In this way, the allocation process works in batch while kernel submission is locked.
3.5 Collaboration
In the chapter’s introduction, it was claimed that the framework is designed for collaboration. This section elaborates on this argument.
Framework related development of an application can be divided in two areas. The first area is the application layer. Development related to this area is easy and fast with no specialised expertise required. Whoever attempts development in this area, requires general knowledge on the behaviour of the calculation with no details on how it is deployed on the GPU. This area suits remarkably well to that scientist who has general expertise in programming and views programming as a means to reach scientific goals.
Development of operators is the second area. Developing an operator requires specialised expertise in highperformance parallel computing over GPUs. Much less knowledge about the overall behaviour of the calculation is required, but indepth and detailed knowledge of the operator is a must. It is suited to a developer who has little interest on scientific goals and much more interest in writing highperformance code.
The two areas are separate and the only commonalities are the framework itself and functional specification of the operator. Therefore, it is is easy to promote collaboration between the scientist and the developer, and get the best out of the two. One notes that functional specification documents are a standard in the industrial software development world. It is the main tool with which developers communicate with their clients.
3.6 Standard Operators
As part of the implementation, some standard operators have been developed. This section discusses the most noteworthy ones.
3.6.1 The Fast Fourier Transform (FFT) operator
This operator executes Fast Fourier Transform over arrays of any dimension. It is possible to divide the array into sub arrays of lesser dimensions and perform a Fourier transform over them. For example, in the case of a matrix, it is possible to request a 1dimensional Fourier transform over each column. It is implemented using the CUDA cuFFT library [63] provided by NVIDIA. Unfortunately, the documentation [63] does not give substantial information on the buffer space required to execute an FFT. Documentation only states that it is between one and eight times the size of the input being FFTed. To be on the safe side, the operator has been set to request a buffer eight times larger than the input.
3.6.2 The allprefixsums operator
Given a sequence , the allprefixsums operator does an accumulated addition over the elements to return the sequence defined as follows:
(3.1) 
This operation is also known as an exclusive scan.
3.6.3 The general reduction template
This is a generic operator (in C++ terms: a template). It provides general code for operators that handle reductions of the form:
(3.2) 
where

is a sequence of sequences, whereby each element sequence is an input to the GAFW operator.

is a general mathematical operator that has to be associative.

is another general mathematical operator that does not need to be associative.
Specialisation of the template needs to define the two mathematical operators and the value of . The value of is determined from the size of the input arrays. The operator can reduce over dimensions. For example, in case of a matrix, it is possible to reduce each column separately. The result will be a 1dimensional array of length equal to the number of rows of the matrix.
Reductions are simple to implement over GPUs. In this implementation, a kernel is run with a configuration of maximum threads per block and a number of blocks high enough to reach nearly 100% real occupancy. Work load is split evenly over all threads such that each thread reduces a subset of the elements in each sequence. Results are saved in shared memory so as to run over all results produced by threads in a given block. This is stored in global memory in such a way that a second kernel reapplies over the values calculated by each block.
Chapter 4 The Maltaimager
The Maltaimager (mtimager) is a high performance image synthesiser for radio interferometry. It is implemented in C++ [61] and CUDA [62], and uses the GAFW infrastructure (refer to chapter 3). It achieves unprecedented performance by means of GPUs. The CPU multithreaded design for handling pre and postdata processing is also a crucial ingredient in ensuring the best performance.
The image is synthesised using wprojection (refer to section 2.2). An independent multipleStokes dirty image is synthesised for each channel. Measurement data is expected to be available as a MeasurementSet [17] stored using the Casacore Table System [66]. Output images are stored as a 4dimensional data cube FITS [67] primary image. Most of the calculations are done in singleprecision mode [68, 69, 70].
4.1 High level design and concepts
The design is based on the General Array Framework (GAFW) philosophy (refer to chapter 3). The system is made up of seven autonomous components as depicted in Figure 4.1. None of them interact directly with each other, and with a few exceptions, they are unaware of each other. Data is communicated between each component using the GAFW, by means of result objects (see section 3.3.5). The main() function integrates all components together.
Each component is assigned a unique set of tasks. The Configuration Manager takes care of producing configuration information for each component based on the local configuration and environment. The Visibility Manager loads data from a MeasurementSet, sorts, converts and inputs data in the GAFW infrastructure. The WImager performs gridding over a multipolarised grid, while the Image Finaliser converts the grid to a multipleStokes dirty image. These components are implemented as GAFW modules, and all calculations are made over GPUs. Channels are processed independently, and, for each channel, separate instances of the two components are set up. Numerical representations of the convolution functions (defined in equation 2.24) are generated by the Convolution Function Generator. The Image Manager stores output images in FITS files, while the Statistics Manager processes statistics, generated by each component, including the GAFW. It then reports them in various CSV (CommaSeparated Values) files.
Data is processed in chunks to exploit parallel mechanisms available on the hardware. This is essential to ensure highperformance. A GPU is by itself a parallel device which can only achieve highperformance through parallel methods. Presenting the GPU with a suitably sized chunk of data ensures best gridding performance. In case of a multiGPU system, by virtue of the GAFW, the imaging tool grids independent channels over different GPUs so as to achieve concurrency over GPUs.
Fast gridding on GPUs is the imaging tool’s strong suit. Nevertheless CPU bound preprocessing and postprocessing of data is a necessity, since, by design GPUs are limited. For example, GPUs cannot load data from harddisk, neither do they recognise C++ objects. Also, they cannot save images to permanent storage. If these pre and postprocessing steps are not well handled on the CPU, then, they can severely compromise performance.
Execution time of pre and post processing steps is hidden by having the CPU, GPU and permanent storage IO running in parallel. Most components perform their tasks asynchronously to each other. Since data is processed in chunks, the Visibility Manager prepares the chunks in the background through a multithreaded mechanism. Once the first chunk is available for gridding the WImager component (by virtue of the GAFW) grids the data over the GPU while the Visibility Manager continues its task of preparing other chunks. This achieves concurrent use of the CPU and GPUs.
Channels are processed one after each other^{1}^{1}1For multiGPU systems, channels are processed in parallel by an amount equal to the number of GPUs.. Once a channel dirty image is finalised the Image Manager saves the image to disk while subsequent channels are being processed. This ensures concurrent use of the GPU and permanent storage. The Visibility Manager, by virtue of its multithreaded design ensures concurrency between CPU processing and permanent storage IO and exploits the multicore infrastructure of modern CPUs.
The core of gridding is based on Romein’s algorithm [37] which is enhanced, implemented and adapted for the necessities of this imaging tool. The algorithm requires that data is grouped by baseline and sorted in time. The ordering is done by the Visibility Manager.
The imaging tool supports single, dual or quad polarised^{2}^{2}2Polarisation can be linear or circular. Both are supported. data. The term multipolarisation is used as to describe any number of polarisations. It should be noted tha the imaging tool handles each multipolarised visibility record as one entity.
Output images are converted in the I,Q,U,V stokes format or a subset of, depending on the polarisations available.
Visibility data is gridded using the natural weighting scheme (see section 2.1.2). The required variance is read from the MeasurementSet.
Flagging is also supported. During preprocessing phases of data, not handled by the imaging tool, some visibility records might be flagged for various reasons. These include erroneous readings. The imaging tool does not grid any such flagged data. Note that flagging is done per polarisation, and a flag has value of 1 when the respective polarised data is not to be gridded and 0 otherwise.
The final point in this section is about channel frequencies. Since, in the general case, the interferometer resides on the Earth, channel frequencies are normally given in the topocentric frame of reference. However, to make corrections for Doppler effects caused by the motion of the Earth, the imaging tool grids using the Local Standard of Rest Kinematic (LSRK) frame of reference. Frequency values in this frame of reference are not given directly by the MeasurementSet, so a conversion is required. This is taken care of by the Visibility Manager using the CPU. It is the only calculation done in doubleprecision mode [68].
4.2 Runtime configuration and the Configuration Manager
Runtime configuration of the mtimager is done via command line arguments and configuration files. The configuration is a set of parameters defined by keyword and value. This is similar to the Common Astronomy Software Applications(CASA) [46] and the standalone lwimager tool based on the casacore libraries [71]. The keyword and its respective value are separated by the equals (=) character. For example, nx=100 defines a parameter named nx with its value set to . A type such as a string, boolean or integer is attributed to a parameter value. Boolean parameters can be set to true by putting, a minus () just before the parameter name. For example, test and test=true are equivalent.
Parameters can be set within a manually specified configuration file. In principle, all parameters can be specified as command line arguments to mtimager. This makes the command quite long, and subject to errors (an issue in lwimager). The use of a configuration file solves the problem. It is a simple text file where parameters are each listed on a line of their own. Empty lines are allowed, and lines beginning with the number sign (#) are assumed as comments and ignored. Parameters set through the command line have precedence over those defined in the configuration file. This enables the user to have a default configuration stored in a file and partially overridden by command line arguments.
The mtimager uses a logging system based on Apache log4cxx API [72]. It requires an extra configuration with a format defined by the API. Its location is configurable through a parameter.
The Configuration Manager is the holder of all configuration logic. It is a passive component as all the logic runs in the main thread. It does not interact with any components. It produces a component specific configuration that is passed by the main thread to the component during initialisation. Due to its nature, the Configuration Manager is codewise dependent on the other components. This dependency is oneway since the other components are neither aware nor dependent on the Configuration Manager.
4.3 The Convolution Function Generator
The Convolution Function Generator is entrusted in calculating the wprojection oversampled convolution functions ^{3}^{3}3 is defined in section 2.24.. It is not practical to calculate the convolution functions during gridding, instead, they are numerically evaluated and presented to the WImager component in an oversampled format.
4.3.1 General points
This component can generate convolution functions in two modes: normal and wprojection mode. The mode used is configurable at runtime.
In normal mode, the generator only evaluates at and wterm is ignored (refer to section 1.4). The term "normal" is used by other imaging tools such as CASA [46] and lwimager. In these tools, it describes the same behaviour as in the mtimager. In this mode, the support of the convolution function is runtime configurable. The system will zero pad or trim the function so as to produce the desired support. This feature is useful for executing performance tests, and it was pivotal in many of the experiments described in chapter 5.
In wprojection mode, is evaluated over various wplanes depending on the runtime configuration. Each convolution function is trimmed to its support size. Support is evaluated after examining the generated data. As already discussed in section 2.2, support of is a function of that increases with increasing .
In the two modes, the choice of the tapering function is runtime configurable. Tapering functions are implemented as GAFW operators, and the choice is defined by specifying the name of the operator. In this thesis, only one tapering function operator has been developed called casagridsf. This implements the same prolate spheroidal function used in the casacore API [71]. It is based on work presented by Schwab [40] and has been adapted to work over GPUs.
The oversampling factor is a runtime configurable variable. Since memory is limited it is suggested to keep it to a low value of 4, which is the value proposed by Cornwell et al. [26]. This value is hardcoded in lwimager and CASA. Zeropadding is used as an interpolation scheme.
The generator samples in . As recommended by Cornwell et al. [26], is scaled in rather than linearly. No necessity exists to calculate for since the convolution functions are symmetrical around . The maximum to consider is runtime configurable.
4.3.2 Mathematical treatment and outputs
The Convolution Function Generator outputs three GAFW results that contain all convolution functions data required for WImager to do its job. Some simple mathematical treatment is given here to help describe the content of the outputs.
When a record is gridded, only the numerical data of one convolution function that falls on the pixels of the UVgrid is used. Any oversampled point that does not fall on the UVgrid is not considered. Define as the function describing the operation that chooses data. are the baseline components of the record being gridded expressed in number of wavelengths. The function returns the chosen data in a matrix with dimension , where is the support of the convolution function . Define the halfwidth and let be an element of , where is the 1based index of the row, and is the 1based index of the column. Then:
(4.1) 
where and satisfy the simultaneous equations:
(4.2a)  
(4.2b)  
(4.2c)  
(4.2d) 
Note that the image^{4}^{4}4The image of an arbitrary function , is defined as the set . of is finite since is oversampled in and , and sampled in .
As defined in equation 4.5, the WImager also calculates a normaliser. There are substantial computational savings if the summation of the real parts of the elements of are precalculated. Let define such summation such that:
(4.3) 
where is the real operator.
The three outputs containing all data related to are now explained. The first output is referred to as the Convolution Functions Numeric Data output. It contains all numerical data of the oversampled . The data is laid down as follows: The data is first ordered by convolution function in increasing . Further ordering for each convolution function is in such a way that elements of each matrix member of the image of is coalesced in memory, sorted in the rowmajor form. This is in accordance with GPU best practices since when a record gets gridded, the convolution function data selected, is accessed in parallel. All matrix members of the image of are ordered in increasing order by and .
The second output is referred to as the Convolution Function Data Position and Support output. It contains two elements for each convolution function. The first element is the function’s support. The second is an index pointing to the first element of the first output that describe the convolution function. This index is vital to search for the right data to use for gridding.
The third output contains the image of . The layout is similar to the layout of the first output.
4.3.3 Data Generation
A CPU/GPU hybrid algorithm is used to generate the three outputs described. CPU work is done using the imaging tool’s main thread. The payload on the CPU is negligible, and the thread spends most of time waiting for some GAFW result to be available. The main thread is not available to spawn any new work while this component is calculating. This does not degrade performance since any new work to spawn after the generation of the convolution functions is dependent on the generated data. The Visibility Manager component is initialised before the generator in such a way that convolution functions are generated by the Convolution Function Generator component at the same time that the Visibility Manager prepares data for gridding.
The three outputs are generated using Algorithm 1. All functions are normalised using a constant normaliser value such that . Support is calculated by detecting the pixel nearest to the edge that has a magnitude larger than . The GPU thread configuration is set such that, for most of the steps, one thread handles one pixel.
4.4 The WImager Component
The WImager component executes the wprojection algorithm (refer to section 2.2) over GPUs, through the General Array Framework.
An instance of the component handles one multipolarised UVgrid. Since the imaging tool treats each channel separately, an instance of the WImager component is created for every channel to grid.
The component is designed to be flexible such that it can be used in other configurations. For example, channel data could be grouped by baseline and gridded over independent instances of the WImager component. The grids would be added up on finalisation. The component can be used as it is for wstacking (discussed in section 2.2) where each wplane is gridded separately using independent instances of the component.
The WImager component is a GAFW module (refer to section 3.3.7). All inputs and outputs are GAFW result objects (see section 3.3.5). The whole algorithm is implemented as one calculation tree and is fully GPU based. It is free from CPU calculations to ensure the asynchronous behaviour explained in section 4.1.
The implemented algorithm is, from here onwards, referred to as the WImager algorithm. The calculation tree is depicted in Figure 4.2. Some GAFW operators are grouped up in one block and then expanded in subsequent Figures 4.3, 4.4 and 4.5. Arrays are denoted with keys, tabulated in Table 4.1. The table includes the mathematical reference used to represent each sequence and a reference to the section, where the sequence is discussed.
The WImager algorithm is based on Romein’s algorithm, reviewed in section 2.5. It is enhanced and adapted to suit for the requirements of the mtimager. One of the main enhancements made over Romein’s algorithm is compression. It is discussed in section 4.4.3.
Records that are input to the WImager algorithm are expected to be grouped by baseline and sorted by time. If this requirement is not respected, the algorithm will still work, but with a heavy penalty in performance. Note that there is no restriction on the amount of baselines in an input chunk of data and records per baseline can vary. Input visibility records do not need to be on the grid and flagging is supported. The algorithm adapts to the varying support of the convolution functions. It is perfectly fine that records in one input chunk require different sized convolution functions to be gridded.
The WImager algorithm is conveniently split up in two phases: the preparation phase and the gridding phase. The main difference between the two phases is the thread configuration. In the first phase, each multipolarised visibility record is processed by one thread or in some parts with a ratio of threads per record that goes below unity. In the gridding phase, a record is processed by a block of threads. In this case, the threads per record ratio goes in the order of thousands and millions and thus expensive. The gridding phase can hold all preparation phase logic but, for best performance, the gridder is stripped from any logic that can be implemented in the preparation phase. Note that the gridding phase is implemented in one GAFW operator and thus all the other operators in the calculation tree shown in Figure 4.2 are part of the preparation phase.
The rest of this section explains in detail the WImager algorithm. The subject is tackled as follows: First, nomenclature and terminology used is defined and explained (section 4.4.1). Inputs, outputs and mathematical equations describing the algorithm are given in section 4.4.2. In the subsequent section, the gridding phase is discussed, whereby all logic that is delegated to the preparation phase is identified. The discussion on the WImager component is concluded by explaining in detail the preparation phase in section 4.4.5.
Key  Sequence name  Mathematical Symbol 
Visibility Records Inputs (defined in section 4.4.2)  
I1  Baseline in meters  
I2  Channel Frequency  
I3  Weight  
I4  Flags  
I5  Visibility  
Convolution Functions Input Data (defined in section 4.3.2)  
C1  Convolution Functions Numeric Data  N/A 
C2  Convolution Functions Data Position and Support  N/A 
C3  Convolution Functions Data Sum  N/A 
Outputs (defined in section 4.4.2)  
O1  UVGrid  
O2  Normaliser  
Binary Sequences (defined in section 4.4.5)  
B1  Gridding Indicator  
B2  Compression Indicator  
B3  Support Change Indicator  
B4  Manipulated Support Change Indicator  N/A 
Scanned Sequences (defined in section 4.4.5)  
SB1  Prefix Sum of Gridding Indicator  
SB3  Prefix Sum of Support Change Indicator  
SB4  Prefix Sum of Manipulated Support Change Indicator  N/A 
Index Sequences (defined in section 4.4.5)  
D1  Gridding Index  
D2  Last Compressed Entry Index  
Trial Plan  Trial Plan  
Plan  Plan  
NonFiltered Sequences (defined in section 4.4.5)  
S1  Position  
S2  Convolution Data Pointer  
S3  Support  
S4  Normaliser  
S5  Take Conjugate  
Filtered Sequences (defined in section 4.4.5)  
FS1  Position  
FS2  Convolution Data Pointer  
FS3  Support  
FS5  Weighted and Compressed Visibilities 
4.4.1 Nomenclature and Terminology
The various terminologies and nomenclatures used in this section are defined and explained. As pointed out in section 4.1, visibility records are input to the component in chunks. As indicated in Figure 4.2 the record data chunk is split in different arrays. It is reasonable to refer and represent GAFW arrays related to record data as mathematical sequences.
The usual "curly brackets" format is used to define mathematical sequences (for example: ). It is to be assumed that the sequence is finite, and that the first element is at index 0, that is . The letter implicitly defines the length of the sequence.
As record data is processed by the WImager algorithm, some of it is removed. The process of removal is referred to as filtering. Sequences that have records removed from are referred to as filtered sequences. Notationwise, filtered sequences are distinguished from their nonfilter counterpart by having the letter as a superscript to the element. For example, is the filtered version of the sequence .
An element in any sequence gives data about one record. Elements in different sequences that are either all nonfiltered or all filtered sequences, describe the same record only if they are at the same position.
The algorithm makes use of binary sequences. The term indicator is used to refer to elements in such sequences. The letter is set as superscript of the sequence’s element to show that the sequence is a binary sequence. For example, is a binary sequence. Note that if the binary sequence is also filtered, the letter is also retained as a superscript. For example, is a filtered binary sequence.
Sometimes a prefix sum (exclusive scan) is run over a binary sequence to produce a scanned sequence. The resulting scanned sequence is denoted by the letter set as superscript of the sequence’s element. For example, is the resulting scanned sequence of . If the binary sequence is a filtered one, the superscript letter is retained and the resultant filtered scanned sequence is denoted by .
Index sequences are defined as sequences containing integer index elements pointing to a record. These are generated by using binary sequences as predicates. Index sequences are denoted by using the predicate sequence. The superscript letter is used to denote that the sequence is an index sequence. For example, is the index sequence generated using as predicate. If the index sequence points to records in a filtered sequence, the superscript letter is also used (for example ).
To represent sequences made up of tuples, the usual notation of a comma separated list of subelements enclosed in a parenthesis, is used. For example, represent a finite sequence of 4tuple elements. Most sequences are directly linked with the input data that define in full a visibility record.
Various sequences include polarisation dependent data. For such sequences, the letter will distinguish between each polarisation and is set as a subscript after the twisted brackets. For example, contains polarisation dependent data. When referring to an element, the notation will be used.
4.4.2 Input, outputs and mathematical equations
The WImager algorithm receives visibility record data in five inputs. Another three extra inputs contain data of the convolution functions in numerical form. These extra inputs are supplied by the Convolution Function Generator and discussed in section 4.3. There are two outputs for this algorithm, which are: a multipolarised UVgrid and a normaliser. Details on the inputs and outputs of the algorithm follow.
The five visibility records related inputs are:

baseline in meters as projected on the UVW axis  . The apostrophe indicates that values are in meters and not in number of wavelengths.

channel frequency (per speed of light)  . where represents the channel wavelength, and which is the inverse of the channel frequency divided by the speed of light.

flags^{5}^{5}5Flagging was discussed in the penultimate paragraph of section 4.1. 

weight 

visibility 
The first output of the algorithm is the multipolarised UVgrid sampled at regular intervals and . The pair of equations 4.4 gives a mathematical representation of how the algorithm calculates this output. It is based on the wprojection gridding mathematical treatment given in sections 2.1.1 and 2.2.1 with some adaptations and added features.
(4.4a)  
(4.4b) 
represents the oversampled wprojection convolution functions, defined in equation 2.24.
is the gridding indicator. It is calculated within the algorithm and defines if a record is to be gridded by the gridding phase. When is set to 1, the respective record is gridded by the gridding phase while if set to 0 it is not.
is set to 0 if and only if one of the following three conditions is met: The first condition is when gridding the record requires updates to pixel outside the grid. The second condition is when all polarisations of the record are flagged (that is, ). The last condition which sets to 0 is when the record is compressed^{6}^{6}6Compression is discussed in section 4.4.3..
When a record is compressed, the compression indicator is set to 1. Otherwise, the compression indicator is set to 0. Note that for a given record and cannot both be 1. Hence can have values of 0 or 1. When equates to 0, then the given record has no effect on the output grid. If and then the record will still be gridded and will affect the output, but not through the gridding phase. This is explained in detail in section 4.4.3.
It should be noted that when a flag is set to 1, and the respective polarised visibility is ignored, since will acquire a value of 0 over the whole grid. The gridding phase will still grid the polarised visibility, but with no effect since it will grid zeros.
The second output of the WImager is the polarisation dependent normalisers . These are required by the Image Finaliser component to output images in units of . The following equation gives a mathematical representation.
(4.5) 
is defined in equation 4.3 and its image is given in the Convolution Function Data Sum input.
4.4.3 Compression
Romein’s algorithm exploits the behaviour of the baseline trajectory, and in the WImager algorithm the behaviour of the baseline trajectory is exploited further. There is a probability that for some consecutive records sampled on the trajectory, the change of is so small that the convolution function values and the position of gridding do not change. In mathematical terms there is a probability that for two consecutive records at position and the equality shown below holds:
_(w’_i/λ_i)(uu’iλi,vv’iλi )⋅III(uΔu,vΔv )=~C_(w’_i+1/λ_i+1)(uu’i+1λi+1,vv’i+1λi+1 )⋅III(uΔu,vΔv )
In this case by virtue of equation 4.4, the weighted and flagged visibilities of the two entries, that is, and , are summed together in the preparation phase and gridded as one record. This thesis terms the process as compression in view that some records are "compressed" into one.
The effectiveness of compression is depended on many parameters. For example, compression depends on the observation integration time, the grid configuration , and the oversampling factor of the convolution functions. A short integration time causes the interferometer to sample records at a faster rate than longer integration times. This means that the Earth would have rotated less from one sampled record to another over a trajectory. Therefore, there will be smaller changes in which imply a higher probability of compression. Short baselines tend to have a shorter trajectory than long baselines. Within a given time interval, sampled records over short baselines travel shorter distances on the UVgrid than over longer baselines. This leads to a higher probability of compression for shorter baselines.
4.4.4 The gridding phase
The gridding phase does the final gridding. It is implemented by one operator named the gridder and is based on Romein’s algorithm. This section discusses implementation details, together with adaptations made to Romein’s algorithm so as to suit the imaging tool. The discussion focuses on the required inputs that the preparation phase needs to generate for the gridder to be as fast as possible.
Filtering
Referring to equation 4.4, whenever , then, the record is not gridded and is ignored by the gridding phase. The calculation for is delegated to the preparation phase. In order to evade any selection logic based on the value of , the preparation phase filters out any records with . All input sequences to the gridding phase, related to visibility records, are thus filtered sequences.
As pointed out earlier, if a polarised visibility value is flagged, then, the gridder will grid zeros instead of ignoring the polarised value. The gridder does not handle flags and does not have any flag data as input. The zeroing of the visibility value is done in the preparation phase. Such a strategy evades logic in the gridding phase related to flagging and reduces the shared memory usage of a record in the gridder kernels. As will be explained later on, the gridder loads chunks of records in shared memory prior to gridding them. Shared memory is a limited resource, and there is an impact on performance when few records are loaded. Not loading flags reduces the record footprint in shared memory and thus increases the amount of records that can be loaded at one go.
Convolution Position sequence
Part of the calculation required to find the position of the convolution function on the UVgrid adheres well to the thread configuration of the preparation phase and is delegated to it. Instead of the and sequences, the gridding phase is presented with a 4integer tuple sequence . The tuple, which proposed in Romein’s algorithm, defines the position of the convolution function as follows: Each multipolarised pixel on the grid is 0based indexed by two integers representing the and value of the pixel respectively. The pixel with and values of 0 is in the middle of the grid and the pixel indexed as has the most negative and values^{7}^{7}7Note that this indexing scheme is the same as that of C++ indexing method of arrays..
If the record’s position on the grid is on the pixel indexed by and the grid is pixels long on the uaxis then: