High-Performance Image Synthesis for Radio Interferometry
A radio interferometer indirectly measures the intensity distribution of the sky over the celestial sphere. Since measurements are made over an irregularly sampled Fourier plane, synthesising an intensity image from interferometric measurements requires substantial processing. Furthermore there are distortions that have to be corrected.
In this thesis, a new high-performance image synthesis tool (imaging tool) for radio interferometry is developed. Implemented in C++ and CUDA, the imaging tool achieves unprecedented performance by means of Graphics Processing Units (GPUs). The imaging tool is divided into several components, and the back-end handling numerical calculations is generalised in a new framework. A new feature termed compression arbitrarily increases the performance of an already highly efficient GPU-based implementation of the w-projection algorithm. Compression takes advantage of the behaviour of oversampled convolution functions and the baseline trajectories. A CPU-based component prepares data for the GPU which is multi-threaded to ensure maximum use of modern multi-core CPUs. Best performance can only be achieved if all hardware components in a system do work in parallel. The imaging tool is designed such that disk I/O and work on CPU and GPUs is done concurrently.
Test cases show that the imaging tool performs nearly 100 faster than another general CPU-based imaging tool. Unfortunately, the tool is limited in use since deconvolution and A-projection are not yet supported. It is also limited by GPU memory. Future work will implement deconvolution and A-projection, whilst finding ways of overcoming the memory limitation.
Dr Kristian Zarb Adami
\acknowledgementsAs in most ambitious and successful projects, success is only possible with the help of others.
I wish to thank my supervisor, Dr Kristian Zarb Adami, for his 24x7 support, dedication and patience. Special thanks to Dr Oleg Smirnov and Dr Filippe Abdalla, who provided valuable insights and advice. Thanks also to Mr Alessio Magro who gave advice and help on GPU programming. A big thanks goes to Dr John W. Romein for supplying me with the test code and data set related to his work, used in this thesis. Thanks also to Dr Anna Scaife and Dr Cyril Tasse for their help.
My family also gave their contribution. A big thank you goes to my parents Joseph and Maria Dolores Muscat for their support. Their support was given during a period when they also had to face a number of health problems. My father also helped to creating some of the graphics included. A last but not least heartfelt thanks goes to my girlfriend Ms Nicola Attard, who, notwithstanding the challenging environment to our relationship, stood by me and gave me her full support.
Chapter 1 Introduction to Radio Interferometry and Image Synthesis
The first interferometer used in astronomy dates back to 1921 when Albert Abraham Michelson together with Francis G. Pease made the first diameter measurements of the star Betelgeuse . The setting used, known as the Michelson Interferometer, is today the basis of modern interferometers. Michelson had been discussing the use of interferometry in astronomy for at least 30 years before the experiment was done [2, 3]. Throughout the century, the technology of radio interferometers has made enormous advancement and up to this day extraordinarily ambitious projects have been commissioned, and more are on the pipeline. An example of a recently commissioned radio interferometer is the LOw-Frequency ARray (LOFAR) . The Square Kilometre Array (SKA) [5, 6, 7, 8, 9] is the most ambitious project currently under-way.
Radio interferometers can be used for various purposes, even for applications outside the scope of astronomy, such as geophysics [10, 11]. The scope of this thesis is limited to the use of the interferometer as a measuring device for the intensity distribution of the sky over the celestial sphere. The measured quantity of the interferometer is known as visibility, and the thesis’ main focus is on how to recover the said intensity distribution from such measured visibility data.
An interferometer is made up of an array of N 2 antennas, and differently from a single dish antenna it achieves sub-arcsecond resolutions with high accuracy. The maximum angular resolution of a single dish is limited by its diffraction limit. The limit is inversely proportional to the diameter of the dish, that is . On the other hand, the angular resolution of the interferometer is limited by the distance between the furthest two antennas in the array in the same inverse proportional way, that is . Achieving sub-arcsecond resolution in single-dish antennas requires large diameters that are prohibitory. For interferometers, achieving sub-arcsecond resolution is just a matter of having antennas as far away as possible from each other. If necessary, part of the array can be orbiting in space such as in the case of the Very Long Baseline Interferometry (VLBI) Space Observatory Programme (VSOP) .
The basic measurement device (based on the Michelson interferometer) is composed of just two elements. Each possible two element combination of N 2 antennas forms a two element independent measuring device. During an observation, the antenna array tracks a position on the celestial sphere normally within the field of view of the observation. Each basic element makes a measurement of the intensity distribution of the sky in the form of visibility values. visibility readings are done simultaneously by the whole -element antenna array and each reading differs in the geographical set-up of the basic device. The geographical set-up is described by the baseline which is the vector covering the distance between the two antennas. Earth rotation changes the directions of the baseline (assuming a frame of reference stationary with the celestial sphere), and this is taken advantage of by taking subsequent visibility readings . As it will be shown later on in this chapter, a Fourier relationship exists between visibility as a function of baseline and the intensity distribution, provided that certain conditions are met. To take advantage of such a relationship is not an easy computational task and is today an active area of research which this thesis is part of.
This introduction aims to give a brief on the theory of interferometry, defines the measurement equation of the interferometer and discusses an image synthesis pipeline commonly used. The brief serves as a preamble for the discussion on motivations, aims and objectives of this thesis, which is done in the penultimate section. The chapter is concluded by giving an outline of the thesis.
The theory presented in this chapter is based on Thompson et al.  and some online sources, notably presentations given in a workshop  organised by the National Radio Astronomy Observatory(NRAO), and a course  given by the same organisation.
1.1 Analysis of a two-element interferometer
A basic two-element interferometer is depicted in Figure 1.1. It is observing an arbitrary cosmic source in the sky, in the direction of the unit vector . The baseline is defined with the vector . The interferometer is assumed to have a quasi-monochromatic response at a central frequency . Cosmic sources are in the far field of the interferometer, and so the incoming waves at the interferometer can be considered planar over the baseline. Clearly the output of antenna 1 is the same as that of antenna 2 , except that is lagging behind by say seconds. is called the geometric delay. The wave needs to travel an extra distance to reach antenna 1 thus, letting represent the speed of light, equates to:
Since the interferometer is quasi-monochromatic then, the voltage output of antenna 1 and antenna 2 denoted by and respectively can be written as follows:
where is time and is proportional to the intensity of the source.
1.1.1 The correlator
To produce a visibility reading, the two voltages get correlated by first multiplying and then averaging over a period of time known as the integration time. As it will be shown in this section, such a correlation is not enough since some intensity data is lost. A complex correlator is used to make a second correlation between the output of the antennas with one of the outputs phase delayed by 90 degrees. The two outputs of the complex correlator are presented as one complex quantity which is the visibility. Visibility is defined formally in equation 1.9.
Multiplication of with results in:
has a constant term and a sinusoidal term with respect to . Provided that averaging is done over a long enough time interval, the sinusoidal term averages out to 0 implying that the average equates to:
The correlator response is dependent on the direction of the source with a sinusoidal behaviour known as a fringe. The term is known as the fringe phase.
Figure 1.2 shows a plot of a fringe. It clearly indicates that, for some values of , the correlator gives no output. This implies loss of data. If an intensity distribution is considered, then it is easy to show that the even part of the distribution is lost.
Let and be the even and odd part of . Using equations 1.4 and integrating over , results to be proportional to:
The second correlation made over the two antenna outputs, with one of the output phase delayed by 90 degrees, generates an even fringe and will respond to the even intensity distribution. This implies that the complex correlator responds to all the intensity distribution.
1.1.2 Effects of channel bandwidth and the delay centre
If it is assumed that the interferometer is responsive over a bandwidth , whereby intensity is constant throughout, then, based on equation 1.4, the correlator output is:
The term is known as the fringe washing function and degrades the response of the interferometer. The degradation is dependent on the frequency bandwidth and should be kept as narrow as possible, especially for wide-field imaging. An interferometer splits up a wide bandwidth into many frequency channels each of narrow bandwidth, and treats each channel independently.
The fringe washing function has no effect over the interferometer response for . This is because signals of different frequencies will reach the correlator in phase. The geometric delay () can be controlled by inserting an extra delay in the circuit between the antenna receiving the leading signal and the complex correlator. For , the fringe washing function is nullified. Since is a function of , one can choose a direction for which the fringe washing function is nullified. This direction is termed the delay centre.
1.2 The Uvw co-ordinate system
Many variables in interferometry are expressed against a reference direction known as the phase reference centre or phase centre. The phase centre is fixed to the sky and is common practice that it is set to point to the same direction of the delay centre.
The general co-ordinate system used in interferometry is the co-ordinate system and is depicted in Figure 1.3. It is a right handed Cartesian system where axes and are on a plane normal to the phase centre and the -axis in the direction of the phase centre. The -axis is in the East-West direction while the -axis is in the North-South direction.
One of the main uses of the co-ordinate system is to measure the baseline against the phase centre. The baseline components expressed over the -axes defined by are depicted in Figure 1.3. The components are normally given in units of number of wavelengths such that:
where is the channel wavelength, and is the baseline expressed in number of wavelengths.
As visibility measurements are taken consecutively in time, the baseline vector rotates in the UVW-space since the interferometer resides on the Earth surface. The UV-coverage of an observation is defined as the set of values for which the interferometer makes a visibility measurement. Figure 1.4 depicts the UV-coverage of a true LOFAR observation. Each baseline will form an elliptical trajectory, while the interferometer samples (measures) visibility in time. A short baseline tends to form a trajectory near the centre, while a longer one tends to form a trajectory further out.
A clarification on the units of the components is now proper. Expressing in units of number of wavelengths is mathematically convenient and is the general approach found in literature . Nevertheless it is not always practical to use, since values are dependent on channel frequency. The MeasurementSet  is a common format used to store visibility measurements. It specifies data to be stored in meters rendering it valid over all frequency channels. In this thesis the components are expressed in number of wavelengths unless stated otherwise.
When describing a position on the celestial sphere the direction cosines and are used as shown in Figure 1.3. They are defined with respect to the U and V axes such that the direction vector is described as:
1.3 The measurement equation
Consider Figure 1.5. represents the phase reference position. The unit vector retains its definition, that is, a pointer towards an arbitrary position in the celestial sphere. is a solid angle, and is the baseline expressed in number of wavelengths.
Not considered in the previous section, is the non-uniform polarisation dependent reception behaviour of the antennas known as the A-term. Let represent a normalised version of the A-term such that . The complex visibility is defined as:
The complex visibility, as defined in equation 1.9, has dimensions of flux density (). The dimensions of the intensity distribution are . In astronomy, the Jansky(Jy) is a commonly used unit defining the flux density where 1 Jansky=.
As described in  equates to:
Substituting in equation 1.9 the following results:
Equation 1.12 is the measurement equation of the interferometer. It defines the relationship between the interferometer measured quantities and the intensity distribution of the sky.
1.4 Image Synthesis
This thesis hardly gives importance to the effects of and the quotient term in the measurement equation 1.12. For convenience, these two terms are subsumed in the intensity distribution which will be referred to, as the measured intensity distribution .
The measurement equation 1.12 is re-written as:
This section reviews how can be recovered from visibility measurements made by the interferometer. The term , is referred to as the w-term. If it equates to 0 then a Fourier relationship between visibility expressed in terms of and ( ignored) and the measured intensity distribution is obtained, that is:
where is the Fourier operator.
Most of the image synthesis techniques, exploit this Fourier relationship in order to use the computationally efficient Inverse Fast Fourier Transforms (IFFT) algorithms. However, there are various issues that need to be circumvented as to apply such algorithms, which mandates some pre- and post-processing of data. The next subsections gives a brief on the main issues and how they are commonly handled. Based on this brief, Figure 1.6 depicts an imaging synthesis pipeline that is commonly used.
1.4.1 Non-uniform sampling of the visibility plane
Fast Fourier Transform algorithms require that the visibility UV-plane (also referred to as the UV-grid) is uniformly sampled. As already discussed in section 1.2 the UV-coverage of the interferometer is not uniform, implying that some processing is required to generate a uniformly sampled UV-grid.
Convolutional gridding is the technique used in interferometry to transform the non-uniform UV-coverage into a uniform one. Details on this technique are given in section 2.1. Measured data is convolved with an appropriately chosen function and the result is sampled as to generate a uniform UV-grid. After an IFFT takes place, the convolution is reversed by a simple element-by-element division on the intensity image. One notes that convolutional gridding is a topic actively researched upon, especially in its application in medical sciences [20, 21, 22].
The non-uniform UV-coverage implies that the measured visibility data is incomplete. The intensity distribution can never be recovered in full by relying solely on the measured visibility samples. The output intensity image that is generated by the convolution gridding algorithm is called a dirty image in view that it contains artefacts caused by the non-uniform incomplete UV-coverage.
Incompleteness is handled by iterative methods commonly known as deconvolution. Based on a priori knowledge on the intensity distribution, deconvolution is applied over the dirty image in order to recover the intensity distribution. A classic example of a deconvolution algorithm is CLEAN [23, 24] which assumes that the intensity is made up of point sources. Starck et al.  give a review of other deconvolution methods.
1.4.3 The w-term
As pointed out by Cornwell et al. , if the w-term is much less than unity it can be neglected and a two dimensional Fourier relationship results. When synthesising a narrow field of view, the w-term can in many cases, be neglected. Neglecting the w-term when synthesising a wide-field of view, causes substantial distortions.
An advanced method that corrects the w-term, is the w-projection  algorithm. A visibility sample is projected on the plane by means of a convolution. The w-term for the w=0 plane is 0, so effects of the w-term get nullified by the projection. The algorithm is integrated with the convolutional gridding algorithm and is explained in detail in section 2.2.
Other alternatives to w-projection exploit other characteristics of the measurement equation. For example, it is possible to express the measurement equation as a 3-dimensional Fourier transform [27, 28]. Snapshots [26, 29, 28, 30] consider a coplanar array whereby at any given time is a linear combination of and for all the measurements made by the coplanar array [31, 28]. Thus, for short periods of observation time, the w-term causes only a distortion in the lm-plane co-ordinates which is corrected by a linear transform. Facets consider the wide-field intensity image as a sum of smaller images (facets) over which the w-term can be neglected. There are two types of facets: the image-plane facets and the uvw-space facets. These are reviewed in .
Hybrid algorithms using w-projection with any of these alternative algorithms are also possible and are discussed in section 2.2.2.
1.5 Motivation, aims and objectives of the thesis
As new powerful radio telescopes are being built up, the computational demand of the imaging pipeline described in Figure 1.6, is increasing to exuberant levels. It is estimated that Phase 2 of the Square Kilometre Array (SKA) will need as much as 4 Exaflops of computational power , with 90% of computational resources taken by the gridding algorithm . Phase 2 is expected to be commissioned by around 2020 and is predicted that though technology would have advanced by that time, a super-computer delivering computational rates in Exaflops would be at the top of the TOP 500  list [32, 35].
Motivated by such computational requirement, this thesis aims in giving a contribution to the application of high performance computing in radio interferometry. The main objective is to develop a new high-performance imaging tool. It is being called the malta-imager or mt-imager for short. It relies on CUDA111CUDA stands for Compute Unified Device Architecture and is a parallel programming framework for GPUs developed by NVIDIA®  compatible NVIDIA®Graphical Processing Units (GPUs) to synthesise images by means of w-projection. Deconvolution is outside the scope of this work.
The infrastructure handling all numerical calculations is generalised in a framework, which is being called the General Array Framework (GAFW). It is designed to handle different hardware such as GPUs and Field Programmable Gate Arrays (FPGAs). It is being implemented for CUDA compatible GPUs only so as to serve the main objective. The mt-imager is built over this framework to perform all GPU-based computation.
1.6 Outline of thesis
This introductory chapter discussed the main concept of image synthesis in radio interferometry. The next chapter, that is Chapter 2, gives a literature review. The focal point of the thesis is the implementation of w-projection and convolutional gridding. A mathematical treatment and a review of reported implementations of these algorithms are given. The chapter is concluded by giving a detailed description of a gridding algorithm over GPUs proposed by Romein . It is a pivotal algorithm to this thesis and will be often referred to as Romein’s algorithm.
The General Array Framework is the subject of Chapter 3. The framework is discussed in detail and terms are defined.
Chapter 4 is the main chapter whereby the mt-imager is discussed in detail. Note that the chapter uses the terms and definitions given in previous chapters. The imaging tool is divided into components that are independent of each other. The high-level design is first discussed, and then details of the tool are given through a discussion on each component.
Chapter 5 reports on the performance obtained by the tool and shows that the main objective of the thesis has been achieved with success. The chapter also reports some detailed experimental analyses on the performance of Romein’s algorithm as implemented in the imaging tool. These analyses are meant to enrich the knowledge on the algorithm. The chapter is concluded by proposing future work.
The thesis is concluded in Chapter 6.
Chapter 2 Gridding and W-Projection
This chapter reviews convolutional gridding and the w-projection algorithm. Some mathematical treatment is given together with other background knowledge. A literature review on performance and implementation of these algorithms is included together with some notes on GPU programming.
2.1 Convolutional gridding
In interferometry, convolutional gridding is the method used to transform the non-uniform sampled UV-plane (visibility plane) to a uniform sampled one such that Inverse Fast Fourier Transform (IFFT) algorithms can be applied (refer to subsection 1.4.1). Visibility measurements are convolved with an appropriately chosen convolution function . The resultant UV-plane is then sampled uniformly, over which an IFFT is applied. The result is an image in the lm-plane (intensity distribution plane), which is corrected from convolution by a simple element-by-element division. The output is a dirty image of the intensity distribution, which is the true intensity aliased by the non-uniform incomplete UV-coverage.
In this section, convolutional gridding is discussed via a mathematical treatment. The concept of weighting is introduced, and some topics on the convolution function are reviewed.
2.1.1 Mathematical review
The mathematical treatment given in this section is based on, and adapted from, Jackson et al. .
The measured intensity distribution considered here, has been defined in section 1.4 as follows:
For this treatment, the w-term is ignored such that Fourier relationship expressed in equation 1.15 is true, that is:
Let be the number of visibility measurements that the interferometer samples, and let be a zero-base index such that and are the and values of a measured visibility point. The sampling function of the interferometer is defined as:
where is the two dimensional Dirac function.
As discussed further on, it is desirable to control the shape of the sampling function. This is done by introducing a weight for each sampled visibility, such that the weighted sampling function equates to:
is known as the synthesised beam, dirty beam or point spread function (PSF). It is here denoted as .
With interferometric sampling and use of weights, the visibility plane is transformed into a non-uniform sampled plane . The desire is to transform the non-uniform sampled plane to a uniform UV-grid , sampled at regular intervals and on the U and V axes respectively. This is achieved by convolving with an appropriately chosen convolution function and then uniformly sampling the output.
where the operator is a two-dimensional convolution, and is the Shah function defined as:
An inverse Fourier transform is applied on which equates to:
where is referred to as the tapering function.
Convolution with the Shah function causes a replication over at intervals in and in . The synthesised field of view is thus dependent on the choice of and . Only one replica is truly calculated which is mathematically equivalent to multiplying with a rectangular function. To complete the process, the image is divided by so as to reverse the convolution and produce the non-normalised dirty image .
Equation 2.9 describes the output of the convolutional gridding algorithm.
2.1.2 Deconvolution and weighting schemes
The PSF aliases in such a way that cannot be recovered through direct methods. This is the effect of data incompleteness. Iterative methods known as deconvolution are applied to try to recover to the best possible accuracy (refer to section 1.4.2). The effectiveness is dependent on the form of the PSF, which is desired to be as compact as possible. If too few visibility measurements samples are considered, then, the PSF might be too wide for a proper recovery. In the extreme case, where only one measurement is considered, the PSF is infinitely wide.
The weight introduced in equation 2.4 plays a pivotal role in controlling the form of the PSF. As visible in Figure 1.4, there is a higher density of visibility measures in regions covered by shorter baselines. This tends to overemphasise the long spatial wavelength of the PSF causing wide skirts. The uniform weighting scheme caters for this issue by weighting each measurement with a value inversely proportional to the neighbourhood density . In another scheme, called natural weighting, the density issue is given second priority, and visibility data is weighted inversely proportional to its variance so as to obtain the best signal to noise ratio . Short baseline data will still be overemphasised, but an in-between scheme, known as robust weighting , tries to compromise between natural and uniform weighting schemes.
2.1.3 The tapering function
The form of the tapering function is crucial, since the replication caused by the convolution with the Shah function, forces to alias the image. Aliasing can be fully suppressed if the tapering function covers the whole field of view defined by and then goes to 0 outside the region. The infinite sinc function (in the UV-plane) is the best choice for  but it is computational unattainable. Various functions have been studied  and the general choice in interferometry is the spheroidal prolate function [40, 41, 42].
The convolution function has to be finite, or in other words it has to have a finite support. The support of the convolution function is defined as the full width of the function in grid pixels. Note that in literature, support might be defined differently as explained by Humphreys and Cornwell .
It is a common approach that the convolution function is not calculated during gridding but is pre-generated prior the gridding process [37, 43] and stored in memory. The function has to be sampled at a higher rate than . Such sampling is referred to as oversampling. The oversampling factor is defined as:
where and represent sampling intervals at which the convolution function is being sampled. For convenience, it is assumed that the oversampling factor is invariant over the U and V axes.
In general an implementation of a gridder does not perform any interpolation on the numerical data of the convolution function. It merely chooses the nearest numerical values while gridding a record111In this thesis, the term record is used to describe a single- or multi-polarised visibility measurement.. This changes the form of the convolution function which can be modelled by the following equation:
where is the oversampled version of .
Applying the Inverse Fourier transform, the following is obtained:
The sinc functions change the form of the convolution function. Their affects are corrected for, by dividing the image with the sinc functions after the inverse Fourier transform takes place.
In subsection 2.1.1, the w-term was ignored while describing the convolutional gridding algorithm. When the term is much less than unity, such an assumption is acceptable . Otherwise ignoring the w-term leads to substantial distortions. This section discusses a recent algorithm known as w-projection  that handles the w-term through convolution. A visibility record can be projected to the visibility plane, by convolving with a -dependent function. W-projection builds over the convolutional gridding algorithm by applying the stated fact. A mathematical treatment of w-projection are the subject of the next subsection. It is followed by other w-projection related topics which are performance, and hybrid algorithms (with w-projection).
2.2.1 Mathematical treatment
Let represent visibility over a plane with constant and let represent the w-term in its exponential form.
Using the measurement equation 1.14 it can be shown that:
which implies the following Fourier relationship:
Clearly and thus
Substituting in equation 2.16 and applying the convolution theorem it results that:
Any visibility plane with constant is related to the plane by a convolution with a known function .
Clearly exists implying that also exists and thus the convolution in equation 2.18 can be inverted to put subject of the formula.
Any visibility plane with constant can be projected to the plane via a convolution with the w-dependent function .
Cornwell et al.  argue that as a consequence of equation 2.20 any visibility point can be re-projected on the plane. This forms the basis of the w-projection algorithm which builds over convolutional gridding by convolving each visibility record with a -dependent function . It is defined as:
The Point Spread Function (defined in section 2.1.1) is calculated using w-projection as follows:
is not directly solvable and thus the convolution functions222The plural is used in view that there is a different convolution function for each unique value of . have to be solved numerically . They need to be pre-generated and stored in an oversampled format. The gridding process will convolve with which is the oversampled version of .
The aliasing and correction arguments given in section 2.1.3 still hold. Sampling in is also required, and this will also generate some aliasing. Cornwell et al.  argues that aliasing can be reduced to tolerable values by scaling in rather than linearly. Support of is dependent on and increases with increasing . This can lead to prohibitive memory requirements for storage of [45, 29]. It is a known issue in w-projection and there is research going on in order to handle the problem, such as the work presented by Bannister and Cornwell .
It is to be noted that based on the same principles described here, it is possible to correct the w-term in the intensity plane. Applying an inverse Fourier transform on equation 2.16 it results that:
2.2.2 Hybrid algorithms
In section 1.4.3 alternatives to w-projection were discussed. These alternatives are not mutually exclusive with w-projection. Hybrid solutions using w-projection are possible and have been successfully implemented. For example, facets and w-projection have integrated together in CASA  where w-projection is applied over wide facets. The recently proposed w-snapshots algorithm  uses w-projection to project visibilities on a plane over which the snapshots algorithm can be applied. The plane is chosen by least square fit and changed whenever the visibility reading being projected over the plane deviates too much.
The basic idea of w-projection is the one currently giving the best performance. Bhatnagar  claims that w-projection is faster than facets by a factor of 10. Bannister and Cornwell  claim that w-projection is the most computationally efficient algorithm.
2.3 Notes on GPU programming using CUDA
2.3.1 Thread configuration
GPUs are parallel devices that can achieve impressive computational power by handling thousands of execution threads concurrently. Mandated by hardware design, threads are grouped in blocks, and each block is executed by one multi-processor. Each multi-processor concurrently handles a few of these blocks and blocks are scheduled to run only after other blocks finish execution. Threads within a block are guaranteed to be all running when a given block is scheduled on a multi-processor. For modern GPUs, the maximum threads per block is 1024, implying that the modern GPU can handle tens of thousand of threads concurrently.
An execution code over a GPU is referred to as a kernel, and in this thesis this term is used only for this purpose. The term thread configuration is used to describe the set-up of threads and blocks for the kernel. Many of the thread configurations mentioned in this thesis are in such a way that a thread is set to process an element without the need to interact with other threads. In such thread configurations it is to be assumed that blocks are set with the maximum number of threads, and enough number of blocks are requested to execute at full occupancy. It is also to be assumed that there can be thread re-use, in the sense that a given thread will process a set of elements one after each other. It has the same effect of a thread being destroyed after an element is processed and re-created to handle a new element when a new block is scheduled.
Many GPU algorithms tend to be memory bound meaning that the main limiting factor is access to memory. In the design of such algorithms, the strive is often the optimisation memory usage.
The GPU provides different types of memory. Table 2.1 lists the relevant ones for this thesis together with some salient features.
|Memory Type||Location (on/off chip)||Access||Scope||Lifetime|
|Shared||On||R/W||All threads in block||Block|
|Global||Off||R/W||All threads and host||Host allocation|
|Texture||Off||R||All threads and host||Host allocation|
Global memory is the only Read/Write memory that is persistent beyond the lifetime of a kernel and it is also the largest in size (some few Gigabytes). It also suffers from high latency. Shared memory is much faster than global memory but limited in scope and size (a few tens of kilobytes). Registers are even faster as they can be accessed at zero extra clock cycles per instruction in most execution scenarios. They are much more limited in scope than shared memory and are a scarce resource.
Shared memory and registers can be used to store temporary data, that need to be accessed quickly. For example, if a set of data is required to be read by all threads in a block, then it is general practice to load the data set in shared memory.
Textures are read-only memory structures, and in this thesis, they are used for their on-chip cache that enables efficient access to read-only data from global memory. They provide further functionalities, not used in this thesis, and not reviewed here.
2.4 Related work on high-performance gridding
Literature on high-performance implementation of w-projection and convolution gridding for radio interferometry is scarce. This contrasts with medical science research whereby GPUs are a common topic of research for use in medical equipment [51, 22, 52].
Edgar et al.  report work on a CUDA gridder for the Murchison Wide Field Array (MWA) . A thread is associated with a unique point on the grid. Each thread scans through a list of visibility records, calculating the contribution (if any) that the record gives to the grid point under the thread responsibility. Most of the records do not give any contribution to a grid point, and there is substantial waste of time in the scan. This waste is reduced by grouping records by their position on the grid, such that each gridder thread scans only some of the groups.
Humphreys and Cornwell  discuss a gridder used for benchmarking. A thread is assigned for each point of the convolution function. To avoid race conditions, a single run of the kernel only grids few records which do not overlap when convolved.
Another benchmark study is presented by Amesfoort et al. . Race conditions are evaded by allocating a private grid for each thread block. This study makes the gridder unusable for radio interferometry in view that memory requirement for such a configuration are much higher than the global memory available in modern GPUs.
Romein  has recently proposed as algorithm which this thesis regards as a breakthrough on the subject. It is used in the development of the thesis’ imaging tool. From here on this algorithm will be referred to, as Romein’s algorithm. It exploits the trajectory behaviour of the baseline that was explained in section 1.2. Details on the algorithm are given in section 2.5.
GPUs are not the only device considered for the high performance implementation of w-projection and convolutional gridding. For example, Verbanescu et al. , Verbanescu and Amesfoort et al.  consider an implementation over the Cell B.E processor while Kestur et al.  reports on a framework for gridding over FPGAs.
2.5 Romein’s algorithm
As already stated, Romein’s algorithm refers to the algorithm presented in . This thesis makes use of this algorithm and hence it is reviewed in some detail here. It should be noted that chapter 5 presents new analysis of this algorithm that are not published in .
Records measured by a given baseline are sorted by time and presented consecutively to the algorithm. In this way, while records are being gridded one after the other, the region in the UV-grid that is updated by a convolved record moves with the baseline trajectory.
The grid is split up in sub-grids of sizes equal to the size of the convolution function that will be used to grid. Threads are assigned to take care of one grid point in each sub-grid as shown in Figure 2.1. Thanks to this configuration a thread updates one and only one grid point in convolving a visibility record. By virtue of the baseline trajectory behaviour, it is quite likely that, subsequent records will need to update the same grid point. Updates are accumulated in the registry, until the grid point moves out of the region being updated. At this point, the thread commits the accumulated update to global memory and commences a new accumulation for a new grid point in a different sub-grid.
Explanation of some implementation details used in the test case presented by Romein  for this algorithm, now follows. Only the main details that are reused in this thesis are pointed out.
All grid point updates related to a record are handled by one CUDA block. This means that the block requires an amount of threads equal to the convolution function size. GPUs impose a maximum on the number of threads per block (maximum of 1024 threads per block for the latest architectures) which is in many cases smaller than the size of a convolution function. A given thread is thus allowed to cater for more than one grid point in each sub-grid. Ideally all the grid points under the responsibility of a thread should be handled concurrently but due to registry limitations in modern GPUs, this is not always possible. Instead, a group of records is scanned several times by the thread so as to cater for all the grid points entrusted to it. All the threads in a block will need to read the same record data for several times, and by GPU best practices, data is pre-loaded in stored memory so as to have a fast access.
Different CUDA blocks grid different groups of records. This mandates the use of atomic additions to commit the accumulated grid point updates to the grid. Single precision atomic additions are intrinsically supported by GPUs, but double precision atomic additions are not supported. Despite the fact that a software implementation of double precision atomic additions is possible, such an implementation is inefficient and heavily impairs the algorithm’s performance. In conclusion, the algorithm works efficiently only for single-precision.
The implementation makes use of textures for retrieving convolution function data stored in global memory.
Romein  reports a maximum of 93 Giga Grid point updates per second333The metric is explained in section 5.3.1. for the test implementation. Gridding was done over a quad-polarised grid using a GeForce GTX680 GPU. This result is revised in chapter 5 in view that enhancements made in this thesis, give more knowledge about the algorithm.
Chapter 3 The General Array FrameWork
The General Array FrameWork (GAFW) provides an infrastructure for numerical calculations. The framework has features meant to facilitate high-performance and has a simple user111In this context the "user" is the developer of an application built over the General Array Framework. interface. It adapts to the underlying hardware (CPUs, GPUs, FPGAs etc) since all control logic is handled by the framework through an engine.
The framework’s design promotes collaboration between scientists with basic knowledge in programming, and developers specialised in high performance computing. A layered approach allows for two distinct development areas, each suited for the respective collaborators mentioned above. The concept of this framework is based on the mathematical concepts of arrays and operators with minimal framework-specific jargon. This makes it easily comprehensible and manageable by scientists.
In this thesis, a C++  implementation has been developed that supports multi-GPU systems and forms the basis of the imaging tool that will be discussed in the next chapter. The use of the framework in the imaging tool simplified the development of the tool.
3.1 Main concepts
The framework mimics the way humans tend to make a calculation. In most cases, an equation is first defined, and afterwards numerical data is entered to obtain a result. In other words, humans tend to understand mathematical relationships before doing any calculation. A similar computational scenario is presented in this framework. The application defines how a calculation is done and then makes a separate request for a calculation over a set of input data. Since calculation definitions are separate from the actual calculation, they can be reused over and over again for different input data.
A calculation is defined using a calculation tree. The calculation tree is a graph of operators that get connected together via arrays. Operators define logic that generates numerical data, while each array represents an N-dimensional ordered set of numerical data of a specific type (ex integers, complex numbers, etc). Arrays are set as input and/or output to operators.
Figure 3.1 gives an example of a simple tree. It defines convolution using FFTs. Arrays are represented by arrows.
For convenience, the entry point for the generated numerical data for an array is provided by a third object called the result. Each array has associated to it one and only one result object.
Result objects provide also the entry point for calculation requests. Calculation requests are result-centric and not tree-centric in the sense that the request is to calculate the result. Based on the defined calculation trees, the framework decides which operations (that is, execution logic described by operators) are required to calculate and obtain the requested result.
Result objects provide mechanisms to inform the framework on what to do with the given data. An example is whether the application intends to retrieve the data or not. This has a significant impact on performance since a memory copy from GPU to host is required.
Calculations are done asynchronously with respect to the application thread. Such behaviour is crucial to achieve high performance and maximise the usage of available resources. Once the application requests a calculation, the framework validates and proceeds in the background. The application thread can load more data and request further calculations while GPUs are executing code in the background.
Two final points on the general concept of the framework are the factory and object identification. A factory is used to make things as easy as possible to the user and ensure full control of the framework over its own objects. The factory maintains all framework objects, including their creation and destruction. Framework objects are identified by a two-string structure which is referred to as an identity. A registry service provided by the framework enables the application to use the objects’ identity instead of C++ pointers or references, when communicating with the framework.
3.2 A layered approach
The framework can be modelled by five stacked layers as shown in Figure 3.2. Each layer is described hereunder:
The Application Layer: As its name implies, this layer represents the application built over the GAFW. The application constructs calculation trees, inputs numerical data to the system, requests calculations and retrieves results. Note that the application layer is unaware of the underlying hardware as it is up to the framework to make the necessary adaptations to execute calculations on the underlying hardware. This keeps the development of the application layer simple.
The Programming Interface Layer (PIL) defines all functionality that the framework provides to the application layer. Its design is intended to offer a remarkably simple interface to the application layer. Section 3.3 discusses the programming interface layer in greater detail.
The Calculation Execution Control Layer (CECL) contains all the logic to perform a calculation. It is well aware of the underlying hardware and adapts to it. As already pointed out, only GPUs are currently supported, and it adapts calculations to the number of GPUs available. The layer is made up of a multi-threaded engine, providing all GPU control logic to execute a calculation. It requires direct access to the underlying hardware.
The Operation Execution Layer(OEL) is responsible to execute operator code on the hardware.
The Hardware Layer is the hardware itself. As pointed out many times, CUDA compatible GPUs are the only supported hardware. It is aimed that in the future, there will be support for other hardware such as FPGAs, DSP boards, CPUs and clusters.
3.3 The programming interface layer
This section discusses more concepts and details of the framework as seen from the point of view of the application layer. Each subsection discusses a concept, object or service given by the framework.
3.3.1 The factory
Since a priority in the design of the framework is simplicity, the handling of objects is mandated to the framework and stripped off from the application. The framework delivers such service through a factory. Only one instance of the factory can exist at any moment in the application life cycle, and is expected to exist throughout the whole lifetime of the application. Whenever the application requires the creation and destruction of framework objects, it does so by making a request to the framework’s factory.
The factory simplifies the management of operators for the application layer, since operator initialisation and destruction logic can vary from one operator to an other. Further details are discussed in section 3.3.4.
The use of the factory is beneficial for forward compatibility with future releases of the framework. Since all the initialisation/destruction logic is in the absolute control of the framework, any future enhancements of the framework that changes such logic can be implemented with no change to the application layer code. The application will still be able to compile against new versions of the framework.
The factory represents the framework, and its instantiation is analogous to the instantiation of the framework. Once the factory is set up, there are no other procedures to follow to initialise the framework. The application can immediately build calculation trees and subsequently request calculations.
3.3.2 Identification of objects and the registry service
Framework objects are identified using a two-string structure called an identity. User-defined objects can also have an identity, and the framework is well structured to support such user-defined objects.
An identity is defined by two strings: a name and an object-name. The name identifies the type of object such as "array" or "result" while the object-name identifies the object itself. In Figure 3.1 the object-names "A" and "B" are used to represent two arrays in the calculation tree. Every object is expected to have a unique object-name, which is given by the application layer. Names are given by the framework and are useful to distinguish between different operators (such as "FFT" or "Dot Multiply").
Object-names have a hierarchical naming scheme222Names do not have a naming scheme.. A dot (.) is used to separate levels in the hierarchy. There are no rules regarding how the hierarchy is set up, but, if the object is to be registered, its parent needs to be registered beforehand. The hierarchy is essential to avoid conflicts in object-names. For example, if two unrelated trees are developed, there is the possibility that, by mistake, the same object-name is chosen for objects in the two trees. If each tree uses its own namespace, by defining its own unique branch in the hierarchy, then the issue is avoided. Note that the framework provides mechanisms to support application defined objects to work in their own namespace.
The use of identities gives the possibility of communicating with the framework on objects using object-names or names instead of C++ pointers or references. A registry service is available in the framework that keeps a mapping between object-names and respective objects. Framework objects are automatically registered on creation. Other application defined objects can be registered manually. This strategy alleviates the application from maintaining C++ memory pointers to objects that it needs.
An array represents the movement of data within a tree. It has an identity and is automatically registered by the factory on creation. An array is defined by its dimensions and the type of numbers it holds (ex integers, complex numbers, etc).
An array within a tree can be in three modes: input, output or bound. The modes are mutually exclusive, and arrays can only be in one mode.
An array is in input mode if numerical data is manually pushed in by the application layer. The array provides the entry point to input such data.
An array can be bound with a result object. Such a bind instructs the framework to use the last generated data represented by the result object as input to a subsequent calculation in which the bound array is involved.
If an array is neither bound nor in input mode then the array is in output mode and is expected to be set as an output of an operator. The application is not normally expected to define the dimensions and type of data for such arrays, since operators are expected to have logic to determine such properties automatically during validation (refer to section 3.4.1).
A GAFW operator describes logic that generates numerical data. For example, an operator might describe the logic of a Fast Fourier Transform.
Operators have an identity (discussed in section 3.3.2). The identity’s name of an operator is an essential property and is one of the main reasons why an identity includes a name. Different operators are regarded as different object types and must have a different name. The application communicates its request for a new operator object using the name of the operator. Good meaningful names are those that describe the execution code represented by the operator such as "FFT".
In the general case, different operators are implemented using a different class. Thanks to identities and object-oriented polymorphism, this complexity is hidden from the application which is aware only of the base class.
An operator has three forms of input data: input arrays, pre-validator arrays and parameters. Input-less operators are legitimate.
Input arrays are an ordered list of arrays that describe the data on which the operator operates. The numerical data that the array represents is made available on the GPU memory during the operation execution.
Pre-validator arrays are another ordered list of arrays. Different from the normal input arrays, the data is not presented to the operator during execution. They are only applicable during the validation phase (refer to section 3.4.1). The framework guarantees that they get validated before the operator. It is expected that the operator validation logic will need to obtain some information from the array (such as dimensions or data type) in order to execute its own validation logic.
Parameters are a set of variables that are set up prior to calculations. They can be regarded as configuration data for the operator. For example, an operator doing zero padding would need to know how much zero padding is needed. This information can be given to the operator through parameters.
An operator has only one form of output, which is an ordered list of arrays in output mode.
An operator can request some additional allocation of memory on the GPU for its own private use while executing. This memory is referred to as a buffer.
Operators exist in the PIL and OEL layers (refer to Figure 3.2). As mentioned earlier, the base class is presented to the application layer. The PIL, therefore, only defines the basic interactions of the operator with the application layer. As for the OEL, the operator defines all validation and execution logic.
The procedure to develop an operator is simple. The operator is defined by a new C++ class that inherits the operator basic class. Two methods need to be overloaded, one providing validation logic and another one providing the kernel submission logic. In most cases, a new CUDA kernel needs to be coded. This is the only hard part in the whole procedure since it requires skilled expertise in CUDA programming. The final step is to register it in the framework by its name. The registration process requires coding for class instantiation and operator object destruction.
Result objects provide the entry points for controlling and retrieving calculated data and requesting calculations. Each array has associated to it one and only one result object and contains all the generated numerical data related to the array.
Since calculation logic is automated within the framework, the application needs to give instructions on how to handle the data. These instructions are listed and explained below:
Application requires results: The application has to inform the framework about its intention of retrieving the calculated data. This is particularly essential for performance because extra memory copies from GPU to host memory are required to allow the application to access the result. It also removes any unnecessary memory allocation on the host.
Data re-usability: This defines the intention of the application to request subsequent calculations that will re-use data generated. On such instruction, the framework keeps a copy of the result, possibly on GPU memory, for re-use.
Overwrite: Setting a result for overwrite instructs the framework that prior to executing the corresponding operation on the GPU, it should initialise output data to the last calculated result.
These instructions can be given to any result object participating in a calculation and not only to the result object through which a calculation is requested. For example, referring to the calculation tree in Figure 3.1, if O is requested to be calculated and C is set to be a required result, then the framework will copy the result of C to the host and make it available to the application.
3.3.6 Proxy results
As their name implies, proxy results are meant to serve as proxies to result objects. Proxy result objects behave as if they are genuine result objects (the result class is inherited). The proper result object which is proxied, is configurable at any time, and can be changed at will. The main use of these objects is in situations where a fixed result object needs to be presented, but needs to be changed every now and then behind the scenes.
GAFW modules are intended to contain a defined calculation in one object. They are application defined and need to inherit the module’s base class. Inputs and outputs of modules are in the form of result objects. The module must provide all the logic to request calculations to the framework.
It is advantageous to have a statistics system in a high-performance application. For example, it is helpful to have execution timings of operations executed on the GPU. The framework generates its own statistics that are pushed to the application for further processing.
A single statistic is contained in a unique object whose class inherits a general base class. An interface is defined in such a way that statistic objects can be posted through it. The interface needs to be implemented by the application. It is also expected that posted statistics are processed in a background thread. This design allows the application to use the infrastructure for its own generated statistics.
3.4 The engine
The Calculation Execution Control Layer is made up of an engine whose function is to execute a calculation request. This section discusses the engine and how it executes a calculation over GPUs. Figure 3.3 portrays a high-level design of the engine.
Calculations are executed in three steps: Validation, snapshot taking and proper calculation. Each step is discussed in the following subsections:
The framework needs to verify that the calculation tree is valid. It also needs to handle any missing tree data (array dimensions and data type) that can be predicted. If the framework has no sufficient data, or the calculation tree is invalid, then the validation process returns an error in the form of a C++ exception.
Some of the validation logic has to be supplied by the operator used in the calculation tree. The operator has to validate itself on details regarding its unique specifications, such as, the number of inputs and outputs. It also has to determine the dimensions of the output arrays when missing or incorrect.
The validation algorithm is depicted in the flow chart shown in Figure 3.4. The algorithm is split in two sub-algorithms, one for validation of arrays and one for validation of operators. These two sub-algorithms recur on each other. Array validation in output mode requires the validation of the respective operator, while operators need the validation of all input arrays and pre-validators. In this way, the tree is traversed until non-output arrays or input-less operators are found. The algorithm then reverses back validating all objects.
3.4.2 Snapshot taking
Taking a snapshot means the copying of all relevant data regarding a calculation such that the proper calculation can be executed asynchronously. In this way, the proper calculation takes place in the background while the application can change trees, input new data, request other calculations or execute any other logic.
To execute a calculation, the engine transforms the calculation tree into a stream of operations to run. Each element in the stream describes an operation in full, including transient data, such as state, locking mechanisms and memory allocation on GPU for input and output data. The engine relies exclusively on this data to perform a calculation.
3.4.3 Proper calculation
This step, computes the calculation tree over GPUs. Simplistically speaking, it is a matter of memory management and kernel submission. All the tasks are executed by the engine with the exception of kernel submission. The latter is executed by the operator on request from the engine.
The engine is implemented using a multi-threaded approach. All the work done by the engine is divided into simple tasks, each handled by a separate thread. A blocking FIFO queue is used to communicate between threads. This approach enables the framework to monitor many events concurrently without the need of looping. It avoids busy waiting were events are continuously queried for status updates. It also helps the framework to act quickly on events over which a thread block waits. The waiting thread is immediately released as the event is fired.
The engine delegates the actual management of a device to a device manager. This simplifies functional support for multi-GPUs (and in the future other devices). A unique instance of the manager is brought up for every device supported. It has its own task threads as illustrated in Figure 3.5.
The engine schedules an operation to be executed over a device by submitting it to the respective device manager. In systems having more than one GPU, a whole calculation request is submitted to one GPU while the next calculation request is submitted to the next GPU. This simple scheduling algorithm proved to be good enough for the imaging tool discussed in the next chapter.
GPU memory management works as follows: The allocation thread continuously allocates memory for incoming operations until the memory is full or there are no operations in the pipeline. Once the GPU memory is full it waits until the post-execution thread frees memory. Memory is freed once the computation related to an operator is ready and will not be reused for future operations. Memory that is not freed and is not set as input or output for any operation submitted to the device manager is managed by the memory management thread. Such memory can halt the whole calculation process as it might not leave enough space for currently scheduled operators. In the case of such a scenario, the device manager caches the data on the host main memory (if not already done) and frees memory on the GPU. Caching also takes place when data needs to be transferred from one GPU to another, or when de-fragmentation of memory is required.
Unfortunately, the CUDA runtime API  is unable to allocate memory on a GPU while a kernel is being executed. This causes the allocation thread to freeze up during kernel execution, reducing thread concurrency on the CPU. This has a direct impact on the overall performance since data transfers333By GPU best practices  time to transfer data from host to GPU is hidden by doing memory transfers in parallel with kernel execution. cannot be initialised prior to the allocation of the respective memory. In order to ease the problem, a locking mechanism is in place that denies concurrent allocation of memory and submission of threads. In this way, the allocation process works in batch while kernel submission is locked.
In the chapter’s introduction, it was claimed that the framework is designed for collaboration. This section elaborates on this argument.
Framework related development of an application can be divided in two areas. The first area is the application layer. Development related to this area is easy and fast with no specialised expertise required. Whoever attempts development in this area, requires general knowledge on the behaviour of the calculation with no details on how it is deployed on the GPU. This area suits remarkably well to that scientist who has general expertise in programming and views programming as a means to reach scientific goals.
Development of operators is the second area. Developing an operator requires specialised expertise in high-performance parallel computing over GPUs. Much less knowledge about the overall behaviour of the calculation is required, but in-depth and detailed knowledge of the operator is a must. It is suited to a developer who has little interest on scientific goals and much more interest in writing high-performance code.
The two areas are separate and the only commonalities are the framework itself and functional specification of the operator. Therefore, it is is easy to promote collaboration between the scientist and the developer, and get the best out of the two. One notes that functional specification documents are a standard in the industrial software development world. It is the main tool with which developers communicate with their clients.
3.6 Standard Operators
As part of the implementation, some standard operators have been developed. This section discusses the most noteworthy ones.
3.6.1 The Fast Fourier Transform (FFT) operator
This operator executes Fast Fourier Transform over arrays of any dimension. It is possible to divide the array into sub arrays of lesser dimensions and perform a Fourier transform over them. For example, in the case of a matrix, it is possible to request a 1-dimensional Fourier transform over each column. It is implemented using the CUDA cuFFT library  provided by NVIDIA. Unfortunately, the documentation  does not give substantial information on the buffer space required to execute an FFT. Documentation only states that it is between one and eight times the size of the input being FFTed. To be on the safe side, the operator has been set to request a buffer eight times larger than the input.
3.6.2 The all-prefix-sums operator
Given a sequence , the all-prefix-sums operator does an accumulated addition over the elements to return the sequence defined as follows:
This operation is also known as an exclusive scan.
3.6.3 The general reduction template
This is a generic operator (in C++ terms: a template). It provides general code for operators that handle reductions of the form:
is a sequence of sequences, whereby each element sequence is an input to the GAFW operator.
is a general mathematical operator that has to be associative.
is another general mathematical operator that does not need to be associative.
Specialisation of the template needs to define the two mathematical operators and the value of . The value of is determined from the size of the input arrays. The operator can reduce over dimensions. For example, in case of a matrix, it is possible to reduce each column separately. The result will be a 1-dimensional array of length equal to the number of rows of the matrix.
Reductions are simple to implement over GPUs. In this implementation, a kernel is run with a configuration of maximum threads per block and a number of blocks high enough to reach nearly 100% real occupancy. Work load is split evenly over all threads such that each thread reduces a subset of the elements in each sequence. Results are saved in shared memory so as to run over all results produced by threads in a given block. This is stored in global memory in such a way that a second kernel re-applies over the values calculated by each block.
Chapter 4 The Malta-imager
The Malta-imager (mt-imager) is a high performance image synthesiser for radio interferometry. It is implemented in C++  and CUDA , and uses the GAFW infrastructure (refer to chapter 3). It achieves unprecedented performance by means of GPUs. The CPU multi-threaded design for handling pre- and post-data processing is also a crucial ingredient in ensuring the best performance.
The image is synthesised using w-projection (refer to section 2.2). An independent multiple-Stokes dirty image is synthesised for each channel. Measurement data is expected to be available as a MeasurementSet  stored using the Casacore Table System . Output images are stored as a 4-dimensional data cube FITS  primary image. Most of the calculations are done in single-precision mode [68, 69, 70].
4.1 High level design and concepts
The design is based on the General Array Framework (GAFW) philosophy (refer to chapter 3). The system is made up of seven autonomous components as depicted in Figure 4.1. None of them interact directly with each other, and with a few exceptions, they are unaware of each other. Data is communicated between each component using the GAFW, by means of result objects (see section 3.3.5). The main() function integrates all components together.
Each component is assigned a unique set of tasks. The Configuration Manager takes care of producing configuration information for each component based on the local configuration and environment. The Visibility Manager loads data from a MeasurementSet, sorts, converts and inputs data in the GAFW infrastructure. The WImager performs gridding over a multi-polarised grid, while the Image Finaliser converts the grid to a multiple-Stokes dirty image. These components are implemented as GAFW modules, and all calculations are made over GPUs. Channels are processed independently, and, for each channel, separate instances of the two components are set up. Numerical representations of the convolution functions (defined in equation 2.24) are generated by the Convolution Function Generator. The Image Manager stores output images in FITS files, while the Statistics Manager processes statistics, generated by each component, including the GAFW. It then reports them in various CSV (Comma-Separated Values) files.
Data is processed in chunks to exploit parallel mechanisms available on the hardware. This is essential to ensure high-performance. A GPU is by itself a parallel device which can only achieve high-performance through parallel methods. Presenting the GPU with a suitably sized chunk of data ensures best gridding performance. In case of a multi-GPU system, by virtue of the GAFW, the imaging tool grids independent channels over different GPUs so as to achieve concurrency over GPUs.
Fast gridding on GPUs is the imaging tool’s strong suit. Nevertheless CPU bound pre-processing and post-processing of data is a necessity, since, by design GPUs are limited. For example, GPUs cannot load data from hard-disk, neither do they recognise C++ objects. Also, they cannot save images to permanent storage. If these pre- and post-processing steps are not well handled on the CPU, then, they can severely compromise performance.
Execution time of pre- and post- processing steps is hidden by having the CPU, GPU and permanent storage IO running in parallel. Most components perform their tasks asynchronously to each other. Since data is processed in chunks, the Visibility Manager prepares the chunks in the background through a multi-threaded mechanism. Once the first chunk is available for gridding the WImager component (by virtue of the GAFW) grids the data over the GPU while the Visibility Manager continues its task of preparing other chunks. This achieves concurrent use of the CPU and GPUs.
Channels are processed one after each other111For multi-GPU systems, channels are processed in parallel by an amount equal to the number of GPUs.. Once a channel dirty image is finalised the Image Manager saves the image to disk while subsequent channels are being processed. This ensures concurrent use of the GPU and permanent storage. The Visibility Manager, by virtue of its multi-threaded design ensures concurrency between CPU processing and permanent storage IO and exploits the multi-core infrastructure of modern CPUs.
The core of gridding is based on Romein’s algorithm  which is enhanced, implemented and adapted for the necessities of this imaging tool. The algorithm requires that data is grouped by baseline and sorted in time. The ordering is done by the Visibility Manager.
The imaging tool supports single, dual or quad polarised222Polarisation can be linear or circular. Both are supported. data. The term multi-polarisation is used as to describe any number of polarisations. It should be noted tha the imaging tool handles each multi-polarised visibility record as one entity.
Output images are converted in the I,Q,U,V stokes format or a subset of, depending on the polarisations available.
Visibility data is gridded using the natural weighting scheme (see section 2.1.2). The required variance is read from the MeasurementSet.
Flagging is also supported. During pre-processing phases of data, not handled by the imaging tool, some visibility records might be flagged for various reasons. These include erroneous readings. The imaging tool does not grid any such flagged data. Note that flagging is done per polarisation, and a flag has value of 1 when the respective polarised data is not to be gridded and 0 otherwise.
The final point in this section is about channel frequencies. Since, in the general case, the interferometer resides on the Earth, channel frequencies are normally given in the topocentric frame of reference. However, to make corrections for Doppler effects caused by the motion of the Earth, the imaging tool grids using the Local Standard of Rest Kinematic (LSRK) frame of reference. Frequency values in this frame of reference are not given directly by the MeasurementSet, so a conversion is required. This is taken care of by the Visibility Manager using the CPU. It is the only calculation done in double-precision mode .
4.2 Runtime configuration and the Configuration Manager
Runtime configuration of the mt-imager is done via command line arguments and configuration files. The configuration is a set of parameters defined by keyword and value. This is similar to the Common Astronomy Software Applications(CASA)  and the standalone lwimager tool based on the casacore libraries . The keyword and its respective value are separated by the equals (=) character. For example, nx=100 defines a parameter named nx with its value set to . A type such as a string, boolean or integer is attributed to a parameter value. Boolean parameters can be set to true by putting, a minus (-) just before the parameter name. For example, -test and test=true are equivalent.
Parameters can be set within a manually specified configuration file. In principle, all parameters can be specified as command line arguments to mt-imager. This makes the command quite long, and subject to errors (an issue in lwimager). The use of a configuration file solves the problem. It is a simple text file where parameters are each listed on a line of their own. Empty lines are allowed, and lines beginning with the number sign (#) are assumed as comments and ignored. Parameters set through the command line have precedence over those defined in the configuration file. This enables the user to have a default configuration stored in a file and partially overridden by command line arguments.
The mt-imager uses a logging system based on Apache log4cxx API . It requires an extra configuration with a format defined by the API. Its location is configurable through a parameter.
The Configuration Manager is the holder of all configuration logic. It is a passive component as all the logic runs in the main thread. It does not interact with any components. It produces a component specific configuration that is passed by the main thread to the component during initialisation. Due to its nature, the Configuration Manager is code-wise dependent on the other components. This dependency is one-way since the other components are neither aware nor dependent on the Configuration Manager.
4.3 The Convolution Function Generator
The Convolution Function Generator is entrusted in calculating the w-projection oversampled convolution functions 333 is defined in section 2.24.. It is not practical to calculate the convolution functions during gridding, instead, they are numerically evaluated and presented to the WImager component in an oversampled format.
4.3.1 General points
This component can generate convolution functions in two modes: normal and w-projection mode. The mode used is configurable at runtime.
In normal mode, the generator only evaluates at and w-term is ignored (refer to section 1.4). The term "normal" is used by other imaging tools such as CASA  and lwimager. In these tools, it describes the same behaviour as in the mt-imager. In this mode, the support of the convolution function is runtime configurable. The system will zero pad or trim the function so as to produce the desired support. This feature is useful for executing performance tests, and it was pivotal in many of the experiments described in chapter 5.
In w-projection mode, is evaluated over various w-planes depending on the runtime configuration. Each convolution function is trimmed to its support size. Support is evaluated after examining the generated data. As already discussed in section 2.2, support of is a function of that increases with increasing .
In the two modes, the choice of the tapering function is run-time configurable. Tapering functions are implemented as GAFW operators, and the choice is defined by specifying the name of the operator. In this thesis, only one tapering function operator has been developed called casagridsf. This implements the same prolate spheroidal function used in the casacore API . It is based on work presented by Schwab  and has been adapted to work over GPUs.
The oversampling factor is a run-time configurable variable. Since memory is limited it is suggested to keep it to a low value of 4, which is the value proposed by Cornwell et al. . This value is hard-coded in lwimager and CASA. Zero-padding is used as an interpolation scheme.
The generator samples in . As recommended by Cornwell et al. , is scaled in rather than linearly. No necessity exists to calculate for since the convolution functions are symmetrical around . The maximum to consider is runtime configurable.
4.3.2 Mathematical treatment and outputs
The Convolution Function Generator outputs three GAFW results that contain all convolution functions data required for WImager to do its job. Some simple mathematical treatment is given here to help describe the content of the outputs.
When a record is gridded, only the numerical data of one convolution function that falls on the pixels of the UV-grid is used. Any oversampled point that does not fall on the UV-grid is not considered. Define as the function describing the operation that chooses data. are the baseline components of the record being gridded expressed in number of wavelengths. The function returns the chosen data in a matrix with dimension , where is the support of the convolution function . Define the half-width and let be an element of , where is the 1-based index of the row, and is the 1-based index of the column. Then:
where and satisfy the simultaneous equations:
Note that the image444The image of an arbitrary function , is defined as the set . of is finite since is oversampled in and , and sampled in .
As defined in equation 4.5, the WImager also calculates a normaliser. There are substantial computational savings if the summation of the real parts of the elements of are pre-calculated. Let define such summation such that:
where is the real operator.
The three outputs containing all data related to are now explained. The first output is referred to as the Convolution Functions Numeric Data output. It contains all numerical data of the oversampled . The data is laid down as follows: The data is first ordered by convolution function in increasing . Further ordering for each convolution function is in such a way that elements of each matrix member of the image of is coalesced in memory, sorted in the row-major form. This is in accordance with GPU best practices since when a record gets gridded, the convolution function data selected, is accessed in parallel. All matrix members of the image of are ordered in increasing order by and .
The second output is referred to as the Convolution Function Data Position and Support output. It contains two elements for each convolution function. The first element is the function’s support. The second is an index pointing to the first element of the first output that describe the convolution function. This index is vital to search for the right data to use for gridding.
The third output contains the image of . The layout is similar to the layout of the first output.
4.3.3 Data Generation
A CPU/GPU hybrid algorithm is used to generate the three outputs described. CPU work is done using the imaging tool’s main thread. The payload on the CPU is negligible, and the thread spends most of time waiting for some GAFW result to be available. The main thread is not available to spawn any new work while this component is calculating. This does not degrade performance since any new work to spawn after the generation of the convolution functions is dependent on the generated data. The Visibility Manager component is initialised before the generator in such a way that convolution functions are generated by the Convolution Function Generator component at the same time that the Visibility Manager prepares data for gridding.
The three outputs are generated using Algorithm 1. All functions are normalised using a constant normaliser value such that . Support is calculated by detecting the pixel nearest to the edge that has a magnitude larger than . The GPU thread configuration is set such that, for most of the steps, one thread handles one pixel.
4.4 The WImager Component
The WImager component executes the w-projection algorithm (refer to section 2.2) over GPUs, through the General Array Framework.
An instance of the component handles one multi-polarised UV-grid. Since the imaging tool treats each channel separately, an instance of the WImager component is created for every channel to grid.
The component is designed to be flexible such that it can be used in other configurations. For example, channel data could be grouped by baseline and gridded over independent instances of the WImager component. The grids would be added up on finalisation. The component can be used as it is for w-stacking (discussed in section 2.2) where each w-plane is gridded separately using independent instances of the component.
The WImager component is a GAFW module (refer to section 3.3.7). All inputs and outputs are GAFW result objects (see section 3.3.5). The whole algorithm is implemented as one calculation tree and is fully GPU based. It is free from CPU calculations to ensure the asynchronous behaviour explained in section 4.1.
The implemented algorithm is, from here onwards, referred to as the WImager algorithm. The calculation tree is depicted in Figure 4.2. Some GAFW operators are grouped up in one block and then expanded in subsequent Figures 4.3, 4.4 and 4.5. Arrays are denoted with keys, tabulated in Table 4.1. The table includes the mathematical reference used to represent each sequence and a reference to the section, where the sequence is discussed.
The WImager algorithm is based on Romein’s algorithm, reviewed in section 2.5. It is enhanced and adapted to suit for the requirements of the mt-imager. One of the main enhancements made over Romein’s algorithm is compression. It is discussed in section 4.4.3.
Records that are input to the WImager algorithm are expected to be grouped by baseline and sorted by time. If this requirement is not respected, the algorithm will still work, but with a heavy penalty in performance. Note that there is no restriction on the amount of baselines in an input chunk of data and records per baseline can vary. Input visibility records do not need to be on the grid and flagging is supported. The algorithm adapts to the varying support of the convolution functions. It is perfectly fine that records in one input chunk require different sized convolution functions to be gridded.
The WImager algorithm is conveniently split up in two phases: the preparation phase and the gridding phase. The main difference between the two phases is the thread configuration. In the first phase, each multi-polarised visibility record is processed by one thread or in some parts with a ratio of threads per record that goes below unity. In the gridding phase, a record is processed by a block of threads. In this case, the threads per record ratio goes in the order of thousands and millions and thus expensive. The gridding phase can hold all preparation phase logic but, for best performance, the gridder is stripped from any logic that can be implemented in the preparation phase. Note that the gridding phase is implemented in one GAFW operator and thus all the other operators in the calculation tree shown in Figure 4.2 are part of the preparation phase.
The rest of this section explains in detail the WImager algorithm. The subject is tackled as follows: First, nomenclature and terminology used is defined and explained (section 4.4.1). Inputs, outputs and mathematical equations describing the algorithm are given in section 4.4.2. In the subsequent section, the gridding phase is discussed, whereby all logic that is delegated to the preparation phase is identified. The discussion on the WImager component is concluded by explaining in detail the preparation phase in section 4.4.5.
|Key||Sequence name||Mathematical Symbol|
|Visibility Records Inputs (defined in section 4.4.2)|
|I-1||Baseline in meters|
|Convolution Functions Input Data (defined in section 4.3.2)|
|C-1||Convolution Functions Numeric Data||N/A|
|C-2||Convolution Functions Data Position and Support||N/A|
|C-3||Convolution Functions Data Sum||N/A|
|Outputs (defined in section 4.4.2)|
|Binary Sequences (defined in section 4.4.5)|
|B-3||Support Change Indicator|
|B-4||Manipulated Support Change Indicator||N/A|
|Scanned Sequences (defined in section 4.4.5)|
|SB-1||Prefix Sum of Gridding Indicator|
|SB-3||Prefix Sum of Support Change Indicator|
|SB-4||Prefix Sum of Manipulated Support Change Indicator||N/A|
|Index Sequences (defined in section 4.4.5)|
|D-2||Last Compressed Entry Index|
|Trial Plan||Trial Plan|
|Non-Filtered Sequences (defined in section 4.4.5)|
|S-2||Convolution Data Pointer|
|Filtered Sequences (defined in section 4.4.5)|
|FS-2||Convolution Data Pointer|
|FS-5||Weighted and Compressed Visibilities|
4.4.1 Nomenclature and Terminology
The various terminologies and nomenclatures used in this section are defined and explained. As pointed out in section 4.1, visibility records are input to the component in chunks. As indicated in Figure 4.2 the record data chunk is split in different arrays. It is reasonable to refer and represent GAFW arrays related to record data as mathematical sequences.
The usual "curly brackets" format is used to define mathematical sequences (for example: ). It is to be assumed that the sequence is finite, and that the first element is at index 0, that is . The letter implicitly defines the length of the sequence.
As record data is processed by the WImager algorithm, some of it is removed. The process of removal is referred to as filtering. Sequences that have records removed from are referred to as filtered sequences. Notation-wise, filtered sequences are distinguished from their non-filter counterpart by having the letter as a superscript to the element. For example, is the filtered version of the sequence .
An element in any sequence gives data about one record. Elements in different sequences that are either all non-filtered or all filtered sequences, describe the same record only if they are at the same position.
The algorithm makes use of binary sequences. The term indicator is used to refer to elements in such sequences. The letter is set as superscript of the sequence’s element to show that the sequence is a binary sequence. For example, is a binary sequence. Note that if the binary sequence is also filtered, the letter is also retained as a superscript. For example, is a filtered binary sequence.
Sometimes a prefix sum (exclusive scan) is run over a binary sequence to produce a scanned sequence. The resulting scanned sequence is denoted by the letter set as superscript of the sequence’s element. For example, is the resulting scanned sequence of . If the binary sequence is a filtered one, the superscript letter is retained and the resultant filtered scanned sequence is denoted by .
Index sequences are defined as sequences containing integer index elements pointing to a record. These are generated by using binary sequences as predicates. Index sequences are denoted by using the predicate sequence. The superscript letter is used to denote that the sequence is an index sequence. For example, is the index sequence generated using as predicate. If the index sequence points to records in a filtered sequence, the superscript letter is also used (for example ).
To represent sequences made up of tuples, the usual notation of a comma separated list of sub-elements enclosed in a parenthesis, is used. For example, represent a finite sequence of 4-tuple elements. Most sequences are directly linked with the input data that define in full a visibility record.
Various sequences include polarisation dependent data. For such sequences, the letter will distinguish between each polarisation and is set as a subscript after the twisted brackets. For example, contains polarisation dependent data. When referring to an element, the notation will be used.
4.4.2 Input, outputs and mathematical equations
The WImager algorithm receives visibility record data in five inputs. Another three extra inputs contain data of the convolution functions in numerical form. These extra inputs are supplied by the Convolution Function Generator and discussed in section 4.3. There are two outputs for this algorithm, which are: a multi-polarised UV-grid and a normaliser. Details on the inputs and outputs of the algorithm follow.
The five visibility records related inputs are:
baseline in meters as projected on the UVW axis - . The apostrophe indicates that values are in meters and not in number of wavelengths.
channel frequency (per speed of light) - . where represents the channel wavelength, and which is the inverse of the channel frequency divided by the speed of light.
flags555Flagging was discussed in the penultimate paragraph of section 4.1. -
The first output of the algorithm is the multi-polarised UV-grid sampled at regular intervals and . The pair of equations 4.4 gives a mathematical representation of how the algorithm calculates this output. It is based on the w-projection gridding mathematical treatment given in sections 2.1.1 and 2.2.1 with some adaptations and added features.
represents the oversampled w-projection convolution functions, defined in equation 2.24.
is the gridding indicator. It is calculated within the algorithm and defines if a record is to be gridded by the gridding phase. When is set to 1, the respective record is gridded by the gridding phase while if set to 0 it is not.
is set to 0 if and only if one of the following three conditions is met: The first condition is when gridding the record requires updates to pixel outside the grid. The second condition is when all polarisations of the record are flagged (that is, ). The last condition which sets to 0 is when the record is compressed666Compression is discussed in section 4.4.3..
When a record is compressed, the compression indicator is set to 1. Otherwise, the compression indicator is set to 0. Note that for a given record and cannot both be 1. Hence can have values of 0 or 1. When equates to 0, then the given record has no effect on the output grid. If and then the record will still be gridded and will affect the output, but not through the gridding phase. This is explained in detail in section 4.4.3.
It should be noted that when a flag is set to 1, and the respective polarised visibility is ignored, since will acquire a value of 0 over the whole grid. The gridding phase will still grid the polarised visibility, but with no effect since it will grid zeros.
The second output of the WImager is the polarisation dependent normalisers . These are required by the Image Finaliser component to output images in units of . The following equation gives a mathematical representation.
is defined in equation 4.3 and its image is given in the Convolution Function Data Sum input.
Romein’s algorithm exploits the behaviour of the baseline trajectory, and in the WImager algorithm the behaviour of the baseline trajectory is exploited further. There is a probability that for some consecutive records sampled on the trajectory, the change of is so small that the convolution function values and the position of gridding do not change. In mathematical terms there is a probability that for two consecutive records at position and the equality shown below holds:
_(w’_i/λ_i)(u-u’iλi,v-v’iλi )⋅III(uΔu,vΔv )=~C_(w’_i+1/λ_i+1)(u-u’i+1λi+1,v-v’i+1λi+1 )⋅III(uΔu,vΔv )
In this case by virtue of equation 4.4, the weighted and flagged visibilities of the two entries, that is, and , are summed together in the preparation phase and gridded as one record. This thesis terms the process as compression in view that some records are "compressed" into one.
The effectiveness of compression is depended on many parameters. For example, compression depends on the observation integration time, the grid configuration , and the oversampling factor of the convolution functions. A short integration time causes the interferometer to sample records at a faster rate than longer integration times. This means that the Earth would have rotated less from one sampled record to another over a trajectory. Therefore, there will be smaller changes in which imply a higher probability of compression. Short baselines tend to have a shorter trajectory than long baselines. Within a given time interval, sampled records over short baselines travel shorter distances on the UV-grid than over longer baselines. This leads to a higher probability of compression for shorter baselines.
4.4.4 The gridding phase
The gridding phase does the final gridding. It is implemented by one operator named the gridder and is based on Romein’s algorithm. This section discusses implementation details, together with adaptations made to Romein’s algorithm so as to suit the imaging tool. The discussion focuses on the required inputs that the preparation phase needs to generate for the gridder to be as fast as possible.
Referring to equation 4.4, whenever , then, the record is not gridded and is ignored by the gridding phase. The calculation for is delegated to the preparation phase. In order to evade any selection logic based on the value of , the preparation phase filters out any records with . All input sequences to the gridding phase, related to visibility records, are thus filtered sequences.
As pointed out earlier, if a polarised visibility value is flagged, then, the gridder will grid zeros instead of ignoring the polarised value. The gridder does not handle flags and does not have any flag data as input. The zeroing of the visibility value is done in the preparation phase. Such a strategy evades logic in the gridding phase related to flagging and reduces the shared memory usage of a record in the gridder kernels. As will be explained later on, the gridder loads chunks of records in shared memory prior to gridding them. Shared memory is a limited resource, and there is an impact on performance when few records are loaded. Not loading flags reduces the record footprint in shared memory and thus increases the amount of records that can be loaded at one go.
Convolution Position sequence
Part of the calculation required to find the position of the convolution function on the UV-grid adheres well to the thread configuration of the preparation phase and is delegated to it. Instead of the and sequences, the gridding phase is presented with a 4-integer tuple sequence . The tuple, which proposed in Romein’s algorithm, defines the position of the convolution function as follows: Each multi-polarised pixel on the grid is 0-based indexed by two integers representing the and value of the pixel respectively. The pixel with and values of 0 is in the middle of the grid and the pixel indexed as has the most negative and values777Note that this indexing scheme is the same as that of C++ indexing method of arrays..
If the record’s position on the grid is on the pixel indexed by and the grid is pixels long on the u-axis then: