Parallel Implementation of -projection Wide-Field Imaging
W-projection is a wide-field imaging technique that is widely used in radio synthesis arrays. Processing the wide-field big data generated by the future Square Kilometre Array (SKA) will require significant updates to current methods to significantly reduce the time consumed on data processing. Data loading and gridding are found to be two major time-consuming tasks in -projection. In this paper, we investigate two parallel methods of accelerating -projection processing on multiple nodes: they are the hybrid Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) method based on multicore Central Processing Units (CPUs) and the hybrid MPI and Compute Unified Device Architecture (CUDA) method based on Graphics Processing Units (GPUs). Both methods are successfully employed and operated in various computational environments, confirming their robustness. The experimental results show that the total runtime of both MPI+OpenMP and MPI+CUDA methods is significantly shorter than that of single-thread processing. MPI+CUDA generally shows faster performance when running on multiple nodes than MPI+OpenMP, especially on large numbers of nodes. The single-precision GPU-based processing yields faster computation than the double-precision processing; while the single- and double-precision CPU-based processing shows consistent computational performance. The gridding time remarkably increases when the support size of the convolution kernel is larger than 8 and the image size is larger than 2048 pixels. The present research offers useful guidance for developing SKA imaging pipelines.
Parallel Implementation of -projection Wide-Field Imaging
1]Baoqiang LAO1]Tao ANantao@shao.ac.cn
Junyi WANG 1]Quan GUO1]Shaoguang GUO1]Xiaocong WU
Parallel Implementation of -projection Wide-Field Imaging
|\hb@xt@1ex1Shanghai Astronomical Observatory, Key Laboratory of Radio Astronomy, Chinese Academy of Sciences, Shanghai, 200030, China;|
|\hb@xt@1ex2Guangxi Cooperative Innovation Center of cloud computing and Big Data, Guilin University of Electronic Technology, Guilin, 541004, China;|
|\hb@xt@1ex3Guangxi Colleges and Universities Key Laboratory of cloud computing and complex systems,|
|Guilin University of Electronic Technology, Guilin, 541004, China|
Received January 11, 2019; accepted April 6, 2019
Radio Synthesis Arrays, Square Kilometre Array, Wide Field Imaging, Parallelization, -projection
PACS number(s): 47.55.nb, 47.20.Ky, 47.11.Fg
The largest international collaboration project in radio astronomy, the Square Kilometre Array (SKA), will be the flagship radio telescope in the next few decades upon completion, providing great opportunities for humans to discover and understand the Universe [1, 2]. The first phase of the SKA (SKA1), consisting of 10% of the total array, already offers unprecedented performance . The wide field of view (FoV) , high survey speed, high resolution and high sensitivity of the SKA render vast amounts of observational data. Of particular note is its high survey speed, which is greater than that of the largest existing radio telescope, the Karl G. Jansky Very Large Array (JVLA), by a factor of 100 . The correlated data generated by the low-frequency SKA1 (SKA1-low) will be at a rate of 466 GigaByte per second (GB/s), resulting in 40 PetaByte (PB) of correlated data in a continuous 24-hours of operation. The correlated data will then be imported into the Science Data Processor (SDP) for data processing and calibration in real time. The amount of initially calibrated data from the SDP is dramatically reduced but still remains at the PB level. Deeper analysis (e.g., self-calibration and deep imaging) is needed to turn these SDP data into science-ready data products that can be used by scientists. Numerous challenges will be encountered in the processing, curation and storage of the SKA big data .
Wide-field imaging is one of the main challenges of SKA data processing. Due to the large FoV and non-coplanar baselines of the SKA, the conventional two-dimensional (2D) Fourier transform approximation in radio synthesis imaging no longer holds, i.e., the effect of the -term cannot be neglected . Therefore, new imaging techniques are developed to tackle the wide-field imaging problem of -term correction, e.g., three-dimensional (3D) Fourier transform , faceting , -projection , -stacking , and warped snapshots  techniques.
Among these techniques, the 3D Fourier transform requires the largest amounts of calculation and memory space, and is thus rarely used in practical applications. The remaining techniques are potentially suitable for SKA imaging, with some being practically used in SKA pathfinder data processing. The faceting algorithm divides the observed sky zone into small pieces named as ’facets’; in each facet the w-term can be neglected and the 2D Fourier transform can be applied. The produced images in the facets are then combined. There are two types of facets: in the image-domain [6, 12] and uv-domain . One shortcoming of faceting images is that the generated images often suffer from the edge effect induced by misalignment of image brightness scales in adjacent facets. The edge effect and noise interference lead to the reduction of imaging quality. A recently developed image domain gridding algorithm eliminates the need for computing the convolution kernel and thus reduces the total cost of gridding; moreover it is more accurate compared to classical gridding algorithms . The -projection algorithm processes the convolution of the visibility data with a new convolution kernel function (for more details, see Section 2) and then projects the visibility data onto a two dimensional plane, enabling 2D fast Fourier transform (FFT) deconvolution. Although this method also increases the amount of computation and occupies a large amount of computer memory, the created image has higher quality and lower noise level than that obtained by the faceting technique . The principle of the -stacking technique is to rasterize the visibility data onto different w layers so as to perform a 2D FFT on each thin w layer. The processing of -stacking also consumes a large amount of device memory . The warped snapshot technique divides the visibility data according to short time slices; in each slice, visibility data can be assumed to be projected onto a two-dimensional plane. Each data slice is imaged independently, and the images of all time slices are then combined. The warped snapshot also requires an instantaneously co-planar array and involves multiple FFT calculations. In addition, the warped snapshot technique increases the algorithm complexity and involves additional image stitching time.
Compared with the 2D FFT method, the above techniques can recover the structure of the radio sources located far from the phase center of the image. However, these techniques all involve long computational time and large memory consumption. Therefore, executing these techniques often relies on high-performance clusters or supercomputers. The rapid development of computer technology, especially the development of parallel computing and artificial intelligence technologies, renders parallel execution and acceleration of the wide-field imaging feasible. As the -projection technique produces higher quality images than the faceting technique, has open source code and is easy to transplant , we investigate the single-thread and parallel implementation of wide-field imaging by adopting the -projection technique in this paper.
2 Parallel Implementation of -projection
Parallel optimization of the -projection algorithm has already been presented in literature. Varbanescu et al. introduced a parallel acceleration method based on Cell BE for multicore CPU processors, with a focus on gridding and degridding . Although the method accelerated the gridding and degridding processing, it increased the amounts of calculations in other steps, and the overall effect on optimization was not properly evaluated. In addition, the method is implemented on a single node, with an extension to multiple nodes yet to be investigated. Amesfoort et al.  ported the work of Varbanescu et al. to a GPU. The local optimization problem was partly solved, enabling the bandwidth usage of the device to be maximized, but the limited GPU memory used in their work lowered the imaging resolution, hampering the practical applications to large-scale radio telescope arrays such as the Low Frequency Array (LOFAR) and SKA. Romein  provided a -projection rasterization acceleration method based on GPU architecture, which can reduce the number of memory access, avoid large memory waste, and accelerate the implementation. Based on Romein’s algorithm, D. Muscat developed a GPU-based w-projection method WImager which is included in their imaging tool called ’Malta-imager’ . By utilizing Compression to merge some consecutive records, WImager results in 3 times increase of gridding speed; but mention that the Compression performance depends on observation integration time, grid configuration and the oversampling factor of the convolution functions. They found the GPU-based imaging method is 100 times faster than a common CPU-based method. Moreover they found that loading data from disk is a main limiting factor of the execution time. Simulated data were used in these work, whereas the radio astronomy observational data structure is more complicated, and thus, the practicality of the acceleration algorithm needs further testing.
Though progress has been made in these previous work, more experiments are needed to achieve practical parallelization and acceleration methods of -projection imaging. In particular, parallel implementation and performance test on multiple nodes has not been investigated yet. In this section, we introduce two parallel methods of -projection based on CPU and GPU architectures on multiple nodes.
2.1 Parallelizaion Method Using Hybrid MPI and OpenMP Based on CPUs
The data loading includes the time consumed on (1) reading visibility data and auxiliary data contained in Measurement Set files from external storage media to disk on clusters or HPCs, (2) copying the visibility data from disk to memory, and then (3) copying data from memory to CPU or GPU. In the CPU-based pipeline, the data loading only contains steps (1) and (2). These processes of transferring data are found to be highly dependent on the I/O bandwidth of the storage device (e.g., ) and cost a large amount of execution time since the gridding process can only commence after all data is loaded into memory. The parallelization of GPU CUDA program requires that the visibility data are time ordered. Thus, additional re-ordering time is spent on the data loading in the GPU-based processing pipeline, compared with CPU-based pipeline. There are basically two approaches to perform parallel implementation to accelerate data loading. One is to parallelize the data I/O in the underlying layer, and the other is to write a parallel data loading program in the application layer before data processing.
The Casacore Table Data System (CTDS)  is the data I/O library of the Measurement Set. Currently, CTDS only allows data reading from and writing to a table of the Measurement Set in serial mode. This will eventually consume a large amount of time when the data size of the Measurement Set becomes very large. CTDS provides an interface for users to design and implement their own CTDS storage managers. Wang et al. developed the Adaptive I/O System (ADIOS) based on the parallel storage management called AdiosStMan 111https://github.com/SKA-ScienceDataProcessor/AdiosStMan, which is the first parallel I/O system for the Measurement Set . The performance test shows that AdiosStMan has an excellent scalability on up to 80 compute nodes for parallel writing. AdiosStMan is currently in the experimental stage and has not been widely used in practical astronomical data processing. Therefore, we choose the second parallel approach for data loading. In our method, we use MPI to implement parallel data loading and allocate parallel data processing on multiple processors and multiple compute nodes.
In gridding step, the original -projection imaging provides a coarse parallelization approach. As shown in Fig. 1, the gridding step is embedded in a loop process, thus providing a natural way to accelerate the processing by parallelizing that loop operation. The parallelization can be achieved using OpenMP, which is a parallel standard form of shared memory that can achieve parallel optimization without complex code porting . More details are presented in Appendix B.
Figure 1 shows the flow chart of the hybrid MPI+OpenMP parallel strategy running on multiple compute nodes, which can be multiple CPU nodes or GPU nodes with multiple CPU cores. First, the input Measurement Set data are split into pieces using MPI, where is the number of MPI processes. Data processing of each piece of data is executed on a number of compute nodes, each of which is identified by the MPI rank number. Second, OpenMP allocates multiple threads to parallelize gridding loops of each piece of data on the allocated compute node. Each thread is assigned iterations. Third, the accomplished gridding results on each compute node are accumulated and sent to the master node (i.e., node #0). Finally, the master node performs an IFFT on the transmitted gridding result to obtain the final image.
2.2 Parallelization Method Using Hybrid MPI and CUDA Based on GPUs
In data loading on GPU architecture, we also use MPI to parallelize data loading and processing but to split data by time. Figure 2 shows the parallel procedure based on hybrid programming of MPI and CUDA. The Measurement Set file should be split into multiple sub-Measurement Set files by observation time using the CASA SPLIT task in advance. The number of sub-Measurement Set files is equal to the number of MPI processes. More details of the data loading and splitting are presented in Appendix B.
We should note that the premise of parallelization using the CUDA program is that the observational data have been grouped by baseline and then ordered by time. However, the observation data are not stored in this order. Therefore, extra data repackaging and sorting is necessary in the data loading step before data processing starts. In addition, the number of observation time range (time periods) on all baselines are not uniform. This may happen when data in some time range on some certain baselines are deleted from the Measurement Set file, being identified as radio frequency interference (RFI) singals or other observation and/or calibration errors. Therefore, a variable-length array is used to store the data on each baseline, but this variable-length array is very difficult to transfer to the GPU for calculation. We thus use a one-dimensional array to store the first scan position (called ”timestamp” in the imaging pipeline) of each baseline. The Measurement Set does not store the baseline index; only the ID number of each antenna is stored. We provide a method of calculating the baseline index as follows. First, the data are grouped by baseline, and each group corresponds to a number, which can be calculated using the ID number of the antenna:
where represents the baseline consisting of antenna 1 (ant1) and antenna 2 (ant2), and are the ID numbers of ant1 and ant2, and is the minimum ID number of ant1 and ant2.
Then, the baseline number is calculated according to eq. 2 and count the number of timestamps under the same baseline number (i.e., the number of samples of the same baseline) and store them in the array . Finally, use the prefix sum method to calculate the starting index of each baseline; its mathematical expression is
Combining the start value of the baseline index calculated above and the number of timestamps under the same baseline index, the imported data can be reordered according to the baseline index.
A GPU, having thousands of compute cores, has great advantage compared to a CPU in parallel computing. Therefore, it is more efficient to accelerate the gridding step with a GPU. Here, we introduce the implementation method of gridding acceleration by using CUDA. Because the integration time between consecutive measurements is short enough, continuous measurements on the same baseline will be gridded onto the same grid point. The gridding results of multiple timestamp data continuously measured on the same baseline are then accumulated and buffered in a register. This feature allows gridding processing to be effectively combined with CUDA’s threading model. The data in the register are written into the global memory when the processing moves to the next baseline data. There is no competition between multiple threads in this way. Parallel gridding on multiple baselines can be performed if the write access to the global memory is complete in an atomic manner. In the gridding process, the operation of each convolution kernel is assigned as a thread, and the entire convolution process requires a total of threads, where is the full support size of the convolution kernel. The minimum number of threads for the CUDA programming is , where is the total number of baselines.
In Figure 2, the processes bounded by the red boxes indicate the processes running on the master node, those bounded by black boxes indicate the processes running on the slave nodes, and those bounded by blue boxes indicate the processes running on the both master and slave nodes.
The visibility data first are imported from sub-Measurement Set files stored on the master compute node and slave compute nodes in parallel. Then, the visibility data will be re-ordered and re-packaged (using the method discussed above) on each compute node. After re-ordering, the data are copied to GPU memory which is allocated to store those data before copying. Next, gridding will be executed on each compute node in parallel manner by using the CUDA program. The exported gridded data from each slave compute node will be sent to the master node, where all gridded data are combined into a whole data set. A final fast IFFT operation is performed to create the image.
3 Results and Discussion
In this section, we will present the results from three test experiments: validation, robustness and scalability. The implementation, computer configuration and results will be presented in detail for each experiment.
3.1 Validation Test
The purpose of this experiment is to validate the correctness of our methods as well as the programs by using observational data. This experiment was conducted on the Tianhe-2 (MilkyWay-2) supercomputer .
Tianhe-2 was the world’s fastest supercomputer in the TOP500 list 222http://www.top500.org/ from 2013 until 2016 and is now the fourth fastest supercomputer. Tianhe-2 has a total of 17920 compute nodes, among which 25 are GPU nodes and the remainder are CPU nodes. Appendix Table B1 lists the specification of a single GPU node on Tianhe-2. Each node has two CPU chips (10 cores each) and two GPU cards (NVIDIA Tesla K80).
The observational data used in this experiment were observed with the JVLA 333https://casaguides.nrao.edu/index.php/EVLA_Wide-Band_Wide-Field_Imaging:_G55.7_3.4-CASA4.4. Those observations were made at L band (covering the 1-2 GHz frequency range) in the D-array configuration from 01:00:25 to 08:25:16 on August 23, 2010, with a duration of approximately 8 hours. The data size is 15 GB. The purpose of these JVLA observations is to validate the wide-field imaging methods integrated in the CASA software. Moreover, the JVLA is an interferometer similar to SKA1-mid. Therefore, these data are suitable for verifying the SKA1-mid data processing techniques. We first run calibration tasks on the data using CASA software following the procedure given in 3) to prepare the calibrated data for the input data of our tests.
The validation experiment used 10 GPU nodes of Tianhe-2. In the test of the hybrid MPI+OpenMp method, we used the 10 GPU nodes but did not enable the GPU card. Each node contains 20 CPU cores, so actually we used 200 CPU cores in total. In the test of the hybrid MPI+CUDA method, one GPU card on each of the ten GPU nodes was enabled, so a total of 10 GPU cards were used. The primary argument settings for these tests are as follows: the final output image is 1024 pixel 1024 pixel, the cell size of the image is 8 arcsecond in the right ascension (RA) and declination (DEC) directions. The support size of the convolution kernel depends on the minimum angular scale to be sampled in the image plane . Higher spatial resolution imaging requires larger support size of the convolution kernel, though with a vast computation cost. In addition, for wide field of view and non-coplanar array, large support size of convolution kernels is needed to describe the image features. A reasonable selection of the support size is a compromise between computation cost and observation configuration. In our experiments, the number of -projection planes is 23, and the full support size of the convolution kernel is 9.
The results are shown in the bottom panel of Figure 3. As a comparison, the results from a 2D FFT and single-thread -projection are demonstrated in the top panel.2D FFT The shell-like structure of the supernova remnant G55.7+3.4, which is prominent in the image centers, does not show significant difference in the four panels. However, the inset of the 2D FFT image (top left) clearly shows a distorted morphology of a point source far away from the image center, near RA=19 h 26 m, DEC=, when compared with other panels. Other point sources close to the image edge also suffer from distortions at various levels. The images produced by MPI+OpenMP (bottom left) and MPI+CUDA (Figure 3-f and h) are quantitatively consistent with each other and also with that acquired from the single-thread CASA -projection (Figure 3-d). The noise level of each image is 2.2, 2.3 and 2.3 mJy beam, respectively; the peak brightness of a test off-center point source shown in the inset panels Figures 3-c, e and g is 45, 48, and 49 mJy beam, respectively, and they are consistent within 2.
To assess the acceleration of the parallelization methods, we ran each test ten times and recorded each runtime. The average runtime was calculated. The runtime of the single-thread processing is 492.84 seconds. The runtime of the hybrid MPI+OpenMP is 3.59 seconds on a total 200 CPU cores or with 200 threads, and the hybrid MPI+CUDA takes 2.73 seconds on 10 GPU nodes (one GPU card each node). It is obvious that MPI+OpenMP and MPI+CUDA are faster than the single thread by factors of 137.3 and 180.5, respectively, confirming that the parallel processing indeed accelerates the -projection processing. MPI+CUDA is 1.3 times faster than MPI+OpenMP method. The GPU acceleration performance is consistent with the previous experiments, e.g., . More comparison of the operation results on CPU- and GPU-architecture will be presented in Section 4.3.
3.2 Performance Test of the Parallelization Methods
The motivation of the performance test is to demonstrate the robustness of parallel acceleration programs on various compute environments and configurations and to explore the scalability from a single node to multiple nodes.
Robustness tests on single node
The experiments were performed on three clusters with diverse configurations: Tianhe-2 supercomputer (Tianhe-2, in short), the SKA China Data Center prototype at the Shanghai Astronomical Observatory (SKACDC, see Appendix Table B2), and the Galaxy supercomputer of the Australian Pawsey Supercomputing Center (Pawsey, see Appendix Table B3).
One GPU node on each platform was used for the tests. The GPU cards are of different types. Figure 4 compares the gridding runtimes on diverse GPU machines. The Tesla V100 runtimes show the best computational performance in either single or double precision case. The K80 GPU is 2 times faster than the K20X GPU in the single precision mode, but their runtimes in the double precision mode are similar. This experiment confirms that the parallel -projection program can be deployed on diverse cluster environments.
Of note, the experiment undertaken on SKACDC is among the performance tests of the prototype machine. In this test, we also used CPUs of the GPU node of the SKACDC cluster to investigate the performance of the algorithm of MPI+OpenMP on a single node; each of the two CPUs has 56 cores. The number of threads is controlled by changing the environment variable OMP_NUM_THREADS. The results are compared in Figure 4. It displays the runtimes of the most time-consuming task, gridding. The white and black bars represent the single and double precision experimental results, respectively. In all cases, the single precision costs less time than double precision. The same result is seen in other experiments in this work as well. The gridding runtime decreases as the number of threads (CPU cores) increases in both single and double precision modes, but the scaling is not linear. The gridding runtime when using a single thread is 6.9 times that when using the maximum 56 threads. The curves flatten after 8 threads (CPU cores). The change is most likely due to an imbalance in the workload that the number of rows of visibility being gridded cannot be divided evenly between threads of the MPI+OpenMP method. This remains an open question to be solved in future work. The runtime for the V100 GPU is shorter than for 56-core CPU processing, suggesting that gridding using a GPU is better than that using a multicore CPU, in particular when CPU threads used are small.
The above results demonstrate that both MPI+CUDA and MPI+OpenMP work normally on a single node under various experimental environments.
Scalability tests on multiple nodes
We then carried out multiple-node tests with the purpose of investigating how performance changes with number of nodes. This experiment was deployed on Tianhe-2 using the same setup as the experiment in section 3.1. Ten tests were conducted, with the number of nodes increasing from 1 to 10 in sequence. Figure 5 and Figure 6 present the runtimes recorded by data loading and gridding, respectively.
In Figure 5, the data loading times for both the MPI+OpenMP and MPI+CUDA methods show consistent decreasing trends with increasing numbers of compute nodes. The single and double precision processing curves do not exhibit significant difference. The runtime decreases rapidly by a factor of 3.5-5.8 as the number of compute nodes increases from =1 to =4 (see Appendix Table B5), showing a roughly linear relationship between runtime and number of compute nodes. However, the runtime - number of nodes curve flattens after .
We found a notable difference between MPI+OpenMP and MPI+CUDA curves at small ; the data loading runtime using GPU processing is much longer than that for CPU processing. Quantitatively, the data loading time for single-precision MPI+OpenMP processing is approximately 14.8 seconds, but the single-precision MPI+CUDA takes 34.6 seconds. The longer data loading time in GPU-based processing actually arises from the extra time consumed on copying data from disks of host machine to GPU global memory (see Section 3.2 and Figure 4). Muscat  also found that loading data from disk is the main limiting factor of their GPU acceleration experiments. The MPI+CUDA runtime becomes comparable to or slightly shorter than that of MPI+OpenMP until =8. Further tests on even more nodes are necessary to explore whether the GPU-based processing runtime is smaller than the CPU-based runtime on larger numbers of nodes.
Figure 6 shows the runtime - number of nodes curves of the gridding task derived from CPU-based (MPI+OpenMP) and GPU-based (MPI+CUDA) processing. As with the curves in Figure 5, the runtime gradually decreases as the number of employed nodes increases, verifying the scalability of the MPI+OpenMP and MPI+CUDA methods in the parallel gridding task. The slopes are steeper at small and then become flatter at large . The varying slopes can be interpreted employing the same reasoning used to interpret the data loading time curves above. Floating-point precision exerts little influence on runtime in the CPU-based method, MPI+OpenMP processing. However, the GPU-based MPI+CUDA shows large discrepancies for all node cases, and the single-precision runtime is approximately 1/4 of the double-precision time for , changing to approximately 1/2 for . This suggests that single-precision floating-point computation is faster than double-precision computation on GPU architecture. This is a useful implication because radio astronomical data are often kept with single precision.
In summary, the scalability of the MPI+OpenMP and MPI+CUDA methods is verified. The GPU-based MPI+CUDA processing, in general, displays a superior advantage of parallel acceleration using large supercomputers with multiple nodes. The single-precision MPI+CUDA computation is fastest in the gridding operation.
Impact of parameter selection on parallel implementation
We further analyze the impacts of two key parameters, the support size of the convolution kernel and the image size, on the parallel implementation. In these tests, we used 10 GPU nodes of the Tianhe-2 supercomputer and took gridding as an example.
Figure 7 shows the changes in gridding time with the support size of the convolution kernel. As the full support size increases from 3 to 9, the gridding time steadily increases. We can see from Figure 2 that large support sizes of the convolution kernel result in more loops in the blue-colored boxes, thus increasing the gridding runtime. A sharp increase in gridding time is seen when the support size is higher than 9, implying that a proper selection of convolution kernel size is important for saving the total runtime. MPI-CUDA processing appears to be faster than the MPI+OpenMP method. This advantage becomes even prominent with a large number of convolution kernels.
Figure 8 shows the variation of gridding time with image size. As seen from the figure, the gridding time steadily increases as the image size becomes larger until it reaches 2048 pixel 2048 pixel. A sharp jump occurs from 2048 to 4096 pixels, and the robustness becomes poor. Measurements of this sharp transition should be further investigated to extend the scalability to high-resolution and large-size images. The percentage increment in the MPI+OpenMP processing is higher than that in the MPI+CUDA processing, suggesting that the scalability of the GPU-based algorithm is more robust. Single-precision processing in either the CPU-based or GPU-based method costs less time than the corresponding double-precision processing, in agreement with the results derived from the experiments above.
4 Summary and Conclusions
In this paper, we investigated two methods for parallel acceleration of the -projection wide-field imaging technique. One is a hybrid MPI and OpenMP method based on CPU architecture, and the other is a hybrid MPI and CUDA method based on GPU architecture. The main conclusions of the research are summarized as below.
(1) Applying -projection effectively recovers the image distortion at the edge of the large field of view. The image quality derived from the MPI+OpenMP and MPI+CUDA is quantitatively consistent with that derived from single thread processing, suggesting that the parallelization is successful and does not include any artifacts. Moreover, the parallel processing is 137-181 times faster than the single thread processing.
(2) Two methods are successfully tested on various compute environments on the Tianhe-2 supercomputer (China), the Pawsey supercomputer center (Australia) and the SKA China Data Center (SKACDC) prototype, verifying the robustness of the programs. The SKACDC also provides a demonstration of its advanced performance, enabling radio interferometer data processing.
(3) Two time-consuming tasks, data loading and gridding, are chosen to test the parallel scalability on multiple nodes. In general, data loading and gridding runtime decrease with the number of nodes in MPI+OpenMP and MPI+CUDA processing, confirming the scalability of the methods. It also suggests that the parallel programs are adaptive to large-scale deployment on clusters or supercomputers. We should mention that the ”runtime - number of nodes” curves show linear relationships only at small number of nodes. The slopes become flatter with an increasing number of nodes. The flattening of the curves results from the imbalance in the workload between the multiple threads and nodes. This would be a problem that should be seriously considered in large SKA imaging operation, especially when there are multiple concurrent tasks of diverse science applications. Future experiments by integrating the image domain gridding in the pipeline would be desirable to further reduce the runtime of w-projection.
(4) MPI+CUDA spends more time on data loading than MPI+OpenMP on a small number of nodes because the data need to be reordered by time and by baseline and then copied to the global GPU memory before processing. This extra time cost becomes less remarkable on large () numbers of nodes. We should mention the apparent relationship between the running time and could be biased by the limited data and image sizes which are rather small in these benchmark tests. The parallel data loading at the SKA scale would not depend on the number of nodes after the splitting in time is done.
(5) Data precision exerts a significant influence on MPI+CUDA gridding processing. Single-precision computation is prominently faster than that for double-precision. The precision has little effect on CPU-based MPI+OpenMP processing. In all these experiments, single-precision GPU-based processing shows the best computational performance with least time consumption.
(6) The gridding runtime shows a remarkable increase when the support size of the convolution kernel exceeds 8 because more loops are included in the processing. The gridding runtime continues to steadily increase with image size from 128 pixels to 2048 pixels; after that, the runtime rapidly increases. This should be kept in mind because SKA survey image sizes are usually larger than 8192 pixels. GPU-based MPI+CUDA processing shows the most stable variation. The experimental results and experience gained from the present work are helpful for the imaging pipeline development of not only the SKA, but also other radio interferometers.
This work used China SKA Regional Center prototype system at Shanghai Astronomical Observatory, funded by the National Key R&D Programme of China (under grant number 2018YFA0404603) and Chinese Academy of Sciences (under grant number 114231KYSB20170003). This research also used resources of National Supercomputer Centre in Guangzhou. This work was supported by resources provided by the Pawsey Supercomputing Centre with funding from the Australian Government and the Government of Western Australia. This work is partly supported by National Natural Science Foundation of China (No. U1831204, 11703069), the Guangxi Cooperative Innovation Center of cloud computing and Big Data (No. 1716), and the Guangxi Colleges and Universities Key Laboratory of cloud computing and complex systems. BQL thanks Chen Wu for helpful discussion and comments on the manuscript.
All authors contributed to the manuscript preparation. Baoqiang Lao performed the experiment and data analysis, and contributed to the paper writing. Tao An is responsible for the design of the experiment, coordinated the group work and drafted the paper. Ang Yu contributed to the experiment implementation. Wenhui Zhang and Junyi Wang participated in the technical definition of the experiment. Quan Guo, Shaoguang Guo and Xiaocong Wu contributed to the preparation of the experiment.
- 1 Science with the Square Kilometre Array, C. Carilli & S. Rawlings (eds.), (Elsevier B.V., Amsterdam, 2004).
- 2 Advancing Astrophysics with the Square Kilometre Array, T. Bourke, R. Braun, R. Fender et al. (eds.), (Dolman Scott Ltd, SKAO; Published online as PoS [AASKA14], 2015).
- 3 P. Dewdney, W. Turner, R. Braun et al., SKA1 System Baseline Design, SKA-TEL-SKO-0000002, Rev 3, (2016).
- 4 R. Braun, Anticipated SKA1 Science Performance, SKA-TEL-SKO-0000818, Rev 01 (2017).
- 5 T. An. Science Opportunities and Challenges of SKA Big Data. SCIENCE CHINA Physics, Mechanics & Astronomy, 62(8): 989531 (2019).
- 6 T. J. Cornwell, R. A. Perley. Radio-interferometric imaging of very large fields - The problem of non-coplanar arrays. Astronomy and Astrophysics, 261: 353-364 (1992).
- 7 Synthesis Imaging in Radio Astronomy II, edited by G. B. Taylor, C. L. Carilli, & R. A. Perley. ASP Conference Series, Vol. 180, (1999)
- 8 L. Kogan, E. W. Greisen . Faceted imaging in AIPS. AIPS memo (2009).
- 9 T. J. Cornwell, K. Golap, S. Bhatnagar. The Noncoplanar Baselines Effect in Radio Interferometry: The W-Projection Algorithm. IEEE Journal of Selected Topics in Signal Processing, 2(5): 647-657 (2008).
- 10 A. R. Offringa, B. McKinley, N. Hurley-Walker, … and J. Rhee. WSCLEAN: an implementation of a fast, generic wide-field imager for radio astronomy. Monthly Notices of the Royal Astronomical Society, 444(1): 606-619 (2014).
- 11 T. J. Cornwell, M. A. Voronkov, B. Humphreys. Wide field imaging for the Square Kilometre Array. Image Reconstruction from Incomplete Data VII. International Society for Optics and Photonics, 8500: 85000L (2012).
- 12 C. Tasse, B. Hugo, M. Mirmont, et al. Faceting for direction-dependent spectral deconvolution. Astronomy & Astrophysics, 611, A87-101 (2018).
- 13 B. Lao, T. An, X. Chen, Y. Lu, and X. Wu. Research on Wide Field Imaging Technology of Low Frequency Radio Telescope Array. Chinese Astronomy and Astrophysics, 42(4): 626-651 (2018).
- 14 S. van der Tol, B. Veenboer, & A. R. Offringa, Astronomy & Astrophysics, 616, A27-39 (2018).
- 15 L. Pratley, M. Johnston-Hollitt, J. D. McEwen. A fast and exact w-stacking and w-projection hybrid algorithm for wide-field interferometric imaging. arXiv:1807.09239 (2018).
- 16 A. L. Varbanescu, A. S. van Amesfoort, T. Cornwell, A. Mattingly, B. G. Elmegreen, R. Van Nieuwpoort, and H. Sips. Radioastronomy image synthesis on the CELL/BE. In European Conference on Parallel Processing, 749-762 (2008).
- 17 A. S. van Amesfoort, A. L. Varbanescu, H. J. Sips, and R. V. Van Nieuwpoort. Evaluating multi-core platforms for HPC data-intensive kernels. In Proceedings of the 6th ACM conference on Computing frontiers, 207-216 (2009).
- 18 J. W. Romein. An efficient work-distribution strategy for gridding radio-telescope data on GPUs. ACM International Conference on Supercomputing, 321-330 (2012).
- 19 D. Muscat, High-Performance Image Synthesis for Radio Interferometry. PhD thesis, arXiv:1403.4209 (2014).
- 20 G. N. J. van Diepen. Casacore Table Data System and its use in the MeasurementSet. Astronomy and Computing, 12: 174-180 (2015).
- 21 R. Wang, C. Harris, A. Wicenec. AdiosStMan: Parallelizing Casacore Table Data System using Adaptive IO System. Astronomy & Computing, 16: 146-154, (2016).
- 22 L. Dagum, R. Menon.. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering, 5(1): 46-55 (1998).
- 23 X. Liao, L. Xiao, C. Yang, and Y. Lu. MilkyWay-2 supercomputer: system and application. Frontiers of Computer Science, 8(3): 345-356 (2014).
- 24 C. Tasse, S. van der Tol, J. van Zwieten, G. van Diepen, S. Bhatnagar. Applying full polarization A-Projection to very wide field of view instruments: An imager for LOFAR. Astronomy & Astrophysics, 553, A105-117 (2013).
Baoqiang LAO received the M.Sc. degree from Guilin University of Electronic Technology, Guilin, China, in 2015, in Information and Communication Engineering. He is currently a software engineer of Shanghai Astronomical Observatory (SHAO) of the Chinese Academy of Sciences and a number of SHAO’s Square Kilometre Array (SKA) group. His research area mainly focuses on SKA continuum survey imaging, wide-field imaging, HPC performance computing, parallel programing and data intensive astronomy technique.
Tao AN, professor at the Shanghai Astronomical Observatory of the Chinese Academy of Sciences. Research interests: high resolution imaging of compact astronomical objects including black hole, jets, pulsars. He is a member of Square Kilometre Array (SKA) Regional Centre Streering Committe, and responsible for the China SKA Data Center preparation. He also serves in the International Astronomical Society (IAU) Commission B4 (Radio Astronomy), European VLBI Programme Committee.