# OpenDDA: A Novel High-Performance Computational Framework for the Discrete Dipole Approximation

## Abstract

This work presents a highly-optimised computational framework for the Discrete Dipole Approximation, a numerical method for calculating the optical properties associated with a target of arbitrary geometry that is widely used in atmospheric, astrophysical and industrial simulations. Core optimisations include the bit-fielding of integer data and iterative methods that complement a new Discrete Fourier Transform (DFT) kernel, which efficiently calculates the matrix-vector products required by these iterative solution schemes. The new kernel performs the requisite 3D DFTs as ensembles of 1D transforms, and by doing so, is able to reduce the number of constituent 1D transforms by and the memory by over . The optimisations also facilitate the use of parallel techniques to further enhance the performance. Complete OpenMP-based shared-memory and MPI-based distributed-memory implementations have been created to take full advantage of the various architectures. Several benchmarks of the new framework indicate extremely favourable performance and scalability.

PROPOSED RUNNING HEAD: OPENDDA: HIGH-PERFORMANCE DDA

James Mc Donald

Computational Astrophysics Laboratory, Department of Information Technology,

National University of Ireland, Galway

Email: james@it.nuigalway.ie Phone: 353 91 495632 Fax: 353 91 494584

Aaron Golden

Computational Astrophysics Laboratory, Department of Information Technology,

National University of Ireland, Galway

Email: agolden@it.nuigalway.ie Phone: 353 91 493549 Fax: 353 91 494501

S. Gerard Jennings

School of Physics & Environmental Change Institute,

National University of Ireland, Galway

Email: gerard.jennings@nuigalway.ie Phone: 353 91 492704 Fax: 353 91 495515

Keywords: Discrete Dipole Approximation, optimisation, matrix-vector product, CG-FFT, parallel algorithm

## 1Introduction

The Discrete Dipole Approximation (DDA) is an intuitive and extremely flexible numerical method for calculating the optical properties associated with a target of arbitrary geometry, whose dimensions are comparable to or moderately larger than the incident wavelength. The basic concept was first introduced in 1964 to study the optical properties of molecular aggregates [5]. Subsequently, prompted by the polarisation of light by interstellar dust and the desire to calculate the approximate extinction, absorption and scattering cross sections for wavelength-sized dielectric grains of arbitrary shape, a comparable scheme was formulated [31]. Since then, the DDA has undergone major development. A recent erudite review describes this development in detail [41].

There are currently two main publicly available DDA implementations, DDSCAT and ADDA, the development of which has provided seminal contributions to the advancement of the DDA. DDSCAT [9] is written in FORTRAN 77 and is, at present, the most widely used DDA implementation. It has been developed by Bruce Draine from the Princeton University Observatory and Piotr Flatau from the Scripps Institution of Oceanography, University of California, San Diego. Its development provided several advancements including, among others, an updated expression for the dipole polarisability that incorporated a radiative reaction correction [7], the application of DFT techniques to the iterative core [21], and the introduction of the Lattice Dispersion Relations (LDR) for the prescription of the dipole polarisability [10]. A review paper by the authors concisely summarises these contributions [8]. ADDA [43] is an extremely advanced C code that has been developed (over a period of more than 10 years) by Maxim Yurkin and Alfons Hoekstra at the University of Amsterdam. It was the first parallel implementation of DDA and from its initial incarnation has been able to perform single DDA computations across clusters of autonomous computers [23]. This enables the code to overcome the restrictions on the number of dipoles that can be used due to the memory limitations of a single computer. The code also incorporates several integrated iterative solvers, several prescriptions for prescribing the dipole polarisabilities, and a host of other functionality [42]. Other DDA implementations, which are not publicly available, see [43] for further details, include SIRRI [32]; a FORTRAN 90 code that has been developed by Simulintu Ltd in Finland, and ZDD [44]; a C++ code that has been developed by Evgenij Zubko and Yuriy Shkuratov from the Institute of Astronomy, Kharkov National University in the Ukraine. For interested parties, a recent paper performs an in-depth comparison of these four codes [30].

Put succinctly, the DDA is an extremely flexible and powerful numerical scheme which replaces a continuum scatterer by an ensemble of dipoles situated on a cubic lattice. Each of the dipoles is driven by the incident electric field and by contributions from the other dipoles. Equation 1 outlines the kernel of this scheme which is a very large and fully populated set of complex linear equations.

Equation 1 must be solved for the unknown polarisations, , from which the scattering properties can be calculated. Here, is the wavenumber, is the permittivity of free space, is the polarisability of dipole , is the incident electric field at dipole , is the identity matrix, , and is the dyadic product, which is a special case of the Kronecker product for vectors of equal order.

For systems based on three-dimensional lattices, the computational requirements are extremely sensitive to the size of the lattice used. Even a small increase in the dimensions of the lattice significantly increases the computational overheads. As a result, the true potential of the host algorithm is often encumbered by the severe and often prohibitive burdens imposed by the computational scale required to produce meaningful results. In these cases, it is crucial that every possible algorithmic or computational enhancement be exploited in order to augment their applicability. OpenDDA is a novel computational framework for the DDA which has been custom-built from scratch in the C language. The new framework incorporates a variety of algorithmic and computational optimisations, enabling researchers to take full advantage of the different architectures that exist in the current generation of computational resources and beyond. Apart from the core optimisations, the new framework also incorporates an extremely intuitive and user-friendly interface for specifying the simulation parameters, including a sturdy and forgiving parser and a straightforward human-readable input control file.

The paper is organised as follows. Section 2 discusses the application of bit-fielding to streamline the storage of the target occupation and composition data. Section 3 deals with the design and development of a new highly-optimised DFT kernel to calculate the matrix-vector products within the iterative solution schemes. Section 4 discusses the custom-built iterative schemes that have been included to support the new DFT kernel. Section 5 describes the extension of this formulation to both shared-memory and distributed-memory parallel environments. Section 6 provides various various serial and parallel benchmarks to corroborate the assertions and illustrate the implications of the new optimisations. Finally, Section 7 presents concluding remarks.

## 2Bit-fielding of target occupation and composition data

In many applications, the range provided by the standard integer data types far exceeds the maximum range associated with the task at hand, and as a result, for large datasets, most of the bits are superfluous, constituting a serious waste of resources. Bit-fielding is a way to overcome this issue by storing data within the bits rather than the integers themselves. The entries can be set & tested by taking the appropriately shifted bit-mask and applying the relevant bitwise operator, LOGICAL OR for setting & LOGICAL AND for testing. The simplicity and speed of bitwise operations makes bit-fielding an extremely efficient method of storing non-negative integer based data.

The DDA’s theoretical strength, which is also its computational weakness due to imposed computational overheads, is its volumetric nature. Its ability to treat arbitrarily shaped particles with complex internal structures derives from the fact that each constituent site in the computational domain has separate integer flags describing its properties. Rather than storing the spatial extent of the target as the Cartesian coordinates of each lattice site in the computational domain and a Boolean occupation flag to indicate whether or not the site is occupied, requiring four integers, OpenDDA simply stores a single bit-fielded occupation flag per lattice site. The Cartesian coordinates need not be stored explicitly as, if required, are readily available as the loop control variables or via direct extraction from the array index itself. The optimisation of this integer data storage reduces the memory required for the spatial description of the target by a factor of 128.

The DDA also requires the specification of the material properties of the target on a dipole-by-dipole basis. This is done by associating each complex refractive index value with a distinct integer flag, and for each dipole, storing three integers to define the material properties in the , & directions. To compress this composition data, OpenDDA employs an extension of the bit-fielding theory to non-Boolean data sets. The new framework automatically calculates the requisite bit-mask, i.e., for four disparate materials, , two bits are all that is required, and the bit-mask is . For this example, the memory required for the integer data associated with the compositional description of the target is reduced by a factor of 16.

To put the optimisations in perspective, for , where , & are the , & dimensions of the computational domain, respectively, with four materials, the memory requirements for the integer data associated with the spatial and compositional description of the target are reduced from 3.26 to just over 0.1. Furthermore, the computational overhead incurred by the application of this bit-fielding optimisation within OpenDDA is negligible. Firstly, OpenDDA does not require the actual Cartesian coordinates within the iterative core. They are only needed at the initialisation stage for the calculation of the incident electric field at each of the lattice sites. Secondly, due to the nature of the iterative methods and the fact that the iterative solution vectors only include entries corresponding to occupied sites (discussed in §Section 4), the Boolean occupation flags are only tested at the initialisation stage, during the calculation of the dipole polarisabilities, and once at the start and end of the DFT matrix-vector product routine. Although the integer data constitutes only a fraction of the total memory footprint, with most deriving from the requisite complex vectors, these optimisations are not without merit and fundamentally underpin the efficiency of the current distributed-memory implementation of OpenDDA. To facilitate the elimination of zero storage within all iterative solution vectors, each autonomous node requires local knowledge of the entire spatial and compositional specification of the target and thus, unlike the complex vectors within the iterative core, the target specification is not distributed. Bit-fielding allows the full description of the target to be known globally without a prohibitive overhead. With each node having typically a few gigabytes to work with, the original memory requirements associated with the description of the target would have been comparable to the maximum possible allocation per node, negating the possibility of a viable distributed implementation.

## 3Kernel optimisation

Due to the sheer size of the system produced by Equation 1, iterative methods, where and is the number of iterations required to reach convergence, are employed, which are much faster than direct methods for large and . The cardinal computational hurdle and key to the acceleration of these iterative methods, such as the conjugate gradient method [1], are the matrix-vector products required by these schemes. A major breakthrough in this respect was the realisation that by embedding the arbitrarily shaped target in a computational domain and neglecting the diagonal terms, the matrix-vector products can be reformulated as convolutions [21]. Once the convolutions have been performed, it is then a trivial matter to include the contributions from the previously neglected terms.

The solution method is predicated upon Toeplitz and circulant matrix structure [4]. A Toeplitz matrix of order , denoted , is simply a matrix with constant diagonals, whereas a circulant matrix, , is a special case of Toeplitz structure, where the last element of each row is the same as first element of the next row. Any Toeplitz matrix, , can be converted to its circulant counterpart, , where , by taking the first column, appending the entries of the first row, excluding the initial entry, in reverse order, and completing the diagonals. The key is that any circulant matrix is diagonalised by the corresponding Fourier matrix, [38], where is the first column of .

Using the fact that the Fourier matrix is unitary, , the multiplication of an arbitrary vector with a circulant matrix can be performed as an element-wise multiplication in the Fourier Domain (FD) via the Convolution Theorem, . The symbol ’’ denotes convolution, ’’ denotes a Fourier transform pair and the symbol ’’ implies element-wise multiplication.

Not only does this reduce the computational complexity of the matrix-vector product from to , but since the circulant eigenvalues are a scalar multiple of the DFT of the first column of the circulant matrix, see Equation 3, the memory is reduced from to . To expedite the calculation of a Toeplitz matrix-vector product, , the matrix must undergo Toeplitz-to-circulant (T-to-c) conversion, , and the vector zero-padded to the dimension of the circulant matrix by appending zeroes. Equation 4 can then be applied. The first elements of the resultant are the required answer for the original Toeplitz matrix-vector product, with the remaining entries being redundant. Please note that this methodology and the following discussion is only applicable to systems exhibiting Toeplitz structure. Non-Toeplitz structure negates the transformation to circulant structure and hence, the use of DFTs to calculate the matrix-vector products efficiently. In terms of the specific geometry of the target in question, the method is indifferent. For sparse systems, from targets with a low volume filling fraction (VFF), such as dendritic or highly porous structures, the only difference is the amount of memory required by the iterative solution vectors. The solution vectors within the iterative core do not store entries corresponding to vacant lattice sites, a fact that will be discussed in §Section 4. However, the DFT kernel that efficiently calculates the matrix-vector products required by the iterative solvers must account for the entire rectangular computational domain, including the vacant sites, and thus performs the same amount of calculations, irrespective of the VFF.

For the DDA, by embedding the arbitrary target in a rectangular lattice and neglecting the diagonal terms, the requisite matrix-vector products take on the form of a complex-symmetric level-three tensor-Toeplitz matrix of order , denoted , in which, the constituent tensor elements are complex-symmetric but not necessarily Toeplitz, multiplied by an arbitrary vector of length , denoted , whose constituent elements are Cartesian vectors. Computationally, can be considered as an assemblage of nine complex-symmetric level-three single-element Toeplitz matrices, and since the tensor elements are symmetric, only six of the nine tensor components are independent and need to be considered explicitly, , . Analogously, the arbitrary vector can be split into three separate component vectors, , , . Consequently, this means that the components can be treated separately and the full problem is reduced to an ensemble of smaller single-element problems whose contributions will ultimately be manifested through FD products.

At this point, before attempting to decipher the ensuing optimisations of this DFT technique, for readers who are not familiar with the specific algorithmic details associated with the application of DFTs to the solution of such matrix-vector systems, it would be prudent to read Appendix A. This concisely summarises the conventional application of DFTs to these complex-symmetric level-three tensor-Toeplitz systems in the context of the DDA, as prescribed by [21].

The complete standard algorithm, which encapsulates the conventional implementation, can be summarised as follows:

T-to-c conversion and 3D DFT of the six independent tensor components

Zero pad and 3D DFT of the three zero-padded vector components

FD element-wise multiplication of the tensor and vector components

3D iDFT and normalisation of the components of the resultant

Addition of the previously ignored diagonal contributions

Due to the T-to-c conversions that are required for DFT applicability, the six tensor-components arrays contain a high degree of symmetry. Since a 3D DFT can be carried out as an ensemble of 1D DFTs, in the , in the , and in the direction, a new algorithm which exploits these symmetries can be constructed. Rather than expanding the six tensor-component arrays from a Toeplitz structure to a circulant structure and applying 3D DFTs, as delineated in Fig. ?, the T-to-c conversion and subsequent DFT can be amalgamated into a single computationally efficient operation. Each of the lines in the direction, , in the direction, , and in the direction, , in section ’’ in Fig. ?, can be taken in turn and copied into an external 1D array of length , , and , respectively. Each can be treated as the first column of a level-one Toeplitz matrix and the T-to-c conversion performed. The DFTs of these external vectors can be computed, and subsequently, the first , and elements of the resultants can be copied back into the 3D tensor-component array, overwriting the original data. This significantly reduces the number of 1D DFTs that have to be performed. Taking into account the zero-padding to ensure that , , and permits the use of fast DFT algorithms, the number of transforms is reduced by approximately in total:

Furthermore, since the full T-to-c conversion, as delineated in Fig. ?, is never explicitly performed, only section ‘1’ in Fig. ? need be stored, reducing the complex number storage required for the six tensor-component arrays by approximately :

For the three zero-padded vector-component arrays, the realisation that of each is by very definition, identically zero, can be used to expedite the calculations. This was originally proposed by [24] and has been exploited within ADDA [42]. Within OpenDDA, for computational efficiency, the DFTs of the three vector components, the subsequent element-wise FD multiplication of the tensor and vector components, and the iDFTs of the resultant, are amalgamated into a conjoint operation. Unlike the tensor component DFTs, which are performed in the , & directions, the DFTs of the vector components are performed in the , & directions, and the iDFTs are performed in the , & directions. The reason for this is twofold. Firstly, it allows the use of a -scratch plane for the DFTs in the & directions, the FD multiplication, and the iDFTs in the & directions, which provides significant memory reductions, as only of each of the three full arrays need be stored. Secondly, it supports the distributed FD multiplication in the MPI version of the algorithm (discussed in §Section 5). As a result of this optimisation, the complex number storage for the three vector components, ignoring the storage required for the three -scratch planes, is reduced by approximately :

Each of the three vector components is embedded into a zero-padded array, in the direction, in the direction, and in the direction. Once embedded, the DFT is performed in the direction. Then, each of the -planes is, in turn, copied into a external zero-padded scratch -plane. Subsequently, the DFTs in the & directions, the element-wise FD multiplication of the tensor & vector components, and the iDFTs in the & directions are performed ‘on-the-fly’ for this -plane. The initial section is then copied back into the 3D vector component array, overwriting the original data. Finally, once this has been done for each of the -planes, the iDFT in the direction is performed.

The reason that this is all possible is due to the fact that only non-zero DFTs, and only iDFTs that will contribute to the final answer, need to be performed. As far as the DFTs are concerned, in the direction, only the lines , see Fig. ? (d), contain any non-zero values and need be considered. Subsequently, in the direction, only the lines , see Fig. ? (e), need be transformed, and in the direction, all the lines , see Fig. ? (f), contain non-zero values and must be included. Taking into account the zero-padding to ensure that , , and permits the use of fast DFT algorithms, the number of transforms is reduced by approximately in total:

The FD element-wise multiplications can be carried out just as detailed in Eqs. Equation 6 ? in Appendix A, with one minor caveat. The six independent tensor-component arrays never underwent the full T-to-c conversion, however, since the DFT preserves the mirror symmetries, when accessing , the pertinent data can be extracted using the criteria:

For the iDFT of the resultant of the FD multiplication, i.e., of , & , see Eqs. Equation 6 ? in Appendix A, only transforms that will contribute to the final requisite answer need to be performed. In the direction, all the lines , see Fig. ? (f), contribute, through the subsequent and transforms, to the final answer. Subsequently, in the direction, only the lines , see Fig. ? (e), contribute, through the subsequent transform, to the final answer, and finally, in the direction, only the lines , see Fig. ? (d), contribute and need to be considered. Taking into account the zero-padding to ensure that , , and permits the use of fast iDFT algorithms, the number of transforms is reduced, as with the forward DFTs, by approximately in total:

The approximate complexity improvements for both computation and storage are given in Table. ?. Please note that for all three architectural variants, serial, OpenMP (shared-memory), and MPI (distributed-memory), the DFT functionality required by the iterative core is provided by the advanced complex interface of the extremely efficient, highly-portable and well known FFTW package (version 3) [19].

## 4Iterative schemes

To avoid unnecessary changes in data format to comply with the requisite structure of some external package, which would only serve to subvert any and all attempts to maximise the performance and streamline the structure of the framework, (like similar DDA implementations) OpenDDA incorporates a relatively large selection of iterative schemes that have been specifically custom-built to integrate seamlessly with the new DFT matrix-vector kernel and support the custom-built domain decomposition and parallel memory allocation schemes in the MPI implementation. Although a variety of different iterative solvers have been employed by [8], to date, no definitive “best” algorithm has been identified. To permit further investigation and to maximise the flexibility and usefulness of the end product, OpenDDA includes: Conjugate Gradients (**CG**) [22], Conjugate Gradients Squared (**CGS**) [36], BiConjugate Gradients (**BiCG**) [13], BiCG for symmetric systems () [14], Stabilised version of the BiCG (**BiCGSTAB**) [37], Restarted, stabilised version of the BiCG (**RBiCGSTAB**) [35], Quasi-minimal residual with coupled two-term recurrences (**QMR**) [16], QMR for symmetric systems () [14], Transpose-free QMR (**TFQMR**) [15] and a variant of BiCGSTAB based on multiple Lanczos starting vectors (**ML(n)BiCGSTAB**) [40].

For computational efficiency, as in ADDA [43], the new framework only stores information corresponding to occupied lattice sites, i.e., the elimination of zero storage within all iterative solution vectors. For the majority of cases, this provides a significant reduction in the memory required to store these vectors, as the target being simulated does not fully utilise the computational domain, generally having fewer occupied sites towards the exterior. As a quick example, consider a spherical target based on an computational grid. Although the grid contains lattice sites, for a spherical target, only are occupied, and dumping the unoccupied data, which is identically zero, equates to a decrease of in the memory required for the iterative vectors. To facilitate this optimisation, data extension and compression algorithms were created to extend a target vector out to allow the DFT matrix-vector product, and subsequently, to compress the resultant back into the computationally efficient form with just the entries corresponding to occupied sites.

## 5Parallel implementations

Over and above the serial improvements, due to the fact that the DFT kernel has been reduced from using 3D to 1D transforms, parallel techniques can be used to further enhance the performance. Full OpenMP-based and MPI-based implementations of the new framework have been created to take proper advantage of both shared-memory and distributed-memory architectures.

For the shared-memory implementation, for thread safety, each thread must have its own DFT scratch arrays. Furthermore, since the aforementioned data-compression algorithm at the end of the DFT kernel operates in-place within the result buffer, to prevent pertinent data loss by multiple processes working within the same array, it has a maximum scalability of three, i.e., one thread per vector component.

For the distributed-memory version, custom ‘cyclic-plane’ domain decomposition and parallel memory allocation schemes have been developed, which, apart from permitting the efficient location of, and access to, pertinent data, also provide a decrease in memory per node that is directly proportional to the number of available nodes and inherent ‘to-the-nearest-plane’ load balancing. This is an important point. For the majority of cases, targets do not have a constant number of occupied sites per plane. Since the custom-built iterative solution schemes only store information corresponding to occupied sites, ‘cyclic-plane’ decomposition provides a much improved data distribution across the available nodes. For example, take the previously discussed spherical target based on an computational grid, in which only of the lattice sites are occupied. If this computational domain is ‘block’ decomposed across three nodes, then even for this simplistic case, the imbalance of in the data distribution dwarfs the negligible imbalance for the ‘cyclic-plane’ decomposition, see Fig. ?. It is conceded that by ‘block’ decomposing the constituent planes non-uniformly among the three nodes, so that node ‘0’ is assigned the first seven planes, node ‘1’ is assigned the four middle 4 planes, and node ‘2’ is assigned the remaining seven planes, the load imbalance is reduced to a mere . However, this requires that the domain decomposition and parallel memory allocation algorithms be aware of, and be able to take account of, the user-defined and essentially arbitrary target geometry. In practice, this has not been the case.

To facilitate the distributed DFTs and iDFT required by the new kernel, the new framework incorporates several in-place local and distributed transpose algorithms, which have all been custom-designed to support the ‘cyclic-plane’ decomposition. They are all capable of treating completely arbitrary systems decomposed across an arbitrary number of nodes. Please note that, unlike the serial and OpenMP versions, which only store section ‘1’ in Fig. ?, for each of the six independent tensor components, the MPI version also stores an extra octant. Rather than storing a array for each of the six components, the MPI version stores a array, in the direction, in the direction, and in the direction. This is to facilitate the distributed element-wise FD multiplication, see Eqs. Equation 6 ? in Appendix A, that is required by the DFT kernel. This extra memory requirement is not an issue as the linear decrease in the memory per node, afforded by the parallel memory allocation, makes the extra storage inconsequential.

## 6Benchmarking

On top of the architectural specific nature of the implementations, each also exists in float, double and long double precision. For all benchmarks, double precision was used and level three and inter-procedural optimisations were used for the compilations, i.e., ‘-O3 -ipo’ for the Intel compilers and ‘-O3 -ipa’ for the Pathscale compilers. Furthermore, to standardise the results, all benchmarks were run using the same set of arbitrarily chosen physical parameters, SPHERE, incident wavelength , incident polarisation , effective radius (radius of a sphere of equal volume) , and complex refractive index . The following list details both the particular hardware and software configurations of each of the benchmarking systems. For each benchmark, the relevant architecture will be referenced simply by the associated tag, FRANKLIN, NEMO or WALTON.

**FRANKLIN:**Shared-memory, SGI Altix , Red Hat Enterprise Linux AS release , IA-, Itanium , , cache, RAM, Intel Compilers , FFTW , Intel MKL .**NEMO:**Shared-memory, HP xw9300 Workstation, openSUSE , x86_64, Dual-Core Opteron , , () cache, RAM, Intel Compilers , FFTW , MPICH2 .**WALTON:**Distributed-memory, IBM cluster 1350, SUSE LINUX Enterprise Server , x86_64, compute nodes, Opteron Single-Core Opteron , , cache, have RAM per node, have RAM per node, QLogic PathScale Compiler Suite, Version , FFTW , MPICH2 .

To affirm the benefits ascribed to the new DFT kernel, the constituent optimisations have been applied sequentially, on FRANKLIN, to show their respective contribution to the cumulative effect. Fig. ?(a) shows the time to compute the matrix-vector product for the original, various hybrid, and new algorithms in a regime that allows comparison with the explicit matrix-vector multiplication. Fig. ?(b) shows long-range timings, where the algorithms have been applied, within their aptitude, to much larger systems. For both the short-range and long-range cases, the timings for the **3D (Orig)** and **1D (Orig)** algorithms are essentially identical. Since the optimisation of the kernel is predicated on the concept of removing superfluous transforms by performing the requisite 3D DFTs as ensembles of 1D transforms, this congruency is crucial as it shows that there is no computational overhead associated with the migration.

For the short-range case, even though the explicit matrix-vector multiplication was performed with the highly optimised Intel MKL [26], it is obvious that the sheer size of the systems involved negate its use almost immediately. Furthermore, for both plots, the sequential application of the components of the new algorithm is clearly discernible. To put these improvements in perspective, in comparison to the original algorithm, the new algorithm requires approximately less transforms in order to carry out the matrix-vector product, which has a comparably profound influence on the timings.

In the long-range case, the comparison of the original and new algorithms is limited to the nethermost region of the plot. This is due to the prohibitive memory requirements associated with the original algorithm. Note that even for extremely large systems, the new algorithm is notably faster than the original algorithm is for a system that is just a fraction of the size. Both the short-range and long-range cases also include timings for the OpenMP-based kernel across 4, 8 & 16 threads, which unequivocally show the substantial performance gains that can be obtained from shared-memory parallelisation.

An ancillary computational benefit, as shown in Fig. ?(c), is that the new formulation only requires the creation of a few 1D DFT plans, rather than full 3D DFT plans. This is especially beneficial if using FFTW [18] with FFTW_MEASURE to compute the transforms, which instructs FFTW to run and measure the execution time of several DFTs in order to find the best way to compute the transform. As a result, initialisation times become negligible.

In comparison to the original algorithm, the new algorithm requires over less memory to store the tensor and vector components. This is evident from the approximate memory requirement presented in Fig. ?(d), where, for the maximum benchmarked size, corresponding to , although the original 3D DFT algorithm would require over , the new algorithm and its OpenMP counterpart only require to perform the same calculation. This figure also incorporates the memory requirements for the distributed MPI-based implementation, showing the reduction in memory per node that is only limited by the number of nodes available.

To ascertain the performance and scalability of the OpenMP-based shared-memory implementation, a large-scale benchmark was performed across threads on FRANKLIN, see Fig. ?. The system size was chosen to be , which produced a computational domain with sites in total and a representation of the spherical target with occupied sites. In terms of memory usage, this system required for 1 thread, with a linear increase of approximately per thread to approximately for 20 threads. The extra memory is for the thread private scratch spaces that are required to ensure thread safety. Although this may sound like large amount of memory, in comparison, for the same system, the original algorithm, which is limited to the serial environment, would require approximately . The time for the matrix-vector product is reduced from for 1 thread to just using 20 threads, a speedup of approximately . All components show excellent scalability, except of course, the aforementioned data compression algorithm, see §Section 5, which is not an issue as it constitutes only a negligible fraction of the DFT kernel’s runtime.

To test the performance and scalability of the MPI-based distributed-memory implementation, the same system as for the OpenMP shared-memory benchmark, was run on WALTON, see Fig. ?. Here, due to the fact that each node has only of physical memory, the benchmark was initialised across 32 nodes and then scaled up to 64 nodes, requiring only per node across 32 nodes, and just per node across 64 nodes. In terms of performance, the timings and scalability are extremely good. The calculation took across 32 nodes and across 64 nodes, a speedup of almost . To put this in perspective, for the OpenMP-based benchmark, the same system required using 1 thread and using 20 threads. In terms of these results, there are several noteworthy points. Firstly, the total time is quoted as the sum of the computation and communication times. The total communication time includes the time for the forward transpositions of both the six independent tensor components and three zero-padded vector components, and also includes the time for the reverse transposition of the resultant of the FD element-wise multiplication of the components. These transpositions are essentially block events that are controlled by quite complex custom-built transposition algorithms. Even with the prescribed optimisations, due to the enormity of the tensor and vector component arrays, ‘in-place’ transposition is an absolute necessity. This complication is compounded by the custom-built parallel-memory allocation and ‘cyclic-plane’ domain decomposition schemes that provide a decrease in memory per node that is directly proportional to the number of active nodes and also provide inherent load balancing. In order to facilitate efficient ‘in-place’ transposition of the component arrays, the transposition algorithms employ the pre-computation of local ‘in-place’ transposes in order to simplify and streamline the host algorithm and subsequent data communication. Unfortunately, at present this negates the possibility of overlapping some of the communication with the computation. Secondly, although all communication timings are marginally sub-linear, all computation timings exhibit super-linear speedups. Upon consulting with the ICHEC technical systems team, the peculiarity is believed to be an artifact of the initial benchmark. Due to the novel parallel memory allocation algorithms in the distributed-memory implementation of the OpenDDA framework, the memory footprint is decomposed across the active nodes. However, it is believed that the chosen system size and associated memory requirements were fractionally too large, causing slightly degraded performance due to swapping. Since all subsequent speedups are calculated with respect to this initial benchmark, and since the parallel memory allocation and domain decomposition schemes would have solved this allocation/swapping issue as the number of active nodes increased above 32, the subsequent timings would appear super-linear. This is not considered to be a cause for concern as this only introduces a global shift and it is to overall trend that is of importance here.

To test the performance of the full OpenDDA framework, a benchmark was run on NEMO with , see Table. ?. This produced a computational domain with sites in total and a representation of the spherical target with occupied sites. The convergence tolerance was taken as the default value of , the maximum number of iterations was set to , the initial guess was set to zero, and Point-Jacobi preconditioning was used. The performance and scalability of the full OpenDDA framework is extremely good. For this system, it is the simpler schemes that tend to provide the best performance, however, since convergence tends to be slower for larger refractive index values [7], as the number of iterations increases, the enhanced stability of the more involved iterative methods may prove beneficial, and ultimately, may provide superior convergence. Although it is duly noted that due to the necessity of distributed transposes, the performance of the MPI version will always lag behind that of the OpenMP version, the slight deficit in performance is offset by the decrease in memory that is directly proportional to the number of nodes available.

To put the new custom-built OpenDDA framework in context, several comparative benchmarks were performed between the various architecture specific implementations of OpenDDA, and ADDA; the fastest and most computationally efficient DDA implementation [30]. Since the number of iterations to reach convergence is dependent on the user-defined convergence tolerance and on the way in which the stopping criteria are defined within the specific implementation, the comparison focusses, not on the number of iterations, but on the time per iteration. For ADDA, the reason that only the total time for the initialisation of the tensor components is given is that ADDA performs the initialisation of the tensor components as a single conjoint operation, i.e., the computation of the components and the DFTs of these components are perform together. In terms of the quoted memory requirements, for the OpenDDA variants, the approximate memory requirements are obtained from the in-built estimation routines. To obtain an estimate for the approximate memory requirements for ADDA, only ‘large’ arrays pertaining to the iterative solution have been included. Furthermore, some of the ‘large’ arrays have been excluded from consideration, as in ADDA, not all relevant ‘large’ arrays coexist, and thus, some do not contribute to the cumulative memory requirements. For the shared-memory comparison, for , the memory estimations for ADDA include, *xvec*, *rvec*, *pvec*, *Einc*, *Avecbuffer*, *Dmatrix*, *Xmatrix*, *slices* & *slices_tr*, and for & , the estimation also includes *vec1*, *vec2* & *vec3*. For the distributed-memory comparison, the estimations also include the contributions from ADDA’s MPI communication buffers, *BT_buffer* & *BT_rbuffer*, and to look at the efficiency of the decomposition of the associated data across the active nodes, apart from the total memory requirements, the results also give the maximum memory requirement per node.

Table ? shows the results for the shared-memory comparative benchmark run on NEMO using , & , for a spherical target with , , and . Please note that although ADDA does employ Intel vectorisation [2] within its basic linear algebra routines, it does incorporate generic OpenMP shared-memory parallelisation. Similarly, Table ? shows the results for the distributed-memory comparative benchmark run on WALTON using , & , across , , , & nodes, where the relevant system sizes for each of the constituent benchmarks are quoted in Table ?. In all instances, the new OpenDDA performs extremely favourably. It is worth noting that the custom ‘cyclic-plane’ decomposition within the OpenDDA framework allows for an extremely even distribution of the data across the active nodes. Even though, for the comparison across 32 nodes in Table ?, OpenDDA requires marginally more memory in total, the maximum memory required by any individual node is actually less.

Finally, as a quick verification of the correct operation of the new OpenDDA framework, a comparison of the extinction, scattering & absorption efficiencies, , & , for the well proven output of the widely used DDSCAT code and the three architectural variants of OpenDDA was performed on NEMO, see Table ?. All are in good agreement.

## 7Conclusions

Within the new framework, the bit-fielding of the storage associated with the spatial and compositional specification of the target, not only provides significant reduction in the associated integer storage, but also facilitates the creation of a versatile and efficient distributed-memory implementation.

Although comparable codes, such as ADDA, have pioneered and employ similar optimisations, OpenDDA incorporates a custom-built highly-optimised DFT kernel that performs the requisite 3D DFTs of both the tensor and vector, which are instrumental in the efficient calculation of the matrix-vector products, as ensembles of 1D transforms. By doing so, a significant fraction of the constituent 1D transforms, which are superfluous, can be neglected. In comparison to the “out-of-the-box” algorithm based on 3D DFTs, OpenDDA requires approximately less transforms and over less memory in order to carry out the matrix-vector products. Furthermore, the optimisations permit parallel techniques to be employed to further enhance the performance. Both the resulting OpenMP-based shared-memory and MPI-based distributed-memory implementations possess extremely good scalability and associated performance. On top of these improvements, due to the fact that, computationally, the new framework only requires the creation of a limited number of 1D DFT plans, rather than the full 3D DFT plans required by the original algorithm, the initialisation times, which can be significant, become completely negligible. The new framework also incorporates a relatively large selection of iterative methods that have been custom-built in serial, OpenMP and MPI to integrate seamlessly with the various architecturally specific DFT kernels.

Within the distributed-memory implementation of the new framework, the novel parallel-memory allocation and domain decomposition schemes provide a decrease in memory per node that is directly proportional to the number of nodes and also provide inherent load balancing. Due to the fact that the new framework only stores information corresponding to occupied lattice sites, depending on the target being simulated, there can be quite an imbalance in the number of occupied sites per plane. However, the new framework employs custom-designed ‘cyclic-plane’ decomposition, and as a result, boasts an extremely well-balanced data distribution across the active nodes.

The framework performs extremely favourably across all architectural variants, with the novel and unique OpenMP-based implementation providing unprecedented parallel performance on shared-memory architectures.

## Acknowledgements

This work was supported by the CosmoGrid Project, funded under the Programme for Research in Third Level Institutions (PRTLI) administered by the Irish Higher Education Authority under the National Development Plan and with partial support from the European Regional Development Fund. In conclusion, the authors would also like to thank the Irish Centre for High-End Computing (ICHEC) and their dedicated Systems Team.

## AOriginal DFT algorithm

This appendix concisely summarises the conventional application of DFTs to the solution of complex-symmetric level-three tensor-Toeplitz matrix-vector products in the context of the DDA, as prescribed by [21].

The extension of the level-one formulation described at the beginning of §Section 3 to the treatment of level-three systems, , is obtained by performing the T-to-c conversion on each of the three levels of just as it was performed for the level-one system, resulting in a level-three circulant, , where .

Here, each of the is a level-two circulant, each of the is a level-one circulant, the symbol ’’ denotes the Kronecker tensor product, and is the downward-shift permutation matrix of order [38]. The arbitrary vector must also be zero-padded to negate the contribution for the circulant extensions on all three levels, , where . The complimentary Fourier structure which diagonalises this level-three system is acquired from the hierarchical diagonalisation by the Fourier matrices, , and , i.e., [4]. As a result, the level-three matrix-vector product, , can be carried out as an element-wise FD multiplication, where is the first column of .

Furthermore, using the realisation that 3D DFT allows Equation 5 to be reformulated in terms of the 3D DFT. To facilitate this, before the aforementioned T-to-c conversions and zero-padding takes place, the first column of each of the six independent tensor component matrices, , and the three vector components, , must be arranged into separate 3D structures, and , respectively. For example, the transformations and are governed by and , , respectively.

Once in 3D form, to facilitate the DFT applicability, each of the six tensor-component arrays must undergo T-to-c conversion, e.g., , and the three vector-component arrays must be zero-padded, e.g., . For a level-three system, the T-to-c conversion involves taking each tensor-component array in turn, denoted by section ’’ in Fig. ?, treating each line in the , & directions as the first column of a level-one Toeplitz matrix, and mirroring it, excluding the initial entry, across its last entry. The mirrored elements must also by multiplied by the appropriate entry from Table ?. The reason for this is that for a general T-to-c conversion, one would usually append the first row, excluding the initial entry, in reverse order. However, since the nine tensor components are from the dyadic product in Equation 2, the Toeplitz matrix is either symmetric or non-symmetric by sign only. Hence, it is possible to just mirror the column itself and, if necessary, multiply the mirrored entries by ’’. The conversion creates section ’’, the conversion creates section ’’, and the conversion creates section ’’ in Fig. ?. The zero-padding of the three vector-component arrays is much simpler. Each component is trivially embedded in the octant of a array of zeroes, i.e., it would occupy section ’’ in Fig. ?, and the rest of the array would be identically zero.

The algorithm requires the DFT of the six tensor-component arrays, , and the three vector-component arrays, , yielding and , respectively. Note that for the practical application of the theory, as with DDSCAT [9] and ADDA [42], for maximum performance, if necessary, the transformations described above are all zero-padded to ensure that the final dimensions of the component arrays permit the use of fast DFT algorithms, i.e., for FFTW [18], a dimension which satisfies , where are arbitrary and . These are illustrated in Fig. ? as , , and , and increase the dimensions to , , and . Subsequently, the contribution from each of the individual components, to the full circulant matrix-vector product, is obtained via the element-wise tensor-vector product of their transforms.

The inverse DFT (iDFT) of & provides & , which are the resultant of Equation 5, in 3D form, and the components of the original tensor-Toeplitz matrix-vector product, , are obtained by extracting the octant of , & , respectively, i.e., ’’ in Fig. ?. This reduces the complexity from operations and storage to and , respectively.

DFT | +1 | -1 | -1 | +1 | +1 | +1 |

DFT | +1 | -1 | +1 | +1 | -1 | +1 |

DFT | +1 | +1 | -1 | +1 | -1 | +1 |