A parallel non-uniform fast Fourier transform library based on an “exponential of semicircle” kernel

A parallel non-uniform fast Fourier transform library based on an “exponential of semicircle” kernel

Alex Barnett Flatiron Institute, Simons Foundation, New York, NY, USA    Jeremy Magland Flatiron Institute, Simons Foundation, New York, NY, USA    Ludvig af Klinteberg Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
July 1, 2019
Abstract

The nonuniform fast Fourier transform (NUFFT) generalizes the FFT to off-grid data. Its many applications include image reconstruction, data analysis, and the numerical solution of differential equations. We present FINUFFT, an efficient parallel library for type 1 (nonuniform to uniform), type 2 (uniform to nonuniform), or type 3 (nonuniform to nonuniform) transforms, in dimensions 1, 2, or 3. It uses minimal RAM, requires no precomputation or plan steps, and has a simple interface to several languages. We perform the expensive spreading/interpolation between nonuniform points and the fine grid via a simple new kernel—the “exponential of semicircle” in —in a cache-aware load-balanced multithreaded implementation. The deconvolution step requires the Fourier transform of the kernel, for which we propose efficient numerical quadrature. For types 1 and 2, rigorous error bounds asymptotic in the kernel width approach the fastest known exponential rate, namely that of the Kaiser–Bessel kernel. We benchmark against several popular CPU-based libraries, showing favorable speed and memory footprint, especially in three dimensions when high accuracy and/or clustered point distributions are desired.

1 Introduction

The need for fast algorithms for spectral analysis of non-uniformly sampled data arose soon after the popularization of the FFT in the 1960s. Many early methods came from signal processing [44] and astronomy [58, 39, 49] [50, Sec. 13.8], but it was not until the 1990s that Dutt–Rokhlin [12] gave the first rigorous analysis of a convergent scheme. The NUFFT has since become crucial in many areas of science and engineering. Several imaging methods, including magnetic resonance imaging (MRI) [57, 35], X-ray computed tomography (CT) [18], ultrasound diffraction tomography [9], and synthetic aperture radar [3], sample the Fourier transform at non-Cartesian points [60, 23] hence require the NUFFT or its inverse for accurate reconstruction. Real-time Fourier-domain optical coherence tomography (OCT) relies on rapid one-dimensional NUFFTs [63]. Periodic electrostatic and Stokes problems are commonly solved by fast “particle-mesh Ewald” summation, whose spectral part is equivalent to a pair of NUFFTs [37, 42]. Spectrally-accurate function interpolation may be efficiently performed with the NUFFT [30, Sec. 6] [20]. The numerical approximation of Fourier transforms using non-Cartesian or adaptive quadrature grids arises in heat solvers [35], cryo electron microscopy [64, 6], and electromagnetics [38]. Many more applications are found in the reviews [62, 33, 48, 30, 25].

Our purpose is to describe and benchmark a general-purpose software library for the NUFFT that achieves high efficiency with an open-source compiler by combining mathematical and implementation innovations. The computational task is to approximate, to a requested relative accuracy , the following exponential sums. Let or 3 be the spatial dimension. Let be the number of desired Fourier modes in dimension ; in each dimension the Fourier mode (frequency) indices are

The full set of mode indices is the Cartesian product that we denote by

containing a total number of modes . The nonuniform points have locations , , which may be taken to lie in , with corresponding strengths . Then the type 1 NUFFT (also known as the “adjoint NFFT” [48, 30]) approximates the outputs111Note that our normalization differs from that of [12, 25].

(1)

This may be interpreted as computing, for the -periodic box, the Fourier series coefficients of the distribution

(2)

Up to normalization, (1) generalizes the discrete Fourier transform (DFT), which is simply the uniform case, e.g. in 1D, with .

The type 2 transform (or “NFFT”) is the adjoint of the type 1. Unlike in the DFT case, it is not generally related to the inverse of the type 1. It evaluates the Fourier series with given coefficients at arbitrary target points . That is,

(3)

Finally, the more general type 3 transform [35] (or “NNFFT” [30]) may be interpreted as evaluating the Fourier transform of the non-periodic distribution (2) at arbitrary target frequencies in , , where is a plain integer index. That is,

(4)

All three types of transform (1), (3) and (4) consist simply of computing exponential sums that naively require work. NUFFT algorithms compute these sums, to a user-specified relative tolerance , in close to linear time in and .

Remark 1

In certain settings the above sums may be interpreted as quadrature formulae applied to evaluating a Fourier transform of a function. However, these tasks are not to be confused with the “inverse NUFFT” (see problems 4 and 5 in [12, 25]) which involves, for instance, treating (3) as a linear system to be solved for , given the right-hand side . For some nonuniform distributions this linear system can be very ill-conditioned. This inverse NUFFT is common in Fourier imaging applications; a popular solution method is to use conjugate gradients to solve the preconditioned normal equations, exploting repeated NUFFTs for the needed matrix-vector multiplications [16, 18, 23, 60] [53, Sec. 3.3]. Thus, efficiency gains reported here would also accelerate this inversion method. See [13, 31] for other approaches. We will not explicitly address the inversion problem here.

Figure 1: The proposed ES spreading kernel (8) (solid blue lines); the Kaiser–Bessel kernel (5) (dashed green); and the best truncated Gaussian (dotted pink) in . (a) shows all kernels for . The discontinuities at are highlighted by dots. (b) shows a logarithmic plot (for positive ) of the kernels for (corresponding to a spreading width of grid points). The graph for ES is a quarter-circle. (c) shows the magnitude of the kernel Fourier transforms, and the “cutoff” frequency . ES and KB have shape close to a quarter-ellipse in (see (39) and (6)). All have exponentially small values for , but the Gaussian has exponential convergence rate in terms of only around half that of ES or KB.

1.1 Prior algorithms, kernels, and implementations

There are two main approaches to the fast approximation of the sums (1) or (3), both of which build upon the FFT: 1) Interpolation between nonuniform points and an upsampled regular grid, combined with an upsampled FFT and correction in Fourier space [12, 7, 16, 25, 30]; or 2) Interpolation to/from an -point (i.e. not upsampled) regular grid, combined with the -point FFT. In the univariate (1D) case, there are several variants of the second approach: Dutt–Rokhlin [13] proposed spectral Lagrange interpolation (using the cotangent kernel applied via the fast multipole method), combined with a single FFT. Recently, Ruiz-Antolín and Townsend [53] proposed a stable Chebyshev approximation in intervals centered about each uniform point, which needs an independent -point FFT for each of the coefficients, but is embarrassingly parallelizable. This improves upon earlier work [2] using Taylor approximation that was numerically unstable without upsampling [33, Ex. 3.10] [53].

We now turn to the first, and most popular, of the two above approaches. For the type 1 and type 2 transforms one sets up a regular fine grid of points where the upsampling factor in each dimension, , is a small constant (typically ). Taking the type 1 as an example, there are three steps. Step 1 evaluates on the fine grid the convolution of (2) with a smooth kernel function , whose support has width fine grid points in each dimension (see Fig. 2(a)). This “spreading” requires kernel evaluations. Step 2 applies the FFT on the -point grid, needing work. Step 3 extracts the lowest frequencies from the result, then divides by , the Fourier transform of the kernel, evaluated at each of these frequencies; this is called deconvolution or roll-off correction. There is a class of kernels, including all those we discuss below, whose analysis gives an error decreasing exponentially (up to weak algebraic prefactors) with , hence one may choose . Thus the total effort for the NUFFT is .

The choice of spreading kernel has a fascinating history. A variety of kernels were originally used for “gridding” in the imaging community (e.g. see [26, 48, 30]). The truncated Gaussian kernel (see Fig. 1) was the first for which an exponential convergence rate with respect to was shown [12]. This rate was improved by Steidl [56, 14]: fixing , for an optimally chosen Gaussian width parameter the error is . Beylkin [7] proposed B-splines for , with the estimate . In both cases, it is clear that increasing improves the convergence rate; however, since the cost of the upsampled FFT grows at least like , a tradeoff arises. In practice, many studies have settled on for general use [26, 18, 16, 30, 25, 46]. For this choice, both the above Gaussian and B-spline rates imply that , the number of correct digits, is approximately . For instance, to achieve 12 digits, a spreading width is needed [25, Remark 2].

However, Jackson et al. [26] realized that the criteria for a good kernel are very similar to those for a good window function in digital signal processing (DSP). To summarize these criteria,

  • The numerical support of in fine grid points, , should be as small as possible, in order to reduce the spreading cost.

  • A certain norm of in the “tails” should be as small as possible, relative to values in the central range ; see Fig. 2(b).

The two criteria conflict: (a) states that should be narrow, but (b), which derives from aliasing error, implies that should be smooth. (We postpone the rigorous statement of (b) until (35).) Is has been known since the work of Slepian and coworkers in the 1960s [54] that, if one chose -norms in (b), the family of prolate spheroidal wavefunctions (PSWF) of order zero [45] would optimize the above criteria. It was also DSP folklore [28] that the “Kaiser–Bessel” (KB) kernel,

(5)

scaled here to have support , where is the regular modified Bessel function of order zero [43, (10.25.2)], well approximates the PSWF. However, unlike the PSWF, which is tricky to evaluate accurately [45], (5) needs only standard special function libraries [40]. Its Fourier transform (using the convention (9)) is known analytically222This pair appears to be a discovery of B. F. Logan, and its use pioneered in DSP by J. F. Kaiser, both at Bell Labs, in the 1960s [21]. Curiously, the pair seems absent from all standard tables [22, §6.677] [51, §2.5.25] [52, §2.5.10]. [28],

(6)

This transform pair (5)–(6) is plotted in green in Fig. 1.

Starting with imaging applications in the 1990s, the KB kernel (5) was introduced for the NUFFT [26, 48, 16]. Note that the function (6), truncated to , outside of which it is exponentially small, may instead be used as the spreading kernel [18, 30]. This latter approach—which we call “backward KB”333The distinction between forward and backward use of the KB pair is unclear in the literature.—has the computational advantage of spreading with cheaper sinh rather than evaluations. The error analyses of the two variants turn out to be equivalent. Despite being only conditionally convergent, the tail sum of (6) needed for criterion (b) may be bounded rigorously; this subtle analysis is due to Fourmont [17, p. 30–38] [18, Sec. 4]. Its optimal convergence with , summarized in [47, p. 30-31] [29, App. C], is (see (35) for the definition of the error ),

(7)

This is the fastest known exponential error rate of any kernel, equalling that of the PSWF [4]: for the choice gives over correct digits. This is nearly twice that of the Gaussian; 12 digits are reached with only . Attempts to further optimize this kernel give only marginal gains [16], unless restricted to cases with specific decay of the mode data [41], or minimal upsampling () [27].

Turning to software implementations, most are based upon the Gaussian or KB kernels (in both its variants). Greengard–Lee [25] presented “fast Gaussian gridding” which reduced the number of exponential function evaluations from to , resulting in a several-fold acceleration of the spreading step. This was implemented by those authors in a general-purpose single-threaded CMCL Fortran library [24]. The mature general-purpose NFFT code of Keiner–Kunis–Potts [29, 30] is multithreaded [61] and uses backward KB by default (although fast Gaussian gridding is available). It allows various precomputations of kernel values (requiring a “plan” stage), demanding a larger RAM footprint but accelerating repeated calls with the same points. There are also several codes specialized to MRI, including MIRT (which uses full precomputation of the KB kernel) by Fessler et al. [15], and recently BART [59] and pynufft [36]. Various specialized GPU implementations also exist (reviewed in [46]), mostly for MRI [34, 32] or OCT [63]. Unlike general-purpose codes, these specialized packages tend to have limited accuracy or dimensionality, and tend not to document precisely what they compute.

1.2 Contribution of this work

We present a general purpose documented CPU-based multithreaded C++ library (FINUFFT) [5] that is efficient without needing any precomputation stage. This means that the RAM overhead is very small and the interface simple. For medium and large problems in 2D and 3D its speed is competitive with state-of-the-art CPU-based codes. In some cases, at high accuracies, FINUFFT is faster than all known CPU-based codes by a factor of 10. The packages against which we benchmark are listed in Table 1.

We spread with a new “exponential of semicircle” (ES) kernel (see Fig. 1),

(8)

which has error convergence rate arbitrarily close to that of (7); see Theorem 7. It is simpler and faster to evaluate than either of the KB pair (5)–(6), yet has essentially identical error. We demonstrate further acceleration via piecewise polynomial approximation. (8) has no known analytic Fourier transform, yet we can use numerical quadrature to evaluate when needed, with negligible extra cost. Unlike interpolation from the fine grid (needed in type 2), spreading (needed for type 1) does not naturally parallelize over nonuniform points, because of collisions between threads writing to the output grid. However, we achieve efficiency in this case by adaptively blocking into auxiliary fine grids, after bin-sorting the points.

The rest of the paper is structured as follows. The next section outlines the software interfaces. In section 3 we describe the algorithms and parameter choices in full, including various novelties in terms of quadrature and type 3 optimization. In Section 4 we summarize a rigorous aliasing error bound for the ES kernel, and use this to justify the choice of and . We also explain the gap between this bound and empirically observed relative errors. Section 6 compares the speed and accuracy performance against other libraries, in dimensions 1, 2, and 3. We conclude in section 7.

2 Use of the FINUFFT library

The basic interfaces are very simple [5]. From C++, with x a double array of M source points, c a complex (std::complex<double>) array of M strengths, and N an integer number of desired output modes,

  status = finufft1d1(M,x,c,isign,tol,N,f,opts);

computes the 1D type 1 NUFFT with relative precision tol (see section 4.2), writing the answer into the preallocated complex array f, and returning zero if successful. Setting isign either or controls the sign of the imaginary unit in (1). opts is a struct defined by the required header finufft.h and initialized by finufft_default_options(&opts), controlling various options. For example, setting opts.debug=1 prints internal timings, whereas opts.chkbnds=1 includes an initial check whether all points lie in the valid input range . The above is one of nine routines with similar interfaces (types 1, 2, and 3, in dimensions 1, 2, and 3). The code is lightweight, at around 3300 lines of C++ (excluding interfaces to other languages). DFTs are performed by FFTW [19], which is the only dependency.

Interfaces from C and Fortran are similar to the above, and require linking with -lstdc++. From high-level languages one may call,

  [f status] = finufft1d1(x,c,isign,tol,N,opts);    % MATLAB or octave
  status = finufftpy.nufft1d1(x,c,isign,tol,N,f)    # python (numpy)

Here is inferred from the input sizes. There also exists a julia interface [1].

Remark 2

The above interface, since it does not involve any “plan” stage, incurs a penalty for repeated small problems ( and of order or less), traceable to the overhead (around 100 microseconds per thread in our tests) for calling fftw_plan() present when FFTW reuses stored wisdom. To provide maximal throughput for repeated small problems (which are yet not small enough that a ZGEMM matrix-matrix multiplication approach wins), we are adding interfaces that handle multiple inputs or allow a plan stage. At the time of writing these are available in 2D only, and will be extended in a future release.

3 Algorithms

For type 1 we use the standard three-step procedure sketched above in Section 1.1. For type 2 the steps are reversed. Type 3 involves a combination of types 1 and 2. Our Fourier transform convention is

(9)

For the default upsampling factor , given the requested relative tolerance , the kernel width and ES parameter in (8) are set via

(10)

The first formula may be summarized as: the kernel width is one more than the desired number of accurate digits. We justify these choices in section 4.2. (FINUFFT also provides a low-upsampling option , which is not tested in this paper.)

Figure 2: (a) 1D illustration of spreading from nonuniform points to the grid values , (shown as dots) needed for type 1. For clarity, only two nonuniform points and are shown; the former results in periodic wrapping of the effect of the kernel. (b) Semi-logarithmic plot of the (positive half of the) Fourier transform of the rescaled kernel , showing the usable frequency domain (and the dynamic range over this domain), and a useful approximate relationship between the aliasing error bound and the ES kernel parameter . From (39), below cutoff the curve is well approximated by a quarter ellipse.

3.1 Type 1: nonuniform to uniform

We describe the algorithm to compute , an approximation to the exact defined by (1).

3.1.1 1D case

We use to denote nonuniform source points, and to label the output modes. For FFT efficiency the DFT size is chosen to be the smallest integer of the form not less than nor , the latter condition simplifying the spreading code.

Step 1 (spreading). From now on we abbreviate the ES kernel in (8) by . We rescale the kernel so that its support becomes , with

(11)

where is the upsampled grid spacing. This rescaled kernel is denoted

(12)

and its periodization is

(13)

We then compute, at a cost of kernel evaluations, the periodic discrete convolution

(14)

as sketched in Fig. 2(a).

Step 2 (FFT). We use the FFT to evalute the -point DFT

(15)

Note that the output index set is cyclically equivalent to the usual FFT index set .

Step 3 (correction). We truncate to the central frequencies, then diagonally scale (deconvolve) the amplitudes array, to give the outputs

(16)

where a good choice of the correction factors comes from samples of the kernel Fourier transform,444It is tempting instead to set to be the DFT of the grid samples of the kernel . However, in our experience this causes around twice the error of (17), as can be justified by the discussion in section 4. Fessler and Sutton [16, Sec. V.C.3] report a similar finding.

(17)

A new feature of our approach is that we approximate by applying Gauss–Legendre quadrature to the Fourier integral, as follows. This allows kernels without an analytically known Fourier transform to be used without loss of efficiency. Let and be the nodes and weights for a -node quadrature on . Since is real and even, only the positive nodes are needed, thus,

(18)

By a convergence study, we find that (thus a maximum quadrature spacing close to ) gives errors less than , over the needed range . A rigorous quadrature bound would be difficult due to the small square-root singularities at the endpoints in (8). The cost of the evaluation of is , and naively would involve cosines. By exploiting the fact that, for each quadrature point , successive values of over the regular grid are related by a constant phase factor, these cosines can be replaced by only complex exponentials, and adds and multiplies, giving an order of magnitude acceleration. We call this standard trick “phase winding.”555In the code, see the function onedim_fseries_kernel in src/common.cpp

3.1.2 The case of higher dimension

In general, different fine grid sizes are needed in each dimension. We use the same recipe, so that , , , . The kernel is a periodized product of scaled 1D kernels,

(19)

where . Writing for the fine grid spacing in each dimension, and to index each fine grid point, the discrete convolution becomes

(20)

In evaluating (20), separability means that only kernel evaluations are needed per source point: the square or cube of values is then filled by an outer product. The DFT (15) generalizes in the standard way to multiple dimensions. Finally, the correction factor is also separable,

(21)

so that only repetitions of (18) are needed, followed by an outer product.

3.2 Type 2: uniform to nonuniform

To compute , an approximation to in (3), we reverse the steps for the type 1. Given the number of modes , and the precision , the choices of , and are as in the type 1. From now on we stick to the case of general dimension .

Step 1 (correction). The input coefficients are pre-corrected (amplified) and zero-padded out to the size of the fine grid,

(22)

with the same amplification factors as in (21).

Step 2 (FFT). This is just as in type 1. Writing the general dimension case of (15), with the index vectors and (and their ranges) swapped,

(23)

Step 3 (interpolation). The adjoint of spreading is interpolation, which outputs a weighted admixture of the grid values near to each target point. The output is then

(24)

As with the type 1, because of separability, this requires evaluations of the kernel function, and flops, per target point.

3.3 Type 3: nonuniform to nonuniform

This algorithm is more involved, but is a variant of standard ones [25, Alg. 3] [14, Alg. 2] [35] [48, Sec. 1.3]. Loosely speaking, it is “a type 1 wrapped around a type 2,” where the type 2 replaces the middle FFT step of the type 1. Given , we choose , , and as before. We will present the choice of shortly (see “step 0” below). It will involve the following bounds on source and target coordinates and :

(25)

Step 1 (dilation and spreading). For spreading onto a grid on , a dilation factor needs to be chosen for each dimension such that the rescaled sources lie in . Furthermore these rescaled coordinates must be at least grid points from the ends in order to avoid wrap-around of mode amplitudes in Step 2. This gives a condition relating and ,

(26)

We may then rewrite (4) as , , where .

We spread these rescaled sources onto a regular grid using the usual periodized kernel (19), to get

(27)

Unlike before, here we have chosen a (cyclically equivalent) output index grid centered at the origin, because we shall now interpret as a Fourier mode index.

Step 2 (Fourier series evaluation via type 2 NUFFT). Treating from (27) as Fourier series coefficients, we evaluate this series at rescaled target points using the type 2 NUFFT (see section 3.2), thus,

(28)

where the rescaled frequency targets have coordinates , . Intuitively, the factor arises because the fine grid of spacing has to be stretched to unit spacing to be interpreted as a Fourier series.

Step 3 (correction). Finally, as in type 1, in order to compensate for the spreading in step 1 (in primed coordinates) a diagonal correction is needed,

But, in contrast to types 1 and 2, the set of frequencies at which must be evaluated is nonuniform, so there is no phase winding trick. Rather, cosines must be evaluated, recalling that is the the number of positive quadrature nodes. Despite this cost, this step consumes only a small fraction of the total computation time.

Remark 3

Empirically, we find that using the same overall requested precision in the above steps 1 and 2 gives overall error still close to . It has been shown in 1D (see term in [14, p. 45]) that the type 3 error is bounded by the error in performing the above step 2 multiplied by , the dynamic range of over the usable frequency band (see Fig. 2(b)). Using , (39), and (37) with we approximate

(29)

which for gives . From (10), for any , so , which is quite small. This helps to justify the above choice of tolerances.

Choice of fine grid size (“Step 0” for type 3). Finally we are able to give the recipe for choosing the fine grid sizes (which of course in practice precedes the above three steps). This relies on aliasing error estimates [14] for steps 1 and 3 that we explain here only heuristically. In section 4.1 we will see that spreading onto a uniform grid of size induces a lattice of aliasing images separated by in frequency space, so that the correction step is only accurate to precision out to frequency magnitude . Thus, since for all and , the condition

(30)

is sufficient. Combining (26) and (30), then solving as equalities for the smallest gives the recipe for the optimal parameters (similar to [35, Rmk. 1]),

(31)
Remark 4 (FFT size for type 3)

The product of the grid sizes in each dimension sets the number of modes, hence the FFT effort required, in the type 2 transform in step 2. Crucially, this is independent of the numbers of sources and of targets . Rather, scales like the space-frequency product . This connects to the Fourier uncertainty principle: scales as the number of “Heisenberg boxes” needed to fill the centered rectangle enclosing the data. In fact, since the number of degrees of freedom [55, p. 391] (or “semiclassical basis size” [11]) needed to represent functions living in the rectangle is its area divided by , namely , we see that is asymptotically times this basis size.

Efficiently handling poorly-centered data. The above remark shows that the type 3 is helped by translating all coordinates and so that their respective bounding boxes are centered around the origin. This reduces the bounds and defined by (25), hence reduces and thus the cost of the FFT. Translations in or in are cheap to apply using the factorization

(32)

Thus the type 3 transform for translated data can be applied by pre-phasing the strengths by , doing the transform, then post-multiplying by . The extra cost is complex exponentials. In our library, if input or output points are sufficiently poorly centered, we apply (32) using as or the means of the minumum and maximum coordinates in each dimension.

Remark 5 (type 3 efficiency)

Remark 4 also shows that input data can be chosen for which the algorithm is arbitrarily inefficient. For example, with only two points () in 1D with , , , , then by choosing huge, (31) implies that the algorithm will require a huge amount of memory and time. Obviously in such cases a direct summation of (4) is preferable. However, for and large but with clustered data, a butterfly-type algorithm which hierarchically exploits (32) could be designed; we leave this for future work.

4 Error analysis and parameter choices

Here we summarize a rigorous estimate (proven in [4]) on the aliasing error of the 1D type 1 and 2 algorithms of section 3, when performed in exact arithmetic. We then use this to justify the algorithm parameter choices stated in (10). Finally we evaluate and discuss the gap between this estimate and empirical errors.

4.1 Theoretical results for the ES kernel

Let be the vector of outputs defined by (1) in 1D, and be the analogous output of the above type 1 NUFFT algorithm in exact arithmetic. We use similar notation for type 2. By linearity, and the fact that the type 2 algorithm is the adjoint of the type 1, the output aliasing error vectors must take the form

(33)

for some matrix . Standard analysis [48, (1.16)] [18, (4.1)] [16, Sec. V.B] (or see [4]) involving the Poisson summation formula shows that, with the choice (17) for , has elements

(34)

Since , error is thus controlled by a phased sum of the tails of at frequency magnitudes at least ; see Fig. 2(b).

It is usual in the literature to seek a uniform bound on elements of by discarding the information about , so that , , where

(35)

The latter inequality is close to tight because defined by (29) controls the loss due to bounding numerator and denominator separately, and is in practice small.

Remark 6

A practical heuristic for is sketched in Fig. 2(b): assuming that i) decreases monotonically with for , and ii) the worst-case sum (numerator in (35)) is dominated by the single value with smallest , then we get , whose logarithm is shown in the figure.

A common use for (35) is the simple - bounds for (33) (see [56] [16, p.12]):

(36)

These results apply to any spreading kernel; we now specialize to the ES kernel. Fix an upsampling factor . Given a kernel width in sample points, one must choose in (8) an ES kernel parameter such that defined in (12) has decayed to its exponentially small region once the smallest aliased frequency is reached; see (35) and Fig. 2(b). To this end we fix a “safety factor” , and set

(37)

so that for the exponential cutoff occurs exactly at , while for the cutoff is safely smaller than . With this set-up, the following states that the aliasing error converges almost exponentially with respect to the kernel width .

Theorem 7 ([4])

For the 1D type 1 and 2 NUFFT, fix and (hence the upsampled grid ) and the safety factor . With as in (37), then the aliasing error uniform bound (35) converges with respect to kernel width as

(38)

Its somewhat involved proof is detailed in [4]. A key ingredient is that, asymptotically as , has the same “exponential of semicircle” form (up to algebraic factors) in the below-cutoff domain that itself has in ; compare Figs. 1(b) and (c). Specifically, fixing the scaled frequency , [4] proves that,

(39)
Remark 8 (Comparison to Kaiser–Bessel bounds)

In the limit , (38) approaches the same exponential rate as (7), and with an algebraic prefactor improved by a factor . On the other hand, (7) has an explicit constant.

Figure 3: Comparison of empirical relative error () vs the requested tolerance (), for all nine FINUFFT routines. In each case , with roughly equal, with two problem sizes included ( and ). Nonuniform points are uniformly randomly distributed in , while strength data is complex Gaussian random. denotes the true output vector, either or (this is computed with tolerance ), and the computed output vector at requested tolerance .

4.2 ES kernel parameter choices and empirical error

We now justify and test the parameter choices (10). With , the factor 2.30 in (10) corresponds to a satefy factor in (37), very close to 1. Note that would give a factor ; however, we find that the factor gives a slightly lower typical error for a given than pushing closer or equal to 1 (this is likely due to the continued drop for visible in Fig. 1(c)). Fessler et al. found a similar factor 2.34 when optimizing the KB kernel [16, Fig. 11].

The width is set by solving (38) with its algebraic prefactor dropped, to give . Interpreting as the requested tolerance (see Remark 10 below), and inserting , gives

(40)

As we show below, the factor 1.065 may be replaced by unity while still giving empirical errors close to the requested tolerance. The constant term in (40) is fit empirically. Thus we have explained the parameter choices (10).

In many applications one cares about relative error in the output vector, which we will denote by

(41)

Following many [12, 48], we will use this metric for testing. Fig. 3 measures this metric for FINUFFT for all nine transform types at two different problem sizes, with random data and randomly located nonuniform points. This shows that, using the choice (10), the achieved relative error well matches the requested tolerance , apart from when round-off error dominates. The mean slope of the logarithm of the empirical error with respect to that of in Fig. 3 is slightly less than unity, due to the design choice of approximating the slope 1.065 in (40) by unity in (10).

Remark 9 (Rounding error)

Double-precision accuracy is used for all machine calculations in the library by default, and also in the studies in this work. The resulting rounding error is only apparent above aliasing error for the large 1D and 2D transforms at high accuracy. Fig. 3(a) shows that our library’s accuracy is limited to 9 relative digits in 1D for ; more generally Fig. 6 shows that rounding error is similar, and essentially the same for all tested libraries. When we find, in 1D, that the rounding contribution to is roughly times machine precision. Taking into account their choice of error norm, this concurs with the findings of [53, Fig. 2.3]. See [48, §1.4] for NUFFT rounding error analysis in 1D. We observe in 2D and 3D that it is that scales the rounding error; thus, as Fig. 3(b-c) shows, it is largely irrelevant even for large .

Finally, we discuss the gap between any theoretical aliasing error estimate deriving from (35)—this includes (7) and (38)—and the empirical relative error due to aliasing. The best possible type 1 bound from (33)–(35) is via the Frobenius norm , so

Writing the transform (1) as , where has elements , this gives

(42)

where in the last step the best bound applying to all nontrivial is used, and denotes the smallest singular value of , or zero if . Thus if , there cannot be a general type 1 bound on , simply because, unlike the DFT, the relative condition number of the type 1 task (1) may be infinite (consider , so ).666The condition number may also be huge even if . The following MATLAB code, in which nonuniform points lie randomly in only half of the periodic interval, outputs typically :  M=80; N=100; A = exp(1i*(-N/2:N/2-1)’*pi*rand(1,M)); min(svd(A))

However there are two distinct mechanisms by which (42) is pessimistic in real-world applications:

  1. For typical input data, is not smaller than ; in fact (as would be expected from randomized phases in ) typically . The growth factor is close to . Thus the problem is generally well-conditioned. See Böttcher–Potts [8, Sec. 4] for a formalization in terms of “probabilistic condition number”.

  2. The uniform bound (35) discards phase information in the elements of , which in practice induce large cancellations to give errors that are improved by a factor over bounds such as (36). Such ideas enable improved aliasing error estimates in a looser norm: e.g. interpreting (3) as point samples of a function , the error of the latter is easily bounded in , as clarified by Nestler [41, Lemma 1] (also see [27]).

Remark 10 (Empirical relative error)

Assuming both mechanisms above apply in practice, they can be combined to replace (42) with the heuristic

which justifies the interpretation of the requested tolerance as the uniform bound in (35) when setting kernel parameters. The result is that, as in Fig. 3, barring rounding error, relative error is almost always similar to the tolerance .

To summarize, rather than design the kernel parameters around the rigorous but highly pessimistic worst-case analysis (42), we (as others do) design for typical errors. Thus, before trusting the relative error, the user is recommended to check that for their input data the desired convergence with respect to has been achieved.

5 Implementation issues

Here we describe the main software aspects that accelerated the library. The chief computational costs in any NUFFT call are the spreading (types 1 and 3) or interpolation (type 2), scaling as , and the FFT, scaling as . We use the multithreaded FFTW library for the latter, thus in this analysis we focus on spreading/interpolation. In comparison, the correction steps, as explained in section 3, are cheap: 1D evaluations of are easily parallelized over the quadrature nodes (or, for type 3, the frequency points) with OpenMP, and the correction and reshuffling of coefficients is memory-bound so does not benefit from parallelization.

5.1 Bin sorting of nonuniform points for spreading/interpolation

When is large, the upsampled grid (with elements) is too large to fit in cache. Unordered reads/writes to RAM are very slow (hundreds of clock cycles), thus looping through the nonuniform points in an order which preserves locality in RAM uses cache well and speeds up spreading and interpolation, by a factor of typically 2–10, including the time to sort.777This is illustrated by running test/spreadtestnd 3 1e7 1e7 1e-12 x 0 1 where x is 0 (no sort) or 1 (sort). For interpolation (dir=2), we find a speedup factor 6 on a Xeon, or 14 on an i7. Each nonuniform point requires accessing a block extending grid points in each dimension, so there is no need to sort to the nearest grid point. Thus we set up boxes of size 16 grid points in the fast () dimension, and, in 2D or 3D, size 4 in the slower ( and ) dimensions. These sizes are a compromise between empirical speed and additional RAM needed for the sort. Then we do an “incomplete” histogram sort: we first count the number of points in each box and use this to construct the breakpoints between bins, then write point indices lying in each box into that bin, finally reading off the indices in the box ordering (without sorting inside each bin). This bin sort is multi-threaded, except for the low-density case where we find that the single-thread version is usually faster.

Figure 4: Sketch of parallel load balanced spreading scheme used for type 1 and type 3 transforms, showing the 2D case. Only two of the threads are shown.

5.2 Parallel spreading

The interpolation task (24) parallelizes well with OpenMP, even for highly nonuniform distributions. Each thread is assigned a subset of the points ; for each point it reads a block of size from the fine grid, does a weighted sum using as weights the tensor product of 1D kernels, then writes the sum to a distinct output .

In contrast, spreading (20) adds to blocks in the fine grid. Blocks being overwritten by different threads may collide, and, even if atomic operations are used to ensure a correct result, false sharing of cache lines [10, Sec. 5.5.2] makes it inefficient. A conventional solution is to assign threads equal distinct slices of the fine grid [29]; however, for nonuniform distributions this could lead to arbitrarily poor load balancing. Instead, we group sorted points into “subproblems” of size up to , which are assigned to threads; see Fig. 4. (This choice is heuristic; an optimal size would depend on L3 cache and the number of threads.) To handle a subproblem, a thread finds the cuboid bounding all in the subproblem, allocates a local cuboid of this size, spreads onto the cuboid, and finally adds the cuboid back into the fine grid. Since subproblems may overlap on the latter grid, this last step needs a #pragma omp critical block to avoid collisions between writes; however this causes minimal overhead since almost all the time is spent spreading to cuboids. The scheme is adaptive: regardless of the point distribution, all threads are kept busy almost all of the time. The scheme requires additional RAM of order the fine grid size.

A further advantage is that the periodic wrapping of grid indices, which is slow, may be avoided: cuboids are padded by in each dimension and written to without wrapping. Index wrapping is only used when adding to the fine grid.

5.3 Piecewise polynomial kernel approximation

The 1D ES kernel (8) requires one real-valued exp and sqrt per evaluation. However, we find that the throughput depends drastically on the CPU type (i7 or Xeon), compiler (GCC vs Intel ICC), and kernel width (the inner loop length that the compiler may be able to vectorize via SIMD instructions). For instance, a Xeon E5-2643 (with AVX2) with GCC version .x achieves only 40M evals/sec/thread, while the same CPU with ICC gives 50–200M evals/sec/thread. We believe this is due to compiler-provided exp instructions that exploit SIMD. Similar variations occur for the i7. Seeking a reliably efficient kernel evaluation on open-source compilers (e.g. GCC), we replaced the kernel evaluation by a polynomial look-up table. The result gives 350–600M evals/sec/thread, and accelerates 1D spreading/interpolation by a factor 2–3 (the effect in is less dramatic).

The look-up table works as follows. For each nonuniform point coordinate, the 1D kernel must be evaluated at ordinates (see Fig. 2(a)). We break the function into equal-width intervals and approximate each by a centered polynomial of degree . Ordinates within each of the intervals are then the same, allowing for SIMD vectorization. The approximation error need only be small relative to ; we find suffices. Monomial coefficients are found by solving a Vandermonde system on collocation points on the boundary of a square in the complex plane tightly enclosing the interval. For each of the two available upsampling factors ( and ), and all relevant , we automatically generate C code containing all coefficients and Horner’s evaluation scheme.

Remark 11

Piecewise polynomial approximation could confer on any kernel (e.g. KB, PSWF) this same high evaluation speed. However, AVX512 and future SIMD instruction sets may accelerate exp evaluations, making ES even faster. Since we do not know which will win with future CPUs, our library uses the ES kernel.

Figure 5: Parallel scaling of FINUFFT in 3D, for threads of a Xeon desktop with 12 physical cores. Both type 1 and type 2 tasks are tested. In all cases there are Fourier modes, and , the number of “sph quad” nodes (see section 6), is as shown. (a–b) show a high-accuracy case (12 digits). (c–d) show low accuracy (3 digits). For strong scaling the efficiency is the speed-up factor divided by ; for weak it is the speed-up factor for a problem size proportional to .

6 Performance tests

Tests were run on a desktop with two Intel Xeon 3.4 GHz E5-2643 CPUs (each with 20 MB L3 cache), giving 12 total physical cores (up to 24 threads), and 128 GB RAM, running EL7 linux. Unless specified we compile all codes with GCC v.7.3.0. We compiled FINUFFT version 1.0 with flags -fPIC -Ofast -funroll-loops -march=native. Experiments were driven using the MEX interface to MATLAB R2016b. In the codes that use FFTW (i.e. FINUFFT and NFFT), we use version 3.3.3 and set its plan method to FFTW_MEASURE (see [19]), which sometimes takes a very long time to plan during the first call. Thus, to show realistic throughput we time only subsequent calls, for which FFTW looks up its stored plan. To minimize variation we take the best of the three subsequent calls.

Tasks.  To assess the efficiency of our contributions—rather than merely measure the speed of FFTW—we choose “high density” tasks where is somewhat larger than , so the FFT is at most a small fraction of the total time. In 1D, since timings do not vary much with point distribution, we always test with iid uniform random in . For we use the following distributions (see insets in Figs. 79):

  • 2D “disc quad”: a polar grid over the disc of radius , using roughly radial Gauss–Legendre nodes and equi-spaced angular nodes.

  • 3D “rand cube”: iid uniform random in .

  • 3D “sph quad”: a spherical grid in the ball of radius , using radial Gauss–Legendre nodes and a tensor-product grid on each sphere.

Here the first and last are realistic quadrature schemes for NUFFT applications [6]. They involve a divergence in point density at the origin of the form for . We choose input strengths or coefficients as iid complex Gaussian random numbers.

Parallel scaling.  Fig. 5 shows parallel scaling tests of 3D type 1 and 2 FINUFFT. The highly-nonuniform “sph quad” distribution was used in order to test the load balancing described in section 5.2. For , where each point interacts with fine grid points, weak scaling (where grows with the number of threads) shows 90% parallel efficiency for (one thread per physical core). Above this, hyper-threading is used: as expected, although it provides a slight net speed boost, measured in threads its parallel efficiency falls far short of that for . Strong scaling (acceleration at fixed ) is a tougher test, dropping to 62–74% at .

For a low-accuracy test (), the kernel is narrower, touching only fine grid points, thus the RAM access pattern is more random relative to the number of FLOPs. We believe the resulting lower parallel efficiencies are due to memory bandwidth, rather than flops, being the bottleneck. That said, at threads (full hyper-threading), both transforms are still 5–7 times faster than for a single core.

Code name kernel langauge on-the-fly omp periodic domain notes
FINUFFT ES C++ yes yes
CMCL Gaussian Fortran yes no
NFFT bkwd. KB C yes or no yes
MIRT optim. KB MATLAB no no
BART KB, C yes yes only
Table 1: Summary of NUFFT libraries tested. “kernel” lists the default spreading function (some allow other kernels). “language” is the coding language. A “yes” in the column “on-the-fly” indicates that no precomputation/plan phase is needed, hence low RAM use per nonuniform point. “omp” shows if multithreading (e.g. via OpenMP) is available. See sections 1.1 and 6.1 for more details.
Figure 6: 1D comparisons. Execution time vs accuracy is shown for various NUFFT libraries, for random data in 1D. Precomputations (needed for codes labeled “pre”) were not included. The left shows single-threaded codes, the right multi-threaded. The top pair are type 1, the bottom pair type 2. See section 6.1.

6.1 Benchmarks against existing libraries

We now compare FINUFFT against several popular open-source CPU-based NUFFT libraries mentioned in section 1.1. Their properties (including their periodic domain conventions) are summarized in Table 1. We study speed vs accuracy for types 1 and 2 for , covering most applications. The machine, OS, and default compiler were as above. In multi-threaded tests we set (two threads per core). Each code provides a MATLAB interface, or is native MATLAB. For reproducibility, we now list their test parameters and set-up (also see https://github.com/ahbarnett/nufft-bench ):

  • FINUFFT, version 1.0. Compiler flags are as in the previous section. We tested tolerances .

  • CMCL NUFFT, version 1.3.3 [24]. This ships with MEX binaries dated 2014. It uses dfftpack for FFTs. For fairness, we recompiled the relevant *.mexa64 binaries on the test machine using gfortran with flags -fPIC -Ofast -funroll-loops -march=native, and mex. We tested tolerances .

  • NFFT, version 3.3.2 [29]. A compiler error resulted with GCC 7.x, so we used GCC 6.4.0. We used the default (backward KB) kernel. We used the “guru” interface with FFT grid size set to the smallest power of two at least , where is the number of modes in dimension , following their examples. Since they increased speed, we set the flags PRE_PHI_HUT, FFT_OUT_OF_PLACE, NFFT_OMP_BLOCKWISE_ADJOINT, and NFFT_SORT_NODES (the latter is not part of the standard MATLAB interface). We tested three variants:

    • no kernel precomputation; kernel is evaluated on the fly (labeled “NFFT”).

    • “pre”: option PRE_PSI which precomputes kernel values (with tensor products done on the fly).

    • “full pre”: option PRE_FULL_PSI which precomputes and then looks up all kernel values.

    We tested kernel parameters (kernel width is ), apart from “full pre” where RAM constraints limited us to .

  • MIRT (Michigan Image Reconstruction Toolbox), no version number; however the latest changes to the nufft directory were on Dec 13, 2016 [15]. This native MATLAB code precomputes a sparse matrix with all kernel values. Its matrix-vector multiplication appears to be single-threaded, thus we place this library in the single-threaded category. We use oversampling and the default kernel minmax:kb, which appears to be close to KB (5). RAM constraints limited us to testing width parameters (equivalent to our ).

  • BART (Berkeley Advanced Reconstruction Toolbox), version 0.4.02 [59]. This is a recent multithreaded C code for 3D only by Uecker, Lustig and Ong, used in a recent comparison by Ou [46]. We compiled with -O3 -ffast-math -funroll-loops -march=native. The MATLAB interface writes to and reads from temporary files; however for our problem sizes with the use of a local SSD drive this adds less than 10% to the run-time. BART did not ship with periodic wrapping of the spreading kernel; however, upon request888M. Uecker, private communication. we received a patched code src/noncart/grid.c. It has fixed accuracy ( is fixed). We empirically find that a prefactor gives around 5-digit accuracy (without the strange factor it gives only 3 digits).

The above notes also illustrate some of the challenges in setting up fair comparisons.

Figure 7: 2D comparisons. Execution time vs accuracy is shown for the tested libraries, for 2D polar “disc quad” nodes (see section 6) illustrated in the insets at smaller . Precomputations (needed for codes labeled “pre”) were not included. The left shows single-threaded codes, the right multi-threaded. The top pair are type 1, the bottom pair type 2. See section 6.1.

We now discuss the results (Figs. 69). In each case we chose , for reasons discussed above. To be favorable to codes that require precomputation, precomputation times were not counted (hence the label “after pre” in the figures). As in section 4.2, denotes relative error, measured against a ground truth of FINUFFT with tolerance .

1D comparisons.  Fig. 6 compares single-thread codes (left plots), and then, for a larger task, multi-threaded codes (right plots). For single-threaded, FINUFFT outperforms all libraries except MIRT, which exploits MATLAB’s apparently efficient sparse matrix-vector multiplication. However, what is not shown is that the precomputation time for MIRT is around 100 times longer than the transform time. For multithreaded type 1, FINUFFT is around faster than NFFT without precomputation, but around slower than “NFFT pre”. For type 2, FINUFFT and “NFFT pre” are similar, but of course this does not count the precomputation time (and higher RAM overhead) of “NFFT pre”. As per Remark 9, all libraries bottom out at around 9-10 digits due to rounding error.

2D comparisons.  Fig. 7 shows similar comparisons in 2D, for a point distribution concentrating at the origin. Compared to other codes not needing precomputation, FINUFFT is 2– faster than CMCL (when single-threaded), and 4– faster than NFFT. When NFFT is allowed precomputation, its type 2 multi-threaded speed is similar to FINUFFT, but for type 1 FINUFFT is faster at high accuracy.

Figure 8: 3D single-threaded comparisons. Execution time vs accuracy is shown for the tested libraries, for uniform random (left pair) and spherical quadrature (right pair) nodes (see section 6). Node patterns are shown in the insets at a smaller . Precomputations (needed for codes labeled “pre”) were not included. The top pair are type 1, the bottom pair type 2. See section 6.1.
Figure 9: 3D multithreaded comparisons. For explanation see caption of Fig. 8.
Code and parameters (rel. error) RAM overhead
FINUFFT (tol. ) 1.4e-06 N/A 14.6 s 8.8 GB
NFFT () 4.7e-06 10.4 s 238 s 20.9 GB
NFFT () PRE_PSI 4.7e-06 67.3 s 125 s 67.1 GB
Table 2: Performance of FINUFFT and NFFT for a large 3D Type 1 transform with roughly 6-digit accuracy, using 24 threads. modes are requested with “sph quad” nonuniform points. The spreading time dominates over the FFT. For NFFT, counts both planning and kernel precomputation. RAM is measured using top, relative to the baseline (around 12 GB) needed to store the input data in MATLAB. See section 6 for machine and NFFT parameters.

3D comparisons.  Fig. 8 compares single-threaded codes (now including BART). The left pair of plots shows random points: FINUFFT is at least faster than any other code, apart from MIRT at 1-digit accuracy. CMCL is a factor 4– slower than single-threaded FINUFFT, we believe in part due to its lack of sorting nonuniform points. (The evidence is that for the right pair, where points have an ordered access pattern, CMCL is only 2– slower). NFFT without precomputation is 3– slower than FINUFFT; precomputation brings this down to 2–. As for , we observe that “NFFT full pre” is no faster than “NFFT pre”, despite its longer precomputation and larger RAM overhead.

Fig. 9 shows larger multi-threaded comparisons against NFFT; now we cannot include “NFFT full pre” due to its large RAM usage. At low accuracies with random points, FINUFFT and “NFFT pre” have similar speeds. However, for type 1 “sph quad” (panel (b)), for , FINUFFT is 8– faster than NFFT even with precomputation. FINUFFT is at least faster than BART in all cases.

In Table 2 we compare FINUFFT and NFFT in terms of both speed and memory overhead, for the same large, medium-accuracy, multi-threaded task. We emphasize that, since , , is a power of two, the fine grids chosen by the two libraries are an identical . This means that the memory use, and the FFTW calls, are identical. Furthermore, the kernel widths are both so the numbers of fine grid points written to are identical. If precomputations are excluded, FINUFFT is faster than NFFT, and faster than “NFFT pre”. For a single use (i.e. including initialization and precomputation), these ratios become and , respectively.

Remark 12

We believe that the following explains the large type 1 performance gain of FINUFFT over NFFT for the 3D clustered “sph quad” nodes, shown by Fig. 9(b) and Table 2. NFFT assigns threads to equal slices of the fine grid (in the -direction), whereas FINUFFT uses subproblems which load balance regardless of the clustering of nodes. We find that only a couple of threads are active for the entire run time of NFFT; most complete their jobs quickly, giving low parallel efficiency.

Finally, Table 2 shows that, if NFFT precomputation is used, its RAM overhead is around that of FINUFFT. The “NFFT pre” RAM overhead of around 28 doubles per point is consistent with the expected stored kernel values per point.

7 Conclusion

We have presented an open-source parallel CPU-based general purpose type 1, 2 and 3 NUFFT library (FINUFFT) for dimensions 1, 2, and 3 that is competitive with existing CPU-based libraries. Using a new spreading kernel (8), all kernel evaluations are efficiently done on the fly: this avoids any precomputation phase, keeps the RAM overhead small, and allows for a simple user-friendly interface from multiple languages. Efficient parallelization balances the work of threads, adapting to any nonuniform point distribution. For all three types, we introduce numerical quadrature to evaluate a kernel Fourier transform for which there is no known analytic formula. Rigorous estimates show almost exponential convergence of the kernel aliasing error, with rate arbitrarily close to that of the best known. We explained the gap between such estimates and empirical relative errors. We benchmarked several NUFFT libraries in detail. We showed that for certain 3D problems with clustered distributions FINUFFT is an order of magnitude faster than the other libraries, even when they are allowed precomputation. In the latter case, FINUFFT has an order of magnitude less RAM overhead.

There are several directions for future work, starting with benchmarking the less-common type 3 case. An efficient interface for the case of repeated small problems (Remark 2) should be completed. There is also a need for a carefully-benchmarked general-purpose GPU NUFFT code, following Kunis–Kunis [34], Ou [46], and others. Implementation of both of the above is in progress.

Remark 13

In some particle-mesh Ewald applications [37] one needs spatial derivatives of the spreading kernel.999A.-K. Tornberg, personal communication. However, (8) has unbounded derivatives (with inverse square-root singularity) at the endpoints. Instead one may prefer the exponentially-close variant since it is smooth up to the endpoints. This kernel requires one extra reciprocal, or approximation as in section 5.3.

Acknowledgments

We are grateful for discussions with Joakim Andén, Charlie Epstein, Zydrunas Gimbutas, Leslie Greengard, Hannah Lawrence, Andras Pataki, Daniel Potts, Vladimir Rokhlin, Yu-hsuan Shih, Marina Spivak, David Stein, and Anna-Karin Tornberg. The Flatiron Institute is a division of the Simons Foundation.

References

  • [1] L. af Klinteberg. Julia interface to FINUFFT, 2018. https://github.com/ludvigak/FINUFFT.jl.
  • [2] C. Anderson and M. D. Dahleh. Rapid computation of the discrete Fourier transform. SIAM J. Sci. Comput., 17(4):913–919, 1996.
  • [3] F. Andersson, R. Moses, and F. Natterer. Fast Fourier methods for synthetic aperture radar imaging. IEEE Trans. Aerospace Elec. Sys., 48(1):215–229, 2012.
  • [4] A. H. Barnett. Asymptotic aliasing error of the kernel exp when used for non-uniform fast Fourier transforms, 2019. in preparation.
  • [5] A. H. Barnett, J. Magland, and L. af Klinterberg. Flatiron Institute nonuniform fast Fourier transform libraries (FINUFFT), 2018. https://github.com/flatironinstitute/finufft.
  • [6] A. H. Barnett, M. Spivak, A. Pataki, and L. Greengard. Rapid solution of the cryo-EM reconstruction problem by frequency marching. SIAM J. Imaging Sci., 10(3):1170–1195, 2017.
  • [7] G. Beylkin. On the fast Fourier transform of functions with singularities. Appl. Comput. Harmonic Anal., 2:363–383, 1995.
  • [8] A. Böttcher and D. Potts. Probability against condition number and sampling of multivariate trigonometric random polynomials. Electron. Trans. Numer. Anal., 26:178–189, 2007.
  • [9] M. M. Bronstein, A. M. Bronstein, M. Zibulevsky, and H. Azhari. Reconstruction in diffraction ultrasound tomography using nonuniform FFT. IEEE Trans. Medical Imaging, 21(11):1395–1401, 2002.
  • [10] B. Chapman, G. Jost, and R. van der Pas. Using OpenMP. Portable Shared Memory Parallel Programming. MIT Press, 2008.
  • [11] M. J. Davis and E. J. Heller. Semiclassical Gaussian basis set method for molecular vibrational wave functions. J. Chem. Phys., 71(8):3383–3395, 1979.
  • [12] A. Dutt and V. Rokhlin. Fast Fourier transforms for nonequispaced data. SIAM J. Sci. Comput., 14:1369–1393, 1993.
  • [13] A. Dutt and V. Rokhlin. Fast Fourier transforms for nonequispaced data, II. Appl. Comput. Harmonic Anal., 2:85–100, 1995.
  • [14] B. Elbel and G. Steidl. Fast Fourier transform for nonequispaced data. In Approximation Theory IX, pages 39–46. Vanderbilt University Press, Nashville, 1998.
  • [15] J. Fessler. Michigan image reconstruction toolbox, 2016. https://web.eecs.umich.edu/fessler/irt/fessler.tgz.
  • [16] J. Fessler and B. Sutton. Nonuniform fast fourier transforms using min-max interpolation. IEEE Trans. Signal Proc., 51(2):560–574, 2003.
  • [17] K. Fourmont. Schnelle Fourier-Transformation bei nichtäquidistanten Gittern und tomographische Anwendungen. PhD thesis, Univ. Münster, 1999.
  • [18] K. Fourmont. Non-equispaced fast Fourier transforms with applications to tomography. J. Fourier Anal. Appl., 9(5):431–450, 2003.
  • [19] M. Frigo and S. G. Johnson. FFTW. http://www.fftw.org/.
  • [20] Z. Gimbutas and S. Veerapaneni. A fast algorithm for spherical grid rotations and its application to singular quadrature. SIAM J. Sci. Comput., 5(6):A2738–A2751, 2013.
  • [21] A. Goldstein and J. Abbate. Oral history: James Kaiser. http://ethw.org/Oral-History:James_Kaiser, 1997. online; accessed 2017-04-15.
  • [22] I. Gradshteyn and I. Ryzhik. Table of Integrals, Series and Products. New York: Academic, 8th edition, 2015.
  • [23] L. Greegard, J.-Y. Lee, and S. Inati. The fast sinc transform and image reconstruction from nonuniform samples in -space. Comm. Appl. Math. and Comp. Sci., 1(1):121–131, 2006.
  • [24] L. Greengard and J.-Y. Lee. NUFFT libraries in Fortran. http://www.cims.nyu.edu/cmcl/nufft/nufft.html. online; accessed 2017-04-10.
  • [25] L. Greengard and J.-Y. Lee. Accelerating the nonuniform fast Fourier transform. SIAM Review, 46(3):443–454, 2004.
  • [26] J. I. Jackson, C. H. Meyer, D. G. Nishimura, and A. Macovski. Selection of a convolution function for Fourier inversion using gridding. IEEE Trans. Medical Imaging, 10(3):473–478, 1991.
  • [27] M. Jacob. Optimized least-square nonuniform fast Fourier transform. IEEE Trans. Signal Process., 57(6):2165–2177, 2009.
  • [28] J. Kaiser. Digital filters. In J. Kaiser and F. Kuo, editors, System analysis by digital computer, chapter 7, pages 218–285. Wiley, 1966.
  • [29] J. Keiner, S. Kunis, and D. Potts. NFFT. http://www-user.tu-chemnitz.de/potts/nfft, 2002–2016. online; accessed 2017-04-10.
  • [30] J. Keiner, S. Kunis, and D. Potts. Using NFFT 3 — a software library for various nonequispaced fast Fourier transforms. ACM Trans. Math. Software, 36(4), 2009.
  • [31] M. Kircheis and D. Potts. Direct inversion of the nonequispaced fast fourier transform, 2019. arxiv:1811.05335, submitted to Linear Algebra Appl.
  • [32] F. Knoll, A. Schwarzl, C. Diwoki, and D. K. Sodickson. gpuNUFFT—an open-source GPU library for 3D gridding with direct MATLAB interface. Proc. Intl. Soc. Mag. Reson. Med., 22, 2014. https://github.com/andyschwarzl/gpuNUFFT.
  • [33] S. Kunis. Nonequispaced FFT: Generalisation and Inversion. PhD thesis, Universität zu Lübeck, 2006.
  • [34] S. Kunis and S. Kunis. The nonequispaced FFT on graphics processing units. Proc. Appl. Math. Mech., 12:7–10, 2012.
  • [35] J.-Y. Lee and L. Greengard. The type 3 nonuniform FFT and its applications. J. Comput. Phys., 206:1–5, 2005.
  • [36] J.-M. Lin. Python non-uniform fast Fourier transform (PyNUFFT): multi-dimensional non-Cartesian image reconstruction package for heterogeneous platforms and applications to MRI, 2017. arxiv:1710.03197v1.
  • [37] D. Lindbo and A.-K. Tornberg. Spectral accuracy in fast Ewald-based methods for particle simulations. J. Comput. Phys., 230:8744–8761, 2011.
  • [38] Q. H. Liu and N. Nguyen. An accurate algorithm for nonuniform fast fourier transforms (NUFFT’s). IEEE Microwave and Guided Wave Letters, 8(1):18–20, 1998.
  • [39] D. D. Meisel. Fourier transforms of data samples at unequal observational intervals. Astronom. J., 83(5):538–545, 1978.
  • [40] S. L. Moshier. CEPHES mathematical function library, 1984–1992. http://www.netlib.org/cephes.
  • [41] F. Nestler. Automated parameter tuning based on RMS errors for nonequispaced FFTs. Adv. Comput. Math., 42:889–919, 2016.
  • [42] F. Nestler, M. Pippig, and D. Potts. Fast Ewald summation based on NFFT with mixed periodicity. J. Comput. Phys., 285:280–315, 2015.
  • [43] F. W. J. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark, editors. NIST Handbook of Mathematical Functions. Cambridge University Press, 2010. http://dlmf.nist.gov.
  • [44] A. Oppenheim, D. Johnson, and K. Steiglitz. Computation of spectra with unequal resolution using the fast Fourier transform. Proc. IEEE, 59(2):299–301, 1971.
  • [45] A. Osipov, V. Rokhlin, and H. Xiao. Prolate Spheroidal Wave Functions of Order Zero: Mathematical Tools for Bandlimited Approximation, volume 187 of Applied Mathematical Sciences. Springer, US, 2013.
  • [46] T. Ou. gNUFFTW: Auto-tuning for high-performance GPU-accelerated non-uniform fast Fourier transforms. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-90.html, 2017. Technical Report #UCB/EECS-2017-90, UC Berkeley.
  • [47] D. Potts. Schnelle Fourier-Transformationen für nichtäquidistante Daten und Anwendungen, 2003. Habilitationssschift, Lübeck.
  • [48] D. Potts, G. Steidl, and M. Tasche. Fast Fourier transforms for nonequispaced data: a tutorial. In Modern Sampling Theory: Mathematics and Applications, pages 247–270. Birkhäuser, Boston, 2001.
  • [49] W. H. Press and G. B. Rybicki. Fast algorithm for spectral analysis of unevenly sampled data. Astrophys. J., 338:277–280, 1989.
  • [50] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes in C. Cambridge University Press, Cambridge, 2nd edition, 2002.
  • [51] A. Prudnikov, Y. A. Brychkov, and O. I. Marichev. Integrals and series, Volume 1. Elementary functions. Gordon and Breach, 1986.
  • [52] A. Prudnikov, Y. A. Brychkov, and O. I. Marichev. Integrals and series, Volume 2. Special functions. Gordon and Breach, 1986.
  • [53] D. Ruis-Antolín and A. Townsend. A nonuniform fast Fourier transform based on low rank approximation, 2018.
  • [54] D. Slepian. Some asymptotic expansions for prolate spheroidal wave functions. Journal of Mathematics and Physics, 44(1-4):99–140, 1965.
  • [55] D. Slepian. Some comments on Fourier analysis, uncertainty, and modeling. SIAM Review, 25:379–393, 1983.
  • [56] G. Steidl. A note on fast Fourier transforms for nonequispaced grids. Adv. Comput. Math., 9:337–352, 1998.
  • [57] B. P. Sutton, D. C. Noll, and J. A. Fessler. Fast, iterative image reconstruction for MRI in the presence of field inhomogeneities. IEEE Trans. Med. Imaging, 22(2):178–188, 2003.
  • [58] A. R. Thompson and R. N. Bracewell. Interpolatio