Accelerating incoherent dedispersion
Abstract
Incoherent dedispersion is a computationally intensive problem that appears frequently in pulsar and transient astronomy. For current and future transient pipelines, dedispersion can dominate the total execution time, meaning its computational speed acts as a constraint on the quality and quantity of science results. It is thus critical that the algorithm be able to take advantage of trends in commodity computing hardware. With this goal in mind, we present analysis of the ‘direct’, ‘tree’ and ‘subband’ dedispersion algorithms with respect to their potential for efficient execution on modern graphics processing units (GPUs). We find all three to be excellent candidates, and proceed to describe implementations in C for CUDA using insight gained from the analysis. Using recent CPU and GPU hardware, the transition to the GPU provides a speedup of 9 for the direct algorithm when compared to an optimised quadcore CPU code. For realistic recent survey parameters, these speeds are high enough that further optimisation is unnecessary to achieve realtime processing. Where further speedups are desirable, we find that the tree and subband algorithms are able to provide 3–7 better performance at the cost of certain smearing, memory consumption and development time tradeoffs. We finish with a discussion of the implications of these results for future transient surveys. Our GPU dedispersion code is publicly available as a C library at: http://dedisp.googlecode.com/
keywords:
methods: data analysis – pulsars: general1 Introduction
With the advent of modern telescopes and digital signal processing backends, the timeresolved radio sky has become a rich source of astrophysical information. Observations of pulsars allow us to probe the nature of neutron stars (Lattimer & Prakash 2004), stellar populations (Bhattacharya & van den Heuvel 1991), the galactic environment (Gaensler et al. 2008), plasma physics and gravitational waves (Lyne et al. 2004). Of equal significance are transient signals such as those from rotating radio transients (McLaughlin et al., 2006) and potentially rare oneoff events such as ‘Lorimer bursts’ (Lorimer et al., 2007; Keane et al., 2011), which may correspond to previously unknown phenomena. These observations all depend on the use of significant computing power to search for signals within long, frequencyresolved time series.
As radiation from sources such as pulsars propagates to Earth, it interacts with free electrons in the interstellar medium. This interaction has the effect of delaying the signal in a frequencydependent manner – signals at lower frequencies are delayed more than those at higher frequencies. Formally, the observed time delay, , between two frequencies and as a result of dispersion by the interstellar medium is given by
(1) 
where is the dispersion constant^{1}^{1}1We note that the dispersion constant is commonly approximated in the literature as . and the frequencies are in MHz. The parameter DM specifies the dispersion measure along the line of sight in pc cm, and is defined as
(2) 
where is the electron number density (cm) and is the distance to the source (pc).
Once a timevarying source has been detected, its dispersion measure can be obtained from observations of its phase as a function of frequency; this in turn allows the distance to the object to be calculated via equation (2), assuming one has a model for the Galactic electron density . When searching for new sources, however, one does not know the distance to the object. In these cases, the dispersion measure must be guessed prior to looking for a signal. To avoid excessive smearing of signals in the time series, and a consequent loss of signaltonoise, survey pipelines typically repeat the process for many trial dispersion measures. This process is referred to as a dedispersion transform. An example of the dedispersion transform is shown in Fig. 1.
Computing the dedispersion transform is a computationally expensive task: a simple approach involves a summation across a band of, e.g., frequency channels for each of (typically) dispersion measures, for each time sample. Given modern sampling intervals of , computing this in realtime is a challenging task, especially if the process must be repeated for multiple beams. The prohibitive cost of realtime dedispersion has traditionally necessitated that pulsar and transient survey projects use offline processing.
In this paper we consider three ways in which computation of the dedispersion transform may be accelerated, enabling realtime processing at low cost. First, in Section 2 we demonstrate how modern manycore computing hardware in the form of graphics processing units [GPUs; see Fluke et al. (2011) for an introduction] can provide an order of magnitude more performance over a multicore central processing unit (CPU) when dedispersing ‘directly’. The use of GPUs for incoherent dedispersion is not an entirely new idea. Dodson et al. (2010) introduced an implementation of such a system as part of the CRAFT survey. Magro et al. (2011) described a similar approach and how it may be used to construct a GPUbased realtime transient detection pipeline for modest fractional bandwidths, demonstrating that their GPU dedisperser could out perform a generic code by two orders of magnitude. In this work we provide a thorough analysis of both the direct incoherent dedispersion algorithm itself and the details of its implementation on GPU hardware.
In Section 3 we then consider the use of the ‘tree’ algorithm, a (theoretically) more efficient means of computing the dedispersion transform. To our knowledge, this technique has not previously been implemented on a GPU. We conclude our analysis of dedispersion algorithms in Section 4 with a discussion of the ‘subband’ method, a derivative of the direct method.
2 Direct Dedispersion
2.1 Introduction
The direct dedispersion algorithm operates by directly summing frequency channels along a quadratic dispersion trail for each time sample and dispersion measure. In detail, the algorithm computes an array of dedispersed time series from an input dataset according to the following equation:
(3) 
where the subscripts , and represent dispersion measure, time sample and frequency channel respectively, and is the total number of frequency channels. Note that throughout this paper we use the convention that means the sum over the range to . The function is a discretized version of equation (1) and gives the time delay relative to the start of the band in whole time samples for a given dispersion measure and frequency channel:
(4)  
(5) 
where is the time difference in seconds between two adjacent samples (the sampling interval), is the frequency in MHz at the start of the band, is the frequency difference in MHz between two adjacent channels and the function round means rounded to the nearest integer. The function or array DM is used to specify the dispersion measures to be computed. Note that the commonlyused central frequency, , and bandwidth, BW, parameters are related by and .
After dedispersion, the dedispersed time series can be searched for periodic or transient signals.
When dedispersing at large DM, the dispersion of a signal can be such that it is smeared significantly within a single frequency channel. Specifically, this occurs when the gradient of a dispersion curve on the timefrequency grid is less than unity (i.e., beyond the ‘diagonal’). Once this effect becomes significant, it becomes somewhat inefficient to continue to dedisperse at the full native time resolution. One option is to reduce the time resolution by a factor of two when the DM exceeds the diagonal by adding adjacent pairs of time samples. This process is then repeated at 2 the diagonal, 4 etc. We refer to this technique as ‘timescrunching’. The use of timescrunching will reduce the overall computational cost, but can also slightly reduce the signaltonoise ratio if the intrinsic pulse width is comparable to that of the dispersion smear.
2.2 Algorithm analysis
The direct dedispersion algorithm’s summation over frequency channels for each of time samples and dispersion measures gives it a computational complexity of
(6) 
The algorithm was analysed previously for manycore architectures in Barsdell, Barnes & Fluke (2010). The key findings were:

the algorithm is best parallelised over the “embarassingly parallel” dispersionmeasure () and time () dimensions, with the sum over frequency channels () being performed sequentially,

the algorithm has a very high theoretical arithmetic intensity, of the same magnitude as the number of dispersion measures computed [typically ], and

the memory access patterns generally exhibit reasonable locality, but their nontrivial nature may make it difficult to achieve a high arithmetic intensity.
While overall the algorithm appears wellpositioned to take advantage of massively parallel hardware, we need to perform a deeper analysis to determine the optimal implementation strategy. The pattern in which memory is accessed is often critical to performance on massivelyparallel architectures, so this is where we now turn our attention.
While the dimension involves a potentially nonlinear mapping of input indices to output indices, the dimension maintains a contiguous mapping from input to output. This makes the dimension suitable for efficient memory access operations via spatial caching, where groups of adjacent parallel threads access memory all at once. This behaviour typically allows a majority of the available memory bandwidth to be exploited.
The remaining memory access issue is the potential use of temporal caching to increase the arithmetic intensity of the algorithm. Dedispersion at similar DMs involves accessing similar regions of input data. By preloading a block of data into a shared cache, many DMs could be computed before needing to return to main memory for more data. This would increase the arithmetic intensity by a factor proportional to the size of the shared cache, potentially providing a significant performance increase, assuming the algorithm was otherwise limited by available memory bandwidth. The problem with the direct dedispersion algorithm, however, is its nonlinear memory access pattern in the dimension. This behaviour makes a caching scheme difficult to devise, as one must account for threads at different DMs needing to access data at delayed times. Whether temporal caching can truly be used effectively for the direct dedispersion algorithm will depend on details of the implementation.
2.3 Implementation Notes
When discussing GPU implementations throughout this paper, we use the terms ‘Fermi’ and ‘preFermi’ GPUs to mean GPUs of the NVIDIA Fermi architecture and those of older architectures respectively. We consider both architectures in order to study the recent evolution of GPU hardware and gain insight into the future direction of the technology.
We implemented the direct dedispersion algorithm using the C for CUDA platform^{2}^{2}2http://developer.nvidia.com/object/gpucomputing.html. As suggested by the analysis in Section 2.2, the algorithm was parallelised over the dispersionmeasure and time dimensions, with each thread summing all channels sequentially. During the analysis it was also noted that the algorithm’s memory access pattern exhibits good spatial locality in the time dimension, with contiguous output indices mapping to contiguous input indices. We therefore chose time as the fastestchanging (i.e., ) thread dimension, such that reads from global memory would always be from contiguous regions with a unit stride, maximising throughput. The DM dimension was consequently mapped to the second (i.e., ) thread dimension.
While the memory access pattern is always contiguous, it is not always aligned. This is a result of the delays, , introduced in the time dimension. At all nonzero DMs, the majority of memory accesses will begin at arbitrary offsets with respect to the internal alignment boundaries of the memory hardware. The consequence of this is that GPUs that do not have builtin caching support may need to split the memory requests into many smaller ones, significantly impacting throughput to the processors. In order to avoid this situation, we made use of the GPU’s texture memory, which does support automatic caching. On preFermi GPU hardware, the use of texture memory resulted in a speedup of around compared to using plain device memory, highlighting the importance of understanding the details of an algorithm’s memory access patterns when using these architectures. With the advent of Fermiclass GPUs, however, the situation has improved significantly. These devices contain an L1 cache that provides many of the advantages of using texture memory without having to explicitly refer to a special memory area. Using texture memory on Fermiclass GPUs was slightly slower than using plain device memory (with L1 cache enabled), as suggested in the CUDA programming guide^{3}^{3}3The current version of the CUDA programming guide is available for download at: http://www.nvidia.com/object/cuda_develop.html.
Input data with fewer bits per sample than the machine word size (currently assumed to be 32 bits) were handled using bitshifting and masking operations on the GPU. It was found that a convenient format for working with the input data was to transpose the input from timemajor order to frequencymajor order by whole words, leaving consecutive frequency channels within each word. For example, for the case of two samples per word, the data order would be: ,,, ,,, where brackets denote data within a machine word. This format means that time delays are always applied in units of whole words, avoiding the need to deal with intraword delays.
The thread decomposition was written to allow the shape of the block (i.e., number of DMs or time samples per block) to be tuned. We found that for a block size of 256 threads, optimal performance on a Fermi GPU was achieved when this was divided into 8 time samples 32 DMs. We interpreted this result as a cacherelated effect, where the block shape determines the spread of memory locations accessed by a group of neighbouring threads spread across timeDM space, and the optimum occurs when this spread is minimised. On preFermi GPUs, changing the block shape was found to have very little impact on performance.
To minimise redundant computations, the functions and were precomputed and stored in lookup tables for the given dispersion measures and frequency channels respectively. Delays were then computed simply by retrieving values from the two tables and evaluating equation (5), requiring only a single multiplication and a rounding operation. On preFermi GPUs, the table corresponding to was explicitly stored in the GPU’s constant memory space, which provides extremely efficient access when all threads read the same value (this is always the case for our implementation, where frequency channels are traversed sequentially). On Fermigeneration cards, this explicit use of the constant memory space is unnecessary – constant memory caching is used automatically when the compiler determines it to be possible.
To amortize overheads within the GPU kernel such as index calculations, loop counters and timedelay computations, we allowed each thread to store and sum multiple time samples. Processing four samples per thread was found to significantly reduce the total arithmetic cost without affecting memory throughput. Increasing this number required more registers per thread (a finite resource), and led to diminishing returns; we found four to be the optimal solution for our implementation.
Our implementation was written to support a channel “kill mask”, which
specifies which frequency channels should be included in the computation and
which should be skipped (e.g., to avoid radio frequency interference present
within them). While our initial approach was to apply this mask as a
conditional statement [e.g., if( kill_mask[channel] ) { sum += data }
],
it was found that applying the mask arithmetically (e.g.,
sum += data * kill_mask[channel]
) resulted in better performance. This
is not particularly surprising given the GPU hardware’s focus on arithmetic
throughput rather than branching operations.
Finally, we investigated the possibility of using temporal caching, as discussed in the analysis in Section 2.2. Unlike most CPUs, GPUs provide a manuallymanaged cache (known as shared memory on NVIDIA GPUs). This provides additional power and flexibility at the cost of programming effort. We used shared memory to stage a rectangular section of input data (i.e., of timefrequency space) in each thread block. Careful attention was given to the amount of data cached, with additional time samples being loaded to allow for differences in delay across a block. The cost of development was significant, and it remained unclear whether the caching mechanism could be made robust against a variety of input parameters. Further, we found that the overall performance of the code was not significantly altered by the addition of the temporal caching mechanism. We concluded that the additional overheads involved in handling the nonlinear memory access patterns (i.e., the mapping of blocks of threads in timeDM space to memory in timefrequency space) negated the performance benefit of staging data in the shared cache. We note, however, that cacheing may prove beneficial when considering only low DMs (e.g., below the diagonal), where delays vary slowly and memory access patterns remain relatively compact.
In theory it is possible that, via careful repacking of the input data, one could exploit the bitlevel parallelism available in modern computing hardware in addition to the threadlevel parallelism. For example, for 2bit data, packing each 2bit value into 8bits would allow four values to be summed in parallel with a single 32bit addition instruction. In this case, additions could be performed before one risked integer overflow. To dedisperse say 1024 channels, one could first sum blocks of 85 channels and then finish the summation by repacking the partial sums into a larger data type. This would achieve efficient use of the available processing hardware, at the cost of additional implementation complexity and overheads for repacking and data management. We did not use this technique in our GPU dedispersion codes, although our reference CPU code does exploit this extra parallelism by packing four 2bit samples into a 64bit word before dedispersion.
3 Tree Dedispersion
3.1 Introduction
The tree dedispersion algorithm, first described by Taylor (1974), attempts to reduce the complexity of the dedispersion computation from to . This significant speedup is obtained by first regularising the problem and then exploiting the regularity to allow repeated calculations to be shared between different DMs. While theoretical speedups of are possible, in practice a number of additional overheads arise when working with real data. These overheads, as well as its increased complexity, have meant that the tree algorithm is rarely used in modern search pipelines. In this work we investigate the tree algorithm in order to assess its usefulness in the age of manycore processors.
In its most basic form, the tree dedispersion algorithm is used to compute the following:
(7)  
(8) 
for in the range . The regularisation is such that the delay function is now a linear function of that ranges from to exactly across the band. The DMs computed by the tree algorithm are therefore:
(9) 
where the function is that given by equation (4).
The tree algorithm is able to evaluate equation (7) for in the range in just steps. It achieves this feat by using a divide and conquer approach in the same way as the wellknown fast Fourier transform (FFT) algorithm. The tree algorithm is visualised in Fig. 2. We define the computation at each step as follows:
(10)  
(11)  
(12)  
(13) 
The integer function gives the time delay for a given iteration and frequency channel and can be defined as
(14) 
where mod is the modulus operator, and division is taken to be truncated integer division. The integer function , which we refer to as the ‘shuffle’ function, reorders the indices according to a pattern defined as follows:
(15) 
where the parameter is known as the radix.
While the tree dedispersion algorithm succeeds in dramatically reducing the computational cost of dedispersion, it has a number of constraints not present in the direct algorithm:

the computed dispersion trails are linear in frequency, not quadratic as found in nature [see equation (1)],

the computed dispersion measures are constrained to those given by equation (9), and

the number of input frequency channels (and thus also the number of DMs) must be a power of two.
Constraint (iii) is generally not a significant concern, as it is common for the number of frequency channels to be a power of two, and blank channels can be added when this is not the case. Constraints (i) and (ii) are more problematic, as they prevent the computation of accurate and efficientlydistributed dispersion trails. Fortunately there are ways of working around these limitations.
One method is to approximate the dispersion trail with piecewise linear segments by dividing the input data into subbands (Manchester et al., 1996). Another approach is to quadratically space the input frequencies by padding with blank channels as a preprocessing step such that the second order term in the dispersion trail is removed (Manchester et al., 2001). These techniques are described in the next two sections.
3.1.1 The piecewise linear tree method
Approximation of the quadratic dispersion curve using piecewise linear segments involves two stages of computation. If the input data are divided into subbands of length
(16) 
with the subband starting at frequency channel
(17) 
then from equation (7) we see that the tree dedispersion algorithm applied to each subband results in the following:
(18) 
which we refer to as stage 1 of the piecewise linear tree method.
In each subband, we approximate the quadratic dispersion trail with a linear one. We compute the linear DM in the subband that approximates the true DM indexed by as follows:
(19)  
(20) 
Applying the constraint and noting that the greatest dispersion delay occurs at the end of the band, we obtain a limit on the DM that the basic piecewise linear tree algorithm can compute. This limit is commonly referred to as the ‘diagonal’ DM, as it corresponds to a dispersion trail in the timefrequency grid with a gradient of unity:^{4}^{4}4Note that the ‘’ in equation (21) arises from the roundtonearest operation in equation (20).
(21) 
A technique for computing larger DMs with the tree algorithm is discussed in Section 3.1.3.
The dedispersed subbands can now be combined to approximate the result of equation (3):
(22)  
(23) 
This forms stage 2 of the piecewise linear tree computation.
The use of the tree algorithm with subbanding introduces an additional source of smearing into the dedispersed time series as a result of approximating the quadratic dispersion curve with a piecewise linear one. We derive an analytic upper limit for this smearing in Appendix A.
3.1.2 The frequencypadded tree method
An alternative approach to adapting the tree algorithm to quadratic dispersion curves is to linearise the input data via a change of frequency coordinates. Formally, the aim is to ‘stretch’ [equation (4)] to a linear function . Expanding to first order around , we have:
(24) 
The change of variables is then found by equating with its linear approximation, , and solving for , which gives
(25) 
Evaluating at gives the total number of frequency channels in the linearised coordinates, which determines the additional computational overhead introduced by the procedure. Note, however, that this number must be rounded up to a power of two before the tree dedispersion algorithm can be applied. For observations with typical effective bandwidths and channel counts that are already a power of two, the frequency padding technique is unlikely to require increasing the total number of channels by more than a factor of two.
In practice, the linearisation procedure is applied by padding the frequency dimension with blank channels such that the real channels are spaced according to equation (25). Once the dispersion trails have been linearised, the tree algorithm can be applied directly.
The ‘diagonal’ DM when using the frequency padding method corresponds to
(26) 
3.1.3 Computing larger DMs
The basic tree dedispersion algorithm computes exactly the DMs specified by equation (9). In practice, however, it is often necessary to search a much larger range of dispersion measures. Fortunately, there are techniques by which this can be achieved without having to resort to using the direct method. The tree algorithm can be made to compute higher DMs by first transforming the input data and then repeating the dedispersion computation. Formally, the following sequence of operations can be used to compute an arbitrary range of DMs:

Apply the tree algorithm to obtain DMs from zero to DM.

Impose a time delay across the band.

Apply the tree algorithm to obtain DMs from DM to DM.

Increment the imposed time delay.

Repeat from step (ii) to obtain DMs up to DM.
The imposed time delay is initially a simple diagonal across the band (i.e., ), and is implemented by incrementing a memory stride value rather than actually shifting memory. While this method enables dedispersion up to arbitrarily large DMs, it does not alter the spacing of DM trials, which remains fixed as per equation (9).
The ‘timescrunching’ technique, discussed in Section 2.1 for the direct algorithm, can also be applied to the tree algorithm. The procedure is as follows:

Apply the tree algorithm to obtain DMs from zero to DM.

Impose a time delay across the band.

Apply the tree algorithm to obtain DMs from DM to DM.

Compress (‘scrunch’) time by a factor of 2 by summing adjacent samples.

Impose a time delay across the band.

Apply the tree algorithm to obtain DMs from DM to DM.

Repeat from step (iv) to obtain DMs up to DM.
As with the direct algorithm, the use of timescrunching provides a performance benefit at the cost of a minor reduction in the signaltonoise ratio for pulses of intrinsic width near the dispersion measure smearing time.
3.2 Algorithm analysis
The tree dedispersion algorithm’s computational complexity of breaks down into sequential steps, with each step involving the computation of independent new values, as seen in equations (10) to (13). Following the analysis methodology of Barsdell, Barnes & Fluke (2010), the algorithm therefore has a depth complexity of , meaning it contains this many sequentiallydependent operations. Interestingly, this result matches that of the direct algorithm, although the tree algorithm requires significantly less total work. From a theoretical perspective, this implies that the tree algorithm contains less inherent parallelism than the direct algorithm. In practice, however, the number of processors will be small relative to the size of the problem (), and this reduced inherent parallelism is unlikely to be a concern for performance except when processing very small datasets.
Branching (i.e., conditional statements) within an algorithm can have a significant effect on performance when targetting GPUlike hardware (Barsdell, Barnes & Fluke, 2010). Fortunately, the tree algorithm is inherently branchfree, with all operations involving only memory accesses and arithmetic operations. This issue is therefore of no concern in this instance.
The arithmetic intensity of the tree algorithm is determined from the ratio of arithmetic operations to memory operations. To process samples, the algorithm involves ‘delay and add’ operations, and produces samples of output. In contrast to the direct algorithm, where the theoretical arithmetic intensity was proportional to the number of DMs computed, the tree algorithm requires only operations per sample. This suggests that the tree algorithm may be unable to exploit GPUlike hardware as efficiently as the direct algorithm. However, the exact arithmetic intensity will depend on constant factors and additional arithmetic overheads, and will only become apparent once the algorithm has been implemented. We defer discussion of these results to Section 3.3.
Achieving peak arithmetic intensity requires reading input data from ‘slow memory’ into ‘fast memory’ (e.g., from disk into main memory, from main memory into cache, from host memory into GPU memory etc.) only once, before performing all computations within fast memory and writing the results, again just once, back to slow memory. In the tree dedispersion algorithm, this means performing all steps entirely within fast memory. The feasibility of this will depend on implementation details, the discussion of which we defer to Section 3.3. However, it will be useful to assume that some subset of the total computation will fit within this model. We will therefore continue the analysis of the tree algorithm under the assumption that we are computing only a (poweroftwo) subset, or block, of channels.
The memory access patterns within the tree algorithm resemble those of the direct algorithm (see Section 2.2). Time samples are always accessed contiguously, with an offset that is essentially arbitrary. In the frequency dimension, memory is accessed according to the shuffle function [equation (15)] depicted in Fig. 2, where at any given step of the algorithm the frequency channels ‘interact’ in pairs, the interaction involving their addition with different time delays.
With respect to the goal of achieving peak arithmetic intensity, the key issue for the memory access patterns within the tree algorithm is the extent to which they remain ‘compact’. This is important because it determines the ability to operate on isolated blocks of data independently, which is critical to prolonging the time between successive trips to slow memory. In the frequency dimension, the computation of some local (poweroftwo) subset of channels involves accessing only other channels within the same subset. In this sense we can say that the memory access patterns are ‘locally compact’ in channels. In the time dimension, however, we note that the algorithm applies compounding delays (equivalent to offsets in whole time samples). This means that the memory access patterns ‘leak’ forward, with any local group of time samples always requiring access to the next group. The amount by which the necessary delays ‘leak’ in time for each channel is given by the integrated delay in that channel after steps (see Fig. 2). The total integrated delay across channels is , which is the number of additional values that must be read into fast memory by the block in order to compute all steps without needing to return to global memory and apply a global synchronisation.
3.3 Implementation Notes
As with the direct algorithm, we implemented the tree algorithm on a GPU in C for CUDA. For our first attempt, we took a simple approach where each of the steps in the computation was performed by a separate call to a GPU function (or kernel). This approach is not ideal, as it is preferable to perform more computation on the device before returning to the host (as per the discussion of arithmetic intensity in Section 3.2), but was necessary in order to guarantee global synchronisation across threads on the GPU between steps. This is a result of the lack of global synchronisation mechanisms on current GPUs.
Between steps, the integer delay and shuffle functions [equations (14) and (15)] were evaluated on the host and stored in lookup tables. These were then copied to constant memory on the device prior to executing the kernel function to compute the step. The use of constant memory ensured retrieval of these values would not be a bottleneck to performance during the computation of each step of the tree algorithm.
The problem was divided between threads on the GPU by allocating one thread for every time sample and every pair of frequency channels. This meant that each thread would compute the delayed sums between two ‘interacting’ channels according to the pattern depicted in Fig. 2 for the current step.
The tree algorithm’s iterative updating behaviour requires that computations at each step be performed ‘outofplace’; i.e., output must be written to a memory space separate from that of the input to avoid modifying input values before they have been used. We achieved this effect by using a doublebuffering scheme, where input and output arrays are swapped after each step.
While the algorithms differ significantly in their details, one point of consistency between the direct and tree methods is the need to apply time delays to the input data. Therefore, just as with our implementation of the direct algorithm, the tree algorithm requires accessing memory locations that are not aligned with internal memory boundaries. As such, we took the same approach as before and mapped the input data to the GPU’s texture memory before launching the device kernel. As noted in Section 2.3, this procedure is unnecessary on Fermigeneration GPUs, as their builtin caches provide the same behaviour automatically.
After successfully implementing the tree algorithm on a GPU using a simple onestepperGPUcall approach, we investigated the possibility of computing multiple steps of the algorithm on the GPU before returning to the CPU for synchronisation. This is possible because current GPUs, while lacking support for global thread synchronisation, do support synchronisation across local thread groups (or blocks). These thread blocks typically contain threads, and provide mechanisms for synchronisation and datasharing, both of which are required for a more efficient tree dedispersion implementation.
As discussed in Section 3.2, application of the tree algorithm to a block of channels time samples requires caching additional values from the next block in time. We used blocks of threads, each loading both their corresponding data value and required additional values into shared cache. Once all values have been stored, computation of the steps proceeds entirely within the shared cache. Using larger thread blocks would allow more steps to be completed within the cache; however, the choice is constrained by the available volume of shared memory (typically around 48kB). Once the block computation is complete, subsequent steps must be computed using the onestepperGPUcall approach described earlier, due to the requirement of global synchronisations.
While theory suggests that an implementation of the tree algorithm exploiting shared memory to perform multiple steps in cache would provide a performance benefit over a simpler implementation, in practice we were unable to achieve a net gain using this approach. The limitations on block size imposed by the volume of shared memory, the need to load additional data into cache and the logarithmic scaling of steps relative to data size significantly reduce the potential speedup, and overheads from increased codecomplexity quickly erode what remains. For this reason we reverted to the straightforward implementation of the tree algorithm as our final code for testing and benchmarking.
In addition to the base tree algorithm, we also implemented the subband method so as to allow the computation of arbitrary dispersion measures. This was achieved by dividing the computation into two stages. In stage 1, the first steps of the tree algorithm are applied to the input data, which produces the desired treededispersed subbands. Stage 2 then involves applying an algorithm to combine the dedispersed time series in different subbands into approximated quadratic dispersion curves according to equation (22). Stage 2 was implemented on the GPU in much the same way as the direct algorithm, with input data mapped to texture memory (on preFermi GPUs) and delays stored in lookup tables in constant device memory.
The frequency padding approach described in Section 3.1.2 was implemented by constructing an array large enough to hold the stretched frequency coordinates, initialising its elements to zero, and then copying (or scattering) the input data into this array according to equation (25). The results of this procedure were then fed to the basic tree dedispersion code to produce the final set of dedispersed time series.
Because the tree algorithm involves sequentially updating the entire dataset, the data must remain in their final format for the duration of the computation. This means that low bitrate data, e.g., 2bit, must be unpacked (in a preprocessing step) into a format that will not overflow during accumulation. This is in contrast to the direct algorithm, where each sum is independent, and can be stored locally to each thread.
4 Subband dedispersion
4.1 Introduction
Subband dedispersion is the name given to another technique used to compute the dedispersion transform. Like the tree algorithm described in Section 3, the subband algorithm attempts to reduce the cost of the computation relative to the direct method; however, rather than exploiting a regularisation of the dedispersion algorithm, the subband method takes a simple approximation approach.
In its simplest form, the algorithm involves two processing steps. In the first, the set of trial DMs is approximated by a reduced set of ‘nominal’ DMs, each separated by trial dispersion measures. The direct dedispersion algorithm is applied to subbands of channels to compute a dedispersed time series for each nominal DM and subband. In the second step, the DM trials near each nominal value are computed by applying the direct algorithm to the ‘miniature filterbanks’ formed by the time series for the subbands at each nominal DM. These data have a reduced frequency resolution of channels across the band. The two steps thus operate at reduced dispersion measure and frequency resolution respectively, resulting in an overall reduction in the computational cost.
The subband algorithm is implemented in the presto software suite (Ransom, 2001) and was recently implemented on a GPU by Magro et al. (2011) (see Section 6.1 for a comparison with their work). Unlike the tree algorithm, the subband method is able to compute the dedispersion transform with the same flexibility as the direct method, making its application to real observational data significantly simpler.
4.2 Algorithm analysis
The computational complexity of the subband dedispersion algorithm can be computed by summing that of the two steps:
(27)  
(28)  
(29)  
(30) 
This result can be combined with knowledge of the smearing introduced by the algorithm to probe the relationship between accuracy and performance. Inserting the smearing constraint (see Section 4.1) into equation (30), we obtain a secondorder expression that is minimised at , which amounts to balancing the execution time between the two steps. This result optimises the time complexity of the algorithm, which then takes the simple form
(31) 
and represents a theoretical speedup over the direct algorithm proportional to the square root of the introduced smearing.
The subband algorithm’s dependence on the direct algorithm means that it inherits similar algorithmic behaviour. However, as with the tree method, the decrease in computational work afforded by the subband approach corresponds to a decrease in the arithmetic intensity of the algorithm. This can be expected to reduce the intrinsic performance of the two subband steps relative to the direct algorithm.
One further consideration for the subband algorithm is the additional volume of memory required to store the intermediate results produced by the first step. These data consist of time series for each subband and nominal DM, giving a space complexity of
(32) 
Assuming the time complexity is optimised as in equation (31), the space complexity becomes
(33) 
which indicates that the memory consumption increases much faster than the execution time, in direct proportion to the introduced smearing rather than to its square root. This can be expected to place a lower limit on the smearing that can be achieved in practice.
4.3 Implementation notes
A significant advantage of the subband algorithm over the tree algorithm is that it involves little more than repeated execution of the direct algorithm. With sufficient generalisation^{5}^{5}5The direct dedispersion routine was modified to support ‘batching’ (simultaneous application to several adjacent datasets) and arbitrary strides through the input and output arrays, trial DMs and channels. of our implementation of the direct algorithm, we were able to implement the subband method with just two consecutive calls to the direct dedispersion routine and the addition of a temporary data buffer.
In our implementation, the ‘intermediate’ data (i.e., the outputs of the first step) are stored in the temporary buffer using 32 bits per sample. The second call to the dedispersion routine then reads these values directly before writing the final output using a desired number of bits per sample.
Experimentation showed that optimal performance occurred at a slightly different shape and size of the thread blocks on the GPU compared to the direct algorithm (see Section 2.2). The subband kernels operated most efficiently with 128 threads per block divided into 16 time samples and 8 DMs. In addition, the optimal choice of the ratio was found to be close to unity, which matches the theoretical result derived in Section 4.2. While these parameters minimised the execution time, the subband kernels were still found to perform around 40% less efficiently than the direct kernel. This result is likely due to the reduced arithmetic intensity of the algorithm (see Section 4.2).
5 Results
5.1 Smearing
Our analytic upperbounds on the increase in smearing due to use of the piecewise linear tree algorithm [equation (39)] and the subband algorithm [equation (42)] are plotted in the upper panels of Figs. 3 and 4 respectively. The reference point [ in equations (39) and (42)] was calculated using equations for the smearing during the direct dedispersion process^{6}^{6}6Levin, L. 2011, priv. comm. assuming an intrinsic pulse width of 40s.
For the piecewise linear tree algorithm, the effective signal smearing at low dispersion measure is dominated by the intrinsic pulse width, the sampling time and the effect of finite DM sampling. As the DM is increased, however, the effects of finite channel width and the subband technique grow, and eventually become dominant. These smearing terms both scale linearly with the dispersion measure, and so the relative contribution of the subband method, , tends to a constant.
The subband algorithm exhibits virtually constant smearing as a function of DM due to its dependence on the DM step, which is itself chosen to maintain a fixed fractional smearing. While the general trend mirrors that of the tree algorithm, the subband algorithm’s smearing is typically around two orders of magnitude worse than its tree counterpart.
5.2 Performance
Our codes as implemented allowed us to directly compute the following:
A number of timing benchmarks were run to compare the performance of the CPU to the GPU and the direct algorithm to the tree algorithms. Input and dispersion parameters were chosen to reflect a typical scenario as appears in modern pulsar surveys such as the High Time Resolution Universe (HTRU) survey currently underway at the Parkes radio telescope (Keith et al., 2010). The benchmarks involved computing the dedispersion transform of one minute of input data with observing parameters of , , , , . DM trials were chosen to match those used in the HTRU survey, which were originally derived by applying an analytic constraint on the signalsmearing due to incorrect trial DM^{7}^{7}7Levin, L. 2011, priv. comm.; see Cordes & McLaughlin 2003 for a similar derivation.. The chosen set contained 1196 trial DMs in the range 0 DM 1000 pc cm with approximately exponential spacing.
For comparison purposes, we benchmarked a reference CPU direct dedispersion code in addition to our GPU codes. The CPU code (named dedisperse_all) is highly optimised, and uses multiple CPU cores to compute the dedispersion transform (parallelised over the time dimension) in addition to bitlevel parallelism as described in Section 2.3. dedisperse_all is approximately 60 more efficient than the generic dedisperse routine from sigproc^{8}^{8}8sigproc.sourceforge.net, but is only applicable to a limited subset of data formats.
At the time of writing, our dedispersion codebase did not include ‘fullcapability’ implementations of all of the discussed algorithms. However, we were able to perform a number of benchmarks that were sufficient to obtain accurate estimates of the performance of complete runs. Timing measurements for our codes were projected to produce a number of derived results representative of the complete benchmark task. The direct/subband dedispersion code was able to compute the complete list of desired DMs, but was not able to exploit timescrunching; results for these algorithms with time scrunching were calculated by assuming that the computation of DMs between 2 and 4 the diagonal would proceed twice as fast as the computation up to 2 the diagonal (as a result of there being half as many time samples), and similarly for 4 to 8 etc. up to the maximum desired DM. A simple code to perform the timescrunching operation (i.e., adding adjacent time samples to reduce the time resolution by a factor of two) was also benchmarked and factored into the projection. For the tree codes, which were unable to compute DMs beyond the diagonal, timing results were projected by scaling as appropriate for the computation of the full set of desired DMs with or without timescrunching. Individual sections of code were timed separately to allow for different scaling behaviours.
Benchmarks were run on a variety of hardware configurations. CPU benchmarks were run on an Intel i7 930 quadcore CPU (Hyperthreading enabled). GPU benchmarks were run using version 3.2 of the CUDA toolkit on the preFermi generation NVIDIA Tesla C1060 and the Fermi generation NVIDIA Tesla C2050 (errorcorrecting memory disabled) and GeForce GTX 480 GPUs. Hardware specifications of the GPUs’ host machines varied, but were not considered to significantly impact performance measurements other than the copies between host and GPU memory. Benchmarks for these copy operations were averaged across the different machines.
Our derived performance results for the direct and piecewise linear tree codes are plotted in the lower panels of Fig. 3. The performance of the frequencypadded tree code corresponded to almost exactly half that of the piecewise linear tree code at a subband size of ; these results were omitted from the plot for clarity.
Performance results for the subband dedispersion code are plotted in the lower panels of Fig. 4 along with the results of the direct code for comparison. Due to limits on memory use (see Section 4.2), benchmarks for were not possible.
Performance was measured by inserting calls to the Unix function gettimeofday() before and after relevant sections of code. Calls to the CUDA function cudaThreadSynchronize() were inserted where necessary to ensure that asynchronous GPU functions had completed their execution prior to recording the time.
Several different sections of code were timed independently. These included pre and postprocessing steps (e.g., unpacking, transposing, scaling) and copies between host and GPU memory (in both directions), as well as the dedispersion kernels themselves. Disk I/O and portions of code whose execution time does not scale with the size of the input were not timed (see Section 6 for a discussion of the impact of disk I/O). Timing results represent the total execution time of all timed sections, including memory copies between the host and the device in the case of the GPU codes.
6 Discussion
The lower panel of Fig. 3(a) shows a number of interesting performance trends. As expected, the slowest computation speeds come from the direct dedispersion code running on the CPU. Here, some scaling is achieved via the use of multiple cores, but the speedup is limited to around 2.5 when using all four. This is likely due to saturation of the available memory bandwidth.
Looking at the corresponding results on a GPU, a large performance advantage is clear. The GTX 480 achieves a 9 speedup over the quadcore CPU, and even the lastgeneration Tesla C1060 manages a factor of 5. The fact that a single GPU is able to compute the dedispersion transform in less than a third of the real observation time makes it an attractive option for realtime detection pipelines.
A further performance boost is seen in the transition to the tree algorithm. Computation speed is projected to exceed that of the direct code for almost all choices of , peaking at around 3 at . Performance is seen to scale approximately linearly for , before peaking and then decreasing very slowly for . This behaviour is explained by the relative contributions of the two stages of the computation. For small , the second, ‘subband combination’, stage dominates the total execution time [scaling as ]. At large the execution time of the second stage becomes small relative to the first, and scaling follows that of the basic tree algorithm [i.e., ].
The results of the subband algorithm in Fig. 4(a) also show a significant performance advantage over the direct algorithm. The computable benchmarks start at =16 with around the same performance as the tree code. From there, performance rapidly increases as the size of the subbands is increased, eventually tailing off around =256 with a speedup of approximately 20 over the direct code. At such high speeds, the time spent in the GPU kernel is less than the time spent transferring the data into and out of the GPU. The significance of this effect for each of the three algorithms is given in Table 1.
Code  Copy Time  Fraction of total time 

Direct  0.62 s  
Tree  1.05 s  
Subband  0.62 s  10% – 65% 
The results discussed so far have assumed the use of the timescrunching technique during the dedispersion computation. If timescrunching is not used, the projected performance results change significantly [see lower panels Figs. 3(b) and 4(b)]. Without the use of timescrunching, the direct dedispersion codes perform around 1.6 slower, and similar results are seen for the subband code. The tree codes, however, are much more severely affected, and perform 5 slower when timescrunching is not employed. This striking result can be explained by the inflexibilities of the tree algorithm discussed in Section 3.1. At large dispersion measure, the direct algorithm allows one to sample DM space very thinly. The tree algorithms, however, do not – they will always compute DM trials at a fixed spacing [see equation (9)]. This means that the tree algorithms are effectively overcomputing the problem, which leads to the erosion of their original theoretical performance advantage. The use of timescrunching emulates the thin DMspace sampling of the direct code, and allows the tree codes to maintain an advantage.
While the piecewise linear tree code and the subband code are seen to provide significant speedups over the direct code, their performance leads come at the cost of introducing additional smearing into the dedispersed signal. Our analytic results for the magnitude of the smearing due to the tree code (upper panels Fig. 3) show that for the chosen observing parameters, the total smear is expected to increase by less than 10% for all at a DM of 1000 pc cm. Given that peak performance of the tree code also corresponded to , we conclude that this is the optimal choice of subband size for such observations.
The smearing introduced by the subband code (upper panels Fig. 4) is significantly worse, increasing the signal degradation by three orders of magnitude more than the tree code. Here, the total smear is expected to increase by around 40% at =16, and at =32 the increase in smearing reaches 300%. While these results are upper limits, it is unlikely that subband sizes of more than =32 will produce acceptable results in practical scenarios.
In contrast to the piecewise linear code, the frequencypadded tree code showed only a modest speedup of around 1.5 over the direct approach due to its doubling of the number of frequency channels. Given that the subband algorithm has a minimal impact on signal quality, we conclude that the frequencypadding technique is an inferior option.
It is also important to consider the development cost of the algorithms we have discussed. While the tree code has shown both high performance and accuracy, it is also considerably more complex than the other algorithms. The tree algorithm in its base form, as discussed in Section 3.1, is much less intuitive than the direct algorithm (e.g., the memory access patterns in Fig. 2). This fact alone makes implementation more difficult. The situation gets significantly worse when one must adapt the tree algorithm to work in practical scenarios, with quadratic dispersion curves and arbitrary DM trials. Here, the algorithm’s inflexibility makes implementation a daunting task. We note that our own implementations are as yet incomplete. By comparison, implementation of the direct code is relatively straightforward, and the subband code requires only minimal changes. Development time must play a role in any decision to use one dedispersion algorithm over another.
The three algorithms we have discussed each show relative strengths and weaknesses. The direct algorithm makes for a relatively straightforward move to the GPU architecture with no concerns regarding accuracy, and offers a speedup of up to 10 over an efficient CPU code. However, its performance is convincingly beaten by the tree and subband methods. The tree method is able to provide significantly better performance with only a minimal loss of signal quality; however, it comes with a high cost of development that may outweigh its advantages. Finally, the subband method combines excellent performance with an easy implementation, but is let down by the substantial smearing it introduces into the dedispersed signal. The optimal choice of algorithm will therefore depend on which factors are most important to a particular project. While there is no clear best choice among the three different algorithms, we emphasize that between the two hardware architectures the GPU clearly outperforms the CPU.
When comparing the use of a GPU to a CPU, it is interesting to note that our final GPU implementation of the direct dedispersion algorithm on a Fermiclass device is, relatively speaking, a simple code. While it was necessary in both the preFermi GPU and multicore CPU implementations to use nontrivial optimisation techniques (e.g., texture memory, bitpacking etc.), the optimal implementation on currentgeneration, Fermi, GPU hardware was also the simplest or ‘obvious’ implementation. This demonstrates how far the (now rather misnamed) graphics processing unit has come in its ability to act as a generalpurpose processor.
In addition to the performance advantage offered by GPUs today, we expect our implementations of the dedispersion problem to scale well to future architectures with little to no code modification. The introduction of the current generation of GPU hardware brought with it both a significant performance increase and an equally significant reduction in programming complexity. We expect these trends to continue when the next generation of GPUs is released, and see a promising future for these architectures and the applications that make use of them.
While we have only discussed singleGPU implementations of dedispersion, it would in theory be a simple matter to make use of multiple GPUs, e.g., via timedivision multiplexing of the input data or allocation of a subset of beams to each GPU. As long as the total execution time is dominated by the GPU dedispersion kernel, the effects of multiple GPUs within a machine sharing resources such as CPU cycles and PCIExpress bandwidth are expected to be negligible. However, as shown in Table 1, the tree and subband codes are in some circumstances so efficient that hostdevice memory copy times become a significant fraction of the total run time. In these situations, the use of multiple GPUs within a single host machine may influence the overall performance due to reduced PCIExpress bandwidth.
Disk I/O is another factor that can contribute to the total execution time of a dedispersion process. Typical serverclass machines have disk read/write speeds of only around 100 MB/s, while our GPU dedispersion codes are capable of producing 8bit time series at well over twice this rate. If dedispersion is performed in an offline fashion, where time series are read from and written to disk before and after dedispersion, then it is likely that disk performance will become the bottleneck. The use of multiple GPUs within a machine may exacerbate this effect. However, for realtime processing pipelines where data are kept in memory between operations, the dedispersion kernel can be expected to dominate the execution time. This is particularly important for transient search pipelines, where acceleration searching is not necessary and dedispersion is typically the most timeconsuming operation.
The potential impact of limited PCIExpress bandwidth or disk I/O performance highlights the need to remember Amdahl’s Law when considering further speedups in the dedispersion codes: the achievable speedup is limited by the largest bottleneck. The tree and subband codes are already on the verge of being dominated by the hostdevice memory copies, meaning that further optimisation of their kernels will provide diminishing returns. While disk and memory bandwidths will nodoubt continue to increase in the future, we expect the ratio of arithmetic performance to memory performance to get worse rather than better.
The application of GPUs to the problem of dedispersion has produced speedups of an order of magnitude. The implications of this result for current and future surveys are significant. Current projects often execute pulsar and transient search pipelines offline due to limited computational resources. This results in event detections being made long after the time of the events themselves, limiting analysis and confirmation power to what can be gleamed from archived data alone. A realtime detection pipeline, made possible by a GPUpowered dedispersion code, could instead trigger systems to record invaluable baseband data during significant events, or alert other observatories to perform followup observations over a range of wavelengths. Realtime detection capabilities will also be crucial for nextgeneration telescopes such as the Square Kilometre Array pathfinder programs ASKAP and MeerKAT. The use of GPUs promises significant reductions in the setup and running costs of realtime pulsar and transient processing pipelines, and could be the enabling factor in the construction of everlarger systems in the future.
6.1 Comparison with other work
Magro et al. (2011) recently reported on a GPU code that could achieve very high () speedups over the dedispersion routines in sigproc and presto (Ransom, 2001) whereas our work only finds improvements of factors of 10–30 over dedisperse_all. There are two key reasons for the apparent discrepancy in speed. Firstly, the sigproc routine was never written to optimise performance but rather to produce reliable dedispersed data streams from a very large number of different backends. Inspection of the innermost loop reveals a conditional test that prohibits parallelisation, and a two dimensional array that is computationally expensive. Secondly, sigproc only produces one DM per file read, which is very inefficient. We believe that these factors explain the very large speedups reported by Magro et al. In our own benchmarks, we have found our CPU comparison code dedisperse_all to be faster than sigproc. For comparison, this puts our direct GPU code at faster than sigproc when using the same Tesla C1060 model GPU as Magro et al.
Direct comparison of our GPU results with those of Magro et al. is difficult, as the details of the CPU code, the method of counting FLOP/s and the observing parameters used in their performance plots is not clear. However, we have benchmarked our GPU code on the ‘toy observation’ presented in section 5 of their paper. The execution times are compared in Table 2. Magro et al. did not specify the number of bits per sample used in their benchmark; we chose to use 8 bits/sample, but found no significant difference when using 32 bits/sample. We found our implementation of the direct dedispersion algorithm to be faster than that reported in their work. Possible factors contributing to this difference include our use of texture memory, twodimensional thread blocks and allocation of multiple samples per thread. The performance results of our implementation of the subband dedispersion algorithm generally agree with those of Magro et al., although the impact of the additional smearing is not quantified in their work.
Stage  Magro et al. (2011)  This work  Ratio 

Corner turn  112 ms  7 ms  16 
Dedispersion  4500 ms  1959 ms  2.29 
GPUCPU copy  220 ms  144 ms  1.52 
Total  4832 ms  2110 ms  2.29 
In summary, we agree with Magro et al. that GPUs offer great promise in incoherent dedispersion. The benefit over that of CPUs is, however, closer to the ratio of their memory bandwidths () than the factor of 100 reported in their paper, which relied on comparison with a nonoptimised singlethreaded CPU code.
6.2 Code availability
We have packaged our GPU implementation of the direct incoherent dedispersion algorithm into a C library that we make available to the community^{9}^{9}9Our library and its source code are available at: http://dedisp.googlecode.com/. The application programming interface (API) was modeled on that of the FFTW library^{10}^{10}10http://www.fftw.org, which was found to be a convenient fit. The library requires the NVIDIA CUDA Toolkit, but places no requirements on the host application, allowing easy integration into existing C/C++/Fortran etc. codes. While the library currently uses the direct dedispersion algorithm, we may consider adding support for a tree or subband algorithm in future.
7 Conclusions
We have analysed the direct, tree and subband dedispersion algorithms and found all three to be good matches for massivelyparallel computing architectures such as GPUs. Implementations of the three algorithms were written for the current and previous generations of GPU hardware, with the more recent devices providing benefits in terms of both performance and ease of development. Timing results showed a 9 speedup over a multicore CPU when executing the direct dedispersion algorithm on a GPU. Using the tree algorithm with a piecewise linear approximation technique results in some additional smearing of the input signal, but was projected to provide a further 3 speedup at a very modest level of signalloss. The subband method provides a means of obtaining even greater speedups, but imposes significant additional smearing on the dedispersed signal. These results have significant implications for current and future radio pulsar and transient surveys, and promise to dramatically lower the cost barrier to the deployment of realtime detection pipelines.
Acknowledgments
We would like to thank Lina Levin and Willem Van Straten for very helpful discussions relating to pulsar searching, Mike Keith for valuable information regarding the tree dedispersion algorithm, and Paul Coster for help in testing our dedispersion code. We would also like to thank the referee Scott Ransom for his very helpful comments and suggestions for the paper.
Appendix A Error analysis for the tree dedispersion algorithm
Here we derive an expression for the maximum error introduced by the use of the piecewise linear tree dedispersion algorithm.
The deviation of a function from a linear approximation between and is bounded by
(34) 
which shows that the error is proportional to the square of the step size and the second derivative of the function. For the dedispersion problem, the second derivative of the delay function with respect to frequency is given by
(35)  
(36) 
which has greater magnitude at lower frequencies. Evaluating at the lowest frequency in the band, , and substituting into equation (34) along with the subband size , one finds the error to be bounded by:
(37) 
where is a proxy for the fractional bandwidth, a measure of the width of the antenna band.
If the smearing as a result of using the direct algorithm is quantified as the effective width, , of an observed pulse, then the piecewise linear tree algorithm is expected to produce a signal with an effective width of
(38) 
giving a relative smearing of
(39) 
In contrast to the use of a piecewise linear approximation, the use of a change of frequency coordinates (‘frequency padding’) to linearise the dispersion trails results in no additional sources of smear.
Appendix B Error analysis for the subband dedispersion algorithm
Here we derive an expression for the maximum error introduced by the use of the subband dedispersion algorithm.
The smearing introduced into a dedispersed time series due to an approximation to the dispersion curve is bounded by the maximum temporal deviation of the approximation from the exact curve. The maximum change in delay across a subband is ; the difference in this value between two nominal DMs then gives the smearing time:
(40)  
(41) 
where the second form is obtained through Taylor expansion in powers of around zero. Note that this derivation assumes the dispersion curve is approximated by aligning the ’early’ edge of each subband. An alternative approach is to centre the subbands on the curve, which reduces the smearing by but adds complexity to the implementation.
As with the tree algorithm, we can define the relative smearing of the subband algorithm with respect to the direct algorithm as
(42) 
where, as before, is the effective width of an observed pulse after direct dedispersion.
References
 Barsdell, Barnes & Fluke (2010) Barsdell B. R., Barnes D. G., Fluke C. J., 2010, M.N.R.A.S., 408, 1936
 Bhattacharya & van den Heuvel (1991) Bhattacharya D., van den Heuvel E. P. J., 1991, Physics Reports, 203, 1
 Cordes & McLaughlin (2003) Cordes J. M., McLaughlin M. A., 2003, ApJ, 596, 1142
 Dodson et al. (2010) Dodson R., Harris C., Pal S., Wayth R., 2010, in ISKAF2010 Science Meeting
 Fluke et al. (2011) Fluke C. J., Barnes D. G., Barsdell B. R., Hassan A. H., 2011, Pub. Astron. Soc. Australia, 28, 15
 Gaensler et al. (2008) Gaensler B. M., Madsen G. J., Chatterjee S., Mao S. A., 2008, Pub. Astron. Soc. Australia, 25, 184
 Keane et al. (2011) Keane E. F., Kramer M., Lyne A. G., Stappers B. W., McLaughlin M. A., 2011, M.N.R.A.S., 838
 Keith et al. (2010) Keith M. J. et al., 2010, M.N.R.A.S., 409, 619
 Lattimer & Prakash (2004) Lattimer J. M., Prakash M., 2004, Science, 304, 536
 Lorimer et al. (2007) Lorimer D. R., Bailes M., McLaughlin M. A., Narkevic D. J., Crawford F., 2007, Science, 318, 777
 Lyne et al. (2004) Lyne A. G. et al., 2004, Science, 303, 1153
 Magro et al. (2011) Magro A., Karastergiou A., Salvini S., Mort B., Dulwich F., Zarb Adami K., 2011, ArXiv eprints 1107.2516
 Manchester et al. (2001) Manchester R. et al., 2001, M.N.R.A.S., 328, 17
 Manchester et al. (1996) —, 1996, M.N.R.A.S., 279, 1235
 McLaughlin et al. (2006) McLaughlin M. A. et al., 2006, Nature, 439, 817
 Ransom (2001) Ransom S. M., 2001, PhD thesis, Harvard University
 Taylor (1974) Taylor J. H., 1974, A&AS, 15, 367