Parallel Algorithms for Summing FloatingPoint Numbers
Abstract
The problem of exactly summing floatingpoint numbers is a fundamental problem that has many applications in largescale simulations and computational geometry. Unfortunately, due to the roundoff error in standard floatingpoint operations, this problem becomes very challenging. Moreover, all existing solutions rely on sequential algorithms which cannot scale to the huge datasets that need to be processed.
In this paper, we provide several efficient parallel algorithms for summing floating point numbers, so as to produce a faithfully rounded floatingpoint representation of the sum. We present algorithms in PRAM, externalmemory, and MapReduce models, and we also provide an experimental analysis of our MapReduce algorithms, due to their simplicity and practical efficiency.
1 Introduction
Floatingpoint numbers are a common data type for representing values that can vary widely. A (base2) floatingpoint representation^{1}^{1}1We take the viewpoint in this paper that floatingpoint numbers are a base2 representation; nevertheless, our algorithms can easily be modified to work with other standard floatingpoint bases, such as 10. of a number, , is a sign bit, , and pair of integers, , such that
where is the number of bits allocated for and is the number of bits allocated for . The value, , is called the mantissa or significand and the value is called the exponent. For more background information on floatingpoint numbers, please see, e.g., the excellent survey by Goldberg [15].
In fixedprecision representations, specific values for and are set according to machinearchitecture or established standards. For example, in the IEEE 754 standard, singleprecision floatingpoint numbers use and , doubleprecision floatingpoint numbers use and , and quadprecision floatingpoint numbers use and .
In arbitraryprecision representations, the number of bits, , for the mantissa, is allowed to vary, either to be as large as a machine’s memory size or to be an arbitrary value set by a user. The number, , of bits for the exponent in such representations is typically set to be a single memory word, since a single memory word is usually sufficient to store any memory address. Examples of systems for processing arbitraryprecision floating point representations include Apfloat for Java [35], the GNU Multiple Precision (GMP) Arithmatic Library [14], the bigfloat type in LEDA [4], and the GNU MultiplePrecision FloatingPoint (MPFR) Library [19, 12].
In either fixedprecision or arbitraryprecision representations, we do not assume in this paper that floatingpoint numbers are normalized so that the most significant bit of the mantissa is .
Given a set , of floatingpoint numbers, we are interested in the design and analysis of parallel algorithms for computing a floating point number, , that best approximates the sum
Moreover, we are interested in methods that are not limited to a specific fixedprecision representation, such as IEEE 754 doubleprecision. In particular, the specific problem we are interested in is to compute the floatingpoint number that is a faithfully rounded representation of , where we consider the value to be faithfully rounded as long as is either the largest floatingpoint number less than or equal to or the smallest floatingpoint number greater than or equal to . For other ways of rounding floatingpoint numbers, please see the IEEE 754 standard.
Although the problem of accurately summing floatingpoint numbers might at first seem simple, it is surprisingly challenging, due to the roundoff error inherent in floatingpoint arithmetic. Standard floatingpoint addition, which we denote as “,” is not exact, so that for two floatingpoint numbers, and ,
where is a summandspecific roundoff error and is a machinespecific worstcase error term [25, 23].
As is wellknown, a standard way to sum numbers, given the set , is to place the numbers from in the leaves of a binary summation tree, , where each internal node, , is associated with the sum of the values in ’s two children, which in this case could be the operation. Unfortunately, if the collection, , of summands contains positive and negative numbers, it is NPcomplete to find the summation tree, , that utilizes only the operation and minimizes the worstcase error for the sum [23]. Even for the case when all the numbers in are positive, the optimal based summation tree is a Huffman tree [23], which is not necessarily efficient to compute and evaluate in parallel. Thus, designing efficient parallel algorithms for accurately summing floatingpoint numbers is a challenge.
Because of these complexity issues and the uncertainty that comes from roundoff errors, many recent floatingpoint summation algorithms are based on exactly computing a rounded sum of two floating point numbers, and , and the exact value of its error, utilizing a function,
where and , with and being floating point numbers in the same precision as and . Example implementations of the function include algorithms by Dekker [8] and Knuth [25], both of which utilize a constant number of floatingpoint operations. As with these recent approaches, in this paper we take the approach of summing the numbers in exactly and then faithfully rounding this exact sum to the an appropriate floatingpoint representation.
As mentioned above, because we desire precisionindependent algorithms, which can work for either fixedprecision or arbitraryprecision representations, we do not take the perspective in this paper that floatingpoint numbers must be restricted to a specific floatingpoint representation, such as IEEE 754 doubleprecision. As is common in other precisionindependent algorithms, however, we do assume that a floatingpoint representation is sufficiently precise so that the number, , could itself be represented exactly as a sum of a constant number of floatingpoint numbers. A similar assumption is common in integerbased RAM and PRAM algorithms,^{2}^{2}2A PRAM is a synchronized parallel randomaccess machine model [21], where memory is shared between processors, so that memory accesses are exclusiveread/exclusivewrite (EREW), concurrentread/exclusivewrite (CREW), or concurrentread/concurrentwrite (CRCW). and poses no restriction in practice, since it amounts to saying that a memory address can be stored in memory words. For example, even if all the computer storage on earth^{3}^{3}3As of the writing of this paper, the total computer storage on earth is estimated to be less than zettabytes, that is, trillion gigabytes. were interpreted as IEEE 754 singleprecision numbers and summed, we could represent the input size, , as the (exact) sum of at most four singleprecision floatingpoint numbers.
As a way of characterizing difficult problem instances, having the potential for significant amounts of cancellation, the condition number, , for , is defined as follows [39, 40, 34]:
which, of course, is defined only for nonzero sums.^{4}^{4}4We could alternatively define a condition number that adds to the denominator a very small value to avoid division by zero. Intuitively, problem instances with large condition numbers are more difficult than problem instances with condition numbers close to 1.
Our approach for designing efficient parallel algorithms for summing floating point numbers, even for difficult problem instances with large condition numbers, is to utilize an approach somewhat reminiscent of the classic Fast Fourier Transform (FFT) algorithm (e.g., see [3, 18]). Namely, to compute the sum of floatingpoint numbers, we convert the numbers to an alternative representation, compute the sum of the numbers exactly in this alternative representation, and then convert the result back to a (faithfullyrounded) floatingpoint number. In our case, the important feature of our alternative representation is that it allows us to compute intermediate sums without propagating carry bits, which provides for superior parallelization.
1.1 Previous Related Results
The floatingpoint summation problem poses interesting challenges for parallel computation, because most existing published exactsummation algorithms are inherently sequential. For instance, we are not aware of any previous parallel methods for this problem that run in worstcase polylogarithmic time.
Neal [30] describes sequential algorithms that use a number representation that he calls a superaccumulator to exactly sum floating point numbers, which he then converts to a faithfullyrounded floatingpoint number. Unfortunately, while Neal’s superaccumulator representation reduces carrybit propagation, it does not eliminate it, as is needed for highlyparallel algorithms. A similar idea has been used in ExBLAS [6], an open source library for exact floating point computations. Shewchuck [33] describes an alternative representation for exactly representing intermediate results of floatingpoint arithmetic, but his method also does not eliminate carrybit propagation in summations; hence, it also does not lead to efficient parallel algorithms. In addition to these methods, there are a number of inherently sequential methods for exactly summing floating point numbers using various other data structures for representing intermediate results, including ExBLAS [6] and algorithms by Zhu and Hayes [39, 40], Demmel and Hida [9, 10], Rump et al. [34], Priest [32], and Malcolm [29]. Although the method of Demmel and Hida [10] can be carryfree for a limited number of summands, none of these sequential methods utilize a completely carryfree intermediate representation suitable for efficient parallelization. Nevertheless, Rump et al. [34] provide an interesting analysis that the running time of their sequential method can be estimated to depend on and a logarithm of the condition number and other factors. Also, Demmel and Hida [10] showed that summing the numbers in a decreasing order by exponent yields a highly accurate answer, yet, the answer does not have to be faithfullyrounded.
There has also been a limited amount of previous work on parallel algorithms for summing floatingpoint numbers. Leuprecht and Oberaigner [28] describe parallel algorithms for summing floatingpoint numbers, for instance, but their methods only parallelize data pipelining, however, and do not translate into efficient parallel methods with polylogarithmic running times or efficient algorithms in the externalmemory^{5}^{5}5The externalmemory model [36] is a dataparallel model, where data is transferred between an internal memory of a specified size, , and external memory in blocks of a specified size, . or MapReduce models.^{6}^{6}6A MapReduce algorithm is a coarsegrain parallel method that performs rounds that involve mapping elements to keys, assigning elements by their keys to “reducer” processors, and having those processors execute on sets of elements with the same key, possibly with a global sequential finalize phase at the end [7, 16, 24]. Indeed, their methods can involve as many as passes over the data. Kadric et al. [22] provide a parallel pipelined method that takes a similar approach to the algorithm of Leuprecht and Oberaigner, while improving its convergence in practice, but their method nevertheless depends on inherently sequential pipelining and iterative refinement operations. Recently, Demmel and Nguyen [11] present a parallel floatingpoint summation method based on using a superaccumulator, but, like the previous sequential superaccumulator methods cited above, their method does not utilize a carryfree intermediate representation; hence, it has an inherently sequential carrypropagation step as a part of its “inner loop” computation.
To our knowledge, no previous algorithm for summing floatingpoint numbers utilizes a carryfree intermediate representation, although such representations are known for integer arithmetic (e.g., see [31]), the most notable being the redundant binary representation (RBR), which is a positional binary representation where each position is from the set .
1.2 Our Results
We show in this paper that one can compute a faithfullyrounded sum, , of floatingpoint numbers with the following performance bounds.

can be computed in time using processors in the EREW PRAM model. This is the first such PRAM algorithm and, as we show, it is worstcase optimal.

can be computed in time using work in the EREW PRAM model, where is the condition number of . This is the first parallel summation algorithm whose running time is conditionnumber sensitive.

can be computed in externalmemory in I/Os, where “sort” is the I/O complexity of sorting.^{7}^{7}7, where is the size of internal memory and is the block size.

can be computed in externalmemory in I/Os, where is the I/O complexity of scanning^{8}^{8}8, where is the block size., when the size of internal memory, , is , where is the size of our intermediate representation for the sum of floatingpoint numbers. This I/O performance bound is, of course, optimal for this case. Typically, is in practice, due to the concentration of exponent values in realworld data (e.g., see [20, 27]) or because fixedprecision floatingpoint numbers have relatively small values for , the number of bits used for exponent values, compared to .

can be computed with a singleround MapReduce algorithm in time and work, using processors, with high probability, assuming is .
In addition, because of the simplicity of our MapReduce algorithm, we provide an experimental analysis of several variations of this algorithm. We show that our MapReduce algorithm can achieve upto 80X speedup over the stateoftheart sequential algorithm. It achieves a linear scalability with both the input size and number of cores in the cluster.
2 Our Number Representations
To represent the exact summation of floating point numbers, we utilize a variety of alternative number representations. To begin, recall that a (base2) floatingpoint number, , is represented as a sign bit, , and a pair of integers, , such that
where is the number of bits allocated for and is the number of bits allocated for . For example, for doubleprecision floatingpoint numbers in the IEEE 754 standard, and .
Thus, for the problem of summing the floatingpoint numbers in , we could alternatively represent every floating point number, including each partial sum, as a fixedpoint binary number consisting of a sign bit, bits to the left of the binary point, and bits to the right of the binary point. Using such a representation, we would have no errors in summing the numbers in , assuming there are no unnecessary overflows. In other words, we can consider such a fixedpoint representation as a large binary integer representation, which then has its binary point shifted. Of course, such a representation would potentially waste a lot of memory, but it is nevertheless instructive to contemplate this as a possibility, as it motivates our other representations.
Although this fixedpoint representation is potentially wasteful of memory, it might not actually be so bad for exactly representing partial sums of a reasonable numbers of summands of singleprecision numbers. For example, in the IEEE 754 standard, singleprecision 32bit floatingpoint numbers can be represented as 256bit fixedpoint numbers. Thus, the memory needed for an errorfree fixedpoint representation of a singleprecision floatingpoint number would occupy the same amount of space as 8 singleprecision numbers. Nevertheless, an additional drawback of this representation is that, in the worstcase, there can be a lot of carrybit propagations that occur for any addition, which negatively impacts parallel performance.
In a superaccumulator representation, as used in some previous summation approaches (e.g., see [26, 30, 32, 33, 40]), we instead represent a fixedpoint (or floatingpoint) number as a vector, , of floatingpoint numbers, , such that the number, , represented by is
where the numbers in have strictly increasing exponents, so that is the least significant number (and this summation is a true addition, with no roundoff errors).
As mentioned above, there are some interesting recent floatingpoint exact summation papers based on the use of superaccumulators, but these are of limited value with respect to parallel algorithms. For instance, Zhu and Hayes [40] present an inherently sequential algorithm that essentially involves adding the numbers from one at time to an initiallyzero superaccumulator and then propagating the carries to produce a faithfullyrounded sum from the mostsignificant entry. Neal [30] presents a similar algorithm and also discusses in passing an alternative approach based on a bottomup parallel summation tree computation based on superaccumulators, but his parallel algorithm involves inefficiently constructing superaccumulators for every leaf, and his superaccumulators overlap by half their bits and even then are not carryfree.
In our case, we allow each to be positive or negative and we add some additional conditions so that superaccumulator summations are carryfree. We do this by extending the generalized signed digit (GSD) integer representation [31] to floatingpoint numbers. This is a redundant representation, so that there are multiple ways of representing the same fixedpoint number.
To simply our discussion, let us shift the binary point for so that every is an integer instead of a fixedpoint number. This shift is made without loss of generality, since we can shift the binary point back to its proper place after we can computed an exact representation of the sum of numbers in .
Next, following the GSD approach [31], we say that a superaccumulator is regularized if
for a given radix, , and each mantissa is an integer in the range , for . In particular, for our algorithms, for any fixed , we choose to be a power of two, , so that each can be represented using a floatingpoint exponent storing a multiple of (since we assume floatingpoint representations are base2 in this paper). For arbitraryprecision floatingpoint numbers, we likewise choose , for a fixed value that is a reasonable length for the number of bits needed for a mantissa (proportional to our word size). In either case, we set
This choice of the parameters, and , is done so that we can achieve propagationfree carrying of components of a summation, as we show next.
Our parallel algorithm for summing two superaccumulators, and , is as follows. First, we compute each componentwise sum of the mantissas, . This sum is then reduced to an interim mantissa sum, , where is a signed carry bit, i.e., , that is chosen so that is guaranteed to be in the range . (We show below that this is always possible.) The computed mantissa sum is then performed as , so that the resulting collection of components is regularized and no carrybit propagation is necessary. As the following lemma shows, taking this approach allows us to avoid propagating carries across an entire superaccumulator after each addition in a summation computation, while nevertheless representing each partial sum exactly.
Lemma 1
It is sufficient to choose , for , for the resulting sum of and to be regularized, i.e., so that each is in the range .
Proof: Given that and are regularized, the mantissa, , is in the range . We wish to show that the result, , be regularized, that is, that each mantissa, , is in the range . Note that if , then we can ensure is in the range for any in by setting . So let us consider the cases when is too negative or too positive.
Case 1: . Note that
Hence, we can choose , so that
Moreover, , in this case, since . Thus,
and .
Case 2: . Note that
Hence, we can choose , so that
Moreover, , in this case, since . Thus,
and .
This implies that, by choosing , we can guarantee that the sum of two regularized superaccumulators can be done in parallel in constant time, with each carry going to at most an adjacent component and no further. There is a chance, of course that the sum of two superaccumulators indexed from to could result in a superaccumulator indexed from to . But this causes no problems for us, since any mantissa can hold bits; hence, one additional suparaccumulator component is sufficient for holding all the adjacentcomponent carries during the sum of floatingpoint numbers.
An additional observation is that, since is a power of two in our case, we can compute each and using simple operations involving the addition or subtraction of a power of two to a number, given . Also, since we reserve an extra bit in each superaccumulator, computing can be done without overflow in the standard floatingpoint representation of components.
Intuitively, each pair of consecutive superaccumulator components, and , “overlap” by one bit and each component has one additional sign bit. As an alternative to representing each as a floatingpoint number, then, we could instead represent each number, , using an integer representation, with an explicit overlap of one bit between each consecutive pair of numbers, and , being understood. This would allow us to save some space for what amounts to redundant exponents in the floatingpoint representations of the ’s. For the sake of simplicity, however, we choose in this discussion to assume that each is itself a floatingpoint number, with the understanding that our algorithms could be easily modified to work for the case that the numbers in are integers. In either case, the overlap between consecutive numbers in allows us to apply a lazy strategy for accumulating carry bits, without full propagation, which overcomes a shortcoming in previous representations.
One important comment is in order here. Namely, for the analysis in this paper, we are not assuming that the number, , of bits allocated for floatingpoint exponents is a fixed constant; hence, our analysis does not assume that, for our floatingpoint numbers, the size of an equivalent fixedpoint representation or superaccumulator for this number is a fixed constant.
Thus, for small numbers of arbitraryprecision floatingpoint numbers, it is possible that our regularized superaccumulators may waste space. To avoid these issues, we can represent numbers using a format we are calling a “sparse superaccumulator.” Given a superaccumulator,
the sparse superaccumulator for is the vector,
consisting of all the active indices in , for . We say that a index, , in a superaccumulator is active if is currently nonzero or has ever been nonzero in the past (when viewing a superaccumulator as an indexed data structure).
One possible parallel algorithm for summing two sparse superaccumulators, and , is as follows. We first merge the active indices of and , and we then do a summation of corresponding terms (possibly with carries into adjacent components, as needed). Note, though, that we will not propagate carries, when we use regularized superaccumulators. Thus, an index in the sum is active if it was active in or or becomes nonzero as a result of the sum. This definition is somewhat related to the adaptive floatingpoint representation of Shewchuk [33], which introduces sparsity and adaptivity but only for vectors of nonoverlapping floatingpoint numbers having arbitrary exponents, rather than exponents that are powers of the radix , as in our sparse regularized superaccumulator representation.
Furthermore, given a sparse superaccumulator, , and a parameter , we define the truncated sparse superaccumulator for to be the vector, , consisting of the first (mostsignificant) entries in .
3 Our Fast PRAM Algorithm
The first PRAM algorithm we present is based on summing numbers represented using sparse regularized superaccumulators. Our method runs in time using work in the EREW PRAM model, which is worstcase optimal. The details are as follows.

Build a binary summation tree, , with depth, having each leaf associated with a distinct floatingpoint number, , in . This step can be done in time and work.

In parallel, for each , convert into an equivalent regularized superaccumulator, . This step can be done in time and work, just by splitting each floatingpoint number into numbers such that each has an exponent that is a multiple of .

Perform a parallel mergesort of the components, using their exponents as keys (not their mantissas). This creates a sorted list, , for each node in , consisting of the exponents found in the subtree in rooted at , as well as links for each exponent, in to its predecessors in the lists for ’s children and its parent. This step can be done in time using work [5, 17] via the cascading divideandconquer technique [1], because the boundary exponents are known from the beginning.

Perform a parallel prefix computation to remove duplicate exponents in each . This step can be done in time using work, since the total size of all the lists is .

Using the links created in the previous step to match up corresponding components in constanttime per level, possibly adding new components that represent a carry bit moving into a component that was previously not included in the sparse representation, perform a bottomup sparse superaccumulator summation in . This results in a sparse superaccumulator representation of the sum being stored in the root of . This step can be done in time using work.

Convert the sparse regularized superaccumulator for the sum at the root of into a nonoverlapping superaccumulator that is regularized. This amounts to a signed carrybit propagation operation, which can be done by a parallel prefix computation (based on a simple lookup table based on whether the input carry bit is a , , or ). We leave the details to the interested reader. This step can be done in time using work.

Correctly round the nonoverlapping superaccumulator from the previous step into a floatingpoint number. This step amounts to locating the most significant nonzero component of the superaccumulator and then combining that, as needed, with its neighboring components to create a floatingpoint number of the appropriate size, rounding the result based on the truncated bits. This step can be done in time using work.
This gives us the following.
Theorem 2
Given floatingpoint numbers, one can compute a faithfullyrounded representation of their sum in time using work (i.e., processors) in the EREW PRAM model. These time and processor bounds are worstcase optimal in the algebraic computation tree model.
Proof: We have already established the upper bounds. For the lower bounds, we show that the set equality problem can be reduced to the floatingpoint summation problem in time using processors. Suppose, then, that we are given two sets of positive numbers, and , and wish to determine if . Let be the smallest power of two greater than . For each element in , create the floatingpoint number, , which represents the number, . Likewise, for each element in , create the floatingpoint number, , which represents the number, . We claim that if and only if the sum of these two sets of numbers is zero. Clearly, if , then there is a onetoone matching between equal elements in and ; hence, there is a matching for each floatingpoint number, , to a floatingpoint number, , such that . Therefore, the sum of all of these numbers is zero in this case. Suppose, for the “only if” case, that the sum of all these numbers is zero. Note that the exponent of any pair of floatingpoint numbers in our collection is either the same or differs by at least . Thus, if two such numbers are different, they will remain different even if we multiply one of them by . Therefore, the only way that the sum of all these numbers is zero is if each floatingpoint number, , in this collection, has a distinct associated floatingpoint number, , in this collection, such that . Therefore, if the sum of all these numbers is zero, then . The lowerbounds follow, then, from the fact that summation is a binary operator and the set equality problem has an lower bound in the algebraic computation tree model [2].
To our knowledge, this is the first PRAM method that achieves time and an amount of work that is worstcase optimal. We note, however, that the above lower bound holds only for floatingpoint representations where exponents are represented with bits.
4 Our ConditionNumber Sensitive PRAM Algorithm
In our fast PRAM algorithm, we showed that it is possible to sum two sparse superaccumulators in time using processors in the EREW PRAM model, given a sorted merge of the exponents of the components of each superaccumulator. This result also gives us the following.
Lemma 3
Given two truncated regularized sparse superaccumulators, and , of combined size , we can compute the sum of and in time using processors in the EREW PRAM model.
Proof: The algorithm follows from the method described above combined with an EREW PRAM method for merging two sorted lists. (E.g., see [21] for details on such computations.)
Given this tool, we now describe our conditionnumber sensitive parallel algorithm for summing the floatingpoint numbers in in parallel. Our method runs in polylogarithmic time using work.^{9}^{9}9Technically, we could have ; hence we could have . As a notational convenience, therefore, we assume that the logarithms in our complexity bounds are defined to always have a minimum value of .
We begin with a simplified version of our algorithm from the previous section. Initialize an depth summation tree, , to have an element of associated with each of its leaves. Convert each value, , stored in a leaf to an equivalent sparse regularized superaccumulator. Perform a bottomup parallel summation computation on using the method of Lemma 3 to perform each pairwise summation of two sparse superaccumulators. We then complete the algorithm by a parallel prefix computation on the sparse superaccumulator for the root of to propagate all the signed carry bits and we then convert this result to a floatingpoint number, as in the last two steps of our algorithm from the previous section. This simplified summation algorithm runs in time using work in the EREW PRAM model.
Given this template, our conditionnumber sensitive algorithm is as follows. Begin by setting . Then perform the above bottomup parallel mergeandsum algorithm, but do so using truncated sparse superaccumulators. Unlike our previous method, this one may cause lossy errors to occur, due to the restrictions to truncated sparse superaccumulators. So we utilize a test, which we call the “stopping condition,” to determine if the result is correct and we can stop. If the stopping condition is not satisfied, however, then we set and repeat the computation. We continue this iteration until either the stopping condition is satisfied or we have increased so high that the final sparse superaccumulator is no longer truncated.
Before we analyze this algorithm, let us first provide the details for some sufficient stopping conditions. Let denote the summation value computed from our method after a given iteration, where the final truncated sparse superaccumulator is
so that is its leastsignificant component. Also, for the sake of simplicity, let us assume that is positive; the method for the case when is negative is similar. Let denote the exponent for and let
where is the smallest mantissa that can possibly be represented in our chosen floatingpoint representation. Note that is the smallest value that could be represented by . A sufficient stopping condition, then, is to test whether or not
that is, the summation value, , would be unchanged even after doing a floatingpoint addition or subtraction of copies of a bound whose magnitude is larger than any value we truncated.
As a simplified alternative, which also works in our algorithms as a sufficient stopping condition, is to determine whether the exponent for the least significant bit in is at least greater than .
The reason that these tests are sufficient as stopping conditions is that when we are summing the floating point numbers using truncated sparse superaccumulators, the largest magnitude that we can possibility truncate any summand is strictly less than . The reason for this is that, by the definition of a truncated sparse superaccumulator, if is included in our final truncated sparse superaccumulator, then was not truncated by any partial sum. Thus, the maximum value of the sum of all the truncated values is at most
Interestingly, this algorithm gives us the following conditionnumber sensitive result.
Theorem 4
Given a set, , of floatingpoint numbers, we can compute a faithfullyrounded representation of the sum of the numbers in in time using work that is in the EREW PRAM model, where is the condition number for .
Proof: The correctness for our algorithm follows immediately from the above discussion, since we terminate when we are assured of a faithfullyrounded sum for the numbers in . We claim that the number of iterations (each of which involves squaring the truncation parameter, ) is . To see that this claim is true, note that
Thus, if we represented the values, and , using exact fixedpoint representations, then is proportional to the difference, , between the bitposition of the most significant bit in the representation of and the bitposition of the most significant bit in the representation of . Our algorithm must therefore perform a sufficient number of iterations so that the number of bits in our truncated sparse superaccumulator for the sum is at least . This indeed occurs in our algorithm and, furthermore, note that our algorithm will terminate when the number of bits in our truncated sparse superaccumulator for the sum is . That is, our algorithm terminates when is , since we assume that floatingpoint numbers in our representation contain bits. Since we square in each iteration, this implies the claimed runningtime and work bounds for our parallel algorithm, since we require squarings to get to be large enough and the total work involved is a geometric summation that adds up to .
Thus, for the vast majority of inputs, which have constantbounded condition numbers, our algorithm uses a linear amount of work and runs in parallel time. That is, implemented as a sequential algorithm, our method would match the linear running time of the inherently sequential method of adding numbers, one at a time to a superaccumulator.
5 ExternalMemory Algorithms
In this section, we describe our efficient algorithms for summing floatingpoint numbers in the externalmemory model [36].
Suppose we are given a set of floatingpoint numbers. Our sortingbased externalmemory algorithm is as follows.

Convert each floatingpoint number to an regularized sparse superaccumulator. This can be done with a single scan over the input, using I/Os.

Sort the components of all the sparse superaccumulators constructed in the previous step independently by their exponents. This step clearly takes I/Os.

Scan the sorted list of superaccumulator components, while maintaining a regularized sparse superaccumulator, , to hold the sum. With each component, , we add to , using a localized version of the algorithm for summing two superaccumulators. Note that we do not need to store all of in internal memory to implement this step, however, since we are processing the components in order by their exponents. Instead, we just need to keep a hotswap buffer of that includes the current exponent, swapping out blocks to external memory as they become full. Moreover, since summing two regularized superaccumulators is a carryfree operation, we don’t need to worry about extra I/Os that would have otherwise been caused by carrybit propagation. Thus, we can implement this step in I/Os.

Given the computed superaccumulator, , which now holds the exact sum of the floatingpoint numbers, perform a backtofront scan of to propagate signed carry bits, to convert into a nonoverlapping regularized sparse superaccumulator. This step clearly requires I/Os.

Given a nonoverlapping superaccumulator for the sum, we round the most significant nonzero components to produce a correctly rounded floatingpoint number for the sum of the floatingpoint numbers in .
This gives us the following.
Theorem 5
Given floatingpoint numbers, we can compute a correctlyrounded floatingpoint representation of their sum using I/Os in the externalmemory model, in a cacheoblivious manner.
Proof: We have already established the performance bounds. To establish that this algorithm can be performed in a cacheoblivious manner, we note that every step involves simple scans over lists of numbers, except for a sorting step, which itself can be done cache obliviously [13].
The critical step in the above externalmemory algorithm, of course, is the scan to add each floatingpoint component to our superaccumulator. Since these components were sorted by their exponents and our superaccumulator representation is carryfree, we need only keep a “sliding window” of blocks of our superaccumulator in memory as we perform this scan. Moreover, such a scan is cacheoblivious given any reasonable page eviction strategy.
If the size of our superaccumulator, , is less than the size of internal memory, , however, then we can utilize an even simpler algorithm. Namely, we can simply keep our entire superaccumulator stored in internal memory and process the floatingpoint components in any order to add each one to the superaccumulator. This observation leads to the following, then.
Theorem 6
Given floatingpoint numbers, we can compute a correctlyrounded floatingpoint representation of their sum using I/Os in the externalmemory model, in a cacheoblivious manner, if .
6 Simple MapReduce Algorithms
In this section, we present a simple MapReduce [7, 16, 24] algorithm for summing floatingpoint numbers. We also report on an experimental evaluation, where we assume that the input is already loaded in a Hadoop Distributed File System (HDFS) where the input is partitioned into 128 MB blocks which are stored on the local disks of cluster nodes. Our algorithm is based on a singleround of MapReduce which takes as input a collection of floating point numbers, and produces a single floating point number that represents the correctlyrounded sum of all input numbers. In this section, we first give a high level overview of our MapReduce algorithm. Then, we describe an implementation of that algorithm using Spark [38]. Finally, we give experimental results of our implementation using large scale data.
6.1 Overview
Our MapReduce algorithm runs as a single MapReduce job as detailed below.

Map: for each , map to , where returns an integer in the range that represents one of the available reducers. We can simply use a random function , which assigns each input record to a randomly chosen reducer. The function can also be defined based on domain knowledge of with the goal of balancing the load across the reducers. For example, if is , then this function assigns roughly values for each reducer.

Reduce: In the reduce phase, each reducer , , sums up all the assigned numbers using the sequential algorithm described earlier in Section 3. The output of the reduce function is one sparse superaccumulator that represents the exact sum of the floatingpoint numbers assigned to . After that, each reducer writes the resulting sparse superaccumulator to the output.

Postprocess: In this final step, a single machine reads back the sparse superaccumulators produced by the reducers and performs a final step of the sparse superaccumulator addition algorithm to add all of them into one final sparse superaccumulator. Finally, the resulting sparse superaccumulator is converted to a correctlyrounded floating point value which is written to the output as the final answer.
6.2 Implementation
We implemented the MapReduce algorithm described above using Spark [38], a modern distributed processing framework. This implementation is open source and available at https://github.com/aseldawy/sumn. We begin by loading the input from disk into the distributed memory of the cluster nodes. In this step, each machine loads the HDFS blocks that are physically stored on its local disk. Then, each machine applies a combine function on each block, which uses the sequential algorithm described in Section 3 to sum all numbers in each partition into one sparse superaccumulator. The goal of the combine step is to reduce the size of the data that need to be shuffled between mappers and reducers. The output of the combine function is a single keyvalue pair where the key is a random integer in the range , and the value is the sparse superaccumulator. Then, the shuffle phase groups keyvalue pairs by the reducer number and assigns each group to one reducer. After that, each reducer sums up all the assigned sparse superaccumulators into one sparse superaccumulator and writes it to the intermediate output. Finally, the postprocess phase runs on the driver machine that issued the MapReduce job and it combines all sparse superaccumulators into one and writes the correctlyrounded floating point value as the final result.
6.3 Experimental Results
To test our MapReduce implementations, we carried out an experimental evaluation of the proposed MapReduce algorithm on large datasets of up to one billion numbers. The input datasets were randomly generated using four different distributions as described in [39]:

A set of randomly generated positive numbers which results in a condition number .

A mix of positive and negative numbers generated uniformly at random.

Anderson’sill conditioned data where a set of random numbers are generated, and then their arithmetic mean is subtracted from each number.

A set of numbers with real sum equal to zero.
In the four datasets, the parameter defines an upper bound for the range of exponents of input numbers.
We used Apache Spark 1.6.0 running on Ubuntu 14.04 and OpenJDK 1.7.0_91. For experimental repeatability, we ran all our experiments on Amazon EC2 using an instance of type ‘r3.8xlarge’ which has 32 cores and 244 GB of main memory running on Intel Xeon E52670 v2 processors [37]. We used the 32 cores to run a local cluster of 32 worker nodes inside that machine. We ran the experiments using three algorithms and measured the endtoend running time of each one:

Our proposed MapReduce algorithm that uses sparse superaccumulators.

A variation of our MapReduce algorithm that uses small superaccumulators [30] instead of sparse superaccumulators.

The stateoftheart sequential iFastSum algorithm [40].
For the latter algorithm, we used the original C++ implementation provided by Zhu and Hayes [40], compiled using gcc 4.8.4. For all the techniques, we first generated a dataset using the random generator provided in [40] and stored it to disk. Then, we processed the same generated dataset with each algorithm one after another. We ignored the disk I/O time and focused only on the processing time. If we take the disk I/O into account, all MapReduce algorithms will be much faster due to the distributed nature of Hadoop Distributed File System (HDFS) where the machines load data in parallel from multiple disks.
Figure 1 shows the overall running time of the three algorithms as the input size increases from 1 million to 1 billion numbers, while the value of is fixed at 2000. In general, iFastSum is faster for processing small datasets with less than 10 million records. However, as the size of the data increases, both MapReduce implementations outperform the sequential iFastSum algorithm with up to 80x speedup. This shows a great potential for MapReduce algorithms in the problem of summing a huge array of floating point numbers. We observe that the implementation that uses small superaccumulator is slightly faster than the one that uses sparse superaccumulator. The reason for this is that each sparse superaccumulator runs on a single core, which does not allow it to realize the theoretical limit of doing work in time using processors. In the future, we could investigate the use of singleinstruction multipledata (SIMD) features to achieve a higher level of parallelism. Another possibility is to use GPU processing units which can achieve massive parallelization with thousands of threads on a single chip. We believe that there is a huge potential in these two options as the design of sparse superaccumulator lends itself to these two types of parallelization where the same instruction is repeated for every index in the superaccumulator.
Figure 2 shows the running time when the parameter increases from 10 to 2000. Notice that the maximum possible value for is 2046 for doubleprecision floating point numbers. As the input size is 1 billion, we observe that the two MapReduce algorithms consistently outperform the sequential iFastSum algorithm. We also observe that the running time of the sparse superaccumulator algorithm slightly increases as the value of increases. This behavior is expected, because the increased value of makes the superaccumulator less sparse as the number of nonzero (active) indices increases. The only exception is with dataset No. 3, as the subtraction of the mean causes the range of exponents to decrease to about 15 even if the original is very large. Similar to the behavior in its original paper [39], the running time of iFastSum with dataset No. 4 increases with the value of . Interestingly, the small superaccumulator keeps a consistent performance regardless of the value of .
Figure 3 shows the endtoend running time as the cluster size increases from 1 to 32 cores. The performance of iFastSum stays constant as it runs only on a single core. The results of this experiment shows a perfect scalability for our MapReduce algorithm, where the performance scales linearly with number of machines. The performance starts to saturate as the cluster size increases from 16 to 32 cores because the underlying processor virtually increases the number of running threads using the hyperthreading feature. For a small cluster with a few cores, iFastSum is faster, as it is highly tuned for sequential processing while Spark incurs extra overhead for the other algorithms. As the number of cores increases, the parallelization of MapReduce algorithms allows them to outperform iFastSum in all experimented datasets. However, the crossover point changes from one dataset to another. For example, since dataset No. 4 is the worst case for iFastSum, sparse superaccumulator proves to be faster even on a single core.
7 Conclusion
In this paper, we have given a number of efficient parallel algorithms for computing a faithfully rounded floatingpoint representation of the sum of floatingpoint numbers. Our algorithms are designed for a number of parallel models, including the PRAM, externalmemory, and MapReduce models. The primary design paradigm of our methods is that of converting the input values to an intermediate representation, called a sparse superaccumulator, summing the values exactly in this representation, and then converting this exact sum to a faithfullyrounded floatingpoint representation. We are able to achieve significant parallelization by utilizing a novel intermediate floatingpoint superaccumulator representation that is carryfree. Our experimental evaluation shows that our MapReduce algorithm can achieve up to 80X performance speedup as compared to the stateoftheart sequential algorithm. The MapReduce algorithm yields lineary scalability with both the input dataset and number of cores in the cluster.
Acknowledgments
This research was supported in part by the National Science Foundation under grant 1228639, and an AWS in Education Grant. We would like to thank Wayne Hayes for several helpful discussions concerning the topics of this paper.
References
 [1] M. J. Atallah et al. Cascading divideandconquer: A technique for designing parallel algorithms. SICOMP, pages 499–532, 1989.
 [2] M. BenOr. Lower bounds for algebraic computation trees. In STOC, pages 80–86, 1983.
 [3] E. O. Brigham. The Fast Fourier Transform and Its Applications. PrenticeHall, Inc., 1988.
 [4] C. Burnikel et al. Exact Geometric Computation in LEDA. In SoCG, pages 418–419, 1995.
 [5] R. Cole. Parallel merge sort. SICOMP, 17(4):770–785, 1988.
 [6] S. Collange et al. A Reproducible Accurate Summation Algorithm for HighPerformance Computing. In Proceedings of the SIAM EX14 workshop, 2014.
 [7] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008.
 [8] T. Dekker. A floatingpoint technique for extending the available precision. Numerische Mathematik, pages 224–242, 1971.
 [9] J. Demmel and Y. Hida. Accurate and efficient floating point summation. SIAM Journal on Scientific Computing, 25(4):1214–1248, 2004.
 [10] J. Demmel and Y. Hida. Fast and accurate floating point summation with application to computational geometry. Numerical Algorithms, 37(14):101–112, 2004.
 [11] J. Demmel and H. D. Nguyen. Parallel reproducible summation. IEEE TC, 64(7):2060–2070, July 2015.
 [12] L. Fousse et al. MPFR: A multipleprecision binary floatingpoint library with correct rounding. TOMS, June 2007.
 [13] M. Frigo et al. Cacheoblivious algorithms. In FOCS, pages 285–297, 1999.
 [14] GMPLib. GMP: the GNU multiple precision arithmetic library. https://gmplib.org/. Accessed 20151216.
 [15] D. Goldberg. What every computer scientist should know about floatingpoint arithmetic. CSUR, pages 5–48, Mar. 1991.
 [16] M. T. Goodrich et al. Sorting, searching, and simulation in the MapReduce framework. In Int. Symp. on Algorithms and Computation (ISAAC), volume 7074 of LNCS, pages 374–383, 2011.
 [17] M. T. Goodrich and S. R. Kosaraju. Sorting on a parallel pointer machine with applications to set expression evaluation. J. ACM, 43(2):331–361, Mar. 1996.
 [18] M. T. Goodrich and R. Tamassia. Algorithm Design and Applications. Wiley Publishing, 2014.
 [19] G. Hanrot, V. Lefévre, P. Pélissier, P. Théveny, and P. Zimmermann. The GNU MPFR library. http://www.mpfr.org/. Accessed 20151216.
 [20] M. Isenburg, P. Lindstrom, and J. Snoeyink. Lossless compression of predicted floatingpoint geometry. ComputerAided Design, pages 869–877, 2005.
 [21] J. JáJá. An Introduction to Parallel Algorithms. AddisonWesley, 1992.
 [22] E. Kadric, P. Gurniak, and A. DeHon. Accurate parallel floatingpoint accumulation. In 21st IEEE Symp. on Computer Arithmetic (ARITH), pages 153–162, April 2013.
 [23] M.Y. Kao and J. Wang. Lineartime approximation algorithms for computing numerical summation with provably small errors. SISC, pages 1568–1576, 2000.
 [24] H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In 21st ACMSIAM Symp. on Discrete Algorithms (SODA), pages 938–948, 2010.
 [25] D. E. Knuth. The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms. AddisonWesley, Boston, MA, USA, 1997.
 [26] U. W. Kulisch and W. L. Miranker. The arithmetic of the digital computer: A new approach. SIAM Review, 28(1):1–40, 1986.
 [27] A. Lacroix and F. Hartwig. Distribution densities of the mantissa and exponent of floating point numbers. In ISCAS, pages 1792–1795, May 1992.
 [28] H. Leuprecht and W. Oberaigner. Parallel algorithms for the rounding exact summation of floating point numbers. Computing, 28(2):89–104, 1982.
 [29] M. A. Malcolm. On accurate floatingpoint summation. Commun. ACM, 14(11):731–736, Nov. 1971.
 [30] R. M. Neal. Fast exact summation using small and large superaccumulators. arXiv ePrint, abs/1505.05571, 2015.
 [31] B. Parhami. Generalized signeddigit number systems: a unifying framework for redundant number representations. IEEE TC, 39(1):89–98, Jan 1990.
 [32] D. Priest. Algorithms for arbitrary precision floating point arithmetic. In 10th IEEE Symp. on Computer Arithmetic (ARITH), pages 132–143, Jun 1991.
 [33] J. Richard Shewchuk. Adaptive precision floatingpoint arithmetic and fast robust geometric predicates. Discrete & Computational Geometry, 18(3):305–363, 1997.
 [34] S. M. Rump, T. Ogita, and S. Oishi. Accurate floatingpoint summation part i: Faithful rounding. SIAM Journal on Scientific Computing, 31(1):189–224, 2008.
 [35] M. Tommila. Apfloat for Java. http://www.apfloat.org/apfloat_java/. Accessed 20151216.
 [36] J. S. Vitter. Algorithms and data structures for external memory. TnTCS, pages 305–474, Jan. 2008.
 [37] Intel Xeon Processor E52670 v2. http://ark.intel.com/products/75275.
 [38] M. Zaharia et al. Spark: Cluster Computing with Working Sets. In HotCloud, 2010.
 [39] Y.K. Zhu and W. B. Hayes. Correct rounding and a hybrid approach to exact floatingpoint summation. SIAM Journal on Scientific Computing, 31(4):2981–3001, 2009.
 [40] Y.K. Zhu and W. B. Hayes. Algorithm 908: Online Exact Summation of FloatingPoint Streams. TOMS, pages 1–13, 2010.