1 Introduction

# The combinatorial algorithm for computing π(x)

## Abstract.

This paper describes recent advances in the combinatorial method for computing , the number of primes . In particular, the memory usage has been reduced by a factor of , and modifications for shared- and distributed-memory parallelism have been incorporated. The resulting method computes with complexity in time and in space. The algorithm has been implemented and used to compute for and for . The mathematics presented here is consistent with and builds on that of previous authors.

###### 2010 Mathematics Subject Classification:
Primary 11N05, Secondary 11Y16, 11-04

## 1. Introduction

Algorithms used in exact calculations of can be divided into roughly three categories. The simplest algorithms are based on identifying and counting each prime , typically using some modification of the sieve of Eratosthenes. A naïve implementation of the sieve of Eratosthenes uses arithmetic operations and bits of memory 1. Modern variants based on bucket sieving reduce the memory usage to roughly storage locations each of width bits, while leaving the time complexity unchanged [18]. Given the prime number theorem, algorithms that enumerate the primes are limited to time complexity .

The first published algorithm capable of computing substantially faster than the sieve of Eratosthenes was a combinatorial algorithm due to E. Meissel [14]. Given that Meissel’s method involved decisions based on human judgement, it is not clear what time complexity to attribute to it; despite this fact, authors usually estimate the time complexity of Meissel’s original method as for any [7]. Meissel used his method in hand calculations of and in the late 1800s [15, 16]; the method was substantially improved by multiple groups of authors, and used in record computations of for between 1956 and 2007 [11, 13, 1, 7, 3, 6, 17]. Meissel’s method and its descendants are collectively known as “the” combinatorial algorithm for computing .

Analytic algorithms for computing based on the Riemann zeta function were first presented by Lagarias and Odlyzko in the 1980s [8, 9, 5]. Despite the attractive complexity of in time and in space, for any , the implied constants were large, and no-one succeeded in developing a practical implementation of these methods until nearly 30 years later. The first record computation using an analytic method was , under the assumption of the Riemann hypothesis, by Franke, Kleinjung, Büthe, Jost in 2010 [4]. This was followed by a 2012 computation of the same value by Platt without assuming the Riemann hypothesis [20]. Büthe et al. subsequently modified their algorithm to eliminate the assumption of the Riemann hypothesis, and presented the first computation of [4].

The problem of determining the number of primes up to some limit is directly tied to the history of the primes themselves, which dates to antiquity and is beyond the scope of this paper. Thus, the paragraphs above are only a sketch of the history of the problem; we direct the interested reader to additional historical references [14, 21, 10, 2, 19].

In the current paper, we describe recent advances to the combinatorial algorithm for computing . Firstly, we show how the memory usage of the algorithm can be reduced by a factor of . We note that this is not only a reduction in the memory complexity, but a substantial reduction in the actual memory usage for relevant values of . Indeed, before the final step in the memory-complexity reduction was achieved, the author had already reduced the memory usage sufficiently to compute , so the original announcement of claimed only a constant-factor reduction in the memory usage. In addition to the reduction in memory usage, we describe mechanisms by which the algorithm can be parallelized. Multiple methods due to the author and others are presented for shared-memory parallelism. We also describe a previously unpublished algorithm for distributed-memory parallelism, loosely based on the idea presented in [6]. The algorithms described here were implemented and used to compute for and for .

## 2. Reducing space complexity

Three data structures dominate the memory usage in the combinatorial algorithm [11, 7, 3, 6, 17]: a table of for , a table of the smallest prime factor , also for , and a set of sieve counters, where a typical choice for is [3]. Each of these three data structures limits the space complexity of the algorithm to . The choice with for some is used in the most recent versions of the algorithm [3, 17] to achieve the time complexity , simultaneously setting the space complexity at . The next largest data structure is a table of primes for , which has size . Thus, to decrease the memory usage of the algorithm by a factor of , we must either reduce each of the limiting data structures by a factor of or more, or else eliminate them entirely.

We note that not all expositions of the algorithm are limited by all three of the above data structures. For example, Oliveira e Silva was aware that significantly smaller sieve counters can be used than implied by , although he does advocate storing and for [17]. This is to be contrasted with Deléglise and Rivat, who use sieve counters, and store for , but manage to eliminate from their final formulae [3].

### 2.1. Retrieving π(y) for y≤ymax in O(1) time using O(ymax/logymax) space

The values are used in many places in the algorithm [3, 17]. The authors of past studies advocate the use of a table of values for this purpose, which requires storage locations. The implied constant is in the simplest implementation, where a single storage location is used to store a single value of . This constant can be reduced somewhat using a wheel, for example only storing for those coprime to the first primes, for some . However, a wheel cannot be used to reduce the space complexity of the algorithm, as the table used to store the wheel itself grows rapidly, namely with the primorial of . From a practical point of view, even with a wheel the table becomes prohibitively large, and had to be eliminated to permit the computation of .

Given the prime number theorem, it turns out that it is possible to retrieve for any in constant expected time, using only precomputed values. The trick is to only store for values that are multiples of . We also make use of a table of all the primes for : such a table also requires storage locations, and is anyway required elsewhere in the combinatorial method [3, 17]. The method for determining for a specific value of is then as follows: firstly, we look up the value at the closest value . We then iterate through the array of primes , starting at , checking whether at each value of . If , we return ; if , then we move on to , repeating the process.

The surprising thing is the rapid speed with which this algorithm converges: from the prime number theorem, we expect on average one prime in the range , because . Thus, the most likely situation is that , i.e., the initial guess for is in fact the correct value, and the algorithm terminates after a single iteration. In practice, in the combinatorial algorithm we retrieve for many values of , such that the average performance is indeed the relevant quantity. Even in the worst case, it is impossible for this algorithm to require more than iterations, which is , because this would contradict the assumption that was the closest value in the table .

### 2.2. Iterating over the squarefree y≤ymax coprime to the first b primes

Demanding fast access to for any is equivalent to factoring any such value of on demand. is accessed sufficiently often that trivial algorithms such as trial factoring are too slow for this purpose.

The author of [17] actually advocated storing the values for , where is the Möbius function, rather than storing in isolation. However, whether and are stored separately or as a product is immaterial for the current analysis. The values require an array of storage locations, each of width at least ; the space required to store is negligible by comparison.

As was the case with the array , a wheel can be used to compress the array . Indeed, the calculation for was performed using a wheel to compress and , see Appendix A. However, even with a wheel the array eventually becomes prohibitively large, and precluded the computation of . Luckily, it turns out that the data structure can be completely eliminated, and along with it. In order to do this, we investigate the purpose of storing [17]. In fact, the only situation where this array is used is to iterate over all squarefree values having for different values of . The author of [17] does this by iterating over all , and explicitly checking the condition for the given value of . We also note that is used for exactly the same values of .

Thus, in order to eliminate the array , we require an iteration scheme over the squarefree numbers coprime to the first primes. Although somewhat cumbersome, it is straightforward to construct such an iteration scheme using a variable number of nested loops. Firstly, we loop over the primes , where is the only loop variable, and assumes the values . We then loop over the biprime numbers , where ranges from until the product exceeds , and ranges from until the product exceeds . We subsequently loop over all numbers that are the product of three distinct primes , , and , each having and , using similar break conditions as above. This process is repeated until the largest possible number of factors for has been exceeded, which occurs when , where is the number of nested loops. For example, , so if is 64 bits or smaller, then . Furthermore, each value of is squarefree by construction, so for each .

### 2.3. Reducing the size of the sieve counters

Reducing the size of the sieve counters is easy in comparison to and . Firstly, we note that one can simply reduce the number of counters, without negative effects on the runtime [17]. By definition, the width of the sieving intervals in the combinatorial algorithm for computing is equal to the number of sieve counters, which we have denoted . Given that the upper limit of the sieve is , there are a total of intervals. Supposing that the overhead per sieving interval is proportional to the number of sieving primes, , the total overhead associated with subdividing the sieving intervals is proportional to by the prime number theorem. If the overall time complexity is to be kept at , then this implies , for some constant . Choosing this minimal value of results in sieve counters a factor of smaller than needed to achieve our target space complexity of . This is consistent with numerical experiments, where we find that the optimal value of to minimize the runtime is substantially smaller than .

Despite the reduction above, there is still an incentive to further reduce the size of the sieve counters. Although it is not necessary, it is helpful in a shared-memory architecture to allocate separate sieve counters for each parallel thread. This permits parallelization at the level of sieving blocks, which is sufficiently coarse as to carry relatively little overhead, yet sufficiently fine that load balancing is relatively easy. If such an approach is taken, then the memory usage of the counters is multiplied by a factor of the number of threads , which limits to or smaller if the memory usage is to be kept at .

Two additional approaches for reducing the size of the sieve counters are apparent to the author. Firstly, it should be possible to substantially reduce the amount of overhead per interval using a variant of the bucket sieve algorithm developed by Oliveira e Silva [18]. The basic idea of bucket sieving is to not sieve every interval by every sieving prime, but rather to allocate each sieving prime to a “bucket” that indicates the next interval in which a multiple of the prime appears. Buckets are then sequentially processed, one bucket per interval, with each sieving prime encountered being moved to a later bucket. In this fashion, the only primes that are encountered in each sieving interval are the ones for which multiples actually appear in that interval. This permits significantly smaller sieving intervals to be used, effectively eliminating the width of the sieving interval as a contributor to memory usage. Such an approach may even permit the entire sieve table to be stored in the processor’s data cache, providing greatly enhanced performance as compared to main memory [18].

The other potential approach for further reducing the memory usage of the sieve counters involves more efficiently packing the values. The sieve counters suggested by Oliveira e Silva, and used by the present author, have a fractal-like structure [17]. For a complete description of the workings and necessity of the sieve counters, we direct the reader to [17]. What matters for us is that the counters are each initialized with a number , for some , and then decremented from that initial value. This implies that the largest counters need to be stored using integer data types with at least bits. Thus, if a common binary representation is used for each of the sieve counters, then the total storage requirement is bits. With the sieve counters indexed using a single variable as in [17], one can probably not avoid using a common binary representation for each of the counters. We note, however, that it is possible to pack the values much more efficiently, resulting in an average of bits per counter, such that the total requirement is bits.

## 3. Modifications for shared-memory parallelism

There are several practical approaches for parallelizing the algorithm on a shared-memory architecture. Firstly, there is one important part of the algorithm, namely the “easy leaves” [17] in the computation of the partial sieve function , which can be made embarrassingly parallel. Here denotes the count of natural numbers that are coprime to the first primes. The so-called easy leaves do not depend on the main sieve, do not need to be interleaved with other parts of the algorithm, and can be computed completely in isolation of one another.

The difficult part of the parallelism is the main sieve, where the partial sieve function is made available for each and each prime . The values of for smaller values of and are needed in order to compute for larger and , which precludes the embarrassingly parallel computation of . The approach taken by the current author is to exploit the fact that the sieving is already broken into blocks of length . Specifically, one sieves each of subsequent blocks in parallel, working not with , but with , where is the beginning of the sieving interval under consideration. Each time a value needs to be added to a running sum without knowledge of , this discrepancy is recorded in a tally. Once each thread is done sieving the interval , the values can be used to compute each , starting at the smallest value of , and the discrepancies represented by the tallies can be resolved.

An algorithm that relies on the above idea has several drawbacks. Firstly, separate sieve counters are needed for each thread, which multiplies the memory usage of the sieve counters by a factor of . Secondly, the tallies needed to keep track of the discrepancies between and require a similar amount of memory as the sieve counters. Finally, synchronization is required after each thread sieves a single block, which carries unnecessary overhead. Nonetheless, this approach was found to be efficient enough for the purposes of the author.

After completing the bulk of the current project, the author was made aware of the yet-unpublished work of Kim Walisch. Walisch employs an adaptive algorithm for shared-memory parallelism, where blocks are scheduled dynamically depending on the runtime of previous blocks. Such an approach is certainly more efficient than synchronizing each iteration, which is important if a large shared-memory machine is to be used.

Another potentially attractive approach for shared-memory parallelism, in terms of both time and space, would be to combine adaptive scheduling with the distributed-memory parallelism algorithm that will be explained in the next section. By leveraging a distributed-memory algorithm even on a shared-memory architecture, the dependence between subsequent iterations would be broken, completely eliminating the need for communication between threads. Any constant arrays, such as the table of primes for , could still be shared between the threads to save space on a single shared-memory node.

## 4. An algorithm permitting distributed-memory parallelism

Distributing the computation of between multiple compute nodes was necessary for the author to compute . The principal issue with distributing the computation is that the simplest algorithms described in Section 3 rely on rapid exchange of information between compute nodes. Although it is in principle possible to efficiently distribute such a calculation, the greatest degree of parallelism can only be achieved if internode communication can be minimized or eliminated.

Fortunately, it is possible to parallelize the combinatorial algorithm for computing in a way that requires no interprocess communication whatsoever, with the exception of summing the contribution to for each job after the fact. This is highly efficient for the machine, but requires use of a supporting algorithm to break the interdependence of the jobs.

The following algorithm for distributed-memory parallelism is loosely based on an unpublished idea of X. Gourdon [6]. Specifically, the issue is that the sums in the main part of the combinatorial algorithm depend on the partial sieve function , which represents the count of numbers up to that are coprime to the first primes. Sieving an interval only reveals the values . Thus, determining requires storing , updating it after sieving each block, and using the updated value while sieving the next block to obtain any values of interest. This approach works fine if the sieve is started at , because the recursive dependence terminates with . If the sieve is to be started somewhere in the middle because, for example, earlier blocks are being simultaneously sieved on some other computer, then we need a method to independently compute .

What is needed is an algorithm that can compute for a given value of and every . An idea for how to do this was given in [6], namely to repeatedly apply the recurrence

 (4.1) ϕ(m,b)=ϕ(m,b−1)−ϕ(m/pb,b−1).

Here is the size of the wheel being used in the sieve, so is accessible for any in time [17]. Given , the idea is to compute to obtain . This can be done using the same implementation intended for the overall computation of , which is able to compute for varying values of and . The process is then repeated, to obtain , and onwards up to .

The difficulty with the above idea is the amount of time needed to perform this process; it would not affect the overall computational complexity of computing , but a simple interpretation of this idea was too slow to be used for the computation of . The general idea, however, is sound, and modifications can be made to substantially decrease the cost.

The approach taken here is a multifaceted one, where varying methods are used to compute depending on the values of . Again, is available in time for any using the sieving wheel. The wheel can also be used to compute in time via and . The difficult cases occur for . We first check whether . If this is the case, then we directly apply (4.1), using the combinatorial algorithm to compute . If, on the other hand, , then Legendre’s formula applies, such that . We next check whether . If this is the case, then we can use the method described in Section 2.1 to retrieve in time. If then Legendre’s formula still applies, but we must compute by some other method, e.g., using a second application of Legendre’s formula or the combinatorial algorithm. For the remaining values , determining is trivial given . Specifically, if then for and for . If , then for all .

## 5. Numerical results

The combinatorial algorithm was implemented and used to compute for and for , see Tables 1 and 2. The values for and for were checked and found to be consistent with the work of previous authors [17, 4]. We note that the values for were previously computed under the assumption of the Riemann hypothesis [4], and were apparently never verified unconditionally until this study. The values and for were first reported in this study. These new values were checked in three ways. First, each new value was computed twice, using separate clusters and differing numerical parameters (, , and ). Second, the values were checked against the logarithmic integral to ensure the results were reasonable. Third, at the suggestion of Robert Gerbicz, the parities of the new values of were checked and found to be consistent with those computed by Lifchitz using a yet-unpublished algorithm [12].

## 6. Summary

Recent advances in the combinatorial algorithm for computing were presented together with numerical results. Specifically, memory usage has been reduced by a factor of , and algorithms for shared- and distributed-memory parallelism have been developed. The resulting algorithm computes using arithmetic operations and memory locations, each of width proportional to . An algorithm for shared memory parallelism appeared previously in the literature [7], but not for the most recent versions of the algorithm [3, 17]; the basic idea necessary for distributed memory parallelism appeared in an unpublished manuscript [6]. The memory reduction presented here appears to be new. Previously reported values [17, 4] of for and for were verified; the values and for were computed and checked in several ways.

We are now in the interesting situation where two different types of algorithms, combinatorial and analytic, are closely matched for practical calculations of . If nothing else, this situation gives unprecedented confidence in any numerical results computed consistently using both types of methods, which is currently the case with for and for .

## Appendix A Implementation details

The description in [17] was used as a starting point for the implementation, with the enhancements of Sections 24 gradually incorporated. The implementation was written in the C99 programming language, with significant effort devoted to ensuring the correctness of the program. Fast unit tests were run on a development machine for every committed version of the code, with more extensive unit tests frequently run on the target cluster. All code was demanded to compile without warning using the GCC 4.9.1 compiler with the default warning level, and to pass static analysis with the Clang Static Analyzer. Precisions of finite-width data types were artificially reduced to intentionally break the program and identify failure modes. Unit tests were written covering wide ranges of parameter values, including edge-cases chosen specifically with the intention of breaking the program. In general, all code was written and checked as strictly as the author was capable at the time of writing.

In Table 3 we show resources usage for computing using two different versions of the author’s implementation of the combinatorial algorithm. The first version of the software, 2014.10.19, was missing the advancement presented in Section 2.2: this is the version of the software used in the original computations of and for . In this table, time is measured in “node seconds”, i.e., it is the sum of the actual time spent on all compute nodes for that calculation. Similarly, memory usage is memory per node. Here a “compute node” was an IBM iDataplex dx360 M4, having a total of 16 CPU cores (2 Intel Xeon E5-2670 eight-core 2.60 GHz CPUs) with either 64 or 128 GB RAM (8 GB PC3-12800 ECC RDIMM modules) depending on the requirements of the calculation. Thus, node s for computing corresponds to roughly CPU core-years.

## Acknowledgements

The author thanks Karl Dilcher for support, and for suggestions regarding the underlying algorithm, these calculations, and this paper. Calculations were performed on the Guillimin, Briarée, and Colosse clusters from McGill University, Université de Montréal, and Laval Université, managed by Calcul Québec and Compute Canada. The operation of these supercomputers is funded by the Canada Foundation for Innovation (CFI), NanoQuébec, RMGA, and the Fonds de recherche du Québec - Nature et technologies (FRQ-NT).

### Footnotes

1. When discussing space complexity, one must distinguish between bits of storage and storage locations, each of which grows as the problem size increases. It is commonplace to state that an algorithm has complexity in space if it requires storage locations for some constant , each capable of storing a number with bits [7, 3, 6, 17]. We use this convention here for consistency with other authors. Similarly, in a model of time complexity, one must specify which operations are considered to be performed in constant time. In this paper, we count bitwise operations, addition, subtraction, multiplication, division, modulus, decisions (branches), and memory read and write operations of a single machine word.

### References

1. J. Bohman, On the number of primes less than a given limit, BIT Numer. Math. 12 (1972), no. 4, 576–577.
2. A. Brauer, On the exact number of primes below a given limit, Amer. Math. Monthly 53 (1946), no. 9, 521–523.
3. M. Deleglise and J. Rivat, Computing : The Meissel, Lehmer, Lagarias, Miller, Odlyzko method, Math. Comp. 65 (1996), 235–245.
4. J. Franke, T. Kleinjung, J. Büthe, and A. Jost, A practical analytic method for calculating , To appear.
5. W. F. Galway, Analytic computation of the prime-counting function, Ph.D. thesis, University of Illinois at Urbana-Champaign, 2004.
6. X. Gourdon, Computation of : Improvements to the Meissel, Lehmer, Lagarias, Miller, Odlyzko, Deléglise and Rivat method, Preprint (2001).
7. J. C. Lagarias, V. S. Miller, and A. M. Odlyzko, Computing : The Meissel-Lehmer method, Math. Comp. 44 (1985), no. 170, 537–560.
8. J. C. Lagarias and A. M. Odlyzko, New algorithms for computing , Lect. Notes Math. 1052 (1984), 176–193.
9. by same author, Computing : An analytic method, J. Algorithms 8 (1987), no. 2, 173–191.
10. D. H. Lehmer, Guide to tables in the theory of numbers, National Research Council, National Academy of Sciences, Washington, D. C., 1941.
11. by same author, On the exact number of primes less than a given limit, Illinois J. Math. 3 (1959), no. 3, 381–388.
12. H. Lifchitz, Quick computation of the parity of , Preprint (2001).
13. D. C. Mapes, Fast method for computing the number of primes less than a given limit, Math. Comp. 17 (1963), no. 82, 179–185.
14. E. Meissel, Ueber die Bestimmung der Primzahlenmenge innerhalb gegebener Grenzen, Math. Ann. 2 (1870), 636–642.
15. by same author, Berechnung der Menge von Primzahlen, welche innerhalb der ersten Hundert Millionen natürlicher Zahlen vorkommen, Math. Ann. 3 (1871), 523–525.
16. by same author, Berechnung der Menge von Primzahlen, welche innerhalb der ersten Milliarde natürlicher Zahlen vorkommen, Math. Ann. 25 (1885), 251–257.
17. T. Oliveira e Silva, Computing : The combinatorial method, Revista do DETUA 4 (2006), no. 6, 759–768.
18. T. Oliveira e Silva, S. Herzog, and S. Pardi, Empirical verification of the even Goldbach conjecture and computation of prime gaps up to , Math. Comp. 83 (2014), no. 288, 2033–2060.
19. J. Peetre, Outline of a scientific biography of Ernst Meissel (1826–1895), Hist. Math. 22 (1995), no. 2, 154–178.
20. D. J. Platt, Computing analytically, Math. Comp. 84 (2015), no. 293, 1521–1535.
21. G. G. Stokes, W. Thomson, J. Glaisher, and J. W. L. Glaisher, Report of the committee, consisting of Professor Cayley, Professor G. G. Stokes, Sir William Thomson, Mr. James Glaisher, and Mr. J. W. L. Glaisher, on mathematical tables, Report of the Fifty-Third Meeting of the British Association for the Advancement of Science, John Murray, London, 1884, pp. 118–126.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters