A revision of the subtractwithborrow random number generators
Abstract
The most popular and widely used subtractwithborrow generator, also known as RANLUX, is reimplemented as a linear congruential generator using large integer arithmetic with the modulus size of 576 bits. Modern computers, as well as the specific structure of the modulus inferred from RANLUX, allow for the development of a fast modular multiplication – the core of the procedure. This was previously believed to be slow and have too high cost in terms of computing resources. Our tests show a significant gain in generation speed which is comparable with other fast, high quality random number generators. An additional feature is the fast skipping of generator states leading to a seeding scheme which guarantees the uniqueness of random number sequences.
keywords:
Linear congruential generator; Subtractwithborrow generator; RANLUX; GMP;[author] Email address: sibid@uvic.ca
PROGRAM SUMMARY/NEW VERSION PROGRAM SUMMARY
Program Title: RANLUX++
Licensing provisions: GPLv3
Programming language: C++, C, Assembler
1 Introduction
The well known Linear Congruential Generator (LCG) is a recurrent sequence of numbers calculated as follows:
(1) 
where is the initial state or seed, – the multiplier, – the increment and – the modulus. The particular choice of the parameters , and with period – the minimal number when , can be found in the literature (1). Commonly used LCGs are limited to , and have poor statistical properties. Thus they are not used for MonteCarlo physical simulations.
This situation can be mitigated when reaches several hundreds or even thousand bits. The cost of the increased range of is to deal with arbitrary precision integer arithmetic which was believed to be prohibitively expensive for practical purposes. In the last two decades there has been tremendous progress in modern central processor units (CPU) especially for personal computers (PC) which can be employed for long arithmetic.
We have explored the possibility to use the long arithmetic in LCG to improve the quality of generated random numbers and found that, despite a substantial increase in calculations, the time to generate a single random number is not proportionally risen. In fact for some parameters, the computational time decreased compared to ordinary LCGs with machine word modulus size.
2 Subtractwithborrow generator
At this point no specific constraints on , and parameters of LCG have been applied. As a good starting point we choose the subtractwithborrow generator first introduced in (2) and the intimate connection with LCG has been shown as a part of the period calculation. The algorithm has been extensively studied in (3) to improve statistical quality of generated numbers. Based on this study the generator RANLUX (4) was developed and now it is widely used in physics simulations as well as in other fields where random numbers with high statistical quality are required. However the current method employed by RANLUX to achieve the high quality makes it one of the slowest generators on the market.
The definition of the subtractwithborrow generator is the following: let some integer greater than 1 also called the base and vector with the length , where and or the carry equals 0 or 1. Then define a recursive transformation of the vector with the rule:
(2) 
where and and also called the lags. As shown in the work (5), this recursion is equivalent to LCG with the modulus , the multiplier and with the relation:
(3) 
In the RANLUX generator the lags and with the base are chosen among other suggested parameters in (2), and thus the modulus is a prime number and the multiplier . With those parameters the period is equal to .
Due to the selected base the natural choice to keep the generator state is a vector of length 24 composed of 24bit numbers. This implementation uses the properties of the modulus to avoid long arithmetic calculations, and a single step equivalent to one modular multiplication that requires only subtraction of two 24bit numbers and carry propagation. In the original FORTRAN implementation, 24bit numbers were stored as floats to avoid at that time, a high cost integertofloat conversion.
2.1 Remainder
The simple structure of the modulus allows us to calculate the remainder using only additions, subtractions and bit shifts. The modulus and thus the generator state have size of bits and fits into 9 64bit machine words. The result of the product fits into 18 64bit machine words which can be represented as a 48 element array of 24bit numbers: . The number obtained by the procedure shown in Algorithm 1 is congruent to and . Note the product is also only bit shifting due to the simple structure of . The calculation of is a sum of carry bits of each arithmetic operation.
2.2 Skipping
Examining the result of a single step of Eq. 1 one can note that the main part of the number is preserved in its successor which is just rotated by 24 bits. This strong correlation is the reason of the poor statistical quality of the original subtractwithborrow generator (2). The bright idea developed in (3) is to apply the transformation (2) many times to break correlations between nearby states before using the state for actual physical simulation. The drawback of this method is obvious – all intermediate states have to be explicitly calculated even if they are not needed. Despite the single step being simple with small resource consumption, good statistical quality requires several hundred steps thus in total, the skipping requires a lot of time. This is a luxury to spend resources and not use the results. Thus socalled luxury levels were introduced as aliases for how many generatated numbers have to be wasted.
Using Eq. 1 we can efficienty skip numbers since all recurrent steps collapse to a single multiplication:
(4) 
where the factor is precomputed and thus the cost to calculate the next state with or without skipping is the same. Any state in the entire period can be calculated in no more than long multiplications using fast exponentiation by squaring which takes order of tens of sec on modern CPUs.
In the Table 1 the precomputed values of where the values of is taken from (4) are shown for illustrative purposes. In the initial rows, long chains of 0 or 1 in binary representation are clearly visible and this can be interepreted such that for each bit of the state only a few bits of the state contributes. Even at the highest luxury level 4 there are still some patterns observable and a demading user maybe not be completely satisfied. For such user the two last rows would be more attractive especially since it is for free! Such chaotic multipiers mean that if any single bit of the state is changed in the next step the altered state will be absolutely different from the unaltered one.
With explicit long multiplication, there is no need to keep the multiplier as a power of , it can be adjusted to get the full period, . As an example the number is a primitive root modulo and with this multiplier all numbers in the range will appear in the sequence only once with any initial from the same range.
luxury level  
0  24  \seqsplitfffffffffffffffffffffffefffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe000000000000000000000001000000000000000000000000000000000000 
1  48  \seqsplit000000000000000000000002ffffffffffffffffffffffff000000000000000000000000000000000001fffffffffffffffffffffffc000000000000000000000001000000000001 
2  97  \seqsplitffffff000000000008000000000009fffffffffffefffffffffff1000000000000000000000006ffffff000004fffffffffff6ffffffffffec000000000001000000000015000001 
3  223  \seqsplit00028b000000000bba00000000026cfffffffff8e4fffffffff96000000000027b0000000007d0fffffffffe25ffffffffeef0fffffffffa0a000000000942000000000ba6000000 
4  389  \seqsplit0df0600000002ee0020000000b9242ffffffdf6604ffffffe4ab160000000d92ab0000001e93f2fffffff593cfffffffb9c8a6ffffffe525740000002c38960000002ecac9000000 
1024  \seqsplite1754cefa19deea6f58651c8ac11b437ba841c49eca3003ff0ef508f058cfdab6105ca16980e6a3ab12a823219e1cd0007281433953609f1cc9c5ca19cf7f0c6d3899b14b7c5ee90  
2048  \seqsplitb48c187cf5b22097492edfcc0cc8e753ff74e54107684ed2256c3d3c662ea36c20b2ca60cb78c5096d8a15a13bee7cb0e64dcb31c48228ec4cec2c78af55c101ed7faa90747aaad9 
The fast skipping also provides a seeding scheme which garantees noncolliding sequences of random numbers – what is needed is to just skip a large enough sequence. This is suggested in (5) using the seed number as a power of the multiplier:
(5) 
where is the seed, is the initial state which is the same for all seeds, is the starting state to deliver random numbers for the seed and is the seed skipping factor to guarantee noncolliding sequencies for any Monte Carlo simulation. Lets take which is so big that for any modern computer the required time to visit all states within the range exceeds the age of the Universe, even if it can generate a new state on every CPU clock cycle. In this case the maximum seed number is .
3 Implementation
The core procedure for efficient implemention of LCG is fast long multiplication. This provides the infrastructure to deliver random numbers in an efficient as well as convinient form to users. Programming languages such as C++, C or FORTRAN which are usually used in Monte Carlo physics simulations, have no native support of the long arithmetic despite all desktop grade CPUs providing hardware instructions suitable for it. The support of long arithmetic is provided by external libraries. We choose the GMP library (6) for proof of concept. This library is highly optimized for basic arithmetic operations with long numbers and supports a large number of CPU architectures. It was found that GMP is a great tool but some operations with long numbers still do not fit properly to the interface provided and the conversion to a proper format and back adds a significant overhead. This fact and for the sake of all inclusive package without external library dependencies, the core functions are written in the assembly language for the AMD64 CPU architecture. This should fit to virtually all computers used for physics simulations. For other architectures the long multiplication can be easily reimplemented. The GMP version of the generator was used to validate the resuts. The problem size is known in advance and this knowledge has been exploited to write the fast and efficient procedures.
Since inception of AMD64 CPU architecture the instruction set has been significantly expanded and new instructions profitable for long arithmetic were introduced. Specificaly the instruction mulx which streams the result of the multiplication to arbitrary registers and then the instructions adox and adcx the addition instructions which affect only corresponding overflow and carry flags of the status register and are suitable to break the dependency chain in long summations. To use advantages of the new instructions where it is possible, three versions of 576 by 576 bit multiplication were written and the best suitable version is selected and executed in runtime by a CPU dispatcher. In all cases the method to multiply numbers is a simple schoolbook method which requires 81 bit multiplications.
Considering the state as 576 bit entropy pool, to produce floating point numbers, 24 or 52 bits are fetched from the pool for single or double precision number correspondingly. The latter number has full precision of 53 bits but from the 576 bits, only 10 53bit numbers can be produced, wasting the remaining 46 bits of entropy which are expensive. A good compromise is to have random first 52 bits and waste only 4 bits of entropy. Conversion of bit strings to the floating point representation is done in batch into intermediate buffer from which the numbers delivered to a user. This strategy significantly improves overall performance.
4 Benchmarks
The benchmarks were conducted on three types of CPU – Core2 (Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz), Haswell (Intel(R) Core(TM) i74790K CPU @ 4.00GHz) and Skylake(Intel(R) Core(TM) i56200U CPU @ 2.30GHz). Since data for all benchmarks fits to L1 cache and there is not much dependence on main memory we present results of benchmarks in CPU clocks. It was observed that within CPU family benchmark results are stable despite different CPUs can have largely different clock frequencies. All measurements were done by the Linux perf program which provides convenient access to internal CPU performance counters.
The benchmark results of the specialized 9 by 9 64bit limbs multiplications together with the generic functions of GMP are shown in Table 2. It is seen that modern CPUs provide significant improvement in integer arithmetic especially in the straightforward implementation of the long multiplication the author made. Apriory knowledge of operand sizes gives additional performance boost compared to the GMP functions which work with arbitrary size vectors.
\backslashboxFunctionCPU type  Core2  Haswell  Skylake 

mul_basecase_core2  370.9  218.5  197.1 
mul_basecase_coreihwl  n.a.  205.6  203.7 
mul_basecase_coreibwl  n.a.  n.a.  163.0 
mul9x9  534.7  198.8  192.0 
mul9x9mulx  n.a.  162.1  154.5 
mul9x9mulxadox  n.a.  n.a.  119.4 
Algorithm 1 to calculate the remainder does not fit properly into the GMP function set and no elegant and performance wise ways were found, so only the author’s implementations are tested. The benchmarks of the modular multiplication are shown in Table 3 where the multiplication and the remainder calculation are joined into a single function to keep intermediate results in CPU registers. About 6.58.3 and 1418 clocks is needed to produce 24 and 52 bits of high quality entropy correspondingly for single and double precision numbers.
\backslashboxFunctionCPU type  Haswell  Skylake 

mulmod9x9mulx  201.2  191.6 
mulmod9x9mulxadox  n.a.  155.4 
It is interesting to compare different implementations to perceive how they can vary. Fair comparison of different implementations of random number generators requires special attention and microbenchmark results on modern CPUs have to be cautiously interpreted due to outoforder execution of instructions, memory accesses in real applications and so on. The small benchmark conducted is to sum random numbers uniformely distributed from 0 to 1. In this procedure we ensured that the explicit call instruction is emmitted to produce the random number in order to avoid inlining of a function body into the summation loop. The results of testing various random number generators are shown in Table 4.
The dummy function calculates nothing and simply immediately returns 0.5 demonstrating the overall overhead of the benchmark as well as its uncertainty. The “std” prefix shows functions from the recent C++11 standard described in the random header (7) implemented in GNU C++ compiler. For the subtractwithborrow functions implemented in the C++11 standard the double precision numbers, for the benchmarking purpose, have explicitly only 48 bits of randomness to avoid the situation when 96 bits of entropy produced and only 53 bits of them are used. The “gsl” prefix shows functions from the GNU Scientific Library (8). TRandom1 is the ROOT implementation of RANLUX (9). RANLUX is the original FORTRAN implementation (4). The functions ranlxs and ranlxd are SIMD implementations for single and double precision numbers correspondingly of M. Lüscher, the author of the skipping approach (3); (10). RANLUX++ is the suggested implementation using the long modular multiplication. The “array” comment shows another way to fetch produced random numbers – first fill a large array of numbers to avoid boundary checks for each number and then sum numbers from the array.
The result of the benchmark shows that the suggested implementation of LCGs with long modular multiplication is among the fastest random numbers generators. The other RANLUX implementations with the high statistical quality are an order of magnitude slower. In the current benchmark the suggested generator is even faster than the simpliest LCG in the C++ standard std::minstd_rand. With already theoretically proven statistical quality (3) also supported by empirical tests (11) as well as the seeding scheme guarantees noncolliding sequences this generator might be the best option for large scale parallel physical simulations.
\backslashboxGeneratorNumber type  double  float 
dummy  9.1  9.1 
std::minstd_rand  35.1  20.2 
std::mt19937_64  36.0  37.0 
std::ranlux24_base  47.2  26.0 
std::ranlux48_base  24.4  26.1 
std::ranlux24  387.0  197.7 
std::ranlux48  640.5  638.4 
gsl_ranlxs0  84.2  
gsl_ranlxs1  125.9  
gsl_ranlxs2  215.5  
gsl_ranlxd1  213.8  
gsl_ranlxd2 ()  394.0  
gsl_ranlux ()  185.0  
gsl_ranlux389 ()  315.4  
TRandom1 ()  362.7  
RANLUX ()  378.3  
ranlxs (array, SSE, )  50.4  
ranlxd (array, SSE, )  95.0  
RANLUX++  29.2  20.0 
RANLUX++ (array)  24.7  15.7 
5 Conclusion
We revised the approach that the large moduli in linear congruental generators are unpractical in MC physical simulations. We showed that LCG can be effciently implemented in case of specially selected modulus such as in the subtractwithborrow generator. The famous RANLUX random number generator was reimplemented using the long modular multiplication. The test results show an order of magnitude improvement in the generation speed on modern CPUs. The generator has a relatively small state vector of 72 bytes, it is fast and it has already proven excellent statistical properties and thus can become the best choice for large scale MC physical simulations. The source code of the generator as well as some of the benchmark tests can be obtained here (12).
Appendix A An efficient implementation of the subtractwithborrow generator
The step of the subtractwithborrow generator is very simple and the current RANLUX algorithm can be implemented in a way to take advantages of modern CPUs. At the default luxury level about 9 states are skipped so it is more efficient to skip an entire state, not numbers one by one. This immediately gives a performance boost since the array elements can be efficiently prefetched to CPU registers for processing. The carry bit can be calculated exploiting the two’s compliment format of signed integer number. The carry in this format is the most significant bit of the difference which is 1 for and 0 otherwise. Extracting the last 24 bits of the difference by bit masking makes the addition of in Eq. 2 redundant since it does not change those bits. The optimized version of the step shown in Algorithm 2 can be easily implemented in the C language. The algorithm is also well suited for a SIMD implementation similarly to what was done in (10) running several generators in parallel according to the width of SIMD registers. Several implementations of the algorithm were written and tested. The test results are shown with (the closest to and greater than or the highest luxury level 4 of the original RANLUX generator) in Table 5 for various types of optimization: “scalar” version is the straightforward implementation of Algorithm 2, “scalar(asm)” – main loop is optimized in the assembler language to access the hardware carry propagation, “SSE2” and “AVX2” are the optimizations using vector instructions. The generation speed for the latter case is so large that mainly the overhead of the benchmark itself is seen. There is an order of magnitude speedup compared to the results shown in Table 4.
The throughput of the skipping procedure itself without delivering numbers to the user corresponds to what one would expect from Algorithm 2: about 2 clock/(24 bit) for the scalar case, 0.5 clock/(24 bit) for SSE2 and 0.25 clock/(24 bit) for AVX2. The assembler implementation of the scalar case was written to access the hardware propagation of the carry bit, unfortunately the necessity to clear 8 most significant bits of each number also affects the status register with the carry bit. Thus this requires special treatment in the code and only the throughput about 1.5 clock/(24 bit) was achieved. These improvements can be easily applied in all current implementations of the subtractwithborrow generator.
scalar  scalar(asm)  SSE2  AVX2 
40.3  28.5  12.0  7.95 
Footnotes
 journal: Computer Physics Communications
References
 D. E. Knuth, The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms, AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.

G. Marsaglia, A. Zaman, A new class
of random number generators, The Annals of Applied Probability 1 (3) (1991)
462–480.
URL http://www.jstor.org/stable/2959748 
M. Lüscher,
A
portable highquality random number generator for lattice field theory
simulations, Computer Physics Communications 79 (1) (1994) 100 – 110.
doi:10.1016/00104655(94)902321.
URL http://www.sciencedirect.com/science/article/pii/0010465594902321 
F. James,
RANLUX:
A Fortran implementation of the highquality pseudorandom number
generator of Lüscher, Computer Physics Communications 79 (1) (1994)
111 – 114.
doi:10.1016/00104655(94)90233X.
URL http://www.sciencedirect.com/science/article/pii/001046559490233X 
S. Tezuka, P. L’Ecuyer, R. Couture,
On the lattice structure of
the addwithcarry and subtractwithborrow random number generators, ACM
Trans. Model. Comput. Simul. 3 (4) (1993) 315–331.
doi:10.1145/159737.159749.
URL http://doi.acm.org/10.1145/159737.159749 
Gnu multiple precision arithmetic library.
URL https://gmplib.org/ 
Draft:
C++ international standard.
URL http://www.openstd.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf 
GNU scientific library.
URL https://www.gnu.org/software/gsl/ 
Data analysis framework.
URL https://root.cern.ch/ 
Ranlux: Random number
generator.
URL http://luscher.web.cern.ch/luscher/ranlux/ 
P. L’Ecuyer, R. Simard,
Testu01: A c library for
empirical testing of random number generators, ACM Trans. Math. Softw.
33 (4) (2007) 22:1–22:40.
doi:10.1145/1268776.1268777.
URL http://doi.acm.org/10.1145/1268776.1268777 
Source code repository.
URL https://github.com/sibidanov/ranluxpp/