Tracking Many Solution Paths of a Polynomial Homotopy on a Graphics Processing UnitThis material is based upon work supported by the National Science Foundation under Grant ACI-1440534.

# Tracking Many Solution Paths of a Polynomial Homotopy on a Graphics Processing Unit

## Abstract

Polynomial systems occur in many areas of science and engineering. Unlike general nonlinear systems, the algebraic structure enables to compute all solutions of a polynomial system. We describe our massive parallel predictor-corrector algorithms to track many solution paths of a polynomial homotopy. The data parallelism that provides the speedups stems from the evaluation and differentiation of the monomials in the same polynomial system at different data points, which are the points on the solution paths. Polynomial homotopies that have tens of thousands of solution paths can keep a sufficiently large amount of threads occupied. Our accelerated code combines the reverse mode of algorithmic differentiation with double double and quad double precision to compute more accurate results faster.

Keywords. algorithmic differentiation, continuation methods, double double, Graphics Processing Unit (GPU), path tracking, polynomial system, polynomial homotopy, quad double.

## 1 Introduction

Many problems in computational algebraic geometry can be solved by computing solutions of polynomial systems. As the number of solutions can grow exponentially in the degrees, the number of variables and equations, the computational complexity of these problems are hard. GPUs provide a promising technology to deliver significant speedups over traditional processors, but may require a complete overhaul of the algorithms we use in our polynomial system solvers.

This paper describes the application of numerical continuation methods to track many solution paths defined by a polynomial homotopy. A polynomial homotopy is a family of polynomial systems in which the solutions depend on one real parameter . Starting at , there are many solution paths originating at known solutions of a start system. These paths end at , at solutions of a polynomial system we want to solve. A common homotopy links the start system to the target system linearly as

 γ(1−t)g(x)+tf(x)=0, (1)

where is a random complex constant. The random ensures that paths originating at the regular solutions of will stay regular for all . For this regularity result to hold, all arithmetic must be complex. Our problem is the tracking of solution paths in complex -space.

The difficulty with implementing the predictor-corrector algorithms to track solution paths defined by polynomial homotopies is that the traditional formulation of the algorithms does not match the data parallelism for which GPUs are designed for. For instance, while each solution path can be tracked independently, the number of predictor-corrector stages may fluctuate significantly depending on the geometric shape of the path. The type of high level parallelism that is applied in distributed memory message passing or shared memory multithreaded processors does not apply to our problem. On GPUs, a relatively small number of multiprocessors can independently launch a large number of threads that perform the same synchronized sequences of instructions.

Applying Newton’s method as a corrector, in every step, we evaluate all polynomials and all their derivatives in the system. To achieve a high level of parallelism for this task, the terms in each polynomial are decomposed as the product of the variables that occur in the term with a positive exponent and the factor that is common in all derivatives of the term. The reverse mode of algorithmic differentiation [6] applied to each product of variables then reaches a high level of parallelism. Its cost matches optimal complexity bounds for the evaluation and differentiation problem. The linear systems in each Newton step we solve in the least squares sense via the modified Gram-Schmidt method. In [24] and [25] our computations were executed on randomly generated regular data sets. In [27], we integrated and improved the evaluation and differentiation codes to run Newton’s method on some selected benchmark polynomial systems. We extended this in [26] to track one single path of a polynomial homotopy on a GPU. The focus in this paper is on the tracking of many paths.

For applications, achieving speedup is not enough, the numerical results must be accurate as well. As the degrees and the number of solution paths increase, the numerical conditioning is likely to worsen as well. To improve the quality, we calculate with double double and quad double arithmetic, using the QD library [9] on the host and its CUDA version [16] on the device. While for complex double arithmetic, the evaluation and differentiation algorithms are memory bound, for complex double double and quad double arithmetic, these algorithms become compute bound. The double digit speedups compensate for the extra cost overhead caused by the extended precision. With the accelerated versions of our code we are often able to obtain more accurate results faster than without the use of GPUs. We obtain speedup and quality up.

Although solving polynomial systems may seem a very specific problem (we refer to [15] for a survey), many software packages have been developed for this problem, e.g.: Bertini [1], HOM4PS [5], HOM4PS-2.0 [13], HOM4PS-3 [2], PHoM [7], NAG4M2 [14], and HOMPACK [28, 29]. Many of these packages are still under active development. To the best of our knowledge, our code provides the first path tracker for polynomial systems for a GPU.

Related research in computer algebra concerns the implementation of polynomial operations on GPUs. Reports on this research are [8] and [17]. Computer algebra is geared towards exact computations, often over finite number fields. Our approach is numerical and we improve the accuracy of our results with double double and quad double arithmetic. This type of arithmetic is described in the section of error-free transformations in [19]. Interval arithmetic on CUDA GPUs [3] is an alternative approach to improve the quality of numerical computations.

The description of our path tracker consists of two parts. First we define the scheduling of the threads and then we outline the organization of the memory. Polynomial systems with ten of thousands of solutions are at the bottom of the threshold for which we start to notice the beneficial effect of our accelerated codes.

## 2 SIMT Path Tracking

The Single Instruction Multiple Threads (SIMT) execution model in GPUs implies that threads are either executing the same instruction or they are idle. In our application, the threads are evaluating and differentiating the same polynomial system, at different approximations for the solutions along the path. In the SIMT model, we run the same arithmetic circuit at different data. Table 1 shows a simplified model.

We distinguish three stages in a predictor-corrector algorithm. The first stage is called predict in Table 1. The predictor consists of the application of an extrapolation algorithm, applied to each coordinate of the solution separately. Typically, a fourth order predictor uses the current and four previous points on the path to predict the coordinates of the solution. This stage is naturally parallel and has a linear cost in the dimension. In the evaluate stage, the polynomial system is evaluated and differentiated at the solution. As this stage can have a cost that is cubic (or higher) in the dimension, it is separate from the correct stage. In the correct stage, a linear system is solved to compute the update to the solution.

For every path we execute a prediction step and at least one evaluate-correct step. Some paths may need two or even three such steps for Newton’s method to converge so the residuals and size of the updates are sufficiently small. While the instructions are the same, it is important that all the data are distinct. Not only the coordinates of each solution, but also the value for the continuation parameter  and the step size differ. The length of the total execution time is bounded from below by the time required for the most difficult solution path.

For memory considerations, paths that have been tracked to their end are relabeled and their work space in memory is swapped to the end. In the schematic of Table 2, the paths of Table 1 are reordered.

When we track one single path, as in [26], the step size control can be performed by the host. When tracking many solution paths, every solution path has its own step size and current value of the continuation parameter . In this situation, the step size control is executed on the device.

After each evaluation and correction, there is a check kernel to determine the success status is represented as 0, and 1. 0 is to continue, is failure and 1 is success. A parallel scan [20] to count of all paths with 0’s can generate the for the new round, see Table 3.

Thus, the only number passed between CPU and GPU each step is total number of paths to continue, which can be easily determine from the the last element of scan.

## 3 SIMT Evaluation and Differentiation

The speedups we obtain are mainly due to the fine granularity of the arithmetic circuits to evaluate and differentiate the polynomials in the system. Table 4 outlines the major differences in the organization of the algorithms for the host (CPU) and the device (GPU).

For evaluation of single path, the evaluated and differentated monomials are the operands in a long summation operation, executed to calculate the evaluated polynomials and the evaluated Jacobian matrix of the system. The values of the evaluated and differentiated monomials are positioned in an irregular pattern in memory. Therefore to the sum kernel, it appears as if the data is at random positions in memory. But for evaluations of multiple paths, all terms, at the same postion of Jacobian matrix, follow the same instruction to sum the same positions of its own path. If the matrix of evaluated and differentiated monomials is transposed, the sum kernel can benefit from memory coalescing. The transposition is illustrated in Table 5.

Instead of transposing matrix after monomial evaluation, we redesign the monomial kernel. The reverse mode [6] is used to generate vertical values and the same monomial of multiple paths are computed together in blocks. This directly fit our goal to transpose the monomial values. Also, all threads in each block shared the same instructions to evaluate monomials, which save the instruction reading time. The organization of the evaluation of a monomial and its derivatives is displayed in Table 6.

Compared with the tree mode [27], this consecutive mode has more memory bandwidth. Although monomial evaluation part has twice memory access than tree mode, summation has more speedup due to consecutive memory. Also, multiple threads in a single block use the same instruction to avoid redundant reading. Thus, this consecutive mode is more suitable for evaluation of multiple paths.

All path join its work space of evaluation vertically, and a relatively small matrix transpose of Jacobian is used before correction. Linear systems in the corrector are solved with a decomposition. Table 8 shows the organization of the memory.

## 4 Computational Results

We developed and executed our code on a Linux workstation. Our benchmark applications were selected for the diversity of the research areas and their size. Because our application benefit the most from the accelerated evaluation and differentiation algorithms, we report first on computations done separately from the path tracking.

### 4.1 Hardware and Software

Our code is compiled with version 6.5 of the CUDA Toolkit and gcc -O2. A Red Hat Enterprise Linux workstation of Microway, with Intel Xeon E5-2670 processors at 2.6 GHz is the host for the NVIDIA Tesla K20C, which has 2496 cores with a clock speed of 706 MHz. To prepare the benchmark data we used Python, in particular phcpy [23], the Python interface to PHCpack [22].

The double double and quad double arithmetic is provided by the QD library [9]. This QD library has been ported to GPUs, we used the code available at [16].

### 4.2 Applications

We selected three examples of polynomial systems, which arose in different applications. The examples can be formulated for any number of equations and variables. We selected three systems and in each case we applied the homotopy (1) to solve the systems. Below is a brief description of each system:

(1) cyclic10: the cyclic 10-roots problem is a 10-dimensional system with 34,940 isolated complex solutions. Except for the last equation (which has two terms), every polynomial has 10 monomials. The -th polynomial in this system is of degree . These roots appear in the study of complex Hadamard matrices [21].

(2) nash8: the solutions of this system give all totally mixed Nash equilibria in a game with 8 players, where each player has two pure strategies, see [4], [18]. For generic payoff matrices, this 8-dimensional system has 14,833 equilibria. Every polynomial in this system has 130 monomials of degrees ranging from one till seven.

(3) pieri44: there are 24,024 four dimensional planes that meet 16 four dimensional planes, given in general position. This system is a 16-dimensional problem and can be interpreted as a matrix completion problem [12], see also [10], [11]. Every polynomial in the system is of degree 4 and has 246 monomials.

As is typical for polynomial systems, the number of isolated solutions grows exponentially in the dimensions and the degrees. The systems we selected are large enough to notice a benefit of the accelerated code and small enough so we can still compute all isolated solutions. Table 9 summarizes their characteristics.

The number of isolated solutions equals the number of solution paths in the polynomial homotopy. In a massively parallel application we launch about 10,000 of threads. In these applications, the parallelism comes from evaluating the same system at about 10,000 different solution paths.

### 4.3 Evaluations

In Tables 10, 11, and 13 we list times and speedups for the evaluation of the three systems. With the number of simultaneous evaluations we go as far as the memory of the device allows us. The speedups were computed by taking the time on the K20C over the time on one 2.6 GHz CPU. With the NVIDIA profiler, times on the GPU are reported for the three kernels (defined in Table 4), in the columns with headers mon, sum, and coeff.

Although the number of variables is small (10, 8, and 16), there are already sufficiently many monomials to achieve a large enough parallelism to obtain good speedups. For double arithmetic, the problem is memory bound. The speedups become really good in complex double double and quad double arithmetic, because then the problem is compute bound.

The corresponding speedups are shown in Figure 1. Observe that the more monomials the system has, the fewer number of evaluations (respectively the fewer number of paths) are required to reach double digit speedups.

### 4.4 path tracking

Results for tracking many paths for the cyclic 10-roots problem are summarized in Table 15. Observe the quality up. Tracking 10,000 paths in double double arithmetic takes 10 seconds on the GPU, while on the CPU it takes 26.562 seconds in double arithmetic. With our accelerated code we obtain solutions in a precision that is twice as large in a time that is more than twice as fast.

The data in Table 15 is visualized in Figure 2. Compared to the speedups for evaluation and differentiation, the speedups for path tracking are roughly about half of those of evaluation and differentiation, for double double and quad double arithmetic.

Table 12 lists times and speedups for tracking many paths of the Nash equilibrium system. In Figure 3 we visualize these data. Notice that, as the Nash equilibrium system has more monomials than the cyclic 10-roots system, the speedup for nash8 are better than those form cyclic10. The speedups improve slightly for the Pieri problem, but with a larger of number of monomials the memory allows for fewer paths to be tracked simultaneously.

### Footnotes

1. thanks: This material is based upon work supported by the National Science Foundation under Grant ACI-1440534.

### References

1. D. J. Bates, J. D. Hauenstein, A. J. Sommese, and C. W. Wampler. Bertini: Software for numerical algebraic geometry. Available at http://www.nd.edu/sommese/bertini/.
2. T. Chen, T.-L. Lee, and T.-Y. Li. Hom4PS-3: a parallel numerical solver for systems of polynomial equations based on polyhedral homotopy continuation methods. In H. Hong and C. Yap, editors, Mathematical Software – ICMS 2014, 4th International Conference, Seoul, South Korea, August 4-9, 2014, Proceedings, volume 8592 of Lecture Notes in Computer Science, pages 183–190. Springer-Verlag, 2014.
3. S. Collange, M. Daumas, and D. Defour. Interval arithmetic in CUDA. In Wen mei W. Hwu, editor, GPU Computing Gems Jade Edition, pages 99–107. Elsevier, 2012.
4. R.S. Datta. Finding all nash equilibria of a finite game using polynomial algebra. Economic Theory, 42(1):55–96, 2009.
5. T. Gao, T.-Y. Li, and X. Li. HOM4PS, 2002. Available at http://www.csulb.edu/tgao/ RESEARCH/Software.htm.
6. A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM, second edition, 2008.
7. T. Gunji, S. Kim, M. Kojima, A. Takeda, K. Fujisawa, and T. Mizutani. PHoM – a polyhedral homotopy continuation method for polynomial systems. Computing, 73(4):55–77, 2004.
8. S. A. Haque, X. Li, F. Mansouri, M. M. Maza, W. Pan, and N. Xie. Dense arithmetic over finite fields with the CUMODP library. In H. Hong and C. Yap, editors, Mathematical Software – ICMS 2014, volume 8592 of Lecture Notes in Computer Science, pages 725–732. Springer-Verlag, 2014.
9. Y. Hida, X. S. Li, and D. H. Bailey. Algorithms for quad-double precision floating point arithmetic. In 15th IEEE Symposium on Computer Arithmetic (Arith-15 2001), 11-17 June 2001, Vail, CO, USA, pages 155–162. IEEE Computer Society, 2001. Shortened version of Technical Report LBNL-46996, software at http://crd.lbl.gov/dhbailey/ mpdist.
10. B. Huber, F. Sottile, and B. Sturmfels. Numerical Schubert calculus. J. Symbolic Computation, 26(6):767–788, 1998.
11. B. Huber and J. Verschelde. Pieri homotopies for problems in enumerative geometry applied to pole placement in linear systems control. SIAM J. Control Optim., 38(4):1265–1287, 2000.
12. M. Kim, J. Rosenthal, and X. Wang. Pole placement and matrix extension problems: A common point of view. SIAM J. Control. Optim., 42(6):2078–2093, 2004.
13. T. L. Lee, T.-Y. Li, and C. H. Tsai. HOM4PS-2.0: a software package for solving polynomial systems by the polyhedral homotopy continuation method. Computing, 83(2-3):109–133, 2008.
14. A. Leykin. Numerical algebraic geometry. The Journal of Software for Algebra and Geometry: Macaulay2, 3:5–10, 2011.
15. T.-Y. Li. Numerical solution of polynomial systems by homotopy continuation methods. In F. Cucker, editor, Handbook of Numerical Analysis. Volume XI. Special Volume: Foundations of Computational Mathematics, pages 209–304. North-Holland, 2003.
16. M. Lu, B. He, and Q. Luo. Supporting extended precision on graphics processors. In A. Ailamaki and P.A. Boncz, editors, Proceedings of the Sixth International Workshop on Data Management on New Hardware (DaMoN 2010), June 7, 2010, Indianapolis, Indiana, pages 19–26, 2010. Software at http://code.google.com/p/gpuprec/.
17. M. M. Maza and W. Pan. Solving bivariate polynomial systems on a GPU. ACM Communications in Computer Algebra, 45(2):127–128, 2011.
18. R.D. McKelvey and A. McLennan. The maximal number of regular totally mixed Nash equilibria. Journal of Economic Theory, 72:411–425, 1997.
19. S.M. Rump. Verification methods: Rigorous results using floating-point arithmetic. Acta Numerica, 19:287â449, 2010.
20. S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. Technical Report NVR-2008-003, NVIDIA, 2008.
21. F. Szöllősi. Construction, classification and parametrization of complex Hadamard matrices. PhD thesis, Central European University, Budapest, 2011. arXiv:1110.5590v1.
22. J. Verschelde. Algorithm 795: PHCpack: A general-purpose solver for polynomial systems by homotopy continuation. ACM Trans. Math. Softw., 25(2):251–276, 1999.
23. J. Verschelde. Modernizing PHCpack through phcpy. In P. de Buyl and N. Varoquaux, editors, Proceedings of the 6th European Conference on Python in Science (EuroSciPy 2013), pages 71–76, 2014.
24. J. Verschelde and G. Yoffe. Evaluating polynomials in several variables and their derivatives on a GPU computing processor. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops (PDSEC 2012), pages 1391–1399. IEEE Computer Society, 2012.
25. J. Verschelde and G. Yoffe. Orthogonalization on a general purpose graphics processing unit with double double and quad double arithmetic. In Proceedings of the 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops (PDSEC 2013), pages 1373–1380. IEEE Computer Society, 2013.
26. J. Verschelde and X. Yu. Accelerating polynomial homotopy continuation on a graphics processing unit with double double and quad double arithmetic. arXiv:1501.06625v2.
27. J. Verschelde and X. Yu. GPU acceleration of Newton’s method for large systems of polynomial equations in double double and quad double arithmetic. In Proceedings of the 16th IEEE International Conference on High Performance Computing and Communication (HPCC 2014), pages 161–164. IEEE Computer Society, 2014.
28. L. T. Watson, S. C. Billups, and A. P. Morgan. Algorithm 652: HOMPACK: a suite of codes for globally convergent homotopy algorithms. ACM Trans. Math. Softw., 13(3):281–310, 1987.
29. L. T. Watson, M. Sosonkina, R. C. Melville, A.P. Morgan, and H.F. Walker. HOMPACK90: A suite of Fortran 90 codes for globally convergent homotopy algorithms. ACM Trans. Math. Softw., 23(4):514–549, 1997.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters