Tracking Many Solution Paths of a Polynomial Homotopy
on a Graphics Processing Unit^{†}^{†}thanks: This
material is based upon work supported by the National
Science Foundation under Grant ACI1440534.
Abstract
Polynomial systems occur in many areas of science and engineering. Unlike general nonlinear systems, the algebraic structure enables to compute all solutions of a polynomial system. We describe our massive parallel predictorcorrector algorithms to track many solution paths of a polynomial homotopy. The data parallelism that provides the speedups stems from the evaluation and differentiation of the monomials in the same polynomial system at different data points, which are the points on the solution paths. Polynomial homotopies that have tens of thousands of solution paths can keep a sufficiently large amount of threads occupied. Our accelerated code combines the reverse mode of algorithmic differentiation with double double and quad double precision to compute more accurate results faster.
Keywords. algorithmic differentiation, continuation methods, double double, Graphics Processing Unit (GPU), path tracking, polynomial system, polynomial homotopy, quad double.
1 Introduction
Many problems in computational algebraic geometry can be solved by computing solutions of polynomial systems. As the number of solutions can grow exponentially in the degrees, the number of variables and equations, the computational complexity of these problems are hard. GPUs provide a promising technology to deliver significant speedups over traditional processors, but may require a complete overhaul of the algorithms we use in our polynomial system solvers.
This paper describes the application of numerical continuation methods to track many solution paths defined by a polynomial homotopy. A polynomial homotopy is a family of polynomial systems in which the solutions depend on one real parameter . Starting at , there are many solution paths originating at known solutions of a start system. These paths end at , at solutions of a polynomial system we want to solve. A common homotopy links the start system to the target system linearly as
(1) 
where is a random complex constant. The random ensures that paths originating at the regular solutions of will stay regular for all . For this regularity result to hold, all arithmetic must be complex. Our problem is the tracking of solution paths in complex space.
The difficulty with implementing the predictorcorrector algorithms to track solution paths defined by polynomial homotopies is that the traditional formulation of the algorithms does not match the data parallelism for which GPUs are designed for. For instance, while each solution path can be tracked independently, the number of predictorcorrector stages may fluctuate significantly depending on the geometric shape of the path. The type of high level parallelism that is applied in distributed memory message passing or shared memory multithreaded processors does not apply to our problem. On GPUs, a relatively small number of multiprocessors can independently launch a large number of threads that perform the same synchronized sequences of instructions.
Applying Newton’s method as a corrector, in every step, we evaluate all polynomials and all their derivatives in the system. To achieve a high level of parallelism for this task, the terms in each polynomial are decomposed as the product of the variables that occur in the term with a positive exponent and the factor that is common in all derivatives of the term. The reverse mode of algorithmic differentiation [6] applied to each product of variables then reaches a high level of parallelism. Its cost matches optimal complexity bounds for the evaluation and differentiation problem. The linear systems in each Newton step we solve in the least squares sense via the modified GramSchmidt method. In [24] and [25] our computations were executed on randomly generated regular data sets. In [27], we integrated and improved the evaluation and differentiation codes to run Newton’s method on some selected benchmark polynomial systems. We extended this in [26] to track one single path of a polynomial homotopy on a GPU. The focus in this paper is on the tracking of many paths.
For applications, achieving speedup is not enough, the numerical results must be accurate as well. As the degrees and the number of solution paths increase, the numerical conditioning is likely to worsen as well. To improve the quality, we calculate with double double and quad double arithmetic, using the QD library [9] on the host and its CUDA version [16] on the device. While for complex double arithmetic, the evaluation and differentiation algorithms are memory bound, for complex double double and quad double arithmetic, these algorithms become compute bound. The double digit speedups compensate for the extra cost overhead caused by the extended precision. With the accelerated versions of our code we are often able to obtain more accurate results faster than without the use of GPUs. We obtain speedup and quality up.
Although solving polynomial systems may seem a very specific problem (we refer to [15] for a survey), many software packages have been developed for this problem, e.g.: Bertini [1], HOM4PS [5], HOM4PS2.0 [13], HOM4PS3 [2], PHoM [7], NAG4M2 [14], and HOMPACK [28, 29]. Many of these packages are still under active development. To the best of our knowledge, our code provides the first path tracker for polynomial systems for a GPU.
Related research in computer algebra concerns the implementation of polynomial operations on GPUs. Reports on this research are [8] and [17]. Computer algebra is geared towards exact computations, often over finite number fields. Our approach is numerical and we improve the accuracy of our results with double double and quad double arithmetic. This type of arithmetic is described in the section of errorfree transformations in [19]. Interval arithmetic on CUDA GPUs [3] is an alternative approach to improve the quality of numerical computations.
The description of our path tracker consists of two parts. First we define the scheduling of the threads and then we outline the organization of the memory. Polynomial systems with ten of thousands of solutions are at the bottom of the threshold for which we start to notice the beneficial effect of our accelerated codes.
2 SIMT Path Tracking
The Single Instruction Multiple Threads (SIMT) execution model in GPUs implies that threads are either executing the same instruction or they are idle. In our application, the threads are evaluating and differentiating the same polynomial system, at different approximations for the solutions along the path. In the SIMT model, we run the same arithmetic circuit at different data. Table 1 shows a simplified model.
path0  path1  path2 
predict  predict  predict 
evaluate  evaluate  evaluate 
correct  correct  correct 
evaluate  evaluate  
correct  correct  
evaluate  
correct 
We distinguish three stages in a predictorcorrector algorithm. The first stage is called predict in Table 1. The predictor consists of the application of an extrapolation algorithm, applied to each coordinate of the solution separately. Typically, a fourth order predictor uses the current and four previous points on the path to predict the coordinates of the solution. This stage is naturally parallel and has a linear cost in the dimension. In the evaluate stage, the polynomial system is evaluated and differentiated at the solution. As this stage can have a cost that is cubic (or higher) in the dimension, it is separate from the correct stage. In the correct stage, a linear system is solved to compute the update to the solution.
For every path we execute a prediction step and at least one evaluatecorrect step. Some paths may need two or even three such steps for Newton’s method to converge so the residuals and size of the updates are sufficiently small. While the instructions are the same, it is important that all the data are distinct. Not only the coordinates of each solution, but also the value for the continuation parameter and the step size differ. The length of the total execution time is bounded from below by the time required for the most difficult solution path.
For memory considerations, paths that have been tracked to their end are relabeled and their work space in memory is swapped to the end. In the schematic of Table 2, the paths of Table 1 are reordered.
job0  job1  job2 
predict0  predict1  predict2 
evaluate0  evaluate1  evaluate2 
correct0  correct1  correct2 
evaluate0  evaluate2  
correct0  correct2  
evaluate2  
correct2 
When we track one single path, as in [26], the step size control can be performed by the host. When tracking many solution paths, every solution path has its own step size and current value of the continuation parameter . In this situation, the step size control is executed on the device.
After each evaluation and correction, there is a check kernel to determine the success status is represented as 0, and 1. 0 is to continue, is failure and 1 is success. A parallel scan [20] to count of all paths with 0’s can generate the for the new round, see Table 3.
path0  path1  path2  path3  path4  

status  
scan for  
Thus, the only number passed between CPU and GPU each step is total number of paths to continue, which can be easily determine from the the last element of scan.
3 SIMT Evaluation and Differentiation
The speedups we obtain are mainly due to the fine granularity of the arithmetic circuits to evaluate and differentiate the polynomials in the system. Table 4 outlines the major differences in the organization of the algorithms for the host (CPU) and the device (GPU).
Pseudo code on the host:  
for  each polynomial do  
for  each monomial do  
1.  compute the coefficient for this monomial;  
2. evaluate the monomial and its derivative;  
3. add the values to the polynomials and to the Jacobian matrix.  
Pseudo code on the device:  
lau  nch the following three kernels  
1.  compute the coefficient  
for all monomials in all polynomials;  
2. evaluate the monomial and its derivatives  
for all monomials in all polynomials;  
3. add to the value of the polynomial and to the Jacobian matrix  
for all monomials in all polynomials. 
For evaluation of single path, the evaluated and differentated monomials are the operands in a long summation operation, executed to calculate the evaluated polynomials and the evaluated Jacobian matrix of the system. The values of the evaluated and differentiated monomials are positioned in an irregular pattern in memory. Therefore to the sum kernel, it appears as if the data is at random positions in memory. But for evaluations of multiple paths, all terms, at the same postion of Jacobian matrix, follow the same instruction to sum the same positions of its own path. If the matrix of evaluated and differentiated monomials is transposed, the sum kernel can benefit from memory coalescing. The transposition is illustrated in Table 5.



Instead of transposing matrix after monomial evaluation, we redesign the monomial kernel. The reverse mode [6] is used to generate vertical values and the same monomial of multiple paths are computed together in blocks. This directly fit our goal to transpose the monomial values. Also, all threads in each block shared the same instructions to evaluate monomials, which save the instruction reading time. The organization of the evaluation of a monomial and its derivatives is displayed in Table 6.
and its four derivatives evaluated  
path 0  path 1  path 2  
0  
1  
2  
7  
6  
3  
4  
5 
Compared with the tree mode [27], this consecutive mode has more memory bandwidth. Although monomial evaluation part has twice memory access than tree mode, summation has more speedup due to consecutive memory. Also, multiple threads in a single block use the same instruction to avoid redundant reading. Thus, this consecutive mode is more suitable for evaluation of multiple paths.
name  double  double double  quad double  

Mon  cyclic10  190.41  124.78  25.70 
nash8  206.68  143.30  27.62  
pieri44  209.47  147.31  27.32  
Sum  cyclic10  104.91  126.63  123.13 
nash8  121.38  128.52  126.56  
pieri44  87.26  80.41  77.56 
All path join its work space of evaluation vertically, and a relatively small matrix transpose of Jacobian is used before correction. Linear systems in the corrector are solved with a decomposition. Table 8 shows the organization of the memory.
instructions  work space for multiple path  

idx  cff  evaluatejoint  correct  
cff  mon  Jac  Jac’, 
4 Computational Results
We developed and executed our code on a Linux workstation. Our benchmark applications were selected for the diversity of the research areas and their size. Because our application benefit the most from the accelerated evaluation and differentiation algorithms, we report first on computations done separately from the path tracking.
4.1 Hardware and Software
Our code is compiled with version 6.5 of the CUDA Toolkit and gcc O2. A Red Hat Enterprise Linux workstation of Microway, with Intel Xeon E52670 processors at 2.6 GHz is the host for the NVIDIA Tesla K20C, which has 2496 cores with a clock speed of 706 MHz. To prepare the benchmark data we used Python, in particular phcpy [23], the Python interface to PHCpack [22].
4.2 Applications
We selected three examples of polynomial systems, which arose in different applications. The examples can be formulated for any number of equations and variables. We selected three systems and in each case we applied the homotopy (1) to solve the systems. Below is a brief description of each system:
(1) cyclic10: the cyclic 10roots problem is a 10dimensional system with 34,940 isolated complex solutions. Except for the last equation (which has two terms), every polynomial has 10 monomials. The th polynomial in this system is of degree . These roots appear in the study of complex Hadamard matrices [21].
(2) nash8: the solutions of this system give all totally mixed Nash equilibria in a game with 8 players, where each player has two pure strategies, see [4], [18]. For generic payoff matrices, this 8dimensional system has 14,833 equilibria. Every polynomial in this system has 130 monomials of degrees ranging from one till seven.
(3) pieri44: there are 24,024 four dimensional planes that meet 16 four dimensional planes, given in general position. This system is a 16dimensional problem and can be interpreted as a matrix completion problem [12], see also [10], [11]. Every polynomial in the system is of degree 4 and has 246 monomials.
As is typical for polynomial systems, the number of isolated solutions grows exponentially in the dimensions and the degrees. The systems we selected are large enough to notice a benefit of the accelerated code and small enough so we can still compute all isolated solutions. Table 9 summarizes their characteristics.
name  dim  #paths  #monomials 

cyclic10  10  34,940  92 
nash8  8  14,833  1,040 
pieri44  16  24,024  3,936 
The number of isolated solutions equals the number of solution paths in the polynomial homotopy. In a massively parallel application we launch about 10,000 of threads. In these applications, the parallelism comes from evaluating the same system at about 10,000 different solution paths.
4.3 Evaluations
In Tables 10, 11, and 13 we list times and speedups for the evaluation of the three systems. With the number of simultaneous evaluations we go as far as the memory of the device allows us. The speedups were computed by taking the time on the K20C over the time on one 2.6 GHz CPU. With the NVIDIA profiler, times on the GPU are reported for the three kernels (defined in Table 4), in the columns with headers mon, sum, and coeff.
Although the number of variables is small (10, 8, and 16), there are already sufficiently many monomials to achieve a large enough parallelism to obtain good speedups. For double arithmetic, the problem is memory bound. The speedups become really good in complex double double and quad double arithmetic, because then the problem is compute bound.
complex double arithmetic  

CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  0.062  0.017  0.008  0.004  0.028  2.19 
20  0.078  0.020  0.008  0.004  0.033  2.39 
50  0.188  0.024  0.011  0.005  0.040  4.69 
100  0.379  0.030  0.016  0.006  0.051  7.39 
150  0.553  0.038  0.021  0.007  0.066  8.41 
200  0.732  0.042  0.026  0.008  0.076  9.60 
250  0.928  0.049  0.032  0.009  0.090  10.31 
300  1.132  0.056  0.037  0.010  0.103  11.04 
500  1.824  0.087  0.056  0.015  0.157  11.61 
750  2.786  0.126  0.079  0.021  0.226  12.32 
1000  3.748  0.155  0.101  0.026  0.282  13.30 
1250  4.748  0.203  0.127  0.032  0.363  13.08 
1500  5.563  0.235  0.149  0.039  0.423  13.14 
2000  7.381  0.299  0.191  0.050  0.540  13.67 
3000  11.148  0.459  0.284  0.082  0.826  13.50 
complex double double arithmetic  
CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  0.587  0.066  0.011  0.011  0.088  6.65 
20  1.135  0.066  0.012  0.011  0.089  12.79 
50  2.808  0.072  0.017  0.012  0.101  27.90 
100  5.598  0.092  0.028  0.017  0.137  40.81 
150  8.601  0.121  0.036  0.022  0.179  48.03 
200  11.225  0.145  0.043  0.025  0.213  52.64 
250  13.951  0.154  0.053  0.029  0.236  59.11 
300  16.821  0.181  0.060  0.037  0.278  60.56 
500  27.912  0.263  0.092  0.052  0.408  68.47 
750  41.877  0.379  0.137  0.074  0.590  71.01 
1000  55.871  0.472  0.175  0.096  0.743  75.24 
1250  69.835  0.587  0.220  0.117  0.924  75.54 
1500  83.920  0.691  0.257  0.139  1.087  77.20 
2000  112.040  0.917  0.338  0.183  1.438  77.92 
3000  167.568  1.383  0.502  0.278  2.163  77.47 
complex quad double arithmetic  
CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  5.572  0.632  0.042  0.072  0.705  7.91 
20  11.129  0.622  0.043  0.073  0.738  15.07 
50  27.769  0.633  0.054  0.075  0.762  36.44 
100  55.566  0.931  0.080  0.130  1.141  48.70 
150  83.369  1.213  0.098  0.179  1.491  55.92 
200  111.027  1.438  0.120  0.224  1.782  62.29 
250  138.872  1.428  0.144  0.235  1.808  76.82 
300  166.546  1.641  0.161  0.277  2.079  80.11 
500  277.978  2.486  0.257  0.436  3.178  87.46 
750  416.268  3.435  0.369  0.594  4.398  94.64 
1000  554.742  4.582  0.485  0.786  5.853  94.77 
1250  694.084  5.715  0.591  0.943  7.249  95.75 
1500  833.445  6.809  0.699  1.183  8.691  95.89 
2000  1111.412  8.916  0.929  1.532  11.377  97.69 
3000  1676.977  13.244  1.375  2.245  16.864  99.44 
complex double arithmetic  

CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  0.311  0.042  0.050  0.015  0.106  2.92 
20  0.586  0.057  0.069  0.015  0.072  8.10 
50  1.417  0.079  0.075  0.027  0.181  7.81 
100  2.813  0.140  0.113  0.032  0.285  9.86 
150  4.181  0.202  0.134  0.045  0.380  11.00 
200  5.586  0.244  0.169  0.057  0.470  11.89 
250  6.927  0.295  0.187  0.064  0.546  12.68 
300  8.302  0.349  0.224  0.078  0.651  12.75 
500  13.834  0.567  0.314  0.125  1.006  13.75 
750  20.650  0.863  0.470  0.193  1.527  13.53 
1000  27.509  1.111  0.608  0.254  1.973  13.94 
1250  34.433  1.435  0.772  0.319  2.526  13.63 
1500  41.253  1.682  0.892  0.390  2.964  13.92 
2000  55.157  2.209  1.179  0.523  3.910  14.11 
3000  82.710  3.303  1.742  0.877  5.922  13.97 
complex double double arithmetic  
CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  4.345  0.195  0.116  0.050  0.361  12.03 
20  8.664  0.201  0.125  0.056  0.382  22.66 
50  21.587  0.226  0.141  0.062  0.429  50.26 
100  43.239  0.411  0.219  0.120  0.750  57.68 
150  65.571  0.602  0.247  0.159  1.008  65.04 
200  86.489  0.762  0.321  0.215  1.297  66.67 
250  108.585  0.839  0.351  0.255  1.445  75.14 
300  130.030  1.016  0.422  0.299  1.737  74.85 
500  216.220  1.623  0.598  0.491  2.712  79.74 
750  325.524  2.445  0.910  0.719  4.074  79.90 
1000  431.826  3.203  1.182  0.957  5.341  80.86 
1250  540.026  4.057  1.501  1.197  6.755  79.94 
1500  647.817  4.778  1.722  1.439  7.939  81.60 
2000  864.464  6.361  2.299  1.936  10.596  81.58 
3000  1301.577  9.517  3.420  2.984  15.922  81.75 
complex quad double arithmetic  
CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  43.425  1.956  0.502  0.506  2.964  14.65 
20  86.566  1.977  0.522  0.534  3.033  28.55 
50  216.214  2.154  0.552  0.537  3.244  66.66 
100  433.039  4.150  0.807  1.051  6.009  72.07 
150  650.377  5.990  0.878  1.569  8.437  77.09 
200  866.149  8.051  1.171  2.077  11.299  76.66 
250  1082.852  8.478  1.239  2.142  11.859  91.31 
300  1297.308  10.046  1.474  2.537  14.057  92.29 
500  2161.734  16.866  1.938  4.182  22.986  94.05 
750  3245.388  25.008  2.874  6.189  34.071  95.25 
1000  4327.603  33.228  3.852  8.173  45.253  95.63 
1250  5407.600  41.280  4.769  10.153  56.202  96.22 
1500  6497.880  51.975  5.514  12.679  70.168  92.61 
2000  8652.404  68.903  7.380  16.727  93.010  93.03 
3000  12977.386  100.940  10.799  24.771  136.510  95.07 
complex double arithmetic  

#paths  CPU  GPU  speedup 
10  0.152  0.196  0.77 
20  0.330  0.239  1.38 
50  0.815  0.292  2.79 
100  1.512  0.341  4.43 
200  2.894  0.462  6.26 
500  7.257  0.809  8.97 
1000  14.171  1.343  10.55 
2000  28.524  2.514  11.35 
5000  72.292  6.156  11.74 
complex double double arithmetic  
#paths  CPU  GPU  speedup 
10  2.130  0.595  3.58 
20  4.496  0.641  7.01 
50  11.215  0.720  15.59 
100  20.813  0.831  25.04 
200  40.018  1.124  35.62 
500  100.446  2.057  48.82 
1000  194.243  3.462  56.11 
2000  392.615  6.345  61.87 
5000  992.708  15.504  64.03 
complex quad double arithmetic  
#paths  CPU  GPU  speedup 
10  20.745  4.593  4.52 
20  42.969  4.835  8.89 
50  106.348  5.101  20.85 
100  198.098  5.926  33.43 
200  383.885  8.846  43.40 
500  986.145  16.407  60.10 
1000  1876.226  28.365  66.15 
2000  3805.213  52.710  72.19 
5000  9618.930  128.948  74.60 
complex double arithmetic  

CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  1.129  0.137  0.138  0.049  0.324  3.48 
20  2.127  0.168  0.156  0.050  0.373  5.70 
50  5.223  0.239  0.208  0.097  0.544  9.60 
100  10.226  0.447  0.306  0.113  0.866  11.80 
150  15.303  0.654  0.396  0.162  1.212  12.63 
200  20.239  0.794  0.475  0.206  1.475  13.72 
250  25.369  0.971  0.575  0.237  1.783  14.23 
300  30.387  1.157  0.671  0.289  2.116  14.36 
500  50.778  1.890  1.113  0.471  3.474  14.62 
750  75.665  2.895  1.589  0.729  5.213  14.51 
1000  102.170  3.718  2.074  0.958  6.751  15.13 
1250  127.156  4.821  2.698  1.215  8.735  14.56 
1500  154.357  5.645  3.126  1.494  10.265  15.04 
2000  201.537  7.425  4.064  2.003  13.492  14.94 
3000  302.158  11.108  6.138  3.351  20.597  14.67 
complex double double arithmetic  
CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  15.116  0.559  0.266  0.170  0.995  15.19 
20  29.886  0.582  0.287  0.202  1.071  27.91 
50  75.020  0.659  0.391  0.217  1.267  59.22 
100  151.854  1.263  0.573  0.437  2.273  66.80 
150  228.841  1.887  0.732  0.598  3.217  71.14 
200  298.554  2.425  0.907  0.781  4.113  72.59 
250  373.088  2.696  1.101  0.932  4.729  78.89 
300  447.931  3.264  1.287  1.113  5.664  79.09 
500  746.392  5.299  2.129  1.862  9.289  80.35 
750  1120.713  8.042  3.104  2.726  13.873  80.79 
1000  1491.030  10.570  4.080  3.649  18.299  81.48 
1250  1867.582  13.400  5.217  4.568  23.185  80.55 
1500  2251.611  15.851  6.006  5.510  27.367  82.27 
2000  2990.387  21.057  7.908  7.429  36.394  82.17 
3000  4478.135  31.423  12.001  11.455  54.879  81.60 
complex quad double arithmetic  
CPU  GPU  
#evals  total  mon  sum  coeff  total  speedup 
10  146.920  5.329  1.132  1.867  8.328  17.64 
20  293.975  5.369  1.188  1.935  8.493  34.61 
50  734.441  6.104  1.954  1.468  9.526  77.10 
100  1470.332  12.760  2.123  3.895  18.778  78.30 
150  2206.220  18.763  2.490  5.891  27.144  81.28 
200  2942.909  25.181  3.149  7.859  36.189  81.32 
250  3676.626  29.401  8.068  3.598  41.067  89.53 
300  4415.752  30.987  4.207  9.617  44.810  98.54 
500  7346.943  58.511  6.909  15.901  81.321  90.35 
750  11049.214  86.933  9.707  23.621  120.261  91.88 
1000  14697.217  113.970  13.027  31.309  158.306  92.84 
1250  18394.761  134.100  16.487  38.865  189.452  97.09 
1500  22045.021  177.920  19.023  48.569  245.512  89.79 
complex double arithmetic  

#paths  CPU  GPU  speedup 
10  0.757  0.506  1.50 
20  1.580  0.603  2.62 
50  3.883  0.890  4.36 
100  7.800  1.229  6.35 
200  15.813  1.801  8.78 
500  39.861  3.713  10.74 
1000  80.347  6.898  11.65 
2000  161.498  13.232  12.21 
5000  401.001  33.050  12.13 
complex double double arithmetic  
#paths  CPU  GPU  speedup 
10  11.307  2.042  5.54 
20  23.558  2.231  10.56 
50  58.339  3.010  19.38 
100  113.878  3.883  29.32 
200  232.249  5.120  45.36 
500  586.282  10.141  57.81 
1000  1183.342  18.317  64.60 
2000  2376.400  34.497  68.89 
complex quad double arithmetic  
#paths  CPU  GPU  speedup 
10  111.498  19.403  5.75 
20  234.984  20.642  11.38 
50  583.908  25.590  22.82 
100  1168.055  34.496  33.86 
200  2375.275  47.696  49.80 
500  5986.772  91.191  65.65 
1000  12075.740  165.244  73.08 
The corresponding speedups are shown in Figure 1. Observe that the more monomials the system has, the fewer number of evaluations (respectively the fewer number of paths) are required to reach double digit speedups.
4.4 path tracking
Results for tracking many paths for the cyclic 10roots problem are summarized in Table 15. Observe the quality up. Tracking 10,000 paths in double double arithmetic takes 10 seconds on the GPU, while on the CPU it takes 26.562 seconds in double arithmetic. With our accelerated code we obtain solutions in a precision that is twice as large in a time that is more than twice as fast.
complex double arithmetic  

#paths  CPU  GPU  speedup 
10  0.040  0.128  0.31 
20  0.075  0.139  0.54 
50  0.158  0.147  1.07 
100  0.277  0.155  1.79 
200  0.482  0.181  2.67 
500  1.239  0.250  4.96 
1000  2.609  0.432  6.03 
2000  5.341  0.768  6.96 
5000  13.358  1.711  7.81 
10000  26.562  3.334  7.97 
complex double double arithmetic  
#paths  CPU  GPU  speedup 
10  0.563  0.344  1.63 
20  1.082  0.386  2.80 
50  2.248  0.404  5.56 
100  3.706  0.421  8.81 
200  6.480  0.458  14.15 
500  16.802  0.729  23.05 
1000  35.683  1.315  27.14 
2000  83.601  2.397  34.87 
5000  210.287  5.246  40.09 
10000  414.332  10.063  41.18 
complex quad double arithmetic  
#paths  CPU  GPU  speedup 
10  5.859  2.696  2.17 
20  11.189  2.852  3.92 
50  24.018  2.866  8.38 
100  38.782  2.966  13.08 
200  67.703  3.568  18.97 
500  174.769  6.203  28.17 
1000  368.449  11.175  32.97 
2000  851.255  21.432  39.72 
5000  2164.485  48.495  44.63 
The data in Table 15 is visualized in Figure 2. Compared to the speedups for evaluation and differentiation, the speedups for path tracking are roughly about half of those of evaluation and differentiation, for double double and quad double arithmetic.
Table 12 lists times and speedups for tracking many paths of the Nash equilibrium system. In Figure 3 we visualize these data. Notice that, as the Nash equilibrium system has more monomials than the cyclic 10roots system, the speedup for nash8 are better than those form cyclic10. The speedups improve slightly for the Pieri problem, but with a larger of number of monomials the memory allows for fewer paths to be tracked simultaneously.
References
 [1] D. J. Bates, J. D. Hauenstein, A. J. Sommese, and C. W. Wampler. Bertini: Software for numerical algebraic geometry. Available at http://www.nd.edu/sommese/bertini/.
 [2] T. Chen, T.L. Lee, and T.Y. Li. Hom4PS3: a parallel numerical solver for systems of polynomial equations based on polyhedral homotopy continuation methods. In H. Hong and C. Yap, editors, Mathematical Software – ICMS 2014, 4th International Conference, Seoul, South Korea, August 49, 2014, Proceedings, volume 8592 of Lecture Notes in Computer Science, pages 183–190. SpringerVerlag, 2014.
 [3] S. Collange, M. Daumas, and D. Defour. Interval arithmetic in CUDA. In Wen mei W. Hwu, editor, GPU Computing Gems Jade Edition, pages 99–107. Elsevier, 2012.
 [4] R.S. Datta. Finding all nash equilibria of a finite game using polynomial algebra. Economic Theory, 42(1):55–96, 2009.
 [5] T. Gao, T.Y. Li, and X. Li. HOM4PS, 2002. Available at http://www.csulb.edu/tgao/ RESEARCH/Software.htm.
 [6] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM, second edition, 2008.
 [7] T. Gunji, S. Kim, M. Kojima, A. Takeda, K. Fujisawa, and T. Mizutani. PHoM – a polyhedral homotopy continuation method for polynomial systems. Computing, 73(4):55–77, 2004.
 [8] S. A. Haque, X. Li, F. Mansouri, M. M. Maza, W. Pan, and N. Xie. Dense arithmetic over finite fields with the CUMODP library. In H. Hong and C. Yap, editors, Mathematical Software – ICMS 2014, volume 8592 of Lecture Notes in Computer Science, pages 725–732. SpringerVerlag, 2014.
 [9] Y. Hida, X. S. Li, and D. H. Bailey. Algorithms for quaddouble precision floating point arithmetic. In 15th IEEE Symposium on Computer Arithmetic (Arith15 2001), 1117 June 2001, Vail, CO, USA, pages 155–162. IEEE Computer Society, 2001. Shortened version of Technical Report LBNL46996, software at http://crd.lbl.gov/dhbailey/ mpdist.
 [10] B. Huber, F. Sottile, and B. Sturmfels. Numerical Schubert calculus. J. Symbolic Computation, 26(6):767–788, 1998.
 [11] B. Huber and J. Verschelde. Pieri homotopies for problems in enumerative geometry applied to pole placement in linear systems control. SIAM J. Control Optim., 38(4):1265–1287, 2000.
 [12] M. Kim, J. Rosenthal, and X. Wang. Pole placement and matrix extension problems: A common point of view. SIAM J. Control. Optim., 42(6):2078–2093, 2004.
 [13] T. L. Lee, T.Y. Li, and C. H. Tsai. HOM4PS2.0: a software package for solving polynomial systems by the polyhedral homotopy continuation method. Computing, 83(23):109–133, 2008.
 [14] A. Leykin. Numerical algebraic geometry. The Journal of Software for Algebra and Geometry: Macaulay2, 3:5–10, 2011.
 [15] T.Y. Li. Numerical solution of polynomial systems by homotopy continuation methods. In F. Cucker, editor, Handbook of Numerical Analysis. Volume XI. Special Volume: Foundations of Computational Mathematics, pages 209–304. NorthHolland, 2003.
 [16] M. Lu, B. He, and Q. Luo. Supporting extended precision on graphics processors. In A. Ailamaki and P.A. Boncz, editors, Proceedings of the Sixth International Workshop on Data Management on New Hardware (DaMoN 2010), June 7, 2010, Indianapolis, Indiana, pages 19–26, 2010. Software at http://code.google.com/p/gpuprec/.
 [17] M. M. Maza and W. Pan. Solving bivariate polynomial systems on a GPU. ACM Communications in Computer Algebra, 45(2):127–128, 2011.
 [18] R.D. McKelvey and A. McLennan. The maximal number of regular totally mixed Nash equilibria. Journal of Economic Theory, 72:411–425, 1997.
 [19] S.M. Rump. Verification methods: Rigorous results using floatingpoint arithmetic. Acta Numerica, 19:287â449, 2010.
 [20] S. Sengupta, M. Harris, and M. Garland. Efficient parallel scan algorithms for GPUs. Technical Report NVR2008003, NVIDIA, 2008.
 [21] F. Szöllősi. Construction, classification and parametrization of complex Hadamard matrices. PhD thesis, Central European University, Budapest, 2011. arXiv:1110.5590v1.
 [22] J. Verschelde. Algorithm 795: PHCpack: A generalpurpose solver for polynomial systems by homotopy continuation. ACM Trans. Math. Softw., 25(2):251–276, 1999.
 [23] J. Verschelde. Modernizing PHCpack through phcpy. In P. de Buyl and N. Varoquaux, editors, Proceedings of the 6th European Conference on Python in Science (EuroSciPy 2013), pages 71–76, 2014.
 [24] J. Verschelde and G. Yoffe. Evaluating polynomials in several variables and their derivatives on a GPU computing processor. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops (PDSEC 2012), pages 1391–1399. IEEE Computer Society, 2012.
 [25] J. Verschelde and G. Yoffe. Orthogonalization on a general purpose graphics processing unit with double double and quad double arithmetic. In Proceedings of the 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops (PDSEC 2013), pages 1373–1380. IEEE Computer Society, 2013.
 [26] J. Verschelde and X. Yu. Accelerating polynomial homotopy continuation on a graphics processing unit with double double and quad double arithmetic. arXiv:1501.06625v2.
 [27] J. Verschelde and X. Yu. GPU acceleration of Newton’s method for large systems of polynomial equations in double double and quad double arithmetic. In Proceedings of the 16th IEEE International Conference on High Performance Computing and Communication (HPCC 2014), pages 161–164. IEEE Computer Society, 2014.
 [28] L. T. Watson, S. C. Billups, and A. P. Morgan. Algorithm 652: HOMPACK: a suite of codes for globally convergent homotopy algorithms. ACM Trans. Math. Softw., 13(3):281–310, 1987.
 [29] L. T. Watson, M. Sosonkina, R. C. Melville, A.P. Morgan, and H.F. Walker. HOMPACK90: A suite of Fortran 90 codes for globally convergent homotopy algorithms. ACM Trans. Math. Softw., 23(4):514–549, 1997.