EigenKernel ^{†}^{†}thanks: The present research was partially supported by JSTCREST project of ’Development of an EigenSupercomputing Engine using a PostPetascale Hierarchical Model’, Priority Issue 7 of the postK project and KAKENHI funds (16KT0016,17H02828). OakforestPACS was used through the JHPCN Project (jh170058NAHI) and through Interdisciplinary Computational Science Program in Center for Computational Sciences, University of Tsukuba. The K computer was used in the HPCI System Research Projects (hp170147, hp170274, hp180079, hp180219). Several computations were carried out also on the facilities of the Supercomputer Center, the Institute for Solid State Physics, the University of Tokyo.
Abstract
An opensource middleware named EigenKernel was developed for use with parallel generalized eigenvalue solvers or largescale electronic state calculation to attain high scalability and usability . The middleware enables the users to choose the optimal solver, among the three parallel eigenvalue libraries of ScaLAPACK, ELPA, EigenExa and hybrid solvers constructed from them, according to the problem specification and the target architecture. The benchmark was carried out on the OakforestPACS supercomputer and reveals that ELPA, EigenExa and their hybrid solvers show a strong scaling property, while the conventional ScaLAPACK solver has a severe bottleneck in scalability. A performance prediction function was also developed so as to predict the elapsed time as the function of the number of used nodes (). The prediction is based on Bayesian inference and the test calculation indicates that the method is applicable not only to performance interpolation but also to extrapolation. Such a middleware is of crucial importance for applicationalgorithmarchitecture codesign among the current , nextgeneration (exascale), and futuregeneration (postMoore era) supercomputers.
Keywords:
Middleware Generalized eigenvalue problem Bayesian inference Performance prediction Parallel processing Autotuning∎
1 Introduction
Efficient computation with the current and upcoming (both exascale and postMoore era) supercomputers can be realized by applicationalgorithmarchitecture codesign Shalf11 ; Dosanjh14 ; POSTK ; CoDEx ; EuroExa , in which various numerical algorithms should be prepared and the optimal one should be chosen according to the target application, architecture and problem. For example, an algorithm designed to minimize the floatingpoint operation count can be the fastest for some combination of application and architecture, while another algorithm designed to minimize communications (e.g. the number of communications or the amount of data moved) can be the fastest in another situation. The present paper proposes a middleware that helps the user to select an optimal solver routine according to the target application and architecture.
Figure 1 shows the concept of our middleware. There are several solver routines, , that provide equivalent mathematical function such as eigenvalue computation, but use different algorithms. Consequently, they show different performance characteristics and the optimal routine depends not only on the applications denoted as but also on the architectures denoted as . Our middleware assists the user to use the optimal routine by preparing the following functions. First, it provides a unified interface to the solver routines. In general, different solvers have different user interface, such as the matrix distribution scheme, so the user is often required to rewrite the application program to switch from one solver to another. Our middleware absorbs this difference and frees the user from this troublesome task. Second, it outputs detailed performance data such as the elapsed time of each subroutine composing the solver. Such data will be useful for detecting the performance bottleneck and finding causes of it, as will be illustrated in this paper. In addition, a performance prediction function, which predicts the elapsed time of the solver routines from existing benchmark data prior to actual computations, will be provided. Such a function is constructed with Bayesian inference in this paper. Performance prediction will be valuable for choosing an appropriate job class in the case of batch execution, or choosing an optimal number of computational nodes that can be used efficiently without performance saturation. Moreover, performance prediction can also be used to realize autotuning, in which the user need not care about the job class and detailed calculation conditions. In this way, our middleware is expected to enhance the usability of existing solver routines and allows the users to concentrate on computational science itself.
Here we focus on a middleware for the generalized eigenvalue problem (GEP) with realsymmetric coefficient matrices, since GEP forms the numerical foundation of electronic state calculations. Some of the authors developed a prototype of such middleware on the K computer in 20152016 IMACHIJIT2016 ; HOSHI2016SC16 . After that, the code appeared at GITHUB as EigenKernel ver. 2017 EIGENKERNELURL under the MIT license. It was confirmed that EigenKernel ver. 2017 works well also on OakleafFX10 and Xeonbased supercomputers IMACHIJIT2016 . In 2018, a new version of EigenKernel was developed and appeared on the developer branch at GITHUB. This version can run on the OakforestPACS supercomputer, a new supercomputer equipping Intel Xeon Phi manycore processors. A performance prediction tool using Python is also being developed to be built into a future version of EigenKernel.
In this paper, we take up this version and discuss two topics. First, we show the performance data of various GEP solvers on OakforestPACS obtained using EigenKernel. Such data will be of interest on its own since OakforestPACS is a new machine and few performance results of dense matrix solvers on it have been reported; stencilbased application Hirokawa18 and communicationavoiding iterative solver for a sparse linear system Idomura17 were evaluated on OakforestPACS, but their characteristics are totally different from those of dense matrix solvers such as GEP solvers. Furthermore, we point out that one of the solvers has a severe scalability problem and investigate the cause of it with the help of the detailed performance data output by EigenKernel. This illustrates how EigenKernel can be used effectively for performance analysis. Second, we describe the new performance prediction function implemented as a Python program. It uses Bayesian inference and predicts the execution time of a specified GEP solver as a function of the number of computational nodes. We present the details of the mathematical performance models used in it and give several examples of performance prediction results. It is to be noted that our performance prediction function can be used not only for interpolation but also for extrapolation, that is, for predicting the execution time at a larger number of nodes from the results at a smaller number of nodes. There is a strong need for such prediction among application users.
This paper is organized as follows. The algorithm for the GEP, the GEP solvers adopted in EigenKernel, and the features of EigenKernel are described in Sec. 2. Sec. 3 is devoted to the scalability analysis of various GEP solvers on OakforestPACS, which was made possible with the use of EigenKernel. Sec. 4 explains our new performance prediction function, focusing on the performance models used in it and the performance prediction results in the case of extrapolation. Sec. 5 discusses the advantages and limitations of our performance prediction method, comparing it with some existing studies. Finally Sec. 6 provides summary of this study and some future outlook.
2 EigenKernel
EigenKernel is a middleware for GEP that enables the user to use optimal solver routines according to the problem specification (matrix size, etc.) and the target architecture. In this section, we first review the algorithm for solving GEP and describe the solver routines adopted by EigenKernel. Features of EigenKernel are also discussed.
2.1 Generalized eigenvalue problem and its solution
We consider the generalized eigenvalue problem
(1) 
where the matrices and are real symmetric ones and is positive definite (). The th eigenvalue or eigenvector is denoted as or , respectively (). The algorithm to solve (1) proceeds as follows. First, the Cholesky decomposition of is computed, producing an upper triangle matrix that satisfies
(2) 
Then the problem is reduced into a standard eigenvalue problem (SEP)
(3) 
with the realsymmetric matrix of
(4) 
When the SEP of Eq. (3) is solved, the eigenvector of the GEP is obtained by
(5) 
2.2 GEP Solvers and hybrid workflows
EigenKernel builds upon three parallel libraries for GEP: ScaLAPACK SCALAPACK , ELPA ELPAWEB and EigenExa EigenExaWeb . Reflecting the structure of the GEP algorithm stated above, all of the GEP solvers from these libraries consist of two routines, namely, the SEP solver and the reducer. EigenKernel allows the user to select the SEP solver from one library and the reducer from another library, by providing appropriate data format/distribution conversion routines. We call the combination of an SEP solver and a reducer a hybrid workflow, or simply workflow. Hybrid workflows enable the user to attain maximum performance by choosing the optimal SEP solver and reducer independently.
Among the three libraries adopted by EigenKernel, ScaLAPACK is the de facto standard parallel numerical library. However, it was developed mainly in 1990’s and thus some of its routines show severe bottlenecks on current supercomputers. Novel solver libraries of ELPA and EigenExa were proposed, so as to overcome the bottlenecks in eigenvalue problems. EigenKernel v.2017 were developed mainly in 20152016, so as to realize hybrid workflows among the three libraries. The ELPA code was developed in Europe under the tightcollaboration between computer scientists and material science researchers and its main target application is FHIaims (Fritz Haber Institute ab initio molecular simulations package) FHIAIMS , a famous electronicstate calculation code. The EigenExa code, on the other hand, was developed at RIKEN in Japan. It is an important fact that the ELPA code has routines optimized for X86, IBM BlueGene and AMD architectures ELPA , while the EigenExa code was developed so as to be optimal mainly on the K computer. The above fact motivates us to develop a hybrid solver workflow so that we can achieve optimal performance for any problem on any architecture. EigenKernel supports only limited versions of ELPA and EigenExa, since the update of ELPA or EigenExa requires us, sometimes, to modify the interface routine without backward compatibility. EigenKernel v.2017 supports ELPA 2014.06.001 and EigenExa 2.3c. In 2018, EigenKernel was updated in the developer branch on GITHUB and can run on OakforestPACS. The benchmark on OakforestPACS in this paper was carried out by the code with the commit ID of 373fb83 that appeared at Feb 28, 2018 on GITHUB. The code is called the ‘current’ code hereafter and supports ELPA v. 2017.05.003 and EigenExa 2.4p1.
Figure 2 shows the possible workflows in EigenKernel. The reducer can be chosen from the ScaLAPACK routine and the ELPAstyle rourine , and the difference bewteen them is discussed later in this paper. The SEP solver for Eq. (3) can be chosen from the five routines: the ScaLAPACK routine denoted as ScaLAPCK, two ELPA routines denoted as ELPA1 and ELPA2 and two EigenExa routines denoted as Eigen_s and Eigen_sx. The ELPA1 and Eigen_s routines are based on the conventional tridiagonalization algorithm like the ScaLAPACK routine but are different in their implementations. The ELPA2 and Eigen_sx routines are based on nonconventional algorithms for modern architectures. Details of these algorithms can be found in the references (see ELPA11 ; ELPA ; ELPAWEB for ELPA and Imamura11 ; EIGENEXA ; FukayaPDSEC15 ; EigenExaWeb for EigenExa).
EigenKernel provides the seven solver workflows for GEP, which are listed as in Table 1. The names of the workflows are the same as those in our previous paper IMACHIJIT2016 .
2.3 Features of EigenKernel
As stated in Introduction, EigenKernel prepares basic functions to assist the user to use the optimal workflow for GEP. First, it provides a unified interface to the GEP solvers. When the SEP solver and the reducer are chosen from different libraries, the conversion of data format and distribution are also performed automatically by EigenKernel. Second, it outputs detailed performance data such as the elapsed times of internal routines of the SEP solver and reducer for performance analysis. The data file is written in JSON (JavaScript Object Notation) format. This data file is used by the performance prediction tool to be discussed in Sec. 4. The tool estimates the execution time of a specified GEP workflow as a function of the number of computational nodes.
In addition to these, EigenKernel has additional features so as to satisfy the needs among application researchers: (I) It is possible to build EigenKernel only with ScaLAPACK. This is because there are supercomputer systems in which ELPA or EigenExa are not installed. (II) The package contains a miniapplication called EigenKernelapp, a standalone application. This miniapplication can be used for real researches, as in Ref. HOSHI2016SC16 , if the matrix data are prepared as files in the Matrix Market format.
It is noted that there is another reducer routine called EigenExastyle reducer that appears in our previous paper IMACHIJIT2016 but is no longer supported by EigenKernel. This is mainly because the code (KMATH_EIGEN_GEV)EIGENGEV requires EigenExa but is not compatible with EigenExa 2.4p1. Since this reducer uses the eigendecomposition of the matrix , instead of the Cholesky decomposition of Eq. (2), its elapsed time is always larger than that of the SEP solver. Such a reducer routine is not efficient, at least judging from the benchmark data of the SEP solvers reported in the previous paper IMACHIJIT2016 and in this paper.
Workflow  SEP solver  Reducer 

A  ScaLAPACK  ScaLAPACK 
B  Eigen_sx  ScaLAPACK 
C  ScaLAPACK  ELPA 
D  ELPA2  ELPA 
E  ELPA1  ELPA 
F  Eigen_s  ELPA 
G  Eigen_sx  ELPA 
3 Scalability analysis on OakforestPACS
In this section, we demonstrate how EigenKernel can be used for performance analysis. We first show the benchmark data of various GEP workflows on OakforestPACS obtained using EigenKernel. Then we analyze the performance bottleneck found in one of the workflows with the help of the detailed performance data output by EigenKernel.
3.1 Benchmarks data for different workflows
The benchmark test was carried out on OakforestPACS , so as to compare the elapsed time among the workflows. OakforestPACS is a massively parallel supercomputer operated by Joint Center for Advanced High Performance Computing (JCAHPC) JCAHPC . It consists of 8,208 computational nodes connected by the Intel OmniPath network. Each node has an Intel Xeon Phi 7250 manycore processor with 3TFLOPS of peak performance. Thus, the aggregate peak performance of the system is more than 25PFLOPS. In the present benchmark test, the MPI/OpenMP hybrid parallelism was used and the number of the used nodes is set to be the number of processes in MPI. The maximum number of nodes in the benchmark is , a quater of the whole system. The test numerical problem is ‘VCNT90000’ , which appears in ELSES matrix library ELSESMATRIX . The matrix size of the problem is . The problem comes from the simulation of a vibrating carbon nanotube (VCNT) calculated by ELSES ELSES ; ELSESWEB , a quantum nanomaterial simulator. The matrices of and in Eq. (1) were generated with an ab initiobased modeled (tightbinding) electronicstate theory CERDA .
The calculations were carried out by the workflows of and in Table 1^{1}^{1}1 The results of the workflows and can be estimated from those of the other workflows, since the SEP solver and the reducer in the workflows and appear in the other workflows. It is concluded from the present benchmark that these two workflows are not efficient. . Table 2 shows the total elapsed time for the GEP solver, denoted as , and that only for the SEP solver, denoted as , as the function of the number of used nodes . The elapsed time for the reducer is defined as IMACHIJIT2016 .
The benchmark data is summarized in Fig. 3. In Fig. 3(a), all the workflows but show a strong scaling property, in the sense that the elapsed time decreases with the number of nodes . The optimal workflow seems to be or among the benchmark data. Here one finds that the workflow , the pure ScaLAPACK workflow, shows a severe difficulty in scalability, because the speedup is ‘saturated’ at . Although the same difficulty appears also in the benchmark data on the K computer IMACHIJIT2016 , there the speedup saturation does not occur until . Thus, the difficulty seems to be severer in the case of OakforestPACS. Figure 3(b) shows the elapsed time only for the SEP solver, which scales well up to for all the workflows. This indicates that the scalability problem in the workflow stems from the reducer.
3.2 Scalability bottleneck in pure ScaLAPACK workflow
More detailed performance data are shown in Table 3 and in Fig. 4 for the workflow or the pure ScaLAPACK workflow. The total elapsed time, denoted as , is decomposed into six terms; the five terms are those for the ScaLAPACK routines of pdsytrd, pdsygst, pdstedc, pdormtr and pdotrf. The elapsed times for these routines are denoted as , , , and , respectively. The elapsed time for the rest part is defined as . These timing data are output by EigenKernel automatically in JSON format.
Figure 4 indicates that the performance bottleneck is caused by the routine of pdsygst, in which the reduced matrix is generated by Eq. (4). A possible cause for the low scalability of pdsygst is the algorithm used in it, which exploits the symmetry of the resulting matrix to reduce the computational cost WILKINSON . Although this is optimal on sequential machines, it brings about some data dependency and can cause a scalability problem on massively parallel machines. On the other hand, the ELPA reducer forms the inverse matrix explicitly and computes Eq. (4) directly using matrix multiplication ELPA . While this algorithm is computationally more expensive, it has a larger degree of parallelism and can be more suited for massively parallel machines.
In principle, the performance analysis such as given here could be done manually. However, it requires the user to insert a timer into every internal routine and output the measured data in some organized format. Since EigenKernel takes care of these kinds of troublesome tasks, it makes performance analysis easier for nonexpert users. In addition, since performance data obtained in practical computation is sometimes valuable for finding performance issues that rarely appears in development process, this feature can contribute to the codesign of software.
# nodes  

16  7469(3768)  1965(1191)  2382(1634)  1757(889.5)  2054(1289) 
32  3865(2065)  1081(584.9)  1314(870.1)  1089(543.9)  1324(775.3) 
64  4550(1804)  717.7(395.3)  824.2(506.4)  572.0(294.7)  745.7(411.9) 
128  3282(1213)  585.9(283.7)  684.3(384.1)  600.3(275.0)  734.1(409.1) 
256  2422(977.3)  387.8(205.8)  442.4(305.6)  333.6(187.4)  411.5(263.4) 
512  2493(728.5)  271.6(143.9)  381.1(253.3)  335.3(190.1)  412.6(270.1) 
1024  2851(598.6)  223.07(125.1)  313.94(216.3)  253.64(153.1)  309.22(210.2) 
2048  2914(484.6)  199.1(105.6)  330.3(234.5)  281.1(176.5)  348.2(239.7) 
# nodes  total  pdsytrd  pdsygst  pdstedc  pdormtr  pdotrf  rest 

16  7469.6  2656.9  2869.5  666.13  444.81  522.44  309.84 
32  3865.2  1506.5  1539.7  344.99  213.23  63.227  197.58 
64  4549.5  1455.8  2369.9  269.49  78.451  263.72  112.14 
128  3281.9  1018.0  1918.7  153.81  41.524  45.828  104.15 
256  2421.8  817.58  1234.2  130.70  29.020  137.66  72.551 
512  2493.0  621.97  1690.3  89.606  16.938  27.057  47.145 
1024  2851.0  496.34  2138.2  84.005  18.235  73.518  40.673 
2048  2913.7  406.67  2383.0  67.194  10.774  16.379  29.605 
4 Performance prediction
4.1 The concept
This section proposes to use Bayesian inference as a tool for performance prediction, in which the elapsed time is predicted from teacher data or existing benchmark data. The importance of performance modeling and prediction has long been recognized by library developers. In fact, in a classical paper published in 1996 Dackland96 , Dackland and Kågström write, “we suggest that any library of routines for scalable high performance computer systems should also include a corresponding library of performance models”. However, there have been few parallel matrix libraries equipped with performance models so far. The performance prediction function to be described in this section will be incorporated in the future version of EigenKernel and will form one of the distinctive features of the middleware.
Performance prediction can be used in a variety of ways. Supercomputer users should prepare a computation plan that requires estimated elapsed time, but it is difficult to predict the elapsed time from hardware specifications, such as peak performance, memory and network bandwidths and communication latency. The performance prediction realizes high usability since it can predict the elapsed time without requiring huge benchmark data. Moreover, the performance prediction enables an autooptimization (autotuning) function, which selects the optimal workflow in EigenKernel automatically given the target machine and the problem size. Such high usability is crucial, for example, in electronic state calculation codes, because the codes are used not only among theorests but also among experimentalists and industrial researchers who are not familiar with HPC techniques.
The present paper focuses not only on performance interpolation but also on extrapolation, which predicts the elapsed time at a larger number of nodes from the data at a smaller number of nodes. This is shown schematically in Fig. 5. An important issue in the extrapolation is to predict the speedup ‘saturation’, or the phenomenon that the elapsed time may have a minimum, as shown in Fig. 5. The extrapolation technique is important, since we have only few opportunities to use the ultimately large computer resources, like the whole system of the K computer or OakforestPACS. A reliable extrapolation technique will encourage real researchers to use large resources.
4.2 Performance models
The performance prediction will be realized, when a reliable performance model is constructed for each routine, so as to reflect the algorithm and architecture properly. The present paper, as an earlystage research, proposes three simple models for the elapsed time of the th routine as the function of the number of nodes (the degrees of freedom in MPI parallelism) ; the first proposed model is called generic threeparameter model and is expressed as
(6)  
(7)  
(8)  
(9) 
with the three fitting parameters of . The terms of or stand for the time in ideal strong scaling or in nonparallel computations, respectively. The model of is known as Amdahl’s relation AMDAHL . The term of stand for the time of MPI communications. The logarithmic function was chosen as a reasonable one, since the main communication pattern required in dense matrix computations is categorized into collective communication (e.g. MPI_Allreduce for calculating inner product of a vector), and such communication routine is often implemented as a sequence of pointtopoint communications along with a binary tree, whose total cost is proportional to Pacheko96 .
The generic threeparameter model in Eq. (6) can give, unlike Amdahl’s relation, the minimum schematically shown in Fig. 5. It is noted that the real MPI communication time is not measured to determine the parameter , since it would require detailed modification of the source code or the use of special profilers. Rather, all the parameters are estimated simultaneously from the total elapsed time using Bayesian inference, as will be explained later.
The second proposed model is called generic fiveparameter model and is expressed as
(10)  
(11)  
(12) 
with the five fitting parameters of . The term of is responsible for the ‘superlinear’ behavior in which the time decays faster than . The superlinear behavior can be seen in several benchmarks data Ristov16 . The term of expresses the time of MPI communications for matrix computation; when performing matrix operations on a 2D scattered matrix, the size of the submatrix allocated to each node is . Thus, the communication volume to send a row or column of the submatrix is . By taking into account the binary tree based collective communication and multiplying , we obtain the term of . The term decays slower than .
The third proposed model results when the MPI communication term of in Eq. (6) is replaced by a linear function;
(13)  
(14) 
The model is called linearcommunicationterm model. We should say that this model is fictitious, because no architecture or algorithm used in real research gives a linear term in MPI communication, as far as we know. The fictitious threeparameter model of Eq. (13) was proposed so as to be compared with the other two models. Other models for MPI routines are proposed Grbovic07 ; Hoefler10 , and comparison with these models will be also informative, which is one of our future works.
4.3 Parameter estimation by Bayesian inference
Bayesian inference was carried out under the preknowledge that each term of is a fraction of the elapsed time and therefore each parameter of should be nonnegative . The procedure was realized by Python with the Bayesian inference module of PyMC ver. 2.36, which employs the Markov Chain Monte Carlo (MCMC) iterative procedure. The method is standard and the use of PyMC is not crucial. The details of the method are explained briefly here; (I) The parameters of are considered to have uncertainty and are expressed as probability distributions. The prior distribution should be chosen and the posterior distribution will be obtained by Bayesian inference. The prior distribution of the parameters of is set to the uniform distribution in the interval of , where is an input. The upper limit of should be chosen so that the the posterior probability distribution is so localized in the region of . The values of depend on problem and will appear in the next subsection with results. (II) The procedure of Bayesian inference was carried out for the logscale variables , instead of the original variable of . The prediction on the logscale variables means the uncertainty appears as relative error and is suitable to the present problem, because the data of lie across a wide range of digits. The prediction on the original variable results in a huge relative error for data with a large number of nodes. (III) Here the uncertainty is given by the normal distribution characterized by the standard deviation . The parameter is also treated as a probability distribution and its prior distribution is set to be the uniform one in the interval of with a given value of the upper bound .
In this paper, Bayesian inference is carried out by the MCMC iterative procedure. The MCMC procedure consumed only one or a couple of minute(s) by note PC with the teacher data of existing benchmarks. The histogram of Monte Carlo sample data are obtained for the parameters of and the elapsed time of and form approximate probability distributions for each quantity. The MCMC iterative procedure was carried out for each routine independently and its total iteration number is set to be . In the prediction procedure of the th routine, each iterative step gives the set of parameter values of and the elapsed time of , according to the selected model. We discarded the data in the first steps, since the Markov chain has not converged to the stationary distribution during such early steps. After that, the sampling data were picked out with an interval of steps. The number of the sampling data, therefore, is . Hereafter the index among the sample data is denoted by (). The th sample data consist of the set of the values of and and these values are denoted as and , respectively. The sampling data set of form the histogram or the probability distribution for . The probability distributions for the model parameters of are obtained in the same manner and will appear later in this section. The sample data for the total elapsed time is given by the sum of those over the routines;
(15) 
Finally, the median and the upper and lower limits of 95 % Highest Posterior Density (HPD) interval are obtained from the histogram of as the predicted values.
# nodes  total  pdsytrd  pdsygst  pdstedc  pdormtr  pdotrf  rest 

4  1872.7  1562.2  61.589  58.132  122.91  20.679  47.190 
16  240.82  129.09  37.012  21.341  24.624  8.0851  20.670 
64  103.18  44.494  24.584  9.9665  7.1271  3.3122  13.692 
256  63.029  21.325  20.509  5.8159  3.4131  2.2474  9.7189 
1024  55.592  17.524  17.242  6.1105  2.6946  3.1462  8.8753 
4096  70.459  20.479  21.169  6.7494  3.9326  7.9400  10.189 
10000  140.89  29.003  49.870  17.714  9.9534  19.817  14.536 
4.4 Results
The prediction results are demonstrated in Fig. 6. The computation was carried out on the K computer for the matrix problem of ‘VCNT22500’, which appears in ELSES matrix library ELSES . The matrix size is . The workflow , pure ScaLAPACK workflow, was used with the numbers of nodes for . The elapsed times were measured for the total elapsed time of and the six routines of , , , , and . The values are shown in Table 4. The measured data of the total elapsed time of shows the speedup ‘saturation’ or the phenomenon that the elapsed time shows a minimum at . The saturation is reasonable from HPC knowledge, since the matrix size is on the order of and efficient parallelization can not be expected for .
Figures 6(a)(c) shows the result of Bayesian inference, in which the teacher data is the measured elapsed times of the six routines of , , , , and at . In the MCMC procedure, the value of is set to among all the routines and the posterior probability distribution satisfies the locality of . One can find that the generic threeparameter model of Eq. (6) and the generic fiveparameter model of Eq. (10) commonly predict the speedup saturation successfully at , while the linearcommunicationterm model does not. Examples of the posterior probability distribution are shown in Fig. 6(d), for and in the fiveparameter model. Since the probability distribution of has a peak at , the term of in Eq. (11), the superlinear term, should be important. The contribution at , for example, is estimated to be . The above observation is consistent with the fact that the measured elapsed time shows the superlinear behavior between (=(1872.7 sec)/(240.82 sec) 7.78) and the generic fiveparameter model reproduces the teacher data, the data at =4, 16, 64, better than the generic threeparameter model.
Figure 7 shows the detailed analysis of the performance prediction in Fig. 6. Here the performance prediction of and is focused on, since the total time is dominated by these two routines, as shown in Table 4. The elapsed time of is predicted in Figs. 7(a) and (b) by the generic threeparameter model and the generic fiveparameter model, respectively. The elapsed time of is predicted in Figs. 7(c) and (d) by the generic threeparameter model and the generic fiveparameter model, respectively. The prediction by the generic fiveparameter model seems to be better than that by the threeparameter model.
5 Discussion on performance prediction
This subsection is devoted to discussions on the present performance prediction method.
5.1 Comparison with conventional least squares method
In the field of data fitting or modeling, the least squares method is by far the most frequently used approach. This applies also to the performance modeling of matrix libraries. In fact, most of the recent studies on performance modeling Peise12 ; Reisert17 ; Fukaya15 ; Fukaya18 rely on the least squares method to fit the model parameters. We therefore show the fitting result by the conventional least squares method in Fig. 8(a), so as to clarify the difference from that by the present method. The data for VCNT22500 on the K computer is used. The total elapsed time of in Table 4 was fitted within the generic threeparameter model of Eq. (6).
Fitting by the least squares method was carried out with the teacher data at and determines the parameters, uniquely, as . Here one finds that the fitted value of is negative, since the method ignores the preknowledge of the nonnegative condition , unlike the present Bayesian inference. The fitted elapsed times in Fig. 8(a) reproduce the measured values exactly at but are deviated severely from the measured values at . Note that the least squares procedure used here can be viewed as a special case of the Bayesian inference with the teacher data at , where the input parameters are determined without uncertainty and no nonnegativity constraints are imposed on them. In fact, the same result can be obtained in the framework of the MCMC procedure, if the parameters of are set to be the interval of with and that for is set to be the interval of with . The above prior distribution means that the nonnegative condition is ignored and the method is required to reproduce the teacher data exactly (). The MCMC procedure was carried out for the variables , unlike in Sec. 4.3, because the nonnegative condition is not imposed on and we cannot use as a variable. We confirmed the above statement by the MCMC procedure. As results, the median is located at and the width of the 95 % Highest Posterior Density (HPD) interval is or less for each coefficient.
The conventional fitting procedure was carried out also, when all the measured elapsed time data were used as the teacher data. The fitting procedure determines the parameters, uniquely, as . Here one find that the fitted value of is negative and the predicted time is negative at several number of nodes, such as s.
5.2 Limitation of the present models
Although the generic fiveparameter model seems to be the best among the three proposed models, the model is still limited in applicability. Figure 8(b) shows the performance prediction by the fiveparameter model for the benchmark data by the workflow in Fig. 4, a data for the problem of ‘VCNT90000’ on OakforestPACS. The data at are the teacher data. In the present case, the upper limit is set to for and and the posterior probability distribution satisfies the locality of . The predicted data fails to reproduce the local maximum at in the measured data, since the model can have only one (local) minimum and no (local) maximum. If one would like to overcome the above limitation, a candidate for a flexible and useful model may be one with case classification;
(16) 
where and are considered to be independent models and the threshold number of is also a fitting parameter. A model with case classification will be fruitful from the algorithm and architecture viewpoints. An example is the case where the target numerical routine switches the algorithms according to the number of used nodes. Another example is the case where the nodes in a rack are tightly connected and parallel computation within these nodes is quite efficient. From the application viewpoint, however, the prediction in Fig. 8(b) is still meaningful, since the extrapolation implies that an efficient parallel computation cannot be expected at .
5.3 Discussions on methodologies and future aspect
This subsection is devoted to discussion on methodologies and future aspects.
(I) The proper values of should be chosen in each problem and here we propose a way to set the values automatically. The possible maximum value of appears, when the elapsed time of the th routine is governed only by the th term of the given model (). We consider, for example, the case in which the first term is dominant () and the possible maximum value of is given by . Therefore the limit of can be chosen to be among the teacher data of . The locality of should be checked for the posterior probability distribution. We plan to use the method in a future version of our code.
(II) In parallel computations, the elapsed times may be different among multiple runs, because the geometries of the used nodes are in general not equivalent. Such uncertainty of the measured elapsed time can be treated in Bayesian inference by, for example, setting the parameter appropriately based on the sample variance of the multiple measured data or other preknowledge. In the present paper, however, the optimal choice of is not discussed and is left as a future work.
(III) It is noted that ATMathCoreLib, an autotuning tool ATMATHCORELIB1 ; ATMATHCORELIB2 , also uses Bayesian inference for performance prediction. However, it is different from our approach in two aspects. First, it uses Bayesian inference to construct a reliable performance model from noisy observations Suda10 . So, more emphasis is put on interpolation than on extrapolation. Second, it assumes normal distribution both for the prior and posterior distributions. This enables the posterior distribution to be calculated analytically without MCMC, but makes it impossible to impose nonnegative condition of .
(IV) The present method uses the same generic performance model for all the routines. A possible next step is to develop a proper model for each routine. Another possible theoretical extension is performance prediction among different problems. The elapsed time of a dense matrix solver depends on the matrix size , as well as on , so it would be desirable to develop a model to express the elapsed time as a function of both the number of nodes and the matrix size (). For example, the elapsed time of the tridiagonalization routine in EigenExa on the K computer is modeled as a function of both the matrix size and the number of nodes Fukaya18 . In this study, several approaches to performance modeling are compared depending on the information available for modeling, and some of them accurately estimate the elapsed time for a give condition (i.e. matrix size and node count). The use of such a model will given more fruitful prediction, in particular, for the extrapolation.
6 Summary and future outlook
We developed an opensource named middleware EigenKernel for the generalized eigenvalue problem that realizes high scalability and usability, responding to the solid need in largescale electronic state calculations. Benchmark results on OakforestPACS shows the middleware enables us to choose the optimal scalable solver from ScaLAPACK, ELPA and EigenExa routines. We presented an example that the detailed performance data provided by EigenKernel helps users to detect performance issues without additional effort such as code modification. For high usability, a performance prediction function was proposed based on Bayesian inference and the MarkovChain MonteCarlo procedure. We found that the function is applicable not only to performance interpolation but also to extrapolation. For a future look, we can consider a system that gathers performance data automatically every time users call a library routine. Such a system could be realized through a possible collaboration with the supercomputer administrator and will give a greater prediction power to our performance prediction tool, by providing huge set of teacher data. The present approach is general and applicable to any numerical procedure, if a proper performance model is prepared for each routine. The middleware thus developed will form a foundation of the applicationalgorithmarchitecture codesign.
Acknowledgement
The authors thank to Toshiyuki Imamura (RIKEN) for the fruitful discussion on EigenExa and Kengo Nakajima (The University of Tokyo) for the fruitful discussion on OakforestPACS.
References
 (1) Shalf, J., Quinlan, D., and Janssen, C.: Rethinking HardwareSoftware Codesign for Exascale Systems, Computer 44, 22–30 (2011)
 (2) Dosanjh, S., Barrett, R., Doerfler, D., Hammond, S., Hemmert, K., Heroux, M., Lin, P., Pedretti, K., Rodrigues, A., Trucano, T., and Luitjens, J.: Exascale design space exploration and codesign, Future Gener. Comput. Syst. 30, 46–58 (2014)
 (3) FLAGSHIP 2020 Project: PostK Supercomputer Project, http://www.rccs.riken.jp/fs2020p/en/
 (4) CoDEx: CoDesign for Exascale, http://www.codexhpc.org/?page_id=93
 (5) EuroEXA: European CoDesign for Exascale Applications, https://euroexa.eu/
 (6) Imachi, H., Hoshi, T.: Hybrid numerical solvers for massively parallel eigenvalue computation and their benchmark with electronic structure calculations, J. Inf. Process. 24, 164–172 (2016)
 (7) Hoshi, T., Imachi, H., Kumahata, K., Terai, M., Miyamoto, K., Minami, K., and Shoji F.: Extremely scalable algorithm for 10atom quantum material simulation on the full system of the K computer, Proceeding of 7th Workshop on Latest Advances in Scalable Algorithms for LargeScale Systems (ScalA), held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah November, 1318, 33–40 (2016)

(8)
EigenKernel:
https://github.com/eigenkernel/
 (9) Hirokawa, Y., Boku, T., Sato, S., and Yabana, K.: Performance Evaluation of Large Scale Electron Dynamics Simulation under Manycore Cluster based on Knights Landing, Proceedings of the International Conference on High Performance Computing in AsiaPacific Region (HPS Asia 2018), 183–191 (2018)
 (10) Idomura, Y., Ina, T., Mayumi, A., Yamada, S., Matsumoto, K., Asahi, Y., and Imamura, T.: Application of a communicationavoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms, Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for LargeScale Systems (ScalA’17), 7:1–7:8 (2017)
 (11) ScaLAPACK: http://www.netlib.org/scalapack/
 (12) ELPA: Eigenvalue SoLvers for PetaflopApplication, http://elpa.mpcdf.mpg.de/
 (13) EigenExa: High Performance EigenSolver, http://www.rccs.riken.jp/labs/lpnctrt/en/projects/eigenexa/
 (14) Blum, V., Gehrke, R., Hanke, F., Havu, P., Havu, V., Ren, X., Reuter, K. and Scheffler, M.: Ab initio molecular simulations with numeric atomcentered orbitals, Computer Physics Communications 180, 2175–2196 (2009); https://aimsclub.fhiberlin.mpg.de/
 (15) Auckenthaler, T., Blum, V., Bungartz, J., Huckle, T., Johanni, R., Kramer, L., Lang, B., Lederer, H., and Willems, P.: Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations, Parallel Comput. 27, 783–794 (2011)
 (16) Marek, A., Blum, V., Johanni, R., Havu, V., Lang, B., Auckenthaler, T., Heinecke, A., Bungartz, H. J. and Lederer, H.: The ELPA Library – Scalable Parallel Eigenvalue Solutions for Electronic Structure Theory and Computational Science, J. Phys. Condens. Matter 26, 213201 (2014)
 (17) Imamura, T., Yamada, S., and Machida, M.: Development of a high performance eigensolver on the petascale next generation supercomputer system, Progress in Nuclear Science and Technology 2, 643–650 (2011)
 (18) Imamura, T.: The EigenExa Library – High Performance & Scalable Direct Eigensolver for LargeScale Computational Science, ISC 2014, Leipzig, Germany, 2014.
 (19) Fukaya, T and Imamura, T.: Performance evaluation of the EigenExa eigensolver on OakleafFX: Tridiagonalization versus pentadiagonalization, Proceedings of 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 960–969 (2015)
 (20) KMATH_EIGEN_GEV: highperformance generalized eigen solver, http://www.rccs.riken.jp/labs/lpnctrt/en/projects/kmatheigengev/
 (21) JCAHPC: Joint Center for Advanced High Performance Computing, http://jcahpc.jp/eng/index.html
 (22) ELSES matrix library, http://www.elses.jp/matrix/
 (23) Hoshi, T., Yamamoto, S., Fujiwara, T., Sogabe, T. and Zhang, S.L.: An order electronic structure theory with generalized eigenvalue equations and its application to a tenmillionatom system, J. Phys. Condens. Matter 21, 165502 (2012)
 (24) ELSES: Extra large Scale Electronic Structure calculation, http://www.elses.jp/index_e.html
 (25) Cerda, J. and Soria, F.: Accurate and transferable extended H’́uckeltype tightbinding parameters, Phys. Rev. B 61, 7965–7971 (2000)
 (26) Wilkinson, J. H. and Reinsch, C.: Handbook for Automatic Computation, Vol. II, Linear Algebra, SpringerVerlag, New York (1971)
 (27) Dackland, K. and Kågström, B.: A Hierarchical Approach for Performance Analysis of ScaLAPACKBased Routines Using the Distributed Linear Algebra Machine, PARA ’96 Proceedings of the Third International Workshop on Applied Parallel Computing, Industrial Computation and Optimization, 186–195 (1996)
 (28) Amdahl, G.: Validity of the Single Processor Approach to Achieving LargeScale Computing Capabilities, AFIPS Conference Proceedings 30, 483–485 (1967)
 (29) Pacheco, P.: Parallel Programming with MPI, Morgan Kaufmann (1996)
 (30) Ristov, S., Prodan, R., Gusev, M. and Skala, K.: Superlinear speedup in HPC systems: Why and when?, Proceedings of the Federated Conference on Computer Science and Information Systems, 889–898 (2016)
 (31) PješivacGrbović, J., Angskun, T., Bosilca, G., Fagg, E., Gabriel, E., and Dongarra, J.: Performance analysis of MPI collective operation, Cluster Computing 10, 127–143 (2007)
 (32) Hoefler, T., Gropp, W., Thakur, R., and Träff, L.,: Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues, Proceeding of the 17th European MPI users’ group meeting conference on Recent advances in the message passing interface, 21–30 (2010)
 (33) Peise, E. and and Bientinesi, B.: Performance Modeling for Dense Linear Algebra, Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 406–416 (2012)
 (34) Reisert, P., Calotoiu, A., Shudler, S. and Wolf, F.: Following the blind seer – creating better performance models using less information, Proceedings of EuroPar 2017: Parallel Processing, Lecture Notes in Computer Science 10417, 106–118, Springer (2017)
 (35) Fukaya, T., Imamura, T. and Yamamoto, Y.: Performance analysis of the Householdertype parallel tallskinny QR factorizations toward automatic algorithm selection, Proceedings of VECPAR 2014: High Performance Computing for Computational Science – VECPAR 2014, Lecture Notes in Computer Science 8969, 269–283, Springer (2015)
 (36) Fukaya, T., Imamura, T. and Yamamoto, Y.: A case study on modeling the performance of dense matrix computation: Tridiagonalization in the EigenExa eigensolver on the K computer, Proceedings of 2018 IEEE International Parallel and Distributed Processing Symposium Workshop, 1113–1122 (2018)
 (37) Suda, R.: ATMathCoreLib: Mathematical Core Library for Automatic Tuning, , IPSJ SIG Technical Report, 2011HPC129 (14), 1–12 (2011) (in Japanese)
 (38) Nagashima, S., Fukaya, T., and Yamamoto, Y.: On Constructing Cost Models for Online Automatic Tuning Using ATMathCoreLib: Case Studies through the SVD Computation on a Multicore Processor, Proceedings of the IEEE 10th International Symposium on Embedded Multicore/Manycore SysttemsonChip (MCSoC16), 345–352 (2016)
 (39) Suda, R.: A Bayesian Method of Online Automatic Tuning, Software Automatic Tuning: From Concept to StateoftheArt Results, 275–293, Springer (2010)