Rectangular Full Packed Format for Cholesky’s Algorithm: Factorization, Solution and Inversion
Abstract
We describe a new data format for storing triangular, symmetric, and Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full format representation. Also, RFPF requires exactly the same minimal storage as packed format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full format routine and two calls to Level 3 BLAS routines. This means no new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines.
G.1.3Numerical AnalysisNumerical Linear Algebra – Linear Systems (symmetric and Hermitian) \categoryG.4Mathematics of ComputingMathematical Software \termsAlgorithms, BLAS, Performance, Linear Algebra Libraries {bottomstuff} Authors’ addresses: F.G. Gustavson, IBM T.J. Watson Research Center, Yorktown Heights, NY10598, USA, email: fg2@us.ibm.com; J. Waśniewski, Department of Informatics and Mathematical Modelling, Technical University of Denmark, Richard Petersens Plads, Building 321, DK2800 Kongens Lyngby, Denmark, email: jw@imm.dtu.dk; J.J. Dongarra, Electrical Engineering and Computer Science Department, University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 379963450, USA, email: dongarra@eecs.utk.edu; Julien Langou, Department of Mathematical and Statistical Sciences, University of Colorado Denver, 1250, 14th Street – Room 646, Denver, Colorado 80202, USA, email: julien.langou@ucdenver.edu.
1 Introduction
A very important class of linear algebra problems deals with a coefficient matrix that is symmetric and positive definite [Dongarra et al. (1998), Demmel (1997), Golub and Van Loan (1996), Trefethen and Bau (1997)]. Because of symmetry it is only necessary to store either the upper or lower triangular part of the matrix .
1.1 LAPACK full and packed storage formats
The LAPACK library [Anderson et al. (1999)] offers two different kinds of subroutines to solve the same problem: POTRF^{1}^{1}1Four names SPOTRF, DPOTRF, CPOTRF and ZPOTRF are used in LAPACK for real symmetric and complex Hermitian matrices [Anderson et al. (1999)], where the first character indicates the precision and arithmetic versions: S – single precision, D – double precision, C – complex and Z – double complex. LAPACK95 uses one name LA_POTRF for all versions [Barker et al. (2001)]. In this paper, POTRF and/or PPTRF express, any precision, any arithmetic and any language version of the PO and/or PP matrix factorization algorithms. and PPTRF both factorize symmetric, positive definite matrices by means of the Cholesky algorithm. A major difference in these two routines is the way they access the array holding the triangular matrix (see Figures 1 and 2).
In the POTRF case, the matrix is stored in one of the lower left or upper right triangles of a full square matrix ([Anderson et al. (1999), pages 139 and 140] and [IBM (1997), page 64])^{2}^{2}2In Fortran column major, in C row major., the other triangle is wasted (see Figure 1). Because of the uniform storage scheme, blocked LAPACK and Level 3 BLAS subroutines [Dongarra et al. (1990b), Dongarra et al. (1990a)] can be employed, resulting in a fast solution.
In the PPTRF case, the matrix is stored in packed storage ([Anderson et al. (1999), pages 140 and 141], [Agarwal et al. (1994)] and [IBM (1997), pages 74 and 75]), which means that the columns of the lower or upper triangle are stored consecutively in a one dimensional array (see Figure 2). Now the triangular matrix occupies the strictly necessary storage space but the nonuniform storage scheme means that use of full storage BLAS is impossible and only the Level 2 BLAS packed subroutines [Lawson et al. (1979), Dongarra et al. (1988)] can be employed, resulting in a slow solution.
To summarize: LAPACK offers a choice between high performance and wasting half of the memory space (POTRF) versus low performance with optimal memory space (PPTRF).
1.2 Packed Minimal Storage Data Formats related to RFPF
Recently many new data formats for matrices have been introduced for improving the performance of Dense Linear Algebra (DLA) algorithms. The survey article [Elmroth et al. (2004)] gives an excellent overview.
Recursive Packed Format (RPF) [Andersen et al. (2001), Andersen et al. (2002)]:
A new compact way to store a triangular, symmetric or Hermitian matrix called Recursive
Packed Format is described in [Andersen et al. (2001)] as are novel ways to transform
RPF to and from standard packed format. New algorithms, called Recursive
Packed Cholesky (RPC) [Andersen et al. (2001), Andersen et al. (2002)] that operate on the RPF
format are presented. RPF format operates almost entirely by calling
Level 3 BLAS GEMM [Dongarra et al. (1990b), Dongarra et al. (1990a)] but requires variants of algorithms
TRSM and SYRK [Dongarra et al. (1990b), Dongarra et al. (1990a)] that are designed t work on RPF. The authors call
these algorithms RPTRSM and RPSYRK [Andersen et al. (2001)] and find that they do most of
their FLOPS by calling GEMM [Dongarra et al. (1990b), Dongarra et al. (1990a)]. It follows that almost all
of execution time of the RPC algorithm is done in calls to GEMM.
There are three advantages of this storage scheme compared to traditional packed
and full storage. First, the RPF storage format
uses the minimum amount of storage required for symmetric, triangular, or
Hermitian matrices. Second, the RPC algorithm is a Level 3 implementation of
Cholesky factorization. Finally, RPF requires no block size tuning parameter.
A disadvantage of the RPC algorithm was that it had a high recursive calling overhead.
The paper [Gustavson and
Jonsson (2000)] removed this overhead and added other novel features
to the RPC algorithm.
Square Block Packed Format (SBPF) [Gustavson (2003)]: SBPF is described in Section 4 of [Gustavson (2003)]. A strong point of SBPF is that it requires minimum block storage and all its blocks are contiguous and of equal size. If one uses SBPF with kernel routines then data copying is mostly eliminated during Cholesky factorization.
Block Packed Hybrid Format (BPHF) [Andersen et al. (2005), Gustavson et al. (2007)]: We consider an efficient implementation of the Cholesky solution of symmetric positivedefinite full linear systems of equations using packed storage. We take the same starting point as that of LINPACK [Dongarra et al. (1979)] and LAPACK [Anderson et al. (1999)], with the upper (or lower) triangular part of the matrix being stored by columns. Following LINPACK [Dongarra et al. (1979)] and LAPACK [Anderson et al. (1999)], we overwrite the given matrix by its Cholesky factor. The paper [Andersen et al. (2005)] uses the BPHF where blocks of the matrix are held contiguously. The paper compares BPHF versus conventional full format storage, packed format and the RPF for the algorithms. BPF is a variant of SBPF in which the diagonal blocks are stored in packed format and so its storage requirement is equal to that of packed storage.
We mention that for packed matrices SBPF and BPHF have become the format of choice for multicore processors when one stores the blocks in register block format [Gustavson et al. (2007)]. Recently, there have been many papers published on new algorithms for multicore processors. This literature is extensive. So, we only mention two projects, PLASMA [Buttari et al. (2007)] and FLAME [Chan et al. (2007)], and refer the interested reader to the literature for additional references.
In regard to other references on new data structures, the survey article [Elmroth et al. (2004)] gives an excellent overview. However, since 2005 at least two new data formats for Cholesky type factorizations have emerged, [Herrero (2006)] and the subject matter of this paper, RFPF [Gustavson and Waśniewski (2007)]. In the next subsection we highlight the main features of RFPF.
1.3 A novel way of representing triangular, symmetric, and Hermitian matrices in LAPACK
LAPACK has two types of subroutines for triangular, symmetric, and Hermitian matrices called packed and full format routines. LAPACK has about 300 these kind of subroutines. So, in either format, a variety of problems can be solved by these LAPACK subroutines. From a user point of view, RFPF can replace both these LAPACK data formats. Furthermore, and this is important, using RFPF does not require any new LAPACK subroutines to be written. Using RFPF in LAPACK only requires the use of already existing LAPACK and BLAS routines. RFPF strongly relies on the existence of the BLAS and LAPACK routines for full storage format.
1.4 Overview of the Paper
First we introduce the RFPF in general, see Section 2. Secondly we show how to use RFPF on symmetric and Hermitian positive definite matrices; e.g., for the factorization (Section 3), solution (Section 4), and inversion (Section 5) of these matrices. Section 6 describes LAPACK subroutines for the Cholesky factorization, Cholesky solution, and Cholesky inversion of symmetric and Hermitian positive definite matrices using RFPF. Section 7 indicates that the stability results of using RFPF is unaffected by this format choice as RFPF uses existing LAPACK algorithms which are already known to be stable. Section 8 describes a variety of performance results on commonly used platforms both for serial and parallel SMP execution. These results show that performance of LAPACK full routines using RFPF versus LAPACK full routines using standard format for both serial and SMP parallel processing is about the same while using half the storage. Also, performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines. Section 9 explains how some new RFPF routines have been integrated in LAPACK. LAPACK software for Cholesky algorithm (factorization, solution and inversion) using RFPF has been released with LAPACK3.2 on November 2008. Section 10 gives a short summary and brief conclusions.
2 Description of Rectangular Full Packed Format
We describe Rectangular Full Packed Format (RFPF). It transforms a standard Packed Array AP of size to a full 2D array. This means that performance of LAPACK’s [Anderson et al. (1999)] packed format routines becomes equal to or better than their full array counterparts. RFPF is a variant of Hybrid Full Packed (HFP) format [Gunnels and Gustavson (2004)]. RFPF is a rearrangement of a Standard full format rectangular Array SA of size LDA*N where . Array SA holds a triangular part of a symmetric, triangular, or Hermitian matrix of order . The rearrangement of array SA is equal to compact full format Rectangular Array AR of size and hence array AR like array AP uses minimal storage. (The specific values of LDA1 and N1 can vary depending on various cases and they will be specified later during the text.) Array AR will hold a full rectangular matrix obtained from a triangle of matrix . Note also that the transpose of the rectangular matrix resides in the transpose of array AR and hence also represents . Therefore, Level 3 BLAS [Dongarra et al. (1990b), Dongarra et al. (1990a)] can be used on array AR or its transpose. In fact, with the equivalent LAPACK algorithm which uses the array AR or its transpose, the performance is slightly better than standard LAPACK algorithm which uses the array SA or its transpose. Therefore, this offers the possibility to replace all packed or full LAPACK routines with equivalent LAPACK routines that work on array AR or its transpose. For examples of transformations of a matrix to a matrix see the figures in Section 6.
RFPF is closely related to HFP format, see [Gunnels and Gustavson (2004)], which represents as the concatenation of two standard full arrays whose total size is also . A basic simple idea leads to both formats. Let be an order symmetric matrix. Break into a block 2–by–2 form
(1) 
where and are symmetric. Clearly, we need only store the lower triangles of and as well as the full matrix when we are interested in a lower triangular formulation.
When is even, the lower triangle of and the upper triangle of can be concatenated together along their main diagonals into a –by– dense matrix (see the figures where N is even in Section 6). This last operation is the crux of the basic simple idea. The offdiagonal block is –by–, and so it can be appended below the –by– dense matrix. Thus, the lower triangle of can be stored as a single –by– dense matrix . In effect, each block matrix , and is now stored in ‘full format’. This means all entries of matrix in array AR of size by can be accessed with constant row and column strides. So, the full power of LAPACK’s block Level 3 codes are now available for RFPF which uses the minimum amount of storage. Finally, matrix which has size –by– is represented in the transpose of array AR and hence has the same desirable properties. There are eight representations of RFPF. The matrix can have have either odd or even order , or it can be represented either in standard lower or upper format or it can be represented by either matrix or its transpose giving representations in all.
All eight cases or representations are presented in Section 6. The RFPF matrices are in the upper right part of the figures. We have introduced colors and horizontal lines to try to visually delineate triangles , representing lower, upper triangles of symmetric matrices , respectively and square or near square representing matrices . For an upper triangle of , , represents lower, upper triangles of symmetric matrices , respectively and square or near square representing matrices . For both lower and upper triangles of we have, after each , added its position location in the arrays holding matrices and .
We now consider performance aspects of using RFPF in the context of using LAPACK routines on triangular matrices stored in RFPF. Let be a Level 3 LAPACK routine that operates either on full format. has a full Level 3 LAPACK block 2–by–2 algorithm, call it . We write a simple related partition algorithm (SRPA) with partition sizes and where . Apply the new SRPA using the new RFPF. The new SRPA almost always has four major steps consisting entirely of calls to existing full format LAPACK routines in two steps and calls to Level 3 BLAS in the remaining two steps, see Figure 3.


Section 6 shows algorithms equal to factorization, solution and inversion algorithms on symmetric positive definite or Hermitian matrices.
3 Cholesky Factorization using Rectangular Full Packed Format
The Cholesky factorization of a symmetric and positive definite matrix can be expressed as
(2) 
where and are lower triangular and upper triangular matrices.
Break the matrices and into 2–by–2 block form in the same way as was done for the matrix in Equation (1):
(3) 
We now have
(4) 
where , , , and are lower and upper triangular submatrices, and and are square or almost square submatrices.
Using Equations (2) and equating the blocks of Equations (1) and Equations (4) gives us the basis of a 2–by–2 block algorithm for Cholesky factorization using RFPF. We can now express each of these four block equalities by calls to existing LAPACK and Level 3 BLAS routines. An example, see Section 6, of this is the three block equations is , and . The first and second of these block equations are handled by calling LAPACK’s POTRF routine and by calling Level 3 BLAS TRSM via . In both these block equations the Fortran equality of replacement () is being used so that the lower triangle of is being replaced and the nearly square matrix is being replaced by . The third block equation breaks into two parts: and which are handled by calling Level 3 BLAS SYRK or HERK and by calling LAPACK’s POTRF routine. At this point we can use the flexibility of the LAPACK library. In RFPF is in upper format (upper triangle) while in standard format is in lower format (lower triangle). Due to symmetry, both formats of contain equal values. This flexibility allows LAPACK to accommodate both formats. Hence, in the calls to SYRK or HERK and POTRF we set uplo = ’U’ even though the rectangular matrix of SYRK and HERK comes from a lower triangular formulation.
New LAPACK like routine PFTRF performs these four computations. PF was chosen to fit with LAPACK’s use of PO and PP. The PFTRF routine covers the Cholesky Factorization algorithm for the eight cases of the RFPF. Section 6 has Figure 4 with four subfigures. Here we are interested in the first and second subfigure. The first subfigure contains the layouts of matrices and . The second subfigure has the Cholesky factorization algorithm obtained by simple algebraic manipulations of the three block equalities obtained above.
4 Solution
In Section 3 we obtained the 2–by–2 Cholesky factorization (3) of matrix . Now, we can solve the equation :

If has lower triangular format then
(5) 
If has an upper triangular format then
(6)
, and are either vectors or rectangular matrices. contains the RHS values. and contain the solution values. , and are vectors when there is one RHS and matrices when there are many RHS. The values of and are stored over the values of .
Expanding (5) and (6) using (3) gives the forward substitution equations
(7) 
and the back substitution equations
(8) 
The Equations (7) and (8) gives the basis of a block algorithm for Cholesky solution using RFPF format. We can now express these two sets of two block equalities by using existing Level 3 BLAS routines. An example, see Section 6, of the first set of these two block equalities is and . The first block equality is handled by Level 3 BLAS TRSM: . The second block equality is handled by Level 3 BLAS GEMM and TRSM: and . The backsolution routines are similarly derived. One gets , and .
New LAPACK like routine PFTRS performs these two solution computations for the eight cases of RFPF. PFTRS calls a new Level 3 BLAS TFSM in the same way that POTRS calls TRSM. The third subfigure in Section 6 gives the Cholesky solution algorithm using RFPF obtained by simple algebraic manipulation of the block Equations (7) and (8).
5 Inversion
Following LAPACK we consider the following three stage procedure:

Factorize the matrix and overwrite with either or by calling PFTRF; see Section 3.

Compute the inverse of either or . Call these matrices or and overwrite either or with them. This is done by calling new routine new LAPACK like TFTRI.

Calculate either the product or and overwrite either or with them.
As in Sections 3 and 4 we examine 2–by–2 block algorithms for the steps two and three above. In Section 3 we obtain either matrices or in RFPF. Like LAPACK inversion algorithms for POTRI and PPTRI, this is our starting point for our LAPACK inversion algorithm using RFPF. The LAPACK inversion algorithms for POTRI and PPTRI also follow from steps two and three above by first calling in the full case LAPACK TRTRI and then calling LAPACK LAUUM.
Using the 2–by–2 blocking for either or in Equation (3) we obtain the following 2–by–2 blocking for and :
(10) 
From the identities and and the 2–by–2 block layouts of Equations (3) and 3), we obtain three block equations for and which can be solved using LAPACK routines for TRTRI and Level 3 BLAS TRMM. An example, see Figure 4, of these three block equations is , and . The first and third of these block equations are handled by LAPACK TRTRI routines as and . In the second inverse computation we use the fact that is equally represented by it transpose which is in RFPF. The second block equation leads to two calls to Level 3 BLAS TRMM via and . In the last two block equations the Fortran equality of replacement () is being used so that is replacing .
Now we turn to part three of the three stage LAPACK procedure above. For this we use the 2–by–2 blocks layouts of Equation (10) and the matrix multiplications indicated by following block Equations (11) giving
(11) 
where , , , and are lower and upper triangular submatrices, and and are square or almost square submatrices. The values of the indicated block multiplications of or in Equation (11) are stored over the block values of or .
Performing the indicated 2–by–2 block multiplications of Equation (11) leads to three block matrix computations. An example, see Section 6, of these three block computations is , and . Additionally, we want to overwrite the values of these block multiplications on their original block operands. Block operand only occurs in the (1,1) block operand computation and hence can be overwritten by a call to LAPACK LAUUM, , followed by a call to Level 3 BLAS SYRK or HERK, . Block operand now only occurs in the (2,1) block computation and hence can be overwritten by a call to Level 3 BLAS TRMM, . Finally, block operand can be overwritten by a call to LAPACK LAUUM, .
The fourth subfigure in Section 6 has the Cholesky inversion algorithms using RFPF based on the results of this Section. New LAPACK routine, PFTRI, performs this computation for the eight cases of RFPF.
6 RFP Data Formats and Algorithms
This section contains three figures.

The first figure describes the RFPF (Rectangular Full Packed Format) and gives algorithms for Cholesky factorization, solution and inversion of symmetric positive definite matrices, where is odd, uplo = ’lower’, and trans = ’no transpose’. This figure has four subfigures.

The first subfigure depicts the lower triangle of a symmetric positive definite matrix in standard full and its representation by the matrix in RFPF.

The second subfigure gives the RFPF Cholesky factorization algorithm and its calling sequences of the LAPACK and BLAS subroutines, see Section 3.

The third subfigure gives the RFPF Cholesky solution algorithm and its calling sequences to the LAPACK and BLAS subroutines, see Section 4.

The fourth subfigure in each figure gives the RFPF Cholesky inversion algorithm and its calling sequences to the LAPACK and BLAS subroutines, see Section 5.


The second figure shows the transformation from full to RFPF of all “no transform” cases.

The third figure depicts all eight cases in RFPF.
The data format for has . Matrix has if is odd and if is even and columns where . Hence, matrix always has LDAR rows and columns. Matrix always has rows and LDAR columns and its leading dimension is equal to . Matrix always has elements as does matrix .
The order of matrix in the first figure is seven and six or seven in the remaining two figures.
7 Stability of the RFPF Algorithm
The RFPF Cholesky factorization (Section 3), Cholesky solution (Section 4), and Cholesky inversion (Section 5) algorithms are equivalent to the traditional algorithms in the books [Dongarra et al. (1998), Demmel (1997), Golub and Van Loan (1996), Trefethen and Bau (1997)]. The whole theory of the traditional Cholesky factorization, solution, inversion and BLAS algorithms carries over to this three Cholesky and BLAS algorithms described in Sections 3, 4, and 5. The error analysis and stability of these algorithms is very well described in the book of [Higham (1996)]. The difference between LAPACK algorithms PO, PP and RFPF^{3}^{3}3full, packed and rectangular full packed. is how inner products are accumulated. In each case a different order is used. They are all mathematically equivalent, and, stability analysis shows that any summation order is stable.
8 A Performance Study using RFP Format
The LAPACK library [Anderson et al. (1999)] routines POTRF/PPTRF, POTRI/PPTRI, and POTRS/PPTRS are compared with the RFPF routines PFTRF, PFTRI, and PFTRS for Cholesky factorization (PxTRF), Cholesky inverse (PxTRI) and Cholesky solution (PxTRS) respectively. In the previous sentence, the character ’x’ can be ’O’ (full format), ’P’ (packed format), or ’F’ (RFPF). In all cases long real precision arithmetic (also called double precision) is used. Sometimes we also show results for long complex precision (also called complex*16). Results were obtained on several different computers using everywhere the vendor Level 3 and Level 2 BLAS. The sequential performance results were done on the following computers:
 Sun Fire E25K (newton):

72 UltraSPARC IV+ dualcore CPUs (1800 MHz/ 2 MB shared L2cache, 32 MB shared L3cache), 416 GB memory (120 CPUs/368 GB). Further information at “http://www.gbar.dtu.dk/index.php/Hardware”.
 SGI Altix 3700 (Freke):

64 CPUs  Intel Itanium2 1.5 GHz/6 MB L3cache. 256 GB memory. Peak performance: 384 GFlops. Further information at “http://www.cscaa.dk/freke/”.
 Intel Tigerton computer (zoot):

quadsocket quadcore Intel Tigerton 2.4GHz (16 total cores) with 32 GB of memory. We use Intel MKL 10.0.1.014.
 DMI Itanium:

CPU Intel Itanium2: 1.3 GHz, cache: 3 MB onchip L3 cache.
 DMI NEC SX6 computer:

8 CPU’s, per CPU peak: 8 Gflops, per node peak: 64 Gflops, vector register length: 256.
The performance results are given in Figures 7 to 15. In Appendix A, we give the table data used in the figures, see Tables 2 to 27. We also give speedup numbers, see Tables 28 to 35.
The figures from 7 to 10 are paired. Figure 7 (double precision) and Figure 8 (double complex precision) present results for the Sun UltraSPARC IV+ computer. Figure 9 (double precision) and Figure 10 (double complex precision) present results for the SGI Altix 3700 computer. Figure 11 (double precision) presents results for the Intel Itanium2 computer. Figure 12 (double precision) presents results for the NEC SX6 computer. Figure 13 (double precision) presents results for the quadsocket quadcore Intel Tigerton computer using reference LAPACK3.2.0 (from netlib.org). Figure 14 (double precision) presents results for the quadsocket quadcore Intel Tigerton computer using vendor LAPACK library (MKL10.0.1.14).
Figure 15 shows the SMP parallelism of these subroutines on the IBM Power4 (clock rate: 1300 MHz; two CPUs per chip; L1 cache: 128 KB (64 KB per CPU) instruction, 64 KB 2way (32 KB per CPU) data; L2 cache: 1.5 MB 8way shared between the two CPUs; L3 cache: 32 MB 8way shared (offchip); TLB: 1024 entries) and SUN UltraSPARCIV (clock rate: 1350 MHz; L1 cache: 64 kB 4way data, 32 kB 4way instruction, and 2 kB Write, 2 kB Prefetch; L2 cache: 8 MB; TLB: 1040 entries) computers respectively. They compare SMP times of PFTRF, vendor POTRF and reference PPTRF.
The RFPF packed results greatly outperform the packed and more often than not are better than the full results. Note that our timings do not include the cost of sorting any LAPACK data formats to RFPF data formats and vice versa. We think that users will input their matrix data using RFPF. Hence, this is our rationale for not including the data transformation times.
For all our experiments, we use vendor Level 3 and Level 2 BLAS. For all our experiments except Figure 13 and Figure 15, we use the provided vendor library for LAPACK and BLAS.
We include comparisons with reference LAPACK for the quadsocket quadcore Intel Tigerton machine in Figure 13. In this case, the vendor LAPACK library packed storage routines significantly outperform the LAPACK reference implementation from netlib. In Figure 14, you find the same experiments on the same machine but, this time, using the vendor library (MKL10.0.1.014). We think that MKL is using the reference implementation for Inverse Cholesky (packed and full format). For Cholesky factorization, we see that both packed and full format routines (PPTRF and POTRF) are tuned. But even, in this case, our RFPF storage format results are better.
When we compare RFPF with full storage, results are mixed. However, both codes are rarely far apart. Most of the performance ratios are between 0.95 to 1.05 overall. But, note that the RFPF performance is more uniform over its versions (four presented; the other four are for n odd ). For LAPACK full (two versions ), the performance variation is greater. Moreover, in the case of the inversion on quadsocket quadcore Tigerton (Figure 13 and Figure 14) RFPF clearly outperforms both variants of the full format.
9 Integration in LAPACK
As mentioned in the introduction, as of release 3.2 (November 2008), LAPACK supports a preliminary version of RFPF. Ultimately, the goal would be for RFPF to support as many functionnalities as full format or standard packed format does. The 44 routines included in release 3.2 for RFPF are given in Table 1. The names for the RFPF routines follow the naming nomenclature used by LAPACK. We have added the format description letters: PF for Symmetric/Hermitian Positive Definite RFPF (PO for full, PP for packed), SF for Symmetric RFPF (SY for full, SP for packed), HF for Hermitian RFPF (HE for full, HP for packed), and TF for Triangular RFPF (TR for full, TP for packed).
Currently, for the complex case, we assume that the transpose complexconjugate part is stored whenever the transpose part is stored in the real case. This corresponds to the theory developed in this present manuscript. In the future, we will want to have the flexibility to store the transpose part (as opposed to transpose complex conjugate) whenever the transpose part is stored in the real case. In particular, this feature will be useful for complex symmetric matrices.
functionality  routine names and calling sequence  

Cholesky factorization  CPFTRF  DPFTRF  SPFTRF  ZPFTRF 
(TRANSR,UPLO,N,A,INFO)  
Multiple solve after PFTRF  CPFTRS  DPFTRS  SPFTRS  ZPFTRS 
(TRANSR,UPLO,N,NR,A,B,LDB,INFO)  
Inversion after PFTRF  CPFTRI  DPFTRI  SPFTRI  ZPFTRI 
(TRANSR,UPLO,N,A,INFO)  
Triangular inversion  CTRTRI  DTRTRI  STRTRI  ZTRTRI 
(TRANSR,UPLO,DIAG,N,A,INFO)  
Sym/Herm matrix norm  CLANHF  DLANSF  SLANSF  ZLANHF 
(NORM,TRANSR,UPLO,N,A,WORK)  
Triangular solve  CTFSM  DTFSM  STFSM  ZTFSM 
(TRANSR,SIDE,UPLO,TRANS,DIAG,M,N,ALPHA,A,B,LDB)  
Sym/Herm rank update  CHFRK  DSFRK  SSFRK  ZHFRK 
(TRANSR,UPLO,TRANS,N,K,ALPHA,A,LDA,BETA,C)  
Conv. from TP to TF  CTPTTF  DTPTTF  STPTTF  ZTPTTF 
(TRANSR,UPLO,N,AP,ARF,INFO)  
Conv. from TR to TF  CTRTTF  DTRTTF  STRTTF  ZTRTTF 
(TRANSR,UPLO,N,A,LDA,ARF,INFO)  
Conv. from TF to TP  CTFTTP  DTFTTP  STFTTP  ZTFTTP 
(TRANSR,UPLO,N,ARF,AP,INFO)  
Conv. from TF to TR  CTFTTR  DTFTTR  STFTTR  ZTFTTR 
(TRANSR,UPLO,N,ARF,A,LDA,INFO) 
10 Summary and Conclusions
This paper describes RFPF as a standard minimal full format for representing both symmetric and triangular matrices. Hence, from a user point of view, these matrix layouts are a replacement for both the standard formats of DLA, namely full and packed storage. These new layouts possess three good features: they are efficient, they are supported by Level 3 BLAS and LAPACK full format routines, and they require minimal storage.
11 Acknowledgments
The results in this paper were obtained on seven computers, an IBM, a SGI, two SUNs, Itanium, NEC, and Intel Tigerton computers. The IBM machine belongs to the Center for Scientific Computing at Aarhus, the SUN machines to the Danish Technical University, the Itanium and NEC machines to the Danish Meteorological Institute, and the Intel Tigerton machine to the Innovative Computing Laboratory at the University of Tennessee.
We would like to thank Bernd Dammann for consulting on the SUN systems; Niels Carl W. Hansen for consulting on the IBM and SGI systems; and Bjarne Stig Andersen for obtaining the results on the Itanium and NEC computers. We thank IBMers John Gunnels who worked earlier on the HFPF format and JP Fasano who was instrumental in getting the source code released by the IBM Open Source Committee. We thank Allan Backer for discussions about an older version of this manuscript.
References
 Agarwal et al. (1994) \bibscAgarwal, R. C., Gustavson, F. G., and Zubair, M. \bibyear1994. Exploiting functional parallelism on power2 to design highperformance numerical algorithms. \bibemphicIBM Journal of Research and Development \bibemph38, 5 (September), 563–576.
 Andersen et al. (2002) \bibscAndersen, B. S., Gunnels, J. A., Gustavson, F., and Waśniewski, J. \bibyear2002. A Recursive Formulation of the Inversion of symmetric positive definite Matrices in Packed Storage Data Format. In \bibscJ. Fagerholm, J. Haataja, J. Järvinen, M. Lyly, and P. R. V. Savolainen Eds., \bibemphicProceedings of the International Conference, PARA 2002, Applied Parallel Computing, Number 2367 in Lecture Notes in Computer Science (Espoo, Finland, June 2002), pp. 287–296. Springer.
 Andersen et al. (2005) \bibscAndersen, B. S., Gustavson, F. G., Reid, J. K., and Waśniewski, J. \bibyear2005. A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm. \bibemphicACM Transactions on Mathematical Software \bibemph31, 201–227.
 Andersen et al. (2001) \bibscAndersen, B. S., Gustavson, F. G., and Waśniewski, J. \bibyear2001. A Recursive Formulation of Cholesky Facorization of a Matrix in Packed Storage. \bibemphicACM Transactions on Mathematical Software \bibemph27, 2 (Jun), 214–244.
 Anderson et al. (1999) \bibscAnderson, E., Bai, Z., Bischof, C., Blackford, L. S., Demmel, J., Dongarra, J. J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. \bibyear1999. \bibemphicLAPACK Users’ Guide (Third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.
 Barker et al. (2001) \bibscBarker, V. A., Blackford, L. S., Dongarra, J. J., Croz, J. D., Hammarling, S., Marinova, M., Waśniewski, J., and Yalamov, P. \bibyear2001. \bibemphicLAPACK95 Users’ Guide (first ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.
 Buttari et al. (2007) \bibscButtari, A., Langou, J., Kurzak, J., and Dongarra, J. \bibyear2007. A class of parallel tiled linear algebra algorithms for multicore architectures. Tech rep. utcs070600, Department of Electrical Engineering and Computer Science of the University of Tennessee.
 Chan et al. (2007) \bibscChan, E., QuintanaOrtí, E., QuintanaOrtí, G., and van de Geijn, R. \bibyear2007. Supermatrix outoforder scheduling of matrix operations for smp and multicore architectures. In \bibemphicSPAA 07, Proceedings of the 19th ACM Symposium on Parallelism in Algorithms and Architecture (2007), pp. 116–125.
 Demmel (1997) \bibscDemmel, J. W. \bibyear1997. \bibemphApplied Numerical Linear Algebra. SIAM, Philadelphia.
 Dongarra et al. (1979) \bibscDongarra, J. J., Bunch, J. R., Moler, C. B., and Stewart, G. W. \bibyear1979. \bibemphLinpack Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
 Dongarra et al. (1990a) \bibscDongarra, J. J., Du Croz, J., Duff, I. S., and Hammarling, S. \bibyear1990a. Algorithm 679: A set of Level 3 Basic Linear Algebra Subprograms. \bibemphicACM Trans. Math. Soft. \bibemph16, 1 (March), 18–28.
 Dongarra et al. (1990b) \bibscDongarra, J. J., Du Croz, J., Duff, I. S., and Hammarling, S. \bibyear1990b. A Set of Level 3 Basic Linear Algebra Subprograms. \bibemphicACM Trans. Math. Soft. \bibemph16, 1 (March), 1–17.
 Dongarra et al. (1988) \bibscDongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. \bibyear1988. An Extended Set of Fortran Basic Linear Algebra Subroutines. \bibemphicACM Trans. Math. Soft. \bibemph14, 1 (March), 1–17.
 Dongarra et al. (1998) \bibscDongarra, J. J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A. \bibyear1998. \bibemphNumerical Linear Algebra for High Performance Computers. SIAM, Society for Industrial and Applied Mathematics, Philadelphia.
 Elmroth et al. (2004) \bibscElmroth, E., Gustavson, F. G., Kagstrom, B., and Jonsson, I. \bibyear2004. Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. \bibemphicSIAM Review \bibemph46, 1 (March), 3–45.
 Golub and Van Loan (1996) \bibscGolub, G. and Van Loan, C. F. \bibyear1996. \bibemphicMatrix Computations (Third ed.). Johns Hopkins University Press, Baltimore, MD.
 Gunnels and Gustavson (2004) \bibscGunnels, J. A. and Gustavson, F. G. \bibyear2004. A new array format for symmetric and triangular matrices. In \bibscJ. W. J.J. Dongarra, K. Madsen Ed., \bibemphicApplied Parallel Computing, State of the Art in Scientific Computing, PARA 2004, Volume LNCS 3732 (SpringerVerlag, Berlin Heidelberg, 2004), pp. 247–255. Springer.
 Gustavson (2003) \bibscGustavson, F. G. \bibyear2003. High Performance Linear Algebra Algorithms using New Generalized Data Structures for Matrices. \bibemphicIBM Journal of Research and Development \bibemph47, 1 (January), 823–849.
 Gustavson et al. (2007) \bibscGustavson, F. G., Gunnels, J., and Sexton, J. \bibyear2007. Minimal Data Copy for Dense Linear Algebra Factorization. In \bibemphicApplied Parallel Computing, State of the Art in Scientific Computing, PARA 2006, Volume LNCS 4699 (SpringerVerlag, Berlin Heidelberg, 2007), pp. 540–549. Springer.
 Gustavson and Jonsson (2000) \bibscGustavson, F. G. and Jonsson, I. \bibyear2000. Minimal storage high performance cholesky via blocking and recursion. \bibemphicIBM Journal of Research and Development \bibemph44, 6 (Nov), 823–849.
 Gustavson et al. (2007) \bibscGustavson, F. G., Reid, J. K., and Waśniewski, J. \bibyear2007. Algorithm 865: Fortran 95 Subroutines for Cholesky Factorization in Blocked Hybrid Format. \bibemphicACM Transactions on Mathematical Software \bibemph33, 1 (March), 5.
 Gustavson and Waśniewski (2007) \bibscGustavson, F. G. and Waśniewski, J. \bibyear2007. Rectangular full packed format for LAPACK algorithms timings on several computers. In \bibemphicApplied Parallel Computing, State of the Art in Scientific Computing, PARA 2006, Volume LNCS 4699 (SpringerVerlag, Berlin Heidelberg, 2007), pp. 570–579. Springer.
 Herrero (2006) \bibscHerrero, J. R. \bibyear2006. \bibemphA Framework for Efficient Execution of Matrix Computations. Ph. D. thesis, Universitat Politècnica de Catalunya.
 Higham (1996) \bibscHigham, N. J. \bibyear1996. \bibemphAccuracy and Stability of Numerical Algorithms. SIAM.
 IBM (1997) IBM. \bibyear1997. \bibemphicEngineering and Scientific Subroutine Library for AIX (Version 3, Volume 1 ed.). IBM. Pub. number SA22–7272–0.
 Lawson et al. (1979) \bibscLawson, C. L., Hanson, R. J., Kincaid, D., and Krogh, F. T. \bibyear1979. Basic Linear Algebra Subprograms for Fortran Usage. \bibemphicACM Trans. Math. Soft. \bibemph5, 308–323.
 Trefethen and Bau (1997) \bibscTrefethen, L. N. and Bau, D. \bibyear1997. \bibemphNumerical Linear Algebra. SIAM, Philadelphia.
Appendix A Performance results
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  827  898  915  834  924  622  435  622 
100  1420  1517  1464  1434  1264  1218  592  811 
200  1734  1795  1590  1746  1707  1858  703  378 
400  2165  2242  2275  2177  2234  2182  791  257 
500  2175  2292  2358  2221  2337  2378  809  251 
800  2426  2550  2585  2455  2618  2567  795  240 
1000  2498  2617  2636  2485  2677  2650  668  217 
1600  2590  2609  2739  2626  2764  2044  614  217 
2000  2703  2758  2829  2711  2912  2753  606  216 
4000  2502  2810  2822  2517  3100  2708  485  91 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  716  699  698  714  581  549  535  554 
100  1199  1185  1183  1197  1163  1148  719  721 
200  1768  1742  1756  1774  1821  1806  840  822 
400  2277  2262  2293  2289  2179  2159  919  881 
500  2354  2334  2357  2130  2468  2479  931  891 
800  2551  2361  2593  2584  2636  2629  880  755 
1000  2599  2600  2668  2639  2717  2717  708  520 
1600  2621  2665  2702  2693  2507  2529  610  419 
2000  2717  2767  2831  2740  2818  2854  599  401 
4000  2542  2506  2757  2652  2635  2661  412  158 
r  n  RFPF  LAPACK  

h  NO TRANS  TRANS  POTRS  PPTRS  
s  U  L  U  L  U  L  U  L  
100  50  1829  1877  1883  1792  1698  1705  549  545 
100  100  2118  2117  2121  2123  2042  1968  713  711 
100  200  2505  2511  2515  2515  2242  2231  689  828 
100  400  2638  2598  2626  2664  2356  2456  715  888 
100  500  2386  2499  2669  2706  2479  2451  743  895 
100  800  2759  2746  2776  2781  2410  2326  626  704 
100  1000  2795  2739  2811  2817  2052  1987  525  554 
160  1600  2870  2873  2886  2875  2431  2289  447  429 
200  2000  2825  2825  2845  2838  2371  2167  416  416 
400  4000  2701  2700  2808  2667  1589  1588  175  168 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  1423  1552  1633  1423  1301  1259  872  1333 
100  2032  1986  2067  1854  1624  1905  1199  1353 
200  2329  2277  2337  2198  2117  2374  1465  542 
400  2646  2624  2698  2561  2556  2684  1725  482 
500  2760  2264  2801  2699  2695  2793  1731  476 
800  2890  2851  2897  2839  2874  2310  1315  441 
1000  2929  2899  2954  2900  2958  2958  1244  435 
1600  3002  2962  2563  2874  3204  1519  1202  379 
2000  3031  2971  3016  3011  3372  3021  1173  411 
4000  3022  2930  3011  3036  3185  2148  572  139 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  1525  1575  1515  1620  1400  1378  1230  1232 
100  1968  2001  1948  2042  2012  1959  1525  1548 
200  2388  2438  2277  2447  2428  2431  1731  1687 
400  2665  2715  2700  2715  2758  2793  1867  1698 
500  2748  2779  2777  2773  2840  2870  1885  1697 
800  2841  2898  2917  2837  2599  2985  1330  1319 
1000  2897  2943  2971  2914  3005  3040  1264  1258 
1600  2920  2925  2724  2482  2031  3015  1153  1212 
2000  2883  2948  2946  2931  2990  3079  1186  1193 
4000  2839  2939  2975  2823  2485  3007  723  706 
r  n  RFPF  LAPACK  

h  NO TRANS  TRANS  POTRS  PPTRS  
s  U  L  U  L  U  L  U  L  
100  50  1949  1972  1971  1978  2161  2138  1029  1028 
100  100  2552  2550  2562  2562  2501  2484  1212  1393 
100  200  2858  2859  2860  2847  2646  2620  1303  1629 
100  400  2982  2972  2972  2949  2811  2803  1398  1780 
100  500  2991  2983  2987  2994  2835  2821  1364  1700 
100  800  3083  3062  3083  2717  2819  2784  921  973 
100  1000  3112  3100  3085  2694  2626  2604  853  855 
160  1600  3141  3140  3149  3137  2820  2715  762  752 
200  2000  3172  3182  3174  3171  2714  2698  718  667 
400  4000  3193  3201  3214  3211  2656  2661  240  230 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  721  616  642  694  519  537  331  300 
100  1419  1280  1337  1386  1347  1216  612  303 
200  2764  2526  2637  2732  2621  2526  1072  300 
400  4120  3728  3943  4053  4116  3932  1041  292 
500  4430  4142  4313  4410  4495  4568  997  291 
800  4663  4009  4198  4804  5034  3873  1007  290 
1000  4764  4134  4485  5107  4789  3732  1029  289 
1600  4278  3612  3956  4178  3740  2680  153  188 
2000  4061  3611  3657  4087  3771  2335  85  107 
4000  3493  2660  3126  3185  3769  2307  53  81 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  774  797  665  818  676  675  317  416 
100  1731  1669  1681  1723  1561  1528  404  754 
200  2945  3140  3169  3195  3034  2975  461  1246 
400  4466  4383  4403  4476  4198  4176  439  1686 
500  4648  4531  4662  4685  4740  4605  429  1795 
800  4827  4815  4891  4799  4463  4833  422  2024 
1000  4992  5016  5194  5155  4699  4931  421  2121 
1600  4882  4957  4908  4874  4293  4733  267  1474 
2000  3482  3749  5031  4967  3916  3072  70  238 
4000  3080  3290  3613  3560  2725  3063  59  152 
r  n  RFPF  LAPACK  

h  NO TRANS  TRANS  POTRS  PPTRS  
s  U  L  U  L  U  L  U  L  
100  50  2535  2535  2552  2543  3283  2826  496  488 
100  100  3838  3831  3853  3848  4438  4301  860  844 
100  200  4898  4894  4894  4892  5045  5029  1357  1307 
100  400  5311  5298  5251  5246  5067  5185  1312  1695 
100  500  5214  5192  5259  5248  5195  5417  1319  1814 
100  800  5300  5222  5645  5634  4666  4773  1369  2095 
100  1000  4851  4712  4775  4846  4699  4098  1378  2159 
160  1600  3721  3406  3850  4127  3658  3441  180  474 
200  2000  3957  3469  3482  3998  3799  3620  97  338 
400  4000  3913  3994  3957  3555  3945  3768  68  167 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  1477  1401  1532  1431  1449  1548  1084  1510 
100  2651  2713  2765  2537  2492  2712  1628  2234 
200  3828  3889  4040  3718  3532  3837  1812  2822 
400  4374  4581  4829  4402  4343  4410  1550  3205 
500  4592  4621  4933  4570  4776  4463  1521  3294 
800  4729  4688  4897  4815  4737  4085  1277  3073 
1000  4735  4694  4928  4689  4727  3334  441  1504 
1600  4796  4701  4901  4737  3872  3801  223  693 
2000  4560  4295  4553  4560  4476  3681  180  368 
4000  3705  3341  4039  4200  4108  3487  101  186 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  1618  1633  1666  1744  1652  1529  1424  1284 
100  2750  2744  2762  2968  2661  2523  2241  2037 
200  3766  3780  3787  4085  3951  3582  2359  2764 
400  4489  4404  4509  4708  4587  4408  1671  3205 
500  4642  4594  4699  4860  4667  4618  1627  3340 
800  4854  4826  4949  5044  4522  4634  1315  3366 
1000  3246  3804  4958  5019  4001  3420  148  939 
1600  4491  4623  3420  3620  2446  2881  69  1204 
2000  2978  2912  4119  4158  3756  4040  62  325 
4000  3532  3573  3514  3365  2829  2911  70  412 
r  n  RFPF  LAPACK  

h  NO TRANS  TRANS  POTRS  PPTRS  
s  U  L  U  L  U  L  U  L  
100  50  3062  3009  3067  3054  3465  3500  1551  1528 
100  100  4106  4106  4114  4109  4287  4314  2146  2129 
100  200  4562  4559  4750  4748  4369  4381  2200  2605 
100  400  4662  4647  4885  4920  4920  5044  2163  2927 
100  500  4612  4612  4970  5007  4925  4717  2193  3005 
100  800  4332  4313  4729  4675  4726  4376  1951  2765 
100  1000  4487  4430  4492  4639  4542  4454  1046  1838 
160  1600  4469  4369  4450  4524  4057  4287  428  1225 
200  2000  4284  4335  4225  4385  4315  4464  290  726 
400  4000  3847  3845  4420  4434  4398  4445  110  373 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
50  781  771  784  771  1107  739  495  533 
100  1843  1788  1848  1812  1874  1725  879  825 
200  3178  2869  2963  3064  2967  2871  1323  1100 
400  3931  3709  3756  3823  3870  3740  1121  1236 
500  4008  3808  3883  3914  4043  3911  1032  1257 
800  4198  4097  4145  4126  3900  4009  612  1127 
1000  4115  4038  4015  3649  3769  3983  305  697 
1600  3851  3652  3967  3971  3640  3987  147  437 
2000  3899  3716  3660  3660  3865  3835  108  358 
4000  3966  3791  3927  4011  3869  4052  119  398 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRI  PPTRI  
u  l  u  l  u  l  u  l  
50  633  659  648  640  777  870  508  460 
100  1252  1323  1300  1272  1573  1760  815  810 
200  2305  2442  2431  2314  2357  2639  1118  1211 
400  3084  3199  3188  3094  3152  3445  1234  1363 
500  3204  3316  3329  3218  3400  3611  1239  1382 
800  3617  3741  3720  3640  3468  3786  1182  1268 
1000  3611  3716  3637  3590  3456  3790  767  946 
1600  3721  3802  3795  3714  3589  3713  500  609 
2000  3784  3812  3745  3704  3636  3798  473  596 
4000  3822  3762  3956  3851  3760  3750  467  614 
r  n  RFPF  LAPACK  

h  NO TRANS  TRANS  POTRS  PPTRS  
s  u  l  u  l  u  l  u  l  
100  50  2409  2412  2414  2422  3044  3018  725  714 
100  100  3305  3301  3303  3303  3889  3855  1126  1109 
100  200  4149  4154  4127  4146  4143  4127  1526  1512 
100  400  4398  4403  4416  4444  4469  4451  1097  1088 
100  500  4313  4155  4374  4394  4203  4093  1054  1045 
100  800  3979  3919  4040  4051  3969  4011  692  720 
100  1000  3716  3608  3498  3477  3630  3645  376  372 
160  1600  3892  3874  4020  3994  4001  4011  188  182 
200  2000  4052  4073  4040  4020  4231  4203  119  119 
400  4000  4245  4225  4275  4287  4330  4320  115  144 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
u  l  u  l  u  l  u  l  
50  206  200  225  225  365  353  57  238 
100  721  728  789  788  1055  989  120  591 
200  2028  2025  2005  2015  1380  1639  246  1250 
400  3868  3915  3078  3073  1763  3311  479  1975 
500  4483  4470  4636  4636  4103  4241  585  2149 
800  5154  5168  4331  4261  3253  4469  870  2399 
1000  5666  5654  5725  5703  5144  5689  1035  2474 
1600  6224  6145  5644  5272  5375  5895  1441  2572 
2000  6762  6788  6642  6610  6088  6732  1654  2598 
4000  7321  7325  7236  7125  6994  7311  2339  2641 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRI  PPTRI  
u  l  u  l  u  l  u  l  
50  152  152  150  152  148  145  91  61 
100  430  432  428  432  313  310  194  126 
200  950  956  940  941  636  627  404  249 
400  1850  1852  1804  1806  1734  1624  722  470 
500  2227  2228  2174  2181  2180  2029  856  572 
800  3775  3775  3668  3686  3405  3052  1186  842 
1000  4346  4346  4254  4263  4273  3638  1342  985 
1600  5313  5294  5137  5308  5438  4511  1690  1361 
2000  6006  6006  5930  5931  5997  4832  1854  1536 
4000  6953  6953  6836  6888  7041  4814  1921  2122 
r  n  RFPF  LAPACK  

h  NO TRANS  TRANS  POTRS  PPTRS  
s  U  L  U  L  U  L  U  L  
100  50  873  870  889  886  1933  1941  88  88 
100  100  2173  2171  2200  2189  3216  3236  181  179 
100  200  4236  4230  4253  4245  4165  4166  352  347 
100  400  5431  5431  5410  5408  5302  5303  648  644 
100  500  5563  5562  5568  5567  5629  5632  783  779 
100  800  6407  6407  6240  6240  5569  5593  1132  1128 
100  1000  6578  6578  6559  6558  6554  6566  1325  1320 
160  1600  6781  6805  6430  6430  6799  6809  1732  1727 
200  2000  7568  7569  7519  7519  7406  7407  1920  1914 
400  4000  7858  7858  7761  7761  7626  7627  2414  2410 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
2000  24.8460  24.2070  26.2493  27.3279  24.9569  13.0685  0.9389  0.4790 
4000  39.0849  38.8042  41.1537  41.7441  38.4284  14.6297  0.7378  0.3879 
6000  43.2940  43.9028  45.7611  45.6911  40.1301  14.6023  0.7212  0.3800 
8000  48.2928  48.0530  50.1546  48.9082  40.1865  14.9028  –  – 
10000  50.6669  50.0472  51.5198  50.8383  41.7279  14.9236  –  – 
12000  47.9860  47.5107  50.6640  50.2138  43.1972  14.6511  –  – 
14000  50.3806  50.6969  52.7881  52.3719  43.7816  14.5463  –  – 
16000  51.2309  51.9454  53.5924  53.2322  44.0667  14.2067  –  – 
18000  52.6901  52.2244  54.2978  53.5869  46.2805  14.5523  –  – 
20000  53.6790  54.1209  54.3555  54.7896  45.8757  14.6236  –  – 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
2000  29.9701  31.0403  29.2714  28.4205  13.2510  18.5249  0.6338  0.9229 
4000  38.4338  39.1702  38.3199  37.7938  13.0367  18.1662  0.5080  0.7301 
6000  38.6324  39.1249  39.0177  38.9534  12.8468  18.0594  0.4972  0.7149 
8000  40.6770  40.7352  40.9032  39.8398  12.8871  17.9491  –  – 
10000  41.3971  41.5932  41.6892  41.6400  12.6654  17.5897  –  – 
12000  41.1646  40.8424  40.2776  40.4129  12.4705  17.5883  –  – 
14000  42.1946  42.1400  41.2174  41.3633  12.4050  17.4173  –  – 
16000  42.0274  42.2826  42.4457  42.3624  12.1912  17.2090  –  – 
18000  42.2909  42.4922  41.9356  42.2480  12.1616  17.3289  –  – 
r  n  RFPF  LAPACK  

h  NO TRANS  TRANS  POTRS  PPTRS  
s  U  L  U  L  U  L  U  L  
100  2000  0.7802  0.7759  0.7947  0.7897  0.7365  0.6691  0.8177  0.7628 
100  4000  0.6925  0.6918  0.7130  0.7113  0.6462  0.6120  0.7310  0.7261 
100  6000  0.6672  0.6639  0.6921  0.6937  0.5955  0.5773  0.7214  0.7193 
100  8000  0.6494  0.6457  0.6787  0.6791  0.5524  0.5463  –  – 
100  10000  0.6247  0.6194  0.6594  0.6579  0.5329  0.5269  –  – 
100  12000  0.6228  0.6230  0.6512  0.6506  0.5336  0.5291  –  – 
100  14000  0.5933  0.6181  0.6291  0.6309  0.5356  0.5271  –  – 
100  16000  0.6020  0.6018  0.6265  0.6295  0.5095  0.5088  –  – 
100  18000  0.6175  0.6164  0.6196  0.6184  0.5310  0.5232  –  – 
100  20000  0.6092  0.6063  0.6022  0.6024  0.5221  0.5163  –  – 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
2000  25.0114  24.7273  27.9415  29.2117  31.7156  22.4987  19.1706  18.1686 
4000  46.5472  46.6683  50.4646  52.1384  53.5300  39.1913  25.5211  23.0137 
6000  57.7951  59.0809  62.8870  62.2730  63.7367  45.6812  34.4061  30.1288 
8000  67.8673  70.0423  72.3038  68.6783  70.7858  48.2404  39.5816  31.3558 
10000  76.6851  78.1704  78.9962  79.0753  75.8030  52.6184  42.2241  35.7579 
12000  72.2916  74.1424  79.1635  78.9553  78.4410  57.9543  46.0673  41.0530 
14000  79.5957  81.4214  85.3673  84.0138  82.1996  59.0167  46.5374  38.7725 
16000  83.6760  84.8718  89.7696  87.4224  83.8289  58.7681  50.8717  45.3575 
18000  86.6604  86.5750  89.3476  88.8508  86.7870  62.9814  52.5077  47.1880 
20000  90.7187  92.3898  92.9467  91.9760  88.2639  64.3982  51.0705  43.1419 
n  RFPF  LAPACK  

NO TRANS  TRANS  POTRF  PPTRF  
U  L  U  L  U  L  U  L  
2000  29.7015  32.2611  29.5448  28.7837  13.8077  18.8739  0.6367  0.9238 
4000  38.6352  39.5333  38.1630  37.9292  13.1999  18.1173  0.5069  0.7288 
6000  38.7001  39.3848  38.5245  39.1682  12.8651  17.8524  –  – 
8000  40.6456  41.2400  41.0437  40.9830  12.8791  17.9160  –  – 
10000  41.5013  41.7725  42.4129  42.3191  12.7119  17.4713  –  – 
12000  41.4199  41.2636  40.6651  40.5983  12.4945  17.6937  –  – 
14000  42.0461  42.4899  41.8353  41.4583  12.4234  17.5004  –  – 
16000  42.5350  42.7828  42.9538  42.4658  12.2014  17.2031  –  – 
18000  42.0039  42.5616  42.3765  41.9941  12.2800  17.3990  –  – 
20000  42.6296  43.0443  41.9921  41.9014  12.1434  17.3876  –  – 
r  n  RFPF  LAPACK  

h  NO TRANS  TRANS  POTRS  PPTRS  
s  U  L  U  L  U  L  U  L  
100  2000  0.1530  0.1439  0.1396  0.1432  0.1164  0.1093  0.1856  0.1482 
100  4000  0.1516  0.1459  0.1486  0.1503  0.1077  0.1140  0.1545  0.1362 
100  6000  0.1512  0.1451  0.1471  0.1493  0.1101  0.1065  0.1397  0.1223 
100  8000  0.1490  0.1411  0.1429  0.1458  0.1085  0.1100  0.1192  0.1136 
100  10000  0.1452  0.1408  0.1471  0.1430  0.1066  0.1088  0.1027  0.1019 
100  12000  0.1407  0.1429  0.1452  0.1404  0.1079  0.1091  0.0958  0.0926 
100  14000  0.1398  0.1406  0.1404  0.1388  0.1080  0.1100  0.0837  0.0843 
100  16000  0.1374  0.1374  0.1411  0.1405  0.1075  0.1089  0.0786  0.0786 
100  18000  0.1370  0.1366  0.1402  0.1396  0.1086  0.1087  0.0748  0.0745 
100  20000  0.1362  0.1364  0.1394  0.1425  0.1065  0.1117  0.0699  0.0699 
n  n  Mflop/s  Times  
pr  PFTRF  in PFTRF  LAPACK  
oc  PO  TR  SY  PO  PO  PP  
TRF  SM  RK  TRF  TRF  TRF  
1  2  3  4  5  6  7  8  9  10 
1000  1  2695  0.12  0.02  0.05  0.04  0.02  0.12  0.94 
5  7570  0.04  0.01  0.02  0.01  0.01  0.03  0.32  
10  10699  0.03  0.01  0.01  0.01  0.00  0.02  0.16  
15  18354  0.02  0.00  0.01  0.00  0.00  0.01  0.11  
2000  1  2618  1.02  0.13  0.38  0.38  0.13  0.97  8.74 
5  10127  0.26  0.04  0.10  0.09  0.04  0.24  3.42  
10  17579  0.15  0.02  0.06  0.05  0.03  0.12  1.65  
15  23798  0.11  0.02  0.04  0.04  0.01  0.13  1.11  
3000  1  2577  3.49  0.45  1.33  1.28  0.44  3.40  30.42 
5  11369  0.79  0.11  0.28  0.30  0.11  0.71  11.76  
10  19706  0.46  0.06  0.19  0.16  0.05  0.38  6.16  
15  29280  0.31  0.05  0.12  0.10  0.04  0.26  4.28  
4000  1  2664  8.01  1.01  2.90  3.09  1.01  7.55  75.72 
5  11221  1.90  0.26  0.68  0.72  0.24  1.65  25.73  
10  21275  1.00  0.13  0.39  0.36  0.12  0.86  13.95  
15  31024  0.69  0.09  0.28  0.24  0.08  0.59  10.46  
5000  1  2551  16.34  2.04  6.16  6.10  2.04  15.79  154.74 
5  11372  3.66  0.45  1.37  1.44  0.40  3.27  47.76  
10  22326  1.87  0.25  0.78  0.62  0.22  1.73  28.13  
15  32265  1.29  0.17  0.53  0.45  0.14  1.16  20.95 
n  n  Mflop/s  Times  
pr  PFTRF  in PFTRF  LAPACK  
oc  PO  TR  SY  PO  PO  PP  
TRF  SM  RK  TRF  TRF  TRF  
1  2  3  4  5  6  7  8  9  10 
1000  1  1587  0.21  0.03  0.09  0.07  0.03  0.19  1.06 
5  4762  0.07  0.02  0.02  0.02  0.02  0.07  1.13  
10  5557  0.06  0.01  0.01  0.02  0.02  0.06  1.12  
15  5557  0.06  0.02  0.01  0.01  0.02  0.06  1.11  
2000  1  1668  1.58  0.22  0.63  0.52  0.22  1.45  11.20 
5  6667  0.40  0.07  0.13  0.13  0.07  0.38  11.95  
10  8602  0.31  0.06  0.07  0.11  0.07  0.25  11.24  
15  9524  0.28  0.06  0.06  0.08  0.08  0.23  11.66  
3000  1  1819  4.95  0.62  1.98  1.72  0.63  4.86  45.48 
5  6872  1.31  0.20  0.42  0.48  0.20  1.38  55.77  
10  12162  0.74  0.14  0.22  0.21  0.16  0.76  46.99  
15  12676  0.71  0.14  0.16  0.30  0.16  0.61  45.71  
4000  1  1823  11.70  1.52  4.62  4.01  1.55  11.86  112.52 
5  7960  2.68  0.40  0.94  0.92  0.42  2.74  112.77  
10  14035  1.52  0.26  0.47  0.49  0.30  1.61  112.53  
15  17067  1.25  0.24  0.37  0.35  0.29  1.29  111.67  
5000  1  1843  22.61  2.92  8.76  8.00  2.93  23.60  218.94 
5  8139  5.12  0.77  1.81  1.80  0.74  5.45  221.58  
10  14318  2.91  0.50  0.97  0.93  0.51  3.11  214.54  
15  17960  2.32  0.45  0.72  0.68  0.47  2.40  225.08 
factorization  inversion  solution  

PF/PO  PF/PP  PF/PO  PF/PP  PF/PO  PF/PP  
50  0.99  1.47  1.23  1.29  1.10  3.43 
100  1.20  1.87  1.03  1.66  1.04  2.98 
200  0.97  2.55  0.97  2.11  1.12  3.04 
400  1.02  2.88  1.05  2.50  1.08  3.00 
500  0.99  2.91  0.95  2.53  1.09  3.02 
800  0.99  3.25  0.98  2.95  1.15  3.95 
1000  0.98  3.95  0.98  3.77  1.37  5.08 
1600  0.99  4.46  1.07  4.43  1.19  6.46 
2000  0.97  4.67  0.99  4.73  1.20  6.84 
4000  0.91  5.82  1.04  6.69  1.77  16.05 
factorization  inversion  solution  

PF/PO  PF/PP  PF/PO  PF/PP  PF/PO  PF/PP  
50  1.16  1.31  1.26  1.23  0.92  1.92 
100  1.01  1.32  1.09  1.53  1.02  1.84 
200  1.01  1.41  0.98  1.60  1.08  1.76 
400  0.97  1.45  1.01  1.56  1.06  1.68 
500  0.97  1.47  1.00  1.62  1.06  1.76 
800  0.98  2.19  1.01  2.20  1.09  3.17 
1000  0.98  2.35  1.00  2.37  1.19  3.64 
1600  0.97  2.41  0.94  2.50  1.12  4.13 
2000  0.96  2.47  0.90  2.58  1.17  4.43 
4000  0.99  4.11  0.95  5.31  1.21  13.39 
factorization  inversion  solution  

PF/PO  PF/PP  PF/PO  PF/PP  PF/PO  PF/PP  
50  1.34  2.18  1.21  1.97  0.78  5.15 
100  1.05  2.32  1.11  2.30  0.87  4.48 
200  1.05  2.58  1.05  2.56  0.97  3.61 
400  1.00  3.96  1.07  2.65  1.02  3.13 
500  0.97  4.44  0.99  2.61  0.97  2.90 
800  0.95  4.77  1.01  2.42  1.18  2.69 
1000  1.07  4.96  1.05  2.45  1.03  2.25 
1600  1.14  22.76  1.05  3.36  1.13  8.71 
2000  1.08  38.20  1.28  21.14  1.05  11.83 
4000  0.93  43.12  1.18  23.77  1.01  23.92 
factorization  inversion  solution  

PF/PO  PF/PP  PF/PO  PF/PP  PF/PO  PF/PP  
50  0.99  1.01  1.06  1.22  0.88  1.98 
100  1.02  1.24  1.12  1.32  0.95  1.92 
200  1.05  1.43  1.03  1.48  1.08  1.82 
400  1.10  1.51  1.03  1.47  0.98  1.68 
500  1.03  1.50  1.04  1.46  1.02  1.67 
800  1.03  1.59  1.09  1.50  1.00  1.71 
1000  1.04  3.28  1.25  5.35  1.02  2.52 
1600  1.27  7.07  1.60  3.84  1.06  3.69 
2000  1.02  12.39  1.03  12.79  0.98  6.04 
4000  1.02  22.58  1.23  8.67  1.00  11.89 
factorization  inversion  solution  

PF/PO  PF/PP  PF/PO  PF/PP  PF/PO  PF/PP  
50  0.71  1.47  0.76  1.30  0.80  3.34 
100  0.99  2.10  0.75  1.62  0.85  2.94 
200  1.07  2.40  0.93  2.02  1.00  2.72 
400  1.02  3.18  0.93  2.35  0.99  4.05 
500  0.99  3.19  0.92  2.41  1.05  4.17 
800  1.05  3.72  0.99  2.95  1.01  5.63 
1000  1.03  5.90  0.98  3.93  1.02  9.88 
1600  1.00  9.09  1.02  6.24  1.00  21.38 
2000  1.01  10.89  1.00  6.40  0.96  34.23 
4000  0.99  10.08  1.05  6.44  0.99  29.77 
factorization  inversion  solution  

PF/PO  PF/PP  PF/PO  PF/PP  PF/PO  PF/PP  
50  0.62  0.95  1.03  1.67  0.46  10.10 
100  0.75  1.34  1.38  2.23  0.68  12.15 
200  1.24  1.62  1.50  2.37  1.02  12.08 
400  1.18  1.98  1.07  2.57  1.02  8.38 
500  1.09  2.16  1.02  2.60  0.99  7.11 
800  1.16  2.15  1.11  3.18  1.15  5.66 
1000  1.01  2.31  1.02  3.24  1.00  4.96 
1600  1.06  2.42  0.98  3.14  1.00  3.93 
2000  1.01  2.61  1.00  3.24  1.02  3.94 
4000  1.00  2.77  0.99  3.28  1.03  3.26 
factorization  inversion  solution  

PF/PO  PF/PP  PF/PO  PF/PP  PF/PO  PF/PP  
2000  1.10  29.11  1.68  33.63  1.08  0.97 
4000  1.09  56.58  2.16  53.65  1.10  0.98 
6000  1.14  63.45  2.17  54.73  1.16  0.96 
8000  1.25  –  2.28  –  1.23  – 
10000  1.23  –  2.37  –  1.24  – 
12000  1.17  –  2.34  –  1.22  – 
14000  1.21  –  2.42  –  1.18  – 
16000  1.22  –  2.47  –  1.24  – 
18000  1.17  –  2.45  –  1.17  – 
20000  1.19  –  2.48  –  1.17  – 
factorization  inversion  solution  

PF/PO  PF/PP  PF/PO  PF/PP  PF/PO  PF/PP  
2000  0.92  1.52  1.71  34.92  1.31  0.82 
4000  0.97  2.04  2.18  54.24  1.33  0.98 
6000  0.99  1.83  2.21  –  1.37  1.08 
8000  1.02  1.83  2.30  –  1.35  1.25 
10000  1.04  1.87  2.43  –  1.35  1.43 
12000  1.01  1.72  2.34  –  1.33  1.52 
14000  1.04  1.83  2.43  –  1.28  1.67 
16000  1.07  1.76  2.50  –  1.30  1.80 
18000  1.03  1.70  2.45  –  1.29  1.87 
20000  1.05  1.82  2.48  –  1.28  2.04 