A nested Krylov subspace method to compute the sign function of large complex matrices
Abstract
We present an acceleration of the wellestablished KrylovRitz methods to compute the sign function of large complex matrices, as needed in lattice QCD simulations involving the overlap Dirac operator at both zero and nonzero baryon density. KrylovRitz methods approximate the sign function using a projection on a Krylov subspace. To achieve a high accuracy this subspace must be taken quite large, which makes the method too costly. The new idea is to make a further projection on an even smaller, nested Krylov subspace. If additionally an intermediate preconditioning step is applied, this projection can be performed without affecting the accuracy of the approximation, and a substantial gain in efficiency is achieved for both Hermitian and nonHermitian matrices. The numerical efficiency of the method is demonstrated on lattice configurations of sizes ranging from to , and the new results are compared with those obtained with rational approximation methods.
[t1]Supported by the DFG collaborative research center SFB/TR55 “Hadron Physics from Lattice QCD”.
1 Introduction
In quantum chromodynamics (QCD) some physical observables rely on the chiral properties of the theory. To study such observables in a lattice formulation of QCD it is important to discretize the Dirac operator such that it respects the corresponding chiral symmetry. This is most faithfully achieved using the overlap Dirac operator Narayanan:1993sk; Narayanan:1994gw. To study QCD at nonzero baryon density the overlap formulation was recently extended to include a quark chemical potential Bloch:2006cd; Bloch:2007xi. A major ingredient in the overlap operator, which makes its use very challenging, is the computation of the sign function of a complex matrix, which is Hermitian at zero baryon density, but becomes nonHermitian when a quark chemical potential is introduced.
The search for efficient numerical methods to compute the sign function for the large sparse matrices encountered in this context is an ongoing field of research. Typically, Krylov subspace methods are employed to evaluate the operation of a matrix function on an arbitrary vector. We distinguish two main variants: the KrylovRitz approximation, which evaluates the function via a projection on the Krylov subspace, and the rational approximation, where the function is first approximated by a partial fraction expansion, which is then efficiently solved using a multishift Krylov subspace inverter.
In the Hermitian case efficient rational approximation methods for the sign function have been devised Neuberger:1998my; vandenEshof:2002ms and are currently being used in large scale lattice simulations. The current method of choice uses the Zolotarev partial fraction expansion vandenEshof:2002ms; Chiu:2002eh; Kennedy:2004tk, which yields the optimal rational approximation to the sign function over a real interval zolotarev77, in conjunction with a multishift conjugate gradient inversion. For nonHermitian matrices, which occur in the presence of a quark chemical potential, Krylov subspace approximations to the sign function are relatively new and still under development. Recently, partial fraction expansion methods using the Neuberger expansion Neuberger:1998my with nonHermitian multishift inverters were proposed Bloch:2009in.
The KrylovRitz approximation, which we discuss in this paper, is based on the construction of a Krylov basis and its accompanying Ritz matrix. Depending on the algorithm used to construct the basis we distinguish between the Lanczos approximation in the Hermitian case vandenEshof:2002ms, and the Arnoldi approximation Bloch:2007aw or twosided Lanczos approximation Bloch:2008gh in the nonHermitian case. The latter clearly yields the more efficient function approximation for nonHermitian matrices Bloch:2008gh. In the KrylovRitz approximation the large complex matrix is projected on the Krylov subspace, and its sign function is approximated by lifting the sign function of its projected image (Ritz matrix) back to the original space. The latter sign function is computed to high accuracy using the spectral definition of a matrix function or using a matrixiterative method. When a large Krylov subspace is needed to reach the desired accuracy, the computation of this matrix sign function becomes a bottleneck for the algorithm.
Herein we will introduce an enhancement of the KrylovRitz approximation method which substantially reduces the cost of this internal sign computation and boosts the efficiency of the overall method, such that it competes with, and even surpasses, the rational function approximation in both the Hermitian and nonHermitian case. The dramatic reduction in computation time is achieved by projecting the Ritz matrix on an even smaller, nested Krylov subspace, after performing a suitable preconditioning step first. The desired sign function is then computed via the sign function of the inner Ritz matrix, which yields the same accuracy as the original KrylovRitz approximation.
The outline of the paper is as follows. In Sec. 2 we introduce the overlap operator and the matrix sign function. In Sec. 3 we show how the matrix function of large matrices is computed using KrylovRitz approximation methods. In Sec. 4 we introduce the nested Krylov subspace method, which substantially enhances the efficiency of the KrylovRitz approximation to the sign function. We study its convergence properties and present numerical results for various lattice sizes, including a comparison with rational approximation methods. Finally, our conclusions are given in Sec. 5. For completeness we have added some algorithms in Appendix.
2 Overlap operator and the matrix sign function
Our motivation to develop numerical algorithms to compute the matrix sign function of large, sparse, complex matrices comes from its application in lattice quantum chromodynamics (LQCD). The overlap formulation of the Dirac operator Narayanan:1993sk; Narayanan:1994gw, which ensures that chiral symmetry is preserved in LQCD, is given in terms of the matrix sign function Neuberger:1997fp, and its definition in the presence of a quark chemical potential Bloch:2006cd is given by
(1) 
where denotes the identity matrix, with the Dirac gamma matrices in Euclidean space, is the matrix sign function, and
(2) 
is the Wilson Dirac operator at nonzero chemical potential Hasenfratz:1983ba with , , and SU(3), where . The exponential factors implement the quark chemical potential on the lattice. For the argument of the sign function is Hermitian, while for it is nonHermitian. To compute the overlap operator we need to define the matrix sign function for a general complex matrix of dimension . A generic matrix function can be defined by
(3) 
where is a collection of contours in such that is analytic inside and on and such that encloses the spectrum of . If is diagonalizable, i.e., , with diagonal eigenvalue matrix and , then this general definition can be simplified to the wellknown spectral form
(4) 
with
(5) 
If cannot be diagonalized, a spectral definition of can still be derived using the Jordan decomposition Golub. For simplicity, but without loss of generality, we assume diagonalizability in the following. For Hermitian the eigenvalues are real and their sign is defined by for with , such that Eq. (4) readily defines the matrix sign function. For nonHermitian the eigenvalues are complex and require a definition of for . The sign function needs to satisfy and reproduce the usual for real . We define
(6) 
where the cut of the square root is chosen along the negative real axis. This choice, although not unique, gives the correct physical result for the overlap Dirac operator in Eq. (1) (see Ref. Bloch:2007xi).
3 KrylovRitz approximations for matrix functions
Since we aim at problems with large matrices, as is the case in LQCD, memory and computing power limitations require sophisticated methods to deal with the sign function. For a matrix of large dimension the common approach is not to compute but rather its action on a vector, i.e., , which is needed by iterative inverters to compute or by iterative eigenvalues solvers for . The KrylovRitz method approximates the resulting vector in the Krylov subspace
(7) 
of , implicitly making a polynomial approximation of degree to . The optimal approximation to in this subspace is its orthogonal projection . For , where the form an orthonormal basis of , an orthogonal projector is given by , and we have
(8) 
However, to compute this projection on the Krylov subspace we already need , which is the quantity we wanted to determine in the first place. Thus, we need to replace this exact projection by an approximation. To reduce the large dimensionality of the problem one typically projects on the Krylov subspace using . The projected matrix has dimension but rank at most . The dimensional image of the projected matrix is defined by the matrix , which is often referred to as Ritz matrix. The components of are the projection coefficients of in the basis , as and are related by (in analogy to the vector case).
The KrylovRitz approximation gallopoulos89parallel; saad:209 to consists in taking the function of the Ritz matrix and lifting it back to the full dimensional space,
(9) 
This approximation actually replaces the polynomial interpolating at the eigenvalues of by the polynomial interpolating at the eigenvalues of , also called Ritz values saad:209. Substituting the approximation (9) in yields
(10) 
where we choose collinear with , i.e., , with the first unit vector of . To evaluate the approximation (10) we do not need to perform the matrix multiplications of Eq. (9) explicitly. First, one computes the function of the dimensional Ritz matrix to high accuracy, using the spectral definition (4) or a matrixiterative method. Then, the final approximation is simply a linear combination of the basis vectors , with coefficients given by the first column of multiplied with .
The KrylovRitz approximation described above uses an orthonormal basis of . For the Hermitian case such a basis can efficiently be constructed using the Lanczos algorithm, which we listed in A for completeness. It generates an orthonormal basis and a tridiagonal symmetric using a threeterm recurrence relation. The nonHermitian case is more laborious as the construction of an orthonormal basis is typically performed using the Arnoldi algorithm, which suffers from long recurrences as each basis vector has to be orthogonalized with respect to all the previous ones. The twosided Lanczos algorithm is a suitable alternative Bloch:2008gh which uses two threeterm recurrences to construct bases and of the right, respectively left, Krylov subspaces and , which are biorthonormal, i.e., (see B for a listing of the algorithm). The lack of orthogonality of the basis prevents the construction of the orthogonal projector needed for the KrylovRitz function approximation (9). Nevertheless, the biorthonormality between and can be used to construct an oblique projector on the right Krylov subspace. The oblique projection of is and its dimensional image is defined by , which we call twosided Ritz matrix, such that . The matrix generated by the twosided Lanczos algorithm is tridiagonal. The twosided KrylovRitz approximation to then consists in taking the matrix function of and lifting it back to the original space,
(11) 
After applying this approximation of to we find an expression which is similar to Eq. (10),
(12) 
where the last step assumes that . The price paid to achieve short recurrences in the nonHermitian case is the loss of orthogonality of the projection on the Krylov subspace, which translates in a somewhat lower accuracy of the twosided Lanczos approximation compared to the Arnoldi approximation, for equal Krylov subspace sizes. Nevertheless, the large gain in speed makes it by far the more efficient method Bloch:2008gh.
In the case where is the sign function, the approximations (10) and (12) require the computation of . Although it could be computed directly with the spectral definition (4), matrixiterative methods are often cheaper for medium sized matrices. We choose to employ the RobertsHigham iteration (RHi) Rob80: Set and compute
(13) 
This iteration converges quadratically to , if the sign function for complex arguments is defined by Eq. (6). The matrix inversion scales like and so will the RHi. For the QCD application considered here, typically 7 to 10 iterations are necessary to converge within machine precision Bloch:2007aw; Bloch:2008gh.
The hope is that the KrylovRitz approximations (10) and (12) are accurate for . The method is known to work very well as long as no eigenvalues are close to a function discontinuity. However, for the sign function this method suffers from the sign discontinuity along the imaginary axis. If has eigenvalues close to this discontinuity the approximating polynomial must steeply change from to over a small interval to give an accurate approximation. This cannot be achieved with a low order polynomial, i.e., the Krylov subspace must be large, which makes the algorithm expensive. The common solution to this problem is to use deflation, where the contribution of the eigencomponents associated to these critical eigenvalues to the sign function is computed exactly.
The convergence of the KrylovRitz approximations to the matrix sign function is illustrated in Fig. 1: the Lanczos approximation for the Hermitian case on the left, and the twosided Lanczos approximation for the nonHermitian case on the right. The accuracy of the approximation cannot be determined by comparing to the exact value , as its evaluation by direct methods is too costly if A is large. To obtain an estimate for the error, we compute (by applying the KrylovRitz approximation twice in succession), which should equal if the approximation to the sign function were exact, and then take as a measure for the error. This error estimate proved to be consistent with the true error obtained by comparing the approximation to the exact solution for and lattices, and will therefore be used for all lattice sizes. Here, and in all subsequent tests, we choose the test vector . As expected, the accuracy improves with increasing Krylov subspace size , and a larger deflation gap , corresponding to a higher number of deflated eigenvectors, leads to a faster convergence. For a given accuracy and equal deflation gap, the subspace size required for nonHermitian is larger than for Hermitian .
To analyze the efficiency of the algorithm we briefly sketch the three major contributions to the total CPU time. For each matrix the deflation requires the computation of the critical eigenvalues and the corresponding eigenvectors. The time needed by the rest of the algorithm strongly depends on the eigenvalue gap, as the Krylov subspace size can be reduced if the deflation gap is increased. As mentioned at the beginning of this section, the product is usually needed for many source vectors , e.g., as part of an iterative inversion. In this case the expensive deflation of only needs to be performed once in an initialization step, while the Krylov subspace part of the algorithm will be repeated for each new vector . For this reason we assume from now on that an initial deflation has been performed and we will concentrate on the efficiency of the Krylov subspace part of the algorithm. We discern two main components in the KrylovRitz method: the construction of the Krylov basis using the Lanczos or twosided Lanczos algorithms, where the computation time grows linearly with the subspace size , and the RHi to compute , which scales as . Figure 2 illustrates these last two contributions. For high accuracy the Krylov subspace becomes large such that the cost of the RHi dominates the total CPU time of the KrylovRitz approximation and the method becomes too costly. In the following, the implementation of the KrylovRitz approximation for which is computed using Eq. (13) will be referred to as nonnested method. In the next section we will present a nested Krylov subspace method, which drastically reduces the cost to compute and vastly improves the overall efficiency of the KrylovRitz approximation.
4 Nested Krylov subspace method for the sign function
4.1 Nesting and preconditioning
We introduce a new method which speeds up the expensive computation of the vector required in the KrylovRitz approximations (10) and (12) to . The idea is to approximate this matrixvector product by a further KrylovRitz approximation, using a second, nested Krylov subspace (specified below) of size , i.e.,
(14) 
where is the matrix containing the basis vectors of the inner Krylov subspace, constructed with the Lanczos or twosided Lanczos method, and is the inner Ritz or twosided Ritz matrix. The is computed using the RHi on the inner Ritz matrix . After substituting this result in Eq. (10) and (12) we get the nested approximation
(15) 
to . By introducing an additional Krylov subspace, the number of operations necessary to compute is reduced from in the nonnested method to + . If this will very much improve the efficiency of the KrylovRitz approximation.
The obvious choice for the inner Krylov subspace is . However, it is easy to see that approximations in this Krylov subspace will not improve the efficiency of the method.
The Ritz matrix of the Krylov subspace will only contain information coming from the upper left corner of , because of the tridiagonal nature of and the sparseness of the source vector .
This will effectively cut down the size of the outer Krylov subspace from to ,
which will substantially worsen the accuracy of the approximation if is chosen much smaller than .
Nonetheless, the nested Krylov subspace method can be made to work efficiently
if we perform an initial preconditioning step on the tridiagonal Ritz matrix, replacing
(16) 
with a positive real number, and construct the approximation to in the Krylov subspace . This alternate Krylov subspace can be used to compute because the transformation leaves the sign unchanged. To show this, we note that both matrices have identical eigenvectors, as a matrix and its inverse share the same eigenvectors, and that the sign of their eigenvalues satisfies
(17) 
where we used the definition (6). Hence, according to Eq. (4).
As is tridiagonal the cost of its inversion, required in (16), is only of . Moreover, as the transformation increases the relative gap between the spectrum and the singularity along the imaginary axis (see below), we expect a clear gain in efficiency for the inner KrylovRitz approximation, characterized by .
For a Hermitian matrix the transformation induced by the preconditioning step is illustrated in Fig. 3 for real positive eigenvalues (for negative values the graph would be reflected with respect to the origin).
The factor is chosen to optimize the effect of the transformation on the relative distance to the imaginary axis, which in the Hermitian case corresponds to a minimization of the condition number.
We examine the condition number for the Hermitian case, assuming that the spectral support of is similar to that of the original matrix , after deflation.
As can be seen from Fig. 3, after transformation the smallest eigenvalue (in absolute value) is , while the largest will be given by the transform of either the smallest or largest eigenvalues of . The smallest condition number will be achieved when both values are identical, i.e., for satisfying
(18) 
where and , for in the spectrum of , and the largest transformed eigenvalue will be
(19) 
In the Hermitian case, the transformation (16) therefore reduces the condition number by a factor
(20) 
The effect of the preconditioning of the Ritz matrix for a typical spectrum of in lattice QCD is illustrated in Fig. 4 for the Hermitian case. The top and bottom graphs depict the spectra of and , respectively. The spectrum of the original Ritz matrix has only a small gap at zero, while the gap for the transformed matrix is large. In this example, the condition number is almost improved by a factor 20. In general, the value of for varies only slightly with the choice of the simulation parameters and will mainly depend on the deflation gap.
For the nonHermitian case, let us assume that the complex spectrum is contained in the circles , with real center and radius . The optimal , maximizing the relative distance from the imaginary axis for the transformed spectrum, is still given by Eq. (18) which now simplifies to . For this choice the transformed eigenvalues are contained in the circles with center and radius , where . This is illustrated in the left panel of Fig. 5, where we show the transformation of a circle for the optimal and a suboptimal value of . For suboptimal the transformation yields an inner and an outer circlelike contour, which merge into the circle when . For the relative distance from the imaginary axis will be maximal and we expect the transformation (16) to work best. The gain in efficiency will however not be as large as for the Hermitian case. This can be quantified by the relative distance to the imaginary axis, in analogy to the calculation performed above for the Hermitian case. For the original spectrum we define the relative distance as
(21)  
and for the transformed spectrum  
(22) 
The improvement factor due to the transformation is given by the ratio of these distances, yielding
(23) 
where we wrote , with the deflation gap. When we will have . For the example shown in the left plot of Fig. 5 the transformation generates an improvement by a factor , as computed with Eq. (23).
In lattice QCD at nonzero baryon density is usually weakly nonHermitian and, after deflation, the spectra are contained in ellipses , with center and major and minor axes and along the real and imaginary axes, respectively. The transformation (16) of an ellipse with aspect ratio is illustrated in the right panel of Fig. 5. For such a narrow ellipse the transformed spectrum is qualitatively similar to the Hermitian case, as all the eigenvalues are transformed to the right of , i.e., away from the imaginary axis, such that the high efficiency of the transformation is still guaranteed. The optimal value is again determined by (18) with , as it maximizes the relative distance from the imaginary axis. The transformation is illustrated for a realistic test case of lattice QCD in Fig. 6, where the eigenvalues and transformed eigenvalues of the Ritz matrix for are shown for .
As we will see below the preconditioning step significantly speeds up the KrylovRitz approximation in its application to lattice QCD at zero and nonzero chemical potential.
4.2 Convergence
In this section we investigate the convergence properties of the nested method. The method was implemented to compute the sign function of needed by the overlap operator (1), for both the Hermitian and the nonHermitian case. Whenever the matrix has eigenvalues close to the imaginary axis, these critical eigenvalues are first deflated to open up a deflation gap, necessary to keep the Krylov subspace within a reasonable size (see Sec. (3)). Our implementation uses Chroma Edwards:2004sx to compute the Wilson operator. The linear algebra is performed with BLAS and LAPACK routines. To ensure the efficiency of the nested method a judicious implementation of the preconditioning step (16), used to construct the inner Krylov subspace, is needed. Explicitly inverting the tridiagonal matrix to form the full matrix , then constructing the basis of the inner Krylov subspace by successive full matrixvector multiplications would make a rather inefficient algorithm. To construct the inner Krylov subspace we do not need to construct the full matrix explicitly, but only have to apply to vectors of (in the nonHermitian case is also needed). These products are best computed using the decomposition of , which is and thus especially efficient for tridiagonal matrices. A detailed listing of the algorithm is given in C.
The overall accuracy of the nested approximation (15) depends on the parameters and , defining the sizes of the outer and inner Krylov subspaces, respectively. For the solution of the nested method will converge to that of the nonnested method with Krylov subspace size and accuracy , so its total error will also converge to . To investigate the accuracy of the nested algorithm, our strategy is to fix the outer Krylov subspace size , corresponding to a certain desired accuracy, and vary the inner Krylov subspace size . We show the convergence results for an lattice configuration in Fig. 7, for both the Hermitian and nonHermitian case. As expected the nested method reaches the accuracy of the nonnested method when its size is large enough. Surprisingly however, this happens for , as the convergence of the inner Ritz approximation seems to be extremely fast. The smallest value of for which optimal convergence is reached will be called . The fast convergence is closely related to the large improvement in condition number discussed in the previous section. We also showed in Eq. (20) how the improvement of the condition number, due to the preconditioning of the Ritz matrix , depends on the deflation gap. A smaller gap will yield a larger improvement, and viceversa. This in turn will influence the convergence rate of the nested method. Figure 8 verifies that the result remains valid for different deflation gaps. The figure also illustrates that the somewhat larger reduction in condition number achieved for a smaller gap yields an accordingly smaller ratio (approximately proportional to the ratio of the respective improvement factors ). This is an additional advantage as the size reduction is largest when the outer subspace is large. In all cases, the inner Krylov subspace can be taken much smaller than the outer subspace, such that the efficiency of the KrylovRitz method is substantially boosted, as will be shown in the benchmarks below.
We also verified that the convergence curves are fairly insensitive to the choice of the source vector and lattice configuration. The fast convergence property of the nested method is generic, regardless of the simulation details, for both the Hermitian and nonHermitian case, even though the precise value of depends on the lattice size, the simulation parameters, the deflation gap and the desired overall accuracy (determined by ).
4.3 Benchmarks
With the fast convergence () discussed in the previous section, we can expect a substantial gain in computation time when using the nested method. The total CPU time consumed by the nested method is illustrated in Fig. 9 for the Hermitian case (left) and the nonHermitian case (right). The size of the outer Krylov subspace is kept fixed, such that its construction gives a constant contribution to the run time, depicted by the horizontal dashed line. The contribution to the CPU time which varies with mainly comes from the computation of with the RHi and is proportional to . For the total run time of the nested method is about equal to that of the nonnested method. However, as illustrated by the curve (red line) and discussed in Sec. 4.2, can be chosen much smaller while preserving the accuracy of the nonnested method. The central result, illustrated by the vertical band in Fig. 9, is that there exists an interval in for which the accuracy is still optimal, but the CPU time needed to compute with the RHi is negligible compared to the time required to construct the outer Krylov subspace. There is therefore no need to make a compromise between run time and accuracy, as both can be optimized simultaneously. The error in this range is the minimal error achievable with the given size of the outer Krylov subspace, while the run time is completely dominated by the cost for building the basis in that subspace. The nested method is able to quench the CPU time needed for the computation of without affecting the accuracy of the KrylovRitz approximation.
To evaluate the nested method further, we compare it to stateoftheart rational approximation methods. In the Hermitian case the Zolotarev rational approximation, evaluated with a multishift conjugate gradient inverter vandenEshof:2002ms, is routinely used in lattice simulations. In the nonHermitian case, i.e., simulations at nonzero baryon density, overlap fermions are not yet commonly used because of their high cost, but recently an efficient algorithm was presented, which evaluates the Neuberger rational approximation using a multishift restarted FOM inverter Bloch:2009in. In Fig. 10 we compare the results obtained with the nested Krylov subspace and rational approximation methods, and show how the CPU time varies as a function of the achieved accuracy for various lattice sizes. In all cases the Hermitian and nonHermitian versions of the nested method perform better than the rational approximation method. The volume dependence of the run time for a fixed accuracy can be extracted from Fig. 10 and is displayed for in Fig. 11. Fits to the nested method results show a volume dependence which is slightly steeper than linear, i.e., proportional to for the Hermitian case and for the nonHermitian case. The comparisons clearly demonstrate the good efficiency of the nested method.
4.4 Note on the memory usage
In the numerical tests we observed that, for a fixed deflation gap, the Krylov subspace size needed to achieve a certain accuracy is almost independent of the lattice volume in the Lanczos approximation and only grows slowly with the volume in the twosided Lanczos approximation. Therefore, the memory consumed by the Krylov basis is roughly proportional to the lattice volume. For large lattice sizes this storage requirement might become too large to run the KrylovRitz approximation on a single node.
One solution, which only requires little storage, is to implement a doublepass version of the algorithm, which is possible due to the use of short recurrences. In doublepass mode only the two most recently generated basis vectors are stored during the construction of the outer Krylov subspace basis. In the first pass the matrix is built and the product is computed with Eq. (14). In the second pass the basis vectors of the outer Krylov subspace are generated again and immediately added in a linear combination, whose coefficients were computed in the first pass. The drawback of this variant is that the Krylov basis is constructed twice, such that the corresponding CPU time will be doubled.
The more efficient solution is to parallelize the singlepass version of the algorithm, such that the memory requirement gets distributed over the available nodes. Benchmarks on larger volumes, using such a parallel implementation, are currently being performed.
4.5 Multilevel nesting
In principle, if the inner Krylov subspace in Eq. (15) is still too large for an efficient application of the RHi on the inner Ritz matrix, the nested method could be applied recursively.
5 Conclusions
In this paper we have presented a nested Krylov subspace method which boosts the KrylovRitz approximations used to compute the sign function of both Hermitian and nonHermitian matrices. The KrylovRitz approximation projects the matrix on a Krylov subspace in which it computes the sign function exactly, before lifting it back to the original space. Its standard implementation suffers from the CPU intensive computation of the sign of the Ritz matrix, which goes like the cube of the Krylov subspace size. By making an additional projection on a much smaller Krylov subspace, the nested method significantly reduces the total computation time of the KrylovRitz approximation, without affecting its accuracy. Numerical tests showed that the nested method works equally well for Hermitian and nonHermitian matrices and is more efficient than stateoftheart rational approximation methods. Moreover, it exhibits a good, close to linear, volume scaling. We are currently investigating the efficiency of the nested method for larger lattice volumes using a parallel implementation of the algorithm.
To end, we comment on the relation between the nested method and the extended Krylov subspace methods introduced in Ref. extendedkrylov. An extended Krylov space is defined as
(24) 
and an approximation in that subspace approximates by the sum . In the nested method we construct the dimensional Krylov subspace , which forms an dimensional subspace of the dimensional extended Krylov subspace . The nested method implicitly fixes the coefficients of the positive and negative powers of to be equal, , which follows from the use of the property . Hence, the nested method implicitly truncates the size of the extended Krylov subspace.
Approximations for the sign function in extended Krylov subspaces have been briefly considered recently Knizhnerman2010, however not in combination with the nesting of Krylov subspaces, i.e. the extended subspace is constructed for the original matrix , not for the Ritz matrix . Evidently this is not feasible in the application to lattice QCD as the inversion of the Wilson Dirac operator is too expensive in order to construct extended Krylov subspaces.
To conclude, we briefly consider the application of the nested method to other matrix functions. The method presented in Sec. 4 requires a transformation which leaves the matrix function invariant, similar to Eq. (16) for the sign function. If such a transformation is not known, the nested method could be adapted by using an extended Krylov subspace method at the inner level. This is also a topic of work in progress.
Acknowledgements
We would like to thank Andreas Frommer and Tilo Wettig for useful discussions.
Appendix A Lanczos algorithm
All not assigned above are zero. Consequently is tridiagonal and symmetric. The are the column vectors of the matrix .
Appendix B Twosided Lanczos algorithm
The and are the column vectors of the matrices and , respectively. All not assigned above are zero. Consequently is tridiagonal, but not symmetric as in the Hermitian case. The coefficients and are, nonuniquely, chosen to satisfy the biorthonormality condition
(25) 
There are potential problems in the twosided Lanczos process, namely serious breakdowns and near breakdowns, where , respectively , however, these were not encountered in our numerical tests.
Appendix C Nested algorithm
Given a (non)Hermitian matrix , a source vector and the critical eigenvectors (left and right eigenvectors and ), with eigenvalues , , do:

Apply LeftRight deflation (see Ref.Bloch:2007aw) to construct , where the components of the source vector along the eigenvectors have been removed:
where for Hermitian .

Perform an decomposition of , e.g., with the LAPACK routine dgttrf (zgttrf). This yields a lower triangular matrix with unit diagonal and one subdiagonal, and an upper triangular matrix with one diagonal and two superdiagonals. All other entries of and are zero.

Run the (twosided) Lanczos algorithm with and source vector to construct the Krylov basis and the Ritz matrix . To do so apply to each Krylov vector :

Compute using a sparse LU back substitution, e.g., with the LAPACK routine dgttrs (zgttrs).

Compute and add to the result of (a). This tridiagonal multiply and add can be done efficiently using the BLAS bandmatrixvector multiplication routine dsbmv (zgbmv).


Run the RHi (or any other suitable method to compute the sign function) on to obtain .

The final approximation is then given by
Note that steps (35) are done in real arithmetic in the Hermitian case.
References
Footnotes
 In practice we deflated the eigenvalues with smallest modulus instead of those with smallest absolute real part , as the former are more efficiently determined numerically, and both choices yield almost identical deflations for the operator of Eq. (1). The reason for this is that, as long as the chemical potential is not unusually large, the spectrum looks like a very narrow bowtie shaped strip along the real axis, and the sets of eigenvalues with smallest absolute real parts and smallest magnitudes will nearly coincide. In the following we therefore define the deflation gap as the largest deflated eigenvalue in magnitude, i.e., .
 The factor is chosen for convenience. For the transformation actually mimics the first step of the RHi (13).
 If is not diagonalizable, the equality can be shown by applying Eq. (17) to the integration variable in the integral representation (3).
 In practice is only known approximately, as it is computed from spectral information of A instead of . However, this has no significant impact on the performance of the nested method.
 Note that for all cases considered in the current study a single level of nesting was sufficient.
 For each level the factor for Eq. (16) is computed using Eq. (18), using appropriately approximated boundaries for the spectrum of the Ritz matrix of the previous level. Note that the factor converges to as more levels are introduced, and the preconditioning step converges to the RHi.