Fast QR decomposition of HODLR matrices
The efficient and accurate QR decomposition for matrices with hierarchical low-rank structures, such as HODLR and hierarchical matrices, has been challenging. Existing structure-exploiting algorithms are prone to numerical instability as they proceed indirectly, via Cholesky decompositions or a block Gram-Schmidt procedure. For a highly ill-conditioned matrix, such approaches either break down in finite-precision arithmetic or result in significant loss of orthogonality. Although these issues can sometimes be addressed by regularization and iterative refinement, it would be more desirable to have an algorithm that avoids these detours and is numerically robust to ill-conditioning. In this work, we propose such an algorithm for HODLR matrices. It achieves accuracy by utilizing Householder reflectors. It achieves efficiency by utilizing fast operations in the HODLR format in combination with compact WY representations and the recursive QR decomposition by Elmroth and Gustavson. Numerical experiments demonstrate that our newly proposed algorithm is robust to ill-conditioning and capable of achieving numerical orthogonality down to the level of roundoff error.
A HODLR (hierarchically off-diagonal low-rank) matrix is defined recursively via block partitions of the form
where the off-diagonal blocks have low rank and the diagonal blocks are again HODLR matrices. The recursion is stopped once the diagonal blocks are of sufficiently small size, in the range of, say, a few hundreds. Storing the off-diagonal blocks in terms of their low-rank factors significantly reduces memory requirements and, potentially, the computational cost of operating with HODLR matrices. The goal of this work is to devise an efficient and numerically accurate algorithm for computing a QR decomposition
where is an upper triangular HODLR matrix and the orthogonal matrix is represented in terms of its so called compact WY representation [Schreiber1989]: , with the identity matrix and triangular/trapezoidal HODLR matrices .
HODLR matrices constitute one of the simplest data-sparse formats among the wide range of hierarchical low-rank formats that have been discussed in the literature during the last two decades. They have proved to be effective, for example, in solving large-scale linear systems [Aminfar2016] and operating with multivariate Gaussian distributions [Ambikasaran2016]. In our own work [Kressner2017, Susnjara2018] HODLR matrices have played a central role in developing fast algorithms for solving symmetric banded eigenvalue problems. In particular, a fast variant of the so called QDWH algorithm [NakaBaiGygi2010, NakaHigh2013] for computing spectral projectors requires the QR decomposition of a HODLR matrix. It is also useful for orthonormalizing data-sparse vectors. Possibly more importantly, the QR decomposition offers a stable alternative to the LU decomposition (without pivoting) for solving linear systems with nonsymmetric HODLR matrices or to the Cholesky decomposition applied to the normal equations for solving linear least-squares problems.
For the more general class of hierarchical matrices [Hackbusch2015], a number of approaches aim at devising fast algorithms for QR decompositions [Bebendorf2008, BennerMach2010, Lintner2002]. However, as we explain in Section 2 below, all existing approaches have limitations in terms of numerical accuracy and orthogonality, especially when is ill-conditioned. In the other direction, when further conditions are imposed on a HODLR matrix, leading to formats such as HSS (hierarchically semi-separable) or quasi-separable matrices, then it can be possible to devise QR or URV decompositions that fully preserve the structure; see [Eidelman2014, VandeVanBarelMastro2008, Xia2010] and the references therein. This is clearly not possible for HODLR matrices: In general, and do not inherit from the property of having low-rank off-diagonal blocks. However, as it turns out, these factors can be very well approximated via HODLR matrices. The approach presented in this work to obtain such approximations is different from any existing approach we are aware of. It is based on the recursive QR decomposition proposed by Elmroth and Gustavson [Elmroth2000] for dense matrices. The key insight in this work is that such a recursive algorithm combines well with the use of the HODLR format for representing the involved compact WY representations. Demonstrated by the numerical experiments, the resulting algorithm is not only fast but it is also capable of yielding high accuracy, that is, a residual and orthogonality down to the level of roundoff error.
The rest of this paper is organized as follows. Section 2 provides an overview of HODLR matrices and the corresponding arithmetics, as well as the existing approaches to fast QR decompositions. In Section 3, we recall the recursive QR decomposition from [Elmroth2000] for dense matrices. The central part of this work, Section 4 combines [Elmroth2000] with the HODLR format. Various numerical experiments reported in Section 5 demonstrate the effectiveness of our approach. Finally, in Section 6, we sketch the extension of our newly proposed approach from square to rectangular HODLR matrices.
2 Overview of HODLR matrices and existing methods
In the following, we focus our description on square HODLR matrices. Although the extension of our algorithm to rectangular matrices does not require any substantially new ideas, the formal description of the algorithm would become significantly more technical. We have therefore postponed the rectangular case to Section 6.
2.1 HODLR matrices
As discussed in the introduction, a HODLR matrix is defined by performing a recursive partition of the form (1) and requiring all occuring off-diagonal blocks to be of low rank. When this recursion is performed times, we say that is a HODLR matrix of level ; see Figure 1 for an illustration.
Clearly, the definition of a HODLR matrix depends on the block sizes chosen in (1) on every level of the recursion or, equivalently, on the integer partition
defined by the sizes , of the diagonal blocks on the lowest level of the recursion. If possible, it is advisable to choose the level and the integers such that all are nearly equal to a prescribed minimal block size . In the following, when discussing the complexity of operations, we assume that such a balanced partition has been chosen.
Given an integer partition (2), we define to be the set of HODLR matrices of level and rank (at most) , that is, if every off-diagonal block of in the recursive block partition induced by (2) has rank at most .
A matrix admits a data-sparse representation by storing its off-diagonal blocks in terms of their low-rank factors. Specifically, letting denote an arbitrary off-diagonal block in the recursive partition of , we can write
Storing and instead of for every such off-diagonal block reduces the overall memory required for storing from to .
We call a factorization (3) left-orthogonal if . Provided that , an arbitrary factorization (3) can be turned into a left-orthogonal one by computing an economy-sized QR decomposition , see [GolVanL2013, Theorem 5.2.3], and replacing , . This described procedure requires operations.
2.1.1 Approximation by HODLR matrices
A general matrix can be approximated by a HODLR matrix by performing low-rank truncations of the off-diagonal blocks in the recursive block partition. Specifically, letting denote such an off-diagonal block, one computes a singular value decomposition with the diagonal matrix containing the singular values. Letting and contain the first columns of and , respectively, and setting , one obtains a rank- approximation
by setting and . We note that this approximation is optimal among all rank- matrices for any unitarily invariant norm [Horn2013, Section 7.4.9]. In particular, for the matrix -norm, we have
In passing, we note that the factorization chosen in (4) is left-orthogonal.
Performing the approximation (4) for every off-diagonal block in the recursive block partition yields a HODLR matrix .
In practice, the rank is chosen adaptively and separately for each off-diagonal block . Given a prescribed tolerance , we choose to be the smallest integer such that . In turn, (5) implies . The resulting HODLR approximation satisfies ; see, e.g., [Bini2017, Theorem 2.2].
Most manipulations involving HODLR matrices lead to an increase of off-diagonal ranks. This increase is potentially mitigated by performing recompression. Let us consider an off-diagonal block , with and choose as explained above. If , a rank- approximation reduces memory requirements while maintaining -accuracy. We use the following well-known procedure for effecting this approximation.
Compute and , economy-sized QR decompositions of and , respectively.
Compute SVD .
Update and .
This procedure, which will be denoted by , requires operations.
2.1.2 Operating with HODLR matrices
A number of operations can be performed efficiently with HODLR matrices. Table 1 lists the operations relevant in this work, together with their computational complexity; see, e.g., [Hackbusch2015, Chapter 3] for more details. It is important to note that all operations, except for matrix-vector multiplication, are combined with low-rank truncation, as discussed above, to limit rank growth in the off-diagonal blocks. The symbol signifies the inexactness due to truncations. The complexity estimates assume that all off-diagonal ranks encountered during an operation remain .
|Matrix low-rank update:|
|Solution of triangular matrix equation:|
2.2 Cholesky-based QR decomposition
This and the following sections describe three existing methods for efficiently computing the QR decomposition of a HODLR matrix. All these methods have originally been proposed for the broader class of hierarchical matrices.
The first method, proposed by Lintner [Lintner2002, Lintner2004], is based on the well-known connection between the QR and Cholesky decompositions. Specifically, letting be the QR decomposition of an invertible matrix , we have
Thus, the upper triangular factor can be obtained from the Cholesky decomposition of the symmetric positive definite matrix . The orthogonal factor is obtained from solving the triangular system . According to Table 1, these three steps (forming , computing the Cholesky decomposition, solving the triangular matrix equation) require operations in the HODLR format.
For dense matrices, the approach described above is well-known and often called CholeskyQR algorithm; see [Stewart1973, Pg. 214] for an early reference. A major disadvantage of this approach, rapidly loses orthogonality in finite precision arithmetic as the condition number of increases. As noted in [Stathopoulos2002], the numerical orthogonality is usually at the level of the squared condition number times the unit roundoff . To improve its orthogonality, one can apply the CholeskyQR algorithm again to and update accordingly. As shown in [Yamamoto2015], this so called CholeskyQR2 algorithm results in a numerically orthogonal factor, provided that is at most .
The CholeskyQR2 algorithm for HODLR and hierarchical matrices [Lintner2002] is additionally affected by low-rank truncation and may require several reorthogonalization steps to reach numerical orthogonality on the level of the truncation error, increasing the computational cost. Another approach proposed in [Lintner2002] to avoid loss of orthogonality is to first compute a polar decomposition and then apply the CholeskyQR algorithm to . Because of , this improves the accuracy of the CholeskyQR algorithm. On the other hand, the need for computing the polar decomposition via an iterative method, such as the sign-function iteration [Higham1986], also significantly increases the computational cost.
2.3 LU-based QR decomposition
An approach proposed by Bebendorf [Bebendorf2008, Sec. 2.10] can be viewed as orthogonalizing a recursive block LU decomposition. Given , let us partition
and suppose that is invertible. Setting , consider the block LU decomposition
The first factor is orthogonalized by (1) rescaling the first block column with the inverted Cholesky factor of the symmetric positive definite matrix and (2) choosing the second block column as , scaled with the inverted Cholesky factor of . Adjusting the second factor in (7) accordingly does not change its block triangular structure. More precisely, one can prove that
holds. By construction, is orthogonal. This procedure is applied recursively to the diagonal blocks of . If is a HODLR matrix corresponding to the partition (6) at every level of the recursion then all involved operations can be performed efficiently in the HODLR format. Following [Bebendorf2008], the overall computational cost is, once again, .
An obvious disadvantage of the described approach, it requires the leading diagonal block to be well conditioned for every subproblem encountered during the recursion. This rather restrictive assumption is only guaranteed for specific matrix classes, such as well-conditioned positive definite matrices.
2.4 QR decomposition based on a block Gram-Schmidt procedure
The equivalence between the QR decomposition and the Gram-Schmidt procedure for full-rank matrices is well known. In particular, applying the modified block Gram-Schmidt procedure to the columns of leads to the block recursive QR decomposition presented in [GolVanL2013, Sec. 5.2.4]. Benner and Mach [BennerMach2010] combined this idea with hierarchical matrix arithmetic. In the following, we briefly summarize their approach. Partitioning the economy-sized QR decomposition of into block columns yields the relation
This yields three steps for the computation of and :
Compute (recursively) the QR decomposition .
Compute and update .
Compute (recursively) the QR decomposition .
Step 2 can be implemented efficiently for a HODLR matrix that aligns with the block column partitioning (8). Steps 1 and 3 are executed recursively until the lowest level of the HODLR structure is reached. On this lowest level, it is suggested in [BennerMach2010, Alg. 3] to compute the QR decomposition of compressed block columns. We refrain from providing details and point out that we consider similarly compressed block columns in Section 4 below. The overall computational complexity is .
The described algorithm inherits the numerical instability of Gram-Schmidt procedures. In particular, we cannot expect to obtain a numerically orthogonal factor in finite-precision arithmetic when is ill-conditioned; see also the analysis in [BennerMach2010, Sec. 3.4].
3 Recursive WY-based QR decomposition
In this section, we recall the recursive QR decomposition by Elmroth and Gustavson [Elmroth2000] for a general, dense matrix with . The orthogonal factor is returned in terms of the compact WY representation [Schreiber1989] of the Householder reflectors involved in the decomposition:
where is an upper triangular matrix and is an matrix with the first rows in unit lower triangular form.
For , the matrix becomes a column vector and we let be the Householder reflector [GolVanL2013, Sec. 5.1.2] that maps to a scalar multiple of the unit vector. Then is trivially of the form (9).
For , we partition into two block columns of roughly equal size:
By recursion, we compute a QR decomposition of the first block column
with , taking the form explained above. The second block column is updated,
and then partitioned as
Again by recursion, we compute a QR decomposition of the bottom block:
To combine the QR decompositions of the first and the updated second block column, we embedd into the larger matrix
we obtain a QR decomposition
Algorithm 1 summarizes the described procedure. To simplify the description, the recursion is performed down to individual columns. In practice [Elmroth2000], the recursion is stopped earlier: When the number of columns does not exceed a certain block size (e.g., ), a standard Householder-based QR decomposition is used.
4 Recursive WY-based QR decomposition of HODLR matrices
By combining the recursive block QR decomposition (Algorithm 1) with HODLR arithmetic, we will show in this section how to derive an efficient algorithm for computing the QR decomposition of a level- HODLR matrix .
The matrix processed in one step of the recursion of our algorithm takes the following form:
is a HODLR matrix of level ;
is given in factorized form with and for some (small) integer ;
for some (small) integer .
To motivate this structure, it is helpful to consider the block column highlighted in Figure 1. The first block in this block column consists of a (square) level-one HODLR matrix, corresponding to the matrix in (11). The other three blocks are all of low rank, and the matrix in (11) corresponds to the first of these blocks. The bottom two blocks extend into the next block column(s). For these two blocks, it is assumed that their left factors are orthonormal. These left factors are ignored and the parts of the right factors residing in the highlighted block column are collected in the matrix in (11).
Given a matrix of the form (11), we aim at computing, recursively and approximately, a QR decomposition of the form
such that are upper triangular HODLR matrices of level and the structure of reflects the structure of , that is,
where: is a unit lower triangular HODLR matrix of level ; is in factorized form; and .
On the highest level of the recursion, when , the matrices in (11) vanish, and is the original HODLR matrix we aim at decomposing. The QR decomposition returned on the highest level has the form (12) with both triangular level- HODLR matrices.
The computation of the QR decomposition (12) proceeds in several steps, which are detailed in the following.
Using the procedure described in Section 2.1.2, we may assume that the factorization of is normalized such that has orthonormal columns. For the moment, we will discard and aim at decomposing instead of the compressed matrix
which has size .
QR decomposition of on the lowest level, .
QR decomposition of on higher levels, .
We proceed recursively as follows. First, is repartitioned as follows:
Here, is split according to its HODLR format, that is, , , with , are HODLR matrices of level , and are low-rank matrices stored in factorized form.
Note that the first block column of in (16) has precisely the form (11) with the level of the HODLR matrix reduced by one, the low-rank block given by and the dense part given by . This allows us to apply recursion and obtain a QR decomposition
with a HODLR matrix and a factorized low-rank matrix . We then update the second block column of :
It is important to note that each term of the sum in the latter expression is a low-rank matrix and, in turn, has low rank. This not only makes the computation of efficient but it also implies that the updates in (18) are of low rank and thus preserve the structure of the second block column of .
Note that , , and are triangular level- HODLR matrices, as desired.
4.1 Algorithm and complexity estimates
Algorithm 2 summarizes the recursive procedure described above. A QR decomposition of a level- HODLR matrix is obtained by applying this algorithm with and void .
In the following, we derive complexity estimates for Algorithm 2 applied to under the assumptions stated in Section 2.1.2. In particular, it is assumed that all off-diagonal ranks (which are chosen adaptively) are bounded by .
Every lower off-diagonal block of needs to be transformed once to left-orthogonal form in the course of the algorithm. For each , , there are such blocks of size and rank at most . Using the procedure for left-orthogonalization explained in Section 2.1, the overall cost is
where we used .
The QR decomposition of the compressed block column is performed for all block columns on the lowest level of recursion. Each of them is of size , because there are at most lower off-diagonal blocks intersecting with each block column. Each QR decomposition requires operations and thus the overall cost is .
The computation of involves the following operations:
three products of level- HODLR matrices with low-rank matrices;
addition of four low-rank matrices, given in terms of their low-rank factors, combined with recompression.
The first part is effected by performing at most matrix-vector multiplications with HODLR matrices. As the computation of is performed times for every , we arrive at a total cost of
In the second part, the addition is performed for matrices of size . The first three terms in Line 11 have rank at most but the rank of the last term can be up to . Letting the rank grow to would lead to an unfavourable complexity, because the cost of recompression depends quadratically on the rank of the matrix to be recompressed. To avoid this effect, we execute separate additions, each immediately followed by the application of . Assuming that each recompression truncates to rank , this requires operations. Similarly as in (21), this leads to a total cost of .
For updating the second block column of , the following operations are performed:
The computation of requires HODLR matrix-vector multiplications, followed by low-rank recompression of a matrix of rank at most . Analogously to (21), this requires a total cost of .
The computation of requires (approximate) subtraction of the product of two low-rank matrices from a level- HODLR matrix. The most expensive part of this step is the recompression of the updated HODLR matrix with off-diagonal ranks at most , amounting to a total cost of
The computation of and involves the multiplication of a matrix with at most rows with a low-rank matrix, which requires operations each time. The total cost is thus again .
The computation of the low-rank block involves:
three multiplications of level- HODLR matrices with low-rank matrices;
addition of three low-rank matrices, given in terms of their low-rank factors, combined with recompression.
Therefore, total cost is identical with the cost for computing : .
The total cost of Algorithm 2 applied to an HODLR matrix is .
5 Numerical results
In this section we demonstrate the efficiency of our method on several examples. All algorithms were implemented and executed in Matlab version R2016b on a dual Intel Core i7-5600U 2.60GHz CPU, KByte of level 2 cache and GByte of RAM, using a single core. Because all algorithms are rich in calls to BLAS and LAPACK routines, we believe that the use of Matlab (instead of a compiled language) does not severely limit the predictive value of the reported timings.
The following algorithms have been compared:
The Cholesky-based QR decomposition for HODLR matrices explained in Section 2.2.
CholQR followed by one step of the reorthogonalization procedure explained in Section 2.2.
Algorithm 2, our newly proposed algorithm.
Call to the MATLAB function qr, which in turn calls the corresponding LAPACK routine for computing the QR decomposition of a general dense matrix.
Among the methods discussed in Section 2, we have decided to focus on CholQR and CholQR2, primarily because they are relatively straightforward to implement. A comparison of CholQR and CholQR2 with the other methods from Section 2 can be found in [BennerMach2010].
If not stated otherwise, when working with HODLR matrices we have chosen the minimal block size and the truncation tolerance .
To assess accuracy, we have measured the numerical orthogonality of and the residual of the computed QR decomposition:
Example 1 (Performance for random HODLR matrices).
We first investigate the performance of our method for HODLR matrices of varying size constructed as follows. The diagonal blocks are random dense matrices and each off-diagonal block is a rank-one matrix chosen as the outer product of two random vectors. From Figure 2, one observes that the computational time of the hQR algorithm nicely matches the reference line, the complexity claimed in Section 4.1. Compared to the much simpler and as we shall see, less accurate CholQR method, our new method is approximately only two times slower, while CholQR2 is slower than hQR for . Note that for , the Cholesky factorization of fails to complete due to lack of (numerical) positive definiteness and, in turn, both CholQR and CholQR2 return with an error.
Table 2 provides insights into the observed accuracy for values of for which (22) can be evaluated conveniently. As increases, the condition number of increases. Our method is robust to this increase and produces numerical orthogonality and a residual norm on the level of the truncation error. In contrast, the accuracy of CholQR clearly deteriorates as the condition number increases and the refinement performed by the more expensive CholQR2 cannot fully make up for this.