Fast QR decomposition of HODLR matrices
Abstract
The efficient and accurate QR decomposition for matrices with hierarchical lowrank structures, such as HODLR and hierarchical matrices, has been challenging. Existing structureexploiting algorithms are prone to numerical instability as they proceed indirectly, via Cholesky decompositions or a block GramSchmidt procedure. For a highly illconditioned matrix, such approaches either break down in finiteprecision arithmetic or result in significant loss of orthogonality. Although these issues can sometimes be addressed by regularization and iterative refinement, it would be more desirable to have an algorithm that avoids these detours and is numerically robust to illconditioning. In this work, we propose such an algorithm for HODLR matrices. It achieves accuracy by utilizing Householder reflectors. It achieves efficiency by utilizing fast operations in the HODLR format in combination with compact WY representations and the recursive QR decomposition by Elmroth and Gustavson. Numerical experiments demonstrate that our newly proposed algorithm is robust to illconditioning and capable of achieving numerical orthogonality down to the level of roundoff error.
1 Introduction
A HODLR (hierarchically offdiagonal lowrank) matrix is defined recursively via block partitions of the form
(1) 
where the offdiagonal blocks have low rank and the diagonal blocks are again HODLR matrices. The recursion is stopped once the diagonal blocks are of sufficiently small size, in the range of, say, a few hundreds. Storing the offdiagonal blocks in terms of their lowrank factors significantly reduces memory requirements and, potentially, the computational cost of operating with HODLR matrices. The goal of this work is to devise an efficient and numerically accurate algorithm for computing a QR decomposition
where is an upper triangular HODLR matrix and the orthogonal matrix is represented in terms of its so called compact WY representation [Schreiber1989]: , with the identity matrix and triangular/trapezoidal HODLR matrices .
HODLR matrices constitute one of the simplest datasparse formats among the wide range of hierarchical lowrank formats that have been discussed in the literature during the last two decades. They have proved to be effective, for example, in solving largescale linear systems [Aminfar2016] and operating with multivariate Gaussian distributions [Ambikasaran2016]. In our own work [Kressner2017, Susnjara2018] HODLR matrices have played a central role in developing fast algorithms for solving symmetric banded eigenvalue problems. In particular, a fast variant of the so called QDWH algorithm [NakaBaiGygi2010, NakaHigh2013] for computing spectral projectors requires the QR decomposition of a HODLR matrix. It is also useful for orthonormalizing datasparse vectors. Possibly more importantly, the QR decomposition offers a stable alternative to the LU decomposition (without pivoting) for solving linear systems with nonsymmetric HODLR matrices or to the Cholesky decomposition applied to the normal equations for solving linear leastsquares problems.
For the more general class of hierarchical matrices [Hackbusch2015], a number of approaches aim at devising fast algorithms for QR decompositions [Bebendorf2008, BennerMach2010, Lintner2002]. However, as we explain in Section 2 below, all existing approaches have limitations in terms of numerical accuracy and orthogonality, especially when is illconditioned. In the other direction, when further conditions are imposed on a HODLR matrix, leading to formats such as HSS (hierarchically semiseparable) or quasiseparable matrices, then it can be possible to devise QR or URV decompositions that fully preserve the structure; see [Eidelman2014, VandeVanBarelMastro2008, Xia2010] and the references therein. This is clearly not possible for HODLR matrices: In general, and do not inherit from the property of having lowrank offdiagonal blocks. However, as it turns out, these factors can be very well approximated via HODLR matrices. The approach presented in this work to obtain such approximations is different from any existing approach we are aware of. It is based on the recursive QR decomposition proposed by Elmroth and Gustavson [Elmroth2000] for dense matrices. The key insight in this work is that such a recursive algorithm combines well with the use of the HODLR format for representing the involved compact WY representations. Demonstrated by the numerical experiments, the resulting algorithm is not only fast but it is also capable of yielding high accuracy, that is, a residual and orthogonality down to the level of roundoff error.
The rest of this paper is organized as follows. Section 2 provides an overview of HODLR matrices and the corresponding arithmetics, as well as the existing approaches to fast QR decompositions. In Section 3, we recall the recursive QR decomposition from [Elmroth2000] for dense matrices. The central part of this work, Section 4 combines [Elmroth2000] with the HODLR format. Various numerical experiments reported in Section 5 demonstrate the effectiveness of our approach. Finally, in Section 6, we sketch the extension of our newly proposed approach from square to rectangular HODLR matrices.
2 Overview of HODLR matrices and existing methods
In the following, we focus our description on square HODLR matrices. Although the extension of our algorithm to rectangular matrices does not require any substantially new ideas, the formal description of the algorithm would become significantly more technical. We have therefore postponed the rectangular case to Section 6.
2.1 HODLR matrices
As discussed in the introduction, a HODLR matrix is defined by performing a recursive partition of the form (1) and requiring all occuring offdiagonal blocks to be of low rank. When this recursion is performed times, we say that is a HODLR matrix of level ; see Figure 1 for an illustration.
Clearly, the definition of a HODLR matrix depends on the block sizes chosen in (1) on every level of the recursion or, equivalently, on the integer partition
(2) 
defined by the sizes , of the diagonal blocks on the lowest level of the recursion. If possible, it is advisable to choose the level and the integers such that all are nearly equal to a prescribed minimal block size . In the following, when discussing the complexity of operations, we assume that such a balanced partition has been chosen.
Given an integer partition (2), we define to be the set of HODLR matrices of level and rank (at most) , that is, if every offdiagonal block of in the recursive block partition induced by (2) has rank at most .
A matrix admits a datasparse representation by storing its offdiagonal blocks in terms of their lowrank factors. Specifically, letting denote an arbitrary offdiagonal block in the recursive partition of , we can write
(3) 
Storing and instead of for every such offdiagonal block reduces the overall memory required for storing from to .
We call a factorization (3) leftorthogonal if . Provided that , an arbitrary factorization (3) can be turned into a leftorthogonal one by computing an economysized QR decomposition , see [GolVanL2013, Theorem 5.2.3], and replacing , . This described procedure requires operations.
2.1.1 Approximation by HODLR matrices
A general matrix can be approximated by a HODLR matrix by performing lowrank truncations of the offdiagonal blocks in the recursive block partition. Specifically, letting denote such an offdiagonal block, one computes a singular value decomposition with the diagonal matrix containing the singular values. Letting and contain the first columns of and , respectively, and setting , one obtains a rank approximation
(4) 
by setting and . We note that this approximation is optimal among all rank matrices for any unitarily invariant norm [Horn2013, Section 7.4.9]. In particular, for the matrix norm, we have
(5) 
In passing, we note that the factorization chosen in (4) is leftorthogonal.
Performing the approximation (4) for every offdiagonal block in the recursive block partition yields a HODLR matrix .
In practice, the rank is chosen adaptively and separately for each offdiagonal block . Given a prescribed tolerance , we choose to be the smallest integer such that . In turn, (5) implies . The resulting HODLR approximation satisfies ; see, e.g., [Bini2017, Theorem 2.2].
Recompression.
Most manipulations involving HODLR matrices lead to an increase of offdiagonal ranks. This increase is potentially mitigated by performing recompression. Let us consider an offdiagonal block , with and choose as explained above. If , a rank approximation reduces memory requirements while maintaining accuracy. We use the following wellknown procedure for effecting this approximation.

Compute and , economysized QR decompositions of and , respectively.

Compute SVD .

Update and .
This procedure, which will be denoted by , requires operations.
2.1.2 Operating with HODLR matrices
A number of operations can be performed efficiently with HODLR matrices. Table 1 lists the operations relevant in this work, together with their computational complexity; see, e.g., [Hackbusch2015, Chapter 3] for more details. It is important to note that all operations, except for matrixvector multiplication, are combined with lowrank truncation, as discussed above, to limit rank growth in the offdiagonal blocks. The symbol signifies the inexactness due to truncations. The complexity estimates assume that all offdiagonal ranks encountered during an operation remain .
Operation  Computational complexity 

Matrixvector multiplication:  
Matrix addition:  
Matrix lowrank update:  
Matrixmatrix multiplication:  
Matrix inversion:  
Solution of triangular matrix equation:  
Cholesky decomposition: 
2.2 Choleskybased QR decomposition
This and the following sections describe three existing methods for efficiently computing the QR decomposition of a HODLR matrix. All these methods have originally been proposed for the broader class of hierarchical matrices.
The first method, proposed by Lintner [Lintner2002, Lintner2004], is based on the wellknown connection between the QR and Cholesky decompositions. Specifically, letting be the QR decomposition of an invertible matrix , we have
Thus, the upper triangular factor can be obtained from the Cholesky decomposition of the symmetric positive definite matrix . The orthogonal factor is obtained from solving the triangular system . According to Table 1, these three steps (forming , computing the Cholesky decomposition, solving the triangular matrix equation) require operations in the HODLR format.
For dense matrices, the approach described above is wellknown and often called CholeskyQR algorithm; see [Stewart1973, Pg. 214] for an early reference. A major disadvantage of this approach, rapidly loses orthogonality in finite precision arithmetic as the condition number of increases. As noted in [Stathopoulos2002], the numerical orthogonality is usually at the level of the squared condition number times the unit roundoff . To improve its orthogonality, one can apply the CholeskyQR algorithm again to and update accordingly. As shown in [Yamamoto2015], this so called CholeskyQR2 algorithm results in a numerically orthogonal factor, provided that is at most .
The CholeskyQR2 algorithm for HODLR and hierarchical matrices [Lintner2002] is additionally affected by lowrank truncation and may require several reorthogonalization steps to reach numerical orthogonality on the level of the truncation error, increasing the computational cost. Another approach proposed in [Lintner2002] to avoid loss of orthogonality is to first compute a polar decomposition and then apply the CholeskyQR algorithm to . Because of , this improves the accuracy of the CholeskyQR algorithm. On the other hand, the need for computing the polar decomposition via an iterative method, such as the signfunction iteration [Higham1986], also significantly increases the computational cost.
2.3 LUbased QR decomposition
An approach proposed by Bebendorf [Bebendorf2008, Sec. 2.10] can be viewed as orthogonalizing a recursive block LU decomposition. Given , let us partition
(6) 
and suppose that is invertible. Setting , consider the block LU decomposition
(7) 
The first factor is orthogonalized by (1) rescaling the first block column with the inverted Cholesky factor of the symmetric positive definite matrix and (2) choosing the second block column as , scaled with the inverted Cholesky factor of . Adjusting the second factor in (7) accordingly does not change its block triangular structure. More precisely, one can prove that
holds. By construction, is orthogonal. This procedure is applied recursively to the diagonal blocks of . If is a HODLR matrix corresponding to the partition (6) at every level of the recursion then all involved operations can be performed efficiently in the HODLR format. Following [Bebendorf2008], the overall computational cost is, once again, .
An obvious disadvantage of the described approach, it requires the leading diagonal block to be well conditioned for every subproblem encountered during the recursion. This rather restrictive assumption is only guaranteed for specific matrix classes, such as wellconditioned positive definite matrices.
2.4 QR decomposition based on a block GramSchmidt procedure
The equivalence between the QR decomposition and the GramSchmidt procedure for fullrank matrices is well known. In particular, applying the modified block GramSchmidt procedure to the columns of leads to the block recursive QR decomposition presented in [GolVanL2013, Sec. 5.2.4]. Benner and Mach [BennerMach2010] combined this idea with hierarchical matrix arithmetic. In the following, we briefly summarize their approach. Partitioning the economysized QR decomposition of into block columns yields the relation
(8) 
This yields three steps for the computation of and :

Compute (recursively) the QR decomposition .

Compute and update .

Compute (recursively) the QR decomposition .
Step 2 can be implemented efficiently for a HODLR matrix that aligns with the block column partitioning (8). Steps 1 and 3 are executed recursively until the lowest level of the HODLR structure is reached. On this lowest level, it is suggested in [BennerMach2010, Alg. 3] to compute the QR decomposition of compressed block columns. We refrain from providing details and point out that we consider similarly compressed block columns in Section 4 below. The overall computational complexity is .
The described algorithm inherits the numerical instability of GramSchmidt procedures. In particular, we cannot expect to obtain a numerically orthogonal factor in finiteprecision arithmetic when is illconditioned; see also the analysis in [BennerMach2010, Sec. 3.4].
3 Recursive WYbased QR decomposition
In this section, we recall the recursive QR decomposition by Elmroth and Gustavson [Elmroth2000] for a general, dense matrix with . The orthogonal factor is returned in terms of the compact WY representation [Schreiber1989] of the Householder reflectors involved in the decomposition:
(9) 
where is an upper triangular matrix and is an matrix with the first rows in unit lower triangular form.
For , the matrix becomes a column vector and we let be the Householder reflector [GolVanL2013, Sec. 5.1.2] that maps to a scalar multiple of the unit vector. Then is trivially of the form (9).
For , we partition into two block columns of roughly equal size:
By recursion, we compute a QR decomposition of the first block column
with , taking the form explained above. The second block column is updated,
and then partitioned as
Again by recursion, we compute a QR decomposition of the bottom block:
To combine the QR decompositions of the first and the updated second block column, we embedd into the larger matrix
By setting
and
(10)  
we obtain a QR decomposition
Algorithm 1 summarizes the described procedure. To simplify the description, the recursion is performed down to individual columns. In practice [Elmroth2000], the recursion is stopped earlier: When the number of columns does not exceed a certain block size (e.g., ), a standard Householderbased QR decomposition is used.
4 Recursive WYbased QR decomposition of HODLR matrices
By combining the recursive block QR decomposition (Algorithm 1) with HODLR arithmetic, we will show in this section how to derive an efficient algorithm for computing the QR decomposition of a level HODLR matrix .
The matrix processed in one step of the recursion of our algorithm takes the following form:
(11) 
where:

is a HODLR matrix of level ;

is given in factorized form with and for some (small) integer ;

for some (small) integer .
To motivate this structure, it is helpful to consider the block column highlighted in Figure 1. The first block in this block column consists of a (square) levelone HODLR matrix, corresponding to the matrix in (11). The other three blocks are all of low rank, and the matrix in (11) corresponds to the first of these blocks. The bottom two blocks extend into the next block column(s). For these two blocks, it is assumed that their left factors are orthonormal. These left factors are ignored and the parts of the right factors residing in the highlighted block column are collected in the matrix in (11).
Given a matrix of the form (11), we aim at computing, recursively and approximately, a QR decomposition of the form
(12) 
such that are upper triangular HODLR matrices of level and the structure of reflects the structure of , that is,
(13) 
where: is a unit lower triangular HODLR matrix of level ; is in factorized form; and .
On the highest level of the recursion, when , the matrices in (11) vanish, and is the original HODLR matrix we aim at decomposing. The QR decomposition returned on the highest level has the form (12) with both triangular level HODLR matrices.
The computation of the QR decomposition (12) proceeds in several steps, which are detailed in the following.
Preprocessing.
Using the procedure described in Section 2.1.2, we may assume that the factorization of is normalized such that has orthonormal columns. For the moment, we will discard and aim at decomposing instead of the compressed matrix
(14) 
which has size .
QR decomposition of on the lowest level, .
QR decomposition of on higher levels, .
We proceed recursively as follows. First, is repartitioned as follows:
(16) 
Here, is split according to its HODLR format, that is, , , with , are HODLR matrices of level , and are lowrank matrices stored in factorized form.
Note that the first block column of in (16) has precisely the form (11) with the level of the HODLR matrix reduced by one, the lowrank block given by and the dense part given by . This allows us to apply recursion and obtain a QR decomposition
(17) 
with a HODLR matrix and a factorized lowrank matrix . We then update the second block column of :
(18) 
where
It is important to note that each term of the sum in the latter expression is a lowrank matrix and, in turn, has low rank. This not only makes the computation of efficient but it also implies that the updates in (18) are of low rank and thus preserve the structure of the second block column of .
After the update (18) has been performed, the process is completed by applying recursion to the updated second block column (18), without the first block, and obtain a QR decomposition
(19) 
By the discussion in Section 3, see in particular (10), combining the QR decompositions of the first and second block columns yields a QR decomposition of :
(20) 
with
and
Note that , , and are triangular level HODLR matrices, as desired.
Postprocessing.
4.1 Algorithm and complexity estimates
Algorithm 2 summarizes the recursive procedure described above. A QR decomposition of a level HODLR matrix is obtained by applying this algorithm with and void .
In the following, we derive complexity estimates for Algorithm 2 applied to under the assumptions stated in Section 2.1.2. In particular, it is assumed that all offdiagonal ranks (which are chosen adaptively) are bounded by .
Line 3.
Every lower offdiagonal block of needs to be transformed once to leftorthogonal form in the course of the algorithm. For each , , there are such blocks of size and rank at most . Using the procedure for leftorthogonalization explained in Section 2.1, the overall cost is
where we used .
Line 7.
The QR decomposition of the compressed block column is performed for all block columns on the lowest level of recursion. Each of them is of size , because there are at most lower offdiagonal blocks intersecting with each block column. Each QR decomposition requires operations and thus the overall cost is .
Lines 11–12.
The computation of involves the following operations:

three products of level HODLR matrices with lowrank matrices;

addition of four lowrank matrices, given in terms of their lowrank factors, combined with recompression.
The first part is effected by performing at most matrixvector multiplications with HODLR matrices. As the computation of is performed times for every , we arrive at a total cost of
(21) 
In the second part, the addition is performed for matrices of size . The first three terms in Line 11 have rank at most but the rank of the last term can be up to . Letting the rank grow to would lead to an unfavourable complexity, because the cost of recompression depends quadratically on the rank of the matrix to be recompressed. To avoid this effect, we execute separate additions, each immediately followed by the application of . Assuming that each recompression truncates to rank , this requires operations. Similarly as in (21), this leads to a total cost of .
Line 13.
For updating the second block column of , the following operations are performed:

The computation of requires HODLR matrixvector multiplications, followed by lowrank recompression of a matrix of rank at most . Analogously to (21), this requires a total cost of .

The computation of requires (approximate) subtraction of the product of two lowrank matrices from a level HODLR matrix. The most expensive part of this step is the recompression of the updated HODLR matrix with offdiagonal ranks at most , amounting to a total cost of

The computation of and involves the multiplication of a matrix with at most rows with a lowrank matrix, which requires operations each time. The total cost is thus again .
Lines 15–16.
The computation of the lowrank block involves:

three multiplications of level HODLR matrices with lowrank matrices;

addition of three lowrank matrices, given in terms of their lowrank factors, combined with recompression.
Therefore, total cost is identical with the cost for computing : .
Summary.
The total cost of Algorithm 2 applied to an HODLR matrix is .
5 Numerical results
In this section we demonstrate the efficiency of our method on several examples. All algorithms were implemented and executed in Matlab version R2016b on a dual Intel Core i75600U 2.60GHz CPU, KByte of level 2 cache and GByte of RAM, using a single core. Because all algorithms are rich in calls to BLAS and LAPACK routines, we believe that the use of Matlab (instead of a compiled language) does not severely limit the predictive value of the reported timings.
The following algorithms have been compared:
 CholQR

The Choleskybased QR decomposition for HODLR matrices explained in Section 2.2.
 CholQR2

CholQR followed by one step of the reorthogonalization procedure explained in Section 2.2.
 hQR

Algorithm 2, our newly proposed algorithm.
 MATLAB

Call to the MATLAB function qr, which in turn calls the corresponding LAPACK routine for computing the QR decomposition of a general dense matrix.
Among the methods discussed in Section 2, we have decided to focus on CholQR and CholQR2, primarily because they are relatively straightforward to implement. A comparison of CholQR and CholQR2 with the other methods from Section 2 can be found in [BennerMach2010].
If not stated otherwise, when working with HODLR matrices we have chosen the minimal block size and the truncation tolerance .
To assess accuracy, we have measured the numerical orthogonality of and the residual of the computed QR decomposition:
(22) 
Example 1 (Performance for random HODLR matrices).
We first investigate the performance of our method for HODLR matrices of varying size constructed as follows. The diagonal blocks are random dense matrices and each offdiagonal block is a rankone matrix chosen as the outer product of two random vectors. From Figure 2, one observes that the computational time of the hQR algorithm nicely matches the reference line, the complexity claimed in Section 4.1. Compared to the much simpler and as we shall see, less accurate CholQR method, our new method is approximately only two times slower, while CholQR2 is slower than hQR for . Note that for , the Cholesky factorization of fails to complete due to lack of (numerical) positive definiteness and, in turn, both CholQR and CholQR2 return with an error.
Table 2 provides insights into the observed accuracy for values of for which (22) can be evaluated conveniently. As increases, the condition number of increases. Our method is robust to this increase and produces numerical orthogonality and a residual norm on the level of the truncation error. In contrast, the accuracy of CholQR clearly deteriorates as the condition number increases and the refinement performed by the more expensive CholQR2 cannot fully make up for this.