randUTV: A blocked randomized algorithm for computing a rankrevealing UTV factorization
P.G. Martinsson^{1}^{1}1Department of Applied Mathematics, University of Colorado at Boulder, 526 UCB, Boulder, CO 803090526, USA, G. QuintanaOrtí^{2}^{2}2Depto. de Ingeniería y Ciencia de Computadores, Universitat Jaume I, 12.071 Castellón, Spain, N. Heavner^{1}^{1}1Department of Applied Mathematics, University of Colorado at Boulder, 526 UCB, Boulder, CO 803090526, USA
Abstract: This manuscript describes the randomized algorithm randUTV for computing a so called UTV factorization efficiently. Given a matrix , the algorithm computes a factorization , where and have orthonormal columns, and is triangular (either upper or lower, whichever is preferred). The algorithm randUTV is developed primarily to be a fast and easily parallelized alternative to algorithms for computing the Singular Value Decomposition (SVD). randUTV provides accuracy very close to that of the SVD for problems such as lowrank approximation, solving illconditioned linear systems, determining bases for various subspaces associated with the matrix, etc. Moreover, randUTV also produces highly accurate approximations to the singular values of . Unlike the SVD, the randomized algorithm proposed builds a UTV factorization in an incremental, singlestage, and noniterative way, making it possible to halt the factorization process once a specified tolerance has been met. Numerical experiments comparing the accuracy and speed of randUTV to the SVD are presented. These experiments demonstrate that in comparison to column pivoted QR, which is another factorization that is often used as a relatively economic alternative to the SVD, randUTV compares favorably in terms of speed while providing far higher accuracy.
1. Introduction
1.1. Overview
Given an matrix , the so called “UTV decomposition” [36, p. 400] takes the form
(1) 
where and are unitary matrices, and is a triangular matrix (either lower or upper triangular). The UTV decomposition can be viewed as a generalization of other standard factorizations such as, e.g., the Singular Value Decomposition (SVD) or the Column Pivoted QR decomposition (CPQR). (To be precise, the SVD is the special case where is diagonal, and the CPQR is the special case where is a permutation matrix.) The additional flexibility inherent in the UTV decomposition enables the design of efficient updating procedures, see [36, Ch. 5, Sec. 4] and [13, 35, 29]. In this manuscript, we describe a randomized algorithm we call randUTV that exploits the additional flexibility provided by the UTV format to build a factorization algorithm that combines some of the most desirable properties of standard algorithms for computing the SVD and CPQR factorizations.
Specific advantages of the proposed algorithm include: (i) randUTV provides close to optimal lowrank approximation in the sense that for , the output factors , , and satisfy
In particular, randUTV is much better at lowrank approximation than CPQR. A related advantage of randUTV is that the diagonal elements of provide excellent approximations to the singular values of , cf. Figure 10. (ii) The algorithm randUTV builds the factorization (1) incrementally, which means that when it is applied to a matrix of numerical rank , the algorithm can be stopped early and incur an overall cost of . Observe that standard algorithms for computing the SVD do not share this property. The CPQR algorithm can be stopped in this fashion, but typically leads to substantially suboptimal lowrank approximation. (iii) The algorithm randUTV is blocked. For a given block size , randUTV processes columns of at a time. Moreover, the vast majority of flops expended by randUTV are used in matrixmatrix multiplications involving and thin matrices with columns. This leads to high computational speeds, in particular when the algorithm is executed on multicore CPUs, GPUs, and distributedmemory architectures. (iv) The algorithm randUTV is not an iterative algorithm. In this regard, it is closer to the CPQR than to standard SVD algorithms, which substantially simplifies software optimization.
Section 6 describes the results from several numerical experiments that illustrate the accuracy and speed of randUTV in comparison to standard techniques for computing the SVD and CPQR factorizations. As a preview, we show in Figure 1 that randUTV executes far faster than a highly optimized implementation of the LAPACK function dgesvd for computing the SVD (we compare to the Intel MKL implementation). Figure 1 also includes lines that indicate how much faster CPQR is over the SVD, since the CPQR is often proposed as an economical alternative to the SVD for low rank approximation. Although the comparison between CPQR and randUTV is slightly unfair since randUTV is far more accurate than CPQR, we see that CPQR is faster for small matrices, but randUTV becomes competitive as the matrix size increases. More importantly, the relative speed of randUTV increases greatly as the number of processors increases. In other words, in modern computing environments, randUTV is both faster and far better at revealing the numerical rank of than column pivoted QR.
1.2. A randomized algorithm for computing the UTV decomposition
The algorithm we propose is blocked for computational efficiency. For concreteness, let us assume that , and that an upper triangular factor is sought. With a block size, randUTV proceeds through approximately steps, where at the ’th step the ’th block of columns of is driven to upper triangular form, as illustrated in Figure 2.
In the first iteration, randUTV uses a randomized subspace iteration inspired by [31, 19] to build two sets of orthonormal vectors that approximately span the spaces spanned by the dominant left and right singular vectors of , respectively. These basis vectors form the first columns of two unitary “transformation matrices” and . We use these to build a new matrix
that has a blocked structure as follows:
In other words, the top left block is diagonal, and the bottom left block is zero. Ideally, we would want the top right block to also be zero, and this is what we would obtain if we used the exact left and right singular vectors in the transformation. The randomized sampling does not exactly identity the correct subspaces, but it does to high enough accuracy that the elements in the top right block have very small moduli. For purposes of lowrank approximation, such a format strikes an attractive balance between computational efficiency and close to optimal accuracy. We will demonstrate that , and that the diagonal entries of form accurate approximations to the first singular values of .
Once the first step has been completed, the second step applies the same procedure to the remaining block , which has size , and then continues in the same fashion to process all remaining blocks.
1.3. Relationship to earlier work
The UTV factorization was introduced and popularized by G.W. Stewart in a sequence of papers, including [33, 34, 35, 37], and the text book chapter [36, p. 400]. Among these, [37] is of particular relevance as it discusses explicitly how the UTV decomposition can be used for lowrank approximation and for finding approximations to the singular values of a matrix; in Section 6.2 we compare the accuracy of the method in [37] to the accuracy of randUTV. Of relevance here is also [28], which describes deterministic iterative methods for driving a triangular matrix towards diagonality. Another path towards better rank revelation and more accurate rank estimation is described in [12]. A key advantage of the UTV factorization is the ease with which it can be updated, as discussed in, e.g., [4, 3, 29]. Implementation aspects are discussed in [13].
The work presented here relies crucially on previous work on randomized subspace iteration for approximating the linear spaces spanned by groups of singular vectors, including [31, 19, 25, 24, 18]. This prior body of literature focused on the task of computing partial factorizations of matrices. More recently, it was observed [26, 27, 11] that analogous techniques can be utilized to accelerate methods for computing full factorizations such as the CPQR. The gain in speed is attained from blocking of the algorithms, rather than by reducing the asymptotic flop count. An alternative approach to using randomization was described in [8].
1.4. Outline of paper
Section 2 introduces notation, and lists some relevant existing results that we need. Section 3 provides a highlevel description of the proposed algorithm. Section 4 describes in detail how to drive one block of columns to upper triangular form, which forms one step of the blocked algorithm. Section 5 describes the whole multistep algorithm, discusses some implementation issues, and provides an estimate of the asymptotic flop count. Section 6 gives the results of several numerical experiments that illustrate the speed and accuracy of the proposed method. Section 7 describes availability of software.
2. Preliminaries
This section introduces our notation, and reviews some established techniques that will be needed. The material in Sections 2.1–2.4 is described in any standard text on numerical linear algebra (e.g. [14, 36, 38]). The material in Section 2.5 on randomized algorithms is described in further detail in the survey [19] and the lecture notes [23].
2.1. Basic notation
Throughout this manuscript, we measure vectors in using their Euclidean norm. The default norm for matrices will be the corresponding operator norm , although we will sometimes also use the Frobenius norm . We use the notation of Golub and Van Loan [14] to specify submatrices: If is an matrix, and and are index vectors, then denotes the corresponding submatrix.. We let denote the matrix , and define analogously. denotes the identity matrix, and is the zero matrix. The transpose of is denoted , and we say that an matrix is orthonormal if its columns are orthonormal, so that . A square orthonormal matrix is said to be unitary.
2.2. The Singular Value Decomposition (SVD)
Let be an matrix and set . Then the SVD of takes the form
(2) 
where matrices and are orthonormal, and is diagonal. We have , , and , where and are the left and right singular vectors of , respectively, and are the singular values of . The singular values are ordered so that .
A principal advantage of the SVD is that it furnishes an explicit solution to the low rank approximation problem. To be precise, let us for define the rank matrix
Then, the EckartYoung theorem asserts that
A disadvantage of the SVD is that the cost to compute the full SVD is for a square matrix, and for a rectangular matrix, with large prefactors. Moreover, standard techniques for computing the SVD are challenging to parallelize, and cannot readily be modified to compute partial factorizations.
2.3. The Column Pivoted QR (CPQR) decomposition
Let be an matrix and set . Then, the CPQR decomposition of takes the form
(3) 
where is orthonormal, is upper triangular, and is a permutation matrix. The permutation matrix is typically chosen to ensure monotonic decay in magnitude of the diagonal entries of so that . The factorization (3) is commonly computed using the Householder QR algorithm [14, Sec. 5.2], which is exceptionally stable.
An advantage of the CPQR is that it is computed via an incremental algorithm that can be halted to produce a partial factorization once any given tolerance has been met. A disadvantage is that it is not ideal for lowrank approximation. In the typical case, the error incurred is noticeably worse than what you get from the SVD but not disastrously so. However, there exist matrices for which CPQR leads to very suboptimal approximation errors [21]. Specialized pivoting strategies have been developed that can in some circumstances improve the lowrank approximation property, resulting in so called “rankrevealing QR factorizations” [7, 16].
The classical column pivoted Householder QR factorization algorithm drives the matrix to upper triangular form via a sequence of rankone updates and column pivotings. Due to the column pivoting performed after each update, it is difficult to block, making it hard to achieve high computational efficiency on modern processors [10]. It has recently been demonstrated that randomized methods can be used to resolve this difficulty [26, 27, 11]. However, we do not yet have rigorous theory backing up such randomized techniques, and they have not yet been incorporated into standard software packages.
2.4. Efficient factorizations of tall and thin matrices
The algorithms proposed in this manuscript rely crucially on the fact that factorizations of “tall and thin” matrices can be computed efficiently. To be precise, suppose that we are given a matrix of size , where , and that we seek to compute an unpivoted QR factorization
(4) 
where is unitary, and is upper triangular. When the Householder QR factorization procedure is used to compute the factorization (4), the matrix is formed as a product of so called “Householder reflectors” [14, Sec. 5.2], and can be written in the form
(5) 
for some matrices and that can be readily computed given the Householder vectors that are formed in the QR factorization procedure [6, 32, 20]. As a consequence, we need only words of storage to store , and we can apply to an matrix using flops. In this manuscript, the different typeface in is used as a reminder that this is a matrix that can be stored and applied efficiently.
Next, suppose that we seek to compute the SVD of the tall and thin matrix . This can be done efficiently in a twostep procedure, where the first step is to compute the unpivoted QR factorization (4). Next, let denote the top block of so that
(6) 
Then, compute the SVD of to obtain the factorization
(7) 
Combining (4), (6), and (7), we obtain the factorization
(8) 
which we recognize as a singular value decomposition of . Observe that the cost of computing this factorization is , and that only words of storage are required, despite the fact that the matrix of left singular vectors is ostensibly an dense matrix.
2.5. Randomized power iterations
This section summarizes key results of [31, 19, 23] on randomized algorithms for constructing sets of orthonormal vectors that approximately span the row or column spaces of a given matrix. To be precise, let be an matrix, let be an integer such that , and suppose that we seek to find an orthonormal matrix such that
Informally, the columns of should approximately span the same space as the dominant right singular vectors of . This is a task that is very well suited for subspace iteration (see [9, Sec. 4.4.3] and [5]), in particular when the starting matrix is a Gaussian random matrix [31, 19]. With a (small) integer denoting the number of steps taken in the power iteration, the following algorithm leads to close to optimal results:

Draw a Gaussian random matrix of size .

Form a “sampling matrix” of size via .

Orthonormalize the columns of to form the matrix .
Observe that in step (2), the matrix is computed via alternating application of and to a tall thin matrix with columns. In some situations, orthonormalization is required between applications to avoid loss of accuracy due to floating point arithmetic. In [31, 19] it is demonstrated that by using a Gaussian random matrix as the starting point, it is often sufficient to take just a few steps of power iteration, say or , or even .
Remark 2 (Oversampling).
In the analysis of power iteration with a Gaussian random matrix as the starting point, it is common to draw a few “extra” samples. In other words, one picks a small integer representing the amount of oversampling done, say or , and starts with a Gaussian matrix of size . This results in an orthonormal (ON) matrix of size . Then, with probability almost 1, the error is close to the minimal error in rank approximation in both spectral and Frobenius norm [19, Sec. 10]. When no oversampling is used, one risks losing some accuracy in the last couple of modes of the singular value decomposition. However, our experience shows that in the context of the present article, this loss is hardly noticeable.
Remark 3 (Rsvd).
The randomized range finder described in this section is simply the first stage in the twostage “Randomized SVD (RSVD)” procedure for computing an approximate rank SVD of a given matrix of size . To wit, suppose that we have used the randomized range finder to build an ON matrix of size such that , for some some error matrix . Then, we can compute an approximate factorization
(9) 
where and are orthonormal, and is diagonal, via the following steps (which together form the “second stage” of RSVD): (1) Set so that . (2) Compute a full SVD of the small matrix so that
(10) 
(3) Set . Observe that these last three steps are exact up to floating point arithmetic, so the error in (9) is exactly the same as the error in the range finder alone: .
3. An overview of the randomized UTV factorization algorithm
This section describes at a high level the overall template of the algorithm randUTV that given an matrix computes its UTV decomposition (1). For simplicity, we assume that , that an upper triangular middle factor is sought, and that we work with a block size that evenly divides so that the matrix can be partitioned into blocks of columns each; in other words, we assume that . The algorithm randUTV iterates over steps, where at the ’th step the ’th block of columns is driven to upper triangular form via the application of unitary transformations from the left and the right. A cartoon of the process is given in Figure 2.
To be slightly more precise, we build at the ’th step unitary matrices and of sizes and , respectively, such that
Using these matrices, we drive towards upper triangular form through a sequence of transformations
The objective at step is to transform the ’th diagonal block to diagonal form, to zero out all blocks beneath the ’th diagonal block, and to make all blocks to the right of the ’th diagonal block as small in magnitude as possible.
Each matrix and consists predominantly of a product of Householder reflectors. To be precise, each such matrix is a product of Householder reflectors, but with the ’th block of columns rightmultiplied by a small unitary matrix.
4. A randomized algorithm for finding a pair of transformation matrices for the first step
4.1. Objectives for the construction
In this section, we describe a randomized algorithm for finding unitary matrices and that execute the first step of the process outlined in Section 3, and illustrated in Figure 2. To avoid notational clutter, we simplify the notation slightly, describing what we consider a basic single step of the process. Given an matrix and a block size , we seek two orthonormal matrices and of sizes and , respectively, such that the matrix
(11) 
has a diagonal leading block, and the entries beneath this block are all zeroed out. In Section 3, we referred to the matrices and as and , respectively, and as .
To make the discussion precise, let us partition and so that
(12) 
where and each contain columns. Then, set for so that
(13) 
Our objective is now to build matrices and that accomplish the following:

is diagonal, with entries that closely approximate the leading singular values of .

.

has small magnitude.

The norm of should be close to optimally small, so that .
The purpose of condition (iv) is to minimize the error in the rank approximation to , cf. Section 5.3.
4.2. A theoretically ideal choice of transformation matrices
Suppose that we could somehow find two matrices and whose first columns exactly span the subspaces spanned by the dominant left and right singular vectors, respectively. Finding such matrices is of course computationally hard, but if we could build them, then we would get the optimal result that

.

.

.
Enforcing condition (i) is then very easy, since the dominant block is now disconnected from the rest of the matrix. Simply executing a full SVD of this small block, and then updating and will do the job.
4.3. A randomized technique for approximating the span of the dominant singular vectors
Inspired by the observation in Section 4.2 that a theoretically ideal right transformation matrix is a matrix whose first columns span the space spanned by the dominant right singular vectors of , we use the randomized power iteration described in Section 2.5 to execute this task. We let denote a small integer specifying how many steps of power iteration we take. Typically, are good choices. Then, take the following steps: (1) Draw a Gaussian random matrix of size . (2) Compute a sampling matrix of size . (3) Perform an unpivoted Householder QR factorization of so that
Observe that will be a product of Householder reflectors, and that the first columns of will form an orthonormal basis for the space spanned by the columns of . In consequence, the first columns of form an orthonormal basis for a space that approximately spans the space spanned by the dominant right singular vectors of . (The font used for is a reminder that it is a product of Householder reflectors, which is exploited when it is stored and operated upon, cf. Section 2.4.)
4.4. Construction of the left transformation matrix
The process for finding the left transformation matrix is deterministic, and will exactly transform the first columns of into a diagonal matrix. Observe that with the unitary matrix constructed using the procedure in Section 4.3, we have the identity.
(14) 
where the partitioning is such that holds the first columns of . We now perform a full SVD on the matrix , which is of size so that
(15) 
Inserting (15) into (14) we obtain the identity
(16) 
Factor out in (16) to the left to get
Finally, factor out to the right to yield the factorization
(17) 
Equation (17) is the factorization that we seek, with
Remark 4.
As we saw in (17), the right transformation matrix consists of a product of Householder reflectors, with the first columns rotated by a small unitary matrix . We will next demonstrate that the left transformation matrix can be built in such a way that it can be written in an entirely analogous form. Simply observe that the matrix in (15) is a tall thin matrix, so that the SVD in (15) can efficiently be computed by first performing an unpivoted QR factorization of to yield a factorization
where is of size , and is a product of Householder reflectors, cf. Section 2.4. Then, perform an SVD of to obtain
The factorization (15) then becomes
(18) 
We see that the expression for in (18) is exactly analogous to the expression for in (17).
4.5. Summary of the construction of transformation matrices
Even though the derivation of the transformation matrices got slightly long, the final algorithm is simple. It can be written down precisely with just a few lines of Matlab code, as shown in the right panel in Figure 3 (the single step process described in this section is the subroutine stepUTV).
function [U,T,V] = randUTV(A,b,q) T = A; U = eye(size(A,1)); V = eye(size(A,2)); for i = 1:ceil(size(A,2)/b) I1 = 1:(b*(i1)); I2 = (b*(i1)+1):size(A,1); J2 = (b*(i1)+1):size(A,2); if (length(J2) > b) [UU,TT,VV] = stepUTV(T(I2,J2),b,q); else [UU,TT,VV] = svd(T(I2,J2)); end U(:,I2) = U(:,I2)*UU; V(:,J2) = V(:,J2)*VV; T(I2,J2) = TT; T(I1,J2) = T(I1,J2)*VV; end return 
function [U,T,V] = stepUTV(A,b,q) G = randn(size(A,1),b); Y = A’*G; for i = 1:q Y = A’*(A*Y); end [V,~] = qr(Y); [U,D,W] = svd(A*V(:,1:b)); T = [D,U’*A*V(:,(b+1):end)]; V(:,1:b) = V(:,1:b)*W; return 
5. The algorithm randUTV
In this section, we describe the algorithm randUTV that given an matrix computes a UTV factorization of the form (1). For concreteness, we assume that and that we seek to build an upper triangular middle matrix . The modifications required for the other cases are straightforward. Section 5.1 describes the most basic version of the scheme, Section 5.2 describes a computationally efficient version, and Section 5.4 provides a calculation of the asymptotic flop count of the resulting algorithm.
5.1. A simplistic algorithm
The algorithm randUTV is obtained by applying the singlestep algorithm described in Section 4 repeatedly, to drive to upper triangular form one block of columns at a time. We recall that a cartoon of the process is shown in Figure 2. At the start of the process, we create three arrays that hold the output matrices , , and , and initialize them by setting
In the first step of the iteration, we use the singlestep technique described in Section 4 to create two unitary “left and right transformation matrices” and and then update , , and accordingly:
This leaves us with a matrix whose leading columns are upper triangular (like matrix in Figure 2). For the second step, we build transformation matrices and by applying the singlestep algorithm described in Section 4 to the remainder matrix , and then updating , , and accordingly.
The process then continues to drive one block of columns at a time to upper triangular form. With denoting the total number of steps, we find that after steps, all that remains to process is the bottom right block of (cf. the matrix in Figure 2). This block consists of columns if is a multiple of the block size, and is otherwise even thinner. For this last block, we obtain the final left and right transformation matrices and by computing a full singular value decomposition of the remaining matrix , and updating the matrices , , and accordingly. (In the cartoon in Figure 2, we have , and the matrices and are built by computing a full SVD of a dense matrix of size .)
We call the algorithm described in this section randUTV. It can be coded using just a few lines of Matlab code, as illustrated in Figure 3. In this simplistic version, all unitary matrices are represented as dense matrices, which makes the overall complexity for an matrix.
5.2. A computationally efficient version
In this section, we describe how the basic version of randUTV, as given in Figure 3, can be turned into a highly efficient procedure via three simple modifications. The resulting algorithm is summarized in Figure 4. We note that the two versions of randUTV described in Figures 3 and 4 are mathematically equivalent; if they were to be executed in exact arithmetic, their outputs would be identical.
The first modification is that all operations on the matrix and on the unitary matrices and should be carried out “in place” to not unnecessarily move any data. To be precise, using the notation in Section 4, we generate at each iteration four unitary matrices , , , and . As soon as such a matrix is generated, it is immediately applied to , and then used to update either or .
Second, we exploit that the two “large” unitary transforms and both consist of products of Householder reflectors. We generate them by computing unpivoted Householder QR factorizations of tall and thin matrices, using a subroutine that outputs simply the Householder vectors. Then, and can both be stored and applied efficiently, as described in Section 2.4.
The third and final modification pertains to the situation where the input matrix is nonsquare. In this case, the full SVD that is computed in the last step involves a rectangular matrix. When is tall (), we find at this step that is empty, so the matrix to be processed is . When computing the SVD of this matrix, we use the efficient version described in Remark 4, which outputs a factorization in which consists in part of a product of Householder reflectors. (In this situation is in fact not necessarily “small,” but it can be stored and applied efficiently.)
Remark 5 (The case ).
In a situation where the matrix has fewer rows than columns, but we still seek an upper triangular matrix , randUTV proceeds exactly as described for the first steps. In the final step, we now find that is empty, but is not, and so we need to compute the SVD of the “fat” matrix . We do this in a manner entirely analogous to how we handle a “tall” matrix, by first performing an unpivoted QR factorization of the rows of . In this situation it is the matrix of right singular vectors at the last step that consists in part of a product of Householder reflectors.
5.3. Connection between RSVD and randUTV
The proposed algorithm randUTV is directly inspired by the Randomized SVD (RSVD) algorithm described in Remark 3 (as originally described in [24, 22, 31] and later elaborated in [25, 19]). In this section, we explore this connection in more detail, and demonstrate that the lowrank approximation error that results from the “singlestep” UTVfactorization described in Section 4 is identical to the error produced by the RSVD (with a twist). This means that the detailed error analysis that is available for the RSVD (see, e.g., [41, 15, 19]) immediately applies to the procedure described here. To be precise:
Theorem 1.
Let be an matrix, let be an integer denoting step size such that , and let denote a nonnegative integer. Let denote the Gaussian matrix drawn in Section 4.3, and let , , and be the factors in the factorization built in Sections 4.3 and 4.4, partitioned as in (12) and (13).

Let denote a sampling matrix, and let denote an orthonormal matrix whose columns form a basis for the column space of . Then, the error precisely equals the error incurred by the RSVD with steps of power iteration, as analyzed in [19, Sec. 10]. It holds that
(19) 
Let denote a sampling matrix, and let denote an orthonormal matrix whose columns form a basis for the column space of . If the rank of is at least , then
(20)
We observe that the term that arises in part (b) can informally be said to be the error resulting from RSVD with “” steps of power iteration. This conforms with what one might have optimistically hoped for, given that the RSVD involves applications of either or to thin matrices with columns, and randUTV involves such operations at each step ( applications in building , and then the computations of and , which are in practice applications of to thin matrices due to the identity (5)).
Proof.
The proofs for the two parts rest on the fact that can be decomposed as follows:
(21) 
where and have columns each, and is of size . With the identity (21) in hand, the claims in (a) follow once we have established that the orthogonal projection equals the projection . The claims in (b) follow once we establish that .
(a) Observe that the matrix is by construction identical to the matrix built in Sections 4.3 and 4.4. Since , we find that . Since , it follows that . Then
(22) 
The first identity in (19) follows immediately from (22). The second identity holds since (21) implies that , with unitary and orthonormal.
(b) We will first prove that with probability 1, the two matrices and have the same column spaces. To this end, note that since is obtained by performing an unpivoted QR factorization of the matrix defined in (a), we know that for some upper triangular matrix . The assumption that has rank at least implies that is invertible with probability 1. Consequently, , since . Since right multiplication by an invertible matrix does not change the column space of a matrix, the claim follows.
Remark 6 (Oversampling).
We recall that the accuracy of randUTV depends on how well the space aligns with the space spanned by the dominant right singular vectors of . If these two spaces were to match exactly, then the truncated UTV factorization would achieve perfectly optimal accuracy. One way to improve the alignment is to increase the power parameter . A second way to make the two spaces align better is to use oversampling, as described in Remark 2. With an oversampling parameter (say or ), we would draw a Gaussian random matrix of size , and then compute an “extended” sampling matrix of size . The sampling matrix we would use to compute would then be formed by the dominant left singular vectors of , cf. Figure 11. Oversampling in this fashion does improve the accuracy (see Section A.2), but in our experience, the additional computational cost is not worth it. Incorporating oversampling would also introduce an additional tuning parameter , which is in many ways undesirable.
5.4. Theoretical cost of randUTV
Now we analyze the theoretical cost of the implementation of randUTV, and we compare it to those of CPQR and SVD.
The theoretical cost of the CPQR factorization and the unpivoted QR factorization of an matrix is: flops, when no orthonormal matrices are required. Although the theoretical cost of both the pivoted and the unpivoted QR is the same, other factors should be considered, being the most important one the quality of flops. In modern architectures, flops performed inside BLAS3 operations can be about 5–10 times faster than flops performed inside BLAS1 and BLAS2 operations, since BLAS3 is CPUbound whereas BLAS1 and BLAS2 are memorybound. Hence, the unpivoted QR factorization in highperformance libraries such as [2] can be much faster because most of the flops are performed inside BLAS3 operations, whereas only half of the flops in the best implementations of CPQR (dgeqp3) in [2] are performed inside BLAS3 operations. In addition, this low performance of CPQR can even be smaller because of the appearance of catastrophic cancellations during the computations. The appearance of just one catastrophic cancellation will stop the building of a block Householder reflector before all of it has been built. This sudden stop forces the algorithm to work on smaller block sizes, which are suboptimal, and hence performances are even lower.
The SVD usually comprises two steps: the reduction to bidiagonal form, and then the reduction from bidiagonal to diagonal form. The first step is a direct step, whereas the second step is an iterative one. The theoretical cost of the reduction to bidiagonal form of an matrix is: flops, when no singular vectors are needed. If , the cost can be reduced to: flops by performing first a QR factorization. The cost of the reduction to diagonal form depends on the number of iterations, which is unknown a priori, but it is usually small when no singular vectors are built. On the one hand, in the bidiagonalization a large share of the flops are performed inside the notsoefficient BLAS1 and BLAS2 operations. Therefore, no high performances are obtained in the reduction to bidiagonal form. On the other hand, the reduction to diagonal form uses just BLAS1 operations. These operations are memorybound, and in addition they cannot be efficiently parallelized within BLAS, which might reduce performances on multicore machines. In conclusion, usual implementations of the SVD will render low performances on both singlecore architectures and multicore architectures.
The theoretical cost of the randUTV factorization of an matrix is: flops, when no orthonormal matrices are required, and steps of power iteration are applied. If , the theoretical cost of randUTV is three times as high as the theoretical cost of CPQR; if , it is four times as high; and if , it is five times as high. Although randUTV seems computationally more expensive than CPQR, the quality of flops should be considered. The share of BLAS1 and BLAS2 flops in randUTV is very small: BLAS1 and BLAS2 flops are only employed inside the CPQR factorization of the sampling matrix , the QR factorization of the current column block, and the SVD of the diagonal block. As these operations only apply to blocks of dimensions , , and , respectively, the total amount of these types of flops is negligible, and therefore most of the flops performed in randUTV are BLAS3 flops. Hence, the algorithm for computing the randUTV will be much faster than what the theoretical cost predicts. In conclusion, this heavy use of BLAS3 operations will render good performances on singlecore architectures, multicore architectures, GPUs, and distributedmemory architectures.
6. Numerical results
6.1. Computational speed
In this section, we investigate the speed of the proposed algorithm randUTV, and compare it to the speeds of highly optimized methods for computing the SVD and the column pivoted QR (CPQR) factorization.
All experiments reported in this article were performed on an Intel Xeon E52695 v3 (Haswell) processor (2.3 GHz), with 14 cores. In order to be able to show scalability results, the clock speed was throttled at 2.3 GHz, turning off socalled turbo boost. Other details of interest include that the OS used was Linux (Version 2.6.32504.el6.x86_64), and the code was compiled with gcc (Version 4.4.7). Main routines for computing the SVD (dgesvd) and the CPQR (dgeqp3) were taken from Intel’s MKL library (Version 11.2.3) since this library usually delivers much higher performances than Netlib’s LAPACK codes. Our implementations were coded with libflame [40, 39] (Release 11104).
Each of the three algorithms we tested (randUTV, SVD, CPQR) was applied to doubleprecision real matrices of size . We report the following times: The time in seconds for the LAPACK function dgesvd from Intel’s MKL. The time in seconds for the LAPACK function dgeqp3 from Intel’s MKL. The time in seconds for our implementation of randUTV. For the purpose of a fair comparison, the three implementations were linked to the BLAS library from Intel’s MKL. In all cases, we used an algorithmic block size of . While likely not optimal for all problem sizes, this block size yields near best performance and, regardless, it allows us to easily compare and contrast the performance of the different implementations.
Table 1 shows the measured computational times when executed on 1, 4, and 14 cores, respectively. In these experiments, all orthonormal matrices ( and for SVD and UTV, and for CPQR) are explicitly formed. This slightly favors CPQR since only one orthonormal matrix is required. The corresponding numbers obtained when orthonormal matrices are not built are given in Appendix A.1. To better illustrate the relative performance of the various techniques, we plot in Figure 5 the computational times measured divided by . Since all techniques under consideration have asymptotic complexity when applied to an matrix, these graphs better reveal the computational efficiency. (We plot time divided by rather than the more commonly reported “normalized Gigaflops” since the algorithms we compare have different scaling factors multiplying the dominant term in the asymptotic flop count.) Figure 5 also shows the timings we measured when the orthonormal matrices were not formed.
The results in Figure 5 lead us to make several observations: (1) The algorithm randUTV is decisively faster than the SVD in almost all cases (the exceptions involve the situation where no unitary matrices are sought, and the input matrix is small). (2) Comparing the speeds of CPQR and randUTV, we see that when both methods are executed on a single core, the speeds are similar, with CPQR being slightly faster in some regimes. (3) As the matrix size grows, and as the number of cores increases, randUTV gains an edge on CPQR in terms of speed.
Computational times when executed on a single core  
500  1.21e01  2.54e02  6.13e02  6.82e02  7.46e02 
1000  7.85e01  1.72e01  3.36e01  3.82e01  4.27e01 
2000  5.52e+00  1.30e+00  2.33e+00  2.67e+00  3.01e+00 
3000  2.11e+01  6.08e+00  7.72e+00  8.93e+00  1.01e+01 
4000  5.31e+01  1.62e+01  1.80e+01  2.10e+01  2.40e+01 
5000  1.04e+02  3.22e+01  3.39e+01  3.98e+01  4.57e+01 
6000  1.82e+02  5.65e+01  5.82e+01  6.85e+01  7.88e+01 
8000  4.34e+02  1.39e+02  1.39e+02  1.63e+02  1.87e+02 
10000  8.36e+02  2.66e+02  2.60e+02  3.08e+02  3.55e+02 
Computational times when executed on 4 cores  