Faster Least Squares Approximation
Least squares approximation is a technique to find an approximate solution to a system of linear equations that has no exact solution. In a typical setting, one lets be the number of constraints and be the number of variables, with . Then, existing exact methods find a solution vector in time. We present two randomized algorithms that provide accurate relative-error approximations to the optimal value and the solution vector of a least squares approximation problem more rapidly than existing exact algorithms. Both of our algorithms preprocess the data with the Randomized Hadamard Transform. One then uniformly randomly samples constraints and solves the smaller problem on those constraints, and the other performs a sparse random projection and solves the smaller problem on those projected coordinates. In both cases, solving the smaller problem provides relative-error approximations, and, if is sufficiently larger than , the approximate solution can be computed in time.
In many applications in mathematics and statistical data analysis, it is of interest to find an approximate solution to a system of linear equations that has no exact solution. For example, let a matrix and a vector be given. If , there will not in general exist a vector such that , and yet it is often of interest to find a vector such that in some precise sense. The method of least squares, whose original formulation is often credited to Gauss and Legendre , accomplishes this by minimizing the sum of squares of the elements of the residual vector, i.e., by solving the optimization problem
It is well-known that the minimum -norm vector among those satisfying eqn. (Equation 1) is
where denotes the Moore-Penrose generalized inverse of the matrix . This solution vector has a very natural statistical interpretation as providing an optimal estimator among all linear unbiased estimators, and it has a very natural geometric interpretation as providing an orthogonal projection of the vector onto the span of the columns of the matrix .
Recall that to minimize the quantity in eqn. (Equation 1), we can set the derivative of with respect to equal to zero, from which it follows that the minimizing vector is a solution of the so-called normal equations
Geometrically, this means that the residual vector is required to be orthogonal to the column space of , i.e., . While solving the normal equations squares the condition number of the input matrix (and thus is not recommended in practice), direct methods (such as the QR decomposition ) solve the problem of eqn. (Equation 1) in time assuming that . Finally, an alternative expression for the vector of eqn. (Equation 2) emerges by leveraging the Singular Value Decomposition (SVD) of . If denotes the SVD of , then
In this paper, we describe two randomized algorithms that will provide accurate relative-error approximations to the minimal -norm solution vector of eqn. (Equation 2) faster than existing exact algorithms for a large class of overconstrained least-squares problems. In particular, we will prove the following theorem.
We will provide a precise statement of the running time for our two algorithms (including the -dependence) in Theorems ? (Section Section 4) and ? (Section Section 5), respectively. It is worth noting that the claims of Theorem ? can be made to hold with probability , for any , by repeating the algorithm times. For example, one could run ten independent copies of the algorithm and keep the vector that minimizes the residual. This clearly does not increase the running time of the algorithm by more than a constant factor, while driving the failure probability down to (approximately) . Also, we will assume that is a power of two and that the rank of the matrix equals . (We note that padding and with all-zero rows suffices to remove the first assumption.)
We now provide a brief overview of our main algorithms. Let the matrix product denote the Randomized Hadamard Transform (see also Section 2.4). Here the matrix denotes the (normalized) matrix of the Hadamard transform and the diagonal matrix is formed by setting its diagonal entries to or with equal probability in independent trials. This transform has been used as one step in the development of a “fast” version of the Johnson-Lindenstrauss lemma . Our first algorithm is a random sampling algorithm. After premultiplying and by , this algorithm samples uniformly at random constraints from the preprocessed problem. (See eqn. ( ?), as well as the remarks after Theorem ? for the precise value of .) Then, this algorithm solves the least squares problem on just those sampled constraints to obtain a vector such that Theorem ? is satisfied. Note that applying the randomized Hadamard transform to the matrix and vector only takes time. This follows since we will actually sample only of the constraints from the Hadamard-preprocessed problem . Then, exactly solving the sampled least-squares problem will require only time. Assuming that is a constant and , it follows that the running time of this algorithm is when .
In a similar manner, our second algorithm also initially premultiplies and by . This algorithm then multiplies the result by a sparse projection matrix , where . This matrix is described in detail in Section 5.2. Its construction depends on a sparsity parameter, and it is identical to the “sparse projection” matrix in Matoušek’s version of the Ailon-Chazelle result . Finally, our second algorithm solves the least squares problem on just those coordinates to obtain such that the three claims of Theorem ? are satisfied. Assuming that is a constant and , it follows that the running time of this algorithm is when .
It is worth noting that our second algorithm has a (marginally) less restrictive assumption on the connection between and . However, the first algorithm is simpler to implement and easier to describe. Clearly, an interesting open problem is to relax the above constraints on for either of the proposed algorithms.
We should note several lines of related work.
First, techniques such as the “method of averages”  preprocess the input into the form of eqn. (Equation 4) of Section 3 and can be used to obtain exact or approximate solutions to the least squares problem of eqn. (Equation 1) in time under strong statistical assumptions on and . To the best of our knowledge, however, the two algorithms we present and analyze are the first algorithms to provide nontrivial approximation guarantees for overconstrained least squares approximation problems in time, while making no assumptions at all on the input data.
Second, Ibarra, Moran, and Hui  provide a reduction of the least squares approximation problem to the matrix multiplication problem. In particular, they show that time, where is the time needed to multiply two matrices, is sufficient to solve this problem. All of the running times we report in this paper assume the use of standard matrix multiplication algorithms, since matrix multiplication algorithms are almost never used in practice. Moreover, even with the current best value for the matrix multiplication exponent, , our algorithms are still faster.
Third, motivated by our preliminary results as reported in  and , both Rokhlin and Tygert  as well as Avron, Maymounkov, and Toledo  have empirically evaluated numerical implementations of variants of one of the algorithms we introduce. We describe this in more detail below in Section 1.3.
Fourth, very recently, Clarkson and Woodruff proved space lower bounds on related problems ; and Nguyen, Do, and Tran achieved a small improvement in the sampling complexity for related problems .
1.3Empirical performance of our randomized algorithms
In prior work we have empirically evaluated randomized algorithms that rely on the ideas that we introduce in this paper in several large-scale data analysis tasks. Nevertheless, it is a fair question to ask whether our “random perspective” on linear algebra will work well in numerical implementations of interest in scientific computation. We address this question here. Although we do not provide an empirical evaluation in this paper, in the wake of the original Technical Report version of this paper in 2007 , two groups of researchers have demonstrated that numerical implementations of variants of the algorithms we introduce in this paper can perform very well in practice.
In 2008, Rokhlin and Tygert  describe a variant of our random projection algorithm, and they demonstrate that their algorithm runs in time
where is an “oversampling” parameter and is a condition number. Importantly (at least for very high-precision applications of this random sampling methodology), they reduce the dependence on from to . Moreover, by choosing , they demonstrate that . Although this bound is inferior to ours, they also consider a class of matrices for which choosing empirically produced a condition number , which means that for this class of matrices their running time is
Their numerical experiments on this class of matrices clearly indicate that their implementations of variants of our algorithms perform well for certain matrices as small as thousands of rows by hundreds of columns.
In 2009, Avron, Maymounkov, Toledo  introduced a randomized least-squares solver based directly on our algorithms. They call it Blendenpik, and by considering a much broader class of matrices, they demonstrate that their solver “beats LAPACK’s direct dense least-sqares solver by a large margin on essentially any dense tall matrix.” Beyond providing additional theoretical analysis, including backward error analysis bounds for our algorithm, they consider five (and numerically implement three) random projection strategies (i.e., Discrete Fourier Transform, Discrete Cosine Transform, Discrete Hartely Transform, Walsh-Hadamard Transform, and a Kac random walk), and they evaluate their algorithms on a wide range of matrices of various sizes and various “localization ” or “coherence” properties. Based on these results that empirically show the superior performance of randomized algorithms such as those we introduce and analyze in this paper on a wide class of matrices, they go so far as to “suggest that random-projection algorithms should be incorporated into future versions of LAPACK.”
After a brief review of relevant background in Section 2, Section 3 presents a structural result outlining conditions on preconditioner matrices that are sufficient for relative-error approximation. Then, we present our main sampling-based algorithm for approximating least squares approximation in Section 4 and in Section 5 we present a second projection-based algorithm for the same problem. Preliminary versions of parts of this paper have appeared as conference proceedings in the 17th ACM-SIAM Symposium on Discrete Algorithms  and in the 47th IEEE Symposium on Foundations of Computer Science ; and the original Technical Report version of this journal paper has appeared on the arXiv . In particular, the core of our analysis in this paper was introduced in , where an expensive-to-compute probability distribution was used to construct a relative-error approximation sampling algorithm for the least squares approximation problem. Then, after the development of the Fast Johnson-Lindenstrauss transform ,  proved that similar ideas could be used to improve the running time of randomized algorithms for the least squares approximation problem. In this paper, we have combined these ideas, treated the two algorithms in a manner to highlight their similarities and differences, and considerably simplified the analysis.
We let denote the set ; denotes the natural logarithm of and denotes the base two logarithm of . For any matrix , denotes the -th row of as a row vector and denotes the -th column of as a column vector. Also, given a random variable , we let denote its expectation and denote its variance.
We will make frequent use of matrix and vector norms. More specifically, we let
denote the square of the Frobenius norm of , and we let
denote the spectral norm of . For any vector , its -norm (or Euclidean norm) is equal to the square root of the sum of the squares of its elements, while its norm is defined as .
2.2Linear Algebra background
We now review relevant definitions and facts from linear algebra; for more details, see . Let the rank of be . The Singular Value Decomposition (SVD) of is denoted by , where is the matrix of left singular vectors, is the diagonal matrix of non-zero singular values, and is the matrix of right singular vectors. Let , denote the -th non-zero singular value of , and and denote the maximum and minimum singular value of . The condition number of is . The Moore-Penrose generalized inverse, or pseudoinverse, of may be expressed in terms of the SVD as . Finally, for any orthogonal matrix , let denote an orthogonal matrix whose columns are an orthonormal basis spanning the subspace of that is orthogonal to the column space of . In terms of , the optimal value of the least squares residual of eqn. (Equation 1) is
2.3Markov’s inequality and the union bound
We will make frequent use of the following fundamental result from probability theory, known as Markov’s inequality . Let be a random variable assuming non-negative values with expectation . Then, for all ,
with probability at least .
We will also need the so-called union bound. Given a set of random events holding with respective probabilities , the probability that all events hold (i.e., the probability of the union of those events) is upper bounded by .
2.4The Randomized Hadamard Transform
The Randomized Hadamard Transform was introduced in  as one step in the development of a fast version of the Johnson-Lindenstrauss lemma . Recall that the (non-normalized) matrix of the Hadamard transform may be defined recursively as follows:
The normalized matrix of the Hadamard transform is equal to ; hereafter, we will denote this normalized matrix by . Now consider a diagonal matrix such that the diagonal entries are set to +1 with probability and to with probability in independent trials. The product is the Randomized Hadamard Transform and has two useful properties. First, when applied to a vector, it “spreads out” its energy, in the sense of providing a bound for its infinity norm (see Section 4.2). Second, computing the product for any vector takes time. Even better, if we only need to access, say, elements in the transformed vector, then those elements can be computed in time . We will expand on the latter observation in the proofs of Theorems ? and ?.
3Our algorithms as preconditioners
Both of our algorithms may be viewed as preconditioning the input matrix and the target vector with a carefully-constructed data-independent random matrix . For our random sampling algorithm, we let , where is a matrix that represents the sampling operation and is the Randomized Hadamard Transform, while for our random projection algorithm, we let , where is a random projection matrix. Thus, we replace the least squares approximation problem of eqn. (Equation 1) with the least squares approximation problem
We explicitly compute the solution to the above problem using a traditional deterministic algorithm , e.g., by computing the vector
Alternatively, one could use standard iterative methods such as the the Conjugate Gradient Normal Residual method (CGNR, see  for details), which can produce an -approximation to the optimal solution of eqn. (Equation 4) in time, where is the condition number of and is the number of rows of .
3.1A structural result sufficient for relative-error approximation
In this subsection, we will state and prove a lemma that establishes sufficient conditions on any matrix such that the solution vector to the least squares problem of eqn. (Equation 4) will satisfy relative-error bounds of the form ( ?) and ( ?). Recall that the SVD of is . In addition, for notational simplicity, we let denote the part of the right hand side vector lying outside of the column space of .
The two conditions that we will require of the matrix are:
for some . Several things should be noted about these conditions. First, although condition ( ?) depends on the right hand side vector , Algorithms ? and ? will satisfy it without using any information from . Second, although condition (Equation 6) only states that , for all , for both of our randomized algorithms we will show that , for all . Thus, one should think of as an approximate isometry. Third, condition ( ?) simply states that remains approximately orthogonal to . Finally, note that the following lemma is a deterministic statement, since it makes no explicit reference to either of our randomized algorithms. Failure probabilities will enter later when we show that our randomized algorithms satisfy conditions (Equation 6) and ( ?).
Let us first rewrite the down-scaled regression problem induced by as
(Equation 7) follows since and ( ?) follows since the columns of the matrix span the same subspace as the columns of . Now, let be such that , and note that minimizes eqn. ( ?). The latter fact follows since
Thus, by the normal equations (Equation 3), we have that
Taking the norm of both sides and observing that under condition (Equation 6) we have , for all , it follows that
Using condition ( ?) we observe that
To establish the first claim of the lemma, let us rewrite the norm of the residual vector as
where (Equation 11) follows by Pythagoras, since , which is orthogonal to , and consequently to ; ( ?) follows by the definition of and ; and ( ?) follows by (Equation 10) and the orthogonality of . The first claim of the lemma follows since .
To establish the second claim of the lemma, recall that . If we take the norm of both sides of this expression, we have that
where (Equation 12) follows since is the smallest singular value of and since the rank of is ; and ( ?) follows by (Equation 10) and the orthogonality of . Taking the square root, the second claim of the lemma follows.
If we make no assumption on , then ( ?) from Lemma ? may provide a weak bound in terms of . If, on the other hand, we make the additional assumption that a constant fraction of the norm of lies in the subspace spanned by the columns of , then ( ?) can be strengthened. Such an assumption is reasonable, since most least-squares problems are practically interesting if at least some part of lies in the subspace spanned by the columns of .
Since , it follows that
This last inequality follows from , which implies
By combining this with eqn. ( ?) of Lemma ?, the lemma follows.
4A sampling-based randomized algorithm
In this section, we present our randomized sampling algorithm for the least squares approximation problem of eqn. (Equation 1). We also state and prove an associated quality-of-approximation theorem.
4.1The main algorithm and main theorem
Algorithm ? takes as input a matrix , a vector , and an error parameter . This algorithm starts by preprocessing the matrix and the vector with the Randomized Hadamard Transform. It then constructs a smaller problem by sampling uniformly at random a small number of constraints from the preprocessed problem. Our main quality-of-approximation theorem (Theorem ? below) states that with constant probability over the random choices made by the algorithm, the vector returned by this algorithm will satisfy the relative-error bounds of eqns. ( ?) and ( ?) and will be computed quickly.
In more detail, after preprocessing with the Randomized Hadamard Transform of Section 2.4, Algorithm ? samples exactly constraints from the preprocessed least squares problem, rescales each sampled constraint by , and solves the least squares problem induced on just those sampled and rescaled constraints. (Note that the algorithm explicitly computes only those rows of and only those elements of that need to be accessed.) More formally, we will let denote a sampling matrix specifying which of the constraints are to be sampled and how they are to be rescaled. This matrix is initially empty and is constructed as described in Algorithm ?. Then, we can consider the problem
which is just a least squares approximation problem involving the constraints sampled from the matrix after the preprocessing with the Randomized Hadamard Transform. The minimum -norm vector among those that achieve the minimum value in this problem is
which is the output of Algorithm ?.
Remark: Assuming that , and using , we get that
Thus, the running time of Algorithm ? becomes
Assuming that , the above running time reduces to
It is worth noting that improvements over the standard time could be derived with weaker assumptions on and . However, for the sake of clarity of presentation, we only focus on the above setting.
Remark: The assumptions in our theorem have a natural geometric interpretation.
The fraction is the sine of the angle between and the column space of ; see page 242 of . Thus, is a bound on the tangent between and the column space of ; see page 244 of . This means that the bound for is proportional to this tangent.
4.2The effect of the Randomized Hadamard Transform
In this subsection, we state a lemma that quantifies the manner in which approximately “uniformizes” information in the left singular subspace of the matrix . We state the lemma for a general orthogonal matrix such that , although we will be interested in the case when and consists of the top left singular vectors of the matrix .
We follow the proof of Lemma 2.1 in . In that lemma, the authors essentially prove that the Randomized Hadamard Transform “spreads out” input vectors. More specifically, since the columns of the matrix (denoted by for all ) are unit vectors, they prove that for fixed and fixed ,
(Note that we consider vectors in whereas  considered vectors in and thus the roles of and are inverted in our proof.) Let to get
From a standard union bound, this immediately implies that with probability at least ,
holds for all and . Using
for all , we conclude the proof of the lemma.
4.3Satisfying condition ()
We now establish the following lemma which states that all the singular values of are close to one. The proof of Lemma ? depends on a bound for approximating the product of a matrix times its transpose by sampling (and rescaling) a small number of columns of the matrix. This bound appears as Theorem ? in the Appendix and is an improvement over prior work of ours in .
Note that for all
In the above, we used the fact that . We now can view as an approximation to the product of two matrices and by randomly sampling and rescaling columns of . Thus, we can leverage Theorem ? from the Appendix. More specifically, consider the matrix . Obviously, since , , and are orthogonal matrices, and . Let ; since we assumed that eqn. ( ?) holds, we note that the columns of , which correspond to the rows of , satisfy
Thus, applying Theorem ? with as above, , and implies that
holds with probability at least . For the above bound to hold, we need to assume the value of eqn. ( ?). Finally, we note that since , the assumption of Theorem ? on the Frobenius norm of the input matrix is always satisfied. Combining the above with inequality (Equation 15) concludes the proof of the lemma.
4.4Satisfying condition ()
We next prove the following lemma, from which it will follow that condition ( ?) is satisfied by Algorithm ?. The proof of this lemma depends on bounds for randomized matrix multiplication algorithms that appeared in .
Recall that and that . We start by noting that since it follows that
Thus, we can view as approximating the product of two matrices and by randomly sampling columns from and rows/elements from . Note that the sampling probabilities are uniform and do not depend on the norms of the columns of or the rows of . However, we can still apply the results of Table 1 (second row) in page 150 of . More specifically, since we condition on eqn. ( ?) holding, the rows of (which of course correspond to columns of ) satisfy
for . Applying the result of Table 1 (second row) of  we get
In the above we used . Markov’s inequality now implies that with probability at least .9,
Setting and using the value of specified above concludes the proof of the lemma.
4.5Completing the proof of Theorem
We now complete the proof of Theorem ?. First, let denote the event that eqn. ( ?) holds; clearly, . Second, let denote the event that both Lemmas ? and ? hold conditioned on holding. Then,
In the above, denotes the complement of event . In the first inequality we used the union bound and in the second inequality we leveraged the bounds for the failure probabilities of Lemmas ? and ? given that eqn. ( ?) holds. We now let denote the event that both Lemmas ? and ? hold, without any a priori conditioning on event ; we will bound as follows:
In the first inequality we used the fact that all probabilities are positive. The above derivation immediately bounds the success probability of Theorem ?. Combining Lemmas ? and ? with the structural results of Lemma ? and setting as in eqn. ( ?) concludes the proof of the accuracy guarantees of Theorem ?.
We now discuss the running time of Algorithm ?. First of all, by the construction of , the number of non-zero entries in is . In Step we need to compute the products and . Recall that has columns and thus the running time of computing both products is equal to the time needed to apply on vectors. First, note that in order to apply on vectors in , operations suffice. In order to estimate how many operations are needed to apply on vectors, we use the results of Theorem (see also Section 7) of Ailon and Liberty , which state that at most operations are needed for this operation. Here denotes the number of non-zero elements in the matrix , which is at most . After this preprocessing, Algorithm ? must compute the pseudoinverse of an matrix, or, equivalently, solve a least-squares problem on constraints and variables. This operation can be performed in time since . Thus, the entire algorithm runs in time
5A projection-based randomized algorithm
In this section, we present a projection-based randomized algorithm for the least squares approximation problem of eqn. (Equation 1). We also state and prove an associated quality-of-approximation theorem.
5.1The main algorithm and main theorem
Algorithm ? takes as input a matrix , a vector , and an error parameter . This algorithm also starts by preprocessing the matrix and right hand side vector with the Randomized Hadamard Transform. It then constructs a smaller problem by performing a “sparse projection” on the preprocessed problem. Our main quality-of-approximation theorem (Theorem ? below) will state that with constant probability (over the random choices made by the algorithm) the vector returned by this algorithm will satisfy the relative-error bounds of eqns. ( ?) and ( ?) and will be computed quickly.
In more detail, Algorithm ? begins by preprocessing the matrix and right hand side vector with the Randomized Hadamard Transform of Section 2.4. This algorithm explicitly computes only those rows of and those elements of that need to be accessed to perform the sparse projection. After this initial preprocessing, Algorithm ? will perform a “sparse projection” by multiplying and by the sparse matrix (described in more detail in Section 5.2). Then, we can consider the problem
which is just a least squares approximation problem involving the matrix and the vector . The minimum -norm vector among those that achieve the minimum value in this problem is
which is the output of Algorithm ?.
Remark: Assuming that we get that
Thus, the expected running time of Algorithm ? becomes
Finally, assuming , the above running time reduces to
It is worth noting that improvements over the standard time could be derived with weaker assumptions on and .
5.2Sparse projection matrices
In this subsection, we state a lemma about the action of a sparse random matrix operating on a vector. Recall that given any set of points in Euclidean space, the Johnson-Lindenstrauss lemma states that those points can be mapped via a linear function to dimensions such that the distances between all pairs of points are preserved to within a multiplicative factor of ; see  and references therein for details.
Formally, let be an error parameter, be a failure probability, and be a “uniformity” parameter. In addition, let be a “sparsity” parameter defining the expected number of nonzero elements per row, and let be the number of rows in our matrix. Then, define the random matrix as in Algorithm ?. Matoušek proved the following lemma, as the key step in his version of the Ailon-Chazelle result .
Remark: In order to achieve sufficient concentration for all vectors , the linear mapping defining the Johnson-Lindenstrauss transform is typically “dense,” in the sense that almost all the elements in each of the rows of the matrix defining the mapping are nonzero. In this case, implementing the mapping on vectors (in, e.g., a matrix ) via a matrix multiplication requires time. This is not faster than the time required to compute an exact solution to the problem of eqn. (Equation 1) if is at least . The Ailon-Chazelle result  states that the mapping can be “sparse,” in the sense that only a few of the elements in each of the rows need to be nonzero, provided that the vector is “well-spread,” in the sense that is close to . This is exactly what the preprocessing with the Randomized Hadamard Transform guarantees.
5.3Proof of Theorem
In this subsection, we provide a proof of Theorem ?. Recall that by the results of Section 3.1, in order to prove Theorem ?, we must show that the matrix constructed by Algorithm ? satisfies conditions (Equation 6) and ( ?) with probability at least . The next two subsections focus on proving that these conditions hold; the last subsection discusses the running time of Algorithm ?.
Satisfying condition ()
In order to prove that all the singular values of are close to one, we start with the following lemma which provides a means to bound the spectral norm of a matrix. This lemma is an instantiation of lemmas that appeared in .
We next establish Lemma ?, which states that all the singular values of are close to one with constant probability. The proof of this lemma depends on the bound provided by Lemma ? and it immediately shows that condition (Equation 6) is satisfied by Algorithm ?.
Define the symmetric matrix , recall that , and note that
holds for all . Consider the grid of eqn. ( ?) and note that there are no more than pairs , since by Lemma ?. Since , in order to show that , it suffices by Lemma ? to show that , for all . To do so, first, consider a single pair. Let
and note that
By multiplying out the right hand side of the above equation and rearranging terms, it follows that
In order to use Lemma ? to bound the quantities , and , we need a bound on the uniformity ratio . To do so, note that
The above inequalities follow by and Lemma ?. This holds for both our chosen points and and in fact for all . Let and let (these choices will be explained shortly). Then, it follows from Lemma ? that by setting and our choices for and , each of the following three statements holds with probability at least :
Thus, combining the above with eqn. (Equation 18), for this single pair of vectors ,
holds with probability at least . Next, recall that there are no more than pairs of vectors , and we need eqn. (Equation 19) to hold for all of them. Since we set then it follows by a union bound that eqn. (Equation 19) holds for all pairs of vectors with probability at least .95. Additionally, let us set , which implies that thus concluding the proof of the lemma.
Finally, we discuss the values of the parameters and . Since , , and , the appropriate values for and emerge after elementary manipulations from Lemma ?.
Satisfying condition ()
In order to prove that condition ( ?) is satisfied, we start with Lemma ?. In words, this lemma states that given vectors and we can use the random sparse projection matrix to approximate by , provided that (or , but not necessarily both) is bounded. The proof of this lemma is elementary but tedious and is deferred to Section 6.2 of the Appendix.
The following lemma proves that condition ( ?) is satisfied by Algorithm ?. The proof of this lemma depends on the bound provided by Lemma ?. Recall that and thus .
We first note that since , it follows that , for all . Thus, we have that
We now bound the expectation of the left hand side of eqn. (Equation 20) by using Lemma ? to bound each term on the right hand side of eqn. (Equation 20). Using eqn. (Equation 13) of Lemma ? we get that
holds for all . By our choice of the sparsity parameter the conditions of Lemma ? are satisfied. It follows from Lemma ? that