Bilateral Random Projections

# Bilateral Random Projections

###### Abstract

Low-rank structure have been profoundly studied in data mining and machine learning. In this paper, we show a dense matrix ’s low-rank approximation can be rapidly built from its left and right random projections and , or bilateral random projection (BRP). We then show power scheme can further improve the precision. The deterministic, average and deviation bounds of the proposed method and its power scheme modification are proved theoretically. The effectiveness and the efficiency of BRP based low-rank approximation is empirically verified on both artificial and real datasets.

## 1 Introduction

Recent researches about low-rank structure concentrate on developing fast approximation and building meaningful decompositions. Two appealing representatives are the randomized approximate matrix decomposition [3] and column selection [1]. The former proves that a matrix can be well approximated by its projection to the column space of its random projections. This rank-revealing method provides a fast approximation of SVD/PCA. The latter proves that a column subset of a low-rank matrix can span its whole range.

In this paper, we consider the problem of fast low-rank approximation. Given bilateral random projections (BRP) of an dense matrix (w.l.o.g, ), i.e., and , wherein and are random matrices,

 L=Y1(AT2Y1)−1YT2 (1)

is a fast rank- approximation of . The computation of includes an inverse of an matrix and three matrix multiplications. Thus, for a dense , floating-point operations (flops) are required to obtain BRP, flops are required to obtain . The computational cost is much less than SVD based approximation. The in (1) has been proposed in [2] as a recovery of a rank- matrix from and , where and are independent Gaussian/SRFT random matrices. However, we propose that is a tight rank- approximation of a full rank matrix , when and are correlated random matrices updated from and , respectively. We then apply power scheme [6] to for improving the approximation precision, especially when the eigenvalues of decay slowly.

Theoretically, we prove the deterministic bound, average bound and deviation bound of the approximation error in BRP based low-rank approximation and its power scheme modification. The results show the error of BRP based approximation is close to the error of SVD approximation under mild conditions. Comparing with randomized SVD in [3] that extracts the column space from unilateral random projections, the BRP based method estimates both column and row spaces from bilateral random projections.

We give an empirical study of BRP on both artificial data and face image dataset. The results show its effectiveness and efficiency in low-rank approximation and recovery.

## 2 Bilateral random projections (BRP) based low-rank approximation

We first introduce the bilateral random projections (BRP) based low-rank approximation and its power scheme modification. The approximation error bounds of these two methods are discussed at the end of this section.

### 2.1 Low-rank approximation with closed form

In order to improve the approximation precision of in (1) when and are standard Gaussian matrices, we use the obtained right random projection to build a better left projection matrix , and use to build a better . In particular, after , we update and calculate the left random projection , then we update and calculate the right random projection . A better low-rank approximation will be obtained if the new and are applied to (1). This improvement requires additional flops of in BRP calculation.

### 2.2 Power scheme modification

When singular values of decay slowly, (1) may perform poorly. We design a modification for this situation based on the power scheme [6]. In the power scheme modification, we instead calculate the BRP of a matrix , whose singular values decay faster than . In particular, . Both and share the same singular vectors. The BRP of is:

 Y1=~XA1,Y2=~XTA2. (2)

According to (1), the BRP based rank approximation of is:

 ~L=Y1(AT2Y1)−1YT2. (3)

In order to obtain the approximation of with rank , we calculate the QR decomposition of and , i.e.,

 Y1=Q1R1,Y2=Q2R2. (4)

The low-rank approximation of is then given by:

 L=(~L)12q+1=Q1[R1(AT2Y1)−1RT2]12q+1QT2. (5)

The power scheme modification (5) requires an inverse of an matrix, an SVD of an matrix and five matrix multiplications. Therefore, for dense , flops are required to obtain BRP, flops are required to obtain the QR decompositions, flops are required to obtain . The power scheme modification reduces the error of (1) by increasing . When the random matrices and are built from and , additional flops are required in the BRP calculation.

## 3 Approximation error bounds

We analyze the error bounds of the BRP based low-rank approximation (1) and its power scheme modification (5).

The SVD of an (w.l.o.g, ) matrix takes the form:

 X=UΛVT=U1Λ1VT1+U2Λ2VT2, (6)

where is an diagonal matrix which diagonal elements are the first largest singular values, and are the corresponding singular vectors, , and forms the rest part of SVD. Assume that is the target rank, and have columns for oversampling. We consider the spectral norm of the approximation error for (1):

 ∥X−L∥ =∥∥X−Y1(AT2Y1)−1YT2∥∥ =∥∥[I−XA1(AT2XA1)−1AT2]X∥∥. (7)

The unitary invariance of the spectral norm leads to

 ∥X−L∥=∥∥UT[I−XA1(AT2XA1)−1AT2]X∥∥ =∥∥Λ[I−VTA1(AT2XA1)−1AT2UΛ]∥∥. (8)

In low-rank approximation, the left random projection matrix is built from the left random projection , and then the right random projection matrix is built from the left random projection . Thus and . Hence the approximation error given in (3) has the following form:

 ∥∥Λ[I−Λ2VTA1(AT1VΛ4VTA1)−1AT1VΛ2]∥∥. (9)

The following Theorem 1 gives the bound for the spectral norm of the deterministic error .

###### Theorem 1.

(Deterministic error bound) Given an real matrix with singular value decomposition , and chosen a target rank and an () standard Gaussian matrix , the BRP based low-rank approximation (1) approximates with the error upper bounded by

 ∥X−L∥2≤∥∥Λ22(VT2A1)(VT1A1)†Λ−11∥∥2+∥Λ2∥2.

See Section 4 for the proof of Theorem 1.

If the singular values of decay fast, the first term in the deterministic error bound will be very small. The last term is the rank- SVD approximation error. Therefore, the BRP based low-rank approximation (1) is nearly optimal.

###### Theorem 2.

(Deterministic error bound, power scheme) Frame the hypotheses of Theorem 1, the power scheme modification (5) approximates with the error upper bounded by

 ∥X−L∥2≤ (∥∥Λ2(2q+1)2(VT2A1)(VT1A1)†Λ−(2q+1)1∥∥2 +∥∥Λ2q+12∥∥2)1/(2q+1).

See Section 4 for the proof of Theorem 2.

If the singular values of decay slowly, the error produced by the power scheme modification (5) is less than the BRP based low-rank approximation (1) and decreasing with the increasing of .

The average error bound of BRP based low-rank approximation is obtained by analyzing the statistical properties of the random matrices that appear in the deterministic error bound in Theorem 1.

###### Theorem 3.

(Average error bound) Frame the hypotheses of Theorem 1,

 E∥X−L∥≤ ⎛⎜⎝ ⎷1p−1r∑i=1λ2r+1λ2i+1⎞⎟⎠|λr+1| +e√r+pp ⎷n∑i=r+1λ2iλ2r.

See Section 4 for the proof of Theorem 3.

The average error bound will approach to the SVD approximation error if and .

The average error bound for the power scheme modification is then obtained from the result of Theorem 3.

###### Theorem 4.

(Average error bound, power scheme) Frame the hypotheses of Theorem 1, the power scheme modification (5) approximates with the expected error upper bounded by

 E∥X−L∥≤ ⎡⎢ ⎢⎣⎛⎜ ⎜⎝  ⎷1p−1r∑i=1λ2(2q+1)r+1λ2(2q+1)i+1⎞⎟ ⎟⎠|λ2q+1r+1| +e√r+pp ⎷n∑i=r+1λ2(2q+1)iλ2(2q+1)r⎤⎥⎦1/(2q+1).

See Section 4 for the proof of Theorem 4.

Compared the average error bounds of the BRP based low-rank approximation with its power scheme modification, the latter produces less error than the former, and the error can be further decreased by increasing .

The deviation bound for the spectral norm of the approximation error can be obtained by analyzing the deviation bound of in the deterministic error bound and by applying the concentration inequality for Lipschitz functions of a Gaussian matrix.

###### Theorem 5.

(Deviation bound) Frame the hypotheses of Theorem 1. Assume that . For all , it holds that

 ∥X−L∥≤ ⎛⎜⎝1+t√12rp(r∑i=1λ−1i)12+e√r+pp+1⋅ tuλ−1r)λ2r+1+e√r+pp+1⋅tλ−1r(n∑i=r+1λ2i)12.

except with probability .

See Section 4 for the proof of Theorem 5.

## 4 Proofs of error bounds

### 4.1 Proof of Theorem 1

The following lemma and propositions from [3] will be used in the proof.

###### Lemma 1.

Suppose that . For every , the matrix . In particular,

 M⪯N  ⇒  ATMA⪯ATNA. (10)
###### Proposition 1.

Suppose . Then, for each matrix , it holds that and that .

###### Proposition 2.

Suppose that . Then

 I−(I+M)−1⪯M. (11)
###### Proposition 3.

We have for each partitioned positive semidefinite matrix

 M=[ABBTC]. (12)

The proof of Theorem 1 is given below.

###### Proof.

Since an orthogonal projector projects a given matrix to the range (column space) of a matrix is defined as , the deterministic error (9) can be written as

 ∥E∥=∥Λ(I−PM)∥, M=Λ2VTA1. (13)

By applying Proposition 1 to the error (13), because , we have

 ∥E∥=∥Λ(I−PM)∥≤∥Λ(I−PN)∥, (14)

where

 (15)

Thus can be written as

For the top-left block in (4.1), Proposition 2 leads to . For the bottom-right block in (4.1), Lemma 1 leads to . Therefore,

 I−PN⪯⎡⎣HTH−(I+HTH)−1HT−H(I+HTH)−1I⎤⎦

By applying Lemma 1, we have

 Λ(I−PN)Λ⪯ ⎡⎣ΛT1HTHΛ1−ΛT1(I+HTH)−1HTΛ2−ΛT2H(I+HTH)−1Λ1ΛT2Λ2⎤⎦

According to Proposition 3, the spectral norm of is bounded by

 ∥Λ(I−PN)∥2=∥Λ(I−PN)Λ∥ ≤∥∥Λ22(VT2A1)(VT1A1)†Λ−11∥∥2+∥Λ2∥2. (16)

By substituting (4.1) into (14), we obtain the deterministic error bound. This completes the proof. ∎

### 4.2 Proof of Theorem 2

The following proposition from [3] will be used in the proof.

###### Proposition 4.

Let be an orthogonal projector, and let be a matrix. For each nonnegative ,

 ∥PA∥≤∥∥P(AAT)qA∥∥1/(2q+1). (17)

The proof of Theorem 2 is given below.

###### Proof.

The power scheme modification (5) applies the BRP based low-rank approximation (1) to rather than . In this case, the approximation error is

 ∥~X−~L∥=∥∥Λ2q+1(I−PM)∥∥, M=Λ2(2q+1)VTA1. (18)

According to Theorem 1, the error is upper bounded by

 ∥∥~X−~L∥∥2≤ ∥∥Λ2(2q+1)2(VT2A1)(VT1A1)†Λ−(2q+1)1∥∥2+∥∥Λ2q+12∥∥2. (19)

The deterministic error bound for the power scheme modification is obtained by applying Proposition 4 to (4.2). This completes the proof. ∎

### 4.3 Proof of Theorem 3

The following propositions from [3] will be used in the proof.

###### Proposition 5.

Fix matrices , , and draw a standard Gaussian matrix . Then it holds that

 E∥∥SGTT∥∥≤∥S∥∥T∥F+∥S∥F∥T∥. (20)
###### Proposition 6.

Draw an standard Gaussian matrix with . Then it holds that

 E∥G†∥2F=rp−1,E∥G†∥≤e√r+pp. (21)

The proof of Theorem 3 is given below.

###### Proof.

The distribution of a standard Gaussian matrix is rotational invariant. Since 1) is a standard Gaussian matrix and 2) is an orthogonal matrix, is a standard Gaussian matrix, and its disjoint submatrices and are standard Gaussian matrices as well.

Theorem 1 and the Hölder’s inequality imply that

 E∥X−L∥ ≤E(∥∥Λ22(VT2A1)(VT1A1)†Λ−11∥∥2+∥Λ2∥2)1/2 ≤E∥∥Λ22(VT2A1)(VT1A1)†Λ−11∥∥+∥Λ2∥. (22)

We condition on and apply Proposition 5 to bound the expectation w.r.t. , i.e.,

 E∥∥Λ22(VT2A1)(VT1A1)†Λ−11∥∥ ≤E(∥∥Λ22∥∥∥∥(VT1A1)†Λ−11∥∥F+∥∥Λ22∥∥F∥∥(VT1A1)†Λ−11∥∥) ≤∥∥Λ22∥∥(E∥∥(VT1A1)†Λ−11∥∥2F)1/2+ ∥∥Λ22∥∥F⋅E∥∥(VT1A1)†∥∥⋅∥∥Λ−11∥∥. (23)

The Frobenius norm of can be calculated as

 ∥∥(VT1A1)†Λ−11∥∥2F =trace[Λ−11((VT1A1)†)T(VT1A1)†Λ−11] =trace[((Λ1VT1A1)(Λ1VT1A1)T)−1].

Since 1) is a standard Gaussian matrix and 2) is a diagonal matrix, each column of follows -variate Gaussian distribution . Thus the random matrix follows the inverted Wishart distribution . According to the expectation of inverted Wishart distribution [4], we have

 =E trace[((Λ1VT1A1)(Λ1VT1A1)T)−1] =trace E[((Λ1VT1A1)(Λ1VT1A1)T)−1] =1p−1r∑i=1λ−2i. (24)

We apply Proposition 6 to the standard Gaussian matrix and obtain

 E∥∥(VT1A1)†∥∥≤e√r+pp. (25)

Therefore, (4.3) can be further derived as

 E ∥∥Λ22(VT2A1)(VT1A1)†Λ−11∥∥ ≤λ2r+1⋅ ⎷1p−1r∑i=1λ−2i+ ⎷n∑i=r+1λ2i⋅e√r+pp⋅|λ−1r| =|λr+1| ⎷1p−1r∑i=1λ2r+1λ2i+e√r+pp ⎷n∑i=r+1λ2iλ2r. (26)

By substituting (4.3) into (4.3), we obtain the average error bound

 E∥X−L∥≤ ⎛⎜⎝ ⎷1p−1r∑i=1λ2r+1λ2i+1⎞⎟⎠|λr+1|+ e√r+pp ⎷n∑i=r+1λ2iλ2r. (27)

This completes the proof. ∎

### 4.4 Proof of Theorem 4

The proof of Theorem 4 is given below.

###### Proof.

By using Hölder’s inequality and Theorem 2, we have

 E∥X−L∥ ≤(E∥X−L∥2q+1)1/(2q+1) ≤(E∥∥~X−~L∥∥)1/(2q+1). (28)

We apply Theorem 3 to and and obtain the bound of , noting that .

 E∥∥~X−~L∥∥= ⎛⎜ ⎜⎝  ⎷1p−1r∑i=1λ2(2q+1)r+1λ2(2q+1)i+1⎞⎟ ⎟⎠|λ2q+1r+1|+ e√r+pp ⎷n∑i=r+1λ2(2q+1)iλ2(2q+1)r. (29)

By substituting (4.4) into (4.4), we obtain the average error bound of the power scheme modification shown in Theorem 4. This completes the proof. ∎

### 4.5 Proof of Theorem 5

The following propositions from [3] will be used in the proof.

###### Proposition 7.

Suppose that is a Lipschitz function on matrices:

 |h(X)−h(Y)|≤L∥X−F∥F  for all X,Y. (30)

Draw a standard Gaussian matrix . Then

 Pr{h(G)≥Eh(G)+Lt}≤e−t2/2. (31)
###### Proposition 8.

Let be a standard Gaussian matrix where . For all ,

 Pr{∥∥G†∥∥F≥√12rp⋅t}≤4t−p  and Pr{∥∥G†∥∥≥e√r+pp+1⋅t}≤t−(p+1). (32)

The proof of Theorem 5 is given below.

###### Proof.

According to the deterministic error bound in Theorem 1, we study the deviation of . Consider the Lipschitz function , its Lipschitz constant can be estimated by using the triangle inequality:

 |h(X)−h(Y)|≤∥∥Λ22(X−Y)(VT1A1)†Λ−11∥∥ ≤∥∥Λ22∥∥∥X−Y∥∥∥(VT1A1)†∥∥∥∥Λ−11∥∥ ≤∥∥Λ22∥∥∥∥(VT1A1)†∥∥∥∥Λ−11∥∥∥X−Y∥F. (33)

Hence the Lipschitz constant satisfies . We condition on and then Proposition 5 implies that

 E[h(VT2A1)|VT1A1]≤ ∥∥Λ22∥∥∥∥(VT1A1)†∥∥F∥∥Λ−11∥∥F+ ∥∥Λ22∥∥F∥∥(VT1A1)†∥∥∥∥Λ−11∥∥.

We define an event as

 T={∥∥(VT1A1)†∥∥F≤√12rp⋅t  and ∥∥(VT1A1)†∥∥≤e√r+pp+1⋅t}. (34)

According to Proposition 8, the event happens except with probability

 Pr{¯¯¯¯T}≤4t−p+t−(p+1). (35)

Applying Proposition 7 to the function , given the event , we have

 Pr {∥∥Λ22(VT2A1)(VT1A1)†Λ−11∥∥> ∥∥Λ22∥∥∥∥(VT1A1)†∥∥F∥∥Λ−11∥∥F+ ∥∥Λ22∥∥F∥∥(VT1A1)†∥∥∥∥Λ−11∥∥+ ∥∥Λ22∥∥∥∥(VT1A1)†∥∥∥∥Λ−11∥∥⋅u∣T}≤e−u2/2. (36)

According to the definition of the event and the probability of , we obtain

 Pr {∥∥Λ22(VT2A1)(VT1A1)†Λ−11∥∥> ∥∥Λ22∥∥∥∥Λ−11∥∥F√12rp⋅t+∥∥Λ22∥∥F∥∥Λ−11∥∥e√r+pp+1⋅t +∥∥Λ22∥∥∥∥Λ−11∥∥e√r+pp+1⋅tu}≤ e−u2/2+4t−p+t−(p+1).

Therefore,

 Pr ⎛⎝1+t√12rp(r∑i=1λ−1i)1/2+e√r+pp+1⋅tuλ−1r⎞⎠λ2r+1+ e√r+pp+1⋅tλ−1r(n∑i=r+1λ2i)1/2⎫⎬⎭≤ e−u2/2+4t−p+t−(p+1). (37)

Since Theorem 1 implies , we obtain the deviation bound in Theorem 5. This completes the proof. ∎

## 5 Empirical Study

We first evaluate the efficiency of the BRP based low-rank approximation (1) for exact recovery of low-rank matrices. We consider square matrices of dimension from to with rank from to . Each matrix is generated by , wherein and are both standard Gaussian matrices. Figure 1 shows that the recovery time is linearly increased w.r.t . This is consistent with the flops required by (1). The relative error of each recovery is less than . It also shows that a matrix with rank can be exactly recovered within CPU seconds. This suggests the advantage of (1) for large-scale applications.

We then evaluate the effectiveness of (1) and its power scheme modification (5) in low-rank approximation of full rank matrix with slowly decaying singular values. We generate a square matrix with size , whose entries are independently sampled from a standard normal distribution with mean and variance , and then apply (1) () and (5) with to obtain approximations with rank varying from to . We show the relative errors in Figure 2 and the relative error of the corresponding SVD approximation as a baseline. The results suggest that our method can obtain a nearly optimal approximation when is sufficiently large (e.g., 2).

At last, we evaluate the efficiency and effectiveness of BRP on low-rank compression of human face images from dataset FERET [5]. We randomly selected face images of individuals from FERET and built a data matrix, wherein the features are the pixels of each image. We then obtain two rank- compressions of the data matrix by using SVD and the power modification of BRP based low-rank approximation (5) with , respectively. The compressed images and the corresponding time costs are shown in Figure 3 and its caption. It indicates that our method is able to produce compression with competitive quality in considerably less time than SVD.

## 6 Conclusion

In this paper, we consider the problem of fast low-rank approximation. A closed form solution for low-rank matrix approximation from bilateral random projections (BRP) is introduced. Given an dense matrix, the approximation can be calculated from random measurements in flops. Power scheme is applied for improving the approximation precision of matrices with slowly decaying singular values. We prove the BRP based low-rank approximation is nearly optimal. The experiments on both artificial and real datasets verifies the effectiveness and efficiency of BRP in both low-rank matrix recovery and approximation tasks.

## References

• [1] A. Deshpande and S. Vempala. Adaptive sampling and fast low-rank matrix approximation. In RANDOM ’06: The 10th International Workshop on Randomization and Computation, pages 292–303, 2006.
• [2] M. Fazel, E. J. Candès, B. Recht, and P. Parrilo. Compressed sensing and robust recovery of low rank matrices. In 42nd Asilomar Conference on Signals, Systems and Computers, 2008.
• [3] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. arXiv: 0909.4061, 2009.
• [4] R. J. Muirhead. Aspects of multivariate statistical theory. John Wiley & Sons Inc., New York, 1982.
• [5] P. J. Phillips, Hyeonjoon Moon, S. A. Rizvi, and P. J. Rauss. The feret evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10):1090–1104, 2000.
• [6] S. Roweis. Em algorithms for pca and spca. In NIPS, pages 626–632, 1998.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters