[

# [

Lijun Zhang zhanglij@msu.edu
mahdavim@msu.edu
Rong Jin rongjin@cse.msu.edu
Department of Computer Science and Engineering
Michigan State University, East Lansing, MI 48824, USA
Tianbao Yang
tyang@ge.com
GE Global Research, San Ramon, CA 94583, USA
Shenghuo Zhu
zsh@nec-labs.com
NEC Laboratories America, Cupertino, CA 95014, USA
###### Abstract

Random projection has been widely used in data classification. It maps high-dimensional data into a low-dimensional subspace in order to reduce the computational cost in solving the related optimization problem. While previous studies are focused on analyzing the classification performance of using random projection, in this work, we consider the recovery problem, i.e., how to accurately recover the optimal solution to the original optimization problem in the high-dimensional space based on the solution learned from the subspace spanned by random projections. We present a simple algorithm, termed Dual Random Projection, that uses the dual solution of the low-dimensional optimization problem to recover the optimal solution to the original problem. Our theoretical analysis shows that with a high probability, the proposed algorithm is able to accurately recover the optimal solution to the original problem, provided that the data matrix is of low rank or can be well approximated by a low rank matrix.

Dual Random Projection]Recovering the Optimal Solution by

[ Lijun Zhang zhanglij@msu.edu
mahdavim@msu.edu
Rong Jin rongjin@cse.msu.edu
Department of Computer Science and Engineering
Michigan State University, East Lansing, MI 48824, USA
Tianbao Yang tyang@ge.com
GE Global Research, San Ramon, CA 94583, USA
Shenghuo Zhu zsh@nec-labs.com
NEC Laboratories America, Cupertino, CA 95014, USA

Keywords: Random projection, Primal solution, Dual solution, Low rank

## 1 Introduction

Random projection is a simple yet powerful dimensionality reduction technique that projects the original high-dimensional data onto a low-dimensional subspace using a random matrix (Kaski, 1998; Bingham and Mannila, 2001). It has been successfully applied to many machine learning tasks, including classification (Fradkin and Madigan, 2003; Vempala, 2004; Rahimi and Recht, 2008), regression (Maillard and Munos, 2012), clustering (Fern and Brodley, 2003; Boutsidis et al., 2010), manifold learning (Dasgupta and Freund, 2008; Freund et al., 2008), and information retrieval (Goel et al., 2005).

In this work, we focus on random projection for classification. While previous studies were devoted to analyzing the classification performance using random projection (Arriaga and Vempala, 1999; Balcan et al., 2006; Paul et al., 2012; Shi et al., 2012), we examine the effect of random projection from a very different aspect. In particular, we are interested in accurately recovering the optimal solution to the original high-dimensional optimization problem using random projection. This is particularly useful for feature selection (Guyon and Elisseeff, 2003), where important features are often selected based on their weights in the linear prediction model learned from the training data. In order to ensure that similar features are selected, the prediction model based on random projection needs to be close to the model obtained by solving the original optimization problem directly.

The proposed algorithm for recovering the optimal solution consists of two simple steps. In the first step, similar to previous studies, we apply random projection to reducing the dimensionality of the data, and then solve a low-dimensional optimization problem. In the second step, we construct the dual solution of the low-dimensional problem from its primal solution, and then use it to recover the optimal solution to the original high-dimensional problem. Our analysis reveals that with a high probability, we are able to recover the optimal solution with a small error by using projections, where is the rank of the data matrix. A similar result also holds when the data matrix can be well approximated by a low rank matrix. We further show that the proposed algorithm can be applied iteratively to recovering the optimal solution with a relative error by using iterations.

The rest of the paper is arranged as follows. Section 2 describes the problem of recovering optimal solution by random projection, the theme of this work. Section 3 describes the dual random projection approach for recovering the optimal solution. Section 4 presents the main theoretical results for the proposed algorithm. Section 5 presents the proof for the theorems stated in Section 4. Section 6 concludes this work.

## 2 The Problem

Let be a set of training examples, where is a vector of dimension and is the binary class assignment for . Let and include input patterns and the class assignments of all training examples. A classifier is learned by solving the following optimization problem:

 minw∈Rdλ2∥w∥2+n∑i=1ℓ(yix⊤iw), (1)

where is a convex loss function that is differentiable111For non-differentiable loss functions such as hinge loss, we could apply the smoothing technique described in (Nesterov, 2005) to make it differentiable.. By writing in its convex conjugate form, i.e.,

 ℓ(z)=maxα∈Ωαz−ℓ∗(α),

where is the convex conjugate of and is the domain of the dual variable, we get the dual optimization problem:

 maxα∈Ωn−n∑i=1ℓ∗(αi)−12λα⊤Gα, (2)

where and and is the Gram matrix given by

 G=D(y)X⊤XD(y). (3)

In the following, we denote by the optimal primal solution to (1), and by the optimal dual solution to (2). The following proposition connects and .

###### Proposition 1

Let be the optimal primal solution to (1), and be the optimal dual solution to (2). We have

 w∗=−1λXD(y)α∗, and [α∗]i=∇ℓ(yix⊤iw∗), i=1,…,n. (4)

The proof of Proposition 1 and other omitted proofs are deferred to the Appendix. When the dimensionality is high and the number of training examples is large, solving either the primal problem in (1) or the dual problem in (2) can be computationally expensive. To reduce the computational cost, one common approach is to significantly reduce the dimensionality by random projection. Let be a Gaussian random matrix, where each entry is independently drawn from a Gaussian distribution and is significantly smaller than . Using the random matrix , we generate a low-dimensional representation for each input example by

 ˆxi=1√mR⊤xi, (5)

and solve the following low-dimensional optimization problem:

 minz∈Rmλ2∥z∥2+n∑i=1ℓ(yiz⊤ˆxi). (6)

The corresponding dual problem is written as

 maxα∈Ωn−n∑i=1ℓ∗(αi)−12λα⊤ˆGα, (7)

where

 ˆG=D(y)X⊤RR⊤mXD(y). (8)

Intuitively, the choice of Gaussian random matrix is justified by the expectation of the dot-product between any two examples in the projected space is equal to the dot-product in the original space, i.e.,

 E[ˆx⊤iˆxj]=x⊤iE[1mRR⊤]xj=x⊤ixj,

where the last equality follows from . Thus, holds in expectation.

Let denote the optimal primal solution to the low-dimensional problem (6), and denote the optimal dual solution to (7). Similar to Proposition 1, the following proposition connects and .

###### Proposition 2

We have

 z∗=−1λ1√mR⊤XD(y)ˆα∗, and [ˆα∗]i=∇ℓ(yi√mx⊤iRz∗), i=1,…,n. (9)

Given the optimal solution , the data point is classified by , which is equivalent to defining a new solution given below, which we refer to as the naive solution

 ˆw=1√mRz∗. (10)

The classification performance of has been examined by many studies (Arriaga and Vempala, 1999; Balcan et al., 2006; Paul et al., 2012; Shi et al., 2012). The general conclusion is that when the original data is linearly separable with a large margin, the classification error for is usually small.

Although these studies show that can achieve a small classification error under appropriate assumptions, it is unclear whether is a good approximation of the optimal solution . In fact, as we will see in Section 4, the naive solution is almost guaranteed to be a BAD approximation of the optimal solution, that is, . This observation leads to an interesting question: is it possible to accurately recover the optimal solution based on , the optimal solution to the low-dimensional optimization problem?

#### Relationship to Compressive Sensing

The proposed problem is closely related to compressive sensing (Donoho, 2006; Candès and Wakin, 2008) where the goal is to recover a high-dimensional but sparse vector using a small number of random measurements. The key difference between our work and compressive sensing is that we do not have the direct access to the random measurement of the target vector (which in our case is ). Instead, is the optimal solution to (6), the primal problem using random projection. However, the following theorem shows that is a good approximation of , which includes random measurements of , if the data matrix is of low rank and the number of random measurements is sufficiently large.

###### Theorem 1

For any , with a probability at least , we have

 ∥√mz∗−R⊤w∗∥2≤√2ε√1−ε∥R⊤w∗∥2,

provided

 m≥(r+1)log(2r/δ)cε2,

where constant is at least , and is the rank of .

Given the approximation bound in Theorem 1, it is appealing to reconstruct using the compressive sensing algorithm provided that is sparse to certain bases. We note that the low rank assumption for data matrix implies that is sparse with respect to the singular vectors of . However, since only provides an approximation to the random measurements of , running the compressive sensing algorithm will not be able to perfectly recover from . In Section 3, we present an algorithm, that recovers with a small error, provided that the data matrix is of low rank. Compared to the compressive sensing algorithm, the main advantage of the proposed algorithm is its computational simplicity; it neither computes the singular vectors of nor solves an optimization problem that minimizes the norm.

## 3 Algorithm

To motivate our algorithm, let us revisit the optimal primal solution to (1), which is given in Proposition 1, i.e.,

 w∗=−1λXD(y)α∗, (11)

where is the optimal solution to the dual problem (2). Given the projected data , we have reached an approximate dual problem in (7). Comparing it with the dual problem in (2), the only difference is that the Gram matrix in (2) is replaced with in (7). Recall that . Thus, when the number of random projections is sufficiently large, will be close to the and we would also expect to be close to . As a result, we can use to approximate in (11), which yields a recovered prediction model given below:

 ˜w=−1λXD(y)ˆα∗=−n∑i=11λyi[ˆα∗]ixi. (12)

Note that the key difference between the recovered solution and the naive solution is that is computed by mapping the optimal primal solution in the projected space back to the original space via the random matrix , while is computed directly in the original space using the approximate dual solution . Therefore, the naive solution lies in the subspace spanned by the column vectors in the random matrix (denoted by ), while the recovered solution lies in the subspace that also contains the optimal solution , i,e., the subspace spanned by columns of (denoted by ). The mismatch between spaces and leads to the large approximation error for .

According to Proposition 2, we can construct the dual solution from the primal solution . Thus, we do not need to solve the dual problem in (7) to obtain . Instead, we solve the low-dimensional optimization problem in (6) to get and construct from it. Algorithm 1 shows the details of the proposed method. We note that although dual variables have been widely used in the analysis of convex optimization (Boyd and Vandenberghe, 2004; Hazan et al., 2011) and online learning (Shalev-Shwartz and Singer, 2006), to the best of our knowledge, this is the first time that dual variables are used in conjunction with random projection for recovering the optimal solution.

To further reduce the recovery error, we develop an iterative method shown in Algorithm 2. The idea comes from that fact that if with a small , we can apply the same dual random projection algorithm again to recover , which will result in a recovery error of . If we repeat the above process with iterations, we should be able to obtain a solution with a recovery error of . This simple intuition leads to an iterative method shown in Algorithm 2. At the -th iteration, given the recovered solution obtained from the previous iteration, we solve the optimization problem in (13) that is designed to recover . The detailed derivation of Algorithm 2 is provided in Section 5.2.

It is important to note that although Algorithm 2 consists of multiple iterations, the random projection of the data matrix is only computed once before the start of the iterations. This important feature makes the iterative algorithm computationally attractive as calculating random projections of a large data matrix is computationally expensive and has been the subject of many studies, e.g., (Achlioptas, 2003; Liberty et al., 2008; Braverman et al., 2010). However, it is worth noting that at each iteration in Algorithm 2, we need to compute the dot-product for all training data in the original space. We also note that Algorithm 2 is related to the Epoch gradient descent algorithm (Hazan and Kale, 2011) for stochastic optimization in the sense that the solution obtained from the previous iteration serves as the starting point to the optimization problem at the current iteration. Unlike the algorithm in (Hazan and Kale, 2011), we do not shrink the domain size over the iterations in Algorithm 2.

#### Application to the Square Loss

In the following, we take the square loss as an example to illustrate the recovery procedure. The original optimization problem, which is refereed to as the ridge regression (Hastie et al., 2009), is

 minw∈Rdλ2∥w∥2+12n∑i=1(1−yix⊤iw)2yi∈{±1}=λ2∥w∥2+12n∑i=1(yi−x⊤iw)2.

Setting the derivative to , we obtain the optimal solution

 w∗ = (λI+XX⊤)−1Xy=X(λI+X⊤X)−1y. (14)

where the last equality follows from the Woodbury matrix identity (Golub and Van Loan, 1996). Thus, the computational cost is either or .

Following the dual random projection algorithm, we first solve the the low-dimensional problem in (6), whose solution is , where . Then, we construct the dual solution . Finally, we recover the optimal solution . It is straightforward to check that computational cost of our algorithm is , which is significantly smaller than that of (14) when both and are large.

After some algebraic manipulation, we can show that

 ˜w=X(λI+X⊤SS⊤mX)−1y. (15)

Comparing (14) with (15), we can see the difference between and comes from the Gram matrix. When is large enough, is close to the , and as a result is also close to .

## 4 Main Results

In this section, we will bound the recovery error of dual random projection. We first assume is of low rank, and then extend the results to the full rank case.

### 4.1 Low Rank

The low rank assumption is closely related to the sparsity assumption made in compressive sensing. This is because lies in the subspace spanned by the column vectors of and the low rank assumption directly implies that is sparse with respect to the singular vectors of .

We denote by the rank of matrix . The following theorem shows that the recovery error of Algorithm 1 is small provided that (1) is of low rank (i.e., ), and (2) the number of random projections is sufficiently large.

###### Theorem 2

Let be the optimal solution to (1) and let be the solution recovered by Algorithm 1. For any , with a probability at least , we have

 ∥˜w−w∗∥2≤ε1−ε∥w∗∥2,

provided

 m≥(r+1)log(2r/δ)cε2,

where constant is at least .

According to Theorem 2, the number of required random projections is . This is similar to a compressive sensing result if we view rank as the sparsity measure used in compressive sensing. Following the same arguments as compressive sensing, it may be possible to argue that is optimal due to the result of the coupon collector’s problem (Mowani and Raghavan, 1995), although the rigorous analysis remains to be developed.

As a comparison, the following theorem shows that with a high probability, the naive solution given in (10) does not accurately recover the true optimal solution .

###### Theorem 3

For any , with a probability at least , we have

 ∥ˆw−w∗∥2≥12√d−rm(1−ε√2(1+ε)1−ε)∥w∗∥2,

provided the condition on in Theorem 2 holds.

As indicated by Theorem 3, when is sufficiently larger than but significantly smaller than , we have , indicating that does not approximate well.

It is important to note that Theorem 3 does not contradict the previous results showing that the random projection based method could result in a small classification error if the data is linearly separable with a large margin. This is because, to decide whether carries a similar classification performance to , we need to measure the following term

 maxx∈span(X),∥x∥2≤1x⊤(ˆw−w∗). (16)

Since can also be written as

 ∥ˆw−w∗∥2=max∥x∥2≤1x⊤(ˆw−w∗),

the quantity defined in (16) could be significantly smaller than if the data matrix is of low rank. The following theorem quantifies this statement.

###### Theorem 4

For any , with a probability at least , we have

 maxx∈span(X),∥x∥2≤1x⊤(ˆw−w∗)≤ε(1+11−ε)∥w∗∥2,

provided the condition on in Theorem 2 holds.

We note that Theorem 4 directly implies the result of margin classification error for random projection (Blum, 2006; Shi et al., 2012). This is because when a data point can be separated by with a margin , i.e., , it will be classified by with a margin at least provided .

Based on Theorem 2, we now state the recovery result for the iterative method.

###### Theorem 5

Let be the optimal solution to (1) and let be the solution recovered by Algorithm 2. For any , with a probability at least , we have

 ∥˜wT−w∗∥2≤(ε1−ε)T∥w∗∥2,

provided the condition on in Theorem 2 holds.

Notice that the number of random projection does not depend on the number of iterations . That is because we only apply random projection once to reducing the dimensionality of the data. Theorem 5 implies that we can recover the optimal solution with a relative error , i.e., , by using iterations.

### 4.2 Full Rank

If has full rank, we established the following theorem to bound the recovery error.

###### Theorem 6

Assume lies in the subspace spanned by the first left singular vectors of , and the loss is -smooth. For any , with a probability at least , we have

 ∥˜w−w∗∥2≤ε1−ε(1+√λ√γσk)∥w∗∥2,

provided

 m≥¯rσ21cε2(λ/γ+σ21)log2dδ,

where is the -th singular value of , , and the constant is at least .

The above theorem implies the number of required random projections is , which can be significantly smaller than . The number is closely related to the numerical -rank of  (Hansen, 1998). We say that has numerical -rank if

 σrν>ν≥σrν+1.

Using the notation of numerical rank, we have

 ¯r≤r√λ/γ+d∑i=r√λ/γ+1σ2iλ/γ+σ2i.

Thus, when the singular value for , which means that can be well approximated by a rank matrix, we have .

One remarking property of our approach is it enjoys a multiplicative bound even in the full rank case. Thus, as long as , we can use Algorithm 2 to reduce the reconstruction error exponentially over the iterations. In contrast, the random projection based algorithm for SVD (Halko et al., 2011), although is able to accurately recover the eigen-space when the matrix is of low rank, it will result in a significant error in uncovering the subspace spanned by the top singular vectors when applied to matrices of full rank, and therefore is unable to recover the optimal solution accurately.

Finally, we note that the assumption that the optimal solution lies in the subspace spanned by the top singular vectors has been used in kernel learning (Guo and Zhou, 2012) and semi-supervised learning (Ji et al., 2012).

## 5 Analysis

Due to the limitation of space, we just provided the analysis for the low rank case. Before presenting the analysis, we first establish some notations and facts. Let the SVD of be

 X=UΣV⊤=r∑i=1λiuiv⊤i,

where , , , is the -th singular value of , and are the corresponding left and right singular vectors of . We define

 γ∗=ΣV⊤D(y)α∗, and ˜γ=ΣV⊤D(y)ˆα∗. (17)

It is straightforward to show that

 w∗=−1λUΣV⊤D(y)α∗=−1λUγ∗, and ˜w=−1λUΣV⊤D(y)ˆα∗=−1λU˜γ.

Since is an orthogonal matrix, we have

 ∥w∗∥2=1λ∥γ∗∥2, ∥˜w∥2=1λ∥˜γ∥2, % and ∥˜w−w∗∥2=1λ∥˜γ−γ∗∥2. (18)

Let us define

 A=U⊤R∈Rr×m.

It is easy to verify that is a Gaussian matrix of size .

### 5.1 Proof of Theorem 2

We first introduce the following concentration inequality for Gaussian random matrix, which serves the key to our analysis.

###### Corollary 7

Let be a standard Gaussian random matrix. For any , with a probability at least , we have

 ∥∥∥1mAA⊤−I∥∥∥2≤ε,

provided

 m≥(r+1)log(2r/δ)cε2,

where is the spectral norm of matrix and is a constant whose value is at least .

Define and as

 L(α)=−n∑i=1ℓ∗(αi)−12λα⊤Gα, and ˆL(α)=−n∑i=1ℓ∗(αi)−12λα⊤ˆGα.

Since maximizes over the domain , we have

 ˆL(ˆα∗)≥ˆL(α∗)+12λ(ˆα∗−α∗)⊤ˆG(ˆα∗−α∗). (19)

Using the concaveness of , we have

 ˆL(ˆα∗)+12λ(ˆα∗−α∗)⊤ˆG(ˆα∗−α∗) (20) ≤ ˆL(α∗)+(ˆα∗−α∗)⊤(∇ˆL(α∗)−∇L(α∗)+∇L(α∗)) ≤ ˆL(α∗)+1λ(ˆα∗−α∗)⊤(G−ˆG)α∗,

where the last inequality follows from the fact that since maximizes over the domain . Combining the inequalities in (19) and (20), we have

 1λ(ˆα∗−α∗)⊤(G−ˆG)α∗≥1λ(ˆα∗−α∗)⊤ˆG(ˆα∗−α∗).

We rewrite and as

 G = D(y)VΣU⊤UΣV⊤D(y)=D(y)VΣΣV⊤D(y), ˆG = D(y)VΣU⊤RR⊤mUΣV⊤D(y)=D(y)VΣAA⊤mΣV⊤D(y).

Using the definitions of and in (17), we obtain

 (˜γ−γ∗)⊤(I−AA⊤m)γ∗≥(˜γ−γ∗)⊤AA⊤m(˜γ−γ∗). (21)

From Corollary 7, with a probability at least , we have , under the given condition on . Therefore, we obtain

 (1−ε)∥˜γ−γ∗∥2≤ε∥γ∗∥2.

We complete the proof by using the equalities given in (18).

### 5.2 Proof of Theorem 5

At the -th iteration, we consider the following optimization problem:

 minw∈RdLt(w;X,y)=λ2∥w+˜wt−1∥22+n∑i=1ℓ(yi(w+˜wt−1)⊤xi), (22)

where is the solution obtained from the -th iteration. It is straightforward to show that is the optimal solution to (22). Then we can use the dual random projection approach to recover by . If we can similarly show that

 ∥˜Δt−Δt∗∥2≤ε1−ε∥Δt∗∥2,

then we update the recovered solution by and have

 ∥˜wt−w∗∥2=∥˜Δt−Δt∗∥2≤ε1−ε∥Δt∗∥2=ε1−ε∥˜wt−1−w∗∥2.

As a result, if we repeat the above process for , the recovery error of the last solution is upper bounded by

 ∥˜wT−w∗∥2≤(ε1−ε)T∥˜w0−w∗∥2=(ε1−ε)T∥w∗∥2,

where we assume .

The remaining question is how to compute the using the dual random projection approach. In order to make the previous analysis remain valid for the recovered solution to the problem (22), we need to write the primal optimization problem in the same form as in (1). To this end, we first note that lies in the subspace spanned by , thus we write as

 ˜wt−1=−1λXD(y)ˆαt−1∗=−1λn∑i=1[ˆαt−1∗]iyixi.

Then, can be written as

 Lt(w;X,y) = λ2∥˜wt−1∥22+λ2∥w∥22+λw⊤˜wt−1+n∑i=1ℓ(yiw⊤xi+yi[˜wt−1]⊤xi) = λ2∥˜wt−1∥22+λ2∥w∥22+n∑i=1ℓ(yiw⊤xi+yi[˜wt−1]⊤xi)−[ˆαt−1∗]iyiw⊤xi =

where the new loss function is defined as

 ℓti(z)=ℓ(z+yi[˜wt−1]⊤xi)−[ˆαt−1∗]iz. (23)

Therefore, is the solution to the following problem:

 minw∈Rdλ2∥w∥22+n∑i=1ℓti(yiw⊤xi).

To apply the dual random projection approach to recover , we solve the following low-dimensional optimization problem:

 minz∈Rmλ2∥z∥22+n∑i=1ℓti(yiz⊤ˆxi),

where is the low-dimensional representation for example . The following derivation signifies that the above problem is equivalent to the problem in (13).

 λ2∥z∥22+n∑i=1ℓti(yiz⊤ˆxi) = λ2∥z∥22+n∑i=1ℓ(yiz⊤ˆxi+yi[˜wt−1]⊤xi)−[ˆαt−1∗]iyiz⊤ˆxi = λ2∥z∥22+λ√mz⊤(R⊤˜wt−1)+n∑i=1ℓ(yiz⊤ˆxi+yi[˜wt−1]⊤xi) = λ2∥∥∥z+1√mR⊤˜wt−1∥∥∥22+n∑i=1ℓ(yiz⊤ˆxi+yi[˜wt−1]⊤xi)−λ2∥∥∥1√mR⊤˜wt−1∥∥∥22,

where in the third line we use the fact that and . Given the optimal solution to the above problem, we can recover by

 ˜Δt=−1λXD(y)ˆat∗,

where is computed by

 [ˆat∗]i=∇ℓti(yiˆx⊤izt∗)=∇ℓ(yiˆx⊤izt∗+yi[˜wt−1]⊤xi)−[ˆαt−1∗]i, i=1,…,n.

The updated solution is computed by

 ˜wt=˜wt−1+˜Δt=<