A New Greedy Algorithm for Multiple Sparse Regression

A New Greedy Algorithm for Multiple Sparse Regression

Ali Jalali
Yahoo! Labs
ajalali@yahoo-inc.com
   Sujay Sanghavi
University of Texas at Asutin
sanghavi@mail.utexas.edu
Abstract

This paper proposes a new algorithm for multiple sparse regression in high dimensions, where the task is to estimate the support and values of several (typically related) sparse vectors from a few noisy linear measurements. Our algorithm is a “forward-backward” greedy procedure that – uniquely – operates on two distinct classes of objects. In particular, we organize our target sparse vectors as a matrix; our algorithm involves iterative addition and removal of both (a) individual elements, and (b) entire rows (corresponding to shared features), of the matrix.

Analytically, we establish that our algorithm manages to recover the supports (exactly) and values (approximately) of the sparse vectors, under assumptions similar to existing approaches based on convex optimization. However, our algorithm has a much smaller computational complexity. Perhaps most interestingly, it is seen empirically to require visibly fewer samples. Ours represents the first attempt to extend greedy algorithms to the class of models that can only/best be represented by a combination of component structural assumptions (sparse and group-sparse, in our case).

1 Introduction

This paper provides a new algorithm for the (standard) multiple sparse linear regression problem, which we now describe. We are interested in inferring sparse vectors from noisy linear measurements; in particular, for each , we observe noisy linear measurements according to the statistical model

(1)

where for each , is the design matrix, is the response vector and is the noise. We combine all tasks as columns of a matrix . We are thus interested in inferring the matrix given , for . Here inference means both recovery of the support of , as well as closeness in numerical values on the non-zero elements.

We are interested in solving this problem in the high-dimensional setting, where the number of observations is potentially substantially smaller than the number of features . High-dimensional settings arise in applications where measurements are expensive, and hence a sufficient number may be unavailable. Consistent recovery of is now not possible in general; however, as is now well-recognized, it is possible if each is sparse, and the design matrices satisfy certain properties.

Multiple sparse linear regression comes up in applications ranging from graphical model selection [13] and kernel learning [1] to function estimation [12] and multi-task learning [6], etc. In several of these examples, the different vectors are related, in the sense that they share portions of their supports/features, and may even be close in values on those entries. As an example, consider the task of learning handwritten character “A” for different writers. Since all these handwritings read “A”, they should share a lot of features, but of course there might be few non-shared features indicating each individual handwriting. A natural question in this setting is: can inferring the vectors jointly (often referred to as multi-task learning [2]) result in lower sample complexity than inferring each one individually?

When the sharing of supports is partial, it turns out the answer depends on the method used. Some “group LASSO” methods like regularization can actually result in lower or higher sample complexity, as compared to doing for example separate LASSO, depending on whether the level of sharing among tasks is high or low, respectively. The “dirty mode” approach [6] develops a method, based on splitting into two matrices which are regularized differently, which shows gains in sample complexity for all levels of sharing. We review the related existing work in section 1.1.

Our Contribution: We provide a novel forward-backward greedy algorithm, designed for when the target structure is a combination of a sparse and block-sparse matrix. We provide theoretical guarantee on the performance of the algorithm in terms of both estimation error and support recovery. Our analysis is more subtle than [7], since we would like to have local assumptions on each task as opposed to having global assumptions on the whole matrix . Ours is the first attempt to extend greedy approaches, which are sometimes seen to be both statistically and computationally more efficient than convex programming, to high-dimensional problems where the best/only approach involves the use of more than one structural models (sparse and group-sparse in our case).

1.1 Related Work

There is now a huge literature on sparse recovery from linear measurements; we restrict ourselves here to the most directly related work on multiple sparse linear regression.

Convex optimization approaches: A popular recent approach to leverage sharing in sparsity patterns has been via the use of group norms as regularizers, with ; examples include the norm [16, 18, 9], and the norm [8, 10]. The sample complexity of these methods may be sensitive [9] to the level of sharing of supports, motivating the “dirty model” approach [6]; in that paper, the unknown matrix was split as the sum of two matrices, regularized to encourage group-sparsity in one and sparsity in the other. Conceptually, this is similar to our line of thinking; however, their approach was based on convex optimization. We show that our method empirically has lower sample complexity than [6] (although we do not have a theoretical characterization of the constant multiplicative factor that seems to be the difference).

Greedy methods: Several algorithms attempt to find the support (and hence values) of sparse vectors by iteratively adding, and possibly dropping, elements from the support. The earliest examples were simple “forward” algorithms like Orthogonal Matching Pursuit (OMP) [15, 19], etc.; these add elements to the support until the loss goes below a threshold. More recently, it has been shown [20, 7] that adding a backward step is more statistically efficient, requiring weaker conditions for support recovery. Another line of (forward) greedy algorithms works by looking at the gradient of the loss function, instead of the function itself; see e.g. [5]. A big difference between our work and these is that our forward-backward algorithm works with two different classes of objects simultaneously: singleton elements of the matrix of vectors that need to be recovered, and entire rows of this matrix. This adds a significant extra dimension in algorithm design, as we need a way to compare the gains provided by each class of object in a way that ensures convergence and correctness.

2 Our Algorithm

We now first briefly describe the algorithm in words, and then specify it precisely. A natural loss function for our multi-task problem is

Let be the matrix which has the target vector as its column. Our algorithm is based on iteratively building and modifying the estimated support of , by adding (in the forward step) and removing (in the backward step) two kinds of objects: singleton elements , and entire rows . The basic idea is to include in forward steps singletons/rows that give big decreases in the loss, and to remove in the backward steps those whose removal results in only a small increase in the loss. However, the kinds of objects cannot be compared in an “apples to apples” way, which means that doing the forward and backward steps in a way that ensures convergence and correctness is not immediate; as we will see below there are some intricacies in how the addition and removal decisions are made.

It is easiest to understand our algorithm in terms of “reward” for forward steps, and“cost” for backward steps. Each inclusion results in a decrease in the loss; the corresponding reward is an appropriate weighting of this decrease, with the weighting tilted to favor singleton elements over entire rows. Similarly, each removal results in an increase in the loss; the corresponding cost is the same weighting of this increase. Each iteration consists of one forward step, and (potentially) several backward steps. We maintain two sets: of singleton elements, and of rows111We abuse notation by using to also refer to all elements in these rows. The correct usage is always clear from context..

  1. In the forward step, we find the new object whose inclusion would yield the highest reward. If this reward is large enough, the object is included in its corresponding matrix (i.e. or ). We also record the value of the reward, and the type of object. If this reward is less than an absolute threshold, the algorithm terminates.

  2. In each backward step, we find the object with the lowest cost. If this cost is “low enough”, we remove the object. Else we do not. We now explain what “low enough” means; this is crucial. Say the object with lowest cost is a singleton element, and there are currently singleton elements in the matrix . Then we remove this element if its cost is smaller, by a fixed fraction , than the reward obtained when the singleton element was added to (note that, because each iteration has several backward steps, this addition could have happened many forward steps prior to the current iteration). Similarly, if the object was a row, its cost is compared to the corresponding row reward obtained.

Convergence: It can be seen that in any given iteration, the loss can actually increase! This is because there can be multiple backward steps in the same iteration. To see that the algorithm converges, note that the cost of each backward step is at most the fraction of the reward of the corresponding forward step: the one it was compared to when we made the decision to execute the backward step. Thus, this backward and forward step, as a pair, result in a decrease in the loss. Convergence follows from the fact that there is a one to one correspondence between each backward step and its corresponding forward step; there are no backward steps that are “un-accounted for”.

  Input: Data , Stopping Threshold , Sharing threshold weight , Backward Factor
  Output Variables: set of singleton elements , set of rows
  Initialize: , , ,
  while true do {Forward Step}
     Find the best new singleton element and its reward
     Find the best new row and its reward
     Choose and record the bigger weighted gain
     if  then {Gain too small}
        break (algorithm stops)
     end if
     If then add row , else add singleton
     Re-estimate on the new support set
     Increment
     while true do {Several backward steps for each forward step}
        Find the worst singleton element and its cost
        Find the worst row and its cost
        if  then {Cost too large}
           break (backward steps end)
        end if
        If then remove row , else remove singleton
        Re-estimate on the new support set
        Decrement
     end while
  end while
Algorithm 1 Greedy Dirty Model

3 Performance Guarantees

Shared and non-shared features: Consider the true matrix , and for a fixed value of integer define the set of “shared” features/rows that have support more than . In this paper, we overload notation so that refers to both the set of rows above (in which case ) and the set of all elements in these rows (in which case ); correct interpretation is always clear from context. We can also define support on the non-shared features as follows and finally, we define .

Recall that our method requires a number as an input, and outputs two sets – a set of singleton elements, and a set of rows – and an estimated matrix which is supported on . Our main analytical result, Theorem 1 below, is a deterministic quantification of conditions (on ) under which our algorithm with as input, yields sparsistency – i.e. recovery of the shared rows , the support on the non-shared rows – and small error . We start with the assumptions and then state the theorem. Corollary 1 covers two popular scenarios with randomness: where the design matrices are deterministic but the noise vectors are Gaussian, and the case where both and are Gaussian.

Restricted Eigenvalue Property (REP): Fix a , and sparsity level . We say the matrix satisfies with constants and , if for all , and all -sparse vectors , we have that

(2)

In our results, we assume (by taking the maximum/minimum over all tasks) that and are the same for all tasks. Note that the level of sparsity will still be different for different .

Gradient of the loss function: If there is no noise, i.e. for all , then is the optimal point of the loss function and the loss function has zero gradient there, i.e. . However, for any if , the corresponding gradient will not be zero either. We define to be an upper bound on the infinity norm of this gradient, i.e. .

Minimum non-zero element: Elements of with very small magnitude are hard to distinguish from 0, so we need to specify a lower bound on the magnitude of elements in the support of we want to recover. Towards this end, for a given , suppose is the magnitude of the largest entry (by magnitude) entry in row of . and . Finally, let . Note that a lower bound on implies that there at least elements whose magnitude is above that bound in every row of , and also that every element in is above that bound.

Theorem 1 (Sparsistency).

Suppose the algorithm is run with and . Let be such that , and for this let shared rows , non-shared features , sparsity levels , and minimum element as above. Suppose also that , and for each we have that holds with constants and , and that . Then, if we run Algorithm 1 with stopping threshold , the output with shared support and individual support satisfies:

  • Error Bound:

  • Support Recovery: and .

Remark 1. The noiseless case corresponds to , in which case the algorithm can be run with . As can be seen, this yields exact recovery, i.e. .

Remark 2. The smaller the value of the backward factor , the faster the algorithm is likely to converge as there are likely to be fewer backward steps. However, smaller results in larger values of and that we need for success; thus an algorithm with smaller is likely to work on a smaller range of problems: a trade-off between statistical and computational complexity.

Remark 3. Note that all the rows in has less than elements. To see this, suppose in contrary that there exist a row in that has more than or equal to non-zeros. Since in the algorithm these single elements should compete with times the improvement of the row, and , the row will be chosen for before those entries are chosen for . Once the row goes to , since we optimize for each entry on the row separately, it is impossible that any other single element on that row goes to . Hence, rows of have less than entries and can be distinguished from the rows of .

Remark 4. Some recent results [7, 14] study greedy algorithms in a general “atomic” framework. While our setting could be made to fall into this general framework, the resulting algorithm would be different, and the performance guarantees would be weaker. These results require for “each” and “all” task, which is order-wise (by an order of ) worse than our assumption for task . To get this result, we leverage the fact that our loss function is separable with respect to tasks and hence, we do the analysis on a per-task basis.

Corollary 1.

For sample complexity , with probability at least for some constants , we have

  • Under the assumptions of the Theorem 1, if is , then the result holds for for some constant .

  • If is and REP assumption in Theorem 1 holds for , then the result holds.

Proof of Theorem 1.

Let be the size of the support of the estimated task union with the support of the true task. Inspired by [7], our proof is based upon the following two lemmas:

Lemma 1.

If holds, then

  • .

  • .

Lemma 2.

If is chosen properly (see appendix for the exact expression), then never exceeds , and hence, .

Part (i) and (ii) of Lemma 1 are consequences of the fact that when algorithm stops the forward step and previous backward step fail to go through, respectively. To ensure the assumption of Lemma 1 holds, we need the Lemma 2 that bounds . The proof can be completed as below.

  • The result follows directly from part (i) of Lemma 1 noting that by Lemma 2.

  • Considering Remark 3, we only need to show that . For any , we have

    where the last inequality follows from part (a) and the inequality . Now, dividing both sides by we get

    The inequality follows from the assumption on and implying . To show the converse, from part (ii) of Lemma 1, we have

    due to the setting of the stopping threshold . This implies that and concludes the proof of the theorem.

4 Experimental Results

(a) Little overlap:
(b) Moderate overlap:
(c) High overlap:
Figure 1: Probability of success in recovering the exact sign support using greedy algorithm, dirty model, Lasso and group LASSO (). For a 2-task problem, the probability of success for different values of feature-overlap fraction is plotted. Here, we let and the values of the parameter and design matrices are i.i.d standard Gaussians and . Greedy method outperforms all methods in the sample complexity required for sign support recovery.

4.1 Synthetic Data

To have a common ground for comparison, we run the same experiment used for the comparison of LASSO, group LASSO and dirty model in [9, 6]. Consider the case where we have tasks each with the support size of and suppose these two tasks share a portion of their supports. The location of non-zero entries are chosen uniformly at random and values of and are chosen to be standard Gaussian realizations. Each row of he matrices and is distributed as and each entry of the noise vectors and is a zero-mean Gaussian draw with variance . We run the experiment for problem sizes and for support overlap levels .

We use cross-validation to find the best values of regularizer coefficients. To do so, we choose , where , and . Notice that this search region is motivated by the requirements of our theorem and can be substantially smaller than the region needs to be searched for and if they are independent. Interestingly, for small number of samples , the ratio tends to be close to , where for large number of samples, the ratio tends to be close to . We suspect this phenomenon is due to the lack of curvature around the optimal point when we have few samples. The greedy algorithm is more stable if it picks a row as opposed to a single coordinate, even if the improvement of the entire row is comparable to the improvement of a single coordinate.

To compare different methods under this regime, we define a rescaled version of sample size , aka control parameter For different values of , we plot the probability of success, obtained by averaging over 100 problems, versus the control parameter in Fig.1. It can be seen that the greedy method outperforms, i.e., requires less number of samples, to recover the exact sign support of .

This result matches the known theoretical guarantees. It is well-known that LASSO has a sharp transition at [17]111The exact expression is . Here, we ignore the term comparing to ., group LASSO ( regularizer) has a sharp transition at [9] and dirty model has a sharp transition at [6]. Although we do not have a theoretical result, these experiments suggest the following conjecture:

Conjecture 1.

For two-task problem with and Gaussian designs, the greedy algorithm has a sharp transition at .

Figure 2: Phase transition threshold versus the parameter in a 2-task problem for greedy algorithm, dirty model, LASSO and group LASSO ( regularizer). The y-axis is . Here, we let and the values of the parameter and design matrices are i.i.d standard Gaussians and . The greedy algorithm shows substantial improvement in sample complexity over the other methods.

To investigate our conjecture, we plot the sharp transition thresholds for different methods versus different values of for problem sizes . Fig 2 shows that the sharp transition threshold for greedy algorithm follows our conjecture with a good precision. Although, theoretical guarantee for such a tight threshold remains open.

4.2 Handwritten Digits Dataset

n Greedy Dirty Model Group LASSO LASSO
10 Average Classification Error 6.5% 8.6% 9.9% 10.8%
Variance of Error 0.4% 0.53% 0.64% 0.51%
Average Row Support Size 180 171 170 123
Average Support Size 1072 1651 1700 539
20 Average Classification Error 2.1% 3.0% 3.5% 4.1%
Variance of Error 0.44% 0.56% 0.62% 0.68%
Average Row Support Size 185 226 217 173
Average Support Size 1120 2118 2165 821
40 Average Classification Error 1.4% 2.2% 3.2% 2.8%
Variance of Error 0.48% 0.57% 0.68% 0.85%
Average Row Support Size 194 299 368 354
Average Support Size 1432 2761 3669 2053
Table 1: Handwriting Classification Results for greedy algorithm, dirty model, group LASSO and LASSO. The greedy method provides much better classification errors with simpler models. The greedy model selection is more consistent as the number of samples increases.

We use the handwritten digit dataset [3] that is used by a number of papers [11, 4, 6] as a reliable dataset for optical handwritten digit recognition algorithms. The dataset contains features of handwritten numerals 0-9 ( tasks) extracted from a collection of Dutch utility maps. The dataset provides samples of each digit written by different people. We take samples from each digit and combine them to a big matrix , i.e., we set for all . We construct the response vectors to be if the corresponding row in is an instance of digit and zero otherwise. Clearly, ’s will have a disjoint support sets. We run all four algorithms on this data and report the results.

Table 1 shows the results of our analysis for different sizes of the training set . We measure the classification error for each digit to get the 10-vector of errors. Then, we find the average error and the variance of the error vector to show how the error is distributed over all tasks. Again, in all methods, parameters are chosen via cross-validation. It can be seen that the greedy method provides a more consistent model selection as the model complexity does not change too much as the number of samples increases while the classification error decreases substantially. In all cases, we get improvement in classification error.

References

  • Bach [2008] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008.
  • Caruana [1997] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997.
  • Duin [2002] R. P.W. Duin. Department of Applied Physics, Delft University of Technology, Delft, The Netherlands, 2002.
  • He and Niyogi [2003] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003.
  • Jain et al. [2011] P. Jain, A. Tewari, and I.S. Dhillon. Orthogonal matching pursuit with replacement. NIPS, 2011.
  • Jalali et al. [2010] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, 2010.
  • Jalali et al. [2011] A. Jalali, C. Johnson, and P. Ravikumar. On learning discrete graphical models using greedy methods. In NIPS, 2011.
  • Lounici et al. [2009] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in multi-task learning. In 22nd Conference On Learning Theory (COLT), 2009.
  • Negahban and Wainwright [2008] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of -regularization. In Advances in Neural Information Processing Systems (NIPS), 2008.
  • Obozinski et al. [2011] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 39:1–17, 2011.
  • Perkins and Theiler [2003] S. Perkins and J. Theiler. Online feature selection using grafting. In ICML, 2003.
  • Ravikumar et al. [a] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society, Series B, a.
  • Ravikumar et al. [b] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using -regularized logistic regression. Annals of Statistics, 38(3):1287–1319, b.
  • Tewari et al. [2011] A. Tewari, P. Ravikumar, and I.S. Dhillon. Greedy algorithms for structurally constrained high dimensional problems. NIPS, 2011.
  • Tropp and Gilbert [2007] J.A. Tropp and A.C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transaction on Information Theory, 53:4655–4666, 2007.
  • Turlach et al. [2005] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics, 27:349–363, 2005.
  • Wainwright [2009] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55:2183–2202, 2009.
  • Zhang and Huang [2008] C. Zhang and J. Huang. Model selection consistency of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36:1567–1594, 2008.
  • [19] T. Zhang. Sparse recovery with orthogonal matching pursuit under rip. IEEE Transaction on Information Theory, 57.
  • Zhang [2008] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In Neural Information Processing Systems (NIPS) 21, 2008.

Appendix A Auxiliary Lemmas for Theorem 1

Note that when the algorithm terminates, the forward step fails to go through. This entails that

Since our loss function is separable with respect to tasks, i.e., , for a fixed task , we can rewrite the second inequality as

The next lemma shows that this has the consequence of upper bounding the deviation in loss between the estimated parameters and the true parameters .

Lemma 3 (Stopping Forward Step).

When the algorithm stops with parameter , we have

(3)
Proof.

Let . For any , we have

(4)

Here, we use the fact that is zero on the support of . Optimizing the RHS over , we obtain

whence the lemma follows.

Lemma 4 (Stopping Error Bound).

When the algorithm stops with parameter , we have

Proof.

For , let

It can be seen that , and from the previous lemma, . Further, is sub-homogeneous (over a limited range): for by basic properties of the convex function. Thus, for a carefully chosen , if we show that for all , where, is as defined in the proof of the theorem, then, it follows that . If not, then there would exist some such that , whence we would arrive at the contradiction

Thus, it remains to show that for all . By restricted strong convexity property of , we have

We can establish

and hence,

if for

This concludes the proof of the lemma.

Next, we note that when the algorithm terminates, the backward step with the current parameters has failed to go through. This entails that

(5)

The next lemma shows the consequence of this bound.

Lemma 5 (Stopping Backward Step).

When the algorithm stops with parameter , we have

Proof.

We have

(6)