Feedback-Controlled Sequential Lasso Screening

# Feedback-Controlled Sequential Lasso Screening

## Abstract

One way to solve lasso problems when the dictionary does not fit into available memory is to first screen the dictionary to remove unneeded features. Prior research has shown that sequential screening methods offer the greatest promise in this endeavor. Most existing work on sequential screening targets the context of tuning parameter selection, where one screens and solves a sequence of lasso problems with a fixed grid of geometrically spaced regularization parameters. In contrast, we focus on the scenario where a target regularization parameter has already been chosen via cross-validated model selection, and we then need to solve many lasso instances using this fixed value. In this context, we propose and explore a feedback controlled sequential screening scheme. Feedback is used at each iteration to select the next problem to be solved. This allows the sequence of problems to be adapted to the instance presented and the number of intermediate problems to be automatically selected. We demonstrate our feedback scheme using several datasets including a dictionary of approximate size 100,000 by 300,000.

\nipsfinalcopy

## 1 Introduction

Sparse dictionary-based representation of data has proved effective in a wide range of applications in computer vision, machine learning, signal processing and statistics [10, 35]. A sparse representation uses a dictionary , with columns called features, to represent a data point as a linear combination of a relatively small number of the features. Thus for many , we require . One way to obtain such representations is by solving the lasso problem [24]:

 minw∈Rp\nicefrac12∥y−Dw∥22+λ∥w∥1, (1)

where is a regularization parameter.

In many of today’s sparse coding applications, the dictionary can be very large, both because the data dimension is large ([36, 28, 23, 14]) and because an adequate representation requires many features ([36, 22, 1, 34, 14]). For example, in music genre classification the authors of [7] employed scattering representations [1, 2] of short segments of music data, resulting in dictionaries of size and (first order scattering), and and (second order scattering). Once dictionaries reach this size, solving (1) can become a bottleneck in the overall computation. More importantly, for significantly larger dictionaries, fitting the dictionary into available memory is a concern.

Several approaches are possible for solving (1) with a large dictionary. For example, assuming the dictionary fits into memory, one could use early termination of an iterative solver (e.g., FISTA [3]) to quickly obtain an approximate solution. Alternatively, one could use a finite number of steps of a sequential greedy method (e.g., OMP [26]) to approximate a solution of (1). OMP iteratively selects a feature with the largest correlation with the current residual. Searching for this feature is the most time consuming part of the algorithm but only requires part of the dictionary to be loaded at once. To obtain a solution that is -sparse, the algorithm needs to hold at most features in the memory. However, since the true sparsity is unknown, in general, OMP and similar algorithms will only yield an approximate solution. Both of these approaches lead (in general) to an approximate solution of (1).

Several recent studies have investigated the following alternative approach. For a given and , use duality to quickly identify a subset of features in that are certified to have zero weight in a solution of (1). By removing these features from the dictionary and solving a smaller lasso problem, we can obtain an exact solution to the original problem. In [9] this approach is called SAFE screening, indicating that screening does not reject any needed feature. Just as important, screening can be executed with only a few features loaded into memory at once. Hence it has the potential to significantly reduce the size of dictionary that is provided to a lasso solver. Recent work on safe screening has further developed this idea [39, 37, 8, 29, 33, 11, 19, 20]. The survey [38] reviews these recent methods. A related but distinct form of lasso screening [25] allows false rejection of features with small probability. We do not explicitly consider such screening tests, but our results are also relevant to this approach.

The core idea of SAFE screening is to bound the solution of the dual problem of (1) within a region and compute . The value is then used to decide if feature can be removed. Since these tests are applied once prior to solving (1), we refer to them as “one-shot” tests. A parameter that is often useful in such tests is the value . Roughly, this gives a measure of how well a feature in matches . Empirical studies indicate that current one-shot screening tests perform well for normalized values of of moderate to large size, but performance quickly declines as falls below [38]. To address this problem, one can employ a more complex form of screening known as sequential screening [9, 25]. In sequential schemes, one screens and solves a sequence of lasso problems with . The key point is that for instance , one utilizes the dual solution to obtain a region bound on . This is used to apply a one-shot screening test to reduce the dictionary. And a standard solver is then used to solve the reduced problem to find .

Existing state-of-the-art sequential screening algorithms include sequential Dome [32], sequential Strong rule [25], sequential Enhanced DPP rule [29] among others [9]. They are situated in the context of model selection. In this setting, all solutions to the sequence of problems are of interest, and the sequence of regularization parameters is thus fixed a priori (typically taking a log-grid of values). Sequential screening is then conducted along this predetermined sequence to expedite this parameter tuning process.

In contrast, we focus on another distinct application context where the best regularization parameter, denoted by , has already been chosen via cross-validated model selection, and then we need to solve many lasso instances using this fixed value. Of our particular interest, is around since many applications of the sparse representation framework have found this range to be most helpful. For example, [39] found that in the range maximizes SVM classification accuracy for the handwritten digit recognition problem. In music genre classification, the authors of [7] selected in the range via cross-validation to maximize classification accuracy using the SRC classifier [36]. Once the proper to use is known, the problem of efficient sequential screening can be very different from the sequential screening that is targeted for the model selection purpose: the solution at is the only one that is of interest, and all other problem instances are merely way points. We are thus given the freedom to design the sequence of regularization parameters with that specifically target a single problem instance at . In this scenario, the open-loop sequential screening schemes using a geometrically spaced sequence , which is determined before any problem in the sequence is solved, may not be a good idea. And it would be an advantage if could be tuned individually to each instance .

This paper makes two contributions. First, we design a feedback mechanism to adaptively select the sequence and for sequential screening. The feedback mechanism automatically selects the next value as a function of the results seen in previous steps of the sequential screening process. It also determines when to stop, and hence automatically selects . We call our feedback controlled sequential screening scheme, data-adaptive sequential screening (DASS). DASS has the advantage that and are automatically adapted during the screening process to the particular instance . In addition, the feedback mechanism bounds the region diameter used for one-shot screening at each iteration by a value set by the user. Second, we examine the effects of the inevitable errors that accrue in obtaining the solutions (and dual solutions) of the intermediate lasso problems during sequential screening. At each step, sequential screening assumes exact knowledge of the previous dual solution. However, practical lasso solvers introduce a small error. In the context of classification, we show that the DASS scheme is robust to these errors.

We give required background in §2. Then we introduce DASS and examine its properties in §3. We show that DASS ensures the diameters of the regions used for the one-shot screening are bounded by a user selected value, and the number of intermediate lasso problems is also bounded. We then address the issue of inaccuracies in lasso solutions. §4 discusses the performance of DASS on a selection of datasets, and we conclude in §5. Proofs are given in the supplementary material.

## 2 Background

The one-shot region tests discussed in the introduction are based on the Lagrangian dual of (1):

 maxθ∈Rn\nicefrac12∥y∥22−\nicefracλ22∥θ−y/λ∥22s.t.|dTiθ|≤1∀i=1,2,…,p, (2)

with the solutions of (1) and of (2) related by:

 y=D^w+λ^θ,dTi^θ={sign^wi, if ^wi≠0;γ∈[−1,1] if ^wi=0. (3)

This dual problem has been extensively discussed in the literature, [38, 39]. The set of feasible points of the dual problem is the nonempty, closed, convex polyhedron formed by the intersection of the finite set of closed half spaces for . The objective in (2) is maximized by the projection of onto . This is the unique point satisfying [15, §3.1]:

 (y/λ−^θ)T(θ−^θ)≤0,for each θ∈F. (4)

Let . For , , with on the boundary of . For , moves away from and is its unique projection onto the boundary of .

To form a one-shot test, one bounds the dual solution in a compact set , and computes . By (3), if a feature satisfies and , then ; see [9, 38]. The bounding region can be selected as the intersection of a sphere , with center and radius , and half spaces, i.e., . Then is obtained by solving a convex program, with closed form solutions available for [38, 31].

Sequential screening screens and solves a sequence of problems with until . In an open-loop scheme, and are selected before any solution is obtained. For example, fix , set , and space the geometrically: with .

Note that must lie in the sphere with center and radius [38]. In addition, if is a feature satisfying , then lies in the half space . So to screen the dictionary at , we use the one-shot screening test for

 R01=S(q1,r1)∩Hmax. (5)

Then we solve the reduced lasso problem to obtain (see (3)) .

At step , we make use of the dual solution for instance . Since is dual feasible, is contained in the sphere with

 qk=y/λk,rk=∥y/λk−^θk−1∥2. (6)

In addition, since is the projection of onto , by (4) we have Since , the right hand side of this inequality is nonnegative. Hence lies in the closed half space with

 nk−1=y/λk−1−^θk−1∥y/λk−1−^θk−1∥2,ck−1=nTk−1^θk−1. (7)

In particular, , so is contained in the region

 R0k=S(qk,rk)∩Hk. (8)

This is illustrated in Fig. 1. We now use the one-shot screening test for to screen the dictionary. Then solve the reduced lasso problem to obtain .

## 3 Feedback Controlled Sequential Screening

Under the assumption of exact computation, we now develop our sequential screening feedback rule and analyze its properties. The issue of approximate lasso solution is examined later. The diameter of a set , denoted by , is the maximum distance between any two points in . The following preliminary lemma will be useful.

###### Lemma 1.

If has a nonempty interior, then

 diam(R)={2√r2−(nTq−c)2,if q∉R;2r,otherwise.

We next show that if , and are constructed via (6), (7), and (8), then .

###### Lemma 2.

For constructed via (6), (7), and (8), .

Combining Lemmas 1 and 2 yields the following result.

###### Proposition 1.

For constructed via (6), (7), and (8),

 diam(R0k)=2(1λk−1λk−1)(yT(I−nk−1nTk−1)y)1/2. (9)

The region provides an initial bound for the one-shot test applied at step . More half space constraints can be used to form a final bounding region . For example, the one-shot test THT in [33, 38, 30] does this by adding a second appropriately selected half space constraint from the dual problem. No matter how, or how many, additional half space constraints are added, (9) provides an upper bound on the diameter of the final bounding region. Thus we have the following corollary.

###### Corollary 1.

If is obtained by adding a finite number of additional half space constraints to , then is bounded above by the expression in (9).

Proposition 1 suggests a rule for selecting so that where is a user specified parameter. Using (9), this is achieved by selecting so that:

 Extra open brace or missing close brace (10)

Assuming is not aligned with (this holds generically), . Incorporating (10) into a sequential screening scheme yields the DASS algorithm in Algorithm 2. We can initialize . At each step, we first use one-shot screening based on the region , then solve the resulting reduced lasso problem (for , can be used as a warm start). Hence each problem is solved efficiently.

If we write , then (10) becomes . Thus is a first order, autoregressive process driven by a nonnegative, nonlinear function of . Since depends on the solution of the previous instance, this term is providing feedback in the update of . The feedback rule ensures that decreases at each step; this results in , and termination of the algorithm. We show in Theorem 1 that the algorithm always terminates, give an upper bound on the number of iterations required, and verify that for the one-shot tests use a region of diameter at most .

###### Theorem 1.

For , let and the sequence be selected using the DASS algorithm. The DASS sequence is guaranteed to terminate after a finite number of steps , and , . In addition, if the dual regularization path is bounded, i.e., there exists , such that for all , then

 N≤1+log(1/λt)log(1+R/2C) .

The assumption in Theorem 1 that is bounded holds if is in the range of (supplementary material). It always holds if [40]. We show the number of iterations required for the algorithm to terminate in real problem instances in §4.

In practice, it is more realistic to assume we have an approximation to , with . For simplicity we assume does not depend on , but this is not essential. We first verify that the DASS algorithm still terminates after a finite number of steps and give a bound on the number of steps required. We state this as a corollary to Theorem 1.

###### Corollary 2.

Suppose we have an approximate dual solution , with , . Under this assumption, the DASS sequence terminates with .

In many classification applications, the lasso regression is often used as a feature extraction step. For each data point to be classified, lasso regression is first used to compute the solution . Then as a feature vector, is fed into a classifier, e.g., SVM, SRC. We analyze in this classification setting, the effect of accumulated inaccuracy in our sequential screening process. Consider using an approximate solution for classification. When used in sequential screening, gives an approximate dual solution , which is used for one-shot screening at step . So the one-shot test may both fail to reject, and falsely reject features. The error in thus has two sources: from the lasso solver itself, and from false feature rejection. The latter errors can propagate, leading DASS to fail to give adequate rejection, or to make too many false rejections causing classification performance to suffer. We investigate this in §4.

## 4 Empirical Tests

We compare DASS with state-of-the-art open-loop sequential screening schemes with a fixed grid of geometrically spaced regularization parameters. These open-loop sequential screening schemes are sequential Dome [32], sequential Strong rule [25] and sequential Enhanced DPP rule [29]. Sequential Dome is the sequential version of the dome test proposed in [37]. The authors of [29] commented that “for the DOME test, it is unclear whether its sequential version exists” and did not include a comparison of their sequential Enhanced DPP rule with sequential Dome. Their claim is not true and we hereby include both algorithms for comparison. Among these four screening algorithms, only the sequential Strong rule in [25] may falsely discard features.

Performance Metrics
1) Rejection Percentage: the ratio of the number of features rejected by screening rules to the number of all features.
2) Speedup: the time to solve (1) at without screening (one problem) divided by the total time to screen plus solve the reduced lasso problem along the sequence (N problems).
Note the difference in our metrics definition compared with [29]. In [29] rejection percentage was defined as the ratio of the number of features discarded by screening rules to the actual number of zero features in the ground truth; and speedup was defined as the time to solve the sequence of lasso problems without screening (N problem) divided by the total time to screen plus solve the reduced lasso problem along the same sequence (N problem). The difference in speedup definition mainly originates from our different motivation and application, as we discussed earlier in §1.

Datasets
(a) MNIST: hand-written digit images (training set: , testing set: ) [16]. Each image is rearranged into a vector and scaled to unit norm. We sample training images (500 per digit) to form the dictionary. Test images are randomly selected from the testing set.
(b) GTZAN: 100 music clips (30 sec, sampled at 22,050 Hz) for each of ten genres of music [27]. Each clip is divided into 3-sec texture windows (TW) with 50% overlap of adjacent TW. Each TW is then represented using a first order scattering vector of length 199 [1]. We randomly select a dictionary of 12,000 scattering vectors and randomly select test vectors from the remaining 8,000.
(c) NYT: 300,000 New York Times articles represented as vectors with respect to a vocabulary of 102,660 words [12]. The -th entry in vector gives the occurrence count of word in document . After removing documents with very low word counts, 299,752 documents remain. 299,652 of these documents are randomly selected to form the dictionary and from the remaining 100 documents, 65 are selected as test vectors with .

For experiments on MNIST and GTZAN we randomly select 20 dictionaries and for each dictionary we use 60 randomly selected test vectors. Reported results are averaged across selected dictionaries and test vectors. is set using the scaling invariant ratio with higher value yielding sparser solution. Speedup results using the Feature-sign lasso solver [17] are shown but our experiment indicate consistent results for a variety of other solvers. For example, see the supplementary material for additional results using the FISTA [4] solver.

In an open-loop sequential screening scheme, once is given and is selected then the sequence is fixed. We claim a key attribute of DASS is that it adapts to each individual problem: . We test the effectiveness of this adaptation by solving 1200 problem instances with for both MNIST and GTZAN. We then plot the scatter plot of speedup versus rejection percentage for these 1200 problem instances. See the top rows of Fig. 4 and Fig. 5. For MNIST, the average used by DASS () is with standard error of . DASS successfully pushes both the rejection percentage and speedup distribution to higher ranges compared with open-loop schemes using .

DASS also exhibits robust average rejection percentage and speedup as varies (bottom rows Fig. 4 and 5). For MNIST, DASS consistently beats or rivals the open-loop schemes as ranges from to . Open-loop schemes with are close to DASS in rejection, but its average speedup is only around or less for most values of . The DASS speedup curve gives a nice outer envelope across the set of open-loop speedup curves. The average used by DASS corresponding to each value of is given on top of the DASS speedup curve (red). As decreases, DASS increases on a case-by-case basis, and this is done without user intervention. Using a properly cross-validated , DASS is able to automatically design an efficient sequence for each individual . Similar results can been seen for GTZAN. Since overhead accumulates as increases, DASS, and open-loop sequential screening schemes, both exhibit less speedup as decreases. Nevertheless, DASS pushes efficient and robust screening down to , which includes a range useful in many classification applications.

The DASS scheme is easy to implement, and its one parameter can be selected by cross-validation using one reasonable value of . The scheme then gives strong screening performance across a wide range of values of . A smaller value of the user-specified parameter , yields a tighter bound on , and hence stronger rejection, but also a larger value of . Hence speedup will hit a sweet spot as decreases. Empirical investigation on the effect of is given in Fig. 6.

To examine the effect of “noisy” lasso solutions, we add noise to our computed , , for various values of noise-to-signal ratio (the ratio of the power of to that of ). We then hard threshold to yield a “corrupted” sparse solution. The impact on rejection fraction, speedup, and classification accuracy is shown for MNIST in Fig. 7. Rejection remains around until the noise-to-signal ratio goes above . Speedup drops from around to at . So DASS remains effective for noise-to-signal ratio below . Within this bound, there is a small penalty for the added corruption, but DASS is robust to the errors. In addition, when is used for Sparse Representation Classification [36], classification accuracy is also robust to the error in .

Finally, we use the NYT dataset to explore problems with many high dimensional features. From the test vector pool of documents, we select documents, subject to . is set at . Since the dictionary cannot fit into our computer memory, we run screening by loading small segments of the dictionary into memory at a time. We test DASS with . The average resulting values of from problem instances is with standard error being . For comparison, we run sequential Dome, sequential Strong rule and sequential Enhanced DPP rule with .

We allocate the same amount of system resources to each algorithm and each problem instance. Sequential Dome and sequential EDPP are not able to complete all problem instances because for some instances they do not reject enough features in the process and run into out of memory error. We thus include the completion rate metric in Fig. 8 defined as the percentage of problem instances that are completed given the same system resources. DASS and sequential Strong rule take the lead in this metric and are able to complete all problem instances. For problem instances that are completed by each algorithm, we provide a scatter plot of total running time versus number of remaining features in Fig. 8. No speedup information is shown since we cannot solve these problems without screening. The results show that DASS is on a par with sequential Strong rule and they significantly outperform sequential Dome and sequential EDPP rules: sequential Strong rule is able to reject more but DASS is able to finish in less runtime. Note however that the Strong rule may commit false rejections.

## 5 Conclusion

We have shown analytically and empirically that feedback controlled sequential screening yields improved performance. Given a cross-validated target , the DASS rule tailors an individualized sequence on the fly which is optimized for each particular problem instance . This has demonstrated significantly greater robustness with respect to the inherent variability in the lasso problem.

Our feedback idea extends to the DPP bound used in [29]. The selection rule yields a feedback controlled version of sequential DPP [29]. This uses a looser bound than (see Fig. 1). Our results also suggest that it may be possible to use feedback to control sequential screening in more targeted ways. For example, by varying the parameter in response to previously solved instances it may be possible to regulate the size of the dictionary after screening. This would be useful in situations of a large number of features and limited memory.

## Supplementary Material

### Proof of Lemma 1

###### Proof.

We want to find the solution to subject to: and , . Form the Lagrangian

 L=−\nicefrac12∥θ1−θ2∥22+∑2k=1\nicefrac12λk(∥θk−q∥22−r2)+∑2k=1μk(nTθk−c)

The objective function and inequality constraint functions are all convex functions, and by the assumption of Lemma 1 there exist strictly feasible points. Hence the KKT conditions are necessary and sufficient for optimality [5, P. 244]. Thus for to be a solution, it is sufficient that these points satisfy the four constraints, the complementary slackness conditions and , , and that .

We now construct a candidate solution. Suppose is not contained in . Let denote the orthogonal projection of onto the hyperplane . Since , and for some . Since lies on the plane, we must have . Hence

 ψ=(nTq−c)/r.

Pick any unit norm vector with . Then set

 Invalid decimal number (11)

For these points and . So both points satisfy the inequality constraints with equality and hence also satisfy the complementary slackness conditions. The final two KKT conditions require

 Dθ1L(θ1) =−(θ1−θ2)+λ1(θ1−q)+μ1n=0 (12) Dθ2L(θ2) =−(θ2−θ1)+λ2(θ2−q)+μ2n=0 (13)

Consider equation (12). Substituting from (11) yields

 r√1−ψ2(λ1−2)u+(μ1−λ1ψr)n=0.

This can be satisfied by selecting and . A similar argument for (13) shows that it is satisfied by the selection and . Thus the candidate solution is indeed a solution. So if , we have

If , then for the candidate solution we take and . In this case, the candidate points satisfy the spherical bound with equality. The half space constraint is also satisfied: , but generally only with inequality. So the candidate solution satisfies all of the constraints and the first two complementary slackness conditions. In this case, substitution of the expressions for into (12) yields

 r(λ1−2)u+μ1n=0

This can be satisfied by selecting and . This also satisfies the corresponding complementary slackness condition. Similarly for (13). So when , . ∎

### Proof of Lemma 2

###### Proof.

. 1) For , , . Since , is outside the dual feasible set and outside the half space . Thus is outside . 2) For , by construction , with and given by equation (7). , since . Since is a separating hyperplane of from , . Then . Thus . ∎

### Proof of Proposition 1

###### Proof.

By Lemma 1, . Set . Then,

 r2k=∥sky−^θk−1∥22=∥(sk−sk−1)y+sk−1y−^θk−1∥22=∥Δky+βk−1nk−1∥22,

where and is a scalar such that Hence

 Extra open brace or missing close brace

For the second term we have

 (nTk−1qk−ck−1)2 =(nTk−1(sk−1y+sky−sk−1y)−ck−1)2 =(nTk−1(Δky+βk−1nk−1))2 =Δ2kyTnk−1nTk−1y+2βk−1ΔkyTnk−1+β2k−1.

Thus . ∎

### Proof of Theorem 1

###### Proof.

Under the DASS feedback rule,

 sk=s1+\nicefrac12R∑kj=2(yT(I−nj−1nTj−1)y)−\nicefrac12≥s1+(k−1)R/(2∥y∥2) . (14)

Hence is sufficient for termination, from which we deduce . This bound is very loose. Proposition 1 and the feedback rule (10) give .

The stronger bound on requires the follow more detailed argument. Let the decreasing sequence be determined by (10) and . Then

 1λk−1λk−1=(1αk−1)1λk−1=\nicefrac12R(yT(I−nk−1nTk−1)y)−\nicefrac12

So

 αk=11+\nicefrac12Rλk−1(yT(I−nk−1nTk−1)y)−\nicefrac12

Using the definition of , we have

 yTnk−1=∥y∥22/λk−1−yT^θk−1∥y/λk−1−^θk−1∥2=∥y∥22−λk−1yT^θk−1∥y−λk−1^θk−1∥2.

Hence

 yT(I−nk−1nTk−1)y =∥y∥22−(yTnk−1)2 (15) =∥y∥22−(∥y∥22−λk−1yT^θk−1)2∥y−λk−1^θk−1∥22 (16) =λ2k−1(∥y∥22∥^θk−1∥22−(yT^θk−1)2)∥y−λk−1^θk−1∥22 (17)

Let . We first consider the possibility that for some scalar . So and are aligned. Since , in this case we must have . Then for all . So this case is easily handled. Hence we will assume below that is not aligned with .

We now make the assumption that there exists such that for . Under this assumption we have:

 λk−1(yT(I−nk−1nTk−1)y)−\nicefrac12 =(∥y−λk−1^θk−1∥22∥y∥22∥^θk−1∥22−(yT^θk−1)2)\nicefrac12 (18) ≥1∥^θk−1∥2(∥y−uk−1uTk−1y∥22∥y∥22−(yTuk−1)2)\nicefrac12 (19) =1∥^θk−1∥2 (20) ≥1C. (21)

Equation (19) follows from (18) by optimizing over . This gives an upper bound on

 αk=11+\nicefrac12Rλk−1(yT(I−nk−1nTk−1)y)−\nicefrac12≤CC+\nicefrac12R<1.

Set . Then . Hence . Noting that and , this implies that

 N≤1+log(λ1/λt)log(1/α)≤1+log(1/λt)log(1+\nicefrac12R/C)

To derive this tighter bound we needed the existence of with . To show this, we assumed the dual regularization path is bounded. ∎

###### Lemma 3.

If is in the range of , then is bounded.

###### Proof.

By the optimality of , must be in the subdifferential of