Maximal Sparsity with Deep Networks?
Abstract
The iterations of many sparse estimation algorithms are comprised of a fixed linear filter cascaded with a thresholding nonlinearity, which collectively resemble a typical neural network layer. Consequently, a lengthy sequence of algorithm iterations can be viewed as a deep network with shared, handcrafted layer weights. It is therefore quite natural to examine the degree to which a learned network model might act as a viable surrogate for traditional sparse estimation in domains where ample training data is available. While the possibility of a reduced computational budget is readily apparent when a ceiling is imposed on the number of layers, our work primarily focuses on estimation accuracy. In particular, it is wellknown that when a signal dictionary has coherent columns, as quantified by a large RIP constant, then most tractable iterative algorithms are unable to find maximally sparse representations. In contrast, we demonstrate both theoretically and empirically the potential for a trained deep network to recover minimal norm representations in regimes where existing methods fail. The resulting system is deployed on a practical photometric stereo estimation problem, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.
School of Electronics Engineering and Computer Science
Peking University
Beijing, China
Email: {boxin,yizhou.wang,wgao}@pku.edu.cn
David Wipf
Microsoft Research
Beijing, China
Email: davidwipf@gmail.com
Keywords: Sparse estimation, compressive sensing, deep unfolding, deep networks, restricted isometry property (RIP)
1 Introduction
Our launching point is the optimization problem
(1) 
where is an observed vector, is some known, overcomplete dictionary of feature/basis vectors with , and denotes the norm of a vector, or a count of the number of nonzero elements. Consequently, (1) can be viewed as the search for a maximally sparse vector such that can be represented using the fewest number of features in the feasible region.^{1}^{1}1In practice, it is common to relax the feasible region to , or replace the constraint altogether with a sensible data fit term balanced with a tradeoff parameter. Unfortunately however, direct assault on (1) involves an intractable, combinatorial optimization process, and therefore efficient alternatives that return a maximally sparse with high probability in restricted regimes are sought. Popular examples with varying degrees of computational overhead include convex relaxations such as norm minimization (Donoho and Elad, 2003; Tibshirani, 1996), greedy approaches like orthogonal matching pursuit (OMP) (Pati et al., 1993; Tropp, 2004), and many flavors of iterative thresholding (Beck and Teboulle, 2009; Blumensath and Davies, 2008).
Variants of these algorithms find practical relevance in numerous disparate application domains, including feature selection (Cotter and Rao, 2002; Figueiredo, 2002), outlier removal (Candès and Tao, 2005; Ikehata et al., 2012), compressive sensing (Donoho, 2006), and source localization (Baillet et al., 2001; Malioutov et al., 2005) among many others. However, a fundamental weakness underlies them all: If the Gram matrix has significant offdiagonal energy, indicative of strong coherence between columns of , then estimation of may be extremely poor. Indeed both the cardinality of the solution, and often more importantly, the locations of nonzero elements, can be completely suboptimal. Loosely speaking this occurs because, as higher correlation levels are present, the nullspace of is more likely to include large numbers of approximately sparse vectors that tend to distract existing algorithms in the feasible region. The degree to which this risk is present can be quantified by a socalled restricted isometry constant to be described in detail later. Compounding the problem is that, in all but the most ideal settings where we are free to choose randomly from certain favorable distributions, there is no way of knowing in advance the true degree in which this correlation structure will be disruptive (e.g., restricted isometry constants are actually not feasible to compute in practice).
In this paper we consider recent developments in the field of deep learning as an entry point for improving the performance of sparse recovery algorithms. Although seemingly unrelated at first glance, the layers of a deep neural network (DNN) can be viewed as iterations of some algorithm that have been unfolded into a network structure (Gregor and LeCun, 2010; Hershey et al., 2014). In particular, iterative thresholding approaches such as those mentioned above typically involve an update rule comprised of a fixed, linear filter followed by a nonlinear activation function that promotes sparsity. Consequently, algorithm execution can be interpreted as passing an input through an extremely deep network with constant layer weights (dependent on ) at every layer.
This ‘unfolding’ viewpoint immediately suggests that we consider substituting discriminatively learned weights in place of those inspired by the original sparse recovery algorithm. For example, it has been argued that, given access to a sufficient number of pairs, a trained network may be capable of producing quality sparse estimates with a modest number of layers. This in turn can lead to a dramatically reduced computational burden relative to purely optimizationbased approaches, which can require hundreds or even thousands of iterations to sufficiently converge (Gregor and LeCun, 2010; Sprechmann et al., 2015).
Existing work on sparse estimation through deep network training borrows basic network components directly from the underlying iterative algorithm. Different networks are primarily differentiated by the types of activation functions employed, which performed as sparsitypromoting nonlinearities during their former life in service to iterative optimization. For example, (Gregor and LeCun, 2010) promotes a softthreshold function inspired by and iterative shrinkagethresholding algorithm (ISTA) for minimizing the norm, a wellknown convex approximation to the canonical norm sparsity penalty from (1). In contrast, (Sprechmann et al., 2015) advocates a wider class of functions derived from proximal operators (Parikh and Boyd, 2014). Finally, it has also been suggested that replacing typically continuous activation functions with hardthreshold operators may lead to sparser representations (Wang et al., 2015). At a high level though, one common ingredient of all of these approaches is the adoption of shared weights across layers.
While existing empirical results are promising, especially in terms of the reduction in computational footprint, there is as of yet no empirical demonstration of a learned deep network that can unequivocally recover maximally sparse vectors with greater accuracy than conventional, stateoftheart optimizationbased algorithms. Nor is there supporting theoretical evidence elucidating the exact mechanism whereby learning may be expected to improve the estimation accuracy, especially in the presence of coherent dictionaries . Additionally, minimal insights exist that might be transferrable to assessing the behavior of broader learning objectives and systems.
1.1 Paper Overview
This paper attempts to fill in these gaps described above, at least to the extent possible, via the following organizational structure. In Section 2 we begin by reviewing the iterative hardthresholding (IHT) algorithm for estimating a sparse vector. IHT was chosen because it can be directly unfolded for learning purposes, is representative of many sparse estimation paradigms, and is amenable to theoretical analysis. Next we discuss the limitations of IHT, including its high sensitivity to correlated designs, and motivate a DNNlike, unfolded alternative. Later, Section 3 considers this unfolded IHT network with shared layerwise weights and activations, which is the standard template for existing methods. We explicitly quantify the degree to which such networks can compensate for correlations in , but also expose the breaking point whereby any possible sharedweight construction is likely to fail.
This naturally segues to richer deep networks with layerwise independent weights and activations (meaning different layers need not share the same, fixed weights and activations), which we scrutinize in Section 4. Here we describe a multiresolution dictionary, with highly correlated clusters of columns, that explicitly requires the richer class of layer parameterizations to guarantee successful sparse recovery. Section 5 then further elucidates the essential multiresolution nature of the sparse estimation problem, and how we may deviate from strict adherence to any particular unfolded algorithmic script in designing a practical DNN. In particular, we motivate a multilabel classification network to focus on learning correct support patterns. In Section 6 we describe what we believe to be an essential ingredient for constructing an effective training set. Corroborating simulation results and a realworld computer vision application example are presented in Sections 7 and 8 respectively, followed by exploration of alternative recurrent longshorttermmemory (LSTM) structures in Section 9. Final discussions are in Section 10, while all proofs are deferred to the Appendix.
1.2 Summary of Contributions
Our technical and empirical contributions can be distilled to the following points:

We rigorously dissect the benefits of unfolding conventional sparse estimation algorithms to produce trainable deep networks. This includes a precise characterization of exactly how different architecture choices can affect the ability to improve effective restrictive isometry constants, which measure of the degree of disruptive correlation present in a dictionary. This helps to quantify the limits of shared layer weights and motivates more flexible network constructions that account for multiresolution structure in in a previously unexplored fashion. Importantly, we envision that our analyses are emblematic of important factors present in other DNNrelated domains.

Based on these theoretical insights, and a better understanding of the essential factors governing performance, we establish the degree to which it is favorable to diverge from strict conformity to any particular unfolded algorithmic script. In particular, we argue that the equivalent of layerwise independent weights and/or activations are essential, while retainment of original hardthresholding nonlinearities and squarederror loss implicit to IHT and related algorithms is not. We also recast the the core problem as deep multilabel classification given that optimal support pattern is our primary concern. This allows us to adopt a novel training paradigm that is less sensitive to the specific distribution encountered during testing. Ultimately, we development the first, ultrafast sparse estimation algorithm that can effectively deal with coherent dictionaries and adversarial restricted isometry constants.

We apply the proposed system to a practical photometric stereo computer vision problem, where the goal is to estimate the 3D geometry of an object using only 2D photos taken from a single camera under different lighting conditions. In this context, shadows and specularities represent sparse outliers that must be simultaneously removed from surface points. We achieve stateoftheart performance despite a minuscule computational budget appropriate for realtime mobile environments.

Finally, we explore the connection between unfolded sparse estimation algorithms and unfolded recurrent LSTM networks, revealing that the gating functions intrinsic to the latter can improve performance in the former by allowing coarseresolution sparsity patterns to prorogate to deeper layers.
2 From Iterative Hard Thesholding (IHT) to Deep Neural Networks
This section first introduces IHT before detailing its unfolded DNN analogue.
2.1 Introduction to IHT
With knowledge of an upper bound on the true cardinality, solving (1) can be replaced by the equivalent problem
(2) 
Iterative hardthresholding (IHT) attempts to minimize (2) using what can be viewed as computationallyefficient projected gradient iterations (Blumensath and Davies, 2009). Let denote the estimate of some maximally sparse after iterations. IHT first computes the gradient of the quadratic objective evaluated at given by
(3) 
We then take the unconstrained gradient step
(4) 
where is a stepsize parameter. Finally, we project onto the constraint set by zeroing out all but the largest values (in magnitude) of using a hardthresholding operator denoted . The combined iteration becomes
(5) 
which only requires matrixvector multiples and is computationally cheap to implement. For the vanilla version of IHT, the stepsize leads to a number of recovery guarantees whereby iterating (5), starting from is guaranteed to reduce (2) at each step before eventually converging to the globally optimal solution.^{2}^{2}2Other values of or even a positive definite matrix, adaptively chosen, can lead to a faster convergence rate (Blumensath and Davies, 2010). These results hinge on properties of which relate to the coherence structure of dictionary columns as encapsulated by the following definition.
Definition 1 (Restricted Isometry Property)
A dictionary satisfies the Restricted Isometry Property (RIP) with constant if
(6) 
holds for all .
In brief, the smaller the value of the restricted isometry constant , the closer any submatrix of with columns is to being orthogonal (i.e., it has less correlation structure).
It is now wellestablished that dictionaries with smaller values of lead to sparse recovery problems that are inherently easier to solve. In the context of IHT, it has been shown (Blumensath and Davies, 2009) that if , with and , then at iteration of (5)
(7) 
It follows that as , , meaning that we recovery the true, generating . Moreover, it can be shown that this is also the unique, optimal solution to (1) (Candès et al., 2006).
2.2 Unfolding IHT Iterations
The success of IHT in recovering maximally sparse solutions crucially depends on the RIPbased condition that , which heavily constrains the degree of correlation structure in that can be tolerated. While dictionaries with columns drawn independently and uniformly from the surface of a unit hypersphere^{3}^{3}3If elements of are drawn iid from and rescaled to have unit norm, then the resulting columns will be iid distributed uniformly on the unit sphere. Moreover, as , each column norm converges to one such that normalization is not even necessary. will satisfy this condition with high probability provided is small enough (Candès and Tao, 2005), for many/most practical problems of interest we cannot rely on this type of IHT recovery guarantee. In fact, except for randomized dictionaries in high dimensions where tight bounds exist, we cannot even compute the value of , which requires calculating the spectral norm of subsets of dictionary columns.
There are many ways nature might structure a dictionary such that IHT (or most any other existing sparse estimation algorithm) will fail. Here we consider one of the most straightforward forms of dictionary coherence that can easily disrupt performance. Consider the situation where , where columns of and are drawn iid from the surface of a unit hypersphere, while is arbitrary. Additionally, is a scalar and is a diagonal normalization matrix that scales each column of to have unit norm. It then follows that if is sufficiently small, the rankone component begins to dominate, and there is no value of such that .
It is here we hypothesize that DNNs provide a potential avenue for improvement to the extent that they might be able to compensate for disruptive correlation structure in . To see this, note that from a qualitative standpoint it is quite clear that the iterations of sparsitypromoting algorithms like IHT resemble the layers of neural networks (Gregor and LeCun, 2010). Therefore we can view a long sequence of such iterations as a DNN with fixed, parameterized weights at every layer. However, what if we are able to learn alternative weights that somehow overcome the limitations of a poor RIP constant?
For example, at the most basic level we might consider general networks with the layer defined by
(8) 
where is a nonlinear activation function, and and are arbitrary. Moreover, given access to training pairs , where is a sparse vector such that , we can optimize and using traditional stochastic gradient descent just like any other DNN structure. In the next section we will precisely characterize the extent to which this modification affords any benefit over IHT using . Later in Section 4 we will consider adaptive nonlinearities and layerspecific parameters .
3 Analysis using Shared LayerWise Weights and Activations
For simplicity in this section we restrict ourselves to the fixed hardthreshold operator across all layers; however, many of the conclusions borne out of our analysis nonetheless carry over to a much wider range of activation functions . In general it is difficult to analyze how arbitrary and may improve upon the fixed parameterization from (5) where and (assuming ). Fortunately though, we can significantly collapse the space of potential weight matrices by including the natural requirement that if represents the true, maximally sparse solution, then it must be a fixedpoint of . Indeed, without this stipulation the iterations could diverge away from the globally optimal value of , something IHT itself will never do. These considerations lead to the following:
Proposition 2
Consider a generalized IHTbased network layer given by
(9) 
and let denote any unique, maximally sparse feasible solution to with . Then to ensure that any such is a fixed point of (9) it must be that .
Although remains unconstrained, this result has restricted to be a rank factor, parameterized by , subtracted from an identity matrix. Certainly this represents a significant contraction of the space of ‘reasonable’ parameterizations for a general IHT layer. In light of Proposition 2, we may then further consider whether the added generality of (as opposed to the original fixed assignment ) affords any further benefit to the revised IHT update
(10) 
For this purpose we note that (10) can be interpreted as a projected gradient descent step for solving
(11) 
However, if is not positive semidefinite, then this objective is no longer even convex, and combined with the nonconvex constraint is likely to produce an even wider constellation of troublesome local minima with no clear affiliation with the global optimum of our original problem from (2). Consequently it does not immediately appear that is likely to provide any tangible benefit. However, there do exist important exceptions.
The first indication of how learning a general might help comes from the following result:
Proposition 3
Suppose that , where is an arbitrary matrix of appropriate dimension and is a fullrank diagonal that jointly solve
(12) 
Moreover, assume that is substituted with in (10), meaning we have simply replaced with a new dictionary that has scaled columns. Given these qualifications, if , with and , then at iteration of (10)
(13) 
As before, it follows that as , , meaning that we recovery the true, generating . Additionally, it can be guaranteed that after a finite number of iterations, the correct support pattern will be discovered. And it should be emphasized that rescaling by some known diagonal is a common prescription for sparse estimation (e.g., column normalization) that does not alter the optimal norm support pattern.^{4}^{4}4Inclusion of this diagonal factor can be equivalently viewed as relaxing Proposition 2 to hold under some fixed rescaling of , i.e., the optimal support pattern is preserved.
But the real advantage over regular IHT comes from the fact that , and in many practical cases, , which implies success can be guaranteed across a much wider range of RIP conditions. For example, if we revisit the dictionary , an immediate benefit can be observed. More concretely, for sufficiently small we argued that for all , and consequently convergence to the optimal solution may fail. In contrast, it can be shown that will remain quite small, satisfying , implying that performance will nearly match that of an equivalent recovery problem using (and as we discussed above, is likely to be relatively small per its unique, randomized design). The following result generalizes a sufficient regime whereby this is possible:
Corollary 4
Suppose , where elements of are drawn iid from , is any arbitrary matrix with , and is a diagonal matrix that enforces unit column norms. Then
(14) 
where denotes the matrix with any rows removed.
Additionally, as the size of grows proportionally larger, it can be shown that with overwhelming probability . Overall, these results suggest that we can essentially annihilate any potentially disruptive rank component at the cost of implicitly losing measurements (linearly independent rows of , and implicitly the corresponding elements of ). Therefore, at least provided that is sufficiently small such that , we can indeed be confident that a modified form of IHT can perform much like a system with an ideal RIP constant.^{5}^{5}5Of course at some point we will experience diminishing marginal returns using this prescription. For example, in the extreme case if , then columns of will be reduced to a dimensional subspace, and no RIP conditions can possibly hold (see (Bah and Tanner, 2010) for details of how the RIP constant scales with the dimensions of Gaussian iid matrices). Regardless, we can still choose some alternative and such that (12) is optimal, but the optimal solution will no longer involve complete eradication of . And of course in practice we may not ever be aware exactly how the dictionary decomposes as some ; however, to the extent that this approximation can possibly hold, the effective RIP constant can be improved nonetheless.
It should be noted that globally solving (12) is nondifferentiable and intractable, but this is the whole point of incorporating a DNN network to begin with. If we have access to a large number of training pairs generated using the true , then during the course of the learning process a useful and can be implicitly learned such that a maximal number of sparse vectors can be successfully recovered.
Moving forward beyond RIPrelated issues, there exists one additional way that learning could afford some value. Suppose now that , where and are as before (preserving their attendant benefits) and is an arbitrary invertible matrix. Given that multiplying a gradient by a positivedefinite, symmetric matrix guarantees a descent direction is preserved, the inclusion of could be viewed as as a form of natural gradient direction to be learned during training (Amari, 1998). However, given that such a direction must be universal across all layers and possible sparsity patterns, unlike the universal benefit of a lowered RIP constant, it is unclear the extent to which this improves performance. It would be interesting to isolate this effect at least empirically, but we do not pursue this issue further herein.
To summarize then, learning layerwise fixed weights and can indeed provide an important benefit by implicitly reducing the RIP constant of . We believe this to be a practicallyrealizable way of affecting what is otherwise an NPhard constant to even compute, let alone optimize. Learning layerwise fixed weights can also produce an alternative ‘natural gradient’ direction; however, given that this direction must be the same for all layers and for all sparse vectors , it remains unclear whether or not this latter capability provides any tangible welfare.
4 Analysis using LayerWise Independent Weights and Activations
In the previous section we observed how jointly adjusting and could implicitly remove the effects of lowrank components that inflate dictionary coherence and RIP constant values. However, we also qualified the advantages of this strategy, with diminishing marginal returns as more nonideal components enter the picture. In fact, it is not difficult to describe a slightly more sophisticated scenario such that use of layerwise constant weights and activations are no longer capable of lowering at all, portending failure when it comes to accurate sparse recovery. In contrast, this section will reveal that independent weights and adaptive activations can nonetheless still succeed.
To illustrate this effect, we will now analyze dictionaries with columns that are tightly grouped into clusters such that the withingroup correlation is high while the betweengroup correlation is modest. As further technical results require a bit more precision, we first present the following formal definition.
Definition 5 (Clustered Dictionary Model)
Let denote a partitioned matrix, with each partition sized such that . Also define and for all . Moreover, assume that both and are constructed with columns of unit norm. Then a dictionary matrix is said to arise from the clustered dictionary model if , with , as a scalar weighting factor, and a diagonal matrix that applies final column normalization. We also define the cluster support as the set of cluster indices whereby some has at least one nonzero corresponding element.
Therefore it becomes readily apparent that, provided is chosen sufficiently small, each partition will be a tight cluster of basis vectors centered around an axis formed by the corresponding . In some sense this model represents the simplest partitioning of correlation structure into two scales: the inter and intracluster structures. It thus represents an accessible model for evaluating further manual or learned modifications of IHT. In particular, we note that assuming is large, possibly even larger than , we can no longer rely on and to reduce as we did in Section 3, as annihilating every rankone term is clearly impossible.
We now turn to an adaptation of IHT that includes two important modifications that are reflective of many generic DNN structures:

The hardthresholding operator is generalized to account for prior information about learned support patterns from previous iterations, and

We allow the dictionary or weight matrix to change from iteration to iteration sequencing through a fixed set akin to layers of a DNN.
Regarding the former, we define an IHT iteration with partially known support as
(15) 
where denotes a support set of that is immune from hardthresholding, and denotes a second support set that is automatically forced to zero. The remaining elements of not in then face the standard hardthresholding operator, with all but the largest values set to zero. In spirit, (15) can be viewed as something like a highway network element (Srivastava et al., 2015) or a LSTM cell (Hochreiter and Schmidhuber, 1997), where elements can be turned on and off via a gating mechanism separate from the activation function.
When combined with layerwise weights that change from iteration to iteration via a prescribed sequence (just like a DNN), we arrive at what we term adaptive IHT:
Definition 6 (Adaptive Iterative Hard Thresholding (AIHT))
Let and assume we have access to some predefined sequence of weights as well as a predefined schedule for computing , , and . Then we refer to the iterations
(16) 
as adaptive iterative hardthresholding.
We will now examine how AIHT can directly handle the recovery of maximally sparse signals arising from the clustered dictionary model. We first introduce some additional notation. Let denote any subset of such that ; in other words represents the matrix formed by concatenating all partitions of from the set .
Proposition 7
Suppose is generated from the clustered dictionary model and that the concatenated matrix has RIP constant for all possible with . Then there exists an AIHT algorithm that is guaranteed to produce the correct support pattern of any in a finite number of iterations provided that , , , and where is suitably small.
The technical nature of Proposition 7 belies the simplicity of the actual underlying core idea. We unpack this result via a few important intuitions:

itself is also trivially obtained once the support is correctly estimated. So practically speaking this result guarantees we can recover in a finite number of iterations.

The integrated dictionary can have an arbitrarily large RIP constant as grows small such that IHT (or minimization, etc.) will likely fail to ever find the correct support. In fact, it can be proven that IHT will fail with minimal assumptions.^{6}^{6}6Technically speaking, the RIP conditions are sufficient but not necessary conditions for success. Therefore, just because the constant is too high alone does not always guarantee failure.

In contrast, the sufficient condition for AIHT to work only depends on coherence involving and , the components of the clustered dictionary model, not the integrated dictionary . In particular, we require that at both the intracluster and betweencluster scales, groups of dictionary columns must be reasonably incoherent.

We can simplify the stated conditions by noting that RIP constants can only go up whenever we increase the number of nonzeros or pad a dictionary with extra columns. Therefore, since and is a superset of the columns from any , Proposition 7 will also hold under the more stringent but easier to digest constraint . So we pay a small price in the sparsity level multiplier (from for regular IHT to for AIHT), but this is offset by the huge gain in working with as opposed to as the argument. And of course in reality we only need this to hold for all of the much smaller dictionary subsets as stipulated in the proposition, a significantly lower bar.

See the proof for details of how the layer weights and , and support sets and can be constructed. But the core principle is that earlier layers must be tasked with exposing the correct support at the cluster level, without concern for accuracy within each cluster. Once the correct cluster support has been obtained, later layers can then be charged with estimating the finegrain details of withincluster support. We believe this type of multiresolution sparse estimation is essential when dealing with highly coherent dictionaries (more on this in the next section).

The support sets and allow the network to ‘remember’ previously learned clusterlevel sparsity patterns, in much the same that LSTM gates allow long term dependencies to propagate (Hochreiter and Schmidhuber, 1997) or highway networks (Srivastava et al., 2015) facilitate information flow unfettered to deeper layers. Moreover, practically speaking these sets can be computed by passing the prior layer’s activations through linear filters followed by indicator functions, again reminiscent of how DNN gating functions are typically implemented.

Even if we exclude the column normalization multiplier from the clustered dictionary model, IHT will still fail since the top elements obtained via hardthresholding will be dominated by columns of with large norms at the mercy of scale factors.
We next turn to the more practically relevant situation where the dictionary cannot be so neatly partitioned into two levels of detail, such that manual construction of a priori layerdependent weights is much more difficult.
5 Discriminative MultiResolution Sparse Estimation
As implied previously, guaranteed success for most existing sparse estimation strategies hinges on the dictionary having columns drawn (approximately) from a uniform distribution on the surface of a unit hypersphere, or some similar condition to ensure that subsets of columns behave approximately like an orthogonal basis. Essentially this confines the structure of the dictionary to operate on a single universal scale. The clustered dictionary model described in the previous section considers a dictionary built on two different scales, with a clusterlevel distribution (coarse) and tightlypacked withincluster details (fine). But in reality practical dictionaries may display structure operating across a variety of scales that interleave with one another, forming a continuum among multiple levels.
When the scales are clearly demarcated, we have seen that it is possible to manually define a multiresolution AIHT algorithm that guarantees success in recovering the optimal support pattern; and indeed, AIHT could be extended to handle a clustered dictionary model with nested structures across more than two scales. However, without clearly partitioned scales it is much less obvious how one would devise an optimal IHT modification. It is in this context that learning with DNNs is likely to be most advantageous. In fact, the situation is not at all unlike many computer vision scenarios whereby handcrafted features such as SIFT may work optimally in confined, idealized domains, while learned CNNbased features are often more effective otherwise.
Given a sufficient corpus of pairs linked via some fixed , we can replace manual filter construction with a learningbased approach. On this point, although we view our results from Section 4 as a convincing proof of concept, it is unlikely that there is anything intrinsically special about the specific hardthreshold operator and layerwise construction we employed per se, as long as we alow for deep, adaptable layers that can account for structure at multiple scales. In practice, we expect that it is more important to establish a robust training pipeline that avoids stalling at the hand of vanishing gradients with deep network structure. It is here that we propose three significant deviations from the original IHT template.

We exploit the fact that in producing a maximally sparse vector , the main challenge is estimating . Once the support is obtained, computing the actual nonzero coefficients just boils down to solving something like a least squares problem. But any learning system will be unaware of this and could easily expend undue effort in attempting to match coefficient magnitudes at the expense of support recovery. Certainly the use of a data fit penalty of the form , as is adopted by nearly all sparse recovery algorithms, will expose us to this issue. Therefore we instead formulate sparse recovery as a multilabel classification problem. More specifically, instead of directly estimating , we attempt to learn , where
(17) For this purpose we may then incorporate a traditional multilabel classification loss function via a final softmax output layer, which forces the network to only concern itself with learning support patterns. This substitution is further justified by the fact that even with traditional IHT, the support pattern will be accurately recovered before the iterations converge exactly to . Therefore we may expect that fewer layers (as well as training data) are required if all we seek is a support estimate.

Given that IHT can take many iterations to converge on challenging problems, we may expect that a relatively deep network structure will be needed to obtain exact support recovery. We must therefore take care to avoid premature convergence to areas with vanishing gradient by incorporating several recent countermeasures proposed in the DNN community. For example, as mentioned previously, the adaptive threshold operator AIHT employs is reminiscent of highway networks or LSTM cells, which have been proposed to allow longer range flow of gradient information to improve convergence through the use of gating functions. An even simpler version of this concept involves direct, ungated connections that allow much deeper ‘residual’ networks to be trained (He et al., 2015a) (which is also reminiscent of the residual factor embedded in the original IHT iterations). We deploy this tool, along with batchnormalization (Ioffe and Szegedy, 2015) to aid convergence, for our basic feedforward pipeline. Later we also consider an alternative structure based on recurrent LSTM cells. Note that unfolded LSTM networks frequently receive a novel input for every time step, whereas here is applied unaltered at every layer (more on this in Section 9).

We replace the nonintegrable hardthreshold operator with simple rectilinear (ReLu) units (Nair and Hinton, 2010), which are functionally equivalent to onesided softthresholding.
Taken together, these changes deviate from the original IHT script; however, we believe they nonetheless preserve the foundational principles of learningbased sparse estimation. Certainly our empirical evidence below supports this claim.
6 Construction of Training Sets
If the ultimate goal is to learn an accurate model for computing the minimum norm, maximally sparse solution, then care must be taken in how we construct the training data. This issue is especially acute in the operational regime considered herein, namely sparse linear inverse problems where dictionary coherence is high.
As an illustrative example of this point, suppose we have a dictionary that is known to facilitate highly compact representations of some signal class of interest . Even if we have access to a large set of observations , our work is still ahead of us to construct a viable training set. This is because in general, computing the maximally sparse that corresponds with each represents an NPhard problem, and if the dictionary is coherent fast approximate schemes like OMP and IHT will fail; similarly  norm equivalency breaks down corrupting convex solutions. Consequently, if we use any of these methods to generate training values for , we will merely end up learning a network that approximates a suboptimal strategy, not the true norm solution that represents our goal to begin with. To the best of our knowledge though, this is the route that all previous learningbased sparse estimation pipelines proceed, e.g., see (Gregor and LeCun, 2010; Sprechmann et al., 2015; Wang et al., 2015). Hence these systems do not actually produce maximally sparse estimates as is our focus here, although they do represent quite useful methods for reducing the computational burden of approximate schemes like minimization.
In this paper we advocate an entirely different strategy. We first generate sparse training vectors with random support patterns. We then compute synthetic observations using (possibly with additional additive noise for robustness). Provided the dictionary satisfies minimal assumptions related to a quantity called matrix spark (Donoho and Elad, 2003), will provably represent the maximally sparse feasible solution. In this way we have an inexpensive means of producing whatever volume of training data we desire. Moreover, we have observed modest sensitivity at test time to the actual magnitude distribution of nonzero coefficients in used during training. In other words, even if we test using a different magnitude distribution than was used to generate training data, performance is relatively stable (see Section 7). This is a likely consequence of using a softmax final multilabel classification layer as opposed to trying to directly estimate . In practice this stability is paramount since we may not have a good estimate of this distribution anyway.
7 Feedforward Network Experiments
With existing sparse optimization algorithms, the goal is to estimate when presented with and . In contrast, with a large corpus of training pairs we intend to learn a mapping from to using a deep architecture. To isolate the various factors affecting the performance of feedforward networks in particular, this section describes experiments using a variety of different data generation procedures. Later, Section 9 will consider a competing recurrent LSTM architecture.
7.1 Network Design
We confine our design here to the feedforward structure motivated in Section 5. In brief, we build a 20layer network with residual connections (He et al., 2015a) and batch normalization (Ioffe and Szegedy, 2015). Moreover, because our sparse data will ultimately have no indication of local smoothness, we use fully connected layers rather than convolutions. For the nonlinearities we apply rectilinear units (ReLU). We also include a final softmax layer which outputs a vector , with providing an estimate of the probability that . The detailed network structure, which was implemented using MXNet (Chen et al., 2015), can be found in Figure 1.
7.2 Basic Sparse Estimation Experimental Setup
We generate a dictionary matrix using
(18) 
where and have iid elements drawn from . We also rescale each column of to have unit norm. generated in this way has superlinear decaying singular values (indicating correlation between the columns) but is not constrained to any specific structure. Many dictionaries in real applications have such a property. As a basic experiment, we generate ground truth samples by randomly selecting nonzero entries, with nonzero amplitudes drawn iid from the uniform distribution , excluding the interval to avoid small, relatively inconsequential contributions to the support pattern. We then create via . As we increase , the sparse estimation problem becomes intrinsically more difficult. We set and , while , noting that if we have only twice as many measurements as nonzeros in , which is a challenging regime, especially when has strong correlations.
We generated total samples and used the first for training and the remaining for testing. Network optimization is achieved via stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 0.0001. The initialization follows (He et al., 2015b) and we did not apply any dropout. The batch size was set to 250. The initial learning rate was 0.01 and was reduced by 90% every 50 epoches. We stopped training after 150 epoches, at which point empirical convergence was always observed. Unless otherwise specified, throughout this paper we convert to the binary label vector from (17) for training purposes using the stated softmax output layer.
To evaluate the performance, we introduce two metrics referred to as strict accuracy (sacc) and loose accuracy (lacc), respectively. Both depend on two sets for each sample/trial, the ground truth labels and the predicted top labels given by
(19) 
Strict accuracy evaluates whether the ground truth nonzeros are exactly aligned with the predicted top values produced by our network, and when averaged across test trails, can be computed via
(20) 
where is an indicator function and here the superscript denotes the sample number. This allornothing metric is commonly adopted in the compressive literature, and effectively measures the percentage of trials where we can perfectly recover .
In contrast, loose accuracy considers the degree to which the correct support is included in the largest values of . Note that given our experimental design, it can be shown that with probability one there exists only a single feasible solution to such that (this will define the optimal support set by design). Moreover, any support pattern with exactly nonzeros is sufficient to produce a unique feasible solution (referred to as a basic feasible solution in the linear programming literature (Luenberger, 1984)). Hence we define loose accuracy as the degree to which the true support indeces are contained within the top 20 largest values of , or
(21) 
Both sacc and lacc metrics are computed for all test points and compared against a battery of existing algorithms, both learning and optimizationbased.^{7}^{7}7For competing algorithms, we compute using the largest (in magnitude) elements of any estimate . These include standard minimization via ISTA iterations (Candès et al., 2006), IHT (Blumensath and Davies, 2009), an ISTAbased network (Gregor and LeCun, 2010), and an IHTinspired network (Wang et al., 2015). For minimization we used publiclyavailable ISTA code,^{8}^{8}8http://www.eecs.berkeley.edu/ yang/software/l1benchmark/index.html while for IHT we applied our own implementation (it only requires a few lines in Matlab) and supplied the hardthresholding operator with the ground truth number of nonzeros, meaning for all experiments. For the learningbased methods, we estimated parameters for both the ISTA and IHTbased networks using MXNet and the exact same training data described above.
7.2.1 Accuracy Results
Figure 2 illustrates how different methods perform under both evaluation metrics. Given the correlated matrix, the recovery performance of IHT, and to a lesser degree minimization using ISTA, is rather modest as expected given that the associated RIP constant will be quite large by construction. In contrast our method achieves uniformly higher accuracy under both metrics, including over existing learningbased methods trained with the same data. This improvement is likely the result of three significant factors: (i) Existing learning methods initialize using weights derived from the original sparse estimation algorithms, but such an initialization will be associated with locally optimal solutions in most cases with correlated dictionaries. (ii) As described in Section 3, constant weights across layers have limited capacity to unravel multiresolution dictionary structure, especially one that is not confined to only possess some low rank correlating component. (iii) The quadratic loss function used by existing methods does not adequately focus resources on the crux of the problem, which is accurate support recovery. In contrast our approach adopts an initialization motivated directly by DNNbased training considerations, unique layer weights to handle a multiresolution dictionary, and a multilabel classification output layer to focus learning on support recovery.
7.2.2 Computational Efficiency
Table 1 displays the average persample runtime required to produce each sparse estimate. Not surprisingly, the learningbased methods display a dramatic advantage over ISTAbased minimization and IHT, both of which require a high number of iterations to converge. In contrast, ISTANet, IHTNet, and our method only involve passing activations through a handful of layers. Although these learningbased approaches all require a potentially expensive training phase, for any task of interest with a fixed matrix, we need only fit the network model once up front and then subsequent testing/deployment will always be much more efficient.
ISTA  IHT  ISTANet  IHTNet  Ours  
runtime  0.7393  0.288  8.17  8.68  9.71 
7.3 Variants of the Basic Experiment
Here we vary a number of different factors from the basic experiment, in each case holding all others fixed.
7.3.1 Varying training set size
In Figure 3, we see that adding more training data to our method can further boost accuracy. This is not a surprise given that it is fundamentally based on learning, and therefore, as long as the capacity of the network allows and optimization ends up in a good basin, we may expect some improvement with additional data.
7.3.2 Alternative network structures
As discussed in Section 5, our DNN design choices were largely motivated by practical issues related to the information and gradient flows to which DNN training can be highly sensitive. In this section we examine different network architectures to quantify essential factors affecting performance. In particular, we consider the following changes:

We remove the residualnet connections.

We replace ReLU with hardthreshold activations. In particular, we utilize the socalled HELU function introduced in (Wang et al., 2015), which is a continuous and piecewise linear approximation of the scalar hardthreshold operator given by

We use a quadratic penalty layer instead of a multilabel classification loss layer, i.e., the loss function is changed to (where is the output of the last fullyconnected layer) during training.
Figure 4 displays the associated recovery percentages, where we observe that in each case performance degrades. Without the residual design, and also with the inclusion of a rigid, nonconvex hardthreshold operator, local minima during training appear to be a likely culprit, consistent with observations from (He et al., 2015a). Likewise, use of a leastsquares loss function is likely to emphasize the estimation of coefficient amplitudes rather than focusing on support recovery.
7.3.3 Different distributions for
From a practical standpoint, we would like to estimate the support pattern of all the significant elements of ; however, these elements need not all have the same amplitudes. Moreover, in practice we may expect that the true amplitude distribution may deviate at times from the original training set. To explore robustness to such mismatch, as well as different amplitude distributions, we consider two sets of candidate data: the original from Section 7.2, and similarlygenerated data but with the uniform distribution of nonzero elements replaced with the Gaussians , where the mean is selected with equal probability as either or , thus avoiding tiny magnitudes with high probability.
Figure 5 reports accuracies under different distributions for both training and testing, including mismatched cases. The label ‘U2U’ refers to training and testing with the uniformly distributed amplitudes described in Section 7.2, while ‘U2N’ uses uniform training set and a Gaussian test set. Analogous definitions apply for ‘N2N’ and ‘N2U’. In all cases we note that the performance is quite stable across training and testing conditions. We would argue that our recasting of the problem as multilabel classification contributes, at least in part, to this robustness. The application example described next demonstrates further tolerance of trainingtesting set mismatches.
8 Practical Application: Photometric Stereo
Photometric stereo represents a powerful technique for recovering highresolution surface normals from a 3D scene using appearance variations in 2D images under different lightings. For example, when images of an ideal Lambertian surface are obtained under illumination from three known directions, the surface orientation can be uniquely determined using a simple leastsquares fit (Woodham, 1980).
In practice however, the estimation process is often disrupted by nonLambertian effects such as specular highlights, shadows, or image noise. To account for such outlying factors, robust estimation methods have been proposed that decompose an observation matrix of stacked images under different lighting conditions into an ideal Lambertian component and a sparse error term (Wu et al., 2010; Ikehata et al., 2012). While principled in theory, this approach requires solving on the order of distinct sparse regression problems, one for each point for which we would like to obtain a surface normal estimate. We will now map this application domain into our sparse DNN framework, which can readily handle the required outlier removal problem potentially ordersofmagnitude faster than existing practical systems, facilitating realtime deployment in mobile environments.
8.1 Problem Details
Suppose we have observations of a given surface point from a Lambertian scene under different lighting directions. Then the resulting measurements, denoted , can be expressed as
(22) 
where denotes the true surface normal, each row of defines a lighting direction, and is the diffuse albedo, acting here as a scalar multiplier (Woodham, 1980). If specular highlights, shadows, or other gross outliers are present, then the observations are more realistically modeled as
(23) 
where is an an unknown sparse vector (Wu et al., 2010; Ikehata et al., 2012). In this revised scenario, we might consider estimating using
(24) 
where is merely the surface normal rescaled with . From this expression, it is apparent that, since is unconstrained, need not compensate for any component of in the range of . Given that is the orthogonal complement to , we may transform (24) to the equivalent problem
(25) 
The constraint is of course equivalent to with and , and so (24) ultimately collapses to our canonical sparse estimation problem from (1). Additionally, given that rows of will form a basis for which is lightinghardware dependent, there are likely to be unavoidable correlations in the dictionary columns.
While existing sparse estimation algorithms can be adopted to solve (25), this is impractical for many realworld applications since the number of surface points can be extremely large (possibly even greater than for highresolution reconstructions). Fortunately though, given that is fixed across all possible scenes and surface points for a given lighting geometry, to apply our method we only need learn a single DNN model, and once trained, testing on novel scenes will be extremely efficient. This allows fast computation of outlier positions via the support of , after which the remaining inlier points can be used to compute surface normals using a traditional least squares fit.
8.2 Results
Following (Ikehata et al., 2012), we use 32bit HDR grayscale images of the object Bunny (256256) with foreground masks under different lighting conditions whose directions, or rows of , are randomly selected from a hemisphere with the object placed at the center. To apply our method, we first compute using the appropriate projection operator derived from the lighting matrix . As realworld training data is expensive to acquire, we instead synthetically generate a training set as follows. First, we draw a support pattern for uniformly at random with cardinality sampled uniformly from the range . The values of and can be tuned in practice. Nonzero values of are assigned iid random values from a Gaussian distribution whose mean and variance are also tunable. Beyond this, no attempt was made to match the true outlier distributions encountered in applications of photometric stereo. Finally, for each we can naturally compute observations , which serve as candidate network inputs.
Average angular error (degrees)  
LS  SBL  Rnd4  Ours  
40  9.33  1.24  0.50  16.64  1.20 
20  10.80  3.96  1.86  18.61  1.95 
10  12.13  7.10  4.02  16.56  1.48 
Runtime (sec.)  

LS  SBL  Rnd4  Ours  
40  7.84  34.9  75.2  0.68  1.25 
20  5.47  34.3  42.5  0.66  1.21 
10  4.10  33.7  59.1  0.64  1.17 
Given synthetic training data acquired in this way, we learn a network with the exact same structure and optimization parameters as in Section 7; no applicationspecific tuning was introduced. We then deploy the resulting network on the grayscale Bunny images.^{9}^{9}9Note that for each different number of images we must train a separate model, since the lighting geometry effectively changes. However, in practice this is not an issue since we only ever need train a single model per hardware configuration. For each surface point, we use our DNN model to approximate (25). Since the network output will be a probability map for the outlier support set instead of the actual values of , we choose the 4 indices with the least probability as inliers and use them to compute via least squares.
We compare our method against the baseline least squares estimate from (Woodham, 1980), norm minimization, and a sparse Bayesian learning (SBL) approach specifically developed in (Ikehata et al., 2012) for surface normal estimation. We also consider a second baseline estimator, denoted ‘Rnd4’, which computes a surface normal estimate using 4 randomly selected indices as putative inliers. As the number of images is varied, we compute the angular error between the recovered normal map by each algorithm and the ground truth.
Results are reported in Table 2 for , which also includes runtime comparisons. In the hardest case, where only images are present, our method significantly outperforms the others. For images our method is still quite competitive, with only SBL offering superior results. However, it must be noted that SBL represents a computationallyexpensive Bayesian approach especially designed for this problem, with runtimes nearly two orders of magnitude higher than our DNN.^{10}^{10}10In fact, the runtime of our method is only twice that of Rnd4, which uses just a single, lowdimensional least squares fit. Additionally, a key reason that our approach does not improve substantially as increases is that we fixed the assumed number of inliers to be 4 in all cases; however, allowing a flexible number that grows with the number of images (as implicitly permitted by SBL) will likely improve performance.
As a complementary perspective, recovered surface normal error maps are displayed in Figure 6 when . Here we observe that our DNN estimates lead to far fewer regions of significant error. Overall though, this application example illustrates that mismatched synthetic training data can, at least for some problem domains, be sufficient to learn a quite useful sparse estimation DNN.
9 Alternative LSTM Networks
Thus far our experimentation has focused on feedforward networks with a residual design. In this section we turn to a recurrent LSTM structure and execute some preliminary evaluations. As highlevel motivation, there are many similarities between unfolded sparse estimation algorithms like the adaptive IHT discussed in Section 4 and an unfolded LSTM network. Both supply an input to every unfolded layer, and both implicitly utilize gating functions to switch activations on or off as learning proceeds, allowing partial support patterns or other information to be remembered in deeper layers. Although we will defer a much more detailed, selfcontained exploration to a future paper, we nonetheless here present an initial empirical proofofconcept using a vanilla form of LSTM network that has not been explicitly tailored for sparse estimation problems beyond the final multilabel classification layer described previously.
Using a cell design from (Hochreiter and Schmidhuber, 1997), we adopt a twolayer LSTM network with a fixed size of 11 steps. Figure 7 presents the specific structure. We compare this network against both our original residual net implementation and SBL. The later was chosen because it represents an algorithm explicitly designed to handle dictionary correlations (Wipf, 2012). For training and testing we use the protocol from Section 7.2, but with nonzero elements of having unit magnitudes. This modification was introduced because it represents a challenging scenario for many traditional sparse optimization algorithms for technical reasons related to local minima detailed in (Wipf et al., 2011). Results are shown in Figure 8, where we observe that the LSTM net is even able to outperform SBL. Further experiments and connecting analyses will be presented in future work for space considerations.


10 Conclusions
There is a clear relationship between iterative optimization rules promoting sparsity (such as those from IHT) and the layers of deep networks. Building from this perspective, in this paper we have shown that deep networks with handcrafted, multiresolution structure can provably solve certain specific classes of sparse recovery problems where existing algorithms fail. However, much like CNNbased features can often outperform SIFT on many computer vision tasks, we argue that a discriminative approach can outperform manual structuring of layers/iterations and compensate for dictionary coherence under more general conditions. We also believe that many of the underlying principles explored herein also transfer to other applications operating at the boundary between optimization and learningbased systems.
There is of course one important caveat with the pursuit of maximal sparsity using a deep network. Sparse estimation problems can be partitioned into two different categories, centered around whether or not the dictionary is reusable. In brief, learningbased methods are only feasible when a fixed (or similar) dictionary can be repeatedly used to represent a signal class of interest. Examples include outlier removal (Candès and Tao, 2005; Ikehata et al., 2012), compressive sensing (Donoho, 2006), and source localization (Baillet et al., 2001; Malioutov et al., 2005) applications where, once learned, a deep model can produce maximally sparse representations as new input signals arrive. In contrast, other influential domains such as subspace clustering (Elhamifar and Vidal, 2013) effectively require solving sparse recovery problems with a novel dictionary at each instance such that any attempt to construct a viable training set would be infeasible. Hence optimizationbased sparse estimation nonetheless remains an important tool regardless of how effective learned models can sometimes be.
Appendix
Here we include technical proofs of our main results. In places we rely on three standard asymptotic notations describing the order of an arbitrary function :
(26)  
10.1 Proof of Proposition 2
Consider some where and for , with . In this restricted setting, it follows that
(27) 
and therefore
(28) 
To ensure that , the largest (in magnitude) elements of must align with , and for all . Together these conditions are necessary to ensure that the operator will produce ; however, they also imply that , where is a zero vector with a ‘1’ in the first position, since no element in the complement of can be larger than . Therefore we require that
(29) 
Of course since this must hold for any , it must be that . Repeating this procedure with for all then leads to the requirement that .
10.2 Proof of Proposition 3
With and the stipulated rescaled dictionary, (10) becomes
(30) 
For the moment, assume that is invertible. Then the iteration (30) is consistent with the modified objective
(31) 
Since with , it follows that and , where , , and . We may then apply Theorem 5 from (Blumensath and Davies, 2009) using this revised system and conclude that
(32) 
which leads to the stated result.
10.3 Proof of Corollary 4
Consider some projection operator onto , using formed with orthonormal rows, which equivalently projects onto the orthogonal complement of . Therefore
(33) 
when . Given that elements of are drawn iid from and has orthonormal rows, elements of will also have iid elements from the same distribution. Hence the stated selections for and allow us to obtain the worstcase upper bound (in expectation) found in the corollary.
10.4 Proof of Proposition 7
Let denote the set of column indeces of associated with and the vectorized concatenation of all , i.e., . We then introduce the following intermediate result.
Lemma 8
By construction of the clustered dictionary model
(34) 
where is a sparse vector such that (which implies that ), and .
Proof: Without loss of generality, we note that assuming , then . To see this, we note that
(35) 
via the triangle inequality and the fact that by assumption. Therefore the normalization constant satisfies
(36) 
We then have