FrankWolfe Network: An Interpretable Deep Structure for NonSparse Coding
Abstract
The problem of norm constrained coding is to convert signal into code that lies inside an ball and most faithfully reconstructs the signal. Previous works under the name of sparse coding considered the cases of and norms. The cases with values, i.e. nonsparse coding studied in this paper, remain a difficulty. We propose an interpretable deep structure namely FrankWolfe Network (FW Net), whose architecture is inspired by unrolling and truncating the FrankWolfe algorithm for solving an norm constrained problem with . We show that the FrankWolfe solver for the norm constraint leads to a novel closedform nonlinear unit, which is parameterized by and termed . The unit links the conventional pooling, activation, and normalization operations, making FW Net distinct from existing deep networks either heuristically designed or converted from projected gradient descent algorithms. We further show that the hyperparameter can be made learnable instead of prechosen in FW Net, which gracefully solves the nonsparse coding problem even with unknown . We evaluate the performance of FW Net on an extensive range of simulations as well as the task of handwritten digit recognition, where FW Net exhibits strong learning capability. We then propose a convolutional version of FW Net, and apply the convolutional FW Net into image denoising and superresolution tasks, where FW Net all demonstrates impressive effectiveness, flexibility, and robustness.
I Introduction
Ia Norm Constrained Coding
Assuming a set of dimensional vectors , we aim to encode each vector into a dimensional code , such that the code reconstructs the vector faithfully. When is the situation of interest, the most faithful coding is not unique. We hence consider the code to bear some structure, or in the Bayesian language, we want to impose some prior on the code. One possible structure/prior is reflected by the norm, i.e. by solving the following problem:
(1)  
where is a linear decoding matrix, and both and are constants. In realworld applications, we have difficulty in predetermining the value, and the flexibility to learn the prior from the data becomes more important, which can be formulated as the following problem:
(2)  
where is a domain of values of interest. For example, defines a domain that ensures the constraint to be convex with regard to . In this paper, (1) and (2) are termed norm constrained coding, and for distinguishing purpose we refer to (1) and (2) as known and unknown , respectively.
IB Motivation
For known , if the decoding matrix is given a priori as , then it is sufficient to encode each individually, by means of solving:
(3)  
or its equivalent, unconstrained form (given a properly chosen Lagrange multiplier ):
(4) 
Eq. (4) is well known as an norm regularized least squares problem, which arises in many disciplines of statistics, signal processing, and machine learning. Several special values, such as 0, 1, 2, and , have been well studied. For example, when , measures the number of nonzero entries in , thus minimizing the term induces the code that is as sparse as possible. While sparsity is indeed a goal in several situations, the norm minimization is NPhard [1]. Researchers then proposed to adopt norm, which gives rise to a convex and easier problem, to approach norm. It was shown that, under some mild conditions, the resulting code of using norm coincides with that of using norm [2]. norm regularization was previously known in lasso regression [3] and basis pursuit [4]. Due to its enforced sparsity, it leads to the great success of compressive sensing [5]. The cases of lead to nonsparse codes that were also investigated. norm regularization was extensively adopted under the name of weight decay in machine learning [6]. norm was claimed to help spread information evenly among the entries of the resultant code, which is known as democratic [7] or spread representations [8], and benefits vector quantization [9], hashing [10], and other applications.
Besides the abovementioned special values, other general values were much less studied due to mathematical difficulty. Nonetheless, it was observed that general indeed could help in specific application domains, where using the special values might be overly simplified. For sparse coding, norms with all enforce sparsity to some extent, yet with different effects. In compressive sensing, it was known that using norm with needs fewer measurements to reconstruct sparse signal than using norm; regarding computational complexity, solving norm regularization is more difficult than solving norm but still easier than solving norm [11]. Further studies found that the choice of crucially affected the quality of results and noise robustness [12]. Xu et al. endorsed the adoption of norm regularization and proposed an iterative half thresholding algorithm [13]. In image deconvolution, Krishnan and Fergus investigated and norms and claimed their advantages over norm [14]. For nonsparse coding, norms with and different ’s have distinct impact on the solution. For example, Kloft et al. argued for norm with in the task of multiple kernel learning [15].
Now let us return to the problem (1) where the decoding matrix is unknown. As discussed above, norm induces sparse representations, thus the special case of norm constrained coding with (similarly ) is to pursue a sparse coding of the given data. While sparse coding has great interpretability and possible relation to visual cognition [16, 17], and was widely adopted in many applications [18, 19, 20, 21], we are inspired by the studies showing that general performs better than special , and ask: is general norm able to outperform ///norm on a specific dataset for a specific task? If yes, then which will be the best (i.e. the case of unknown )? These questions seem not being investigated before, to the best of our knowledge. Especially, nonsparse coding, i.e. , is clearly different from sparse coding and is the major theme of our study.
IC Outline of Solution
Analytically solving the norm constrained problems is very difficult as they are not convex optimization. When designing numerical solutions, note that there are two (resp. three) groups of variables in (1) (resp. (2)), and it is natural to perform alternate optimization over them iteratively. Previous methods for sparse coding mostly follow this approach, for example in [16] and in the wellknown KSVD method [22]. One clear drawback of this approach is the high computational complexity due to the iterative nature. A variant of using earlystopped iterative algorithms to provide fast solution approximations was discussed in [23], built on the overly strong assumption of closetoorthogonal bases [24]. Besides the high complexity, it is not straightforward to extend the sparse coding methods to the cases of nonsparse coding, since they usually leverage the premise of sparse code heuristically but for general it is hard to design such heuristics.
A different approach for sparse coding was proposed by Gregor and LeCun [25], where an iterative algorithm known as iterative shrinkage and thresholding (ISTA), that was previously used for norm regularized least squares, is unrolled and truncated to construct a multilayer feedforward network. The network can be trained endtoend to act as a regressor from a vector to its corresponding sparse code. Note that truncation helps lower the computational cost of the network than the original algorithm, while training helps compensate the error due to truncation. The trained network known as learned ISTA (LISTA) is then a fast alternative to the original ISTA algorithm. Following the approach of LISTA, several recent works consider the cases of norm [26] and norm [27], as well as extend other iterative algorithms to their network versions [28, 29, 30]. However, LISTA and its followingup works are not applicable for solving the cases of general , because they all refer to the same category of firstorder iterative algorithms, i.e. the projected gradient descent (PGD) algorithms. For general , the projection step in PGD is not analytically solvable. In addition, such works have more difficulty in solving the unknown problem.
Beyond the family of PGD, another firstorder iterative algorithm is the FrankWolfe algorithm [31], which is free of projection. This characteristic inspires us to adapt the FrankWolfe algorithm to solve the general problem. Similar to LISTA, we unroll and truncate the FrankWolfe algorithm to construct a network, termed FrankWolfe network (FW Net), which can be trained endtoend. Although the convergence speed of the original FrankWolfe algorithm is slow, we will show that FW Net can converge faster than not only the original FrankWolfe and ISTA algorithms, but also the LISTA. Moreover, FW Net has a novel, closedform and nonlinear computation unit that is parameterized by , and as varies, that unit displays the behaviors of several classic pooling operators, and can be naturally viewed as a cascade of a generalized normalization and a parametric activation. Due to the fact that becomes a (hyper) parameter in FW Net, we can either set a priori or make learnable when training FW Net. Thus, FW Net has higher learning flexibility than LISTA and training FW Net gracefully solves the unknown problem.
ID Our Contributions
To the best of our knowledge, we are the first to extend the sparse coding problem into norm constrained coding with general and unknown ; we are also the first to unrollandtruncate the FrankWolfe algorithm to construct trainable network. Technically, we make the following contributions:

We propose the FrankWolfe network (FW Net), whose architecture is inspired by unrolling and truncating the FrankWolfe algorithm for solving norm regularized least squares problem. FW Net features a novel nonlinear unit that is parameterized by and termed . The unit links the conventional pooling, activation, and normalization operations in deep networks. FW Net is verified to solve nonsparse coding with general known better than the existing iterative algorithms. More importantly, FW Net solves the nonsparse coding with unknown at low computational costs.

We propose a convolutional version of FW Net, which extends the basic FW Net by adding convolutions and utilizing the unit point by point across different channels. The convolutional (Conv) FW Net can be readily applied into imagerelated tasks.

We evaluate the performance of FW Net on an extensive range of simulations as well as a handwritten digit recognition task, where FW Net exhibits strong learning capability. We further apply Conv FW Net into image denoising and superresolution, where it also demonstrates impressive effectiveness, flexibility, and robustness.
IE Paper Organization
The remainder of this paper is organized as follows. Section II reviews literatures from four folds: norm constrained coding and its applications, deep structured networks, the original FrankWolfe algorithm, and nonlinear units in networks, from which we can see that FW Net connects and integrates these separate fields. Section III formulates FW Net, with detailed discussions on its motivation, structure, interpretation, and implementation issues. Section IV validates the performance of FW Net on synthetic data under an extensive range of settings, as well as on a toy problem, handwritten digit recognition with the MNIST dataset. Section V discusses the proposed convolutional version of FW Net. Section VI then provides experimental results of using Conv FW Net on image denoising and superresolution, with comparison to some other CNN models. Section VII concludes this paper. For reproducible research, our code and pretrained models have been published online
Ii Related Work
Iia Norm Constrained Coding and Its Applications
Sparse coding as a representative methodology of the linear representation methods, has been used widely in signal processing and computer vision, such as image denoising, deblurring, image restoration, superresolution, and image classification [18, 32, 19, 21]. The sparse representation aims to preserve the principle component and reduces the redundancy in the original signal. From the viewpoint of different norm minimizations used in sparsity constraints, these methods can be roughly categorized into the following groups: 1) norm minimization; 2) norm () minimization; 3) norm minimization; and 4) norm minimization. In addition, norm minimization is extensively used, but it does not lead to sparse solution.
Varieties of dictionary learning methods have been proposed and implemented based on sparse representation. KSVD [22] seeks an overcomplete dictionary from given training samples under the norm constraint, which achieves good performance in image denoising. Wright et al. [32] proposed a general classification algorithm for face recognition based on norm minimization, which shows if the sparsity can be introduced into the recognition problem properly, the choice of features is not crucial. Krogh et al. [6] showed that limiting the growth of weights through norm penalty can improve generalization in a feedforward neural network. The simple weight decay has been adopted widely in machine learning.
IiB Deep Structured Networks
Deep networks are typically stacked with offtheshelf building blocks that are jointly trained with simple loss functions. Since many realworld problems involve predicting statistically dependent variables or related tasks, deep structured networks [33, 34] were proposed to model complex patterns by taking into account such dependencies. Among many efforts, a noticeable portion has been devoted to unrolling the traditional optimization and inference algorithms into their deep endtoend trainable formats.
Gregor and LeCun [25] first leveraged the idea to construct feedforward networks as fast trainable regressors to approximate the sparse code solutions, whose idea was expanded by many successors, e.g. [35, 26, 27, 36, 37, 38, 39]. Those works show the benefits of incorporating the problem structure into the design of deep architectures, in terms of both performance and interpretability [40]. A series of works [41, 42, 43, 44] established the theoretical background of this methodology. Note that many previous works [25, 35, 26, 41] built their deep architectures based on the projected gradient descent type algorithms, e.g. the iterative shrinkage and thresholding algorithm (ISTA). The projection step turned into the nonlinear activation function. Wang et al. [29] converted proximal methods to deep networks with continuous output variables. More examples include the messagepassing inference machine [28], shrinkage fields [45], CRFRNN [46], and ADMMnet [30].
IiC FrankWolfe Algorithm
The FrankWolfe algorithm [31], also known as conditional gradient descent, is one of the simplest and earliest known iterative solver, for the generic constrained convex problem:
(5) 
where is a convex and continuously differentiable objective function, and is a convex and compact subset of a Hilbert space. At each step, the FrankWolfe algorithm first considers the linear approximation of , and then moves towards this linear minimizer that is taken over . Section III presents a concrete example of applying the FrankWolfe algorithm.
The FrankWolfe algorithm has lately regained popularity due to its promising applicability in handling structural constraints, such as sparsity or lowrank. The FrankWolfe algorithm is projectionfree: while competing methods such as the projected gradient descent and proximal algorithms need to take a projection step back to the feasible set per iteration, the FrankWolfe algorithm only solves a linear problem over the same set in each iteration, and automatically stays in the feasible set. For example, the sparsity regularized problems are commonly relaxed as convex optimization over convex hulls of atomic sets, especially norm constrained domains [47], which makes the FrankWolfe algorithm easily applicable. We refer the readers to the comprehensive review in [48, 49] for more details about the algorithm.
IiD Nonlinear Units in Networks
There have been blooming interests in designing novel nonlinear activation functions [50], a few of which have parametric and learnable forms, such as the parametric ReLU [51]. Among existing deep structured networks, e.g. [25, 35, 26], their activation functions usually took fixed forms (e.g. some variants of ReLU) that reflected the prechosen structural priors. A datadriven scheme is presented in [52] to learn optimal thresholding functions for ISTA. Their adopted parametric representations led to spline curvetype activation function, which reduced the estimation error compared to using the common (fixed) piecewise linear functions.
As another major type of nonlinearity in deep networks, pooling was originally introduced as a dimensionreduction tool to aggregate a collection of inputs into lowdimensional outputs [53]. Other than the inputoutput dimensions, the difference between activation and pooling also lies in that activation is typically applied element wise, while pooling is on groups of hidden units, usually within a spatial neighborhood. It was proposed in [54] to learn taskdependent pooling and to adaptively reshape the pooling regions. More learnable pooling strategies are investigated in [55], via either mixing two different pooling types or a treestructured fusion. Gulcehre et al. [56] introduced the unit that computed a normalized norm over the set of outputs, with the value of learnable.
In addition, we find our proposed nonlinear unit inherently related to normalization techniques. Jarrett et al. [53] demonstrated that a combination of nonlinear activation, pooling, and normalization improved object recognition. Batch normalization (BN) [57] rescaled the summed inputs of neurons over training batches, and substantially accelerated training. Layer normalization (LN) [58] normalized the activations across all activities within a layer. Ren et al. [59] reexploited the idea of divisive normalization (DN) [60, 61], a wellgrounded transformation in real neural systems. The authors viewed both BN and LN as special cases of DN, and observed improvements by applying DN on a variety of tasks.
Iii FrankWolfe Network
Iiia FrankWolfe Solver for Norm Constrained Least Squares
We investigate the norm constrained least squares problem as a concrete example to illustrate the construction of FW Net. The proposed methodology can certainly be extended to more generic problems (5). Let denote the input signal, and denote the code (a.k.a. representation, feature vector). is the decoding matrix (a.k.a. dictionary, bases). The problem considers to be an norm ball of radius , with to ensure the convexity of , and to be a leastsquares function:
(6) 
We initialize . At iteration , the FrankWolfe algorithm iteratively updates two variables to solve (6):
(7) 
where . is the step size, which is typically set as or chosen via line search. By Hölder’s inequality, the solution of is:
(8) 
where denote the th entry of , respectively. It is interesting to examine a few special values in (8) (ignoring the negative sign for simplicity):

, selects the largest entry in , while setting all other entries to zero
^{2} . 
, rescales by its root mean square.

, all entries of have the equal magnitude .
The three special cases easily remind the behaviors of max pooling, rootmeansquare pooling and average pooling [53, 21], although the input and output are both and no dimensionality reduction is performed. For general , it is expected to exhibit more varied behaviors that can be interpretable from a pooling perspective. We thus denote by the function : = , to illustrate the operation (8) associated with a specific .
IiiB Constructing the Network
Following the well received unrollingthentruncating methodology, e.g. [25], we represent the FrankWolfe solver for (6) that runs finite iterations (), as a multilayer feedforward neural network. Fig. 1 depicts the resulting architecture, named the FrankWolfe Network (FW Net). As a custom, we set , which is an interior point of for any . The layerwise weights ’s can be analytically constructed. Specifically, note that , if we define
(9) 
Then we have , as depicted in Fig. 1. Given the prechosen hyperparameters, and without any further tuning, FW Net outputs a iteration approximation of the exact solution to (6). As marked out, the intermediate outputs of FW Net in Fig. 1 are aligned with the iterative variables in the original FrankWolfe solver (7). Note that the twovariable update scheme in (7) naturally leads to two groups of “skip connections,” which might be reminiscent of the ResNet [63].
As many existing works did [25, 35, 26], the unrolled and truncated network in Fig. 1 could be alternatively treated as a trainable regressor to predict the exact from . We further view FW Net as composed from two subnetworks: (1) a singlelayer initialization subnetwork, consisting of a linear layer and a subsequent operator. It provides a rough estimate of the solution: , which appears similar to typical sparse coding that often initializes with a thresholded [64]. Note that has a direct shortcut path to the output, with the multiplicative weights ; (2) a recurrent refinement subnetwork, that is unrolled to repeat the layerwise transform for times to gradually refine the solution.
FW Net is designed for large learning flexibility, by fitting almost all its parameters and hyperparameters from training data:

Layerwise weights. Eq. (9) can be used to initialize the weights, but during training, and are untied with and viewed as conventional fullyconnected layers (without biases). All weights are learned with backpropagation from end to end. In some cases (e.g. Table III), we find that sharing the ’s (with ) is a choice worthy of consideration: it effectively reduces the actual number of parameters and makes the training to converge fast. But not sharing the weights is always helpful to improve the performance, as observed in our experiments. Thus, the weights are not shared unless otherwise noted. In addition, the relation between and (with ), as in (9), is not enforced during training. is treated as a simple fullyconnected layer without additional constraint.

Hyperparameters. and were given or prechosen in (7). In FW Net, they can be either prechosen or made learnable. For learnable hyperparameters, we also compute the gradients w.r.t. and , and update them via backpropagation. We adopt the same throughout FW Net as implies the structural prior. is set to be independent per layer. Learning and also adaptively compensates for the truncation effect [52] of iterative algorithms. In addition, learning gracefully solves the unknown problem (2).
FW Net can further be jointly tuned with a taskspecific loss function, e.g. the softmax loss for classification, or the meansquarederror loss and/or a semantic guidance loss [65] for denoising, in the form of an endtoend network.
IiiC Implementing the Network
Reformulating as normalization plus neuron
A closer inspection on the structure of the function (8) leads to a twostep decomposition (let for simplicity):
(10) 
Step 1 performs a normalization step under the norm. Let , which happens to be the Hölder conjugate of : we thus call this step conjugate normalization. It coincides with a simplified form of DN [59], by setting all summation and suppression weights of DN to 1.
Step 2 takes the form of an exponentialtype and nonsaturated elementwise activation function [66], and is a learnable activation parameterized by [50]. As the output range of Step 1 is , the exponent displays suppression effect when , and amplifies entries when .
While the decomposition of (8) is not unique, we carefully choose (10) due to its effectiveness and numerical stability. As a counterexample, if we adopt another more straightforward decomposition of (8):
(11) 
then, large values () will be boosted by the power of when , and the second step may run the risk of explosion when , in both feedforward and backpropagation. In contrast, (10) first squashes each entry into (Step 1) before feeding into the exponential neuron (Step 2), resolving the numerical issue well in practice.
The observation (10), called “ = normalization + neuron” for brevity, provides a new interesting insight into the connection between neuron, pooling and normalization, the three major types of nonlinear units in deep networks that were once considered separately. By splitting into two modules sharing the parameter , the backpropagation computation is also greatly simplified, as directly computing the gradient of w.r.t. can be quite involved, with more potential numerical problems.
Network initialization and training
The training of FW Net evidently benefits from highquality initializations. Although and are disentangled, can be well initialized from the given or estimated
In the original NCLS, is assumed to avoid the scale ambiguity, and is commonly enforced in dictionary learning [22]. Similarly, to restrain the magnitude growth of , we adopt the norm weight decay regularizer, with the default coefficient in experiments.
IiiD Interpretation of FrankWolfe Network as LSTM
The ISTA algorithm [40] has been interpreted as a stack of plain Recurrent Neural Networks (RNNs). We hereby show that FW Net can be potentially viewed as a special case of the Long ShortTerm Memory (LSTM) architecture [67], which incorporates a gating mechanism [68] and may thus capture longterm dependencies better than plain RNNs when is large. Although we include no such experiment in this paper, we are interested in applying FW Net as LSTM to model sequential data in future work.
Let us think as a constant input, and is the previous hidden state of the th step (). The current candidate hidden state is computed by , where the sophisticated function replaces the common tanh to be the activation function. and each take the role of the input gate and forget gate, that control how much of the newly computed and previous state will go through, respectively. constitutes the internal memory of the current unit. Eventually, with a simple output gate equal to 1, the new hidden state is updated to .
Iv Evaluation of FrankWolfe Network
Iva Simulations
We generate synthetic data in the following steps. First we generate a random vector and then we project it to the ball of radius . The projection is achieved by solving the problem: , using the original FW algorithm until convergence. This will be termed real in the following. We then create a random matrix , and achieve , where is additive white Gaussian noise with variance . We use the default values , , , and generate 15,000 samples for training. A testing set of 1,000 samples are generated separately in the identical manner. Our goal is to train a FW Net to predict/regress from the given observation . The performance is measured by meansquarederror (MSE) between predicted and groundtruth . All network models are implemented with Caffe [69] and trained by using MSE between groundtruth and networkoutput vectors as loss.
We first choose real , vary , and compare the following methods:

CVX: solving the problem using the groundtruth and real . The CVX package [70] is employed to solve this convex problem. No training process is involved here.

Original FW: running the original FrankWolfe algorithm for iterations, using the groundtruth and real and fixing = . No training process is involved here.

MLP: replacing the in FW Net with ReLU, having a feedforward network of fullyconnected layers, which has the same number of parameters with FW Net (except ).

FW Net: the proposed network that jointly learns , , and .

FW fixed {, }: fixing and = , learning in FW Net.

FW fixed : fixing = , learning and in FW Net.

FW fixed {wrong , }: fixing (i.e. an incorrect structural prior) and = , learning in FW Net.
Fig. 2 (a) depicts the training curve of FW net at . We also run the Original FW (for 8 iterations) for comparison. After a few epochs, FW Net is capable in achieving significantly smaller error than the Original FW, and converges stably. It shows the advantage of training a network to compensate for the truncation error caused by limited iterations. It is noteworthy that FW Net has the ability to adjust the dictionary , prior and step size at the same time, all of which are learned from training data. Comparing the test error of FW Net and Original FW at different ’s (Fig. 2 (b)), FW Net shows the superiority of flexibility, especially when is small or even the is wrong. FW Net learns to find a group parameters (, and ) during the training process, so that these parameters coordinate each other to lead to better performance.
Fig. 2 (b) compares the testing performance of different methods at different ’s. FW Net with learnable parameters achieves the best performance. The Original FW performs poorly at , and its error barely decreases as increases to 4 and 8, since the original FW algorithm is known to converge slowly in practice. CVX does not perform well as this problem is quite difficult (). Though having the same number of parameters, MLP performs not well, which indicates that this synthetic task is not trivial. Especially, the original FrankWolfe algorithm also significantly outperforms MLP after 4 iterations.
Furthermore, we reveal the effect of each component (, and ). Firstly, we fix and , i.e. we learn the ’s compared to the original FW algorithm. By observing the two curves of testing performance, FW Net improves the performance with learnable ’s which coordinate the fixed and , and achieves better approximation of the target in a few steps. Then, we let be learnable and only fix , which is slightly superior to FW fixed {, }. FW Net also maintains a smaller, but consistent margin over FW fixed . Those three comparisons confirm that except for , learning and are both useful. Finally, we give the wrong to measure the influence on FW Net. It is noteworthy that FW fixed {wrong , } suffers from much larger error than other FW networks. That demonstrates the huge damage that an incorrect or inaccurate structural prior can cause to the learning process. But, as mentioned before, FW Net has the high flexibility to adjust the other parameters. Even though under the wrong condition, FW Net still outperforms the original FW algorithm in a few steps.
Fig. 3 inspects the learning of and . As seen from Fig. 3 (a), the value fluctuates in the middle of training, but ends up converging stably (initial i.e. norm). In Fig. 3 (c), as goes up, the learned by FW net approaches the real gradually. This phenomenon can be interpreted by that the original FW algorithm cannot solve the problem well in only a few steps, and thus FW Net adaptively controls each component through learning them from training data to approximate the distribution of as much as possible. To understand why the learned is usually larger than the real , we may intuitively think that will “compress” the input more heavily as gets down closer to 1. To predict , while the original FrankWolfe algorithm may run many iterations, FW Net has to achieve within a much smaller, fixed number of iterations. Thus, each has to let more energy “pass through,” and the learning result has larger . Fig. 3 (b) observes the change of before and after training, at . While remains almost unchanged, () all decreases. As a possible reason, FW Net might try to compensate the truncation error by raising the weight of the initialization in the final output .
CVX  FW Net  FW Net with fixed  LISTA  

MSE  0.0036  0.5641  0.3480  0.1961  1.2076  0.8604  0.7358  0.4157  0.3481  0.3053 
1 (fixed)  1.360  1.254  1.222  1 (fixed)  1 (fixed)  1 (fixed)  1 (fixed)  1 (fixed)  1 (fixed) 
We then look into case, and regenerate synthetic data. We compare three methods as defined before: CVX, FW Net, and FW fixed . In addition, we add LISTA [25] into comparison, because LISTA is designed for . The depth, layerwise dimensions, and weightsharing of LISTA are configured identically to FW Net. Note that LISTA is dedicated to the case and cannot be easily extended for general cases. We retune all the parameters to get the best performance with LISTA to ensure a fair comparison. Table I compares their results at . The CVX is able to solve problems to a much better accuracy than the case of , and FW Net still outperforms FW fixed . More interesting is the comparison between FW Net and LISTA: FW Net is outperformed by LISTA at , then reaches a draw at , and eventually outperforms LISTA by a large margin at . Increasing demonstrates a more substantial boost on the performance of FW Net than that of LISTA, which can be interpreted as we have discussed in Section IIID that FW Net is a special LSTM, but LISTA is a vanilla RNN [40]. Note that the real is 1 (corresponding to sparse), but the learned in FW Net is larger than 1 (corresponding to nonsparse). The success of FW Net does not imply that the problem itself is a nonsparse coding one. This is also true for all the following experiments.
We also simulate with other real values. Fig. 3 (d) shows the learned values with respect to different real ’s. When the real approaches or , FW Net is able to estimate the value of more accurately, probably because the convex problems with norm and norm minimization can be solved more easily.
IvB Handwritten Digit Recognition
Similar to what has been done in [25], we adopt FW Net as a feature extractor and then use logistic regression to classify the features for the task of handwritten digit recognition. We use the MNIST dataset [71] to experiment. We design the following procedure to pretrain the FW Net as a feature extractor. The original images are dimensionality reduced to 128dim by PCA for input to FW Net. Then we further perform PCA on each digit separately to reduce the dimension to 15. We construct a 150dim sparse code for each image, whose 150 dimensions are divided into 10 groups to correspond to 10 digits, only 15 dimensions of which are filled by the corresponding PCA result, whereas the other dimensions are all zero. This sparse code is regarded as the “groundtruth” for FW Net in training. Accordingly, the transformation matrices in the secondstep PCA are concatenated to serve as , which is used to initialize the fullyconnected layers in FW Net according to (9). In this experiment, we use stochastic gradient descent (SGD), a momentum of 0.9 and a minibatch size of 100. The FW Net is pretrained for 200 epochs, and the initial learning rate is 0.1, decayed to 0.01 at 100 epochs. Then the pretrained FW Net is augmented with a fullyconnected layer with softmax that is randomly initialized, resulting a network that can be trained endtoend for classification. We observe that the performance benefits from joint training marginally but consistently.
If we formulate this feature extraction problem as norm constrained, then the is unknown, and most of previous methods adopt norm or norm. Different from them, FW Net tries to attain a from training data, which suits the real data better. The results are shown in Table II, comparing FW Net with simple feedforward fullyconnected networks (MLP) [71] and LCoD [25]. FW Net achieves lower error rate than the others. Especially, it takes 50 iterations for LCoD to achieve an error rate of 1.39, but only 3 layers for FW Net to achieve 1.34, where the numbers of parameters are similar. Moreover, with the increasing number of layers, FW Net makes a continuous improvement, which is consistent with the observation in the simulation.
Table III provides the results of different initializations of the hyperparameter , showing that FW Net is insensitive to the initialization of and can converge to the learned . However, fixing the value is not good for FW Net even fixing to the finally learned . This is interesting as it seems the learnable provides advantages not only for the final trained model but also for training itself. An adjustable value may suit for the evolving parameters during training FW Net, which we plan to study further.
# Params  Error rate (%)  

3layer: 300+100 [71]  266,200  3.05 
3layer: 500+150  468,500  2.95 
3layer: 500+300  545,000  1.53 
1iter LCoD [25]  65,536  1.60 
5iter LCoD  65,536  1.47 
50iter LCoD  65,536  1.39 
2layer FW Net  43,350  2.20 
3layer FW Net  65,850  1.34 
4layer FW Net  88,200  1.25 
Initial  Learnable  Fixed  

Accuracy (%)  Accuracy (%)  
1.1  98.51  1.601  97.35  1.1 
1.3  98.45  1.600  97.70  1.3 
1.5  98.43  1.595  98.30  1.5 
1.6  98.66  1.614  98.26  1.6 
1.8  98.38  1.601  97.90  1.8 
2.0  98.50  1.585  97.78  2.0 
V Convolutional FrankWolfe Network
Previous similar works [25, 35, 26] mostly result in fullyconnected networks, as they are unrolled and truncated from linear sparse coding models. Nonetheless, fullyconnected networks are less effective than convolutional neural networks (CNNs) when tackling structured multidimensional signal such as images. A natural idea to extend this type of works to convolutional cases seems to be convolutional sparse coding [72], which also admits iterative solvers. However, the reformulation will be inefficient both memorywise and computationwise, as discussed in [73].
We therefore seek a simpler procedure to build the convolutional FW Net: in Fig. 1, we replace the fullyconnected layers with convolutional layers, and operate across all the output feature maps for each location individually. The latter is inspired by the wellknown conversion between convolutional and fullyconnected layers by reshaping inputs
The convolutional FW Net then bears the similar flexibility to jointly learn weights and hyperparameters ( and ). Yet different from the original version of FW Net, the in convolutional FW Net reflects a diversified treatment of different convolutional filters at the same location, and should be differentiated from pooling over multiple points:

, only the channel containing the strongest response will be preserved at each location, which is reduced to maxout [74].

, rescales the feature maps across the channels by its rootmeansquare.

denotes the equal importance of all channels and leads to the crosschannel average.
By involving into optimization, essentially learns to rescale local responses, as advocated by the neural lateral inhibition mechanism. The learned will indicate the relative importance of different channels, and can potentially be useful for network compression [77].
The convolutional kernels in the convolutional FW Net are not directly tied with any dictionary, thus we have not seen an initialization strategy straightforwardly available. In this paper, we use random initialization for the convolutional filters, but we initialize and in the same way as mentioned before. As a result, convolutional FW Nets often converge slower than fullyconnected ones. We notice some recent works that construct convolutional filters [78, 79] from data in an explicit unsupervised manner, and plan to exploit them as future options to initialize convolutional FW Nets.
To demonstrate its capability, we apply convolutional FW Net to lowlevel image processing tasks, including image denoising and image superresolution in this paper. Our focus lies on building lightweight compact models and verifying the benefit of introducing the learnable .
Vi Experiments of Convolutional FrankWolfe Network
Via Image Denoising
We investigate two versions of FW Net for image denoising. The first version is the fullyconnected (FC) FW Net as used in the previous simulation. Here, the basic FC FW Net (Fig. 1) is augmented with one extra fullyconnected layer, whose parameters are denoted by to reconstruct . is naturally initialized by . The network output is compared against the original/clean image to calculate MSE loss. To train this FC FW Net, note that one strategy to image denoising is to split a noisy image into small (like ) overlapping patches, process the patches individually, and compose the patches back to a complete image. Then, we reshape blocks into 64dim samples (i.e. ). We adopt , , and . The network is trained with a number of noisy blocks as well as their noisefree counterparts. The second version of FW Net is the proposed convolutional (Conv) FW Net as discussed in Section V. The adopted configurations are: filters, 64 feature maps per layer, , and .
We use the BSD500 dataset [80] for experiment. The BSD68 dataset is used as testing data, and the remaining 432 images are converted to grayscale and added with white Gaussian noise for training. Several competitive baselines are chosen: a FC LISTA [25] network that is configured identically to our FC FW Net; a Conv LISTA network that is configured identically to our Conv FW Net; KSVD + OMP that represents the norm optimization; BM3D [81] using the dictionary size of 512; and DnCNN [82] which is a recently developed method based on CNN, here our retrained DnCNN includes 4 convolutional layers followed by BN and ReLU. We consider three noise levels: , , and .
Table IV provides the results of different image denoising methods on the BSD68 dataset. FC FW Net is better than LISTA in all cases. We also study the effect of , as shown in Table V. FW Net with learnable outperforms FW Net with fixed by a large margin, and is slightly better than FW Net with fixed . Here, learnable seems benefiting not only the final model but also the training itself, as has been observed in IVB. BM3D is a strong baseline, which the deep networks cannot beat easily. As seen from Table IV, BM3D outperforms DnCNN (4 layers), but Conv FW Net (4 layers) is better than BM3D.
It is worth noting that the original DnCNN in [82] has 20 layers and much more parameters and outperforms BM3D. We conduct an experiment to compare DnCNN with Conv FW Net at different number of layers. The results are shown in Fig. 4 (a). We observe that for shallow networks, Conv FW Net outperforms DnCNN significantly. As the network depth increases, both Conv FW Net and DnCNN perform similarly. This may be attributed to the learnable parameter in the Conv FW Net, which helps much when the network parameters are less. Thus, may be favorable if we want to build a compact network. The learned values are shown in Fig. 4 (b). We observe that the learned value is stable across networks with different depths. It is also similar to the learned value in the FC FW Net (). Thus, we consider that the value is determined by the data, and FW Net can effectively identify the value regardless of FC or Conv structures.
FC LISTA  29.20  27.63  24.33 

FC FW Net, learned  29.35  27.71  24.52 
KSVD + OMP  30.82  27.97  23.97 
BM3D  31.07  28.57  25.62 
DnCNN (4 layers)  30.89  28.42  25.42 
DnCNN (4 layers, w/o BN)  30.85  28.43  25.36 
Conv LISTA (4 layers)  30.90  28.50  25.55 
Conv FW Net (4 layers)  31.27  28.71  25.66 
FW Net, fixed  28.35  26.41  23.68 

FW Net, fixed  29.25  27.62  24.45 
FW Net, learned  29.35  27.71  24.52 
ViB Image SuperResolution
For the experiments about single image superresolution (SR), we compare Conv FW Net with baselines SRCNN [83] and VDSR [84], and all methods are trained on the 91image standard set and evaluated on Set5/Set14 with a scaling factor of 3. We train a 4layer Conv FW Net and a 4layer VDSR (4 convolutional layers equipped with ReLU), both of which have the same number of convolutional kernels. We adopt the same training configurations as presented in [84]. For SRCNN we directly use the model released by the authors.
Table VI compares the results of these methods. Our Conv FW Net achieves the best performance among the three. To further understand the influence of each component in FW Net, we experimentally compare the following setttings:

FW Net (No Skip): removing the top skip connections in Fig. 1;

FW Net (Fixed ): setting , which is equivalent to removing the bottom skip connections in Fig. 1;

FW Net (ReLU): replacing all the units with ReLU.
Both FW Net (No Skip) and FW Net (Fixed ) break the original structure introduced by the FrankWolfe algorithm, and incur severe performance drop. FW Net (ReLU) performs similarly to 4layer VDSR and is worse than FW Net. These results further demonstrate that each component in Conv FW Net contributes to the final performance.
We also measure the effect of the hyperparameter . The results are shown in Table VII. Different values indeed influence on the final performance significantly, and FW Net with learnable achieves the best performance, which again verifies the advantage of attaining the prior from the data.
Method  Set5  Set14 

3layer SRCNN  32.34  28.64 
4layer VDSR  32.51  28.71 
4layer VDSR (+BN)  32.55  28.69 
4layer Conv FW Net  32.85  28.76 
Method  Set5  Set14 

FW Net (No Skip)  31.52   
FW Net (Fixed )  32.21   
FW Net (ReLU)  32.57  28.66 
FW Net  32.85  28.76 
FW Net (Fixed )  32.49  28.42 
FW Net (Fixed )  32.81  28.63 
FW Net (Fixed )  32.73  28.69 
FW Net (Fixed )  32.59  28.58 
FW Net (Fixed )  32.65  28.67 
FW Net (Learned )  32.85  28.76 
Vii Conclusion
We have studied the general nonsparse coding problem, i.e. the norm constrained coding problem with general . We have proposed the FrankWolfe network, whose architecture is carefully designed by referring to the FrankWolfe algorithm. Many aspects of FW Net are inherently connected to the existing success of deep learning. FW Net has gained impressive effectiveness, flexibility, and robustness in our conducted simulation and experiments. Results show that learning the hyperparameter is beneficial especially in realdata experiments, which highlights the necessity of introducing general norm and the advantage of FW Net in learning the during the endtoend training.
Since the original FrankWolfe algorithm deals with convex optimization only, the proposed FW Net can handle cases, but not cases. Thus, FW Net is good at solving nonsparse coding problems. is quite special, as it usually leads to sparse solution [2], thus, FW Net with fixed can solve sparse coding, too, but then its efficiency seems inferior to LISTA as observed in our experiments. For a realworld problem, is sparse coding or nonsparse coding better? This is an open problem and calls for future research. In addition, a number of promising directions have emerged as our future work, including handling more constraints other than the norm, and the customization of FW Net for more realworld applications.
Footnotes
 https://github.com/sunke123/FWNet
 This reminds us of the wellknown matching pursuit algorithm. We noted that a recent work [62] has revealed a unified view between the FrankWolfe algorithm and the matching pursuit algorithm.
 For example, when is not given, we can estimate using KSVD [22] to initialize .
 We are aware of the option to reparameterize to ensure [56]. We have not implemented it in our experiments, since we never encountered during learning. The reparameterization trick can be done if necessary.
 http://cs231n.github.io/convolutionalnetworks//#convert
References
 B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal on Computing, vol. 24, no. 2, pp. 227–234, 1995.
 D. L. Donoho, “For most large underdetermined systems of linear equations the minimal norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics, vol. 59, no. 6, pp. 797–829, 2006.
 R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
 S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Review, vol. 43, no. 1, pp. 129–159, 2001.
 D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
 A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in NIPS, 1992, pp. 950–957.
 C. Studer, T. Goldstein, W. Yin, and R. G. Baraniuk, “Democratic representations,” arXiv preprint arXiv:1401.3420, 2014.
 J.J. Fuchs, “Spread representations,” in ASILOMAR, 2011, pp. 814–817.
 Y. Lyubarskii and R. Vershynin, “Uncertainty principles and vector quantization,” IEEE Transactions on Information Theory, vol. 56, no. 7, pp. 3491–3501, 2010.
 Z. Wang, J. Liu, S. Huang, X. Wang, and S. Chang, “Transformed antisparse learning for unsupervised hashing,” in BMVC, 2017, pp. 1–12.
 R. Chartrand, “Exact reconstruction of sparse signals via nonconvex minimization,” IEEE Signal Processing Letters, vol. 14, no. 10, pp. 707–710, 2007.
 R. Chartrand and V. Staneva, “Restricted isometry properties and nonconvex compressive sensing,” Inverse Problems, vol. 24, no. 3, p. 035020, 2008.
 Z. Xu, X. Chang, F. Xu, and H. Zhang, “ regularization: A thresholding representation theory and a fast solver,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 7, pp. 1013–1027, 2012.
 D. Krishnan and R. Fergus, “Fast image deconvolution using hyperLaplacian priors,” in NIPS, 2009, pp. 1033–1041.
 M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “norm multiple kernel learning,” Journal of Machine Learning Research, vol. 12, pp. 953–997, 2011.
 B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?” Vision Research, vol. 37, no. 23, pp. 3311–3325, 1997.
 H. Lee, C. Ekanadham, and A. Y. Ng, “Sparse deep belief net model for visual area V2,” in NIPS, 2008, pp. 873–880.
 M. Elad and M. Aharon, “Image denoising via learned dictionaries and sparse representation,” in CVPR, vol. 1, 2006, pp. 895–900.
 J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE Transactions on Image Processing, vol. 17, no. 1, pp. 53–69, 2008.
 M. Ranzato, F.J. Huang, Y.L. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in CVPR, 2007, pp. 1–8.
 J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, 2009, pp. 1794–1801.
 M. Aharon, M. Elad, and A. Bruckstein, “SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.
 H. Xu, Z. Wang, H. Yang, D. Liu, and J. Liu, “Learning simple thresholded features with sparse support recovery,” IEEE Transactions on Circuits and Systems for Video Technology, DOI: 10.1109/TCSVT.2019.2901713, 2019.
 N. Bansal, X. Chen, and Z. Wang, “Can we gain more from orthogonality regularizations in training deep networks?” in NIPS, 2018, pp. 4261–4271.
 K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in ICML, 2010, pp. 399–406.
 Z. Wang, Q. Ling, and T. S. Huang, “Learning deep encoders,” in AAAI, 2016, pp. 2194–2200.
 Z. Wang, Y. Yang, S. Chang, Q. Ling, and T. S. Huang, “Learning a deep encoder for hashing,” in IJCAI, 2016, pp. 2174–2180.
 S. Ross, D. Munoz, M. Hebert, and J. A. Bagnell, “Learning messagepassing inference machines for structured prediction,” in CVPR, 2011, pp. 2737–2744.
 S. Wang, S. Fidler, and R. Urtasun, “Proximal deep structured models,” in NIPS, 2016, pp. 865–873.
 J. Sun, H. Li, and Z. Xu, “Deep ADMMnet for compressive sensing MRI,” in NIPS, 2016, pp. 10–18.
 M. Frank and P. Wolfe, “An algorithm for quadratic programming,” Naval Research Logistics, vol. 3, no. 12, pp. 95–110, 1956.
 J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009.
 L.C. Chen, A. Schwing, A. Yuille, and R. Urtasun, “Learning deep structured models,” in ICML, 2015, pp. 1785–1794.
 A. G. Schwing and R. Urtasun, “Fully connected deep structured networks,” arXiv preprint arXiv:1503.02351, 2015.
 P. Sprechmann, A. M. Bronstein, and G. Sapiro, “Learning efficient sparse and low rank models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1821–1833, 2015.
 Z. Wang, S. Chang, J. Zhou, M. Wang, and T. S. Huang, “Learning a taskspecific deep architecture for clustering,” in SDM. SIAM, 2016, pp. 369–377.
 Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang, “D3: Deep dualdomain based fast restoration of JPEGcompressed images,” in CVPR, 2016, pp. 2764–2772.
 Z. Wang, S. Huang, J. Zhou, and T. S. Huang, “Doubly sparsifying network,” in IJCAI. AAAI Press, 2017, pp. 3020–3026.
 J. Zhang and B. Ghanem, “ISTANet: Interpretable optimizationinspired deep network for image compressive sensing,” in CVPR, 2018, pp. 1828–1837.
 S. Wisdom, T. Powers, J. Pitton, and L. Atlas, “Interpretable recurrent neural networks using sequential sparse recovery,” arXiv preprint arXiv:1611.07252, 2016.
 B. Xin, Y. Wang, W. Gao, D. Wipf, and B. Wang, “Maximal sparsity with deep networks?” in NIPS, 2016, pp. 4340–4348.
 T. Moreau and J. Bruna, “Understanding trainable sparse coding via matrix factorization,” arXiv preprint arXiv:1609.00285, 2016.
 X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical linear convergence of unfolded ISTA and its practical weights and thresholds,” in NIPS, 2018, pp. 9061–9071.
 J. Liu, X. Chen, Z. Wang, and W. Yin, “ALISTA: Analytic weights are as good as learned weights in LISTA,” ICLR 2019, https://openreview.net/forum?id=B1lnzn0ctQ, 2019.
 U. Schmidt and S. Roth, “Shrinkage fields for effective image restoration,” in CVPR, 2014, pp. 2774–2781.
 S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in ICCV, 2015, pp. 1529–1537.
 V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Foundations of Computational Mathematics, vol. 12, no. 6, pp. 805–849, 2012.
 M. Jaggi, “Revisiting FrankWolfe: Projectionfree sparse convex optimization.” in ICML, 2013, pp. 427–435.
 L. Zhang, G. Wang, D. Romero, and G. B. Giannakis, “Randomized block FrankWolfe for convergent largescale learning,” IEEE Transactions on Signal Processing, vol. 65, no. 24, pp. 6448–6461, 2017.
 F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning activation functions to improve deep neural networks,” arXiv preprint arXiv:1412.6830, 2014.
 K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in ICCV, 2015, pp. 1026–1034.
 U. S. Kamilov and H. Mansour, “Learning optimal nonlinearities for iterative thresholding algorithms,” IEEE Signal Processing Letters, vol. 23, no. 5, pp. 747–751, 2016.
 K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “What is the best multistage architecture for object recognition?” in ICCV. IEEE, 2009, pp. 2146–2153.
 M. Malinowski and M. Fritz, “Learnable pooling regions for image classification,” arXiv preprint arXiv:1301.3516, 2013.
 C.Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree,” in Artificial Intelligence and Statistics, 2016, pp. 464–472.
 C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio, “Learnednorm pooling for deep feedforward and recurrent neural networks,” in ECMLPKDD. Springer, 2014, pp. 530–546.
 S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015, pp. 448–456.
 J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
 M. Ren, R. Liao, R. Urtasun, F. H. Sinz, and R. S. Zemel, “Normalizing the normalizers: Comparing and extending network normalization schemes,” arXiv preprint arXiv:1611.04520, 2016.
 S. Lyu, “Divisive normalization: Justification and effectiveness as efficient coding transform,” in NIPS, 2010, pp. 1522–1530.
 J. Ballé, V. Laparra, and E. P. Simoncelli, “Density modeling of images using a generalized normalization transformation,” arXiv preprint arXiv:1511.06281, 2015.
 F. Locatello, R. Khanna, M. Tschannen, and M. Jaggi, “A unified optimization view on generalized matching pursuit and FrankWolfe,” arXiv preprint arXiv:1702.06457, 2017.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
 Z. Wang, J. Yang, H. Zhang, Z. Wang, Y. Yang, D. Liu, and T. S. Huang, Sparse coding and its applications in computer vision. World Scientific, 2016.
 D. Liu, B. Wen, X. Liu, Z. Wang, and T. S. Huang, “When image denoising meets highlevel vision tasks: A deep learning approach,” in IJCAI. AAAI Press, 2018, pp. 842–848.
 D.A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015.
 S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM Multimedia. ACM, 2014, pp. 675–678.
 M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming,” http://cvxr.com/cvx, Mar. 2014.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse coding,” in CVPR. IEEE, 2013, pp. 391–398.
 H. Sreter and R. Giryes, “Learned convolutional sparse coding,” in ICASSP, 2018, pp. 2191–2195.
 I. J. Goodfellow, D. WardeFarley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
 M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
 T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” in ICCV, 2015, pp. 1913–1921.
 S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 T.H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “PCANet: A simple deep learning baseline for image classification?” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5017–5032, 2015.
 J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, 2013.
 P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898–916, 2011.
 H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?” in CVPR. IEEE, 2012, pp. 2392–2399.
 K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
 C. Dong, C. C. Loy, K. He, and X. Tang, “Image superresolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016.
 J. Kim, J. K. Lee, and K. M. Lee, “Accurate image superresolution using very deep convolutional networks,” in CVPR, 2016, pp. 1646–1654.