Deep Transform and Metric Learning Network: Wedding Deep Dictionary Learning and Neural Networks
Abstract
On account of its many successes in inference tasks and denoising applications, Dictionary Learning (DL) and its related sparse optimization problems have garnered a lot of research interest. While most solutions have focused on single layer dictionaries, the improved recently proposed Deep DL (DDL) methods have also fallen short on a number of issues. We propose herein, a novel DDL approach where each DL layer can be formulated as a combination of one linear layer and a Recurrent Neural Network (RNN). The RNN is shown to flexibly account for the layerassociated and learned metric. Our proposed work unveils new insights into Neural Networks and DDL and provides a new, efficient and competitive approach to jointly learn a deep transform and a metric for inference applications. Extensive experiments are carried out to demonstrate that the proposed method can not only outperform existing DDL but also stateoftheart generic CNNs.
Keywords:
Deep Dictionary Learning, Deep Neural Network, Metric Learning, Transform Learning, Proximal operator, Differentiable Programming1 Introduction
Dictionary Learning/Sparse Coding has demonstrated its high potential in exploring the semantic information embedded in high dimensional noisy data. It has been successfully applied for solving different inference tasks, such as image denoising [8], image restoration [36], image superresolution [40, 25], audio processing [10] and image classification [39].
While Synthesis Dictionary Learning (SDL) has been greatly investigated and widely used, the Analysis Dictionary Learning (ADL)/Transform Learning, as a dual problem, has been getting greater attention for its robustness property among others [24, 3, 27]. DL based methods have primarily focused on learning onelayer dictionary and its associated sparse representation. Other variations on the classification theme have also been appearing with a goal of addressing some recognized limitations, such as taskdriven dictionary learning [21], first introduced to jointly learn the dictionary, its sparse representation, and its classification objective. In [1], a label consistent term is additionally considered. Classspecific dictionary learning has been recently shown to improve the discrimination in [23, 37, 35] at the expense of a higher complexity. On the ADL side, more and more efficient classifiers [11, 32, 33, 28, 29] have resulted from numerous research efforts, and have yielded to an outperformance of SDL in both training and testing phases [30].
DL methods with their associated sparse representation, present significant computational challenges addressed by different techniques, including KSVD [1, 24], SNSADL [3] and Fast Iterative Shrinkagethresholding Algorithm (FISTA) [2]. Meant to provide a practically faster solution, the alternating minimization of FISTA still exhibited limitations and a relatively high computational cost.
To address these computational and scaling difficulties, differentiable programming solutions have also been developed, to take advantage of the efficiency of neural networks. LISTA [9] was first proposed to unfold iterative hardthresholding into an RNN format, thus speeding up SDL. Unlike conventional solutions for solving optimization problems, LISTA uses the forward and backward passes to simultaneously update the sparse representation and dictionary in an efficient manner. In the same spirit, sparse LSTM (SLSTM) [41] adapts LISTA to a Long Short Term Memory structure to automatically learn the dimension of the sparse representation.
Although the aforementioned differentiable programming methods are efficient at solving a singlelayer DL problem, the latter formulation still does not yield the best performance in image classification tasks. With the fast development of deep learning, Deep Dictionary Learning (DDL) methods [31, 19] have thus come into play. In [14], a deep model for ADL followed by a SDL is developed for image superresolution. Also, [20] deeply stacks SDLs to classify images by achieving promising and robust results. Unsupervised DDL approaches have also been proposed, with promising results [18, 12].
However, to the best of our knowledge, no DDL model which can provide both a fast and reliable solution has been proposed.
The proposed work herein, aims at ensuring the discriminative ability of singlelayer DL while providing the efficiency of endtoend models. To this end, we propose a novel differentiable programming method to jointly learn a deep metric together with an associated transform. Cascading these canonical structures will exploit and strengthen the structure learning capacity of a deep network, yielding what we refer to a Deep Transform and Metric Learning Network (DeTraMeNet). This newly proposed approach not only increases the discrimination capabilities of DL, but also affords a flexibility of constructing different DDL or Deep Neural Network (DNN) architectures. As will be later shown, this approach also resolves usually arising initialization and gradient propagation issues in DDL.
As shown in Figure 1, in each layer of DeTraMeNet, the DL problem is decomposed as a transform learning one, i.e. a linear layer part cascaded with a nonlinear component using a learned metric. The latter, referred to as QMetric Learning, is realized by an RNN. One of the contributions of our work is to show how DDL can theoretically be reformulated as such a combination of linear layers and RNNs. Decoupling the metric and the dual frame operator (pseudoinverse of dictionary) into two independent variables is also shown to introduce additional flexibility, and to improve the power of DL. On the practical side, and to achieve a faster and simpler implementation, we impose a blockdiagonal structure for QMetric Learning leading to parallel processing of independent channels. Moreover, a convolutional operator is also introduced to decrease the number of parameters, thus leading to a ConvolutionalRNN. Additionally, the QMetric Learning part may be viewed as a nonseparable activation function that can be flexibly included into any architecture. As a result, different new DeTraMe networks may be obtained by integrating QMetric Learning into various CNN architectures such as Plain CNN [26] and ResNets [13]. The resulting DeTraMeNetsbased architectures are demonstrated to be more discriminative than generic CNN models.
Although the authors of [34] and [17] also used a CNN followed by an RNN for respectively solving superresolution and sense recognition tasks, they directly used LISTA in their model. In turn, our method actually solves the same problem as LISTA. In addition, in [34] and [17], a sparse representation was jointly learned, while a more discriminative DDL approach is achieved in our work. We also formally derive the linear and RNNbased layer structure from DDL, thus providing a theoretical justification and a rationale to such approaches. This may also open an avenue to new and more creative and performing alternatives.
Our main contributions are summarized below:

We theoretically transform onelayer dictionary learning into transform learning and QMetric learning, and deduce how to convert DDL into DeTraMeNet.

Such joint transform learning and QMetric learning are successfully and easily implemented as a tandem of a linear layer and an RNN. A convolutional layer can be chosen for the linear part, and the RNN can also be simplified into a ConvolutionalRNN. To the best of our knowledge, this is the first work which makes an insightful bridge between DDL methods and the combination of linear layers and RNNs, with the associated performance gains.

The transform and QMetric learning uses two independent variables, one for the dictionary and the other for the dual frame operator of the dictionary. This bridges the current work to conventional SDL while introducing more discriminative power, and allowing the use of faster learning procedures than the original DL.

The QMetric can also be viewed as a parametric nonseparable nonlinear activation function, while in current neural network architectures, very few nonseparable nonlinear operators are used (softmax, max pooling, average pooling). As a component of a neural network, it can be flexibly inserted into any network architecture to easily construct a DL layer.

The proposed DeTraMeNet is demonstrated to not only improve the discrimination power of DDL, but to also achieve a better performance than stateoftheart CNNs.
The paper is organized as follows: In Section 2, we introduce the required background material. We derive the theoretical basis for our novel approach in Section 3. Its algorithmic solution is investigated in Section 4. Substantiating experimental results and evaluations are presented in Section 5. Finally, we provide some concluding remarks in Section 6.
1.1 Notation
Symbols  Descriptions 

A Matrix  
The transpose and inverse of matrices  
The Identity Matrix  
The row and column element of a matrix  
,  A Vector and its element 
An Operator 
2 Preliminaries
2.1 Dictionary Learning for Classification
In taskdriven dictionary learning [21], the common method for onelayer dictionary learning classifier is to jointly learn the dictionary matrix , the sparse representation of a given vector , and the classifier parameter . Let be the data and the associated labels. Taskdriven DL can be expressed as finding
(1) 
In SDL, we learn the composition of a dictionary and a sparse reconstruction in order to reconstruct or synthesize the data, hence yielding the standard formulation,
(2) 
Alternatively, in ADL, we directly operate on the data using a dictionary, leading to,
(3) 
The term may correspond to various kinds of loss functions, such as leastsquares, crossentropy, or hinge loss.
2.2 Deep Dictionary Learning for Classification
An efficient DDL approach [20] consists of computing
(4) 
where denotes the estimated label, is the classifier matrix, is a nonlinear function, and
(5) 
where denotes the composition of operators. For every layer is a reshaping operator, which is a tall matrix. Moreover, is a nonlinear operator computing a sparse representation within a synthesis dictionary matrix . More precisely, for a given matrix ,
(6) 
with
(7) 
where , , and is a function in , the class of proper lower semicontinuous convex functions from to . A simple choice consists in setting to zero, while adopting the following specific form for ;
(8) 
where denotes the indicator function of a set (equal to zero in and otherwise). Note that Eq. (6) corresponds to the minimization of a strongly convex function, which thus admits a unique minimizer, so making the operator properly defined.
3 Joint Deep Metric and Transform Learning
3.1 Proximal interpretation
Our goal here is to establish an equivalent but more insightful solution for in each layer.
Claim 1: can be solved by a proximal operator of a transform learning with a metric :
(9) 
To simplify notation, we omit the superscript which denotes the layer in Eq. (6) which, in turn, aims at finding the sparse representation . For every , and , Eq. (7) can thus be reexpressed as follows:
(10) 
where
(11) 
with
(12) 
and denotes the weighted Euclidean norm induced by . Determining the optimal sparse representation of is therefore, equivalent to computing the proximity operator in Eq. (11), that is Eq. (9):
(13) 
This thus establishes a reexpression of the solution of the representation procedure as the proximity operator of within the metric induced by the symmetric definite positive matrix [6, 5]. Furthermore, it shows that the SDL can be equivalently viewed as an ADL formulation involving the dictionary matrix , provided that a proper metric is chosen.
3.2 Multilayer representation
Consequently, by substituting Eq. (13) in Eqs. (4) and (5), the DDL model can be reexpressed in a more concise and comprehensive form as
(14) 
where, for , the affine operators mapping to by an analysis transform and a shift term , and explicitly as,
(15) 
with and
(16) 
Eq. (15) shows that, for each layer , we obtain a structure similar to a linear layer by treating as the weight operator and as the bias parameter, which are referred as the Transform learning part in DeTraMe method. In standard Forward Neural Networks (FNNs), the activation functions can be interpreted as proximity operators of convex functions [7]. Eq. (14) attests that our model is more general, in the sense that different metrics are introduced for these operators. In the next section, we propose an efficient method to learn these metrics in a supervised manner.
4 QMetric Learning
4.1 Prox computation
Reformulation (14) has the great advantage to allow us to benefit from algorithmic frameworks developed for FNNs, provided that we are able to compute efficiently
(17) 
where is the weighted Frobenius norm. Hereabove, is a matrix where the samples associated with the training set have been stacked columnwise. A similar convention is used to construct and from and . An elasticnet like regularization is chosen by setting with . We have, in particular, observed that the last quadratic term has a positive influence in increasing stability and avoiding overfitting. As in Eq. (12), Eq. (17) is actually equivalent to solving the following optimization problem:
(18) 
Claim 2: We show next that the solution of Eq. (18) is obtained as an iteration of the form:
(19) 
Various iterative splitting methods could be used to find the unique minimizer of the above optimized convex function [4, 15]. Our purpose is to develop an algorithmic solution for which classical NN learning techniques can be applied in a fast and convenient manner. By subdifferential calculus, the solution to the problem (18) satisfies the following optimality condition:
(20) 
where . Elementwise rewriting of Eq. (20) yields, for every , and ,
(21) 
Let us adopt a blockcoordinate approach and update the th row of by fixing all the other ones. As is a positive definite matrix, and Eq. (21) implies that
(22) 
where . And let
(23) 
where is the Kronecker sequence (equal to 1 when and 0 otherwise). Then, Eq. (22) suggests that the elements of can be globally updated, at iteration , as shown in Eq. (19):
with denoting the Hadamard (elementwise) product. Note that a similar expression can be derived by applying a preconditioned forwardbackward algorithm [5] to Eq. (18), where the preconditioning matrix is , which has been detailed in the supplementary material. The implementation of the method allowing us to compute the proximity operator in (17) is summarized below:
4.2 RNN implementation
Given , , and , Alg. (1) can be viewed as an RNN structure for which is the hidden variable and is a constant input over time. By taking advantage of existing gradient backpropagation techniques for RNNs, can thus be directly computed in order to minimize the global loss . This shows that, thanks to the reparameterization in Eq. (23), QMetric Learning has been recast as the training of a specific RNN.
Note that is a symmetric matrix. In order to reduce the number of parameters and ease of optimizing them, we choose a blockdiagonal structure for . In addition, for each of the blocks, either an arbitrary or convolutive structure can be adopted. Since the structure of is reflected by the structure of , this leads in Eq. (19) to fully connected or convolutional layers where the channel outputs are linked to non overlapping blocks of the inputs. In our experiments on images, ConvolutionalRNNs have been preferred for practical efficiency.
4.3 Training procedure
We have finally transformed our DDL approach in an alternation of linear layers and specific RNNs. This not only simplifies the implementation of the resulting DeTraMeNet by making use of standard NN tools, but also allows us to employ wellestablished stochastic gradientbased learning strategies. Let be the learning rate at iteration , the simplified form of a training method for DeTraMeNets is provided in Alg. 2.
The constraints on the parameters of the RNNs have been imposed by projections. In Alg. 2, denotes the projection onto a nonempty closed convex set and is the vector space of matrices with diagonal terms equal to 0.
5 Experiments and Results
In this section, our DeTraMeNet method is evaluated on three popular datasets, namely CIFAR10 [16], CIFAR100 [16] and Street View House Numbers (SVHN) [22]. Since the common NN architectures are plain networks such as ALLCNN [26] and residual ones, such as ResNet [13] and WideResNet [38], we compare DeTraMeNet with these three respective stateoftheart architectures.
5.1 Architectures
Since we break SDL into two independent linear layer and RNN parts, RNNs can be flexibly inserted into any nonlinear layer of a deep neural network. After choosing convolutional linear layers, we can construct two different architectures when inserting RNN into Plain Networks and residual blocks. One is to replace all the RELU activation layers in PlainNet with QMetric ReLU, leading to DeTraMePlainNet. Another is to replace the RELU layer inside the block in ResNet by QMetric ReLU, giving rise to DeTraMeResNet. When replacing all the RELU layers, DeTraMePlainNet becomes equivalent to DDL as explained in Section 4. When only replacing a single RELU layer in the ResNet architecture, a new DeTraMeResNet structure is built. The detailed architectures are illustrated in the supplementary materials.
DeTraMePlainNet 3layer  PlainNet 3layer  PlainNet 6layer  PlainNet 9layer  PlainNet 12layer 
Input 32 x 32 RGB Image with dropout(0.2)  
conv 96  conv 96 RELU  conv 96 RELU  conv 96 RELU  conv 96 RELU 
+ QMetric: conv 96  conv 96 RELU  conv 96 RELU  conv 96 RELU  
conv 96 RELU  conv 96 RELU  
with stride=2, dropout(0.5)  with stride=2, dropout(0.5)  
conv 192 RELU  
conv 96  conv 96 RELU  conv 96 RELU  conv 192 RELU  conv 192 RELU 
with stride=2  with stride=2  with stride=2, dropout(0.5)  conv 192 RELU  conv 192 RELU 
+ QMetric: conv 96  conv 192 RELU  conv 192 RELU  with stride=2, dropout(0.5)  
with stride=2, dropout(0.5)  conv 192 RELU  
conv 10  conv 10 RELU  conv 192 RELU  conv 192 RELU  conv 192 RELU 
with stride=2  with stride=2  conv 10 RELU  conv 192 RELU  with stride=2 
+QMetric: conv 10  with stride=2  conv 10 RELU  conv 192 RELU  
conv 192 RELU  
conv 10 RELU  
Global Average Pooling  
Softmax 
For the PlainNet, we use a 9 layer architecture similar to ALLCNN [26] with dropouts, as listed in Table 1. For the ResNet architecture, we follow the setting in [13], the first layer is a convolutional layer with 16 filters. 3 residual blocks with output map size of 32, 16, and 8 are then used with 16, 32 and 64 filters for each block. The network ends up with a global average pooling and a fullyconnected layer. The parameters listed in Table 2 are respectively chosen equal to for ResNet 8, 20, 56, 110 and 164layer networks, and we respectively use and for WideResNet 164 and WideResNet 168 networks as suggested in [38].
output map size  

# layers  
#filters  
WideResNet #filters 
For DeTraMeNet, we use convolutional RNNs having the same filter size (resp. number of channels) as those in the convolutional layer before. The number of parameters of each model as well as the number of iterations performed in RNNs, are indicated in Table 4.
5.2 Datasets and Training Settings
CIFAR10 [16] contains 60,000 color images divided into 10 classes. 50,000 images are used for training and 10,000 images for testing. CIFAR100 [16] is also constituted of color images. However, it includes 100 classes with 50,000 images for training and 10,000 images for testing. SVHN [22] contains 630,420 color images with size . 604,388 images are used for training and 26,032 images are used for testing.
For CIFAR datasets, the normalized input image is randomly cropped after padding on each sides of the image and random flipping, similarly to [13, 38]. No other data augmentation is used. For SVHN, we normalize the range of the images between 0 and 1. All the models are trained on an Nvidia V100 32Gb GPU with 128 minibatch size. The models of both PlainNet and ResNet architectures are trained by SGD optimizer with momentum equal to 0.9 and a weight decay of . On CIFAR datasets, the algorithm starts with a learning rate of 0.1. 300 epochs are used to train the models, and the learning rate is reduced at the 150th and 225th epochs. On SVHN dataset, a learning rate of 0.01 is used at the beginning and is then divided by 10 at the 80th and 120th epochs within a total of 160 epochs. The same settings are used as in [38].
5.3 Results
DeTraMeNet vs. DDL
Model  # Parameters  CIFAR10  CIFAR100 

DDL 9 [20]  1.4M  0.9304  0.6876 
DeTraMeNet 9  3.0M  0.9340  0.7034 
First, we compare our results with those achieved by the DDL approach in [20]. As we break the dictionary and its pseudo inverse into two independent variables, a higher number of parameters is involved in DeTraMeNet than in [20]. However, DeTraMeNet presents two main advantages: The first one is a better capability to discriminate: in Table 3, compared to DDL, DeTraMeNet respectively achieves and improvements on CIFAR10 and CIFAR100 datasets. The second advantage is that DeTraMeNet is implemented in a network framework, with no need for extra functions to compute gradients at each layer. Moreover, by taking advantage of the developed techniques in neural networks, DeTraMeNet does not meet the difficulties of sensitivity to initialization and gradient propagation that the original DDL approach faces.
DeTraMeNet vs. Generic CNNs
Accuracy  # Parameters  CIFAR10 +  CIFAR100 +  SVHN  
Network Architectures  Original  DeTraMeNet  Original  DeTraMeNet  Original  DeTraMeNet  Original  DeTraMeNet 
(#iteration)  (#iteration)  (#iteration)  
PlainNet 3layer  0.094M  0.261M  0.4248  0.8867 (5)  0.2209  0.6475 (3)  0.4564  0.9721 (8) 
PlainNet 6layer  1.016M  1.929M  0.8634  0.9241 (2)  0.6275  0.7014 (2)  0.9755  0.9817 (5) 
PlainNet 9layer  1.370M  2.984M  0.9036  0.9340 (2)  0.6591  0.7034 (2)  0.9798  0.9826 (5) 
PlainNet 12layer  2.366M  3.980M  0.9108  0.9361 (2)  0.6901  0.7117 (2)  0.9814  0.9827 (3) 
ResNet 8  0.074M  0.123M  0.8782  0.8941 (3)  0.5997  0.6527 (2)  0.9670  0.9750 (3) 
ResNet 20  0.268M  0.413M  0.9214  0.9253 (3)  0.6833  0.6890 (2)  0.9770  0.9782 (2) 
ResNet 56  0.848M  0.994M  0.9365  0.9375 (3)  0.7113  0.7166 (2)  0.9796  0.9804 (2) 
ResNet 110  1.719M  1.867M  0.9374  0.9377 (2)  0.7273  0.7364 (2)     
ResNet 164  2.590M  2.738M  0.9359  0.9439 (2)  0.7357  0.7441 (2)     
WideResNet 164  3.585M  5.136 M  0.9525  0.9531 (2)  0.7679  0.7761 (3)  0.9806  0.9816 (3) 
WideResNet 168  10.783M  16.983M  0.9572  0.9579 (2)  0.7945  0.8048 (3)  0.9817  0.9823 (3) 
We next compare DeTraMeNet with generic CNNs with respect to three different aspects: Accuracy, Parameternumber and Capacity.
Accuracy. As shown in Table 4, with the same architecture, using DeTraMeNet structures achieves an overall better performance than all various generic CNN models do. For PlainNet architecture, DeTraMeNet increases the accuracy with a median of on CIFAR10, on CIFAR100 and on SVHN, and respectively increases the accuracy of at least on theses three datasets. For ResNet architecture, DeTraMeNet also consistently increases the accuracy with a median of on CIFAR10, on CIFAR100 and on SVHN, and at least on all datasets.
Parameter number. Although, for a given architecture, DeTraMeNet improves the accuracy, it involves more parameters. However, as demonstrated in Figure 2, for a given number of parameters, DeTraMeNet outperforms the original CNNs over all three datasets. Plots corresponding to DeTraMeNet for both PlainNet and ResNet architectures are indeed above those associated with standard CNNs.
Capacity. In terms of depth, comparing improvements with PlainNet and ResNet, shows that the shallower the network, the more accurate. It is remarkable that DeTraMeNet leads to more than accuracy increase for PlainNet 3layer on CIFAR10, CIFAR100 and SVHN datasets. When the networks become deeper, they better capture discriminative features of the classes, and albeit with smaller gains, DeTraMeNet still achieves a better accuracy than a generic deep CNN, e.g. around higher than ResNet 164 on CIFAR10 and CIFAR100. In terms of width, we use WideResNet164 and WideResNet168 as two reference models, since both of them include 16 layers but have different widths. Table 4 shows that increasing width is beneficial to DeTraMeNet. Since the original models have already achieved excellent performance for CIFAR10 and SVHN, DeTraMeNets with various widths show similarly slightly improved accuracies. However, for CIFAR100, enlarging the width for DeTraMeNet leads to an increase in the accuracy gain from to .
6 Conclusion
Starting from a DDL formulation, we have shown that it is possible to reformulate the problem in a standard optimization problem with the introduction of metrics within standard activation operators. This yields a novel Deep Transform and Metric Learning problem. This has allowed us to show that the original DDL can be performed thanks to a network mixing linear layer and RNN algorithmic structures, thus leading to a fast and flexible network framework for building efficient DDLbased classifiers with a higher discriminiative ability. Our experiments show that the resulting DeTraMeNet performs better than the original DDL approach and stateoftheart generic CNNs. We think that the bridge we established between DDL and DNN will help in further understanding and controlling these powerful tools so as to attain better performance and properties.
References
 Aharon, M., Elad, M., Bruckstein, A.: Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing 54(11), 4311–4322 (2006)
 Beck, A., Teboulle, M.: A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
 Bian, X., Krim, H., Bronstein, A., Dai, L.: Sparsity and nullity: Paradigms for analysis dictionary learning. SIAM Journal on Imaging Sciences 9(3), 1107–1126 (2016)
 Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2004)
 Chouzenoux, E., Pesquet, J.C., Repetti, A.: Variable metric forwardbackward algorithm for minimizing the sum of a differentiable function and a convex function. Journal of Optimization Theory and Applications 162(1), 107–132 (Jul 2014)
 Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H. (eds.) FixedPoint Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. SpringerVerlag, New York (2010)
 Combettes, P.L., Pesquet, J.C.: Deep neural network structures solving variational inequalities. SetValued and Variational Analysis (2018), https://arxiv.org/abs/1808.07526
 Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing 15(12), 3736â3745 (Dec 2006). https://doi.org/10.1109/TIP.2006.881969, https://doi.org/10.1109/TIP.2006.881969
 Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. pp. 399–406. Omnipress (2010)
 Grosse, R., Raina, R., Kwong, H., Ng, A.Y.: Shiftinvariant sparse coding for audio classification. In: Proceedings of the TwentyThird Conference on Uncertainty in Artificial Intelligence. p. 149â158. UAIâ07, AUAI Press, Arlington, Virginia, USA (2007)
 Guo, J., Guo, Y., Kong, X., Zhang, M., He, R.: Discriminative analysis dictionary learning. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
 Gupta, P., Maggu, J., Majumdar, A., Chouzenoux, E., Chierchia, G.: Deconfuse: A deep convolutional transform based unsupervised fusion framework. Tech. rep. (2020), https://hal.archivesouvertes.fr/hal02461768
 He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
 Huang, J.J., Dragotti, P.L.: A deep dictionary model for image superresolution. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, Calgary, Canada (March 2018)
 Komodakis, N., Pesquet, J.C.: Playing with duality: An overview of recent primaldual approaches for solving largescale optimization problems. IEEE Signal Processing Magazine 32(6), 31–5 (Nov 2014)
 Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Tech. rep., Citeseer (2009)
 Liu, Y., Chen, Q., Chen, W., Wassell, I.: Dictionary learning inspired deep network for scene recognition. In: ThirtySecond AAAI Conference on Artificial Intelligence (2018)
 Maggu, J., Majumdar, A.: Unsupervised deep transform learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6782–6786. Calgary, Canada (1520 April 2018)
 Mahdizadehaghdam, S., Dai, L., Krim, H., Skau, E., Wang, H.: Image classification: A hierarchical dictionary learning approach. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2597–2601. IEEE (2017)
 Mahdizadehaghdam, S., Panahi, A., Krim, H., Dai, L.: Deep dictionary learning: A parametric network approach. IEEE Transactions on Image Processing (2019)
 Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., Bach, F.R.: Supervised dictionary learning. In: Advances in Neural Information Processing Systems. pp. 1033–1040 (2009)
 Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
 Ramirez, I., Sprechmann, P., Sapiro, G.: Classification and clustering via dictionary learning with structured incoherence and shared features. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. pp. 3501–3508. IEEE (2010)
 Rubinstein, R., Peleg, T., Elad, M.: Analysis ksvd: A dictionarylearning algorithm for the analysis sparse model. IEEE Transactions on Signal Processing 61(3), 661–677 (2013)
 Skau, E., Wohlberg, B., Krim, H., Dai, L.: Pansharpening via coupled triple factorization dictionary learning. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1234–1237. IEEE (2016)
 Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
 Tang, W., Otero, I.R., Krim, H., Dai, L.: Analysis dictionary learning for scene classification. In: Statistical Signal Processing Workshop (SSP), 2016 IEEE. pp. 1–5. IEEE (2016)
 Tang, W., Panahi, A., Krim, H., Dai, L.: Structured analysis dictionary learning for image classification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2181–2185. IEEE (2018)
 Tang, W., Panahi, A., Krim, H., Dai, L.: Analysis dictionary learning: an efficient and discriminative solution. In: ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3682–3686. IEEE (2019)
 Tang, W., Panahi, A., Krim, H., Dai, L.: Analysis dictionary learning based classification: Structure for robustness. IEEE Transactions on Image Processing 28(12), 6035–6046 (2019)
 Tariyal, S., Majumdar, A., Singh, R., Vatsa, M.: Deep dictionary learning. IEEE Access 4, 10096–10109 (2016)
 Wang, J., Guo, Y., Guo, J., Luo, X., Kong, X.: Classaware analysis dictionary learning for pattern classification. IEEE Signal Processing Letters 24(12), 1822–1826 (2017)
 Wang, Q., Guo, Y., Guo, J., Kong, X.: Synthesis ksvd based analysis dictionary learning for pattern classification. Multimedia Tools and Applications 77(13), 17023–17041 (2018)
 Wang, Z., Liu, D., Yang, J., Han, W., Huang, T.: Deep networks for image superresolution with sparse prior. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 370–378 (2015)
 Wang, Z., Yang, J., Nasrabadi, N., Huang, T.: A maxmargin perspective on sparse representationbased classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1217–1224 (2013)
 Xu, M., Jia, X., Pickering, M., Plaza, A.J.: Cloud removal based on sparse representation via multitemporal dictionary learning. IEEE Transactions on Geoscience and Remote Sensing 54(5), 2998–3006 (2016). https://doi.org/10.1109/tgrs.2015.2509860, https://app.dimensions.ai/details/publication/pub.1061614193
 Yang, M., Zhang, L., Feng, X., Zhang, D.: Fisher discrimination dictionary learning for sparse representation. In: 2011 International Conference on Computer Vision. pp. 543–550. IEEE (2011)
 Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
 Zhang, D., Liu, P., Zhang, K., Zhang, H., Wang, Q., Jing, X.: Class relatedness orienteddiscriminative dictionary learning for multiclass image classification. Pattern Recognition 59(C), 168â175 (Nov 2016). https://doi.org/10.1016/j.patcog.2015.12.005, https://doi.org/10.1016/j.patcog.2015.12.005
 Zhong, W.: Robust object tracking via sparsitybased collaborative model. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). p. 1838â1845. CVPR â12, IEEE Computer Society, USA (2012)
 Zhou, J.T., Di, K., Du, J., Peng, X., Yang, H., Pan, S.J., Tsang, I.W., Liu, Y., Qin, Z., Goh, R.S.M.: Sc2net: Sparse lstms for sparse coding. In: ThirtySecond AAAI Conference on Artificial Intelligence (2018)