# Learned Convolutional Sparse Coding

###### Abstract

We propose a convolutional recurrent sparse auto-encoder model. The model consists of a sparse encoder, which is a convolutional extension of the learned ISTA (LISTA) method, and a linear convolutional decoder. Our strategy offers a simple method for learning a task-driven sparse convolutional dictionary (CD), and producing an approximate convolutional sparse code (CSC) over the learned dictionary. We trained the model to minimize reconstruction loss via gradient decent with back-propagation and have achieved competitve results to KSVD image denoising and to leading CSC methods in image inpainting requiring only a small fraction of their run-time.

Learned Convolutional Sparse Coding

Hillel Sreter^{†}^{†}thanks: This research is supported by Wipro ltd. One of the GPUs used for this research was donated by the NVIDIA Corporation. Raja Giryes |

School of Electrical Engineering, Tel Aviv University, Tel Aviv, Israel, {hillelsr@mail., raja@}tau.ac.il |

Index Terms— Sparse Coding, ISTA, LISTA, Convolutional Sparse Coding, Neural networks.

## 1 Introduction

Sparse coding (SC) is a powerful and popular tool used in a variety of applications from classification and feature extraction, to signal processing tasks such as enhancement, super-resolution, etc. A classical approach to use SC with a signal is to split it into segments (or patches), and solve for each

(1) |

where is the sparse representation of the (column stacked) patch in the dictionary . There are two painful points in this approach: (i) performing SC over all the patches tends to be a slow process; and (ii) learning a dictionary over each segment independently loses spatial information outside it such as shift-invariance in images.

One prominent approach for addressing the first point is using approximate sparse coding models such as the learned iterative shrinkage and thresholding algorithm (LISTA) [1]. LISTA is a recurrent neural network architecture designed to mimic ISTA [2], which is an iterative algorithm for approximating the solution of (1). As to the second point, one may impose an additional prior on the learned dictionary such as being convolutional, i.e., a concatenation of Toeplitz matrices. In this case each element (known also as atom) of the dictionary is learned based on the whole signal. Moreover, the resulting dictionary is shift-invariant due to it being convolutional.

In this paper, we introduce a learning method for task driven CD learning. We design a convolutional neural network that both learns the SC of a family of signals and their CSCs. We demonstrate the efficiency of our new approach in the tasks of image denoising and inpainting.

## 2 The Sparse Coding Problem

### 2.1 Sparse coding and ISTA

Solving (1) directly is a combinatorial problem, thus, its complexity grows
exponentially with . To resolve this deficiency, many approximation strategies have been developed for solving (1). A popular one relaxes the pseudo-norm using the norm yielding (the unconstrained form):

(2) |

A popular technique to minimize (2) is the ISTA [2] iteration:

(3) |

where , is the largest eigenvalue of and is the soft thresholding operator:

(4) |

ISTA iterations are stopped when a defined convergence criterion is satisfied.

### 2.2 Learned ISTA (LISTA)

In approximate SC, one may build a non-linear differentiable encoder that can be trained to produce quick approximate SC of a given signal. In [1], an ISTA like recurrent network denoted by LISTA is introduced. By rewriting (3) as

(5) |

LISTA can be derived by substituting in (5) for , for and for (using a separate threshold value for each entry instead of a single threshold for all values as in [3]). Thus, LISTA iteration is defined as:

(6) |

where is the number of LISTA iterations and the parameters and are learned by minimizing:

(7) |

where is either the optimal SC of the input signal (if attainable) or the ISTA final solution after its convergence.

There have been lots of work on LISTA like architectures. In [3], a LISTA auto-encoder is introduced. Rather than training to approximate the optimal SC, LISTA is trained directly to minimize the optimization objective (2) instead of (7). A discriminative non-negative version of LISTA is introduced in [4]. It contains a decoder that reconstructs the approximate SC back to the input signal and a classification unit that receives the approximate SC as an input. Thus, the model described in [4] is named discriminative recurrent sparse auto-encoder and the quality of the produced approximate SC is quantified by the success of the decoder and classifier. The work in [5] used a cascaded sparse coding network with LISTA building blocks to fully exploit the sparsity of an image for the super resolution problem. In [6] and [7], a theoretical explanation is provided for the reason LISTA like models are able to accelerate SC iterative solvers.

## 3 Convolutional Sparse Coding

The CSC model [8],[9],[10], [11], [12] can be derived from the classical SC model by substituting matrix multiplication with the convolution operator:

(8) |

where is the input signal, a local convolution filter and a sparse feature map of the convolutional atom . The minimization problem in (2) for CSC may be formulated as:

(9) |

It is important to note that unlike traditional SC, is not split into patches (or segments) but rather the CSC is of the full input signal. The CSC model is inherently spatially invariant, thus, a learned atom of a specific edge orientation can globally represent all edges of that orientation in the whole image. Unlike CSC, in classical SC multiple atoms tend to learn the same oriented edge for different offsets in space.

Various methods have been proposed for solving (9). The strategies in [8] and[9] involve transformation to the frequency domain and optimization using Alternating Direction Method of Multipliers (ADMM). These methods tend to optimize over the whole train-set at once. Thus, the whole train-set must be held in memory while learning the CDC, which of course limits the train-set size. Moreover, inferring from is an iterative process that may require a large number of iterations, thus, less suitable for real-time tasks. Work on speeding up the ADMM method for CD learning is done in[13]. In [14], a consensus-based optimization framework has been proposed that makes CSC tractable on large-scale datasets. In [15], a thorough performance comparison among different CD learning methods is done as well as proposing a new learning approach. More work on CD learning has been done in [10] and [16]. In [17], the effect of solving in the frequency domain on boundary artifact is studied, offering different types of solutions.The work in [18] shows the potential of CSC for image super-resolution.

## 4 Learned Convolutional Sparse Coding

In order to have computationally efficient CSC model, we extend the approximate SC model LISTA [1] to the CSC model. We perform training in an end-to-end task driven fashion. Our proposed approach shows competitive performance to classical SC and CSC methods in different tasks but with order of magnitude fewer computations. Our model is trained via stochastic gradient-descent, thus, naturally it can learn the CD over very large datasets without any special adaptations. This of course helps in learning a CD that can better represent the space in which the input signals are sampled from.

### 4.1 Learning approximate CSC

Due to the linear properties of convolutions and the fact that CSC model can be thought of as a classical SC model, where the dictionary is a concatenation of Toepltiz matrices, the CSC model can be viewed as a specific case of classical SC. Thus,the objective in (9) can be formatted to be like (2), by substituting the general dictionary with . Obviously, naively reformulating CSC model as matrix multiplication
is very inefficient both memory-wise because and computation wise, as each element of is computed with multiply and accumulate operations (MACs) versus the convolution formation, where
only MACs are needed (realistically assuming ).

Thus, instead of using standard LISTA directly on , we reformulate ISTA to the convolutional case and then propose its LISTA version. ISTA iterations for CSC reads as:

(10) |

where is an array of filters, and . The operartion reverses the order of entries in in both dimensions. Modeling (10) in a similar way to (6) (with some differences) leads to the convolutional LISTA structure:

(11) |

where , and are fully trainable and independent variables. Note that we have added here also the variable that takes into account having multiple channels in the original signal (e.g., color channels).

### 4.2 Learning the CD

When learning a CD, we expect to produce as close as possible to given the approximate CSC (ACSC) produced by the model described in (11). Thus, we learn the CD by adding a linear encoder that consists of a filter array at the end of the convolutional LISTA iterations. This leads to the following network that integrates both the calculation of the CSC and the application of the dictionary:

(12) |

### 4.3 Task driven convolutional sparse coding

This formulation makes it possible to train a sparse induced auto-encoder, where the encoder learns to produce ACSC and the decoder learns the correct filters to reconstruct the signal from the generated ACSC. The whole model is trained via stochastic gradient decent aiming at minimizing:

(13) |

where is the target signal and is the one calculated in (12). We tested different types of distance functions and found (15) to yield best results. Unlike the sparse auto-encoder proposed in [19], where the encoder is a feed-froward network and sparsity is induced via adding sparsity promoting regularization to the loss function, our model is inherently biased to produce sparse encoding of its input due to the special design of its architecture. From a probabilistic point of view, as shown in [20], sparse auto-encoder can be thought of as a generative model that has latent variables with a sparsity prior. Thus, the joint distribution of the model,output and CSC is

(14) |

The soft thresholding operation used in the network encourages to be large when is sparse. Thus, when training our model we found it sufficient to minimize a reconstruction term representing without the need to add a sparsity inducing term to the loss.

## 5 Experiments

### 5.1 Learned CSC network parameters

We used our model as specified in (12). We found 3 recurrent steps to be sufficient for our tasks. The filters used in the experiments have the following dimensions:
,
,
,
.

As initialization is important for convergence,
and are initialized with the same random values and and
is initialized with the transposed and flipped filters
of . The factor takes into consideration from (3).
We initialized the threshold to , thus implicitly initializing
to .
We tested different types of reconstruction loss including standard and losses.
We found the loss function proposed in [21] to retrieves the best image quality,

(15) |

where is the multiscale SSIM loss. We have trained the model using the Adam optimizer [22],with this reconstruction loss and . We use PASCAL VOC [23] dataset due to its large amount of high quality images. All images are normalized by 255 before feeding them to the model.

### 5.2 Image denoiseing

To test our model for image denoising we have added random noise to a given original image producing , where and .

As can be seen in Fig. 1 and Table 2, the learned CD generalizes well and the reconstruction is competitive both qualitatively and quantitatively to the results of KSVD denoising.

We show the atoms of the learned CD in Fig. 2. The learned CD does not have any image specific atoms but rather a mixture of high pass DCT like and Gabor like filters. Table 1 presents the average run time per image. We used KSVD’s publicly available code [24]. Our model is faster by an order of magnitude on a CPU and by two orders on a GPU.

ours CPU | ours GPU | KSVD CPU | |

runtime [sec] | 0.81 | 0.03 | 4.21 |

Image | Proposed | KSVD |
---|---|---|

Lena | ||

House | ||

Pepper | ||

Couple | ||

Fgpr | ||

Boat | ||

Hill | ||

Man | ||

Barbara |

### 5.3 Image inpainting

We further test our model on the inpainting problem, in which , where is as an element wise multiplication operator and is a binary mask such that contains only part of the pixels in . Rewriting (8) while taking into consideration, we have

(16) |

Thus, (5) becomes,

(17) |

The ACSC version of (17) is given by

(18) |

In our experiment , thus, we randomly sample half of the input pixels. The objective is to reconstruct . We took a pre-trained ACSC network on denoising task, plugged in it to be the form of (18), and optimized it for the inpainting task over the PASCAL VOC [23] dataset. We compare our inpainting results to [8] and [10] over the same test images used [8]. The image numbering convention is consistent to [8]. All test images are preprocessed with local contrast as in [8]. Table. 4 shows that our model produces competitive results to the ones of [8] and [10]. The main advantage of the proposed approach as can be seen in Table 3 is its significant speed-up in running time (by more than three orders of magnitude).

ours CPU | ours GPU | [8] CPU | [10] CPU | |
---|---|---|---|---|

runtime [sec] | 0.6 | 0.023 | 163 | 65.49 |

Image | Heide et al. | Papyan et al. | Proposed |
---|---|---|---|

1 | 28.76 | 28.84 | |

2 | 31.54 | 31.63 | |

3 | 30.59 | 30.48 | |

4 | 27.41 | 26.8 | |

5 | 33.65 | 33.59 | |

6 | 33.03 | 27.7 | |

7 | 28.69 | 28.33 | |

8 | 30.39 | 30.23 | |

9 | 28.07 | 28.10 | |

10 | 31.59 | 31.65 | |

11 | 29.77 | 29.60 | |

12 | 26.40 | 26.45 | |

13 | 29.01 | 28.86 | |

14 | 29.48 | 29.5 | |

15 | 28.95 | 28.28 | |

16 | 29.59 | 29.48 | |

17 | 27.69 | 28.27 | |

18 | 31.61 | 31.36 | |

19 | 26.79 | 26.88 | |

20 | 31.93 | 32.01 | |

21 | 28.81 | ||

22 | 26.27 | 26.47 |

## 6 Conclusion

Approximate convolutional sparse coding as proposed in this work is a powerful tool. It combines both the computational capabilities and approximation power of convolutional neural networks and the strong theory of sparse coding. We demonstrated the efficiency of this strategy both in the tasks of image denoising and inpainting.

## References

- [1] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 399–406.
- [2] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on pure and applied mathematics, vol. 57, no. 11, pp. 1413–1457, 2004.
- [3] P. Sprechmann, A.M. Bronstein, and G. Sapiro, “Learning efficient sparse and low rank models,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1821–1833, 2015.
- [4] J. Rolfe and Y. LeCun, “Discriminative recurrent sparse auto-encoders,” ICLR, 2013.
- [5] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for image super-resolution with sparse prior,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 370–378.
- [6] R. Giryes, Y. Eldar, A.M. Bronstein, and G. Sapiro, “Tradeoffs between convergence speed and reconstruction accuracy in inverse problems,” arXiv preprint arXiv:1605.09232, 2016.
- [7] T. Moreau and J. Bruna, “Adaptive acceleration of sparse coding via matrix factorization,” ICLR, 2017.
- [8] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional sparse coding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5135–5143.
- [9] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse coding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 391–398.
- [10] V. Papyan, Y. Romano, J. Sulam, and M. Elad, “Convolutional dictionary learning via local processing,” ICCV, 2017.
- [11] H. Bristow and S. Lucey, “Optimization methods for convolutional sparse coding,” arXiv preprint arXiv:1406.2407, 2014.
- [12] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus, “Deconvolutional networks,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2528–2535.
- [13] Y. Wang, Q. Yao, J. T. Kwok, and L. Ni, “Online convolutional sparse coding,” 06 2017.
- [14] B. Choudhury, R. Swanson, F. Heide, G. Wetzstein, and W. Heidrich, “Consensus convolutional sparse coding,” 2017.
- [15] C. Garcia-Cardona and B. Wohlberg, “Convolutional dictionary learning,” arXiv preprint arXiv:1709.02893, 2017.
- [16] B. Wohlberg, “Efficient convolutional sparse coding,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 7173–7177.
- [17] B. Wohlberg, “Boundary handling for convolutional sparse representations,” in Image Processing (ICIP), 2016 IEEE International Conference on. IEEE, 2016, pp. 1833–1837.
- [18] S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang, “Convolutional sparse coding for image super-resolution,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
- [19] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, no. 2011, pp. 1–19, 2011.
- [20] I. Goodfellow, Y. Bengio, and Courville A., Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.
- [21] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47–57, 2017.
- [22] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [23] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- [24] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit,” Cs Technion, vol. 40, no. 8, pp. 1–15, 2008.