Convolutional Dictionary Learning via Local Processing
Convolutional Sparse Coding (CSC) is an increasingly popular model in the signal and image processing communities, tackling some of the limitations of traditional patch-based sparse representations. Although several works have addressed the dictionary learning problem under this model, these relied on an ADMM formulation in the Fourier domain, losing the sense of locality and the relation to the traditional patch-based sparse pursuit. A recent work suggested a novel theoretical analysis of this global model, providing guarantees that rely on a localized sparsity measure. Herein, we extend this local-global relation by showing how one can efficiently solve the convolutional sparse pursuit problem and train the filters involved, while operating locally on image patches. Our approach provides an intuitive algorithm that can leverage standard techniques from the sparse representations field. The proposed method is fast to train, simple to implement, and flexible enough that it can be easily deployed in a variety of applications. We demonstrate the proposed training scheme for image inpainting and image separation, while achieving state-of-the-art results.
The celebrated sparse representation model has led to impressive results in various applications over the last decade [10, 1, 29, 30, 8]. In this context one typically assumes that a signal is a linear combination of a few columns, also called atoms, taken from a matrix termed a dictionary; i.e. where is a sparse vector. Given , finding its sparsest representation, called sparse pursuit, amounts to solving the following problem
where stands for the model mismatch or an additive noise strength. The solution for the above can be approximated using greedy algorithms such as Orthogonal Matching Pursuit (OMP)  or convex formulations such as BP . The task of learning the model, i.e. identifying the dictionary that best represents a set of training signals, is called dictionary learning and several methods have been proposed for tackling it, including K-SVD , MOD , online dictionary learning , trainlets , and more.
When dealing with high-dimensional signals, addressing the dictionary learning problem becomes computationally infeasible, and learning the model suffers from the curse of dimensionality. Traditionally, this problem was circumvented by training a local model for patches extracted from and processing these independently. This approach gained much popularity and success due to its simplicity and high-performance [10, 21, 30, 8, 19]. A different approach is the Convolutional Sparse Coding (CSC) model, which aims to amend the problem by imposing a specific structure on the global dictionary involved [15, 4, 18, 27, 17, 16]. In particular, this model assumes that is a banded convolutional dictionary, implying that this global model assumes that the signal is a superposition of a few local atoms, or filters, shifted to different positions. Several works have presented algorithms for training convolutional dictionaries [4, 17, 27], circumventing some of the computational burdens of this problem by relying on ADMM solvers that operate in the Fourier domain. In doing so, these methods lost the connection to the patch-based processing paradigm, as widely practiced in many signal and image processing applications.
In this work, we propose a novel approach for training the CSC model, called slice-based dictionary learning. Unlike current methods, we leverage a localized strategy enabling the solution of the global problem in terms of only local computations in the original domain. The main advantages of our method over existing ones are:
It operates locally on patches, while solving faithfully the global CSC problem;
It reveals how one should modify current (and any) dictionary learning algorithms to solve the CSC problem in a variety of applications;
It is easy to implement and intuitive to understand;
It can leverage standard techniques from the sparse representations field, such as OMP, LARS, K-SVD, MOD, online dictionary learning and trainlets;
It converges faster than current state of the art methods, while providing a better model; and
It can naturally allow for a different number of non-zeros in each spatial location, according to the local signal complexity.
The rest of this paper is organized as follows: Section 2 reviews the CSC model. The proposed method is presented in Section 3 and contrasted with conventional approaches in Section 4. Section 5 shows how our method can be employed to tackle the tasks of image inpainting and separation, and later in Section 6 we demonstrate empirically our algorithms. We conclude this work in Section 7.
2 Convolutional Sparse Coding
The CSC model assumes that a global signal can be decomposed as , where are local filters that are convolved with their corresponding features maps (or sparse representations) . Alternatively, following Figure 2, the above can be written in matrix form as ; where is a banded convolutional dictionary built from shifted versions of a local matrix , containing the atoms as its columns, and is a global sparse representation obtained by interlacing the . In this setting, a patch taken from the global signal equals , where is a stripe dictionary and is a stripe vector. Here we defined to be the operator that extracts the -th -dimensional patch from .
The work in  suggested a theoretical analysis of this global model, driven by a localized sparsity measure. Therein, it was shown that if all the stripes are sparse, the solution to the convolutional sparse pursuit problem is unique and can be recovered by greedy algorithms, such as the OMP , or convex formulations such as the Basis Pursuit (BP) . This analysis was then extended in  to a noisy regime showing that, under similar sparsity assumptions, the global problem formulation and the pursuit algorithms are also stable. Herein, we leverage this local-global relation from an algorithmic perspective, showing how one can efficiently solve the convolutional sparse pursuit problem and train the dictionary (i.e., the filters) involved, while only operating locally.
Note that the global sparse vector can be broken into a set of non-overlapping -dimensional sparse vectors , which we call needles. The essence of the presented algorithm is in the observation that one can express the global signal as , where is the operator that puts in the -th position and pads the rest of the entries with zeros. Denoting by the -th slice , we can write the above as . It is important to stress that the slices do not correspond to patches extracted from the signal, , but rather to much simpler entities. They represent only a fraction of the -th patch, since , i.e. a patch is constructed from several overlapping slices. Unlike current works in signal and image processing, which train a local dictionary on the patches , in what follows we define the learning problem with respect to the slices, , instead. In other words, we aim to train instead of . As a motivation, we present in Figure 1 a set of patches extracted from natural images and their corresponding slices , obtained from the proposed algorithm, which will be presented in Section 3. Indeed, one can observe that the slices are simpler than the patches, as they contain less information.
3 Proposed Method: Slice-Based Dictionary Learning
The convolutional dictionary learning problem refers to the following optimization111Hereafter, we assume that the atoms in the dictionary are normalized to a unit norm. objective,
for a convolutional dictionary as in Figure 2 and a Lagrangian parameter that controls the sparsity level. Employing the decomposition of in terms of its slices, and the separability of the norm, the above can be written as the following constrained minimization problem,
One could tackle this problem using half-quadratic splitting  by introducing a penalty term over the violation of the constraint and gradually increasing its importance. Alternatively, we can employ the ADMM algorithm  and solve the augmented Lagrangian formulation (in its scaled form),
where are the dual variables that enable the constrains to be met.
3.1 Local Sparse Coding and Dictionary Update
The minimization of Equation (5) with respect to all the needles is separable, and can be addressed independently for every by leveraging standard tools such as LARS. This also allows for having a different number of non-zeros per slice, depending on the local complexity. Similarly, the minimization with respect to can be done using any patch-based dictionary learning algorithm such as the K-SVD, MOD, online dictionary learning or trainlets. Note that in the dictionary update stage, while minimizing for and , one could refrain from iterating these updates until convergence, and instead perform only a few iterations before proceeding with the remaining variables.
3.2 Slice Update via Local Laplacian
The minimization of Equation (5) with respect to all the slices amounts to solving the following quadratic problem
Taking the derivative with respect to the variables and nulling them, we obtain the following system of linear equations
the above can be written as
Using the Woodbury matrix identity and the fact that , where is the identity matrix, the above is equal to
Plugging the definitions of , and , we obtain
Although seemingly complicated at first glance, the above is simple to interpret and implement in practice. This expression indicates that one should (i) compute the estimated slices , then (ii) aggregate them to obtain the global estimate , and finally (iii) subtract from the corresponding patch from the aggregated signal, i.e. . As a remark, since this update essentially subtracts from an averaged version of it, it can be seen as some sort of a patch-based local Laplacian operation.
3.3 Boundary Conditions
In the description of the CSC model (see Figure 2), we assumed for simplicity circulant boundary conditions. In practice, however, natural signals such as images are in general not circulant and special treatment is needed for the boundaries. One way of handling this issue is by assuming that , where is matrix that crops the first and last rows of the dictionary (see Figure 2). The change needed in Algorithm 1 to incorporate is minor. Indeed, one has to simply replace the patch extraction operator , with , where the operator pads a global signal with zeros on the boundary and extracts a patch from the result. In addition, one has to replace the patch placement operator with , which simply puts the input in the location of the -th patch and then crops the result.
3.4 From Patches to Slices
The ADMM variant of the proposed algorithm, named slice-based dictionary learning, is summarized in Algorithm 1. While we have assumed the data corresponds to one signal , this can be easily extended to consider several signals.
At this point, a discussion regarding the relation between this algorithm and standard (patch-based) dictionary learning techniques is in place. Indeed, from a quick glance the two approaches seem very similar: Both perform local sparse pursuit on local patches extracted from the signal, then update the dictionary to represent these patches better, and finally apply patch-averaging to obtain a global estimate of the reconstructed signal. Moreover, both iterate this process in a block-coordinate descent manner in order to minimize the overall objective. So, what is the difference between this algorithm and previous approaches?
The answer lies in the migration from patches to slices. While originally dictionary learning algorithms aimed to represent patches taken from the signal, our scheme suggests to train the dictionary to construct slices, which do not necessarily reconstruct the patch fully. Instead, only the summation of these slices results in the reconstructed patches. To illustrate this relation, we show in Figure 3 the decomposition of several patches in terms of their constituent slices. One can observe that although the slices are simple in nature, they manage to construct the rather complex patches. The difference between this illustration and that of Figure 1 is that the latter shows patches and only the slices that are fully contained in them.
Note that the slices are not mere auxiliary variables, but rather emerge naturally from the convolutional formulation. After initializing these with patches from the signal, , each iteration progressively “carves” portions from the patch via the local Laplacian, resulting in simpler constructions. Eventually, these variables are guaranteed to converge to – the slices we have defined.
Having established the similarities and differences between the traditional patch-based approach and the slice alternative, one might wonder what is the advantage of working with slices over patches. In the conventional approach, the patches are processed independently, ignoring their overlap. In the slice-based case, however, the local Laplacian forces the slices to communicate and reach a consensus on the reconstructed signal. Put differently, the CSC offers a global model, while earlier patch-based methods used local models without any holistic fusion of them.
4 Comparison to Other Methods
In this section we explain further the advantages of our method, and compare it to standard algorithms for training the CSC model such as [17, 28]. Arguably the main difference resides in our localized treatment, as opposed to the global Fourier domain processing. Our approach enables the following benefits:
The sparse pursuit step can be done separately for each slice and is therefore trivial to parallelize.
The algorithm can work in a complete online regime where in each iteration it samples a random subset of slices, solves a pursuit for these and then updates the dictionary accordingly. Adopting a similar strategy in the competing algorithms [17, 28] might be problematic, since these are deployed in the Fourier domain on global signals and it is therefore unclear how to operate on a subset of local patches.
Our algorithm can be easily modified to allow a different number of non-zeros in each location of the global signal. Such local adaptation to the complexity of the image cannot be offered by the Fourier-oriented algorithms.
We now turn to comparing the proposed algorithm to alternative methods in terms of computational complexity. Denote by the number of signals on which the dictionary is trained, and by the maximal number of non-zeros in a needle222Although we solve the Lagrangian formulation of LARS, we also limit the maximal number of non-zeros per needle to be at most . . At each iteration of our algorithm we employ LARS that has a complexity of per slice , resulting in computations for all slices in all the images. The last term, , corresponds to the precomputation of the Gram of the dictionary (which is in general negligible). Then, given the obtained needles, we reconstruct the slices, requiring , aggregate the results to form the global estimate, incurring , and update the slices, which requires an additional . These steps are negligible compared to the sparse pursuits and are thus omitted in the final expression. Finally, we update the dictionary using the K-SVD, which is . We summarize the above in Table 1. In addition, we present in the same table the complexity of each iteration of the (Fourier-based) algorithm in . In this case, corresponds to the number of inner iterations in their ADMM solver of the sparse pursuit and dictionary update.
The most computationally demanding step in our algorithm is the local sparse pursuit, which is . Assuming that the needles are very sparse, which indeed happens in all of our experiments, this reduces to . On the other hand, the complexity in the algorithm of  is dominated by the computation of the FFT, which is . We conclude that our algorithm scales linearly with the global dimension, while theirs grows as . Note that this also holds for other related methods, such as that of , which also depend on the global FFT. Moreover, one should remember the fact that in our scheme one might run the pursuits on a small percentage of the total number of slices, meaning that in practice our algorithm can scale as , where is a constant smaller than one.
5 Image Processing via CSC
In this section, we demonstrate our proposed algorithm on several image processing tasks. Note that the discussion thus far focused on one dimensional signals, however it can be easily generalized to images by replacing the convolutional structure in the CSC model with block-circulant circulant-block (BCCB) matrices.
5.1 Image Inpainting
Assume an original image is multiplied by a diagonal binary matrix , which masks the entries in which . In the task of image inpainting, given the corrupted image , the goal is to restore the original unknown . One can tackle this problem by solving the following CSC problem
where we assume the dictionary was pretrained. Using similar steps to those leading to Equation (5), the above can be written as
This objective can be minimized via the algorithm described in the previous section. Moreover, the minimization with respect to the local sparse codes remains the same. The only difference regards the update of the slices , in which case one obtains the following expression
The steps leading to the above equation are almost identical to those in subsection 3.2, and they only differ in the incorporation of the mask .
5.2 Texture and Cartoon Separation
In this task the goal is to decompose an image into its texture component that contains highly oscillating or pseudo-random patterns, and a cartoon part that is a piece-wise smooth image. Many image separation algorithms tackle this problem by imposing a prior on both components. For cartoon, one usually employs the isotropic (or anisotropic) Total Variation norm, denoted by . The modeling of texture, on the other hand, is more difficult and several approaches have been considered over the years [12, 2, 22, 32].
In this work, we propose to model the texture component using the CSC model. As such, the task of separation amounts to solving the following problem
where is a convolutional (texture) dictionary, and is its corresponding sparse vector. Using similar derivations to those presented in Section 3.2, the above is equivalent to
where we split the variable into in order to facilitate the minimization over the TV norm. Its corresponding ADMM formulation333Disregarding the training of the dictionary, this is a standard two-function ADMM problem. The first set of variables are and , and the second are and . is given by
where , and are the texture slices, needles and dual variables, respectively, and is the dual variable of the global cartoon . The above optimization problem can be minimized by slightly modifying Algorithm 1. The update for is a sparse pursuit and the update for the variable is a TV denoising problem. Then, one can update the and jointly by
where and . The final step of the algorithm is updating the texture dictionary via any dictionary learning method.
We turn to demonstrate our proposed slice-based dictionary learning. Throughout the experiments we use the LARS algorithm  to solve the LASSO problem and the K-SVD  for the dictionary learning. The reader should keep in mind, nevertheless, that one could use any other pursuit or dictionary learning algorithm for the respective updates. In all experiments, the number of filters trained are and they are of size .
6.1 Slice-Based Dictionary Learning
Following the test setting presented in , we run our proposed algorithm to solve Equation (2) with on the Fruit dataset , which contains ten images. As in , the images were mean subtracted and contrast normalized. We present in Figure 4 the dictionary obtained after several iterations using our proposed slice-based dictionary learning, and compare it to the result in  and also to the method AVA-AMS in . Note that all three methods handle the boundary conditions, which were discussed in Section 3.3. We compare in Figure 5 the objective of the three algorithms as function of time, showing that our algorithm is more stable and also converges faster. In addition, to demonstrate one of the advantages of our scheme, we train the dictionary on a small subset () of all slices and present the obtained result in the same figure.
|Heide et al.||11.00||10.29||10.18||11.77||9.41||9.74||11.99||15.55||10.37||11.60||15.11|
6.2 Image Inpainting
We turn to test our proposed algorithm on the task of image inpainting, as described in Section 5.1. We follow the experimental setting presented in  and compare to their state-of-the-art method using their publicly available code. The dictionaries employed in both approaches are trained on the Fruit dataset, as described in the previous subsection (see Figure 4). For a fair comparison, in the inference stage, we tuned the parameter for both approaches. Table 2 presents the results in terms of peak signal-to-noise ratio (PSNR) on a set of publicly available standard test images, showing our method leads to quantitatively better results444The PSNR is computed as , where and are the original and restored images. Since the images are normalized, the range of the PSNR values is non-standard.. Figure 6 compares the two visually, showing our method also leads to better qualitative results.
A common strategy in image restoration is to train the dictionary on the corrupted image itself, as shown in , as opposed to employing a dictionary trained on a separate collection of images. The algorithm presented in Section 5.1 can be easily adapted to this framework by updating the local dictionary on the slices obtained at every iteration. To exemplify the benefits of this, we include the results555A comparison with the method of  was not possible in this case, as their implementation cannot handle training a dictionary on standard-sized images. obtained by using this approach in Table 2 and Figure 6.
6.3 Texture and Cartoon Separation
We conclude by applying our proposed slice-based dictionary learning algorithm to the task of texture and cartoon separation. The TV denoiser used in the following experiments is the publicly available software of . We run our method on the synthetic image Sakura and a portion extracted from Barbara, both taken from , and on the image Cat, originally from . For each of these, we compare with the corresponding methods. We present the results of all three experiments in Figure 7, together with the trained dictionaries. Lastly, as an application for our texture separation algorithm, we enhance the image Flower by multiplying its texture component by a scalar factor (greater than one) and combining the result with the original image. We treat the colored image by transforming it to the Lab color space, manipulating the L channel, and finally transforming the result back to the original domain. The original image and the obtained result are depicted in Figure 8. One can observe that our approach does not suffer from halos, gradient reversals or other common enhancement artifacts.
In this work we proposed the slice-based dictionary learning algorithm. Our method employs standard patch-based tools from the realm of sparsity to solve the global CSC problem. We have shown the relation between our method and the patch-averaging paradigm, clarifying the main differences between the two: (i) the migration from patches to the simpler entities called slices, and (ii) the application of a local Laplacian that results in a global consensus. Finally, we illustrated the advantages of the proposed algorithm in a series of applications and compared it to related state-of-the-art methods.
-  M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311–4322, 2006.
-  J.-F. Aujol, G. Gilboa, T. Chan, and S. Osher. Structure-texture image decompositionâmodeling, algorithms, and parameter selection. International Journal of Computer Vision, 67(1):111–136, 2006.
-  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
-  H. Bristow, A. Eriksson, and S. Lucey. Fast convolutional sparse coding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 391–398, 2013.
-  S. H. Chan, R. Khoshabeh, K. B. Gibson, P. E. Gill, and T. Q. Nguyen. An augmented lagrangian method for total variation video restoration. IEEE Transactions on Image Processing, 20(11):3097–3111, 2011.
-  S. Chen, S. A. Billings, and W. Luo. Orthogonal least squares methods and their application to non-linear system identification. International Journal of control, 50(5):1873–1896, 1989.
-  S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic Decomposition by Basis Pursuit. SIAM Review, 43(1):129–159, 2001.
-  W. Dong, L. Zhang, G. Shi, and X. Wu. Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Trans. on Image Process., 20(7):1838–1857, 2011.
-  B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
-  M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736–3745, 2006.
-  M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process., 15(12):3736–3745, Dec. 2006.
-  M. Elad, J.-L. Starck, P. Querre, and D. L. Donoho. Simultaneous cartoon and texture image inpainting using morphological component analysis (mca). Applied and Computational Harmonic Analysis, 19(3):340–358, 2005.
-  K. Engan, S. O. Aase, and J. H. Husoy. Method of optimal directions for frame design. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, volume 5, pages 2443–2446. IEEE, 1999.
-  D. Geman and C. Yang. Nonlinear image recovery with half-quadratic regularization. IEEE Transactions on Image Processing, 4(7):932–946, 1995.
-  R. Grosse, R. Raina, H. Kwong, and A. Y. Ng. Shift-Invariant Sparse Coding for Audio Classification. In Uncertainty in Artificial Intelligence, 2007.
-  S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang. Convolutional sparse coding for image super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 1823–1831, 2015.
-  F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexible convolutional sparse coding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5135–5143. IEEE, 2015.
-  B. Kong and C. C. Fowlkes. Fast convolutional sparse coding (fcsc). Department of Computer Science, University of California, Irvine, Tech. Rep, 2014.
-  J. Mairal, F. Bach, J. Ponce, et al. Sparse modeling for image and vision processing. Foundations and Trends® in Computer Graphics and Vision, 8(2-3):85–283, 2014.
-  J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, pages 689–696. ACM, 2009.
-  J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Transactions on image processing, 17(1):53–69, 2008.
-  S. Ono, T. Miyata, and I. Yamada. Cartoon-texture image decomposition using blockwise low-rank texture characterization. IEEE Transactions on Image Processing, 23(3):1128–1142, 2014.
-  V. Papyan, J. Sulam, and M. Elad. Working locally thinking globally-part I: Theoretical guarantees for convolutional sparse coding. arXiv preprint arXiv:1607.02005, 2016.
-  V. Papyan, J. Sulam, and M. Elad. Working locally thinking globally-part II: Stability and algorithms for convolutional sparse coding. arXiv preprint arXiv:1607.02009, 2016.
-  R. Rubinstein, M. Zibulevsky, and M. Elad. Efficient implementation of the k-svd algorithm using batch orthogonal matching pursuit. Cs Technion, 40(8):1–15, 2008.
-  J. Sulam, B. Ophir, M. Zibulevsky, and M. Elad. Trainlets: Dictionary learning in high dimensions. IEEE Transactions on Signal Processing, 64(12):3180–3193, 2016.
-  B. Wohlberg. Efficient convolutional sparse coding. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7173–7177. IEEE, 2014.
-  B. Wohlberg. Boundary handling for convolutional sparse representations. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 1833–1837. IEEE, 2016.
-  J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence, 31(2):210–227, 2009.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.
-  M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2528–2535. IEEE, 2010.
-  H. Zhang and V. M. Patel. Convolutional sparse coding-based image decomposition.