Deep Network Classification by Scattering and Homotopy Dictionary Learning
We introduce a sparse scattering deep convolutional neural network, which provides a simple model to analyze properties of deep representation learning for classification. Learning a single dictionary matrix with a classifier yields a higher classification accuracy than AlexNet over the ImageNet ILSVRC2012 dataset. The network first applies a scattering transform which linearizes variabilities due to geometric transformations such as translations and small deformations. A sparse dictionary coding reduces intra-class variability while preserving class separation through projections over unions of linear spaces. It is implemented in a deep convolutional network with a homotopy algorithm having an exponential convergence. A convergence proof is given in a general framework including ALISTA. Classification results are analyzed over ImageNet.
Deep convolutional networks have spectacular applications to classification and regression (LeCun et al., 2015), but they are a black box which are hard to analyze mathematically because of their architecture complexity. We introduce a simplified convolutional neural network illustrated in Figure 1, whose learning can be reduced to a single dictionary matrix and a classifier. Despite its simplicity, it applies to complex image classification and reaches a higher accuracy than AlexNet (Krizhevsky et al., 2012) over ImageNet ILSVRC2012. It is a cascade of well understood mathematical operators, and thus provides a simplified mathematical framework to analyze classification performances.
Intra-class variabilities due to geometric image transformations such as translations or small deformations are linearized by a scattering transform (Bruna and Mallat, 2013) which is invertible. Scattering transforms include no learning. They are effective representations to classify relatively simple images such as digits in MNIST, textures (Bruna and Mallat, 2013) or small CIFAR images (Oyallon and Mallat, 2014). Learning deep convolutional networks however gives a much higher accuracy over complex databases such as ImageNet. A fundamental issue is to understand the source of this improvement. This paper shows that it can be captured by a sparse code in a dictionary optimized by supervised learning. It is implemented with a deep convolutional network architecture. The sparse code eliminates non-informative image components and projects each class in unions of linear spaces. The classification accuracy is considerably improved and goes beyond AlexNet over ImageNet 2012.
Dictionary learning for classification was introduced in Mairal et al. (2009) and implemented with deep convolutional neural network architectures by several authors (Sulam et al., 2018; Mahdizadehaghdam et al., 2019; Sun et al., 2018). These algorithms have been applied to simpler image classification problems such as MNIST or CIFAR but no results were published on large datasets such as ImageNet on which they do not seem to scale. This is due to their complexity and the need to cascade several sparse codes, which leads to complex structures. We show that a single dictionary learning is sufficient if applied to scattering coefficients as opposed to raw data. A major issue is to compute the sparse code with a small network. We introduce a new architecture based on homotopy continuation, which leads to exponential convergence. It is thus implemented in a small convolutional network. The ALISTA (Liu et al., 2019) sparse code is incorporated in this framework. The main contributions of the paper are summarized below:
A Sparse Scattering network architecture, illustrated in Figure 1, where the classification is performed over a sparse code in a learned dictionary of scattering coefficients. It outperforms AlexNet over ImageNet 2012.
A new dictionary learning algorithm with homotopy sparse coding, optimized by gradient descent in a deep convolutional network.
A proof of exponential convergence of ALISTA (Liu et al., 2019) in presence of noise.
We explain the implementation and mathematical properties of each element of the sparse scattering network. Section 2 briefly reviews multiscale scattering transforms. Section 3 introduces homotopy dictionary learning for classification, with a proof of exponential convergence under appropriate assumptions. Section 4 analyzes image classification results of sparse scattering networks on ImageNet 2012.
2 Scattering Transform
A scattering transform is a cascade of wavelet transforms and ReLU or modulus non-linearities. It can be interpreted as a deep convolutional network with predefined wavelet filters (Mallat, 2016). For images, wavelet filters are calculated from a mother complex wavelet whose average is zero. It is rotated by , dilated by and its phase is shifted by :
We choose a Morlet wavelet as in Bruna and Mallat (2013) to produce a sparse set of non-negligible wavelet coefficients. A ReLU is written .
Scattering coefficients of order are computed by averaging rectified wavelet coefficients with a subsampling stride of :
where is a Gaussian dilated by (Bruna and Mallat, 2013).
The averaging by eliminates the variations of at scales smaller than . This information is recovered by computing their variations at all scales , with a second wavelet transform. Scattering coefficients of order two are:
To reduce the dimension of scattering vectors, we define phase invariant second order scattering coefficients with a complex modulus instead of a phase sensitive ReLU:
The scattering representation includes order coefficients and order phase invariant coefficients. In this paper, we choose and hence 4 scales , angles and 4 phases on . Scattering coefficients are computed with the software package Kymatio (Andreux et al., 2018). They preserve the image information and can be recovered from (Oyallon et al., 2019). For computational efficiency, the dimension of scattering vectors can be reduced by a factor with a linear operator which preserves the ability to recover a close approximation of from . The dimension reduction operator of Figure 1 is computed by preserving the principal directions of a PCA calculated on the training image databasis, or is optimized by gradient descent together with the other network parameters.
The scattering transform is Lipschitz continuous to translations and deformations (Mallat, 2012). Intra-class variablities due to translations and deformations smaller than are linearized. Good classification accuracies are obtained with a linear classifier over scattering coefficients in image databases where intra-class variabilities are dominated by translations and deformations. This is the case for digits in MNIST or texture images (Bruna and Mallat, 2013). However it does not take into account variabilities of pattern structures and clutter which dominate complex image databases. To remove this clutter while preserving class separation requires some form of supervised learning as in deep convolutional networks. When applied to raw image data, dictionary learning often computes wavelet-like filters as in the first layer of deep neural networks (Krizhevsky et al., 2012). This is not sufficient to obtain high classification accuracy over complex image databases. The sparse scattering network of Figure 1 computes a sparse code of scattering representation , in a dictionary optimized by minimizing the classification loss. For this purpose, the next section introduces a homotopy dictionary learning algorithm, implemented in a small convolutional network.
3 Homotopy Dictionary Learning for Classification
Task-driven dictionary learning for classification with sparse coding was proposed in Mairal et al. (2011). We introduce a small convolutional network architecture to implement a sparse code and learn the dictionary with a homotopy continuation on thresholds. ALISTA (Liu et al., 2019) is also shown to be a homotopy sparse coding whose exponential convergence is proved under more general conditions. Next section reviews dictionary learning for classification. Homotopy sparse coding algorithms are studied in Section 3.2.
3.1 Dictionary Learning
Unless specified, all norms are Euclidean norms. A sparse code approximates a vector with a linear combination of a minimum number of columns of a dictionary matrix , which are normalized . A sparse code is a vector of minimum support which has a bounded error . Such sparse codes have been used to optimize signal compression and to remove noise, to solve inverse problems in compressive sensing (Candes et al., 2006), and for classification (Mairal et al., 2011).
Minimizing the support of a code amounts to minimizing its “norm” which is not convex. This non-convex optimization is convexified by replacing the norm by an norm . It is solved by minimizing a convex Lagrangian with a multiplier which depends on the error bound :
The sparse code also depends upon the dictionary and , we omit these two last variables in the equation above for readability. One can prove (Donoho and Elad, 2006) that has the same support as the minimum support sparse code if the support size and the dictionary coherence satisfy:
Sparse approximation versus sparse code
Sparse coding was first introduced for denoising (Donoho and Elad, 2006). The sparse approximation is a non-linear filtering which preserves the “signal” components of represented by few large amplitude coefficients. It eliminates the “noise” corresponding to incoherent components of whose correlations with all dictionary vectors are below . It can also be interpreted as a projection in a union of linear spaces, each of which corresponding to a sparse code support.
For classification, we need to reduce intra-class variabilities and preserve or increase class separability. Intra-class variabilites may be interpreted as “noise” for the classification whereas image transformations from one class to another correspond to the “signal” we want to preserve. By defining sparse representations of training vectors with different supports for different classes, it projects each class in different unions of linear spaces, which reduces intra-class variabilites while preserving separation. The dictionary learning optimizes the choice of to obtain sparse codes with discriminative supports.
The classification is usually performed from the sparse code . We will see that a classification applied on the reconstructed sparse approximation has nearly the same accuracy. Indeed, the linear operator can preserve separated linear spaces.
Dictionary learning by gradient descent
Given a set of inputs and labels , task-driven dictionary learning minimizes a classification loss that takes as input the sparse code of the input , the label and the classification parameters . Thus, the loss depends upon the dictionary , the Lagrange multiplier which adjusts the sparsity level, and the classification parameters . All these parameters can be jointly optimized by stochastic gradient descent to minimize the loss. This requires to compute the sparse code and its derivatives w.r.t and , which can be done by implementing the sparse coding in a deep convolutional network where the sparse code is computed in the forward pass and the derivatives of w.r.t and are computed in the backward pass. For this purpose, next section introduces a homotopy iterated soft thresholding network architecture.
3.2 Homotopy Iterated Soft Thresholding Network
This section introduces an efficient convolutional network architecture to compute sparse codes and learn dictionaries. Iterative Soft-Thresholding Algorithms (ISTA) (Daubechies et al., 2004), and FISTA (Beck and Teboulle, 2009) can be implemented with deep neural networks but they require many layers because of their slow convergence. LISTA algorithm (Gregor and LeCun, 2010) and its more recent version ALISTA (Liu et al., 2019) accelerate this convergence by introducing an auxiliary matrix which is adapted to the statistics of the input and to the properties of the dictionary. For ALISTA, it leads to exponential convergence under appropriate hypotheses. However, we shall see that this auxiliary matrix prevents from using this approach to learn a dictionary which minimizes a classification loss with a sparse code. We introduce a dictionary learning based on a homotopy Iterated Soft Thresholding Continuation (Jiao et al., 2017), which has the same exponential convergence without an auxiliary matrix. We shall see that ALISTA can also be considered as a homotopy continuation algorithm. We give a proof of exponential convergence for non-zero Lagrange multipliers in this general framework.
Iterated Soft Thresholding
ISTA alternates a gradient step on the quadratic term of the Lagrangian (1) and a soft-thresholding :
where is the spectral norm and . The first iteration computes a non-sparse code which is progressively sparsified through iterated thresholdings. After iterations, the sparse code has an error in . FISTA (Beck and Teboulle, 2009) accelerates the error decay to , which remains slow. Each iteration of ISTA and FISTA is computed with linear operators and a soft thresholding and can thus be implemented with one layer in a deep network (Papyan et al., 2017). However, the total number of layers must be large to achieve a small error, and it requires to compute spectral norms during training, which is slow.
Homotopy Iterated Thresholding and ALISTA
Homotopy continuation algorithms introduced in Osborne et al. (2000), minimize the Lagrangian (1) by progressively decreasing the Lagrange multiplier. This optimization path is opposite to ISTA and FISTA since it goes from a very sparse initial solution towards a less sparse but optimal one, similarly to matching pursuit algorithms (Davis et al., 1997; Donoho and Tsaig, 2008). Homotopy algorithms are particularly efficient if the final Lagrange multiplier is large so that the optimal solution is very sparse. We shall see that it is the case for classification.
The homotopy Iterative Soft-Thresholding Continuation (ISTC) of Jiao, Jin and Lu (Jiao et al., 2017) algorithm adjusts the decay rate of an exponentially decreasing sequence of Lagrange multipliers for :
After iterations, they prove that has the same support as the optimal sparse code , if , if the dictionary coherence condition (2) is satisfied, and if is sufficiently close to . Figure 2 illustrates the implementation of this sparse coding algorithm in a deep network of depth , with side connections. For image classification we use a convolutional translation invariant dictionary, which defines a deep convolutional network. This convolutional network is used to compute the sparse code of scattering coefficients in Figure 1.
ALISTA can be considered as a generalization of the homotopy ISTC algorithm, which replaces by an auxiliary matrix . We shall also study whether this flexibility can improve results. Each column of is normalized by . The iteration (4) is thus rewritten
The following theorem extends the convergence result of homotopy ISTC algorithm, by replacing the coherence of by the mutual coherence of and
This theorem also extends the ALISTA exponential convergence result in the general setting where the sparse code introduces a reconstruction error, which may be interpreted as a noise removal. We will see that this error can be large for image classification applications because it corresponds to non-informative clutter removal.
Let be the sparse code of with error . If its support satisfies
then soft-thresholding iterations (5) with thresholds
define a sparse code , whose support is included in the support of and
The proof is in Appendix A of the supplementary material. It adapts the convergence proof of ISTC to the more general ALISTA framework. When , we recover the convergence result of the homotopy ISTC, and when we recover the ALISTA exponential convergence result. However, one should not get too impressed by this exponential convergence rate because the condition only applies to very sparse codes in highly incoherent dictionaries. ALISTA optimizes in order to minimize the mutual coherence , but it is usually not possible to reach . It thus restricts the set of possible signals and dictionaries, as opposed to ISTA and FISTA algorithms whose convergence is guaranteed for any signal and dictionary. However, the condition is based on a brutal upper bound calculation in the proof, and it is not necessary for convergence. Next section shows that for image classification over ImageNet, by setting we learn a dictionary where the homotopy ISTC algorithm converges exponentially although the theorem hypothesis is not satisfied. By learning simultaneously and , we shall see that we can reduce the classification loss but the resulting algorithm does not converge to a sparse code anymore.
4 Image Classification
The goal of this work is to construct a deep neural network model which is sufficiently simple to be interpreted mathematically, while reaching a level of accuracy of more complex deep convolutional networks on complex classification problems. This is why we concentrate on ImageNet as opposed to MNIST or CIFAR. Next section compares its performance to state of the art deep networks, and analyzes the influence of different architecture components. Section 4.2 studies the exponential convergence of the homotopy ISTC sparse coding network in comparison with ISTA, FISTA and a flexible ALISTA.
4.1 Image Classification on ImageNet
We show that a sparse dictionary learning on scattering coefficients considerably improves the classification performance on and can outperform AlexNet accuracy.
ImageNet ILSVRC2012 is a challenging color image dataset of 1.2 million training images and 50,000 validation images, divided into 1000 classes. Prior to convolutional networks, SIFT representations combined with Fisher vector encoding reached a Top 5 classification accuracy of 74.3% with multiple model averaging (Sánchez and Perronnin, 2011). In their PyTorch implementation, the Top 5 accuracy of AlexNet and ResNet-152 is 79.1% and 94.1% respectively111Accuracies from https://pytorch.org/docs/master/torchvision/models.html .
The scattering transform at a scale of an ImageNet color image is a spatial array of of channels. Applying to an MLP classifier with 2 hidden layers of size 4096, ReLU and dropout like in AlexNet gives a 60.7% Top 5 accuracy. Applying to a 3-layer SLE network of 1x1 convolutions with ReLU with the same MLP reaches AlexNet performance (Oyallon et al., 2017). However, there is no mathematical understanding of the operations performed by these three layers, and the origin of the improvements.
The sparse scattering architecture is described in Figure 3. The convolutional operator is applied on a standardized scattering transform and reduces the number of scattering channels from to . The sparse code is calculated with a convolutional dictionary having vectors. It takes as input an array of which has been normalized and outputs a code of size or a sparse approximation of size . Either is provided as input to the MLP classifier. The ISTC network illustrated in Figure 2 has layers with softshrink non-linearities and no batch normalization. Before the classifier, there is a batch normalization and a average pooling. The MLP classifier has 2 hidden layers of size 4096, ReLU and dropout rate of 0.3. The supervised learning jointly optimizes , the dictionary with the Lagrange multiplier and the MLP classifier. It is done with a stochastic gradient descent during 120 epochs using an initial learning rate of 0.01 with a decay of 0.1 at epochs 50 and 100. With a sparse code in input of the MLP, it has a Top 5 accuracy of 80.9%, outperforming AlexNet. If we replace the ISTC network by an ALISTA network, the accuracy improves to . However, next section shows that contrarily to ISTC, an ALISTA network optimized for classification does not compute a sparse code and is therefore not mathematically interpretable. In the following we thus concentrate on the homotopy ISTC network.
The dimension reduction operator has a marginal effect in terms of performance. If we eliminate it or if we replace it by an unsupervised PCA dimension reduction, the performance drops by less than , whereas the accuracy drops by if we eliminate the sparse coding. The considerable improvement brought by the sparse code is further amplified if the MLP classifier is replaced by a linear classifier. A linear classifier on a scattering vector has a (Top 1, Top 5) accuracy of . With an ISTC sparse code in a learned dictionary the accuracy jumps to and hence improves by more than .
If the MLP classification is applied to the sparse approximation as opposed to the sparse code then the accuracy drops only by . The sparse approximation of has a small dimension similar to AlexNet last convolutional layer output and is not sparse. This indicates that it is not the individual sparse outputs of the sparse code which are important but the linear space defined by their support, which are mapped to other linear spaces by .
The optimization learns a large factor which yields a large approximation error . The resulting code is very sparse with about non-zero coefficients. The sparse approximation thus eliminates nearly half of the energy of which can be interpreted as non-informative "clutter" removal. The sparse code is a projection of over a linear space defined by the support of . If a column is interpreted as a "scattering space feature" then this linear space is a conjunction of a particular set of such features. The high classification accuracy indicates that different linear spaces correspond mostly to different classes. These linear spaces are mapped by into lower dimensional linear spaces which remain separated. It thus indicates that is optimized to preserves discriminative directions which transform a vector of one class into a vector of another one.
4.2 Convergence of Homotopy Algorithms
To guarantee that the network is mathematically interpretable we verify numerically that the homotopy ISTC algorithm computes an accurate approximation of the optimal sparse code in (1), with a small number of iterations (typically 12).
The Theorem 3.1 guarantees an exponential convergence if . In our classification setting, the theorem hypothesis is clearly not satisfied : , which is well above . However, this condition is not necessary and based on a relatively crude upper bound.
Figure 4 left shows numerically that ISTC algorithm minimizes the Lagrangian , with an exponential convergence which is faster than ISTA and FISTA over the dictionary that it learns. On the contrary, Figure 4 right shows that ALISTA does not minimize the Lagrangian at all. This comes from the fact that contrarily to standard ALISTA (Liu et al., 2019), we do not impose that the auxiliary matrix has a minimum joint coherence with the dictionary . It would require too much computation and the matrix is rather optimized to minimize the classification loss. This is why it improves the classification accuracy but does not compute a sparse code.
To further compare the convergence speed of ISTC versus ISTA and FISTA, we compute the relative mean square error between the optimal sparse code and the sparse code output of 12 iterations of each of these three algorithms. The is 0.02 for ISTC, 0.25 for FISTA and 0.46 for ISTA, which shows that ISTC reduces the error by a factor compared to ISTA and FISTA after 12 iterations.
The first goal of this work is to define a deep neural network having a good accuracy for complex image classification and which can be analyzed mathematically. This sparse scattering network learns the representation by optimizing a sparse code computed with a dictionary learned over scattering coefficients. The dictionary learning is implemented with a new homotopy ISTC network having an exponential convergence. The sparse dictionary learning improves accuracy by more than 20% over a scattering representation alone, and has a higher accuracy than AlexNet. The dictionary seems to be optimized in order to build separated sparse codes for each class, which belong to unions of linear spaces. Because the network operators are mathematically well specified, the analysis of its properties is simpler than for standard deep convolutional networks. However, more work is needed to understand the dictionary optimization and how it relates to image and class properties.
This work was supported by the ERC InvariantClass 320959 and grants from Région Ile-de-France. We thank the Scientific Computing Core at the Flatiron Institute for the use of their computing resources. We would like to thank Eugene Belilovsky for helpful discussions and comments.
- Kymatio: scattering transforms in python. CoRR. External Links: Cited by: §2.
- A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences 2 (1), pp. 183–202. Cited by: §3.2, §3.2.
- Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 1872–1886. Cited by: §1, §2, §2, §2.
- Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52 (2), pp. 489–509. Cited by: §3.1.
- An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics 57 (11), pp. 1413–1457. Cited by: §3.2.
- Adaptive greedy approximations. Constr. Approx. 13 (1), pp. 57–98. Cited by: §3.2.
- On the stability of the basis pursuit in the presence of noise. Signal Processing 86 (3), pp. 511–532. Cited by: §3.1, §3.1.
- Fast solution of l-norm minimization problems when the solution may be sparse. IEEE Trans. Information Theory 54 (11), pp. 4789–4812. Cited by: §3.2.
- Learning fast approximations of sparse coding. In ICML, pp. 399–406. Cited by: §3.2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1.
- Iterative soft/hard thresholding with homotopy continuation for sparse recovery. IEEE Signal Processing Letters 24 (6), pp. 784–788. Cited by: §3.2, §3.2.
- ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1106–1114. Cited by: §1, §2, Table 1.
- Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
- ALISTA: analytic weights are as good as learned weights in LISTA. In International Conference on Learning Representations, Cited by: 3rd item, §1, §3.2, §3, §4.2.
- Deep dictionary learning: a parametric network approach. IEEE Transactions on Image Processing 28 (10), pp. 4790–4802. Cited by: §1.
- Task-driven dictionary learning. IEEE transactions on pattern analysis and machine intelligence 34 (4), pp. 791–804. Cited by: §3.1, §3.
- Supervised dictionary learning. In Advances in neural information processing systems, pp. 1033–1040. Cited by: §1.
- Group invariant scattering. Comm. Pure Appl. Math. 65 (10), pp. 1331–1398. Cited by: §2.
- Understanding deep convolutional networks. Phil. Trans. of Royal Society A 374 (2065). Cited by: §2.
- A new approach to variable selection in least squares problems. IMA journal of numerical analysis 20 (3), pp. 389. Cited by: §3.2.
- Scaling the scattering transform: deep hybrid networks. In Proceedings of the IEEE international conference on computer vision, pp. 5618–5627. Cited by: §4.1.
- Deep roto-translation scattering for object classification. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2865–2873. Cited by: §1.
- Scattering networks for hybrid representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9), pp. 2208–2221. Cited by: §2, Table 1.
- Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research 18, pp. 83:1–83:52. Cited by: §3.2.
- Fisher vectors meet neural networks: a hybrid classification architecture. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3743–3752. Cited by: Table 1.
- High-dimensional signature compression for large-scale image classification.. In CVPR, pp. 1665–1672. Cited by: §4.1.
- Multilayer convolutional sparse modeling: pursuit and dictionary learning. IEEE Transactions on Signal Processing 66 (15), pp. 4090–4104. Cited by: §1.
- Supervised deep sparse coding networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 346–350. Cited by: §1.
Appendix A Appendix
a.1 Proof of Theorem 3.1
Let be the optimal sparse code. We denote by the support of any . We are going to prove by induction on that for any we have and if .
For , so is indeed included in
the support of and . To verify the induction hypothesis for , we shall prove that
Let us write the error . For all
Since the support of is smaller than , and
so taking the max on gives:
But given the inequalities
Let us now suppose that the property is valid for and let us prove it for . We denote by the restriction of to vectors indexed by . We begin by showing that . For any , since and we have
For any not in , let us prove that . The induction hypothesis assumes that and with so:
Since we assume that , we have
Because of the thresholding , it proves that and hence that .
Let us now evaluate . For any , a soft thresholding satisfies
Taking a max over proves the induction hypothesis.