Deep Network Classification by Scattering and Homotopy Dictionary Learning

Deep Network Classification by Scattering and Homotopy Dictionary Learning

Abstract

We introduce a sparse scattering deep convolutional neural network, which provides a simple model to analyze properties of deep representation learning for classification. Learning a single dictionary matrix with a classifier yields a higher classification accuracy than AlexNet over the ImageNet 2012 dataset. The network first applies a scattering transform that linearizes variabilities due to geometric transformations such as translations and small deformations. A sparse dictionary coding reduces intra-class variability while preserving class separation through projections over unions of linear spaces. It is implemented in a deep convolutional network with a homotopy algorithm having an exponential convergence. A convergence proof is given in a general framework that includes ALISTA. Classification results are analyzed on ImageNet.

\iclrfinalcopy

1 Introduction

Deep convolutional networks have spectacular applications to classification and regression (LeCun et al., 2015), but they are black boxes that are hard to analyze mathematically because of their architecture complexity. Scattering transforms are simplified convolutional neural networks with wavelet filters which are not learned (Bruna and Mallat, 2013). They provide state-of-the-art classification results among predefined or unsupervised representations, and are nearly as efficient as learned deep networks on relatively simple image datasets, such as digits in MNIST, textures (Bruna and Mallat, 2013) or small CIFAR images (Oyallon and Mallat, 2014; Mallat, 2016). However, over complex datasets such as ImageNet, the classification accuracy of a learned deep convolutional network is much higher than a scattering transform or any other predefined representation (Oyallon et al., 2019). A fundamental issue is to understand the source of this improvement. This paper addresses this question by showing that one can reduce the learning to a single dictionary matrix, which is used to compute a positive sparse code.

The resulting algorithm is implemented with a simplified convolutional neural network architecture illustrated in Figure 1. The classifier input is a positive sparse code of scattering coefficients calculated in a dictionary . The matrix is learned together with the classifier by minimizing a classification loss over a training set. We show that learning improves the performance of a scattering representation considerably and is sufficient to reach a higher accuracy than AlexNet (Krizhevsky et al., 2012) over ImageNet 2012. This cascade of well understood mathematical operators provides a simplified mathematical model to analyze optimization and classification performances of deep neural networks.

Dictionary learning for classification was introduced in Mairal et al. (2009) and implemented with deep convolutional neural network architectures by several authors (Sulam et al., 2018; Mahdizadehaghdam et al., 2019; Sun et al., 2018). To reach good classification accuracies, these networks cascade several dictionary learning blocks. As a result, there is no indication that these operators compute optimal sparse codes. These architectures are thus difficult to analyze mathematically and involve heavy calculations. They have only been applied to small image classification problems such as MNIST or CIFAR, as opposed to ImageNet. Our architecture reaches a high classification performance on ImageNet with only one dictionary , because it is applied to scattering coefficients as opposed to raw images. Intra-class variabilities due to geometric image transformations such as translations or small deformations are linearized by a scattering transform (Bruna and Mallat, 2013), which avoids unnecessary learning.

Learning a dictionary in a deep neural network requires to implement a sparse code. We show that homotopy iterative thresholding algorithms lead to more efficient sparse coding implementations with fewer layers. We prove their exponential convergence in a general framework that includes the ALISTA (Liu et al., 2019) algorithm. The main contributions of the paper are summarized below:

• A sparse scattering network architecture, illustrated in Figure 1, where the classification is performed over a sparse code computed with a single learned dictionary of scattering coefficients. It outperforms AlexNet over ImageNet 2012.

• A new dictionary learning algorithm with homotopy sparse coding, optimized by gradient descent in a deep convolutional network. If the dictionary is sufficiently incoherent, the homotopy sparse coding error is proved to convergence exponentially.

We explain the implementation and mathematical properties of each element of the sparse scattering network. Section 2 briefly reviews multiscale scattering transforms. Section 3 introduces homotopy dictionary learning for classification, with a proof of exponential convergence under appropriate assumptions. Section 4 analyzes image classification results of sparse scattering networks on ImageNet 2012.

2 Scattering Transform

A scattering transform is a cascade of wavelet transforms and ReLU or modulus non-linearities. It can be interpreted as a deep convolutional network with predefined wavelet filters (Mallat, 2016). For images, wavelet filters are calculated from a mother complex wavelet whose average is zero. It is rotated by , dilated by and its phase is shifted by :

 ψj,θ(u)=2−2jψ(2−jr−θu)  and  ψj,θ,α=Real(e−iαψj,θ)

We choose a Morlet wavelet as in Bruna and Mallat (2013) to produce a sparse set of non-negligible wavelet coefficients. A ReLU is written .

Scattering coefficients of order are computed by averaging rectified wavelet coefficients with a subsampling stride of :

 Sx(u,k,α)=ρ(x⋆ψj,θ,α)⋆ϕJ(2Ju)  with  k=(j,θ)

where is a Gaussian dilated by (Bruna and Mallat, 2013). The averaging by eliminates the variations of at scales smaller than . This information is recovered by computing their variations at all scales , with a second wavelet transform. Scattering coefficients of order two are:

 Sx(u,k,k′,α,α′)=ρ(ρ(x⋆ψj,θ,α)⋆ψj′,θ′,α′)⋆ϕJ(2Ju)  with  k,k′=(j,θ),(j′,θ′)

To reduce the dimension of scattering vectors, we define phase invariant second order scattering coefficients with a complex modulus instead of a phase sensitive ReLU:

 Sx(u,k,k′)=||x⋆ψj,θ|⋆ψj′,θ′|⋆ϕJ(2Ju) for j′>j

The scattering representation includes order coefficients and order phase invariant coefficients. In this paper, we choose and hence 4 scales , angles and 4 phases on . Scattering coefficients are computed with the software package Kymatio (Andreux et al., 2018). They preserve the image information, and can be recovered from (Oyallon et al., 2019). For computational efficiency, the dimension of scattering vectors can be reduced by a factor with a linear operator that preserves the ability to recover a close approximation of from . The dimension reduction operator of Figure 1 may be an orthogonal projection over the principal directions of a PCA calculated on the training set, or it can be optimized by gradient descent together with the other network parameters.

The scattering transform is Lipschitz continuous to translations and deformations (Mallat, 2012). Intra-class variabilities due to translations smaller than and small deformations are linearized. Good classification accuracies are obtained with a linear classifier over scattering coefficients in image datasets where translations and deformations dominate intra-class variabilities. This is the case for digits in MNIST or texture images (Bruna and Mallat, 2013). However, it does not take into account variabilities of pattern structures and clutter which dominate complex image datasets. To remove this clutter while preserving class separation requires some form of supervised learning. The sparse scattering network of Figure 1 computes a sparse code of scattering representation in a learned dictionary of scattering features, which minimizes the classification loss. For this purpose, the next section introduces a homotopy dictionary learning algorithm, implemented in a small convolutional network.

3 Homotopy Dictionary Learning for Classification

Task-driven dictionary learning for classification with sparse coding was proposed in Mairal et al. (2011). We introduce a small convolutional network architecture to implement a sparse code and learn the dictionary with a homotopy continuation on thresholds. The next section reviews dictionary learning for classification. Homotopy sparse coding algorithms are studied in Section 3.2.

3.1 Sparse coding and dictionary Learning

Unless specified, all norms are Euclidean norms. A sparse code approximates a vector with a linear combination of a minimum number of columns of a dictionary matrix , which are normalized . It is a vector of minimum support with a bounded approximation error . Such sparse codes have been used to optimize signal compression (Mallat and Zhang, 1993) and to remove noise, to solve inverse problems in compressed sensing (Candes et al., 2006), and for classification (Mairal et al., 2011). In this case, the dictionary learning optimizes the matrix in order to minimize the classification loss. The resulting columns can be interpreted as classification features selected by the sparse code . To enforce this interpretation, we impose that sparse code coefficients are positive, .

Positive sparse coding Minimizing the support of a code amounts to minimizing its “norm”, which is not convex. This non-convex optimization is convexified by replacing the norm by an norm. Since , we have . The minimization of with is solved by minimizing a convex Lagrangian with a multiplier which depends on :

 α1=argminα≥012∥Dα−β∥2+λ∗∥α∥1 (1)

One can prove (Donoho and Elad, 2006) that has the same support as the minimum support sparse code along if the support size and the dictionary coherence satisfy:

 sμ(D)<1/2  where  μ(D)=maxm≠m′|DtmDm′| (2)

The sparse approximation is a non-linear filtering which preserves the components of which are “coherent” in the dictionary , represented by few large amplitude coefficients. It eliminates the “noise” corresponding to incoherent components of whose correlations with all dictionary vectors are typically below , which can be interpreted as a threshold.

Supervised dictionary learning with a deep neural network Dictionary learning for classification amounts to optimizing the matrix and the threshold to minimize the classification loss on a training set . This is a much more difficult non-convex optimization problem than the convex sparse coding problem (1). The sparse code of each scattering representation depends upon and . It is used as an input to a classifier parametrized by . The classification loss thus depends upon the dictionary and (through ), and on the classification parameters . The dictionary is learned by minimizing the classification loss. This task-driven dictionary learning strategy was introduced in Mairal et al. (2011).

An implementation of the task-driven dictionary learning strategy with deep neural networks has been proposed in (Papyan et al., 2017; Sulam et al., 2018; Mahdizadehaghdam et al., 2019; Sun et al., 2018). The deep network is designed to approximate the sparse code by unrolling a fixed number of iterations of an iterative soft thresholding algorithm. The network takes as input and is parametrized by the dictionary and the Lagrange multiplier , as shown in Figure 2. The classification loss is then minimized with stochastic gradient descent on the classifier parameters and on and . The number of layers in the network is equal to the number of iterations used to approximate the sparse code. During training, the forward pass approximates the sparse code with respect to the current dictionary, and the backward pass updates the dictionary through a stochastic gradient descent step.

For computational efficiency the main issue is to approximate with as few layers as possible and hence find an iterative algorithm which converges quickly. Next section shows that this can be done with homotopy algorithms, that can have an exponential convergence.

3.2 Homotopy Iterated Soft Thresholding Algorithms

Sparse codes are efficiently computed with iterative proximal gradient algorithms (Combettes and Pesquet, 2011). For a positive sparse code, these algorithms iteratively apply a linear operator and a rectifier which acts as a positive thresholding. They can thus be implemented in a deep neural network. We show that homotopy algorithms can converge exponentially and thus lead to precise calculations with fewer layers.

Iterated Positive Soft Thresholding with ReLU Proximal gradient algorithms compute sparse codes with a gradient step on the regression term followed by proximal projection which enforces the sparse penalization (Combettes and Pesquet, 2011). For a positive sparse code, the proximal projection is defined by:

 proxλ(β)=argminα≥012∥α−β∥2+λ∥α∥1 (3)

Since for , we verify that where is a rectifier, with a bias . The rectifier acts as a positive soft thresholding, where is the threshold. Without the positivity condition , the proximal operator in (3) is a soft thresholding which preserves the sign.

An Iterated Soft Thresholding Algorithm (ISTA) (Daubechies et al., 2004) computes an sparse code by alternating a gradient step on and a proximal projection. For positive codes, it is initialized with , and:

 αn+1=ρ(αn+ϵDt(β−Dαn)−ϵλ∗)  with  ϵ<1∥DtD∥2,2 (4)

where is the spectral norm. The first iteration computes a non-sparse code which is progressively sparsified by iterated thresholdings. The convergence is slow: . Fast Iterated Soft Thresholding Agorithm (FISTA) (Beck and Teboulle, 2009) accelerates the error decay to , but it remains slow.

Each iteration of ISTA and FISTA is computed with linear operators and a thresholding and can be implemented with one layer (Papyan et al., 2017). The slow convergence of these algorithms requires to use a large number of layers to compute an accurate sparse code. We show that the number of layers can be reduced considerably with homotopy algorithms.

Homotopy continuation Homotopy continuation algorithms introduced in Osborne et al. (2000), minimize the Lagrangian (1) by progressively decreasing the Lagrange multiplier. This optimization path is opposite to ISTA and FISTA since it begins with a very sparse initial solution whose sparsity is progressively reduced, similarly to matching pursuit algorithms (Davis et al., 1997; Donoho and Tsaig, 2008). Homotopy algorithms are particularly efficient if the final Lagrange multiplier is large and thus produces a very sparse optimal solution. We shall see that it is the case for classification.

Homotopy proximal gradient descents (Xiao and Zhang, 2013) are implemented with an exponentially decreasing sequence of Lagrange multipliers for . Jiao, Jin and Lu (Jiao et al., 2017) have introduced an Iterative Soft Thresholding Continuation (ISTC) algorithm with a fixed number of iterations per threshold. To compute a positive sparse code, we replace the soft thresholding by a ReLU proximal projector, with one iteration per threshold, over iterations:

 αn=ρ(αn−1+Dt(β−Dαn−1)−λn)  with  λn=λmax(λmaxλ∗)−n/N (5)

By adapting the proof of (Jiao et al., 2017) to positive codes, the next theorem proves in a more general framework that if is sufficiently large and then converges exponentially to the optimal positive sparse code.

LISTA algorithm (Gregor and LeCun, 2010) and its more recent version ALISTA (Liu et al., 2019) accelerate the convergence of proximal algorithms by introducing an auxiliary matrix , which is adapted to the statistics of the input and to the properties of the dictionary. Such an auxiliary matrix may also improve classification accuracy. We study its influence by replacing by an arbitrary matrix in (5). Each column of is normalized by . A generalized ISTC is defined for any dictionary and any auxiliary by:

 αn=ρ(αn−1+Wt(β−Dαn−1)−λn)  with  λn=λmax(λmaxλ∗)−n/N (6)

If then we recover the original ISTC algorithm (5) (Jiao et al., 2017). Figure 2 illustrates a neural network implementation of this generalized ISTC algorithm over layers, with side connections. Let us introduce the mutual coherence of and

 ˜μ=maxm≠m′|Wtm′Dm|

The following theorem gives a sufficient condition on this mutual coherence and on the thresholds so that converges exponentially to the optimal sparse code. ALISTA (Liu et al., 2019) is a particular case of generalized ISTC where is optimized in order to minimize the mutual coherence . In Section 4.1 we shall optimize jointly with without any analytic mutual coherence minimization like in ALISTA.

Theorem 3.1

Let be the sparse code of with error . If its support satisfies

 s˜μ<1/2 (7)

then thresholding iterations (6) with

 λn=λmaxγ−n≥λ∗=∥Wt(β−Dα0)∥∞1−2γ˜μs (8)

define an , whose support is included in the support of if and . The error then decreases exponentially:

 ∥αn−α0∥∞≤2λmaxγ−n (9)

The proof is in Appendix A of the supplementary material. It adapts the convergence proof of Jiao et al. (2017) to arbitrary auxiliary matrices and positive sparse codes. If we set to minimize the mutual coherence then this theorem extends the ALISTA exponential convergence result to the noisy case. It proves exponential convergence by specifying thresholds for a non-zero approximation error .

However, one should not get too impressed by this exponential convergence rate because the condition only applies to very sparse codes in highly incoherent dictionaries. Given a dictionary , it is usually not possible to find which satisfies this hypothesis. However, this sufficient condition is based on a brutal upper bound calculation in the proof. It is not necessary to get an exponential convergence. Next section studies learned dictionaries for classification on ImageNet and shows that when , the ISTC algorithm converges exponentially although . When is learned independently from , with no mutual coherence condition, we shall see that the algorithm may not converge.

4 Image Classification

The goal of this work is to construct a deep neural network model which is sufficiently simple to be analyzed mathematically, while reaching the accuracy of more complex deep convolutional networks on large classification problems. This is why we concentrate on ImageNet as opposed to MNIST or CIFAR. Next section shows that a single sparse code in a learned dictionary improves considerably the classification performance of a scattering representation, and outperforms AlexNet on ImageNet 1. We analyze the influence of different architecture components. Section 4.2 compares the convergence of homotopy iterated thresholdings with ISTA and FISTA.

4.1 Image Classification on ImageNet

ImageNet 2012 (Russakovsky et al., 2015) is a challenging color image dataset of 1.2 million training images and 50,000 validation images, divided into 1000 classes. Prior to convolutional networks, SIFT representations combined with Fisher vector encoding reached a Top 5 classification accuracy of 74.3% with multiple model averaging (Sánchez and Perronnin, 2011). In their PyTorch implementation, the Top 5 accuracy of AlexNet and ResNet-152 is 79.1% and 94.1% respectively2.

The scattering transform at a scale of an ImageNet color image is a spatial array of of channels. If we apply to the same MLP classifier as in AlexNet, with 2 hidden layers of size 4096, ReLU and dropout rate of 0.3, the Top 5 accuracy is 65.3%. We shall use the same AlexNet type MLP classifier in all other experiments, or a linear classifier when specified. If we first apply to a 3-layer SLE network of 1x1 convolutions with ReLU and then the same MLP then the accuracy is improved by and it reaches AlexNet performance (Oyallon et al., 2017). However, there is no mathematical understanding of the operations performed by these three layers, and the origin of the improvements, which partly motivates this work.

The sparse scattering architecture is described in Figure 3. A convolutional operator is applied on a standardized scattering transform to reduce the number of scattering channels from to . It includes learned parameters. The ISTC network illustrated in Figure 2 has layers with ReLU and no batch normalization. A smaller network with has nearly the same classification accuracy but the ISTC sparse coding does not converge as well, as explained in Section 4.2. Increasing to or has little impact on accuracy and on the code precision.

The sparse code is first calculated with a convolutional dictionary having vectors. Dictionary columns have a spatial support of size and thus do not overlap when translated. It preserves a small dictionary coherence so that the iterative thresholding algorithm converges exponentially. This ISTC network takes as input an array of size which has been normalized and outputs a code of size or a reconstruction of size . The total number of learned parameters in is about . The output or of the ISTC network is transformed by a batch normalization, and a average pooling and then provided as input to the MLP classifier. The representation is computed with parameters in and , which is above the parameters of AlexNet. Our goal here is not to reduce the number of parameters but to structure the network into well defined mathematical operators.

If we set in the ISTC network, the supervised learning jointly optimizes , the dictionary with the Lagrange multiplier and the MLP classifier parameters. It is done with a stochastic gradient descent during 160 epochs using an initial learning rate of 0.01 with a decay of 0.1 at epochs 60 and 120. With a sparse code in input of the MLP, it has a Top 5 accuracy of 81.0%, which outperforms AlexNet.

If we also jointly optimize to minimize the classification loss, then the accuracy improves to . However, next section shows that in this case, the ISTC network does not compute a sparse code and is therefore not mathematically understood. In the following we thus impose that .

The dimension reduction operator has a marginal effect in terms of performance. If we eliminate it or if we replace it by an unsupervised PCA dimension reduction, the performance drops by less than , whereas the accuracy drops by almost if we eliminate the sparse coding. The number of learned parameters to compute then drops from to . The considerable improvement brought by the sparse code is further amplified if the MLP classifier is replaced by a much smaller linear classifier. A linear classifier on a scattering vector has a (Top 1, Top 5) accuracy of . With a ISTC sparse code with in a learned dictionary the accuracy jumps to and hence improves by nearly .

The optimization learns a relatively large factor which yields a large approximation error , and a very sparse code with about non-zero coefficients. The sparse approximation thus eliminates nearly half of the energy of which can be interpreted as non-informative “clutter” removal. The sparse approximation of has a small dimension similar to AlexNet last convolutional layer output. If the MLP classifier is applied to as opposed to then the accuracy drops by less than and it remains slightly above AlexNet. Replacing by thus improves the accuracy by . The sparse coding projection eliminates “noise”, which seems to mostly correspond to intra-class variabilities while carrying little discriminative information between classes. Since is a sparse combination of dictionary columns , each can be interpreted as “discriminative features” in the space of scattering coefficients. They are optimized to preserve discriminative directions between classes.

4.2 Convergence of Homotopy Algorithms

To guarantee that the network can be analyzed mathematically, we verify numerically that the homotopy ISTC algorithm computes an accurate approximation of the optimal sparse code in (1), with a small number of iterations.

When , Theorem 3.1 guarantees an exponential convergence by imposing a strong incoherence condition . In our classification setting, so the theorem hypothesis is clearly not satisfied. However, this incoherence condition is not necessary. It is derived from a relatively crude upper bound in the proof of Appendix A.1. Figure 4 left shows numerically that the ISTC algorithm for minimizes the Lagrangian over , with an exponential convergence which is faster than ISTA and FISTA. This is tested with a dictionary learned by minimizing the classification loss over ImageNet.

If we jointly optimize and to minimize the classification loss then the ImageNet classification accuracy improves from to . However, Figure 4 right shows that the generalized ISTC network outputs a sparse code which does not minimize the Lagrangian at all. Indeed, the learned matrix does not have a minimum joint coherence with the dictionary , as in ALISTA (Liu et al., 2019). The joint coherence then becomes very large with , which prevents the convergence. Computing by minimizing the joint coherence would require too many computations.

To further compare the convergence speed of ISTC for versus ISTA and FISTA, we compute the relative mean square error between the optimal sparse code and the sparse code output of 12 iterations of each of these three algorithms. The is 0.23 for FISTA and 0.45 for ISTA but only 0.02 for ISTC. In this case, after 12 iterations, ISTC reduces the error by a factor compared to ISTA and FISTA.

5 Conclusion

This work shows that learning a single dictionary is sufficient to improve the performance of a predefined scattering representation beyond the accuracy of AlexNet on ImageNet. The resulting deep convolutional network is a scattering transform followed by a positive sparse code, which are well defined mathematical operators. Dictionary vectors capture discriminative directions in the scattering space. The dictionary approximations act as a non-linear projector which removes non-informative intra-class variations.

The dictionary learning is implemented with an ISTC network with ReLUs. We prove exponential convergence in a general framework that includes ALISTA. A sparse scattering network reduces the convolutional network learning to a single dictionary learning problem. It opens the possibility to study the network properties by analyzing the resulting dictionary. It also offers a simpler mathematical framework to analyze optimization issues.

Acknowledgments

This work was supported by the ERC InvariantClass 320959, grants from Région Ile-de-France and the PRAIRIE 3IA Institute of the French ANR-19-P3IA-0001 program. We thank the Scientific Computing Core at the Flatiron Institute for the use of their computing resources. We would like to thank Eugene Belilovsky for helpful discussions and comments.

Appendix A Appendix

a.1 Proof of Theorem 3.1

Let be the optimal sparse code. We denote by the support of any . We also write . We are going to prove by induction on that for any we have and if .

For , so is indeed included in the support of and . To verify the induction hypothesis for , we shall prove that .

Let us write the error . For all

 α0(m)WtmDm=Wtmβ−Wtmw−∑m≠m′α0(m′)WtmDm′.

Since the support of is smaller than , and

 |α0(m)|≤|Wtmβ|+|Wtmw|+s˜μ∥α0∥∞

so taking the max on gives:

 ∥α0∥∞(1−˜μs)≤∥Wtβ∥∞+∥Wtw∥∞

But given the inequalities

 ∥Wtβ∥∞ ≤ λmax ∥Wtw∥∞ ≤ λmax(1−2γ˜μs) (1−γ˜μs)(1−˜μs) ≤ 1 since γ≥1 and (1−˜μs)>0

we get

 ∥α0∥∞≤2λmax=2λ0

Let us now suppose that the property is valid for and let us prove it for . We denote by the restriction of to vectors indexed by . We begin by showing that . For any , since and we have

 αn+1(m) = ρλn+1(αn(m)+Wtm(β−Dαn)) = ρλn+1(α0(m)+Wtm(DS(α0)∪S(αn)−{m}(α0−αn)S(α0)∪S(αn)−{m}+w))

For any not in , let us prove that . The induction hypothesis assumes that and with so:

 I = |α0(m)+Wtm(DS(α0)∪S(αn)−{m}(α0−αn)S(α0)∪S(αn)−{m}+w)| ≤ |Wtm(DS(α0)(α0−αn)S(α0))|+|Wtmw| since S(αn)⊂S(α0) and α0(m)=0 by assumption. ≤ ˜μs∥α0−αn∥∞+∥Wtw∥∞

Since we assume that , we have

 ∥Wtw∥∞≤(1−2γ˜μs)λn+1

and thus

 I≤˜μs∥α0−αn∥∞+∥Wtw∥∞≤˜μs2λn+λn+1(1−2γ˜μs)≤λn+1

since .

Because of the thresholding , it proves that and hence that .

Let us now evaluate . For any , a soft thresholding satisfies

 |ρλ(α1+α2)−α1|≤λ+|α2|

so:

 |αn+1(m)−α0(m)| ≤ λn+1+|Wtm(DS(α0)∪S(αn)−{m}(α0−αn)S(α0)∪S(αn)−{m})|+|Wtmw| ≤ λn+1+˜μs∥α0−αn∥∞+∥Wtw∥∞ ≤ λn+1+˜μs2λn+λn+1(1−2γ˜μs)=2λn+1

Taking a max over proves the induction hypothesis.

Footnotes

1. Code to reproduce experiments is available at https://github.com/j-zarka/SparseScatNet
2. Accuracies from https://pytorch.org/docs/master/torchvision/models.html

References

1. Kymatio: scattering transforms in python. CoRR. External Links: Link Cited by: §2.
2. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences 2 (1), pp. 183–202. Cited by: §3.2.
3. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 1872–1886. Cited by: §1, §1, §2, §2, §2.
4. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52 (2), pp. 489–509. Cited by: §3.1.
5. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Cited by: §3.2, §3.2.
6. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics 57 (11), pp. 1413–1457. Cited by: §3.2.
7. Adaptive greedy approximations. Constr. Approx. 13 (1), pp. 57–98. Cited by: §3.2.
8. On the stability of the basis pursuit in the presence of noise. Signal Processing 86 (3), pp. 511–532. Cited by: §3.1.
9. Fast solution of l-norm minimization problems when the solution may be sparse. IEEE Trans. Information Theory 54 (11), pp. 4789–4812. Cited by: §3.2.
10. Learning fast approximations of sparse coding. In ICML, pp. 399–406. Cited by: §3.2.
11. Iterative soft/hard thresholding with homotopy continuation for sparse recovery. IEEE Signal Processing Letters 24 (6), pp. 784–788. Cited by: §3.2, §3.2, §3.2.
12. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1097–1105. Cited by: §1, Table 1.
13. Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
14. ALISTA: analytic weights are as good as learned weights in LISTA. In International Conference on Learning Representations, Cited by: §1, §3.2, §4.2.
15. Deep dictionary learning: a parametric network approach. IEEE Transactions on Image Processing 28 (10), pp. 4790–4802. Cited by: §1, §3.1.
16. Task-driven dictionary learning. IEEE transactions on pattern analysis and machine intelligence 34 (4), pp. 791–804. Cited by: §3.1, §3.1, §3.
17. Supervised dictionary learning. In Advances in neural information processing systems, pp. 1033–1040. Cited by: §1.
18. Matching pursuits with time-frequency dictionaries. Trans. Sig. Proc. 41 (12), pp. 3397–3415. Cited by: §3.1.
19. Group invariant scattering. Comm. Pure Appl. Math. 65 (10), pp. 1331–1398. Cited by: §2.
20. Understanding deep convolutional networks. Phil. Trans. of Royal Society A 374 (2065). Cited by: §1, §2.
21. A new approach to variable selection in least squares problems. IMA journal of numerical analysis 20 (3), pp. 389. Cited by: §3.2.
22. Scaling the scattering transform: deep hybrid networks. In Proceedings of the IEEE international conference on computer vision, pp. 5618–5627. Cited by: §4.1.
23. Deep roto-translation scattering for object classification. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2865–2873. Cited by: §1.
24. Scattering networks for hybrid representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9), pp. 2208–2221. Cited by: §1, §2, Table 1.
25. Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research 18, pp. 83:1–83:52. Cited by: §3.1, §3.2.
26. Fisher vectors meet neural networks: a hybrid classification architecture. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3743–3752. Cited by: Table 1.
27. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §4.1.
28. High-dimensional signature compression for large-scale image classification.. In CVPR, pp. 1665–1672. Cited by: §4.1.
29. Multilayer convolutional sparse modeling: pursuit and dictionary learning. IEEE Transactions on Signal Processing 66 (15), pp. 4090–4104. Cited by: §1, §3.1.
30. Supervised deep sparse coding networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 346–350. Cited by: §1, §3.1.
31. A proximal-gradient homotopy method for the sparse least-square problem. SIAM Journl on Optimization 23 (2), pp. 1062–1091. Cited by: §3.2.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters