Visualizing and Improving Scattering Networks
Abstract
Scattering Transforms (or ScatterNets) introduced by Mallat in [1] are a promising start into creating a welldefined feature extractor to use for pattern recognition and image classification tasks. They are of particular interest due to their architectural similarity to Convolutional Neural Networks (CNNs), while requiring no parameter learning and still performing very well (particularly in constrained classification tasks).
In this paper we visualize what the deeper layers of a ScatterNet are sensitive to using a ‘DeScatterNet’. We show that the higher orders of ScatterNets are sensitive to complex, edgelike patterns (checkerboards and rippled edges). These complex patterns may be useful for texture classification, but are quite dissimilar from the patterns visualized in second and third layers of Convolutional Neural Networks (CNNs)  the current state of the art Image Classifiers. We propose that this may be the source of the current gaps in performance between ScatterNets and CNNs (83% vs 93% on CIFAR10 for ScatterNet+SVM vs ResNet). We then use these visualization tools to propose possible enhancements to the ScatterNet design, which show they have the power to extract features more closely resembling CNNs, while still being welldefined and having the invariance properties fundamental to ScatterNets.
To Appear in the 27th IEEE International Workshop on Machine Learning For Signal Processing (MLSP) 2017 \nameFergal Cotter and Nick Kingsbury \addressSignal Processing Group, Department of Engineering, University of Cambridge, U.K.
ScatterNets, DeScatterNets, Scattering Network, Convolutional Neural Network, Visualization,
1 Introduction
Scattering transforms, or ScatterNets, have recently gained much attention and use due to their ability to extract generic and descriptive features in well defined way. They can be used as unsupervised feature extractors for image classification [2, 3, 4, 5] and texture classification [6], or in combination with supervised methods such as Convolutional Neural Networks (CNNs) to make the latter learn quicker, and in a more stable way [7].
ScatterNets have been shown to perform very well as image classifiers. In particular, they can outperform CNNs for classification tasks with reduced training set sizes, e.g. in CIFAR10 and CIFAR100 (Table 6 from [7] and Table 4 from [4]). They are also near stateoftheart for Texture Discrimination tasks (Tables 1–3 from [6]). Despite this, there still exists a considerable gap between them and CNNs on challenges like CIFAR10 with the full training set ( vs. ). Even considering the benefits of ScatterNets, this gap must be addressed.
We first revise the operations that form a ScatterNet in 2. We then introduce our DeScatterNet (3), and show how we can use it to examine the layers of ScatterNets (using a similar technique to the CNN visualization in [8]). We use this analysis tool to highlight what patterns a ScatterNet is sensitive to (4), showing that they are very different from what their CNN counterparts are sensitive to, and possibly less useful for discriminative tasks.
We use these observations to propose an architectural change to ScatterNets, which have not changed much since their inception in [1]. Two changes of note however are the work of Sifre and Mallat in [6], and the work of Singh and Kingsbury in [4]. Sifre and Mallat introduced Rotationally Invariant ScatterNets which took ScatterNets in a new direction, as the architecture now included filtering across the wavelet orientations (albeit with heavy restrictions on the fitlers used). Singh and Kingsbury achieved improvements in performance in a Scattering system using the spatially implementable [9] wavelets instead of the Fourier Transform (FFT) based Morlet previously used.
We build on these two systems, showing that with carefully designed complex filters applied across the complex spatial coefficients of a 2D , we can build filters that are sensitive to more recognizable shapes like those commonly seen in CNNs, such as corners and curves (5).
2 The Scattering Transform
The Scattering Transform, or ScatterNet, is a cascade of complex wavelet transforms and modulus nonlinearities (throwing away the phase of the complex wavelet coefficients). At a chosen scale, averaging filters provide invariance to nuisance variations such as shift and deformation (and potentially rotations). Due to the nonexpansive nature of the wavelet transform and the modulus operation, this transform is stable to deformations.
Typical implementations of the ScatterNet are limited to two ‘orders’ (equivalent to layers in a CNN) [3, 4, 7]. In addition to scattering order, we also have the scale of invariance, . This is the number of bandpass coefficients output from a wavelet filter bank (FB), and defines the cutoff frequency for the final lowpass output: ( is the sampling frequency of the signal). Finally, we call the number of oriented wavelet coefficients used . These are the three main hyperparameters of the scattering transform and must be set ahead of time. We describe a system with scale parameter , order and with orientations ( is fixed to 6 for the but is flexible for the FFT based Morlet wavelets).
Consider an input signal . The zeroth order scatter coefficient is the lowpass output of a level FB:
(1) 
This is invariant to translations of up to pixels
(2) 
for . We will want to retain this information in these coefficients to build a useful classifier.
Let us call the set
of available scales and orientations and use to index it.
For both Morlet and implementations, is complexvalued, i.e.,
with and forming a Hilbert Pair,
resulting in an analytic .
This analyticity provides a source of
invariance — small input shifts in result in a phase rotation (but little
magnitude change) of the complex wavelet coefficients
Taking the magnitude of gives us the first order propagated signals:
(3) 
The first order scattering coefficient makes invariant up to our scale by averaging it:
(4) 
If we define , then we can iteratively define:
(5)  
(6)  
(7) 
We repeat this for higher orders, although previous work shows that, for natural images, we get diminishing returns after . The output of our ScatterNet is then:
(8) 
2.1 Scattering Color Images
A wavelet transform like the accepts single channel input, while we often work on RGB images. This leaves us with a choice. We can either:

Apply the wavelet transform (and the subsequent scattering operations) on each channel independently. This would triple the output size to .

Define a frequency threshold below which we keep color information, and above which, we combine the three channels into a single luminance channel.
The second option uses the well known fact that the human eye is far less sensitive to higher spatial frequencies in color channels than in luminance channels. This also fits in with the first layer filters seen in the well known Convolutional Neural Network, AlexNet. Roughly one half of the filters were low frequency color ‘blobs’, while the other half were higher frequency, grayscale, oriented wavelets.
For this reason, we choose the second option for the architecture described in this paper. We keep the 3 color channels in our coefficients, but work only on grayscale for high orders (the coefficients are the lowpass bands of a Jscale wavelet transform, so we have effectively chosen a color cutoff frequency of ).
For example, consider an RGB input image of size . The scattering transform we have described with parameters and would then have the following coefficients:
3 The Inverse Network
We now introduce our inverse scattering network. This allows us to back project scattering coefficients to the image plane; it is inspired by the DeconvNet used by Zeiler and Fergus in [8] to look into the deeper layers of CNNs.
We emphasize that instead of thinking about perfectly reconstructing from , we want to see what signal/pattern in the input image caused a large activation in each channel. This gives us a good idea of what each output channel is sensitive to, or what it extracts from the input. Note that we do not use any of the log normalization layers described in [3, 4].
3.1 Inverting the LowPass Filtering
Going from the coefficients to the coefficients involved convolving by a low pass filter, followed by decimation to make the output . is a purely real filter, and we can ‘invert’ this operation by interpolating to the same spatial size as and convolving with the mirror image of , (this is equivalent to the transpose convolution described in [8]).
(9) 
This will not recover as it was on the forward pass, but will recover all the information in that caused a strong response in .
3.2 Inverting the Magnitude Operation
In the same vein as [8], we face a difficult task in inverting the nonlinearity in our system. We lend inspiration from the switches introduced in the DeconvNet; the switches in a DeconvNet save the location of maximal activations so that on the backwards pass activation layers could be unpooled trivially. We do an equivalent operation by saving the phase of the complex activations. On the backwards pass we reinsert the phase to give our recovered .
(10) 
3.3 Inverting the Wavelet Decomposition
Using the makes inverting the wavelet transform simple, as we can simply feed the coefficients through the synthesis filter banks to regenerate the signal. For complex , this is convolving with the conjugate transpose :
(11)  
4 Visualization with Inverse Scattering
To examine our ScatterNet, we scatter all of the images from ImageNet’s validation set and record the top 9 images which most highly activate each of the channels in the ScatterNet. This is the identification phase (in which no inverse scattering is performed).
Then, in the reconstruction phase, we load in the images, and scatter them one by one. We take the resulting output vector and mask all but a single value in the channel we are currently examining.
This 1sparse tensor is then presented to the inverse scattering network from Figure 1 and projected back to the image space. Some results of this are shown in Figure 2. This figure shows reconstructed features from the layers of a ScatterNet. For a given output channel, we show the top 9 activations projected independently to pixel space. For the first and second order coefficients, we also show the patch of pixels in the input image which cause this large output. We display activations from various scales (increasing from first row to last row), and random orientations in these scales.
The order 1 scattering (labelled with ‘Order 1’ in Figure 2) coefficients look quite similar to the first layer filters from the well known AlexNet CNN [11]. This is not too surprising, as the first order scattering coefficients are simply a wavelet transform followed by average pooling. They are responding to images with strong edges aligned with the wavelet orientation.
The second order coefficients (labelled with ‘Order 2’ in Figure 2) appear very similar to the order 1 coefficients at first glance. They too are sensitive to edgelike features, and some of them (e.g. third row, third column and fourth row, second column) are mostly just that. These are features that have the same oriented wavelet applied at both the first and second order. Others, such as the 9 in the first row, first column, and first row, fourth column are more sensitive to checkerboard like patterns. Indeed, these are activations where the orientation of the wavelet for the first and second order scattering were far from each other (15 and 105 for the first row, first column and 105 and 45 for the first row, fourth column).
For comparison, we include reconstructions from the second layer of the wellknown VGG CNN (labelled with ‘VGG conv2_2’, in Figure 2). These were made with a DeconvNet, following the same method as [8]. Note that while some of the features are edgelike, we also see higher order shapes like corners, crosses and curves.
5 Corners, Crosses and Curves
These reconstructions show that the features extracted from ScatterNets vary significantly from those learned in CNNs after the first order. In many respects, the features extracted from a CNN like VGGNet look preferable for use as part of a classification system.
[6] and [3] introduced the idea of a ‘RotoTranslation’ ScatterNet. Invariance to rotation could be made by applying averaging (and bandpass) filters across the orientations from the wavelet transform before applying the complex modulus. Momentarily ignoring the form of the filters they apply, referring to them as , we can think of this stage as stacking the outputs of a complex wavelet transform on top of each other, and convolving these filters over all spatial locations of the wavelet coefficients (this is equivalent to how filters in a CNN are fully connected in depth):
(12) 
We then take the modulus of these complex outputs to make a second propagated signal:
(13) 
We present a variation on this idea, by filtering with a more general . We use of length 12 rather than 6, as we use the orientations and their complex conjugates; each wavelet is a 30 rotation of the previous, so with 12 rotations, we can cover the full .
Figure 3 shows some reconstructions from these coefficients. Each of the four quadrants show reconstructions from a different class of ScatterNet layer. All shapes are shown in real and imaginary Hilbertlike pairs; the top images in each quadrant are reconstructed from a purely real , while the bottom inputs are reconstructed from a purely imaginary . This shows one level of invariance of these filters, as after taking the complex magnitude, both the top and the bottom shape will activate the filter with the same strength. In comparison, for the purely real filters of a CNN, the top shape would cause a large output, and the bottom shape would cause near 0 activity (they are nearly orthogonal to each other).
In the top left, we display the 6 wavelet filters for reference (these were reconstructed from , not ). In the top right of the figure we see some of the shapes made by using the ’s from the RotoTranslation ScatterNet [6, 3]. The bottom left is where we present some of our novel kernels. These are simple cornerlike shapes made by filtering with
(14) 
The six orientations are made by rolling the coefficients in along one sample (i.e. , , …). Coefficients roll back around (like circular convolution) when they reach the end.
Finally, in the bottom right we see shapes made by . Note that with the exception of the ringlike shape which has 12 nonzero coefficients, all of these shapes were reconstructed with ’s that have 4 to 8 nonzero coefficients of a possible 64. These shapes are now beginning to more closely resemble the more complex shapes seen in the middle stages of CNNs.
6 Discussion
This paper presents a way to investigate what the higher orders of a ScatterNet are responding to  the DeScatterNet described in 3. Using this, we have shown that the second ‘layer’ of a ScatterNet responds strongly to patterns that are very dissimilar to those that highly activate the second layer of a CNN. As well as being dissimilar to CNNs, visual inspection of the ScatterNet’s patterns reveal that they may be less useful for discriminative tasks, and we believe this may be causing the current gaps in stateoftheart performance between the two.
We have presented an architectural change to ScatterNets that can make it sensitive to more recognizable shapes. We believe that using this new layer is how we can start to close the gap, making more generic and descriptive ScatterNets while keeping control of their desirable properties.
A future paper will include classifier results for these new filters.
Footnotes
 From here on, we drop the notation when indexing , for clarity.
 In comparison to a system with purely real filters such as a CNN, which would have rapidly varying coefficients for small input shifts [9].
References
 Stéphane Mallat, “Group Invariant Scattering,” Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, Oct. 2012.
 J. Bruna and S. Mallat, “Invariant Scattering Convolution Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, Aug. 2013.
 Edouard Oyallon and Stephane Mallat, “Deep RotoTranslation Scattering for Object Classification,” 2015, pp. 2865–2873.
 Amarjot Singh and Nick Kingsbury, “DualTree Wavelet Scattering Network with Parametric Log Transformation for Object Classification,” arXiv:1702.03267 [cs], Feb. 2017.
 Amarjot Singh and Nick Kingsbury, “MultiResolution DualTree Wavelet Scattering Network for Signal Classification,” arXiv:1702.03345 [cs], Feb. 2017.
 L. Sifre and S. Mallat, “Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013, pp. 1233–1240.
 Edouard Oyallon, Eugene Belilovsky, and Sergey Zagoruyko, “Scaling the Scattering Transform: Deep Hybrid Networks,” arXiv:1703.08961 [cs], Mar. 2017.
 Matthew D. Zeiler and Rob Fergus, “Visualizing and Understanding Convolutional Networks,” in Computer Vision – ECCV 2014, Sept. 2014, pp. 818–833.
 N. Kingsbury, “Complex wavelets for shift invariant analysis and filtering of signals,” Applied and Computational Harmonic Analysis, vol. 10, no. 3, pp. 234–253, May 2001.
 Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for LargeScale Image Recognition,” arXiv:1409.1556 [cs], Sept. 2014.
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in NIPS. 2012, pp. 1097–1105, Curran Associates, Inc.