ScaleEquivariant Neural Networks with Decomposed Convolutional Filters
Abstract
Encoding the input scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many vision tasks especially when dealing with multiscale input signals. We study, in this paper, a scaleequivariant CNN architecture with joint convolutions across the space and the scaling group, which is shown to be both sufficient and necessary to achieve scaleequivariant representations. To reduce the model complexity and computational burden, we decompose the convolutional filters under two prefixed separable bases and truncate the expansion to lowfrequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation. Numerical experiments demonstrate that the proposed scaleequivariant neural network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size.
1 Introduction
Convolutional neural networks (CNNs) have achieved great success in machine learning problems such as image classification [8], object detection [17], and semantic segmentation [10, 18]. Compared to fullyconnected networks, CNNs through spatial weight sharing have the benefit of being translationequivariant, i.e., translating the input leads to a translated version of the output. This property is crucial for many vision tasks such as image recognition and segmentation. However, regular CNNs are not equivariant to other important group transformations such as rescaling and rotation, and it is beneficial in some applications to also encode such group information explicitly into the network representation.
Several network architectures have been designed to achieve (2D) rotationequivarianc [3, 14, 25, 26, 31], and the feature maps of such networks typically include an extra index for the rotation group . Building on the idea of group convolutions proposed in [5] for discrete symmetry groups, the authors in [3] and [25] constructed rotationequivariant CNNs by conducting group convolutions jointly across the space and using steerable filters [6]. Scaleequivariant CNNs, on the other hand, have only been studied in a less general setting in the existing literature [7, 13, 28]. In particular, to the best of our knowledge, a joint convolution across the space and the scaling group has yet been proposed to achieve scaleequivariance in the most general form. This is possibly because of two difficulties one encounters when dealing with the scaling group: First, unlike , it is an acyclic and unbounded group; second, an extra index in incurs a significant increase in model parameters and computational burden, which is further exacerbated by a lack of the counterpart of “steerable filters” for the scaling group. Moreover, since the scaling transformation is rarely perfect in practice (due to changing view angle or numerical discretization), one needs to quantify and promote the deformation robustness of the equivariant representation (i.e., is the model still “approximately” equivariant if the scaling transformation is “contaminated” by a nuisance input deformation), which, to the best of our knowledge, has yet been studied in prior works.
The purpose of this paper is to address the aforementioned theoretical and practical issues in the construction of scaleequivariant CNN models. Specifically, our contribution is threefold:

[noitemsep]

We propose a general scaleequivariant CNN architecture with a joint convolution over and , which is proved in Section 4 to be both sufficient and necessary to achieve scaleequivariance.

A truncated decomposition of the convolutional filters under a prefixed separable basis on the two geometric domains ( and ) is used to reduce the model size and computational cost.

We prove the representation stability of the proposed architecture up to equivariant scaling action of the input signal.
Our contribution to the family of groupequivariant CNNs is nontrivial; in particular, the scaling group unlike the rotation group is acyclic and noncompact. This poses challenges both in theory and in practice, so that many previous works on groupequivariant CNNs cannot be directly extended. We introduce new algorithm design and mathematical techniques to obtain the first general scaleequivariant CNN in literature with both computational efficiency and proved representation stability.
2 Related Work
Mixedscale and scaleequivariant CNNs. Incorporating multiscale information into a CNN representation has been studied in many existing works. The Inception net [21] and its variants [20, 22] stack filters of different sizes in a single layer to address the multiscale salient features. Building on this idea, the selective kernel network [9] utilizes a nonlinear learnable mechanism to aggregate information from multiple scales. Dilated convolutions [15, 23, 30, 29] have also been used to combine multiscale information without increasing the model complexity. Although the effectiveness of such models have been empirically demonstrated in various vision tasks, there is still a lack of interpretability of their ability to encode the input scale information. Groupequivariant CNNs, on the other hand, explicitly encode the group information into the network representation. Cohen and Welling [5] proposed CNNs with group convolutions that are equivariant to several finite discrete symmetry groups. This idea is later generalized in [4] and applied mainly to the rotation groups and [3, 24, 25]. Although scaleequivariant CNNs have also been proposed in the literature [7, 13, 28], they are typically studied in a less general setting. In particular, none of the previous works proposed to conduct joint convolutions over as a necessary and sufficient condition to impose scaleequivariance, for which reason they are thus variants of a special case of our proposed architecture where the convolutional filters in are Dirac delta functions (c.f. Remark 1.)
Representation stability to input deformations. Input deformations typically induce noticeable variabilities within object classes, some of which are uninformative for the vision tasks. Models that are stable to input deformations are thus favorable in many applications. The scattering transform [2, 11, 12] computes translationinvariant representations that are Lipschitz continuous to deformations by cascading predefined wavelet transforms and modulus poolings. A joint convolution over is later adopted in [19] to build rototranslation scattering with stable rotation/translationinvariant representations. These models, however, use prefixed wavelet transforms in the networks, and are thus nonadaptive to the data. DCFNet [16] combines a prefixed filter basis and learnable expansion coefficients in a CNN architecture, achieving both data adaptivity and representation stability inherited from the filter regularity. This idea is later extended in [3] to produce rotationequivariant representations that are Lipschitz continuous in norm to input deformations modulo a global rotation, i.e., the model stays approximately equivariant even if the input rotation is imperfect. To the best of our knowledge, a theoretical analysis of the deformation robustness of a scaleequivariant CNN has yet been studied, and a direct generalization of the result in [3] is futile because the feature maps of a scaleequivariant CNN is typically not in (c.f. Remark 2.)
3 ScaleEquivariant CNN and Filter Decomposition
Groupequivariance is the property of a mapping to commute with the group actions on the domain and codomain . More specifically, let be a group, and , , respectively, be group actions on and . A function is said to be equivariant if
invariance is thus a special case of equivariance where . For learning tasks where the feature is known a priori to change equivariantly to a group action on the input , e.g. image segmentation should be equivariant to translation, it would be beneficial to reduce the hypothesis space to include only equivaraint models. In this paper, we consider mainly the scalingtranslation group . Given and an input image ( is the spatial position, and is the unstructured channel index, e.g. RGB channels of a color image), the scalingtranslation group action on is defined as
(1) 
Constructing scaleequivariant CNNs thus amounts to finding an architecture such that each trained network commutes with the group action on the input and a similarly defined group action (to be explained in Section 3.1) on the output.
3.1 ScaleEquivariant CNNs
Inspired by [3] and [25], we consider scaleequivariant CNNs with an extra index for the the scaling group : for each , the th layer output is denoted as , where is the spatial position, is the scale index, and corresponds to the unstructured channels. We use the continuous model for formal derivation, i.e., the images and feature maps have continuous spatial and scale indices. In practice, the images are discretized on a Cartesian grid, and the scales are computed only on a discretized finite interval. Similar to [3], the group action on the th layer output is defined as a scalingtranslation in space as well as a shift in the scale channel:
(2) 
A feedforward neural network is said to be scaleequivariant, i.e., equivariant to , if
(3) 
where we slightly abuse the notation to denote the th layer output given the input . The following Theorem shows that scaleequivariance is achieved if and only if joint convolutions are conducted over as in (4) and (5).
Theorem 1.
We defer the proof of Theorem 1, as well as those of other theorems, to the appendix.
Remark 1.
When the (joint) convolutional filter takes the special form , the joint convolution (5) over reduces to only a (multiscale) spatial convolution
i.e., the feature maps at different scales do not transfer information among each other (see Figure 0(a)). The previous works [7, 13, 28] on scaleequivariant CNNs are all based on this special case of Theorem 1.
Although the joint convolutions (5) on provide the most general way of imposing scaleequivariance, they unfortunately also incur a significant increase in the model size and computational burden. Following the idea of [3] and [16], we address this issue by taking a truncated decomposition of the convolutional filters under a prefixed separable basis, which will be discussed in detail in the next section.
3.2 Separable Basis Decomposition
We consider decomposing the convolutional filters under the product of two function bases, and , which are the eigenfunctions of the Dirichlet Laplacian on, respectively, the unit disk and , i.e.,
(6) 
In particular, the spatial basis satisfying (6) is the FourierBessel (FB) basis [1]. In the continuous formulation, the spatial “pooling” operation is equivalent to rescaling the convolutional filters in space. We thus assume, without loss of generality, that the convolutional filters are compactly supported as follows
Let , then can be decomposed under and as
(7) 
where and are the expansion coefficients of the filters under the joint bases. During training, the basis functions are fixed, and only the expansion coefficients are updated. In practice, we truncate the expansion to only lowfrequency coefficients (i.e., are nonzero only for , ), which are kept as the trainable parameters. This directly leads to a reduction of network parameters and computational burden. More specifically, let us compare the th convolutional layer (5) of a scaleequivariant CNN with and without truncated basis decomposition:
Number of trainable parameters: Suppose the filters are discretized on a Cartesian grid of size . The number of trainable parameters at the th layer of a scaleequivariant CNN without basis decomposition is . On the other hand, in an ScDCFNet with truncated basis expansion up to leading coefficients for and coefficients for , the number of parameters is instead . Hence a reduction to a factor of in trainable parameters is achieved for ScDCFNet via truncated basis decomposition. In particular, if , and , then the number of parameters is reduced to .
Computational cost: Suppose the size of the input and output at the th layer are, respectively, and , where is the spatial dimension, is the number of scale channels, and () is the number of the unstructured input (output) channels. Let the filters be discretized on a Cartesian grid of size . The following theorem shows that, compared to a regular scaleequivariant CNN, the computational cost in a forward pass of ScDCFNet is reduced again to a factor of .
Theorem 2.
Assume , i.e., the number of the output channels is much larger than the size of the convolutional filters in and , then the computational cost of an ScDCFNet is reduced to a factor of when compared to a scaleequivariant CNN without basis decomposition.
Apart from reducing the model size and computational burden, similar to [3], truncating the filter decomposition has the further benefit of improving the deformation robustness of the equivariant representation, i.e., the equivaraince relation (3) still approximately holds true if the spatial scaling of the input is contaminated by a local deformation (e.g., due to changing view angle or numerical discretization.) This will be addressed in detail in the next section.
4 Representation Stability of ScDCFNet to Input Deformation
We study, in this section, the representation stability of ScDCFNet to input deformations modulo a global scale change, i.e., the input undergoes not only a scale change but also a small spatial distortion. To quantify the distance between different feature maps at each layer, we define the norm of as
(8) 
Remark 2.
We next quantify the representation stability of ScDCFNet under three mild assumptions on the convolutional layers and input deformations. First,
(A1) The pointwise nonlinear activation is nonexpansive.
Next, we need a bound on the convolutional filters under certain norms. For each , define as
(9) 
where the FourierBessel (FB) norm of a sequence is a weighted norm defined as , where is the th eigenvalue of the Dirichlet Laplacian on the unit disk defined in (6). We next assume that each is bounded:
(A2) For all , .
The boundedness of is facilitated by truncating the basis decomposition to only lowfrequency components (small ), which is one of the key idea of ScDCFNet explained in Section 3.2. After a proper initialization of the trainable coefficients, (A2) can generally be satisfied. The assumption (A2) implies several bounds on the convolutional filters at each layer (c.f. Lemma 2 in the appendix), which, combined with (A1), guarantees that an ScDCFNet is layerwise nonexpansive:
Proposition 1.
Under the assumption (A1) and (A2), an ScDCFNet satisfies the following.

[label=()]

Let be the th layer output given a zero bottomlayer input, then depends only on .

Let be the centered version of after removing , i.e.,
then . As a result, .
Finally, we make an assumption on the input deformation modulo a global scale change. Given a function , the spatial deformation on the feature maps is defined as
(10) 
where . We assume a small local deformation on the input:
(A3) , where is the operator norm.
The following theorem demonstrates the representation stability of an ScDCFNet to input deformation modulo a global scale change.
Theorem 3.
Theorem 3 gauges how approximately equivariant is ScDCFNet if the input undergoes not only a scale change but also a nonlinear spatial deformation , which is important both in theory and in practice because the scaling of an object is rarely perfect in reality.
5 Numerical Experiments
In this section, we conduct several numerical experiments for the following three purposes.

To verify that ScDCFNet indeed achieves scale equivariance (3).

To illustrate that ScDCFNet significantly outperforms regular CNNs at a much reduced model size in multiscale image classification.

To show that a trained ScDCFNet autoencoder is able to reconstruct rescaled versions of the input by simply applying group actions on the image codes, demonstrating that ScDCFNet indeed explicitly encodes the input scale information into the representation.
The experiments are tested on the Scaled MNIST (SMNIST) and Scaled FashionMNIST (SFashion) datasets, which are built by rescaling the original MNIST and FashionMNIST [27] images by a factor randomly sampled from a uniform distribution on . A zeropadding to a size of is conducted after the rescaling. If mentioned explicitly, for some experiments, the images are resized to for better visualization.
Before going into the details of the numerical results, we need to clarify the implementation of the spatial pooling module of ScDCFNet. Given a feature , the traditional averagepooling in with the same spatial kernel size across destroys scale equivariance (3). To remedy this, we first convolve with a scalespecific lowpass filter before downsampling the convolved signal on a coarser spatial grid. Specifically, we have , where is the feature after pooling, is a lowpass filter, e.g., a Gaussian kernel, and is the pooling factor. We will refer to this as scaleequivariant averagepooling in what follows.
5.1 Verification of Scale Equivariance
We first verify that ScDCFNet indeed achieves scaleequivariance (3). Specifically, we compare the feature maps of a twolayer ScDCFNet with randomly generated truncated filter expansion coefficients and those of a regular CNN. The exact architectures are detailed in Appendix B.1. Figure 2 displays the first and secondlayer feature maps of an original image and its rescaled version using the two comparing architectures. Feature maps at different layers are rescaled to the same spatial dimension for visualization. The four images enclosed in each of the dashed rectangle correspond to: (th layer feature of the original input), (th layer feature of the rescaled input), (rescaled th layer feature of the original input, where is understood as for a regular CNN due to the lack of a scale index ), and the difference . It is clear that even with numerical discretization, which can be modeled as a form of input deformation, ScDCFNet is still approximately scaleequivariant, i.e., , whereas a regular CNN does not have such a property.
5.2 Multiscale Image Classification
Without batchnormalization  SMNIST test accuracy (%)  SFashion test accuracy (%)  

Architectures  Ratio  
CNN,  
ScDCFNet,  
ScDCFNet,  
With batchnormalization  SMNIST test accuracy (%)  SFashion test accuracy (%)  
Architectures  Ratio  
CNN,  
ScDCFNet,  
We next demonstrate the improved performance of ScDCFNet in multiscale image classification. The experiments are conducted on the SMNIST and SFashion datasets, and a regular CNN is used as a performance benchmark. Both networks are comprised of three convolutional layers with the exact architectures (Table 2) detailed in Appendix B.2. Unlike , the scaling group is unbounded, and we thus compute only the feature maps with the index restricted to the scale interval (), which is discretized uniformly into channels. The performance of the comparing architectures with and without batchnormalization is shown in Table 1. It is clear that, by limiting the hypothesis space to scaleequivaraint models and taking truncated basis decomposition to reduce the model size, ScDCFNet achieves a significant improvement in classification accuracy with a reduced number of trainable parameters. The advantage of ScDCFNet is more pronounced when the number of training samples is small (), suggesting that, by hardwiring the input scale information directly into its representation, ScDCFNet is less susceptible to overfitting the limited multiscale training data.
We also observe that even when a regular CNN is trained with data augmentation (random cropping and rescaling), its performance is still inferior to that of an ScDCFNet without manipulation of the training data. In particular, although the accuracies of the regular CNNs trained on 2000 SMNIST and SFashion images after data augmentation are improved to, respectively, and , they still underperform the ScDCFNets without data augmentation ( and ) using only a fraction of trainable parameters. Moreover, if ScDCFNet is trained with data augmentation, the accuracies can be further improved to and respectively. This suggests that ScDCFNet can be combined with data augmentation for optimal performance in multiscale image classification.
5.3 Image Reconstruction
In the last experiment, we illustrate the ability of ScDCFNet to explicitly encode the input scale information into the representation. To achieve this, we train an ScDCFNet autoencoder on the SMNIST dataset with images resized to for better visualization. The encoder stacks two scaleequivaraint convolutional blocks with averagepooling, and the decoder contains a succession of two transposed convolutional blocks with upsampling. A regular CNN autoencoder is also trained for comparison (see Table 3 in Appendix B.3 for the detailed architecture.)
Our goal is to demonstrate that the image code produced by the ScDCFNet autoencoder contains the scale information of the input, i.e., by applying the group action (2) to the code of a test image before feeding it to the decoder, we can reconstruct rescaled versions of original input. This property can be visually verified in Figure 3. In contrast, a regular CNN autoencoder fails to do so.
6 Conclusion
We propose, in this paper, a scaleequivaraint CNN with joint convolutions across the space and the scaling group , which we show to be both sufficient and necessary to impose scaleequivariant network representation. To reduce the computational cost and model complexity incurred by the joint convolutions, the convolutional filters supported on are decomposed under a separable basis across the two domains and truncated to only lowfrequency components. Moreover, the truncated filter expansion leads also to improved deformation robustness of the equivaraint representation, i.e., the model is still approximately equivariant even if the scaling transformation is imperfect. Experimental results suggest that ScDCFNet achieves improved performance in multiscale image classification with greater interpretability and reduced model size compared to regular CNN models.
For future work, we will study the application of ScDCFNet in other more complicated vision tasks, such as object detection/localization and pose estimation, where it is beneficial to directly encode the input scale information into the deep representation. Moreover, the memory usage of our current implementation of ScDCFNet scales linearly to the number of the truncated basis functions in order to realize the reduced computational burden explained in Theorem 2. We will explore other efficient implementation of the model, e.g., using filterbank type of techniques to compute convolutions with multiscale spatial filters, to significantly reduce both the computational cost and memory usage.
References
 [1] (1965) Handbook of mathematical functions: with formulas, graphs, and mathematical tables. Vol. 55, Courier Corporation. Cited by: §3.2.
 [2] (2013) Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1872–1886. Cited by: §2.
 [3] (2019) RotDCF: decomposition of convolutional filters for rotationequivariant deep networks. In International Conference on Learning Representations, Cited by: §1, §2, §2, §3.1, §3.1, §3.2, Remark 2.
 [4] (2018) A general theory of equivariant cnns on homogeneous spaces. arXiv preprint arXiv:1811.02017. Cited by: §2.
 [5] (2016) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §1, §2.
 [6] (1991) The design and use of steerable filters. IEEE Transactions on Pattern Analysis & Machine Intelligence (9), pp. 891–906. Cited by: §1.
 [7] (2014) Locally scaleinvariant convolutional neural networks. arXiv preprint arXiv:1412.5104. Cited by: §1, §2, Figure 1, Remark 1.
 [8] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 [9] (2019) Selective kernel networks. Cited by: §2.
 [10] (201506) Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [11] (2010) Recursive interferometric representation. In Proc. of EUSICO conference, Danemark, Cited by: §2.
 [12] (2012) Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §2.
 [13] (2018) Scale equivariance in cnns with vector fields. arXiv preprint arXiv:1807.11783. Cited by: §1, §2, Figure 1, Remark 1.
 [14] (2017) Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5048–5057. Cited by: §1.
 [15] (2018) A mixedscale dense convolutional neural network for image analysis. Proceedings of the National Academy of Sciences 115 (2), pp. 254–259. Cited by: §2.
 [16] (201810–15 Jul) DCFNet: deep neural network with decomposed convolutional filters. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4198–4207. Cited by: §A.3, §A.3, §A.4, §A.4, §A.4, §A.4, §2, §3.1, Lemma 4.
 [17] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 91–99. Cited by: §1.
 [18] (2015) Unet: convolutional networks for biomedical image segmentation. In Medical Image Computing and ComputerAssisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 9783319245744 Cited by: §1.
 [19] (2013) Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1233–1240. Cited by: §2.
 [20] (2017) Inceptionv4, inceptionresnet and the impact of residual connections on learning. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §2.
 [21] (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
 [22] (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2.
 [23] (2018) Understanding convolution for semantic segmentation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1451–1460. Cited by: §2.
 [24] (2018) 3d steerable cnns: learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, pp. 10381–10392. Cited by: §2.
 [25] (2018) Learning steerable filters for rotation equivariant cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849–858. Cited by: §1, §2, §3.1.
 [26] (2017) Harmonic networks: deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037. Cited by: §1.
 [27] (20170828)(Website) External Links: cs.LG/1708.07747 Cited by: §5.
 [28] (2014) Scaleinvariant convolutional neural networks. arXiv preprint arXiv:1411.6369. Cited by: §1, §2, Figure 1, Remark 1.
 [29] (201707) Dilated residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [30] (2016) Multiscale context aggregation by dilated convolutions. In International Conference on Learning Representations, Cited by: §2.
 [31] (2017) Oriented response networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 519–528. Cited by: §1.
Appendix A Proofs
a.1 Proof of Theorem 1
Proof of Theorem 1.
We note first that (3) holds true if and only if the following being valid for all ,
(12) 
where is understood as . We also note that the layerwise operations of a general feedforward neural network with an extra index can be written as
(13) 
and, for ,
(14) 
When , we have
and
Therefore .
To prove the necessary part: when , we have
and
Hence for (12) to hold when , we need
(15) 
Keeping fixed while changing in (15), we obtain that does not depend on the third variable . Thus . Define as
Then, for any given , setting in (15) leads to
For , a similar argument leads to