Scale-Equivariant Neural Networks with Decomposed Convolutional Filters
Encoding the input scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many vision tasks especially when dealing with multiscale input signals. We study, in this paper, a scale-equivariant CNN architecture with joint convolutions across the space and the scaling group, which is shown to be both sufficient and necessary to achieve scale-equivariant representations. To reduce the model complexity and computational burden, we decompose the convolutional filters under two pre-fixed separable bases and truncate the expansion to low-frequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation. Numerical experiments demonstrate that the proposed scale-equivariant neural network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size.
Convolutional neural networks (CNNs) have achieved great success in machine learning problems such as image classification , object detection , and semantic segmentation [10, 18]. Compared to fully-connected networks, CNNs through spatial weight sharing have the benefit of being translation-equivariant, i.e., translating the input leads to a translated version of the output. This property is crucial for many vision tasks such as image recognition and segmentation. However, regular CNNs are not equivariant to other important group transformations such as rescaling and rotation, and it is beneficial in some applications to also encode such group information explicitly into the network representation.
Several network architectures have been designed to achieve (2D) rotation-equivarianc [3, 14, 25, 26, 31], and the feature maps of such networks typically include an extra index for the rotation group . Building on the idea of group convolutions proposed in  for discrete symmetry groups, the authors in  and  constructed rotation-equivariant CNNs by conducting group convolutions jointly across the space and using steerable filters . Scale-equivariant CNNs, on the other hand, have only been studied in a less general setting in the existing literature [7, 13, 28]. In particular, to the best of our knowledge, a joint convolution across the space and the scaling group has yet been proposed to achieve scale-equivariance in the most general form. This is possibly because of two difficulties one encounters when dealing with the scaling group: First, unlike , it is an acyclic and unbounded group; second, an extra index in incurs a significant increase in model parameters and computational burden, which is further exacerbated by a lack of the counterpart of “steerable filters” for the scaling group. Moreover, since the scaling transformation is rarely perfect in practice (due to changing view angle or numerical discretization), one needs to quantify and promote the deformation robustness of the equivariant representation (i.e., is the model still “approximately” equivariant if the scaling transformation is “contaminated” by a nuisance input deformation), which, to the best of our knowledge, has yet been studied in prior works.
The purpose of this paper is to address the aforementioned theoretical and practical issues in the construction of scale-equivariant CNN models. Specifically, our contribution is three-fold:
We propose a general scale-equivariant CNN architecture with a joint convolution over and , which is proved in Section 4 to be both sufficient and necessary to achieve scale-equivariance.
A truncated decomposition of the convolutional filters under a pre-fixed separable basis on the two geometric domains ( and ) is used to reduce the model size and computational cost.
We prove the representation stability of the proposed architecture up to equivariant scaling action of the input signal.
Our contribution to the family of group-equivariant CNNs is non-trivial; in particular, the scaling group unlike the rotation group is acyclic and non-compact. This poses challenges both in theory and in practice, so that many previous works on group-equivariant CNNs cannot be directly extended. We introduce new algorithm design and mathematical techniques to obtain the first general scale-equivariant CNN in literature with both computational efficiency and proved representation stability.
2 Related Work
Mixed-scale and scale-equivariant CNNs. Incorporating multiscale information into a CNN representation has been studied in many existing works. The Inception net  and its variants [20, 22] stack filters of different sizes in a single layer to address the multiscale salient features. Building on this idea, the selective kernel network  utilizes a nonlinear learnable mechanism to aggregate information from multiple scales. Dilated convolutions [15, 23, 30, 29] have also been used to combine multiscale information without increasing the model complexity. Although the effectiveness of such models have been empirically demonstrated in various vision tasks, there is still a lack of interpretability of their ability to encode the input scale information. Group-equivariant CNNs, on the other hand, explicitly encode the group information into the network representation. Cohen and Welling  proposed CNNs with group convolutions that are equivariant to several finite discrete symmetry groups. This idea is later generalized in  and applied mainly to the rotation groups and [3, 24, 25]. Although scale-equivariant CNNs have also been proposed in the literature [7, 13, 28], they are typically studied in a less general setting. In particular, none of the previous works proposed to conduct joint convolutions over as a necessary and sufficient condition to impose scale-equivariance, for which reason they are thus variants of a special case of our proposed architecture where the convolutional filters in are Dirac delta functions (c.f. Remark 1.)
Representation stability to input deformations. Input deformations typically induce noticeable variabilities within object classes, some of which are uninformative for the vision tasks. Models that are stable to input deformations are thus favorable in many applications. The scattering transform [2, 11, 12] computes translation-invariant representations that are Lipschitz continuous to deformations by cascading predefined wavelet transforms and modulus poolings. A joint convolution over is later adopted in  to build roto-translation scattering with stable rotation/translation-invariant representations. These models, however, use pre-fixed wavelet transforms in the networks, and are thus nonadaptive to the data. DCFNet  combines a pre-fixed filter basis and learnable expansion coefficients in a CNN architecture, achieving both data adaptivity and representation stability inherited from the filter regularity. This idea is later extended in  to produce rotation-equivariant representations that are Lipschitz continuous in norm to input deformations modulo a global rotation, i.e., the model stays approximately equivariant even if the input rotation is imperfect. To the best of our knowledge, a theoretical analysis of the deformation robustness of a scale-equivariant CNN has yet been studied, and a direct generalization of the result in  is futile because the feature maps of a scale-equivariant CNN is typically not in (c.f. Remark 2.)
3 Scale-Equivariant CNN and Filter Decomposition
Group-equivariance is the property of a mapping to commute with the group actions on the domain and codomain . More specifically, let be a group, and , , respectively, be group actions on and . A function is said to be -equivariant if
-invariance is thus a special case of -equivariance where . For learning tasks where the feature is known a priori to change equivariantly to a group action on the input , e.g. image segmentation should be equivariant to translation, it would be beneficial to reduce the hypothesis space to include only -equivaraint models. In this paper, we consider mainly the scaling-translation group . Given and an input image ( is the spatial position, and is the unstructured channel index, e.g. RGB channels of a color image), the scaling-translation group action on is defined as
Constructing scale-equivariant CNNs thus amounts to finding an architecture such that each trained network commutes with the group action on the input and a similarly defined group action (to be explained in Section 3.1) on the output.
3.1 Scale-Equivariant CNNs
Inspired by  and , we consider scale-equivariant CNNs with an extra index for the the scaling group : for each , the -th layer output is denoted as , where is the spatial position, is the scale index, and corresponds to the unstructured channels. We use the continuous model for formal derivation, i.e., the images and feature maps have continuous spatial and scale indices. In practice, the images are discretized on a Cartesian grid, and the scales are computed only on a discretized finite interval. Similar to , the group action on the -th layer output is defined as a scaling-translation in space as well as a shift in the scale channel:
A feedforward neural network is said to be scale-equivariant, i.e., equivariant to , if
where we slightly abuse the notation to denote the -th layer output given the input . The following Theorem shows that scale-equivariance is achieved if and only if joint convolutions are conducted over as in (4) and (5).
We defer the proof of Theorem 1, as well as those of other theorems, to the appendix.
When the (joint) convolutional filter takes the special form , the joint convolution (5) over reduces to only a (multiscale) spatial convolution
i.e., the feature maps at different scales do not transfer information among each other (see Figure 0(a)). The previous works [7, 13, 28] on scale-equivariant CNNs are all based on this special case of Theorem 1.
Although the joint convolutions (5) on provide the most general way of imposing scale-equivariance, they unfortunately also incur a significant increase in the model size and computational burden. Following the idea of  and , we address this issue by taking a truncated decomposition of the convolutional filters under a pre-fixed separable basis, which will be discussed in detail in the next section.
3.2 Separable Basis Decomposition
We consider decomposing the convolutional filters under the product of two function bases, and , which are the eigenfunctions of the Dirichlet Laplacian on, respectively, the unit disk and , i.e.,
In particular, the spatial basis satisfying (6) is the Fourier-Bessel (FB) basis . In the continuous formulation, the spatial “pooling” operation is equivalent to rescaling the convolutional filters in space. We thus assume, without loss of generality, that the convolutional filters are compactly supported as follows
Let , then can be decomposed under and as
where and are the expansion coefficients of the filters under the joint bases. During training, the basis functions are fixed, and only the expansion coefficients are updated. In practice, we truncate the expansion to only low-frequency coefficients (i.e., are non-zero only for , ), which are kept as the trainable parameters. This directly leads to a reduction of network parameters and computational burden. More specifically, let us compare the -th convolutional layer (5) of a scale-equivariant CNN with and without truncated basis decomposition:
Number of trainable parameters: Suppose the filters are discretized on a Cartesian grid of size . The number of trainable parameters at the -th layer of a scale-equivariant CNN without basis decomposition is . On the other hand, in an ScDCFNet with truncated basis expansion up to leading coefficients for and coefficients for , the number of parameters is instead . Hence a reduction to a factor of in trainable parameters is achieved for ScDCFNet via truncated basis decomposition. In particular, if , and , then the number of parameters is reduced to .
Computational cost: Suppose the size of the input and output at the -th layer are, respectively, and , where is the spatial dimension, is the number of scale channels, and () is the number of the unstructured input (output) channels. Let the filters be discretized on a Cartesian grid of size . The following theorem shows that, compared to a regular scale-equivariant CNN, the computational cost in a forward pass of ScDCFNet is reduced again to a factor of .
Assume , i.e., the number of the output channels is much larger than the size of the convolutional filters in and , then the computational cost of an ScDCFNet is reduced to a factor of when compared to a scale-equivariant CNN without basis decomposition.
Apart from reducing the model size and computational burden, similar to , truncating the filter decomposition has the further benefit of improving the deformation robustness of the equivariant representation, i.e., the equivaraince relation (3) still approximately holds true if the spatial scaling of the input is contaminated by a local deformation (e.g., due to changing view angle or numerical discretization.) This will be addressed in detail in the next section.
4 Representation Stability of ScDCFNet to Input Deformation
We study, in this section, the representation stability of ScDCFNet to input deformations modulo a global scale change, i.e., the input undergoes not only a scale change but also a small spatial distortion. To quantify the distance between different feature maps at each layer, we define the norm of as
We next quantify the representation stability of ScDCFNet under three mild assumptions on the convolutional layers and input deformations. First,
(A1) The pointwise nonlinear activation is non-expansive.
Next, we need a bound on the convolutional filters under certain norms. For each , define as
where the Fourier-Bessel (FB) norm of a sequence is a weighted norm defined as , where is the -th eigenvalue of the Dirichlet Laplacian on the unit disk defined in (6). We next assume that each is bounded:
(A2) For all , .
The boundedness of is facilitated by truncating the basis decomposition to only low-frequency components (small ), which is one of the key idea of ScDCFNet explained in Section 3.2. After a proper initialization of the trainable coefficients, (A2) can generally be satisfied. The assumption (A2) implies several bounds on the convolutional filters at each layer (c.f. Lemma 2 in the appendix), which, combined with (A1), guarantees that an ScDCFNet is layerwise non-expansive:
Under the assumption (A1) and (A2), an ScDCFNet satisfies the following.
Let be the -th layer output given a zero bottom-layer input, then depends only on .
Let be the centered version of after removing , i.e.,
then . As a result, .
Finally, we make an assumption on the input deformation modulo a global scale change. Given a function , the spatial deformation on the feature maps is defined as
where . We assume a small local deformation on the input:
(A3) , where is the operator norm.
The following theorem demonstrates the representation stability of an ScDCFNet to input deformation modulo a global scale change.
Theorem 3 gauges how approximately equivariant is ScDCFNet if the input undergoes not only a scale change but also a nonlinear spatial deformation , which is important both in theory and in practice because the scaling of an object is rarely perfect in reality.
5 Numerical Experiments
In this section, we conduct several numerical experiments for the following three purposes.
To verify that ScDCFNet indeed achieves scale equivariance (3).
To illustrate that ScDCFNet significantly outperforms regular CNNs at a much reduced model size in multiscale image classification.
To show that a trained ScDCFNet auto-encoder is able to reconstruct rescaled versions of the input by simply applying group actions on the image codes, demonstrating that ScDCFNet indeed explicitly encodes the input scale information into the representation.
The experiments are tested on the Scaled MNIST (SMNIST) and Scaled Fashion-MNIST (SFashion) datasets, which are built by rescaling the original MNIST and Fashion-MNIST  images by a factor randomly sampled from a uniform distribution on . A zero-padding to a size of is conducted after the rescaling. If mentioned explicitly, for some experiments, the images are resized to for better visualization.
Before going into the details of the numerical results, we need to clarify the implementation of the spatial pooling module of ScDCFNet. Given a feature , the traditional average-pooling in with the same spatial kernel size across destroys scale equivariance (3). To remedy this, we first convolve with a scale-specific low-pass filter before downsampling the convolved signal on a coarser spatial grid. Specifically, we have , where is the feature after pooling, is a low-pass filter, e.g., a Gaussian kernel, and is the pooling factor. We will refer to this as scale-equivariant average-pooling in what follows.
5.1 Verification of Scale Equivariance
We first verify that ScDCFNet indeed achieves scale-equivariance (3). Specifically, we compare the feature maps of a two-layer ScDCFNet with randomly generated truncated filter expansion coefficients and those of a regular CNN. The exact architectures are detailed in Appendix B.1. Figure 2 displays the first- and second-layer feature maps of an original image and its rescaled version using the two comparing architectures. Feature maps at different layers are rescaled to the same spatial dimension for visualization. The four images enclosed in each of the dashed rectangle correspond to: (-th layer feature of the original input), (-th layer feature of the rescaled input), (rescaled -th layer feature of the original input, where is understood as for a regular CNN due to the lack of a scale index ), and the difference . It is clear that even with numerical discretization, which can be modeled as a form of input deformation, ScDCFNet is still approximately scale-equivariant, i.e., , whereas a regular CNN does not have such a property.
5.2 Multiscale Image Classification
|Without batch-normalization||SMNIST test accuracy (%)||SFashion test accuracy (%)|
|With batch-normalization||SMNIST test accuracy (%)||SFashion test accuracy (%)|
We next demonstrate the improved performance of ScDCFNet in multiscale image classification. The experiments are conducted on the SMNIST and SFashion datasets, and a regular CNN is used as a performance benchmark. Both networks are comprised of three convolutional layers with the exact architectures (Table 2) detailed in Appendix B.2. Unlike , the scaling group is unbounded, and we thus compute only the feature maps with the index restricted to the scale interval (), which is discretized uniformly into channels. The performance of the comparing architectures with and without batch-normalization is shown in Table 1. It is clear that, by limiting the hypothesis space to scale-equivaraint models and taking truncated basis decomposition to reduce the model size, ScDCFNet achieves a significant improvement in classification accuracy with a reduced number of trainable parameters. The advantage of ScDCFNet is more pronounced when the number of training samples is small (), suggesting that, by hardwiring the input scale information directly into its representation, ScDCFNet is less susceptible to overfitting the limited multiscale training data.
We also observe that even when a regular CNN is trained with data augmentation (random cropping and rescaling), its performance is still inferior to that of an ScDCFNet without manipulation of the training data. In particular, although the accuracies of the regular CNNs trained on 2000 SMNIST and SFashion images after data augmentation are improved to, respectively, and , they still underperform the ScDCFNets without data augmentation ( and ) using only a fraction of trainable parameters. Moreover, if ScDCFNet is trained with data augmentation, the accuracies can be further improved to and respectively. This suggests that ScDCFNet can be combined with data augmentation for optimal performance in multiscale image classification.
5.3 Image Reconstruction
In the last experiment, we illustrate the ability of ScDCFNet to explicitly encode the input scale information into the representation. To achieve this, we train an ScDCFNet auto-encoder on the SMNIST dataset with images resized to for better visualization. The encoder stacks two scale-equivaraint convolutional blocks with average-pooling, and the decoder contains a succession of two transposed convolutional blocks with upsampling. A regular CNN auto-encoder is also trained for comparison (see Table 3 in Appendix B.3 for the detailed architecture.)
Our goal is to demonstrate that the image code produced by the ScDCFNet auto-encoder contains the scale information of the input, i.e., by applying the group action (2) to the code of a test image before feeding it to the decoder, we can reconstruct rescaled versions of original input. This property can be visually verified in Figure 3. In contrast, a regular CNN auto-encoder fails to do so.
We propose, in this paper, a scale-equivaraint CNN with joint convolutions across the space and the scaling group , which we show to be both sufficient and necessary to impose scale-equivariant network representation. To reduce the computational cost and model complexity incurred by the joint convolutions, the convolutional filters supported on are decomposed under a separable basis across the two domains and truncated to only low-frequency components. Moreover, the truncated filter expansion leads also to improved deformation robustness of the equivaraint representation, i.e., the model is still approximately equivariant even if the scaling transformation is imperfect. Experimental results suggest that ScDCFNet achieves improved performance in multiscale image classification with greater interpretability and reduced model size compared to regular CNN models.
For future work, we will study the application of ScDCFNet in other more complicated vision tasks, such as object detection/localization and pose estimation, where it is beneficial to directly encode the input scale information into the deep representation. Moreover, the memory usage of our current implementation of ScDCFNet scales linearly to the number of the truncated basis functions in order to realize the reduced computational burden explained in Theorem 2. We will explore other efficient implementation of the model, e.g., using filter-bank type of techniques to compute convolutions with multiscale spatial filters, to significantly reduce both the computational cost and memory usage.
-  (1965) Handbook of mathematical functions: with formulas, graphs, and mathematical tables. Vol. 55, Courier Corporation. Cited by: §3.2.
-  (2013) Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1872–1886. Cited by: §2.
-  (2019) RotDCF: decomposition of convolutional filters for rotation-equivariant deep networks. In International Conference on Learning Representations, Cited by: §1, §2, §2, §3.1, §3.1, §3.2, Remark 2.
-  (2018) A general theory of equivariant cnns on homogeneous spaces. arXiv preprint arXiv:1811.02017. Cited by: §2.
-  (2016) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §1, §2.
-  (1991) The design and use of steerable filters. IEEE Transactions on Pattern Analysis & Machine Intelligence (9), pp. 891–906. Cited by: §1.
-  (2014) Locally scale-invariant convolutional neural networks. arXiv preprint arXiv:1412.5104. Cited by: §1, §2, Figure 1, Remark 1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2019) Selective kernel networks. Cited by: §2.
-  (2015-06) Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2010) Recursive interferometric representation. In Proc. of EUSICO conference, Danemark, Cited by: §2.
-  (2012) Group invariant scattering. Communications on Pure and Applied Mathematics 65 (10), pp. 1331–1398. Cited by: §2.
-  (2018) Scale equivariance in cnns with vector fields. arXiv preprint arXiv:1807.11783. Cited by: §1, §2, Figure 1, Remark 1.
-  (2017) Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5048–5057. Cited by: §1.
-  (2018) A mixed-scale dense convolutional neural network for image analysis. Proceedings of the National Academy of Sciences 115 (2), pp. 254–259. Cited by: §2.
-  (2018-10–15 Jul) DCFNet: deep neural network with decomposed convolutional filters. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4198–4207. Cited by: §A.3, §A.3, §A.4, §A.4, §A.4, §A.4, §2, §3.1, Lemma 4.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 91–99. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: Cited by: §1.
-  (2013) Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1233–1240. Cited by: §2.
-  (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2.
-  (2018) Understanding convolution for semantic segmentation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1451–1460. Cited by: §2.
-  (2018) 3d steerable cnns: learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, pp. 10381–10392. Cited by: §2.
-  (2018) Learning steerable filters for rotation equivariant cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849–858. Cited by: §1, §2, §3.1.
-  (2017) Harmonic networks: deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037. Cited by: §1.
-  (2017-08-28)(Website) External Links: Cited by: §5.
-  (2014) Scale-invariant convolutional neural networks. arXiv preprint arXiv:1411.6369. Cited by: §1, §2, Figure 1, Remark 1.
-  (2017-07) Dilated residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2016) Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, Cited by: §2.
-  (2017) Oriented response networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 519–528. Cited by: §1.
Appendix A Proofs
a.1 Proof of Theorem 1
Proof of Theorem 1.
We note first that (3) holds true if and only if the following being valid for all ,
where is understood as . We also note that the layer-wise operations of a general feedforward neural network with an extra index can be written as
and, for ,
When , we have
To prove the necessary part: when , we have
Hence for (12) to hold when , we need
Keeping fixed while changing in (15), we obtain that does not depend on the third variable . Thus . Define as
Then, for any given , setting in (15) leads to
For , a similar argument leads to