Affine Self Convolution
Abstract
Attention mechanisms, and most prominently self/attention, are a powerful building block for processing not only text but also images. These provide a parameter efficient method for aggregating inputs. We focus on self/attention in vision models, and we combine it with convolution, which as far as we know, are the first to do. What emerges is a convolution with data dependent filters. We call this an Affine Self Convolution. While this is applied differently at each spatial location, we show that it is translation equivariant. We also modify the Squeeze and Excitation variant of attention, extending both variants of attention to the rototranslation group. We evaluate these new models on CIFAR10 and CIFAR100 and show an improvement in the number of parameters, while reaching comparable or higher accuracy at test time against self/trained baselines.
1 Introduction
Computer vision has seen great success thanks to the use of the convolution operation in Convolutional Neural Networks (CNNs) (Lecun89; Krizhevsky12; He17MaskRCNN; Mnih13). This operation takes advantage of the translational symmetry in visual perception tasks such as image classification. Meanwhile, in tasks that require sequence processing, attention (chorowski2015attention; Bahdanau14) and self/attention (Vaswani17) have emerged as a powerful technique.
One of the peculiarities of CNNs is that filters are defined independently of the data. At the same time, self/attention is data dependent, but does not provide a template matching scheme, as does the convolution operation, since it merely reweights neighborhoods. While there is work towards using attention in CNNs, the current models use them independently, sequentially, or in parallel.
We unify convolution and self/attention, taking the best of both worlds. Our method provides a translationally equivariant convolution, where the filters are also dependent on the input. These data dependent filters more efficiently describe the relations present in the input. Moreover, by formulating it as a special convolution, it can be extended to be equivariant to other groups of transformations. As a result, we apply the rich literature on group equivariant neural networks (CohenW16) and develop the rototranslation equivariant counterpart. This module can be used as a replacement for standard convolutional layers and we call it an Affine Self Convolution.
Another variant of attention in computer vision is Squeeze and Excitation (SE). This provides global attention and we extend it to the rototranslation variant in order to compare it to our module. We plan to release code for all the experiments soon.
The contributions of this work are:

We introduce the Affine Self Convolution (ASC), merging convolution and self/attention.

We prove ASC is translation equivariant.

We extend ASC to roto/translation equivariant ASC.

We develop group Squeeze and Excitation.

We evaluate these modules on CIFAR10 and CIFAR100.
2 Background
In order to combine convolution and self/attention we first look at the group convolutions (cohen2018general) and then at the self/attention mechanism (Vaswani17; parmar2018imagetransformer).
2.1 Group convolution
Group equivariant convolutional neural networks extend the operation of convolution. CohenW16 show that we can consider the convolution operation to be defined on a group and that this allows for a natural generalization to other objects that have a group structure.
Translation equivariant convolution The set of points in with the operation of vector addition forms a group. For each value in this group, the translation operator translates the domain of a function: . Therefore, given two functions , the convolution between the image and the filter can be written in terms of translations by elements of the group (we depict the convolution in Figure 2):
(1) 
Where the convolution operation is indexed by the domain of the input and the filter. Most importantly, the convolution is equivariant to translations: . This property connects a transformation of the input image with a precise transformation of the activations . This is desirable because a model that implements such an operation can also be invariant to translations by taking a max over spatial positions at the end.
Rototranslation equivariant convolution A natural extension of the group of translations is the group of planar rotations and translations (weiler2018learning; diaconu2019learning). The elements of this group have 2 components, a (proper) rotation and a translation :
(2) 
Where is the identity matrix of order 2 and is the identity element of . Throughout this work we will use to denote rotation elements and translation elements of the group . From here, we denote the . Similarly to diaconu2019learning, the operator inversely transforms the coordinates of a function with domain by . As a result, we can evaluate the planar convolution at points on , . This operation is called a lifting layer and outputs activations with domain . The operator transforms such functions by . We can preserve equivariance of these functions by replacing the planar convolution with convolutions between :
(3) 
In practice we are limited by the sampling grid, as real world images are functions on . This has as symmetries translations by integer values, which form the group and rotations by multiples of , which form the group . By replacing the groups and in the definition of with and we get the group . This is a subgroup of , and all the relations in this work hold for both.
2.2 Selfattention in Computer vision
Simplified form We start with a simplified form of local self/attention to highlight each part and to make notation clearer. For an input image , self/attention is defined by 2 parts: 1) a score function between center pixel and a neighbor pixel . This is usually the dot product, . Moreover, such that the score values in a neighborhood sum to , the scores are normalized with softmax, , 2) an aggregation of the neighbors based on the score function:
(4) 
This operation defines a neighborhood dependent weighting. We depict this in Figure 2. The dot product score function, and therefore Equation 4, does not take account of the relative position between and . Simply, this encodes no spatial information. To take advantage of the regular grid in images, recent methods use positional embeddings . These can be added to neighbors at , prior to computing the score, and is parametrized by the relative position from the center , as done in ramachandran2019SASA, . It can also be added to the neighbors score, after computing the score function, as done in bello2019AACNN; hu2019LRNet,
Selfattention In practice, self/attention uses three sets of parameters (parmar2018imagetransformer; wang2018NLNet), , , , where stands for Query, Key, Value. We turn our attention to an input function (image) , with input channels and parameters . , , , which are implemented as convolutions, have the purpose of mapping to three separate embeddings, , , and not to process spatial information. These have or channels. The spatial information is weighted and mixed by the attention mechanism. In this more general setting, where we denote an arbitrary channel with , self/attention is defined with: 1) three linear mappings of the input (defined analogously for and ): 2) a normalized score function, which can use positional embeddings: and 3) an aggregation of the embeddings based on the score function:
(5) 
It is also common to use the multi head mechanism from Transformer (Vaswani17) alongside attention. This means that are split along the channel dimension and the self/attention mechanism is evaluated independently for each element of the partition (each head). These are then concatenated and passed through a linear layer.
3 Method
In the following section we look at the local relative type of attention, self/attention and we formulate it similarly to convolution. This allows us to merge the two and then develop the rototranslational variant. We also derive the rototranslational variant of the Squeeze and Excite module, which defines global attention.
3.1 Affine Self Convolution (ASC)
Affine map We have seen how simplified self/attention in Equation 4 is applied to an input image and that it was further developed using relative positional embeddings in order to include spatial information. We note that these are additive terms and extend them by a multiplicative term that is also spatially dependent. This results in a affine map , which depends on . We define the map relative to a center : and we apply it using the translation operator as . Furthermore, we can describe the affine map as acting on an image :
(6) 
Simplified form Applying the affine map , then simplified self/attention from Equation 4 to : is how we define the simplified ASC. We denote this with . This is applied to two functions, an input and an affine filter, and we index it by the domain of the two functions, similarly to the convolution operation:
(7) 
This uses the self/attention score from Equation 4, the convolutional filter from Equation 1 and also adds positional embeddings. Moreover, we can distribute and view and as the parameters of a normalized affine map. Intuitively, this not only performs template matching through , which is independent of the information in the image, but scales the template relative to what is in the image through . By unifying the convolution and self/attention, these data dependent filters can more efficiently describe the relations in the image because they are applied differently at each location. We call this an Affine Self Convolution and we depict it in Figure 3. It is possible to recover the usual convolution by setting the scaling coefficients to and to . By setting to 1, we recover a simplified self/attention, where positional embeddings are used for computing the score and for the aggregated term.
Translation equivariance We prove that this is translation equivariant in the Appendix 25 and therefore: . This means that this model can detect objects regardless of their position in the image, as does the standard convolution.
Affine Self Convolution Similarly to the general form of self/attention in Equation 5, we use three sets of parameters, , , for an input . This is depicted in Appendix Figure 6. By contrast to the simple variant, we now also have the affine maps. These process spatial information and are outside the extent, which is controlled by a hyperparameter: kernel size. Moreover, each affine map, can be implemented with , mapping from all the channels in the input to all the channels in the output or , mapping from one channel in the input to the same channel in the output. The same is true for , if we replace with . Our experiments use the latter, due to computational constraints. The additive term of the map is defined as , where for , is replaced with . We denote an arbitrary channel with and we define ASC as:
1) three linear mappings of the input (defined analogously for and ):  
(9)  
2) three affine maps for the terms (defined analogously for and ):  
(10)  
3) a score function between center and neighbor , which is then normalized with softmax:  
(12)  
4) an aggregation of the embeddings based on the score function:  
(13) 
We also use the multi head mechanism. We note that now we have a separate set of parameters for , and that is only evaluated at:. Therefore, we can learn for only one spatial index .
By comparison, the positional embeddings in ramachandran2019SASA are represented by the term . The positional embeddings in hu2019LRNet are equivalent to learning the product of the embeddings , which arises when multiplying . A difference between this work and their work is that we directly learn these parameters, while ramachandran2019SASA; hu2019LRNet use a separate network to learn .
3.2 Rototranslation Affine Self Convolution
We now use the machinery of this new operation and the group theoretic background to develop the rototranslation ASC. Similarly to the standard ASC, we first define a simplified form based on an affine map, then we prove rototranslation equivariance, and finally, we present the general form.
Affine map We now turn to functions on . We will denote . Similarly to the relative affine map for functions defined on in equation 6, we define a relative affine map for functions on . This map uses affine parameters and is defined as . This acts on a function with domain as:
(14) 
This affine map transforms a function relative to a center .
Simplified form To define the score function, we notice that we can split the sum over the group in Equation 3 into two sums, . In this form, for each the convolution can be seen as a weighted sum of the neighbors , relative to a center . We can scale the affine group convolution based on this intuition. Precisely, we add a score function that is evaluated between each center and each neighbor :
(15) 
We note that it might also be possible to define a score function and we depict the difference in Figure 4. Nonetheless, this could be computationally prohibitive and is not required in order to preserve rototranslation equivariance. Using the score in Equation 15 we define the Simple ASC on :
(16) 
In the Appendix 33 we show that we can replace with . Therefore, we can learn as a function with domain . We depict this mechanism in the Appendix Figure 9.
Rototranslation equivariance We verify that ASC on groups is equivariant to actions of the group , by checking the equivariance relation in the Appendix 34.
Rototranslation ASC In practice, the input is discrete and we turn to functions on , . For the general rototranslation ASC all quantities are defined analogously to the ASC on in Equation 13, using the score function in Equation 15, the aggregation in Equation 16 and replacing the domain with . The details can be found in Appendix E.
3.3 Group Squeeze and Excite
Interactions between filters that are spatially far apart is also tackled in Squeeze and Excitation (SE) (SE). The SE module proposes to rescale each feature map based on a global aggregation of the spatial dimension. An intuition for this is that, by allowing for channel interactions at all the spatial locations, this effectively enlarges the receptive field maximally. This is also a parameter efficient method for increasing the receptive field.
For a general group (with ), the squeeze term takes an average of a function over the group. This is invariant to transformations by actions of the group , which we show in the Appendix 42. The average is then passed through a one hidden layer MLP (, where is the number of channels of the input and is the reduction ratio). A sigmoid unit () is then used on the activation.
(17) 
These are then broadcasted across the domain with an element/wise multiplication, . Therefore, multiplying a function by preserves the group structure of :
(18) 
This is added as the last operation in any bottleneck ResNet. In our experiments, the group is either the group of integer translations (which is the original operation is SE and for which is the heightwidth of the image) or the discrete rototranslation group (for which is heightwidth, since there are rotation in ).
4 Experiments
In this section we describe the dataset used and motivate our baseline architecture. The results are then divided into models that use convolution only and models that use self/attention, including ASC. We then present the overall trends. We specify various hyperparameters in the Appendix C.
Dataset In our experiments we test the models’ performance on the CIFAR10 and CIFAR100 datasets (CIFAR10). These consist of images each, for training and for testing. We further split the training set into images for training and we leave for validation. CIFAR10 consists of classes and CIFAR100 consists of 100 classes.
Backbone ResNets (HeZRS15) are a family of CNNs. They are composed of building blocks which are either basic blocks, which have layers per feature map size or bottleneck blocks, which have layers per feature map size. For CIFAR the feature map sizes are . Based on the choice of HeZRS15 define the depth of the network. On top of this, they also count the initial and the final layer. As a result we can choose basic block ResNets with layers or bottleneck block ResNets with layers. In our experiments we use a variant of the ResNet that is appropriate for CIFAR and also uses bottleneck residual blocks. We take this approach because the models using self/attention in the literature use it as a replacement for the layer inside the bottleneck residual block and we do the same. As a result, from the standard ResNet20, which is an example of a ResNet with basic residual block ( layers, with ), we arrive at ResNet29, which is an example of a ResNet with bottleneck residual block ( layers, with ).
Models We present the results on CIFAR10 in Figure 5. The models are divided into convolution only models and self/attention models, including ASC. Convolution models include squeeze and excite models (SE), since these do not change the convolution operation, but adds the SE module at the end of each bottleneck block. We show the baseline ResNet29, the roto/translational baseline ResNet29, together with these models with SE. We now turn to models that use self/attention, for which we replace the convolutions in ResNet29s bottleneck with self/attention. This means that we still leave a convolution, the convolution in the stem (this is the first layer in all ResNets). We experiment with several models. We replicate the strategy for positional embeddings in ramachandran2019SASA and we only learn a factorized variant of , instead of learning an affine map for each of the terms. These parameters are factorized over the spatial dimensions. This means that for each head, half of the positional embeddings is invariant to horizontal translations, while the other half is invariant to vertical translations. Models using these parametrization of self/attention are denoted with +SASA. Self/attention models includes ASC models. We use Simple_ASC based on Equation 7. This does not use three separate pairs of parameters, just one. With +ASC we denote the general form of ASC as described in equation 13. We also include ResNet29+ASC+SE as one of the models. This is because ASC and SE are not mutually exclusive and we can add the SE module as for a standard ResNet. We also train a rototranslation ASC, which we denote with ResNet29+ASC. This uses the general form of roto/translation ASC as described in equation 24.
Results The overall tradeoff between accuracy and parameter count in Figure 5 shows that all the self/attention models use fewer parameters than convolution models, while reaching an accuracy in the same range. This confirms the intuition that data dependent filters are more flexible and a powerful modeling choice. We see that +SE adds a small improvement to convolutional models, but no improvement to +ASC models. This indicates that local self/attention provides enough spatial context. In terms of the self/attention models, +SASA is the only one which does not manage to compete with the baseline. We conclude that a relative affine map is required, additive positional embeddings are not enough. Most insightful, +Simple_ASC drastically decreases the number of parameters () and reaches accuracy within one standard deviation of the baseline. The three models with the highest accuracy are the rototranslational counterparts to standard models. They show a clear constant increase in performance with little to no extra parameters.
Model  Accuracy  #Parameters  
translation  ResNet29  68.34 0.38  336 
equivariant  ResNet29+Simple_ASC (ours)  68.40 0.78  240 
ResNet29+ASC (ours)  68.68 0.77  291  
rototranslation  ResNet29  72.03 0.46  321 
equivariant  ResNet29+SE  72.03 0.45  354 
ResNet29+ASC (ours)  72.71 0.51  283 
This confirms that the theory was applied consistently and that the group theoretic approach benefits attention mechanisms. We ran a subset of the models on Cifar100, results are shown in Table 1. The table shows that both ASC and rototranslation equivariance are more beneficial on this dataset. This indicates that these models generalize better when there is less data available.
The experiments show that self/attention is competitive when including the affine map as done in ASC and that roto/translational equivariance is a robust improvement in all models.
5 Related work
Group equivariant CNNs The theoretically founded approach of group equivariant neural networks has motivated several advances. These works are presented under a unifying framework of group equivariant convolutional networks in cohen2018general. Closely related to our work are the developments in planar Euclidean groups in worrall2017harmonic; weiler2018learning; diaconu2019learning and the discrete variants of these in Lecun89; CohenW16; dieleman2016exploiting; hoogeboom2018hexaconv and 3D in kondor2018n; cohen2018spherical; esteves2018learning; worrall2018cubenet; weiler20183d; winkels20183d; thomas2018tensor. We note that this work would benefit from extensions to semigroups (worrall2019deep) or curved manifolds (cohen2019gauge). Other relevant works include kondor2018generalization; esteves2019equivariant; esteves2018cross; marcos2017rotation; zhou2017oriented; bekkers2018roto; jacobsen2017dynamic.
Selfattention Various forms of self/attention for CNNs have been introduced based on nonlocal means (buades2005non) or on the Transformer (Vaswani17). These models are generally trained for classification/segmentation (wang2018NLNet; hu2019LRNet), but have also been used for generative tasks (parmar2018imagetransformer; zhang2018SAGAN). Some works (hu2019LRNet) show that locality of the self/attention mechanism together with softmax helps and that relative positional embeddings are essential. In parallel, ramachandran2019SASA also show improvements using local self/attention with local relative positional embeddings over bello2019AACNN which use global self/attention with global positional embeddings. We describe in more detail how our model is related to ramachandran2019SASA; hu2019LRNet at the end of Section 3.1. It is also worth mentioning that several of these works compare convolutional models with attention based models and show that convolutional models require more floating point operations per second (FLOPS). Self/attention has also been applied to graphs (velivckovic2017graph). Regardless, they are faster than the attention based models.
Data dependent filters Other works which approach the problem of learning to apply filters differently at each spatial positions are stanley2019designing; jia2016dynamic; ha2016hypernetworks; jaderberg2015spatial; sabour2017dynamic; kosiorek2019stacked; dai2017deformable; su2019pixel.
6 Conclusion
In this work we show that there is a mixture of convolution and self/attention that can be used to replace spatial convolutions in CNNs. This module can be described as a convolution with data dependent filters. By retaining all the benefits of self/attention and convolution, what emerges are filters that are translationally equivariant, while being applied differently for each location in the input. The results show that this method is able to achieve comparable if not better performance than the convolutional models, while using fewer parameters and a bigger receptive field. Under simplifying assumptions, we can recover both self/attention and convolution, which allows us to incorporate the group theoretic approach of Group equivariant CNNs. Therefore, we prove the translational equivariance of ASC and we also develop the rototranslation equivariant ASC. The latter, is more robust to transformations of the input while surpassing the other models in accuracy. We expect the most fruitful directions for future work to be: an efficient implementation (because self/attention is slower), efficient parametrization (order and shape of s and s), merge self/attention and convolution for NLP/graphs, and equivariant ASC for manifolds (equivariant self/attention could score transport methods without assuming a predefined geometry).
References
Appendix A QKV ASC Figure
Appendix B Results
Model  Accuracy  #Parameters  
ResNet29  91.18 0.34  313k  
convolution  ResNet29 + SE  91.39 0.15  347k 
models  ResNet29  93.02 0.16  310k 
ResNet29 + SE  93.24 0.17  342k  
ResNet29+SASA  89.43 0.40  235  
ResNet29+Simple_ASC (ours)  91.01 0.24  217  
selfattention  ResNet29+ASC (ours)  91.24 0.32  268 
models  ResNet29+ASC+SE (ours)  91.21 0.11  301 
ResNet29+ASC (ours)  93.03 0.25  272 
We also run preliminary experiments with ResNet83 (bottleneck block , with ), which are presented in Figure 8. These show a similar trend to the ResNet29 models, but they also indicate that attention, either as SE or ASC might be more rewarding in bigger models.
Appendix C Hyperparameters, initialization and training schedule
We normalize the images and use the standard data augmentation technique of random horizontal flips and random crops of 3232 from the zero padded 3636 images.
We initialize convolutional layers using He initialization (HeInit) (for the variant, the number of channels is multiplied by 4 in the He initialization) and we initialize batchnorms scaling coefficient to 1 and shifting coefficient to 0. The reduction ratio in SE is , while in models, the reduction ratio in SE is .
In the self/attention models we replace the spatial convolutions in the bottleneck layers with self/attention layers. Where the baseline ResNet uses a stride of 2, the self/attention models applies self/attention then an average pooling layer with kernel size and stride 2. The self/attention layers use the multi head mechanism with 8 heads and a kernel size of . We set . In all examples we initialize . We use the dot product score normalized by . These models are more unstable in the first couple of epochs. Therefore, we initialize the scaling coefficient of the last batchnorm in each residual block to 0 as done in ZeroInit. Moreover, we warmup (per epoch) the learning rate for the first 10 epochs, up to the learning rate of 0.1. The models are trained using Nesterov accelerated gradient with momentum 0.9 and weight decay of 0.0001. The ResNet29 models and its variants are trained for 100 epochs, where the learning rate was divided by 10 at epochs 50 and 75 and each model was trained for 100 epochs. We also include examples of ResNet83. These bigger models we trained for 200 epochs and divided the learning rate by 10 at epochs 100 and 150. We do this for all the models, for a fair comparison. Throughout our experiments we used Pytorch (paszke2017pytorch). We have also taken inspiration from the GitHub repositories: imgclsmob; propercifar10resnet.
When we use rototranslation layers instead of standard layers, we divide the number of channels in each stage of the ResNet by . This preserves a similar number of parameters between the rototranslation models and the standard models.
Appendix D Rototranslation Simple ASC Figure
Appendix E Rototranslation ASC
In practice, the input is discrete and we turn to functions on , . For the general rototranslation ASC all quantities are defined analogously to the ASC on in Equation 13, using the score function in Equation 15 and replacing the domain with . Therefore, we define rototranslation ASC with:
1) three linear mappings of the input (defined analogously for and ):  
(20)  
2) three affine maps for the terms (defined analogously for and ):  
(21)  
3) a score function between center and neighbor , which is then normalized with softmax:  
(23)  
4) an aggregation of the embeddings based on the score function: The final aggregation step completely describes the rototranslation ASC operation:  
(24) 
Similarly to ASC, for rototranslation ASC we learn for only one spatial index .
Appendix F Proofs
{tcolorbox}[breakable, boxrule=0.5mm, coltitle=black, colframe=lightgray, colback=white, width=(1),adjusted title=ASC translation equivariance proof]
Claim:
(25) 
Proof:
[breakable, boxrule=0.5mm, coltitle=black, colframe=lightgray, colback=white, width=(1),adjusted title=Proof ]
Claim:  
(30)  
Proof:  
[breakable, boxrule=0.5mm, coltitle=black, colframe=lightgray, colback=white, width=(1),adjusted title=Proof ]
Claim:  
(32)  
Proof:  
[breakable, boxrule=0.5mm, coltitle=black, colframe=lightgray, colback=white, width=(1),adjusted title=Proof is a function on ] Claim:
(33) 
Proof:
For the ASC on groups, the map and therefore, and , are always used inside . This leads to a more parameter efficient parametrization for :
This is actually a function on , not the whole group . Therefore, we replace with .
[breakable, boxrule=0.5mm, coltitle=black, colframe=lightgray, colback=white, width=(1),adjusted title=Rototranslation ASC equivariance proof] Claim:
(34) 
Proof:
Expanding on the left hand side:  
Using the substitution: and  
Using: , which we prove in the Appendix LABEL:eqn:p4_ASC_proof1.  
Using: , which we prove in the Appendix LABEL:eqn:p4_ASC_proof2.  
By arriving at the right hand side of equation 34, we concolude the proof that rototranslation ASC is equivariant to actions of the group .
[breakable, boxrule=0.5mm, coltitle=black, colframe=lightgray, colback=white, width=(1),adjusted title=Proof ]
Claim:  
Proof:  
[boxrule=0.5mm, coltitle=black, colframe=lightgray, colback=white, width=(1),adjusted title=Proof ]
Claim:  
Proof:  
[boxrule=0.5mm, coltitle=black, colframe=lightgray, colback=white, width=(1),adjusted title=Proof Group Squeeze is invariant to transformations of the group] Claim:
(42) 
Proof:
Using the substitution:  