Building Deep Equivariant Capsule Networks
Abstract
Capsule networks are constrained by the parameterexpensive nature of their layers, and the general lack of provable equivariance guarantees. We present a variation of capsule networks that aims to remedy this. We identify that learning all pairwise partwhole relationships between capsules of successive layers is inefficient. Further, we also realise that the choice of prediction networks and the routing mechanism are both key to equivariance. Based on these, we propose an alternative framework for capsule networks that learns to projectively encode the manifold of posevariations, termed the spaceofvariation (SOV), for every capsuletype of each layer. This is done using a trainable, equivariant function defined over a grid of grouptransformations. Thus, the predictionphase of routing involves projection into the SOV of a deeper capsule using the corresponding function. As a specific instantiation of this idea, and also in order to reap the benefits of increased parametersharing, we use typehomogeneous groupequivariant convolutions of shallower capsules in this phase. We also introduce an equivariant routing mechanism based on degreecentrality. We show that this particular instance of our general model is equivariant, and hence preserves the compositional representation of an input under transformations. We conduct several experiments on standard objectclassification datasets that showcase the increased transformationrobustness, as well as general performance, of our model to several capsule baselines.
Capsule networks, Equivariance
1 Introduction
The hierarchical componentstructure of visual objects motivates their description as instances of classdependent spatial grammars. The productionrules of such grammars specify this structure by laying out valid typecombinations for components of an object, their intergeometry, as well as the behaviour of these with respect to transformations on the input. A system that aims to truly understand a visual scene must accurately learn such grammars for all constituent objects  in effect, learning their aggregational structures. One means of doing so is to have the internal representation of a model serve as a componentparsing of an input across several semantic resolutions. Further, in order to mimic latent compositionalities in objects, such a representation must be reflective of detected strengths of possible spatial relationships. A natural structure for such a representation is a parsetree whose nodes denote components, and whose weighted parentchild edges denote the strengths of detected aggregational relationships.
Capsule networks ((Hinton et al., 2011)), ((Sabour et al., 2017)) are a family of deep neural networks that aim to build such distributed, spatiallyaware representations in a multiclass setting. Each layer of a capsule network represents and detects instances of a set of components (of a visual scene) at a particular semantic resolution. It does this by using vectorvalued activations, termed ’capsules’. Each capsule is meant to be interpreted as being representative of a set of generalised posecoordinates for a visual object. Each layer consists of capsules of several types that may be instantiated at all spatial locations depending on the nature of the image. Thus, given an image, a capsule network provides a description of its components at various ’levels’ of semantics. In order that this distributed representation across layers be an accurate componentparsing of a visual scene, and capture meaningful and inherent spatial relationships, deeper capsules are constructed from shallower capsules using a mechanism that combines backpropagationbased learning, and consensusbased heuristics.
Briefly, the mechanism of creating deeper capsules from a set of shallower capsules is as follows. Each deeper capsule of a particular type receives a set of predictions for its pose from a local pool of shallower capsules. This happens by using a set of trainable neural networks that the shallower capsules are given as input into. These networks can be interpreted as aiming to capture possible partwhole relationships between the corresponding deeper and shallower capsules. The predictions thus obtained are then combined in a manner that ensures that the result reflects agreement among them. This is so that capsules are activated only when their componentcapsules are in the right spatial relationship to form an instance of the objecttype it represents. The agreementbased aggregation described just now is termed ’routing’. Multiple routing algorithms exist, for example dynamic routing =((Sabour et al., 2017)), EMrouting =((Hinton et al., 2018)), SVDbased routing =((Bahadori, 2018)), and routing based on a clusteringlike objective function =((Wang and Liu, 2018)). These are based on differing notions of consensus, and consequently affect the capsuledecomposition of an input differently.
Based on their explicit learning of compositional structures, capsule networks can be seen as an alternative (to CNNs) for better learning of compositional representations. Indeed, CNNbased models do not have an inherent mechanism to explicitly learn or use spatial relationships in a visual scene. Further, the common use of layers that enforce local transformationinvariance, such as pooling, further limit their ability to accurately detect compositional structures by allowing for relaxations in otherwise strict spatial relations =((Hinton et al., 2011)). Thus, despite some manner of hierarchical learning  as seen in their layers capturing simpler to more complex features as a function of depth  CNNs do not form the ideal representational model we seek. It is our belief that capsulebased models may serve us better in this regard.
This much said, research in capsule networks is still in its infancy, and several issues have to be overcome before capsule networks can become universally applicable like CNNs. We focus on two of these that we consider as fundamental to building better capsule network models. First, most capsulenetwork models, in their current form, do not scale well to deep architectures. A significant factor is the fact that all pairwise relationships between capsules of two layers (upto a local pool) are explicitly modelled by a unique neural network. Thus, for a ’convolutional capsule’ layer  the number of trainable neural networks depends on the product of the spatial extent of the windowing and the product of the number of capsuletypes of each the two layers. We argue that this design is not only expensive, but also inefficient. Given two successive capsulelayers, not all pairs of capsuletypes have significant relationships. This is due to them either representing objectcomponents that are part of different classes, or being just incompatible in compositional structures. The consequences of this inefficiency go beyond poor scalability. For example, due to the large number of predictionnetworks in this design, only simple functions  often just matrices  are used to model partwhole relationships. While building deep capsule networks, such a linear inductive bias can be inaccurate in layers where complex objects are represented. Thus, for the purpose of building deeper architectures, as well as more expressive layers, this inefficiency in the prediction phase must be handled.
The second issue with capsule networks is more theoretical, but nonetheless has implications in practice. This is the lack, in general, of theoretical guarantees on equivariance. Most capsule networks only use intuitive heuristics to learn transformationrobust spatial relations among components. This is acceptable, but not ideal. A capsule network model that can detect compositionalities in a provablyinvariant manner are more useful, and more in line with the basic motivations for capsules.
Both of the above issues are remedied in the following description of our model. First, instead of learning pairwise relationships among capsules, we learn to projectively encode a description of each capsuletype for every layer. This we do by associating each capsuletype with a vectorvalued function, given by a trainable neural network. This network assumes the role of the prediction mechanism in capsule networks. We interpret the role of this network as a means of encoding the manifold of legal posevariations for its associated capsuletype. It is expected that, given proper training, shallower capsules that have no relationship with a particular capsuletype will project themselves to a vector of low activation (for example, twonorm), when input to the corresponding network. As an aside, it is this mechanism that gives the name to our model. We term this manifold the ’spaceofvariation’ of a capsuletype. Since, we attempt to learn such spaces at each layer, we name our model ’spaceofvariation’ networks (SOVNET). In this design, the number of trainable networks for a given layer depend on the number of capsuletypes of that layer.
As mentioned earlier, the choice of prediction networks and routing algorithm is important to having guarantees on learning transformationinvariant compositional relationships. Thus, in order to ensure equivariance, which we show is sufficient for the above, we use groupequivariant convolutions (GCNN) =((Cohen and Welling, 2016)) in the prediction phase. Thus, shallower capsules of a fixed type are input to a GCNN associated with a deeper capsuletype to obtain predictions for it. Apart from ensuring equivariance to transformations, GCNNs also allow for greater parametersharing (across a set of transformations), resulting in greater awareness of local objectstructures. We argue that this could potentially improve the quality of predictions when compared to isolated predictions made by convolutional capsule layers, such as those of =((Hinton et al., 2018)).
The last contribution of this paper is an equivariant degreecentrality based routing algorithm. The main idea of this method is to treat each prediction for a capsule as a vertex of a graph, whose weighted edges are given by a similarity measure on the predictions themselves. Our method uses the softmaxed values of the degree scores of the affinity matrix of this graph as a set of weights for aggregating predictions. The key idea being that predictions that agree with a majority of other predictions for the same capsule get a larger weight  following the principle of routingbyagreement. While this method is only heuristic in the sense of optimality, it is provably equivariant and preserves the capsuledecomposition of an input. We summarise the contributions of this paper in the following:

A general framework for a scalable capsulenetwork model.

A particular instantiation of this model that uses equivariant convolutions, and an equivariant, degreecentralitybased routing algorithm.

A graphbased framework for studying the representation of a capsule network, and the proof of the sufficiency of equivariance for the (qualified) preservation of this representation under transformations of the input.

A set of proofofconcept, evaluative experiments on affinely transformed variations of MNIST, FASHIONMNIST, and CIFAR10, as well as separate experiments on KMNIST and SVHN that showcase the superior adapatability of SOVNET architectures to train and testtime geometric perturbations of the data, as well as their general performance.
2 Sovnet, equivariance, and compositionality
We begin with essential definitions for a layer of SOVNET, and the properties we wish to guarantee. Given a group , we formally describe the layer of a SOVNET architecture as the set of functiontuples . Here, denotes the number of capsuletypes at the layer, is a functional description of the dimensional posevectors of instances of the capsuletype, and is a functional description of the corresponding activations.
We model each capsuletype as a function over a group of transformations so as to allow for formal guarantees on transformationequivariance. Thus, we also model images as function from a group to a representationspace. The main assumption being that the translationgroup is a subgroup of the group in question. This is similar in approach to =((Cohen and Welling, 2016)). We wish for each capsuletype, both pose and activationwise, to display equivariance. We present a formal definition of this notion.
Consider a group and vector spaces , . Let and be two grouprepresentations for elements of over and , respectively. : is said to be equivariant with respect to and if , , = .
This definition translates to a preservation on transformations in the inputspace to the outputspace  something that allows no loss of information in compositional structures. As in =((Cohen and Welling, 2016)), we restrict the notion of equivariance in our model by using the operator in place of the grouprepresentation. is given by = . Thus, if denotes an operation between two functions, we require ([] = . The operator describes the change in representation space, and is dependent on the nature of the deep learning model. In our case, is given by routing among capsules.
2.1 sovnet layer
We define the capsuletypes of a particular layer as an output of an agreementbased aggregation of predictions made by the preceding layer. A recursive application of this definition is enough to define a SOVNET architecture, given an initial set of capsules. A means of obtaining this initial set is given in section 3. We provide a general framework for the summationbased family of routing procedures in Algorithm 1.
The steps in Algorithm 1 are exactly the same as in any other capsule network: shallower capsules make a prediction for deeper capsules, the importance of these predictions are scored, and finally these are used to build the deeper capsules. The extent of agreement among the predictions for a particular capsule of a type defines its activation. We must note that several other families of routing algorithms exist, especially those that do not use weightedsummation. An example of such an algorithm is spectralrouting =((Bahadori, 2018)). We do not, however, pursue a description of such methods as our aims are more limited in scope.
We instantiate the above algorithm to a specific model, as given in Algorithm 2. In this model, the are groupequivariant convolutional filters, and the operator is the corresponding groupequivariant correlation operator . The weights are, in this routing method, the softmaxed degreescores of the affinities among predictions for the same deeper capsule. Further, like in dynamic routing =((Sabour et al., 2017)), we also assume that the activation of a capsule is given by its twonorm. To ensure that this value is in , we use the ’squash’ function of dynamic routing. Thus, we do not mention it explicitly. Note that we have used the subscript notation to also denote that a variable is part of a vector, for example denotes the element of the dimensional vector . This new routing algorithm is meant to serve as an alternative to existing iterative routing strategies such as dynamic routing. An important strength of our method being that there is no hyperparameter, like that of the number of iterations in dynamic routing or EM routing.
2.2 Equivariance, compositionality and sovnet
The SOVNET layer we introduced in Algorithm 2 is groupequivariant with respect to the group action , where  the set of transformations over which the groupconvolution is defined. For notational convenience, we define to be an operator that encapsulates the degreerouting procedure for with prediction networks . Thus, the capsuletype of the layer is functionally depicted as = , where = . The formal statement of this result is given below; the proof is presented in the appendix.
Theorem 2.1.
The SOVNET layer defined in Algorithm 2, and denoted by the operator as given above, satisfies = , where belongs to the underlying group of the equivariant convolution.
Proof.
The proof is given in the appendix. ∎
Equivariance is widely considered a desirable inductive bias for a variety of reasons. First, equivariance mirrors natural labelinvariance under transformations. Second, it lends predictability to the output of a network under (fixed) transformations of the input. These, of course, lead to a greater robustness in handling transformations of the data. We aim at adding to this list by showing that equivariance guarantees the preservation of detected compositionalities in a SOVNET architecture. This is of course quite unsurprising, and has been a significant undercurrent of the capsulenetwork idea. Our work completes this intuition with a formal result.
We begin by first defining the notion of a capsuledecomposition graph. This graph is formed from the activations and the routing weights of a SOVNET. Specifically, given an input to a SOVNET model, each capsule of every type is a vertex in this graph. We construct an edge between capsules that are connected by routing, with the direction from the shallower capsule to the deeper capsule. Each of these edges are weighted by the corresponding routing coefficient. Capsules not related to each other by routing are not connected by an edge. This graph is a direct formalisation of the various detected compositionalities with their strengths.
What should the ideal behaviour of this graph be under the changeofviewpoint of an input? The answer to this lies in the expected behaviour of natural compositionalities. Thus, while the pose of objects, and their components, is changed under transformations of the input, the relative geometry is constant. Thus, it is desirable that the capsuledecomposition graphs of a particular input (and its transformed variations) be isomorphic to each other. We show that a SOVNET model that is equivariant with respect to a set of transformations satisfies the above property for that set. A more formal description of the capsuledecomposition graph, and the statement for the above theorem are given below.
Consider a layer SOVNET model, whose routing procedure belongs to the family of methods given by Algorithm 1. Let us consider a fixed input : . We define the capsuledecomposition graph of such a model, for this input , as = . Here, and denote the vertexset and the edgeset, respectively. , where = , , . . denotes the pool of gridpositions that route to . This definition uses the positional, as well as type information of each capsule to uniquely identify it. Moreover, we also use the notation to denote .
Theorem 2.2.
Consider an layer SOVNET whose activations are routed according to a procedure belonging to the family given by Algorithm 1. Further, assume that this routing procedure is equivariant with respect to the group . Then, given an input and , and are isomorphic.
Proof.
The proof is given in the appendix. ∎
Based on above theorem, and the fact that degreecentrality based routing is equivariant, the above result applies to SOVNET models that use Algorithm 2 .
3 Experiments and results
This section presents a description of the experiments we performed. We conducted two sets of experiments; the first to compare SOVNET architectures to other capsule network baselines with respect to transformation robustness on classification, and the second to compare SOVNET to certain capsule as well as convolutional baselines based on classification performance. Before we present the details of these experiments, we briefly describe some details of the SOVNET architecture we used. We only present an outline  the code will be released pending publication.
The first detail of the architecture pertains to the construction of the first layer of capsules. While many approaches are possible, we used the following methodology that is similar in spirit to other capsule network models. The first layer of the SOVNET architectures we constructed use a modified residual block that uses the SELU activation, along with groupequivariant convolutions. This is so as to allow a meaningful set of equivariant feature maps to be used for the creation of the first set of capsules. Intuition and some literature, for example ((Rosario et al., 2019)), suggest that the construction of primary capsules plays a significant role in the performance of the capsule network. Thus, it is necessary to build a sufficiently expressive layer that yields the first set of meaningful capsuleactivations. To this end, each capsuletype in the primary capsule layer is associated with a groupconvolution layer followed by a modified residual block. The convolutional featuremaps from the preceding layer passes through each of these subnetworks to yield the primary capsules. No routing is performed in this layer.
We now describe the SOVNET blocks. Since the design of SOVNET significantly reduces the number of prediction networks, and thereby the number of trainable parameters, we are able to build architectures whose each layer uses more expressive prediction mechanisms than a simple matrix. Specifically, each hidden layer of the SOVNET architectures we consider uses a (groupequivariant) modified residual block as the prediction mechanism. We use a SOVNET architecture that uses 5 hidden layers for MNIST, FashionMNIST, KMNIST, and SVHN, and a model that uses 6 hidden layers for CIFAR10. Unlike DeepCaps  another capsule network whose predictions use (regular) convolution, each of the hidden layers of out SOVNET models use degreerouting. The hidden layers of DeepCaps (excepting the last), in contrast, are not strictly capsulebased  being just convolutions whose outputs are reshaped to a capsuleform.
The output capsulelayer of SOVNET is designed similar to the hidden capsulelayers, with the difference that the predictionmechanism is a groupconvolutional implementation of a fullyconnected layer. In order to make a prediction for the class of an input, the maximum across the rotational (and reflectional) positions of the twonorm of the capsuleactivations of this layer are taken for each classtype. This is an equivariant operation, as it corresponds to the subgrouppooling of ((Cohen and Welling, 2016)). The predictions that this layer yields is the type of the capsule with the maximum twonorm.
In order to guarantee the robustness to translations and rotations, we used the p4convolutions =((Cohen and Welling, 2016)) for the prediction mechanism in all the networks used in the first set of experiments. For the second set, we used the p4mconvolution =((Cohen and Welling, 2016)), that is equivariant to rotations, translations and reflections  for greater ability to learn from augmentations. The architectures, however are identical but for this difference.
As in =((Sabour et al., 2017)), we used a margin loss and a regularising reconstruction loss to train the networks. The positive and negative margins for half of the training epochs were set to 0.9 and 0.1, respectively. Further, the negative marginloss was weighted by 0.5, as in =((Sabour et al., 2017)). These values were used for the first half of the training epochs. In order to facilitate better predictions, these values were changed to 0.95, 0.05, and 0.8, respectively for the second half of the training. We adopt this from =((Rajasegaran et al., 2019)). The reconstruction loss was computed by masking the incorrect classes, and by feeding the ’true’ classcapsule to a series of transposed convolutions to reconstruct the image. The mean square loss was computed for the reconstruction and original image. The main idea being that this loss guides the capsule network to build meaningful capsules. This loss was weighed by 0.0005 as in =((Sabour et al., 2017)). We used the Adam optimiser and an exponential learning rate scheduler that reduced the learning rate by a factor of 0.9 each epoch.
With this outline of the architecture and details of the training, we now describe the first set of experiments we conducted on SOVNET. The preservation of detected compositionalities under transformations in SOVNET leads us to the expectation that SOVNET models, when properly trained, will display greater robustness to changes in viewpoint of the input. Apart from handling testtime transformations, as is the commonly held notion of transformation robustness, a robust model must also effectively learn from traintime perturbations of the data. Based on these ideas, we designed a set of experiments that compare SOVNET architectures to other capsule networks on their ability to handle train and testtime affine transformations of the data.
Specifically, we perform experiments on MNIST =((LeCun and Cortes, 2010)), FashionMNIST =((Xiao et al., 2017)), and CIFAR10 =((Krizhevsky and Hinton, 2009)). For each of these datasets, we created 5 variations of the train and testsplits by randomly transforming data according to the extents of the transformations given in Table 1. We train a given model on each transformed version of the trainingsplit, and test each model on each of the versions of the testsplit. Thus we obtain, for a single model, 25 accuracies per dataset  each corresponding to a pair of train and testsplits. There is a single modification to these transformations for the case of CIFAR10. In order to compare SOVNET against the closest competitor DeepCaps, we use their strategy of first resizing CIFAR10 images to 6464, followed by translations and rotations.
We tested SOVNET against four capsule network baselines, namely Capsnet =((Sabour et al., 2017)), EMcaps =((Hinton et al., 2018)), DeepCaps ((Rajasegaran et al., 2019)), and GCaps =((Lenssen et al., 2018)). The results of these experiments are given in Tables 2 to 4. In the majority of the cases, SOVNET obtains the highest accuracy  showing that it is more robust to transformations of the data. Note that we had to conduct these experiments as such a robustness study was not done in the original papers for the baselines. We used, and modified, code from the following github sources for the implementation of the baselines: =((Li, 2019)) for CAPSNET; =((Yang, 2019)) for EMCAPS; =((Rajasegaran, 2019)) and =((HopefulRational, 2019)) for DeepCaps, and =((Lenssen, 2019)) for GCaps.
The second set of experiments we conducted, tested SOVNET against several capsule as well as convolutional baselines. We trained and tested SOVNET on KMNIST =((Clanuwat et al., 2018)) and SVHN =((Netzer et al., 2011)). With fairly standard augmentation  mild translations (and resizing for SVHN to 6464)  the SOVNET architecture with p4mconvolutions was able to achieve onpar, or above, comparative performance. The results of this experiment are in Table 5. In order to compare the performance of SOVNET architectures against more sophisticated CNNbaselines, we also trained ResNet18, ResNet34 on the most extreme of the transformations of the list  translation by 2 pixels, and rotation by 180°. The results of these experiments are presented in the appendix.
S.no.  Translational extent  Rotational extent 

1  0 pixels  0° 
2  2 pixels  30° 
3  2 pixels  60° 
4  2 pixels  90° 
5  2 pixels  180° 
Results on Training on Untransformed MNIST  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  99.35%  91.57%  72.10%  55.27%  42.58% 
EMcaps  99.09%  92.23%  72.83%  56.66%  42.95% 
GCaps  97.83%  82.59%  66.27%  56.63%  54.52% 
DeepCaps  99.56%  94.61%  74.44%  57.24%  45.43% 
SOVNET  99.68%  96.15%  80.53%  64.55%  51.02% 
Results on Training on MNIST Transformed by (2,30°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  99.60%  99.39%  95.65%  79.53%  59.58% 
EMcaps  99.36%  99.03%  94.91%  79.12%  59.03% 
GCaps  98.12%  96.17%  90.87%  81.34%  77.13% 
DeepCaps  99.62%  99.57%  97.50%  84.16%  62.75% 
SOVNET  99.77%  99.70%  98.86%  90.63%  69.26% 
Results on Training on MNIST Transformed by (2,60°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  99.39%  99.12%  98.99%  95.53%  72.06% 
EMcaps  98.84%  98.79%  98.55%  94.03%  70.03% 
GCaps  97.44%  96.31%  96.01%  93.18%  81.70% 
DeepCaps  99.54%  99.49%  99.42%  97.27%  73.61% 
SOVNET  99.70%  99.65%  99.63%  98.56%  79.59% 
Results on Training on MNIST Transformed by (2,90°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  99.17%  98.77%  98.73%  98.29%  79.18% 
EMcaps  98.83%  98.38%  98.42%  97.86%  77.47% 
GCaps  97.67%  96.53%  96.33%  95.52%  83.76% 
DeepCaps  99.44%  99.16%  99.03%  98.64%  77.54% 
SOVNET  99.68%  99.60%  99.59%  99.5%  87.76% 
Results on Training on MNIST Transformed by (2,180°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  97.52%  96.65%  96.64%  96.50%  96.09% 
EMcaps  95.89%  95.22%  95.42%  95.42%  95.09% 
GCaps  95.24%  93.67%  93.83%  93.79%  93.76% 
DeepCaps  98.17%  97.84%  97.89%  98.11%  98.01% 
SOVNET  98.34%  98.10%  98.11%  98.08%  98.06% 
Results on Training on Untransformed FashionMNIST  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  91.23%  57.15%  37.98%  28.33%  22.38% 
EMcaps  90.05%  59.75%  40.26%  30.17%  23.82% 
GCaps  86.56%  50.05%  35.05%  29.93%  27.10% 
DeepCaps  93.27%  57.85%  37.06%  27.63%  21.86% 
SOVNET  94.72%  61.58%  41.01%  34.07%  27.63% 
Results on Training on FashionMNIST Transformed by (2,30°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  91.22%  89.57%  69.58%  50.17%  35.16% 
EMcaps  90.17%  89.47%  68.39%  49.23%  37.02% 
GCaps  83.28%  80.12%  64.86%  53.71%  52.54% 
DeepCaps  93.71%  93.40%  75.32%  53.35%  36.30% 
SOVNET  94.99%  94.36%  77.19%  58.59%  43.84% 
Results on Training on FashionMNIST Transformed by (2,60°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  89.98%  88.55%  88.15%  72.81%  46.89% 
EMcaps  88.24%  87.30%  87.04%  71.72%  48.14% 
GCaps  82.04%  80.12%  78.94%  68.05%  59.25% 
DeepCaps  93.36%  93.06%  92.84%  80.76%  49.90% 
SOVNET  94.49%  94.08%  94.20%  90.23%  73.48% 
Results on Training on FashionMNIST Transformed by (2,90°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  88.78%  87.18%  87.13%  86.19%  59.59% 
EMcaps  86.43%  85.85%  85.82%  85.63%  61.15% 
GCaps  80.71%  79.55%  79.17%  79.21%  72.11% 
DeepCaps  93.07%  92.93%  92.75%  92.51%  62.50% 
SOVNET  94.41%  94.03%  93.93%  93.98%  91.42% 
Results on Training on FashionMNIST Transformed by (2,180°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  86.90%  84.94%  84.93%  84.75%  84.72% 
EMcaps  82.99%  82.67%  82.18%  82.32%  82.18% 
GCaps  80.65%  79.66%  79.46%  79.47%  79.37% 
DeepCaps  92.07%  91.71%  91.70%  91.76%  91.66% 
SOVNET  94.11%  93.77%  93.56%  93.57%  93.60% 
Results on Training on Untransformed Cifar10  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  68.28%  55.57%  43.55%  37.48%  30.89% 
EMcaps  62.85%  49.28%  41.37%  34.73%  29.90% 
GCaps  49.54%  38.45%  31.89%  30.88%  27.70% 
DeepCaps  76.76%  67.97%  53.56%  45.22%  35.67% 
SOVNET  88.34%  47.57%  42.24%  43.75%  43.52% 
Results on Training on Cifar10 Transformed by (2,30°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  73.45  69.87%  61.17%  52.29%  42.58% 
EMcaps  70.24%  66.63%  59.10%  50.93%  42.26% 
GCaps  49.50%  48.88%  45.78%  42.93%  38.74% 
DeepCaps  84.24%  82.54%  74.63%  63.54%  48.63% 
SOVNET  86.58%  85.35%  82.51%  79.14%  69.64% 
Results on Training on Cifar10 Transformed by (2,60°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  70.26%  67.69%  66.62%  60.04%  47.99% 
EMcaps  66.53%  65.09%  63.21%  58.04%  47.61% 
GCaps  49.63%  50.31%  48.84%  47.43%  43.11% 
DeepCaps  83.92%  83.63%  82.79%  78.09%  60.02% 
SOVNET  83.86%  83.63%  83.57%  83.06%  80.89% 
Results on Training on Cifar10 Transformed by (2,90°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  67.81%  65.64%  65.46%  64.35%  52.79% 
EMcaps  64.33%  63.00%  62.70%  61.42%  52.08% 
GCaps  49.98%  51.24%  50.63%  49.95%  46.59% 
DeepCaps  82.91%  82.78%  82.66%  82.62%  68.34% 
SOVNET  83.33%  82.76%  82.58%  82.79%  82.22% 
Results on Training on Cifar10 Transformed by (2,180°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
Capsnet  61.08%  59.53%  60.04%  59.85%  59.90% 
EMcaps  57.57%  55.89%  56.85%  56.35%  55.20% 
GCaps  39.09%  41.03%  41.43%  41.25%  41.08% 
DeepCaps  81.12%  80.81%  80.64%  81.05%  80.92% 
SOVNET  82.50%  81.80%  81.78%  81.95%  81.82% 
4 Discussion and related work
A number of insights can be drawn from an observation of the accuracies obtained from the experiments. First, the most obvious, is that SOVNET is significantly more robust to train and testtime geometric transformations of the input. Indeed, SOVNET learns to use even extreme transformations of the training data and generalises better to testtime transformations better in a majority of the cases. However, in certain splits, some baselines perform better than SOVNET. These cases are briefly discussed below.
On the CIFAR10 experiments, DeepCaps performs significantly better than SOVNET on the untransformed case  generalising to testtime transformations better. However, SOVNET learns from traintime transformations better than DeepCaps  outperforming it in a large majority of the other cases. We hypothesize that the first observation is due to the increased (almost double) number of parameters of DeepCaps that allows it to learn features that generalise better to transformations. Further, as p4convolutions (the predictionmechanisms used) are equivariant only to rotations in multiples of 90°, its performance is significantly lower for testtime transformations of 30°and 60°for the untransformed case. However, the equivariance of SOVNET allows it to learn better from traintime geometric transforms than DeepCaps, explaining the second observation.
The second case is that GCaps outperforms SOVNET on generalising to extreme transformations on (mainly) MNIST, and once on FashionMNIST, under mild traintime conditions. However, it is unable to sustain this under more extreme traintime perturbations. We infer that this is caused largely by the explicit geometric parameterisation of capsules in GCaps. While under mildtomoderate traintime conditions, and on simple datasets, this approach could yield better results, this parameterisation, especially with very simple predictionmechanisms, can prove detrimental. Thus, the convolutional nature of the predictionmechanisms, which can capture more complex features, and also the greater depth of SOVNET allows it to learn better from more complex training scenarios. This makes the case for deeper models with more expressive and equivariant predictionmechanisms.
A related point of interest is that GCaps performs very poorly on the CIFAR10 dataset  achieving the least accuracy on most cases on this dataset  despite provable guarantees on equivariance. We argue that this is significantly due to the nature of the capsules of this model itself. In GCaps, each capsule is explicitly modelled as an element of a Lie group. Thus, capsules capture exclusively geometric information, and use only this information for routing. In contrast, other capsule models have no such parameterisation. In the case of CIFAR10, where nongeometric features such as texture are important, we see that purely spatiogeometric based routing is not effective. This observation allows us to make a more general hypothesis that could deal with the fundamentals of capsule networks. We propose a tradeoff in capsule networks, based on the notion of equivariance. To appreciate this, some background is necessary on both equivariance and capsule networks.
As the body of literature concerning equivariance is quite vast, we only mention a relevant selection of papers. Equivariance can be seen as a desirable, if not fundamental, inductive bias for neural networks used in computer vision. Indeed, the fact that AlexNet =((Krizhevsky et al., 2012)) automatically learns representation that are equivariant to flips, rotation and scaling =((Lenc and Vedaldi, 2015)) shows the importance of equivariance as well as its natural necessity. Thus, a neural network model that can formally guarantee this property is essential. An early work in this regard is the work on groupequivariant convolutions proposed in =((Cohen and Welling, 2016)). There, the authors proposed a generalisation of the 2D spatial convolution operation to act on a general group of symmetry transforms  increasing the parametersharing and, thereby, improving performance. Since then, several other models exhibiting equivariance to certain groups of transformations have been proposed, for example =((Cohen et al., 2018)), where a spherical correlation operator that exhibits rotationequivariance was introduced; =((Carlos Esteves and Daniilidis, 2017)), where a network equivariant to rotation and scale, but invariant to translations was presented, and ((Worrall and Brostow, 2018)), where a model equivariant to translations and 3D rightangled rotations was developed.
A fundamental issue with a general groupequivariant convolutional model is the fact that the grid the convolution works with increases exponentially with the type of the transformations considered. This was pointed out in =((Sabour et al., 2017)); capsules were proposed as an efficient alternative. In a general capsule network model, each capsule is supposed to represent the posecoordinates of an objectcomponent. Thus, to increase the scope of equivariance, only a linear increase in the dimension of each capsule is necessary. This was however not formalised in most capsule architectures, which focused on other aspects such as routing =((Hinton et al., 2018)), =((Bahadori, 2018)), =((Wang and Liu, 2018)); general architecture =((Rajasegaran et al., 2019)), =((Deliège et al., 2018)), =((Rawlinson et al., 2018)), ((Jeong et al., 2019)), =((Phaye et al., 2018)), ((Rosario et al., 2019)); or application ((Afshar et al., 2018)).
It was only in groupequivariant capsules =((Lenssen et al., 2018)) that this idea of efficient equivariance was formalised. Indeed, in that paper equivariance changed from the preserving the action of a group on a vector space to preserving the grouptransformation on an element. While such models scale well to larger transformation groups in the sense of preserving equivariance guarantees, we argue that they cannot efficiently handle compositionalities that involve more than spatial geometry. The direct use of capsules as geometric posecoordinates could lead to exponential representational inefficiencies in the number of capsules. This is the tradeoff we referred to. We do not attempt a formalisation of this, and instead make the observation given next. While SOVNET (using GCNNs) lacks in transformational efficiency, the use of convolutions allows it better capture nongeometric structures well. Further, SOVNET still retains the advantage of learning compositional structures better than CNN models due to the use of routing, placing it in a favourable position between two extremes.
5 Conclusion
We presented a general model for capsule networks that is scalable to deep architectures. We introduced a prediction mechanism based on groupequivariant convolutions, and a routing procedure based on degreecentrality. We showed that this model is equivariant, depending on the nature of the equivariant convolutions. We proved that this results in a preservation of detected compositionalities under transformations. We presented the results of experiments on affine variations of various classification datasets, and showed that our model performs better than several capsule network baselines. A second set of experiments showed that our model performs comparably to convolutional baselines on two other datasets. We also discussed a possible tradeoff between efficiency in the transformational sense and efficiency in the representation of nongeometric compositional relations. As future work, we aim at understanding the role of the routing algorithm in the optimality of the capsuledecomposition graph, and various other properties of interest based on it. We also note that SOVNET allows other equivariant prediction mechanisms  each of which could result in a wider application of SOVNET to different domains.
References
 Afshar et al. ((2018)) P. Afshar, A. Mohammadi, and K. N. Plataniotis. Brain tumor type classification via capsule networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3129–3133. IEEE, 2018.
 Bahadori ((2018)) M. T. Bahadori. Spectral capsule networks. 2018.
 Carlos Esteves and Daniilidis ((2017)) X. Z. Carlos Esteves, Christine AllenBlanchette and K. Daniilidis. Polar transformer networks. CoRR, abs/1709.01889, 2017. URL http://arxiv.org/abs/1709.01889.
 Clanuwat et al. ((2018)) T. Clanuwat, M. BoberIrizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha. Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718, 2018.
 Cohen and Welling ((2016)) T. Cohen and M. Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999, 2016.
 Cohen et al. ((2018)) T. S. Cohen, M. Geiger, J. Köhler, and M. Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
 Deliège et al. ((2018)) A. Deliège, A. Cioppa, and M. Van Droogenbroeck. Hitnet: a neural network with capsules embedded in a hitormiss layer, extended with hybrid data augmentation and ghost capsules. arXiv preprint arXiv:1806.06519, 2018.
 Hinton et al. ((2011)) G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming autoencoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011.
 Hinton et al. ((2018)) G. E. Hinton, S. Sabour, and N. Frosst. Matrix capsules with em routing. 2018.
 HopefulRational ((2019)) HopefulRational. PyTorch Implementation of "DeepCaps: Going Deeper with Capsule Networks" by Jathushan Rajasegaran et al.: HopefulRational/DeepCapsPyTorch, Sept. 2019. URL https://github.com/HopefulRational/DeepCapsPyTorch. originaldate: 20190731T16:16:24Z.
 Jeong et al. ((2019)) T. Jeong, Y. Lee, and H. Kim. Ladder capsule network. In International Conference on Machine Learning, pages 3071–3079, 2019.
 Krizhevsky and Hinton ((2009)) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Krizhevsky et al. ((2012)) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 LeCun and Cortes ((2010)) Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Lenc and Vedaldi ((2015)) K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 991–999, 2015.
 Lenssen ((2019)) J. E. Lenssen. Pytorch implementation of Group Equivariant Capsule Networks: mrjel/group_equivariant_capsules_pytorch, July 2019. URL https://github.com/mrjel/group_equivariant_capsules_pytorch. originaldate: 20181005T13:23:40Z.
 Lenssen et al. ((2018)) J. E. Lenssen, M. Fey, and P. Libuschewski. Group equivariant capsule networks. arXiv preprint arXiv:1806.05086, 2018.
 Li ((2019)) E. Li. Empirical studies on Capsule Network representation and improvements implemented with PyTorch.: ethanleet/CapsNet, Sept. 2019. URL https://github.com/ethanleet/CapsNet. originaldate: 20180430T23:44:26Z.
 Netzer et al. ((2011)) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
 Phaye et al. ((2018)) S. S. R. Phaye, A. Sikka, A. Dhall, and D. Bathula. Dense and diverse capsule networks: Making the capsules learn better. arXiv preprint arXiv:1805.04001, 2018.
 Rajasegaran ((2019)) J. Rajasegaran. Official Implementation of "DeepCaps: Going Deeper with Capsule Networks" paper (CVPR 2019).: brjathu/deepcaps, Sept. 2019. URL https://github.com/brjathu/deepcaps. originaldate: 20190315T07:42:55Z.
 Rajasegaran et al. ((2019)) J. Rajasegaran, V. Jayasundara, S. Jayasekara, H. Jayasekara, S. Seneviratne, and R. Rodrigo. Deepcaps: Going deeper with capsule networks. arXiv preprint arXiv:1904.09546, 2019.
 Rawlinson et al. ((2018)) D. Rawlinson, A. Ahmed, and G. Kowadlo. Sparse unsupervised capsules generalize better. arXiv preprint arXiv:1804.06094, 2018.
 Rosario et al. ((2019)) V. M. d. Rosario, E. Borin, and M. Breternitz Jr. The multilane capsule network (mlcn). arXiv preprint arXiv:1902.08431, 2019.
 Sabour et al. ((2017)) S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pages 3856–3866, 2017.
 Tissera et al. ((2019)) D. Tissera, K. Kahatapitiya, R. Wijesinghe, S. Fernando, and R. Rodrigo. Contextaware multipath networks. arXiv preprint arXiv:1907.11519, 2019.
 Wang and Liu ((2018)) D. Wang and Q. Liu. An optimization view on dynamic routing between capsules. 2018.
 Worrall and Brostow ((2018)) D. Worrall and G. Brostow. Cubenet: Equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 567–584, 2018.
 Xiao et al. ((2017)) H. Xiao, K. Rasul, and R. Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Yang ((2019)) L. Yang. A PyTorch Implementation of Matrix Capsules with EM Routing: yl1993/MatrixCapsulesEMPyTorch, Sept. 2019. URL https://github.com/yl1993/MatrixCapsulesEMPyTorch. originaldate: 20180421T07:33:47Z.
Appendix A Appendix
We present proofs for the theorems mentioned in the main body.
Theorem A.1.
The SOVNET layer defined in Algorithm 2, and denoted by the operator as given above, satisfies = , where belongs to the underlying group of the equivariant convolution.
Proof.
For the theorem to be true, we must show that each step of Algorithm 2 is equivariant. We do this stepwise.
The predictions made in the first step are groupequivariant. This follows from the fact that = , and that =  proved in =[5].
We now show that the procedure is equivariant. We see that = ; . Each = . From the equivariance of , = = = = .
Moreover, the twonorm of an equivariant map is also equivariant  from the equivariance of the postcomposition of nonlinearities over equivariant maps =[5]. Also, the division of two (nonzero) equivariant maps is also equivariant. Thus, obtaining the degreescores is equivariant. Again, the softmax function preserves the equivariance as it is a pointwise nonlinearity.
The proof is concluded by pointing out that the product and sum of equivariant maps is also equivariant. ∎
Theorem A.2.
Consider an layer SOVNET whose activations are routed according to a procedure belonging to the family given by Algorithm 1. Further, assume that this routing procedure is equivariant with respect to the group . Then, given an input and , and are isomorphic.
Proof.
Consider a fixed layer SOVNET that is equivariant to transformations from a group , and an input . Let be the capsuledecomposition graph corresponding to . Then denotes the the capsuledecomposition graph of the transformed input .
We show that the map is an isomorphism from to . First, we note that is a bijection from to . This is from the definition of the vertex set of a capsuledecomposition graph and the fact that the map is a bijection.
We now show that if and only if .
First, let us assume . Thus, is routed to with routingcoefficient . However, due to the assumed equivariance of the model, is routed to with routingcoefficient . This, of course, implies .
The converse of this result is proved in the same way by considering , noting that = , and applying the above result to and . ∎
In order to compare SOVNET with more sophisticated CNN models, we performed a limited set of experiments on MNIST and FashionMNIST. We trained ResNet18 and ResNet34 on the train split of these transformed by random translations of up to 2 pixels, and random rotations of up to 180°. The models were tested on various transformed versions of the testsplits. The results of these experiments are given in Table 6. As can be seen in the table, SOVNET compares with the two much deeper CNN models. More testing on more complex datasets, as well as deeper SOVNET models must be done, however, to obtain a better understanding of the relative performance of these two kinds of models. We also performed this experiment on FashionMNIST for a simple 8layer deep GCNNbased on p4convolutions.
Results on Training on MNIST Transformed by (2,180°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
ResNet18  98.60%  98.30%  98.21%  98.15%  98.02% 
ResNet34  98.53%  98.26%  98.21%  98.12%  98.01% 
SOVNET  98.34%  98.10%  98.11%  98.08%  98.06% 
Results on Training on FashionMNIST Transformed by (2,180°)  
Method  (0,0°)  (2,30°)  (2,60°)  (2,90°)  (2,180°) 
ResNet18  94.21%  93.55%  93.24%  93.30%  93.45% 
ResNet34  94.38%  93.75%  93.78%  93.78%  93.73% 
SimpleP4  80.00%  79.15%  78.98%  79.00%  78.97% 
SOVNET  94.11%  93.77%  93.56%  93.57%  93.60% 