Learning to compute inner consensus
A noble approach to modeling agreement between Capsules††thanks: Supported by Calouste Gulbenkian Foundation
This project considers Capsule Networks, a recently introduced machine learning model that has shown promising results regarding generalization and preservation of spatial information with few parameters. The Capsule Network’s inner routing procedures thus far proposed, a priori, establish how the routing relations are modeled, which limits the expressiveness of the underlying model. In this project, we propose two distinct ways in which the routing procedure can be learned like any other network parameter.
Starting with the developments made by Frank Rosenblatt surrounding the Perceptron algorithm , innovative techniques have marked the beginning of a new, biologically inspired, approach in Artificial Intelligence  that is, surprisingly, better suited to deal with naturally unstructured data.
The now called field of Deep Learning has expanded these ideas by creating models that stack multiple layers of Perceptrons. These Multilayer Perceptrons, commonly known as Neural Networks , achieve greater representation capacity, due to the layered manner the computational complexity is added, especially when compared with its precursor. Attributable to this compositional approach they are especially hard-wired to learn a nested hierarchy of concepts .
As an approach to soft-computing, Neural Networks stand in opposition to the precisely stated view of analytical algorithms that, unlike the human mind, are not tolerant of imprecision, uncertainty, partial truth and approximation . In conjunction with other Deep Learning models, they stand at the vanguard of Artificial Intelligence Research, employed in tasks that previously have been found computationally intractable.
Aided by the increase in computational power as well as efforts to collect high quality labeled data, in the last decade, a particular subset of Neural Networks, called Convolutional Neural Networks(CNN) , have accomplished remarkable results . As their common trait, having to deal with high dimension unstructured data, computer vision , speech recognition , natural language processing , machine translation  and medical image analysis [14, 3] are the fields in which these models have shown greater applicability.
Colloquially, a CNN as presented by Yann LeCun and others, is a model that uses multiple layers of feature detectors that have local receptive fields and shared parameters interleaved with sub-sampling layers [15, 22, 23]. For attaining translation invariance, by design, these sub-sampling layers discard spatial information , which, when applied to the classification task, assist in amplifying the aspects of its input that are useful for discriminating and suppress irrelevant variations that are not.
Translation invariance, however helpful in attaining a model that has the same classification when applied to entities in different viewpoints, inevitably requires training on lots of redundant data. This redundancy is artificially introduced to force the optimization process to find solutions that can not distinguish between different viewpoints of the same entity. Additionally, disregarding spatial information produces models incapable of dealing with recognition tasks, such as facial identity recognition, that require knowledge of the precise spatial relationships between high-level parts, like a nose or a mouth.
1.1 Capsule Networks
To address these drawbacks, adaptations on CNNs have been proposed. This project focuses on improving an existing equivariant approach, introduced by Sara Sabour, Geoffrey E. Hinton and Nicholas Frosst, called Capsule Networks [8, 9]. In parallel with a neural network, a Capsule Network [8, 9] is, in essence, multiple levels of capsule layers which, in part, are composed of various capsules. A capsule is a group of artificial neurons which learns to recognise an implicitly defined entity, in the network’s input, and outputs a tensor representation, the pose, which captures the properties of that entity relative to an implicitly defined canonical version and an activation probability. The activation probability was designed to express the presence of the entity the capsule represents in the network’s input. In the high-level capsules this activation probability corresponds to the inter-class invariant discriminator used for classification. Moreover, every capsule’s pose is equivariant, meaning that as the entity moves over the appearance manifold, the pose moves by a corresponding amount.
Every capsule layer is either calculated from multiple feature maps, when they correspond to the primary capsule layer, or by a series of transformations, followed by a mechanism called routing, applied to the outputs of the previous capsule layer.
Disentangling the internal representations in viewpoint invariant, presence probability and viewpoint equivariant, instantiation parameters may prove to be a more generalizable approach for representing knowledge than the conventional CNN view, that only strives for representational invariance. Early evidence of this has been revealed by experimenting on the effects of varying individual components of the individual capsules. The MNIST trained network presented in , without being explicitly designed to, learned to capture properties such as stroke, width and skew. In addition to this, as mentioned by Hinton and others , the same network, despite never being trained with digits, that were subject to affine transformations, was able to accurately classify them during test time, which seems to suggest greater generalization capacity when in comparison with conventional CNNs.
The Capsule Network model proposed by Sabour et al.  uses, layer wise, dynamic routing. Since dynamic routing is a sequential extremely computationally expensive procedure, if the network were to be scaled, in order to be suited to solve more challenging datasets, would incur in overly expensive costs in training and inference time. Furthermore, when applied to many capsules, the gradient flow through the dynamic routing computations is dampened. This inhibits learning, regardless of the computational resources used. Additionally, both dynamic routing as well as other routing procedures proposed, a priori, establish the way in which the routing relations are modeled, which limits the expressiveness of the underlying model.
In this work, we propose two distinct ways in which the routing procedure can be discriminatively learned. In parallel with , we employ routing to local receptive fields with parameter sharing, in order to reduce vanishing gradients, take advantage of the fact that correlated capsules tend to concentrate in local regions and reduce the number of model parameters.
2 Related Work
Capsules were first introduced in , whereas the logic of encoding instantiation parameters was established in a transforming autoencoder.
More recently, further work on capsules  garnered some attention achieving state-of-the-art performance on MNIST, with a shallow Capsule Network using an algorithm named Dynamic routing.
Shortly after, a new Expectation-Maximisation routing algorithm was proposed in , and capsule vectors were replaced by matrices to reduce the number of parameters and also introduced convolutional capsules. State-of-the-art performance was achieved on the smallNORB dataset using a relatively small Capsule Network.
An analysis of Dynamic routing algorithm as an optimization problem was presented in  as well as a discussion of possible ways to improve Capsule Networks.
3 Routing Procedure
Routing consists of a dot-product self-attention procedure that assigns, for each output capsule, a distribution of compatibility probabilities to the transformed previous layer’s capsules, the capsule votes. These compatibility probabilities, after multiplied to corresponding capsule votes are combined and result in the output capsule’s pose. Furthermore, the routing procedure also assigns the activation probability to the respective output capsule, usually based on the amount of agreement between the votes with higher compatibility probabilities.
This procedure provides the framework for a consensus mechanism in the multi-layer capsule hierarchy, which, for the higher levels of the input’s domain, will solve the problem of assigning parts to wholes. Additionally, it can be applied to local receptive fields with shared transformation matrices. Figure 1 contains a diagram representing how routing is applied convolutionally in one dimension. The 2D and 3D convolutional routing is extrapolated in the same manner as the usual 2D and 3D convolution would.
The vote transformations are traditionally linear transformations. However, due to some stability issues that occur during training, equation 1, proposed by , will be used instead. The capsule vote from the input capsule to the output capsule is obtained by multiplying the matrix to the pose of the input capsule divided by the frobenius norm of .
In algorithm 1, we present a generalization of the routing procedure that encompasses most of the routing algorithms proposed. The definition of the and functions is what caracterizes the particular routing procedure.
4 Learning the Routing Procedure
In hopes of improving the performance of Capsule Networks and of incorporating routing into the whole training process, instead of designing an alternative routing procedure, that necessarily constrains the parts to whole relations that can be modeled, we present methods for parameterizing it, such that, either for each layer or each network, the routing procedure itself can be discriminatively learned like any other model parameter.
In the following subsections we present two distinct alternatives. The first one exposes routing as a classic clustering algorithm based on the application of parametric kernel functions. The second takes a less structured approach that defines the activation and compatibility functions simply as neural networks.
4.1 Kernel Learning Approach
Since for each input, in every output capsule, the routing computations are reminiscent of an agglomerative fuzzy clustering algorithm, following the avenue taken in , in this subsection, routing it’s analyzed as the optimization of a clustering-like objective function. The resulting cluster is interpreted as the agreement over the capsule votes and is used as the routing procedure’s output capsule’s pose. Similarly with  we propose 2, inspired by the algorithm presented in .
where is a kernel function defined by , is the output capsule’s pose, and are, respectively, the input capsule’s vote and its activation, is a vector with n components, one for each input capsule’s vote, is the compatibility probability that represents the degree of similarity between the capsule’s vote and the output capsule’s pose and is a penalty factor for the negative entropy term applied to .
The first term in the objective function represents the additive inverse of the weighted average of the similarity between the output pose and the input votes. The weights are the product of the input capsule’s activation and the corresponding compatibility probability. In this way, the more activated and compatible an input capsule is the more significant its vote similarity is in the overall optimization process. Furthermore, the second term was chosen so that there is a penalty for the polarisation of the compatibility probabilities. This, in essence, was introduced so that the output capsule’s pose is not only equal to one of the input votes but instead is an intricate mixture.
What ties this approach into the learning the routing procedure framework is that the parameter of the kernel function and the penalty factor are learned discriminatively using back-propagation during the training process.
The minimization problem is solved by partially optimizing for and .
For a fixed , is updated as
For a fixed , is updated as follows: Using the Lagrangian multiplier technique, we obtain the unconstrained minimization problem 5.
where is the Lagrangian multiplier. If is a minimizer of then the gradient must be zero. Thus,
From 6, we obtain
which, when solved for , results in
and can be updated by 11.
In this way, we obtain a compatibility function that is parameterized by and and can be learned according to the specific machine learning problem in hand. The remaining function for the definition of the routing procedure is the activation, present in 13, which we defined to be a sigmoid applied to the linear transformation of the dot-product between the compatibility probabilities and the similarity between the corresponding votes and the final output capsule’s pose.
4.2 Connectionist Approach
An alternative approach to the more formal presented in the previous subsection, to modeling the routing procedure, is to allow the and functions in Algorithm 1 to be Neural Networks. More precisely we employed a LSTM cell , which is designed to keep track of arbitrary long-term dependencies, in conjunction with two distinct neural networks to obtain a learnable routing mechanism that plays an active role throughout the iterations.
Figure 2 presents a diagram of the LSTM cell employed. The input of the cell is a concatenation of the previous iteration intermediate output capsule pose , the capsule vote from the input capsule and the corresponding routing intensity and the activation . The routing intensities are the unnormalized equivalent to the compatibility probabilities and are each initialized with the value one. After every single iteration step, the mentioned routing intensities go through a softmax function to obtain the corresponding compatibility probabilities. After initialized with zeros, the cell state and hidden state are updated throughout the iterations to obtain representations that are subsequently fed to two distinct neural networks.
The neural network is applied to the outputs of the hidden state to produce the routing intensities at the end of every iteration. Eventually, at the end of the iterative process, the cell states, correspondent to every input capsule, after combined, are fed to the neural network and result in the output capsule’s final activation as indicated by 14.
The parameters from the LSTM and both neural networks are shared across every single input capsule. Additionally, the dimension of the LSTM’s cell state and hidden state, as well as the architectures of the neural networks used, become hyperparameters.
In order to evaluate the effectiveness of the proposed routing algorithms, when compared with Expectation-Maximization, introduced in , experiments were conducted in both MNIST dataset  and smallNORB . They consist in training the same Capsule Network architecture, for 100 epochs, with every routing procedure running for three iterations, for both of the proposed algorithms and the one present in  and comparing the respective test set results. MNIST was selected, in the early stages, as proof of concept. smallNORB was chosen due to it being much closer to natural images and yet devoid of context and color.
All the experiments where made using a Tensorflow  implementation of the underlying models 111https://github.com/Goncalo-Faria/learning-inner-consensus. The optimizer used was Adam  with a learning rate of 3-e3 scheduled by an exponential decay. Additionally, the models were trained using kaggle notebooks, approximately 40 hours each, a free research tool for machine learning that provides a Tesla P100 Nvidia GPU as part of an accelerated computing environment.
Apart from the implementation of the routing procedure, all of the models contained in the experiments instantiate the Capsule Network architecture present in Table 1 and use 4x4 matrices as poses. The model is a slight modification of the smaller Capsule Network used in . More precisely, the compatibility probabilities are not shared across the convolutional channel, after the initial convolutions batch normalization  is applied, and the transformation used is the one described in equation 1. Applied to smallNORB, the model has 86K parameters, excluding the ones pertaining to the different routing procedures.
|Convolutional layer + relu + BatchNorm||K=5, S=2, Ch=64|
|Primary Capsules||K=1, S=1, Ch=8|
|Convolutional Capsule Layer 1||K=3, S=2, Ch=16|
|Convolutional Capsule Layer 2||K=3, S=1, Ch=16|
|Capule Class Layer||flatten, O=5|
1. Specification of the Capsule Network model used in the experiment with SmallNORB dataset. K denotes convolutional kernel size, S stride, Ch number of output chanels and O number of classes.
The model present in table 1 contains three layers to which routing is applied. When used in the described experiments, the routing algorithm presented in section 4.2(Connectionist), used the hyperparameters described in table 2. Moreover, the routing algorithm presented in section 4.1(Kernel Learning) used the kernel defined in equation 15, a mixture of gaussian kernels. The first two layers in the Capsule Network have Q = 4 and the last one has Q = 10.
|layer||hidden layers in||hidden layers in||number of units in the lstm’s hidden and cell states|
|Convolutional Capsule Layer 1||||||16|
|Convolutional Capsule Layer 2||[32,32]||[64,64]||16|
|Capule Class Layer||[64,64]||[124,124]||16|
2. The detailed specification of the hyperparameters used in the experiments with the Connectionist approach’s routing mechanism. The representation of the hidden layers is a list where the list element corresponds to the number of neurons in the hidden layer.
The choice of routing hyperparameters was based on making the routing procedure in the deeper layers have more parameters. What motivated this design choice was the intuition that, the more complex the features, further complex needed to be the routing.
The results pertaining to all of the experiments are contained in table 3. The table contains the percentual error rate achieved in the evaluation of the models with the test set of the corresponding datasets.
Compared with Expectation-Maximization and the kernel learning approach, the routing procedure derived with the connectionist approach achieves greater accuracy on the MNIST dataset. However, the more sophisticated smallNORB dataset appears to be a more challenging task for the proposed routing procedures, since the experimental results suggest that, for this Capsule Network architecture, Expectation-Maximization is a superior algorithm.
ACKNOWLEDGMENTS A special thank you to the Gulbenkian Foundation, the scientific committee of the Artificial Intelligence Program and this project’s tutor Professor Cesar Analide.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §5.
-  (2003) A Neural Probabilistic Language Model. In Journal of Machine Learning Research, External Links: Cited by: §1.
-  (2016) Automatic Detection of Cerebral Microbleeds from MR Images via 3D Convolutional Neural Networks. IEEE Transactions on Medical Imaging. External Links: Cited by: §1.
-  (2006) Neural networks in a softcomputing framework. External Links: Cited by: §1.
-  (2019) External Links: Cited by: Figure 1.
-  (1994) Neural networks: a comprehensive foundation. External Links: Cited by: §1.
-  (2011) Transforming auto-encoders. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), External Links: Cited by: §1, §2.
-  (2017) Dynamic routing between capsules. In Advances in Neural Information Processing Systems, Cited by: §1.1, §1.1, §1.1, §2, §6.
-  (2018) Matrix capsules with EM routing. In International Conference on Learning Representations, Cited by: §1.1, §1.1, §2, §5, §5, §6.
-  (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition.. IEEE Signal Processing Magazine. External Links: Cited by: §1.
-  (1997-12) Long short-term memory. Neural computation 9, pp. 1735–80. External Links: Cited by: §4.2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167. External Links: Cited by: §5.
-  (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.
-  (2016) Deep MRI brain extraction: A 3D convolutional neural network for skull stripping. NeuroImage. External Links: Cited by: §1.
-  (2012) ImageNet Classification with Deep Convolutional Neural Networks. In ImageNet Classification with Deep Convolutional Neural Networks, External Links: Cited by: §1, §1.
-  (1989) Handwritten Digit Recognition with a Back-Propagation Network. Advances in Neural Information Processing Systems. External Links: Cited by: §1.
-  (2005) The mnist database of handwritten digits. Cited by: §5.
-  (2004) Learning methods for generic object recognition with invariance to pose and lighting. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. 2, pp. II–104 Vol.2. Cited by: §5.
-  (2008) Agglomerative fuzzy K-Means clustering algorithm with selection of number of clusters. IEEE Transactions on Knowledge and Data Engineering. External Links: Cited by: §4.1.
-  (2012) A proposal for the Dartmouth summer research project on artificial intelligence. External Links: Cited by: §1.
-  (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review. External Links: Cited by: §1.
-  (2011) Traffic sign recognition with multi-scale convolutional networks. In Proceedings of the International Joint Conference on Neural Networks, External Links: Cited by: §1.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, External Links: Cited by: §1.
-  Understanding lstm networks. Note: http://colah.github.io/posts/2015-08-Understanding-LSTMs/Accessed: 2019-08-15 Cited by: Figure 2.
-  (2018) An Optimization View on Dynamic Routing Between Capsules. Journal of Geotechnical and Geoenvironmental Engineering. Cited by: §2, §3, §4.1.
-  (2016) Cited by: §1.
-  (2015) Deep Learning Book. MIT Press. External Links: Cited by: §1.