PCapsNets, a General Form of Convolutional Neural Networks
Abstract
We propose Pure CapsNets (PCapsNets) which is a generation of normal CNNs structurally. Specifically, we make three modifications to current CapsNets. First, we remove routing procedures from CapsNets based on the observation that the coupling coefficients can be learned implicitly. Second, we replace the convolutional layers in CapsNets to improve efficiency. Third, we package the capsules into rank3 tensors to further improve efficiency. The experiment shows that PCapsNets achieve better performance than CapsNets with varied routing procedures by using significantly fewer parameters on MNIST&CIFAR10. The high efficiency of PCapsNets is even comparable to some deep compressing models. For example, we achieve more than 99% percent accuracy on MNIST by using only 3888 parameters. We visualize the capsules as well as the corresponding correlation matrix to show a possible way of initializing CapsNets in the future. We also explore the adversarial robustness of PCapsNets compared to CNNs.
1 Introduction
Capsule Networks, or CapsNets, have been found to be more efficient for encoding the intrinsic spatial relationships among features (parts or a whole) than normal CNNs. For example, the CapsNet with dynamic routing ([19]) can separate overlapping digits accurately, while the CapsNet with EM routing ([9]) achieves lower error rate on smallNORB ([12]). However, the routing procedures of CapsNets (including dynamic routing ([19]) and EM routing ([9])) are computationally expensive. Several modified routing procedures have been proposed to improve the efficiency ([24, 3, 13]), but they sometimes do not “behave as expected and often produce results that are worse than simple baseline algorithms that assign the connection strengths uniformly or randomly” ([16]). Another evidence comes from Hinton’s recent work [11] which removes explicit routing procedures from capsule autoencoders.
Even we can afford the computation cost of the routing procedures, we still do not know whether the routing numbers we set for each layer serve our optimization target. For example, in the work of [19], the CapsNet models achieve the best performance when the routing number is set to 1 or 3, while other numbers cause performance degradation. For a 10layer CapsNet, assuming we have to try three routing numbers for each layer, then combinations have to be tested to find the best routing number assignment. This problem could significantly limit the scalability and efficiency of CapsNets.
Here we propose PCapsNets, which resolve this issue by removing the routing procedures and instead learning the coupling coefficients implicitly during capsule transformation (see Section 3 for details). Moreover, another issue with current CapsNets is that it is common to use several convolutional layers before feeding these features into a capsule layer. We find that using convolutional layers in CapsNets is not efficient, so we replace them with capsule layers. Inspired by [9], we also explore how to package the input of a CapsNet into rank3 tensors to make PCapsNets more representative. The capsule convolution in PCapsNets can be considered as a more general version of 3D convolution. At each step, 3D convolution uses a 3D kernel to map a 3D tensor into a scalar (as Figure 2 shows) while the capsule convolution in Figure 2 adopts a 5D kernel to map a 5D tensor into a 5D tensor.
2 Related Work
CapsNets ([19]) organize neurons as capsules to mimic the biological neural systems. One key design of CapsNets is the routing procedure which can combine lowerlevel features as higherlevel features to better model hierarchical relationships. There have been many papers on improving the expensive routing procedures since the idea of CapsNets was proposed. For example, [24] improves the routing efficiency by 40% by using weighted kernel density estimation. [3] propose an attentionbased routing procedure which can accelerate the dynamic routing procedure. However, [16] have found that these routing procedures are heuristic and sometimes perform even worse than random routing assignment.
Incorporating routing procedures into the optimization process could be a solution. [20] treats the routing procedure as a regularizer to minimize the clustering loss between adjacent capsule layers. [13] approximates the routing procedure with master and aide interaction to ease the computation burden. [1] incorporates the routing procedure into the training process to avoid the computational complexity of dynamic routing.
Here we argue that from the viewpoint of optimization, the routing procedure, which is designed to acquire coupling coefficients between adjacent layers, can be learned and optimized implicitly, and may thus be unnecessary. This approach is different from the above CapsNets which instead focus on improving the efficiency of the routing procedures, not attempting to replace them altogether.
3 How PCapsNets work
We now describe our proposed PCapsNet model in detail. We describe the three key ideas in the next three sections: (1) that the routing procedures may not be needed, (2) that packaging capsules into higherrank tensors is beneficial, and (3) that we do not need convolutional layers.
3.1 Routing procedures are not necessary
The primary idea of routing procedures in CapsNets is to use the parts and learned partwhole relationship to vote for objects. Intuitively, identifying an object by counting the votes makes perfect sense. Mathematically, routing procedures can also be considered as linear combinations of tensors. This is similar to the convolution layers in CNNs in which the basic operation of a convolutional layer is linear combinations (scaling and addition),
(1) 
where is the output scalar, is the input scalar, and is the weight.
The case in CapsNets is a bit more complex since the dimensionalities of input and output tensors between adjacent capsule layers are different and we can not combine them directly. Thus we adopt a step to transform input tensors () into intermediate tensors () by multiplying a matrix (). Then we assign each intermediate tensors () a weight , and now we can combine them together,
(2) 
where are called coupling coefficients which are usually acquired by a heuristic routing procedure ([19, 9]).
In conclusion, CNNs do linear combinations on scalars while CapsNets do linear combinations on tensors. Using a routing procedure to acquire linear coefficients makes sense. However, if Equation 2 is rewritten as,
(3) 
then from the viewpoint of optimization, it is not necessary to learn or calculate and separately since we can learn instead. In other words, we can learn the implicitly by learning . Equation 3 is the basic operation of PCapsNets only we extend it to the 3D case; please see Section 3.2 for details.
By removing routing procedures, we no longer need an expensive step for computing coupling coefficients. At the same time, we can guarantee the learned is optimized to serve a target, while the good properties of CapsNets could still be preserved (see section 4 for details). We conjecture that the strong modeling ability of CapsNets come from this tensor to tensor mapping between adjacent capsule layers.
From the viewpoint of optimization, routing procedures do not contribute a lot either. Taking the CapsNets in ([19]) as an example, the number of parameters in the transformation operation is while the number of parameters in the routing operation equals to — the “routing parameters” only represent 7.25% of the total parameters and are thus negligible compared to the “transformation parameters.” In other words, the benefit from routing procedures may be limited, even though they are the computational bottleneck.
Equation 1 and Equation 3 have a similar form. We argue that the “dimension transformation” step of CapsNets can be considered as a more general version of convolution. For example, if each 3D tensor in PCapsNets becomes a scalar, then PCapsNets would degrade to normal CNNs. As Figure 5 shows, the basic operation of 3D convolution is while the basic operation of PCapsNet is .
3.2 Packaging capsules into higher rank tensors is helpful to save parameters
The capsules in ([19]) and ([9]) are vectors and matrices. For example, the capsules in [19] have dimensionality which can convert each 8dimensional tensor in the lower layer into a 16dimensional tensor in the higher layer ( is the input number and 10 is the output number). We need a total of parameters. If we package each input/output vector into and matrices, we need only parameters. This is the policy adopted by [9] in which 16dimensional tensors are converted into new 16dimensional tensors by using tensors. In this way, the total number of parameters is reduced by a factor of 15.
In this paper, the basic unit of input (), output () and capsules () are all rank3 tensors. Assuming the kernel size is (), the input capsule number (equivalent to the number of input feature maps in CNNs) is . If we extend Equation 3 to the 3D case, and incorporate the convolution operation, then we obtain,
(4) 
which shows how to obtain an output tensor from input tensors in the previous layer in PCapsNets.
Assuming a PCapsNet model is supposed to fit a function , the groundtruth label is and the loss function . Then in backpropagation, we calculate the gradients with respect to the input and with respect to the capsules ,
(5) 
(6) 
The advantage of folding capsules into highrank tensors is to reduce the computational cost of dimension transformation between adjacent capsule layers. For example, converting a tensor to another tensor, we need parameters. In contrast, if we fold both input/output vectors to threedimensional tensors, for example, as , then we only need 16 parameters (the capsule shape is ). For the same number of parameters, folded capsules might be more representative than unfolded ones. Figure 2 shows what happens in one capsule layer of PCapsNets in detail.
3.3 We can build a pure CapsNet without using any convolutional layers
It is a common practice to embed convolutional layers in CapsNets, which makes these CapsNets a hybrid network with both convolutional and capsule layers ([19, 9, 1]). One argument for using several convolutional layers is to extract low level, multidimensional features. We argue that this claim is not so persuasive based on two observations, 1). The level of multidimensional entities that a model needs cannot be known in advance, and it does not matter, either, as long as the level serves our target; 2). Even if a model needs a low level of multidimensional entities, the capsule layer can still be used since it is a more general version of a convolutional layer.
Based on the above observations, we build a “pure” CapsNet by using
only capsule layers. One issue of PCapsNets is how to process the
input if they are not highrank tensors. Our solution is simply adding
new dimensions. For example, the first layer of a PCapsNet can take
tensors as the input
(colored image), and take tensors as the input for grayscale images.
In conclusion, PCapsNets make three modifications over CapsNets ([19]). First, we remove the routing procedures from all the capsule layers. Second, we replace all the convolutional layers with capsule layers. Third, we package all the capsules and input/output as rank3 tensors to save parameters. We keep the loss and activation functions the same as in the previous work. Specifically, for each capsule layer, we use the squash function in ([4]) as the activation function. We also use the same margin loss function in ([19]) for classification tasks,
(7) 
where = 1 iff class k is present, and , are metaparameters that represent the threshold for positive and negative samples respectively. is a weight that adjust the loss contribution for negative samples.
4 Experiments
We test our PCapsNets model on MNIST and CIFAR10. PCapsNets show higher efficiency than CapsNets [19] with various routing procedures as well as several deep compressing neural network models [21, 23, 7].
For MNIST, PCapsNets#0 achieve better performance than CapsNets [19] by using 40 times fewer parameters, as Table 1 shows. At the same time, PCapsNets#3 achieve better performance than Matrix CapsNets [9] by using 87% fewer parameters. [17] is the only model that outperforms PCapsNets, but uses 80 times more parameters.
Since PCapsNets show high efficiency, it is interesting to compare PCapsNets with some deep compressing models on MNIST. We choose five models that come from three algorithms as our baselines. As Table 2 shows, for the same number of parameter, PCapsNets can always achieve a lower error rate. For example, PCapsNets#2 achieves 99.15% accuracy by using only 3,888 parameters while the model ([21]) achieves 98.44% by using 3,554 parameters. For PCapsNet structures in Table 1 and Table 2, please check our supplementary materials for details.
Models  routing  Error rate(%)  Param # 

DCNet++ ([17])  Dynamic ()  0.29  13.4M 
DCNet ([17])  Dynamic ()  0.25  11.8M 
CapsNets ([19])  Dynamic (1)  6.8M  
CapsNets ([19])  Dynamic (3)  6.8M  
AttenCaps ([3]  Attention ()  5.3M  
CapsNets ([9])  EM (3)  320K  
PCapsNets#0    171K  
PCapsNets#3    22.2K 
Algorithm  Error rate(%)  Param # 

KFCCombined ([7])  0.57  52.5K 
Adaptive Fastfood 2048 ([23])  52.1K  
Adaptive Fastfood 1024 ([23])  38.8K  
KFCII ([7])  0.76  27.7K 
PCapsNets#3  22.2K  
PCapsNets#2  3.8K  
ProfSumNet ([21])  1.55  3.6K 
PCapsNets#1  2.9K 
For CIFAR10, we also adopt a fivelayer PCapsNet (please see the supplementary materials) which has about 365,000 parameters. We follow the work of [19, 9] to crop 24 24 patches from each image during training, and use only the center 24 24 patch during testing. We also use the same data augmentation trick as in [6] (please see our supplementary materials for details). As Table 3 shows, PCapsNet achieves better performance than several routingbased CapsNets by using fewer parameters. The only exception is CapsuleVAE ([18]) which uses fewer parameters than PCapsNets but the accuracy is lower. The structure of PCapsNets#4 can be found in our supplementary materials.
In spite of the parameterwise efficiency of PCapsNets, one limitation is that we cannot find an appropriate acceleration solution like cuDNN ([2]) since all current acceleration packages are convolutionbased. To accelerate our training, we developed a customized acceleration solution based on cuda ([15]) and CAFFE ([10]). The primary idea is reducing the communication times between CPUs and GPUs, and maximizing the number of canbeparalleled operations. Please check our supplementary materials for details, and the code will be released soon.
Models  Routing  Ensembled  Error rate(%)  Param # 

DCNet++ ([17])  Dynamic ()  1  10.29  13.4M 
DCNet ([17])  Dynamic ()  1  18.37  11.8M 
MSCaps ([22])  Dynamic ()  1  24.3  11.2M 
CapsNets ([19])  Dynamic (3)  7  6.8M  
AttenCaps ([3]  Attention ()  1  5.6M  
FRMS ([24])  Fast Dynamic (2)  1  1.2M  
FREM ([24])  Fast Dynamic (2)  1  1.2M  
CapsNets ([9])  EM (3)  1  458K  
PCapsNets#4    1  365K  
CapsuleVAE ([18])  VBRouting  1  11.2  323K 

5 Visualization of PCapsNets
We visualize the capsules (filters) of PCapsNets trained on MNIST (the model used is the same as in Figure 8). The capsules in each layer are 7D tensors. We flatten each layer into a matrix to make it easier to visualize. For example, the first capsule layer has a shape of , so we reshape it to a matrix. We do a similar reshaping for the following three layers, and the result is shown in Figure 3.
We observe that the capsules within each layer appear correlated with each other. To check if this is true, we print out the first two layers’ correlation matrix for both the PCapsNet model as well as a CNN model (which comes from [19], also trained on MNIST) for comparison. We compute Pearson productmoment correlation coefficients (a division of covariance matrix and multiplication of standard deviation) of filter elements in each of two convolution layers respectively. In our case, we draw two 25x25 correlation matrices from that reshaped conv1 (25x256) and conv2 (25x65536). Similarly, we generate two 9x9 correlation matrices of PCapsNets from reshaped conv1 (9x16) and conv2 (9x32). As Figure 5 shows, the filters of convolutional layers have lower correlations within kernels than PCapsNet. The result makes sense since the capsules in PCapsNets are supposed to extract the same type of features while the filters in standard CNNs are supposed to extract different ones.
The difference shown here suggests that we might rethink the initialization of CapsNets. Currently, our PCapsNet, as well as other types of CaspNets all adopt initializing methods designed for CNNs, which might not be ideal.
(a) conv1  (b) conv2  (c) capconv1  (d) capconv2 
6 Generalization Gap
Generalization gap is the difference between a model’s performance on training data and that on unseen data from the same distribution. We compare the generalization gap of PCapsNets with that of the CNN baseline [19] by marking out an area between training loss curve and testing loss curve, as Figure 6 shows. For visual comparison, we draw the curve per 20 iterations for baseline [19] and 80 iterations for PCapsNet, respectively. We can see that at the end of the training, the gap of training/testing loss of PCapsNets is smaller than the CNN model. We conjecture that PCapsNets have a better generalization ability.
7 Adversarial Robustness
For blackbox adversarial attack, [9] claims that CapsNets is as vulnerable as CNNs. We find that PCapsNets also suffer this issue, even more seriously than CNN models. Specifically, we adopt FGSM [5] as the attacking method and use LeNet as the substitute model to generate one thousand testing adversarial images. As Table 4 shows, when epsilon increases from 0.05 to 0.3, the accuracy of the baseline and the PCapsNet model fall to 54.51% and 25.11%, respectively.
Epsilon  Baseline  PCapsNets 

0.05  99.09%  98.66% 
0.1  98.01%  94.4% 
0.15  95.52%  81.35% 
0.2  89.84%  59.52% 
0.25  78.31%  39.58% 
0.3  54.51%  25.11% 
[9] claims that CapsNets show far more resistance to whitebox attack; we find an opposite result for PCapsNets. Specifically, we use UAP ([14]) as our attacking method, and train a generative network (see the supplementary materials for details) to generate universal perturbations to attack the CNN model ([19]) as well as the PCapsNet model shown in Figure 8). The universal perturbations are supposed to fool a model that predicts a targeted wrong label ((the ground truth label + 1) % 10). As Figure 7 shows, when attacked, the accuracy of the PCapsNet model decreases more sharply than the baseline.
It thus appears that PCapsNets are more vulnerable to both whitebox and blackbox adversarial attacking compared to CNNs. One possible reason is that the PCapsNets model we use here is significantly smaller than the CNN baseline (3688 versus 35.4M). It would be a fairer comparison if two models have a similar number of parameters.
8 Conclusion
We propose PCapsNets by making three modifications based on CapsNets [19], 1) We replace all the convolutional layers with capsule layers, 2) We remove routing procedures from the whole network, and 3) We package capsules into rank3 tensors to further improve the efficiency. In this way, PCapsNets becomes a general version of CNNs structurally. The experiment shows that PCapsNets can achieve better performance than multiple other CapsNets variants with different routing procedures, as well as than deep compressing models, by using fewer parameters. We visualize the capsules in PCapsNets and point out that the initializing methods of CNNs might not be appropriate for CapsNets. We conclude that the capsule layers in PCapsNets can be considered as a general version of 3D convolutional layers. We conjecture that CapsNets can encode the intrinsic spatial relationship between a part and a while efficiently, comes from the tensortotensor mapping between adjacent capsule layers. This mapping is presumably also the reason for PCapsNets’ good performance.
9 Future work
Apart from high efficiency, another advantage of CapsNets is extracting good spatial features. PCapsNets have shown high efficiency in classification tasks, and should also be able to generalize well on segmentation & detection tasks. This will be our feature work.
Appendix A Network Structures
a.1 Mnist&cifar
For MNIST&CIFAR10, we designed five versions of CapsNets (CapsNets#0, CapsNets#1, CapsNets#2, CapsNets#3), they are all fivelayer CapsNets. Take CapsNets#2 as an example, the input are grayscale images with a shape of 28 28, we reshape it as a 6D tensor, to fit our PCaspNets. The first capsule layer (CapsConv#1, as Figure 8 shows.), is a 7D tensor, . Each dimension of the 7D tensor represents the kernel height, the kernel width, the number of input capsule feature map, the number of output capsule feature map, the capsule’s first dimension, the capsule’s second dimension, the capsule’s third dimension. All the following feature maps and filters can be interpreted in a similar way.
Similarly, the five capsule layers of PCapsNets#0 are , , , , respectively. The strides for each layers are .
The five capsule layers of PCapsNets#1 are , , , , respectively. The strides for each layers are .
The five capsule layers of PCapsNets#3 are , , , , respectively. The strides for each layers are .
The five capsule layers of PCapsNets#4 are , , , , respectively. The strides for each layers are .
a.2 The Generative Network for Adversarial Attack
The input of the generative network is a 100dimension vector filled with a random number ranging from 1 to 1. Then the vector is fed to a fullyconnected layer with 3456 output ( the output is reshaped as ). On top of the fullyconnected layer, there are three deconvolutional layers. They are one deconvolutional layer with 192 output (the kernel size is 5, the stride is 1, no padding), one deconvolutional layer with 96 output (the kernel size is 4, the stride is 2, the padding size is 1), and one deconvolutional layer with 1 output (the kernel size is 4, the stride is 2, the padding size is 1) respectively. The final output of the three deconvolutional layers has the same shape as the input image (2828) which are the perturbations.
Appendix B Metaparameters & Data Augmentation
For all the PCapsNet models in the paper, We add a Leaky ReLU function(the negative slope is 0.1) and a squash function after each capsule layer. All the parameters are initialized by MSRA ([8]).
For MNIST, we decrease the learning rate from 0.002 every 4000 steps by a factor of 0.5 during training. The batch size is 128, and we trained our model for 30 thousand iterations. The upper/lower bound of the margin loss is 0.5/0.1. The is 0.5. We adopt the same data augmentation as in ([19]), namely, shifting each image by up to 2 pixels in each direction with zero padding.
For CIFAR10, we use a batch size of 256. The learning rate is 0.001, and we decrease it by a factor of 0.5 every 10 thousand iterations. We train our model for 50 thousand iterations. The upper/lower bound of the margin loss is 0.6/0.1. The is 0.5. Before training we first process each image by using Global Contrast Normalization (GCN), as Equation 8 shows.
(8) 
where, and are the raw image and the normalized image. , and are metaparameters whose values are 1, , and 10. Then we apply Zero Component Analysis (ZCA) to the whole dataset. Specifically, we choose 10000 images randomly from the GCNprocessed training set and calculate the mean image across all the pixels. Then we calculate the covariance matrix as well as the singular values and vectors, as, Equation 9 shows.
(9) 
Finally, we can use Equation 10 to process each image in the dataset.
(10) 
Batch Size  CPU(s/100 iterations)  CUDA Kernel(s/100 iterations) 

50  106.22  19.67 
100  213.60  46.57 
150  319.37  61.63 
200  425.15  91.59 
Appendix C Acceleration Solution for PCapsNets
Different from convolution operations in CNNs, which can be interpreted as a few large matrix multiplications during training, the capsule convolutions in PCaspNets have to be interpreted as a large number of small matrix multiplication. If we use the current acceleration library like CuDNN ([2]) or the customized convolution solution in CAFFE ([10]), too many communication times would be incorporated which slows the whole training process a lot. The communication overhead is so much that the training is slower than CPUonly mode. To overcome this issue, we parallel the operations within each kernel to minimize communication times. We build two PCaspNets#3 models, one is CPUonly based, the other one is based on our parallel solution. The GPU is one TITAN Xp card, the CPU is Intel Xeon. As Table 5 shows, our solution achieves at least faster speed than the CPU mode for different batch sizes.
References
 Zhenhua Chen, Chuhua Wang, Tiancong Zhao, and David Crandall. Generalized capsule networks with trainable routing procedure. ICML Worksop: Understanding and Improving Generalization in Deep Learning, 2019.
 Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.
 Jaewoong Choi, Hyun Seo, Suii Im, and Myungju Kang. Attention routing between capsules. CoRR, abs/1907.01750, 2019.
 Xi Edgar, Bing Selina, and Jin Yang. Capsule network performance on complex data. CoRR, 2017.
 I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. ArXiv eprints, Dec. 2014.
 Ian J. Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout Networks. arXiv eprints, page arXiv:1302.4389, Feb 2013.
 Roger Baker Grosse and James Martens. A kroneckerfactored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, 2016.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. CoRR, abs/1502.01852, 2015.
 Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with EM routing. In International Conference on Learning Representations, 2018.
 Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 Adam R. Kosiorek, Sara Sabour, Yee Whye Teh, and Geoffrey E. Hinton. Stacked Capsule Autoencoders. arXiv eprints, page arXiv:1906.06818, Jun 2019.
 Yann LeCun, Fu Jie Huang, and Léon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In IEEE Conference on Computer Vision and Pattern Recognition, 2004.
 Hongyang Li, Xiaoyang Guo, Bo Dai, Wanli Ouyang, and Xiaogang Wang. Neural network encapsulation. ECCV, 2018.
 SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. CoRR, abs/1610.08401, 2016.
 John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, Mar. 2008.
 Inyoung Paik, Taeyeong Kwak, and Injung Kim. Capsule Networks Need an Improved Routing Algorithm. arXiv eprints, page arXiv:1907.13327, Jul 2019.
 Sai Samarth R. Phaye, Apoorva Sikka, Abhinav Dhall, and Deepti R. Bathula. Dense and diverse capsule networks: Making the capsules learn better. CoRR, abs/1805.04001, 2018.
 Fabio De Sousa Ribeiro, Georgios Leontidis, and Stefanos D. Kollias. Capsule routing via variational bayes. CoRR, abs/1905.11455, 2019.
 Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules. CoRR, abs/1710.09829, 2017.
 Dilin Wang and Qiang Liu. An optimization view on dynamic routing between capsules, 2018.
 Chai Wah Wu. Prodsumnet: reducing model parameters in deep neural networks via productofsums matrix decompositions. CoRR, abs/1809.02209, 2018.
 Canqun Xiang, Lu Zhang, Yi Tang, Wenbin Zou, and Chen Xu. Mscapsnet: A novel multiscale capsule network. IEEE Signal Processing Letters, 25:1850–1854, 2018.
 Z. Yang, M. Moczulski, M. Denil, N. d. Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1476–1483, Dec 2015.
 Suofei Zhang, Wei Zhao, Xiaofu Wu, and Quan Zhou. Fast dynamic routing based on weighted kernel density estimation. CoRR, abs/1805.10807, 2018.