# An Algorithm for Routing Capsules in All Domains with Sample Applications in Vision and Language

## Abstract

Building on recent work on capsule networks, we propose a new, general-purpose form of “routing by agreement” that activates output capsules in a layer as a function of their net benefit to use and net cost to ignore input capsules from earlier layers. To illustrate the usefulness of our routing algorithm, we present two capsule networks that apply it in different domains: vision and language.^{1}

## 1 Introduction

Capsule networks with routing by agreement can be more effective than convolutional neural networks for segmenting highly overlapping images Sabour et al. (2017) and for generalizing to different poses of objects embedded in images and resisting white-box adversarial image attacks Hinton et al. (2018), typically requiring fewer parameters but more training and computation.

A capsule is a group of neurons whose outputs represent different properties of the same entity in different contexts. Routing by agreement is an iterative form of clustering in which a capsule detects an entity by looking for agreement among votes from input capsules that have already detected parts of the entity in a previous layer.

### Proposed Routing Algorithm

Here, we propose a new, general-purpose form of “EM routing” Hinton et al. (2018) that uses the expectation-maximization (EM) algorithm to cluster similar votes from input capsules to output capsules. Each output capsule iteratively maximizes the probability of input votes assigned to it, given its probabilistic model.

Our EM routing algorithm has multiple differences compared to previous ones. The most significant difference is that we compute each output capsule’s activation by applying a logistic function to the difference between a net benefit to use and a net cost to ignore (i.e., not use) each input capsule. We compute the share of each input capsule used or ignored by each output capsule in a new procedure we call the D-Step, executed between the E-Step and M-Step of each EM iteration. We are motivated by the intuitive notion that

“output capsules should benefit from the input data they use, and lose benefits from any input data they ignore,”

as they maximize the probability of votes from those input capsules they use in each EM iteration.

We simultaneously (a) optimize the entire layer for a training objective, (b) iteratively maximize the probability of input capsule votes each output capsule uses, and (c) iteratively maximize the probability of net input capsule benefits less costs in service of (a) and (b). We like to think of this mechanism as finding “the combination of net benefits and costs that produces greater profit,” or, more colorfully, maximizing “bang per bit.”

Another significant difference of our routing algorithm, compared to previous ones, is that it accepts variable-size inputs, such as sequences of contextualized token embeddings in natural language applications. A contextualized token embedding is a special case of a capsule, one whose neuron outputs represent different properties of the same token id in different contexts.

### Sample Applications in Two Domains

For comparison with prior work on capsule networks, we evaluate our routing algorithm on the smallNORB visual recognition task LeCun et al. (2004), in which objects must be recognized from stereo images with varying azimuths, elevations, and lighting conditions. We construct a capsule network that routes pixel convolutions as capsules to achieve new state-of-the-art accuracy of 99.1% on this task. Compared to the previous state of the art Hinton et al. (2018), our smallNORB model has approximately 272,000 instead of 310,000 parameters, trains in 50 instead of 300 epochs, and accepts as input non-downsampled pairs of 9696 images, instead of 3232 downsampled crops that are nine times smaller. Also, we do not average over multiple crops per image to compute test accuracy. Fig. 1 shows how our model compares to prior state-of-the-art models on two criteria.

We find evidence that our smallNORB model learns to encode pose solely from pixel data and classification labels, i.e., it learns its own form of “reverse graphics” without us explicitly having to optimize for it. See the visualization in Fig. 4 (p. 4) and the 24 plots and corresponding captions in Supp. Fig. 6 (p. 6) and Supp. Fig. 7 (p. 7).

For illustration of the general-purpose nature of our routing algorithm, we also evaluate it on a natural language task: classifying the root sentences of the Stanford Sentiment Treebank Socher et al. (2013) into fine-grained (SST-5/R) and binary (SST-2/R) labels. (We add the “/R” designation to distinguish these root-sentence tasks from classification of phrases in the parse trees, because we have seen research that does not.)

For evaluation of our algorithm on SST-5/R and SST-2/R, we construct a capsule network with only approximately 140,000 parameters that routes frozen token embeddings from a pretrained transformer Vaswani et al. (2017) as capsules. In our implementation, we use a GPT-2 Radford et al. (2018) as the pretrained transformer, the largest such model publicly available at the time of writing. Our SST model achieves new state-of-the-art test set accuracy of 58.5% on SST-5/R, and new state-of-the-art test set accuracy for single-task models of 95.6% on SST-2/R. Fig. 2 shows how our SST model compares to prior state-of-the-art models on SST-5/R.

### Motivation

As we show here, we can achieve state-of-the-art results in more than one domain by stacking one or more layers of our routing algorithm atop, or between, blocks of non-iterative layers (e.g., convolutional, self-attention, LSTM). Our motivation is to develop universal, composable learning algorithms that can adapt to any application.

## 2 Notation

We use tensor notation, with all operations performed element-wise (Hadamard), implicitly assuming conventional broadcasting for any missing indexes. This notation is both more succint than linear algebra’s and more natural to implement with frameworks like PyTorch and TensorFlow. For extra clarity, we do not use implicit (Einstein) summations nor raised (contravariant) indexes. See Tab. 1 for examples of our notation and their implementation in PyTorch.

Example | Implementation in PyTorch |
---|---|

A.unsqueeze(-1) + B | |

A.unsqueeze(-1) * B or | |

einsum("ij,ijk->ijk", A, B) | |

einsum("ij,ijk->jk", A, B) | |

einsum("ij,ijk->k", A, B) |

## 3 Our Routing Algorithm

For brevity, we assume familiarity with both capsule networks and the matrix EM routing algorithm proposed by Hinton et al. (2018), so we focus our discussion here mainly on those aspects of our work that are new. Also, while our algorithm generalizes to any probabilistic model that can be used in an expectation-maximization loop, we restrict our discussion here only to one case: a multidimensional Gaussian model.

As shown in Alg. LABEL:alg:Our_EM_Routing_algorithm, for a given sample, our algorithm dynamically routes input capsules to output capsules, where is specified in advance if samples are of fixed size or left unspecified if samples are of variable size. Input capsules are tensors of size , and output capsules are tensors of size .

algocf[htbp] \end@dblfloat

Per sample, we accept as input two tensors: scores and capsules . We return as output three tensors: scores, capsules, and variances . The indexes are

Intuitively, we can think of as the number of detectable parts, as the number of detectable entities, each consisting of or associated with one or more parts, as the dimension of the covector space, or dual, of the spaces in which parts and entities have properties, and and as the dimensions of part and entity properties, respectively, in those spaces.

For example, if we wanted to detect, say, dogs and cats embedded in images, from 64 detectable animal parts, the values of , , , , and would be 64, 2, 4, 4, and 4, respectively. Each of the 64 input capsules and 2 output capsules would be a 4 4 matrix capable of representing the spatial relationship between the viewer of an image and objects embedded in the image.

In other domains, the dimensions of the dual space, of part properties, and of entity properties may be very different.

When is unspecified, the number of detectable parts is variable. In that case, it might be desirable for input capsules themselves to have properties that represent their type and/or position. This is commonly done, for example, in language models, which add relative or absolute position information to token embeddings.

### 3.1 Votes

Before starting the routing loop, we compute votes from each input to each output capsule,

(1) |

where is a tensor of votes computed from the th input capsule to the th output capsule for each component of the output capsule, and and (or and , if is unspecified) are parameters. We perform tensor contraction on index and compute element-wise operations, as indicated, along indexes , , , and , with conventional broadcasting implicitly assumed for any missing dimensions. For each input capsule , we obtain a different slice of votes for the th output capsule, breaking symmetry.

Intuitively, the computation of in (1) can be understood as simultaneous matrix-matrix multiplications, each applying a linear transformation to a transposed input capsule, followed by addition of biases and then another transposition, to obtain votes of size .

When is left unspecified, we remove index from the parameters used to compute , so we reduce their size by a factor of . In this case, the computation of applies simultaneous matrix-matrix multiplications to each input capsule, followed by addition of corresponding biases.

#### Adapting to Variable-Size Outputs

A trivial adaptation of our algorithm, which we do not show, is to allow both and to be unspecified, resulting in a variable number of input capsules voting for an equal number of output capsules. We can do this by removing indexes from , reducing its size by a factor of . In that case, the computation of would apply the same linear transformation to all input capsules, and would have to break symmetry via other means (e.g., by adding different biases).

### 3.2 Routing Loop

Unlike previous versions of EM routing, which on each iteration compute first an M-Step and then an E-Step procedure, our algorithm computes these procedures in the conventional order: first the E-Step and then the M-Step. This ordering removes the final, superfluous E-Step from the loop, and also, we believe, simplifies exposition.

We also introduce a new procedure, which we call the D-Step, between the E-Step and M-Step. The D-Step computes the share of each input capsule’s data used or ignored by each output capsule, for subsequent use in the computation of output scores . These computations, described in 3.4 and 3.5, represent our most significant departure from previous forms of EM routing.

Another difference is that in our algorithm, and are “pre-activation” scores in the interval on which we apply logistic functions as needed, at the last minute as it were, to compute activations. This trivial modification facilitates more flexible use of the “raw” values of and by subsequent layers and/or objective functions, with more numerical stability. For example, a subsequent layer can apply a Softmax function to the output scores to induce a distribution over output capsule activations.

In the following subsections, we describe the computations performed by the E-Step, D-Step, and M-Step on each iteration, in order of execution, emphasizing those computations which are new or different with respect to previous work.

### 3.3 E-Step

At the start of each iteration, for each sample, our E-Step computes an tensor of routing probabilities for assigning each input capsule to each output capsule ,

(2) |

where is the logistic function, is the th output score computed in the previous iteration’s M-Step, and stands for

(3) |

the products of the probability densities of input capsule ’s votes for output capsule ’s components, given output capsule ’s Gaussian model, updated in the previous iteration’s M-Step, as in other forms of EM routing, except that in our case votes have two indexes, , instead of one. See Alg. LABEL:alg:Our_EM_Routing_algorithm for computation. Informally, we can think of as “the probability of votes from input capsule given capsule ’s probabilistic model.”

In our implementation, for numerical stability, we compute after the first iteration by applying a Softmax function to simplified log-sums, instead of using the second equation in (2).

### 3.4 D-Step

At the beginning of each D-Step, we multiply the assigned routing probabilities by logistic function activations of input scores, which act as gates, to obtain , the share of data used from each input capsule to update each output capsule ’s Gaussian model,

(4) |

where is the logistic function and is the input score associated with input capsule . Except for the “last-minute” application of the logistic function, (4) is the same as its corresponding equation in previous forms of EM routing. However, our notation explicitly differentiates between , the routing probabilities, and , the share of capsule ’s data used by capsule .

Each row of routing probabilities adds up to 1, and maps to ; therefore,

(5) |

that is, has values that range from 0 (“completely ignore input capsule in output capsule ’s model”) to 1 (“use the whole of input capsule in output capsule ’s model”), but never exceeds (“how much of input capsule can we use among all output capsules?”).

Given these relationships, we can compute the share of data ignored (i.e., not used) from each input capsule by each output capsule ,

(6) |

such that for each input and output capsule pair the two shares, and , plus the data that is “gated off” by logistic activation of the corresponding input score, , add up to 1, or the whole input capsule,

(7) |

accounting for all input data.

Output capsules are thus in a competition with each other to use “more valuable bits,” and ignore “less valuable bits,” of input data. Each output capsule can improve its use-ignore shares only at the expense of other output capsules.

### 3.5 M-Step

The M-Step computes updated output scores and weighted means and variances and , respectively, to maximize the probability that each output capsule ’s Gaussian model would generate the votes computed from each input capsule used by . We discuss these computations in the subsections that follow.

#### Output Scores

Previous forms of EM routing use the minimum description length principle to derive approximations of the cost to activate and the cost not to activate each output capsule , and compute output activations by applying a logistic function to the difference between those approximations. Such costs must be approximated because the only known method for accurately computing them would require inverting vote-computation matrices, which is impractical. See Hinton et al. (2018) for details.

We use a different approach, motivated by the intuitive notion that

“output capsules should benefit from the input data they use, and lose benefits from any input data they ignore,”

as they maximize the probability of votes from those input capsules they use in each iteration.

For each output capsule , we compute output score as the difference of a net benefit to use and a net cost to ignore input capsule data,

(8) |

where and are the shares of input capsule ’s data used and ignored by output capsule , computed in (4) and (6), respectively, and and (or and , if is unspecified) are parameters representing each output capsule’s net benefit to use and net cost to ignore input capsule data, respectively. The adjective “net” denotes that and can have positive (“credits”) or negative (“debits”) values.

For certain tasks, we may want the net cost of ignoring each input capsule to be equal to the net benefit we lose from not using it. We can accomplish this trivially by setting for all , making them one and the same parameter.

#### Output Capsule Probabilistic Models

At the end of each M-Step, we compute updated means and variances of every output capsule ’s Gaussian model, weighted by , the amount of data the output capsule uses from each input capsule . See Alg. LABEL:alg:Our_EM_Routing_algorithm for details.

#### Motivation and Intuitition

In the next iteration’s E-Step, when we activate by applying to it a logistic function in (2), we induce in each output capsule a distribution over a quantity equal to (a) the output capsule’s net benefits from those input capsules it uses, less (b) its net costs (or lost benefits) from those input capsules it ignores.

When we recompute routing probabilities in (2), we weight , the probability of votes from input capsules each output capsule uses, by , the probability of net benefits less costs from using those input capsules, jointly maximizing both probabilities in the EM loop for optimization of a training objective specified elsewhere.

Informally, we can think of this multi-faceted mechanism as finding, for each output capsule, “the combination of net benefits and costs that produces greater profit,” where “greater profit” stands for maximizing input vote probabilities at each output capsule and optimizing the whole layer for another objective. We prefer to think of it as maximizing “bang per bit.”

#### Relationship to Description Length

There is an interesting connection between our activation mechanism and that used by previous forms of EM routing: All else remaining equal, at each output capsule, using more data from an input capsule is associated with greater description length, and using less data from the input capsule is associated with the opposite—by definition.

## 4 Sample Application: smallNORB

The smallNORB dataset LeCun et al. (2004) has grayscale stereo 9696 images of five classes of toys: airplanes, cars, trucks, humans, and animals. There are 10 toys in each class, five selected for the training set and five for the test set. Each toy is photographed stereographically at 18 different azimuths (0-340 degrees), 9 different elevations, and 6 lighting conditions, such that the training and test sets each contain 24,300 pairs of images. Supp. Fig. 9 shows samples from each class.

### 4.1 Architecture

The architecture of our smallNORB model is described in detail in Fig. 3. At a high level of abstraction, we can think of the model as doing two things: First, it applies a sequence of standard convolutions to detect 64 toy parts and their poses (spatial relationships to the viewer of the image) in multiple possible locations in the image (steps (a) through (d)) in Fig. 3). Then, the model applies two layers of our routing algorithm, one to detect 64 larger toy parts and their poses, and another to detect five categories of toys and their poses (Fig. 3(e)). The routing layers are meant to induce the standard convolutions to learn to recognize toy parts and their poses.

#### Use of Variable-Size Inputs

To keep the number of parameters small, we leave unspecified in the first routing layer, so it accepts a variable number of input capsules without regard for their location in the image. To counteract this loss of location information, we stack input images with two tensors of coordinate values evenly spaced from -1.0 to 1.0, one horizontally and one vertically, as shown in Fig. 3(b).

Besides reducing the number of parameters in the first routing layer (by a factor of , as shown in Fig. 3), our decision to accept a variable number of capsules in that first routing layer makes our model capable of accepting images of variable size, limited only by memory, though we do not make use of this capability here.

### 4.2 Initialization and Training

We initialize all convolutions with Kaiming normal initialization He et al. (2015) and the two routing layers as follows: Normal initialization scaled by for the multilinear weights that compute votes, zeros for the bias parameters, and zeros for the net benefit and cost parameters.

We train the model for 50 epochs, with a batch size of 20, using RAdam Liu et al. (2019) for optimization via stochastic gradient descent. We use a single-cycle hyperparameter scheduling policy in which learning rate and first momentum start at (, ), each change linearly to (, ) over the first 10% of training iterations, and then return to their respective starting values with a cosine shape over the remaining iterations.

During training, we add 16 pixels of padding on each side to each pair of images and randomly crop them to size. We do not use any other image processing, nor any metadata, nor any additional data in training.

For regularization, we use mixup Zhang et al. (2017) with Beta distribution parameters (0.2, 0.2), inducing the iterative EM clustering algorithms in our routing layers to learn to distinguish samples that have been mixed together.

The objective function is a Cross Entropy loss, computed on Softmax activations of the output scores of the final routing layer’s five capsules. Supp. Fig. 8 shows validation loss and accuracy after each epoch of training.

### 4.3 Results

Tab. 2 shows test set accuracy, number of parameters, and number of epochs to train our smallNORB model, and how it compares to prior models that have achieved state-of-the-art results without using any metadata or additional data.

Accuracy | No. of | Train | |

Model | (%) | Params | Epochs |

Sabour et al. (2017) | 97.3 | 8200K | N/A |

Cireşan et al. (2011)* | 97.4 | 2700K | N/A |

Hinton et al. (2018) | 98.2 | 310K | 300 |

Hinton et al. (2018)** | 98.6 | 310K | 300 |

Our Model | 99.1 | 272K | 50 |

* Extensive image processing in training. |

** Reported accuracy is mean of multiple random crops per image. |

Compared to the previous state of the art Hinton et al. (2018), our model has fewer parameters, trains in an order of magnitude fewer epochs, and accepts full-size unprocessed images instead of downsampled, cropped ones. We do not use multiple crops per image to compute test accuracy.

Compared to the best-performing conventional CNN on this benchmark Cireşan et al. (2011), our model has an order of magnitude fewer parameters and is trained with minimal data augmentation (only cropping), whereas the CNN is trained with additional stereo pairs of images created using different filters and affine distortions.

### 4.4 Analysis

Besides superior test set performance, we find evidence that our smallNORB model learns to perform its own form of “reverse graphics” without explicitly optimizing for it, solely from pixel data and classification labels. The model learns to use all four pose vectors jointly to represent poses, albeit in a way that feels quite alien compared to the typical human approach (e.g., a rotation matrix inside a matrix with translation data).

The visualization in Fig. 4 shows multidimensional scaling (MDS) representations in of the trajectories of an activated class capsule’s pose vectors as we feed test images of an object in the class with varying elevations to our model. We can see that the four pose vectors that constitute the class capsule jointly move and eventually seem to “flip” as we change viewpoint elevation. The same visualization for other objects in the dataset, and for varying azimuth, look qualitatively similar.

We also analyze quantitatively the behavior of pose vectors as we vary azimuth and elevation for every category and instance of toy in the dataset, and find that pose vectors behave in ways that are consistent with variation in azimuth and elevation. See the 24 plots and their captions in Supp. Fig. 6 and Supp. Fig. 7 for details.

Much more work remains to be done to understand and quantify our routing algorithm’s ability to learn “reverse graphics.” However, we think such work falls outside the scope of this paper, given that we also evaluate our routing algorithm in another domain, natural language.

## 5 Sample Application: SST

The Stanford Sentiment Treebank (SST) Socher et al. (2013) consists of 11,855 sentences extracted from movie reviews, parsed into trees with 215,154 unique phrases. The root sentences are split into training (8,544), validation (1,101) and test (2,210) sets, each with its own subset of the parse trees. The fine-grained root sentence classification task (SST-5/R) involves selecting one of five labels (very negative, negative, neutral, positive, very positive) for each root sentence in the test set. The binary root sentence classification task (SST-2/R) involves selecting one of two labels (negative, positive) after removing all neutral samples from the dataset, leaving 9,613 root sentences, split into training (6,920), validation (872), and test (1,821) sets, each with their own subset of the parse trees.

We chose SST, and SST-5/R in particular, for three reasons: First, its size is small enough to sidestep certain challenges to scaling EM routing (e.g., see Barham and Isard (2019)). Second, since this dataset’s release in 2013, no model has come close to human performance on SST-5/R, as measured by accuracy on its labels, which were assigned by human beings. Finally, we suspect SST-5/R has remained challenging because it is less susceptible than other benchmarks to the “Clever Hans” effect, in which seemingly impressive performance is explained by exploitation of spurious statistical cues in the data.

The Clever Hans effect has been documented in multiple natural language datasets, for example, by McCoy et al. (2019) and Niven and Kao (2019). Models have become so good at recognizing patterns in natural language that human beings are now finding it difficult to design benchmarks that are free of spurious statistical cues.

Several qualities, we think, make it challenging for machines to find and exploit spurious statistical cues in SST-5/R: First, its labels map to sentiments that transition into each other in complicated ways (e.g., the boundary between neutral and positive sentences). Second, the dataset is small (e.g., only 2,210 test sentences). Third, the samples exhibit a variety of complex syntactic phenomena (e.g., nested negations). Finally, the samples exhibit diverse linguistic constructions (e.g., idiosyncratic movie fan idioms).

### 5.1 Architecture

The architecture of our SST model resembles that of our smallNORB model. We show and describe it in detail in Fig. 5. At a high level of abstraction, we can think of our SST model as doing two things: First, it applies a nonlinear transformation to every embedding from a pretrained transformer, mapping each one to a vector with 64 elements indicating present or absence of “sentiment features” (steps (a) through (d)) in Fig. 5). Then, the model applies two layers of our routing algorithm, one to detect 64 composite parts, and another to detect classification labels, five for SST-5/R and two for SST-2/R (Fig. 5(e)). The routing layers are meant to induce the nonlinear transformation to learn to recognize useful “sentiment parts.”

#### Use of Variable-Size Inputs

The number of tokens in sentences is variable, so we leave unspecified in the first routing layer of our SST model. This first routing layer accepts any number of input capsules without regard for their position in the sequence or the depth of the transformer layer from which they originate.

Transformer embeddings incorporate information about their position in a sequence, but not about layer depth. To counteract the loss of depth information, we add a “depth-of-layer” parameter to the input tensor, as shown in Fig. 5(b). In this parameter, each transformer layer has a corresponding vector slice, which we add element-wise to every embedding in the input tensor originating from that transformer layer.

#### Use of GPT-2

In our implementation, we use a GPT-2 Radford et al. (2018) as the pretrained transformer. We chose this model mainly because it was the largest one publicly available at the time of writing, and also because we like the simplicity of its training objective: it is trained only to predict the next subword token in approximately 40GB of text.

GPT-2 has approximately 774 million parameters and outputs 37 layers (36 hidden plus one visible) of dimension 1280. If a sentence contains, say, 10 tokens, this GPT-2 model transforms it into sequences, each with 10 embeddings of size 1280. We concatenate them into an input tensor of shape for our SST model.

### 5.2 Initialization and Training

We initialize the routing layers in the same manner as for the smallNORB model, the depth embeddings with zeros, and the linear transformations with Kaiming normal.

The training regime is the same as for the smallNORB model, except that we train the SST model for only 3 epochs. We use as training data all unique token sequences in the parse trees of the training split. Supp. Fig. 8 shows loss and accuracy on the root-sentence validation set after each epoch of training for both SST-5/R and SST-2/R.

### 5.3 Results on SST-5/R

As Tab. 3 shows, we achieve new state-of-the-art test set accuracy of 58.5% on SST-5/R, a significant improvement (2.3 percentage points) over previous state-of-the-art test set accuracy.

Model/Pretraining | SST-5/R (%) |
---|---|

RNTN/none (Socher et al., 2013) | 45.7 |

CNN/word2vec (Kim et al., 2014) | 48.0 |

Para-Vec/on dataset (Le and Mikolov, 2014) | 48.7 |

LSTM/on PP2B (Wieting et al., 2015) | 49.1 |

DMN/GloVe (Kumar et al., 2015) | 51.1 |

NSE/GloVe (Mundkhdalai and Yu, 2016) | 52.8 |

ByteLSTM/82M reviews (Radford et al., 2017) | 52.9 |

CT-LSTM/word2vec (Looks et al., 2017) | 53.6 |

BCN+Char/CoVe (McCann et al., 2017) | 53.7 |

BCN/ELMo (Peters et al., 2018) | 54.7 |

SuBiLSTM+Char/CoVe (Brahma, 2018) | 56.2 |

Our Model/GPT-2 (non-finetuned) | 58.5 |

### 5.4 Results on SST-2/R

Changing only the final number of capsules to two, the same model achieves test set accuracy of 95.6% on SST-2/R, a new state-of-the-art performance for single-task models. The previous state of the art was achieved by BERT Devlin et al. (2018). Current state-of-the-art test set accuracy for multi-task models or ensembles is 96.8%, by an ensemble of XLNet models trained on multiple GLUE tasks Yang et al. (2019). See Tab. 4.

Model | SST-2/R (%) |

Multi-task models or ensembles: | |

Snorkel (Ratner et al., 2018) (ensemble) | 96.2 |

MT-DNN (Liu et al., 2019) (single model) | 95.6 |

MT-DNN (Liu et al., 2019) (ensemble) | 96.5 |

XLNet (Yang et al., 2019) (ensemble) | 96.8 |

Single-task models: | |

BCN+Char/CoVe (McCann et al., 2017) | 90.3 |

Block-sparse LSTM (Radford et al., 2017) | 93.2 |

BERT (Devlin et al., 2018) | 94.9 |

Our Model/GPT-2 (non-finetuned) | 95.6 |

### 5.5 Analysis

We are surprised to see a such a large improvement over the state of the art on SST-5/R and also new state-of-the-art performance on SST-2/R compared to previous single-task models, considering that (a) we do not finetune the transformer, (b) our SST model resembles the model we use for a visual task, and (c) we train the SST model with the same regime, changing only the number of epochs. (By implication, greater accuracy might be possible with transformer finetuning and more careful tweaking of model and training regime.)

We also note that progress on SST-5/R has been remarkably steady, year after year, since its publication in 2013, as AI researchers have devised new architectures that exploit new pretraining mechanisms (Tab. 3). Our results represent a continuation of this long-term trend. As of yet, no model has come close to accurately modeling the labeling decisions of human beings on SST-5/R. Performance on the binary task, SST-2/R, whose labels lack the subtlety of the fine-grained ones, has been close to human baseline for several years now.

Finally, our successful use of a capsule network to route embeddings from a pretrained transformer links two areas of AI research that have been largely independent from each other: capsule networks with routing by agreement, used mainly for visual tasks, and transformers with self-attention, used mainly for sequence tasks.

## 6 Related Work

Hinton et al. (2018) proposed the first form of EM routing and showed that capsule networks using it to route matrix capsules can generalize to different poses of objects in images and resist white-box adversarial attacks better than conventional CNNs. Their “related work” section compares capsule networks to other efforts for improving the ability of visual recognition models to deal effectively with viewpoint variations.

Sabour et al. (2017) showed that capsule networks with an earlier form of routing by agreement, operating on vector capsules, can be more effective than conventional CNNs for segmenting highly overlapping images of digits.

Barham and Isard (2019) showed that currently it can be challenging to scale capsule networks to large datasets and output spaces in some circumstances due to current software (e.g., PyTorch, TensorFlow) and hardware (e.g., GPUs, TPUs) systems, which are highly optimized for a fairly small set of computational kernels, in a way that is tightly coupled with memory hardware, leading to poor performance on non-standard workloads, including basic operations on capsules.

Coenen et al. (2019) found evidence that BERT, and possibly other transformer architectures, learn to embed sequences of natural language as trees. Their work inspired us to wonder if capsule networks might be able to recognize such “language trees” in different “poses,” analogously to the way in which capsule networks can recognize different poses of objects embedded in images.

Vaswani et al. (2017) proposed transformer models using query-key-value dot-product attention, and showed that such models can be more effective than prior methods for modeling sequences. Our routing algorithm can be seen as a new kind of attention mechanism in which output capsules “compete with each other for the attention of input capsules,” with each output capsule seeing a different set of input capsule votes.

## 7 Future Work

Capsule networks are a recent innovation, and our routing algorithm is still more recent. Its behavior and properties are not yet widely or fully understood. As current challenges to scaling, such as those studied by Barham and Isard (2019), are overcome, we think it would make sense to conduct more comprehensive evaluations and ablation studies of our algorithm in multiple domains.

We are also intrigued about using our routing algorithm for natural language modeling. At present this seems impractical, due in part to the computational complexity of the algorithm.^{2}

Another possible avenue for future research involves experimenting with probabilistic models other than a multidimensional Gaussian in output capsules. While our limited experiments show that a multidimensional Gaussian works remarkably well, we harbor some doubts about its effectiveness with capsules of much greater size.

Finally, we naturally wonder about using non-probabilistic clustering in our routing algorithm, k-means being the most obvious choice, given its relationship to EM and its proven effectiveness at dealing with large-scale data in other settings.

## 8 Conclusion

Building on recent work on capsule networks, we propose a new, general-purpose form of routing by agreement that computes output capsule activations as a logistic function of net benefits to use less net costs to ignore input capsules. To make the computation of net benefits and costs possible, we introduce a new step in the EM loop, the D-Step, that computes the share of data used and ignored from each input capsule by each output capsule, accounting for all input capsule data. We construct our routing algorithm to accept variable-size inputs, such as sequences, which also proves useful for keeping the number of model parameters small in applications for which it is otherwise not necessary. We also explain how to adapt the algorithm for variable-size outputs. Finally, our algorithm uses “pre-activation” scores to which we apply logistic functions as needed, facilitating more flexible use by subsequent layers and/or objective functions, with more numerical stability.

We illustrate the usefulness of our routing algorithm with two capsule networks that apply it in different domains, vision and language. Both networks achieve state-of-the-art performance in their respective domains after training with the same regime, thereby showing that adding one or more layers of our routing algorithm can produce state-of-the-art results in more than one domain, without requiring tuning. Our motivation is to develop universal, composable learning algorithms. Our work is but a small step in this direction.

## Acknowledgments

We thank Russell T. Klophaus for his feedback.

## List of Supplementary Figures

### Footnotes

- In both domains, we use the same routing code, available at https://github.com/glassroom/heinsen_routing along with pretrained models and replication instructions.
- Consider: if we wanted to use our routing algorithm to predict the next capsule in a natural language sequence, over a dictionary of typical size, say, subword ids, the final layer of the model alone would have to compute and hold in memory the equivalent of simultaneous EM loops, each on a different set of input votes per output capsule.

### References

- Machine learning systems are stuck in a rut. In Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS ’19, New York, NY, USA, pp. 177–183. External Links: ISBN 978-1-4503-6727-1 Cited by: §5, §6, §7.
- High-performance neural networks for visual object classification. CoRR abs/1102.0183. External Links: 1102.0183 Cited by: §4.3, Table 2.
- Visualizing and measuring the geometry of BERT. CoRR abs/1906.02715. External Links: 1906.02715 Cited by: §6.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: 1810.04805 Cited by: §5.4.
- Delving deep into rectifiers: surpassing human-level performance on imagenet classification. CoRR abs/1502.01852. External Links: 1502.01852 Cited by: §4.2.
- Matrix capsules with em routing. Cited by: §1, §1, §1, §3.5.1, §3, §4.3, Table 2, §6.
- Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR’04, Washington, DC, USA, pp. 97–104. Cited by: §1, §4.
- On the variance of the adaptive learning rate and beyond. External Links: 1908.03265 Cited by: §4.2.
- Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3428–3448. Cited by: §5.
- Probing neural network comprehension of natural language arguments. CoRR abs/1907.07355. External Links: 1907.07355 Cited by: §5.
- Language models are unsupervised multitask learners. Cited by: §1, §5.1.2.
- Searching for activation functions. CoRR abs/1710.05941. External Links: 1710.05941 Cited by: Figure 3.
- Dynamic routing between capsules. CoRR abs/1710.09829. External Links: 1710.09829 Cited by: §1, Table 2, §6.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), Vol. 1631, pp. 1642. Cited by: §1, §5.
- Attention is all you need. CoRR abs/1706.03762. External Links: 1706.03762 Cited by: §1, §6.
- XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: 1906.08237 Cited by: §5.4.
- Mixup: beyond empirical risk minimization. CoRR abs/1710.09412. External Links: 1710.09412 Cited by: §4.2.