Text Classification using Capsules

Text Classification using Capsules

Jaeyoung Kim, Sion Jang    Sungchul Choi
TEAMLAB, Gachon University
teamlab.gachon@gmail.com \ANDEunjeong Park

This paper presents an empirical exploration of the use of capsule networks for text classification. While it has been shown that capsule networks are effective for image classification, their validity in the domain of text has not been explored. In this paper, we show that capsule networks indeed have potential for text classification, and that they have several advantages over convolutional neural networks. We further suggest a simple routing method that effectively reduces the computational complexity of dynamic routing. We utilized seven benchmark datasets to demonstrate that capsule networks, along with the proposed routing method provide comparable results.

Text Classification using Capsules

Jaeyoung Kim, Sion Jang and Sungchul Choi TEAMLAB, Gachon University teamlab.gachon@gmail.com

Eunjeong Park NAVER lucy.park@navercorp.com

1 Introduction

Text classification is one of the most basic and important tasks in the field of machine learning. Traditionally, the use of term frequency inverse document frequency (tf-idf) as a representation of documents, and general classifiers such as support vector machines (SVM) or logistic regression have been utilized for statistical classification.

Recently, however, continuous development of deep learning methods has made it possible to find distributed representations of words and documents in an efficient manner (Mikolov et al., 2013; Le and Mikolov, 2014), which further led to higher accuracies for text classification. The major deep learning models utilized in text classification are largely based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

Meanwhile, in the image classification domain, capsule networks (Hinton et al., 2011; Sabour et al., 2017) proved to be effective at understanding spatial relationships in high levels of data by employing a whole vector of instantiation parameters. We have applied this network structure to the classification of text, and argue that it also has advantages in this field.

The main contributions of this work are three-fold. First, we apply capsule networks with dynamic routing to text classification and achieve comparable results to previous methods. Second, we propose an alternative routing method that achieves higher accuracy compared to dynamic routing. Third, we propose the use of an ELU-gate (Dauphin et al., 2016) to propagate relevant information.

2 Related Work

2.1 Text classification

As deep learning architectures have become more popular, they have also been applied to text classification. CNN models were originally popularized for text classification by Kim (2014) and employed convolutions directly to sentences. CNNs were further explored at the character-level by Zhang et al. (2015). Dynamic convolutional neural networks (DCNNs) (Kalchbrenner et al., 2014) introduce a unique method of pooling by dynamically incorporating the length of a sentence when determining the pooling parameter.

While it is straightforward to utilize RNNs for text classification because of the sequential nature of text, naive RNNs have not been as successful as anticipated. However, with long short-term memory (LSTM) and initializations based on sequence autoencoders (Dai and Le, 2015) or small perturbations added to LSTM word embeddings (Miyato et al., 2017), RNNs have also achieved strong results.

Additionally, self-attention networks - models without any convolutions or recurrence - have also been successfully applied to text classification (Shen et al., 2018).

Figure 1: Capsule networks for text. Each document passes a gate layer, convolutional capsule layer, and a text capsule layer.

2.2 Capsule networks

Because the convolution operator in a CNN is represented by a weighted sum of lower layers, it is difficult to express the features of a complex object as it moves into the upper layers. This has the disadvantage of not considering the hierarchical relationships between local features. CNNs utilize pooling to overcome these shortcomings. Pooling can reduce the computational complexity of convolution operations and capture the invariance of local features. However, pooling operations lose information regarding spatial relationships and are likely to misclassify objects based on their orientation or proportion.

The capsule network is a structured model that solves many of the problems inherent to CNNs. Capsules in capsule networks are locally invariant groups that learn to recognize the existence of visual entities and encode their properties into vectors. While neurons operate independently in a CNN, capsule networks utilize a nonlinear function called squashing because capsules (groups of neurons) are represented as a vector.

Capsules consider the spatial relationships between entities and learn these relationships via dynamic routing (Sabour et al., 2017). Dynamic routing determines the connection strength between lower-level and upper-level capsules through repetitive routing based on a coupling coefficient. This coupling coefficient is utilized to measure the similarity between the vectors that predict the upper capsule and lower capsule, and learns which lower-level capsule must be directed to which upper-level capsule. Through this process, capsules learn to represent the properties of a given entity.

3 Model

3.1 Architecture

Our goal is to apply capsule networks to text classification, and modify it according to our purpose. Capsules have the ability to represent attributes of partial entities, and express semantic meanings in a wider space by expressing the entities with a vector rather than a scalar. In this regard, capsules are suitable to express a sentence or document as a vector. Figure 1 depicts the general structure of the proposed model. The input of the network is a document , where is the length of the document and is the embedding size.

The second layer is a feature map utilizing convolutions, where the kernel size is , number of filters is , and stride is fixed to 1. While CNNs utilize max-pooling of feature maps to extract meaningful contexts, we utilize a trick similar to the gated-linear unit (Dauphin et al., 2016), defined as


where are weights, are bias terms, and is the element-wise multiplication operator. This ELU-gate unit acts as a control tower by selecting which features to be activated. Unlike pooling, the ELU-gate unit does not lose spatial information.

The next layer is a convolutional capsule layer with channels of convolutional dimension capsules where the kernel size is . Because the classifier is connected locally to the feature map, it is difficult for the classifier to handle variations in transformation. Some studies have shown that utilizing a large kernel size in a network tends to gather information from a much larger region in the receptive field (Peng et al., 2017). Because we do not utilize pooling, we instead increased the size of the kernel to enlargen our viewpoint. Therefore, we applied nonlinear squashing (Sabour et al., 2017) in the convolutional capsule layer .

The final layer is the text capsule layer . We utilized two different routing methods from the convolutional capsule layer to the text capsule layer, as described in the subsections below.

3.1.1 Capsule network with dynamic routing

In Sabour et al. (2017), the capsule network updated the weight of coupling coefficients through an iterative routing process and determined the degree to which lower capsules were directed to upper capsules. The coupling coefficient is determined by the degree of similarity between the standard-upper and prediction-upper capsules.


where , and is the number of classes. is the coupling coefficient and the softmax output of is updated in every routing iteration. is determined by the degree of similarity between the lower and upper capsules and predicts the entities of the upper capsules. The predicted vector is expressed by a matrix operation between the weight matrix and .


The routing procedure is defined as follows:


3.1.2 Capsule network with static routing

For the image domain, it is reasonable to consider the spatial hierarchies of lower-level entities and routing can recognize objects similarly to the manner in which we recognize objects. However, in the language domain, there is a great deal of freedom in the way that documents and emotions can be expressed. For example, in the original capsule network, learning to correctly represent the positional characteristics of the eyes, nose, and mouth when categorizing faces in images was a major challenge. However, in the case of documents, it is difficult to say that two documents are absolutely different because the order of the sentences in the two documents are different. In this perspective, it becomes natural to suggest a static routing scheme as follows:


where is a weight matrix and is the number of capsules in . is multiplied by to express the upper entity as a capsule of -dimensional vectors. is the result of applying the squashing function to and represents the text capsule layer. This differs from fully connected scalar operations and has the advantage of representing documents as vectors.

4 Experimental Settings

4.1 Datasets

Dataset Classes Train Val. Test
20news 20 10182 1132 7532 177925 50021 315
Reuters10 10 6472 720 2787 28482 17508 168
MR (2004) 2 1620 180 200 40693 31764 779
MR (2005) 2 8635 960 1067 18764 16448 22
TREC-QA 6 4843 539 500 8689 7461 9
MPQA 2 8587 955 1067 6246 6083 3
IMDb 2 22500 2500 25000 112540 58843 231
Table 1: Summary statistics for datasets after tokenization. : Vocabulary size. : Number of words present in the set of pre-trained word vectors. : Average sequence length.

We tested our model on seven different benchmark datasets, as shown in Table 1. The details for each dataset are as follows:


This dataset is a collection of 20,000 news documents partitioned between 20 different newsgroups.


We utilize the Reuters corpus provided by the Python natural language toolkit NLTK, where documents are initially tagged with 90 categories. In order to limit the number of classes, we selected the 10 most-common categories (earn, acq, money-fix, grain, crude, trad, interest, wheat, ship, corn) and selected corresponding documents.

MR (2004)

(Pang and Lee, 2004) A corpus containing 1,000 positive and 1,000 negative preprocessed movie reviews.

MR (2005)

(Pang and Lee, 2005)333http://www.cs.cornell.edu/people/pabo/movie-review-data/ A larger movie review dataset, which contains 5,331 positive sentences and 5,331 negative sentences.


(Li and Roth, 2002)444http://cogcomp.org/Data/QA/QC/ A TREC question dataset for classifying questions into six different question types (person, location, numeric information, etc.).


(Wiebe et al., 2005)555http://mpqa.cs.pitt.edu Opinion polarity detection of subtasks in the MPQA dataset.


(Maas et al., 2011)666http://ai.stanford.edu/~amaas/data/sentiment/ Reviews from the Internet Movie Database, labeled based on positive or negative sentiments.

4.2 Hyperparameters and training

20news 40 0.001 256 5 0.001 6 10/16
Reuters10 40 0.001 256 3 0.0001 6 10/16
MR (2004) 50 0.001 256 3 0.001 6 16/16
MR (2005) 50 0.02 256 1 0.0001 16 16/24
TREC-QA 50 0.0085 256 5 0.001 16 32/16
MPQA 40 0.01 256 1 0.00008 16 8/16
IMDb 50 0.01 256 6 0.001 6 8/16
Table 2: Hyperparameters used fot the capsule network experiments. : Batch size. : Regularization constant in layer (other layers have a regularization constant of 0.01). : Number of filters. : Filter size. : Initial learning rate. : Number of capsules. : Dimension of capsules.

For training, we utilized preprocessed word vectors consisting of 840 billion words from Glove777https://nlp.stanford.edu/projects/glove/. We utilized the Adam optimizer (Kingma and Ba, 2014) with exponentially decaying learning rates. We monotonically decreased the learning rate by decaying it by a factor of 0.99 in every epoch. We utilized a dropout rate of 0.5 and embedding size of 300.

Particularly, the number of capsules is set to 6, according to experiments based on a held out dataset. This is a very low number compared to Sabour et al. (2017), which employed 1,152 capsules for image classification. Our conjecture for this big difference is that the complexity of the generated feature map is lower in our benchmark tasks. If the complexity of a generated feature map is low, the capsule is expected to provide an appropriate representation of the entity, even without dynamic routing.

Our model was trained on a GPU utilizing TensorFlow (Abadi et al., 2016), with the hyperparameter settings as shown in Table 2.

The CNN classification model from Kim (2014) was utilized as a baseline model for experimental comparisons. We performed appropriate parameter tuning for each dataset which are listed in Table 3.

20news 64 0.01 256 [4,5,6] 0.001
Reuters10 40 0.001 100 [2,3,4] 0.0001
MR (2004) 50 0.0001 256 [4,5,6] 0.001
MR (2005) 64 0.01 100 [2,3,4] 0.0001
TREC-QA 50 0.01 256 [3,4,5] 0.0001
MPQA 64 0.01 100 [2,3,4] 0.0001
IMDb 40 0.001 256 [3,4,5] 0.0001
Table 3: Hyperparamers used for the baseline CNN experiments.

5 Results and analysis

5.1 Classification accuracies

Model 20news Reuters10 MR (2004) MR (2005) TREC-QA MPQA IMDb
CapsNet-dynamic-routing 86.45 86.72 88.1 81.00 93.80 89.60 89.80
CapsNet-static-routing 87.17 87.52 89.6 80.98 94.84 90.57 89.72
CNN-non-static* 86.6 87.4 88.0 81.3 92.7 89.9 90.36
CNN-non-static (Kim, 2014) - - - 81.4 92.7 89.4 -
DCNN (Kalchbrenner et al., 2014) - - - - 93.0 -
SA-LSTM (Dai and Le, 2015) 84.4 - - 80.7 - - 92.76
Virtual adversarial LSTM (Miyato et al., 2017) - - - 83.4 - - 94.1
Bi-BloSAN (Shen et al., 2018) - - - - 94.8 90.4 -
Table 4: Text classification accuracies for seven benchmark datasets. Results for CapsNet-* and CNN-non-static are the average accuracies of five consecutive runs. CNN-non-static, marked with an asterisk, are results from a replication code of Kim (2014). Other results are from the corresponding references.

Our experimental results indicate that the accuracy of the static-routing model is higher than that of the dynamic-routing model, as shown in Table 4. We believe this is due to the higher complexity of the second layer, which is a feature map utilizing convolutions.

5.2 Capsule networks over CNNs

Figure 2: Loss and test accuracy with dimensional change of using 20news dataset

Static routing does not use all the theoretical philosophies of the capsule network. However, learning in vector units is different from existing CNN. We experimented with how vector-based learning affects the performance of the model. Figure 3 is the performance results according to the variation of dimension but keeping the number of trainable parameters. Experimental results show that the higher accuracy when the dimension is increased. Therefore, when training as a vector, the capacity to represent the information of the entities increases and it becomes possible to express various attributes of the entities. Using static routing does not lose the characteristics of the capsule. So we experimented with the ability to represent the properties of a capsule in static routing. We use MNIST because there are some limitations to the visualization of minute changes in words. We did a perturbation test after adding an ELU-gate in original capsule network structure and changing dynamic routing to static routing. The experimental method is the same as (Sabour et al., 2017).

Figure 3: Dimension perturbations for static routing model with MNIST.

Figure 3 shows that each row has various properties such as rotation, thickness, scale, etc. Therefore, the use of static routing does not lose the essential characteristics of the capsule. This differs from CNN, which is the computation of independent neurons.

Model “good” “bad”
Capsule network
with dynamic routing
recommended waste
entertaining worst
pieces lame
gripping poorly
truly disappointing
Capsule network
with static routing
delightful terrible
refreshing worst
pleased supposed
fantastic waste
terrifying lame
CNN-baseline capta sinny
029 shiksa
popped blockbuter
americanime u
waqt animal
Table 5: Top five neighboring words for “good” and “bad” in the IMDb dataset, where word vectors were randomly initialized.

We measured word similarities to see how our model differs from the basic CNN. Table 5 is the similarity measurement table. When the pre-trained word vector was utilized, both the CNN and our model were fine-tuned to the dataset. However, a difference can be seen when utilizing the static-routing method. In CNN, max-pooling cannot update entire words because only the context with the highest activation is updated during backpropagation. Because our model does not utilize max pooling, it learns the syntactic representations of words in the static-routing model because its learns without losing positional context.

5.3 Static-routing over dynamic-routing

Sentence Dynamic-routing Static-routing Actual class
what is the name
of the tallest man of korea
4 3 3
who is
the smallest woman in usa
2 3 3
What is the nickname of shakespeare
2 3 3
what is the name of the voices of
the simpsons
2 3 3
what is the nickname of
soccer team of usa
4 3 3
can i give a question , who is the
prime minister of norway in europe
4 3 3
can you tell me which president
was unmarried in unite states america
2 3 3
what is shakespeare ś nickname
3 3 3
what is the color of crickets
1 2 2
what are the types of twins
4 2 2
i want to ask you what
is the public currency of brazil
1 1 2
can you tell me what are the
cigarettes composed of
1 2 2
how we can call female walrus
1 2 2
Pretrained word accuracy -
Randomly initialized word accuracy -
Table 6: Sentences from TREC-QA test data where phrases in red have changed word orders.

It is a general practice to utilize max-pooling in order to extract data features when using a CNN. However, max-pooling often produces poor results in text classification due to loss of information. More specifically, max pooling only maintains the feature with the highest activation, which means it discards all other features even though they may seemingly be useful.

To remedy this issue, capsule networks with dynamic routing chooses to preserve not only one, but all features that are useful, as long as they are “agreed” among layers. However, we assert that this strategy is not necessarily optimal for document classification as opposed to image classification, due to the high variability in text. Specifically, the model should be flexible and robust enough to handle slight modifications in the text, such as word order shuffling or the insertion of an untrained word vector. We conjecture that removing the coupling coefficient would smooth out the underlying signals and therefore make the model more robust in this regard. We further perform experiments involving word order shuffling and noise injections in Section 5.3 to support this claim.

In order to prove the above hypothesis, and argue the effectiveness of static routing, we evaluated the classification results after changing the sequences of words in a sentence. For this, we utilized 50 samples from each of class 2 (ENTITY) and 3 (HUMAN), from the TREC-QA test dataset. As can be seen in Table 6, static routing proved a much higher accuracy compared to dynamic routing, given that the word vectors are pretrained.

We further identified the effects of words changes on the predictions of the model utilizing LIME (Ribeiro et al., ). LIME is a method for generating new samples with similar values to corresponding instances in the vicinity of the predicted value from the model and determining how the predictions of the models differ based on the input values. In the results presented in Figure 4, both routing models tend to produce incorrect decisions because of changed words.

\thesubsubfigure Dynamic routing
\thesubsubfigure Static routing
Figure 4: Words are highlighted according to their importance for prediction for the TREC-QA dataset, where the ground truth class is 2 (ENTITY). Green and red labels are positive and negative effects, respectively. Higher intensity indicates greater importance of the word.

When the original example is “what is the state flower of michigan” (the third example in Figure 4), the reconstructed data is “what is the color of michigan’s state flower”. The dynamic routing method has a negative effect on the newly added word “color.” It also has a negative effect when a combination that does not appear in the existing TREC-QA data is added, such as “can you tell me” in the table (second to last example in Figure 4).

Therefore, we did not utilize the coupling factor for this reason. As a result, the computational complexity can be reduced and generalization is improved compared to when dynamic routing is used.

5.4 Justifying the ELU-gate

Dataset TREC-QA MPQA 20news MR (2004)
Multiple filters
without max-pooling
93.47 88.68 85.76 85.89
Multiple filters
with max-pooling
93.84 89.28 86.08 85.69
Convolutional layer 93.99 90.07 85.40 85.29
ELU-gate 94.80 90.57 87.17 89.60
(a) Static-routing
Dataset TREC-QA MPQA 20news MR (2004)
Multiple filters
without max-pooling
91.95 88.82 85.56 84.59
Multiple filters
with max-pooling
92.63 89.43 85.69 87.29
Convolutional layer 93.07 90.26 85.60 85.29
ELU-gate 93.80 89.60 86.45 88.10
(b) Dynamic-routing
Table 7: Results of the ablation test. Each number is the mean accuracy of five consecutive runs.

In Dauphin et al. (2016), the ELU gating mechanism was mainly experimented with recurrent models such as LSTM and GRU, but they also showed that it is effective with convolutional layers. The gate gradient of LSTM is as follows.


In the case of LSTM, the effect of the gradient is reduced because downscaling occurs in and .


Since the gradient of the ELU-gate can be expressed as shown in Equation 7, the effect of downscaling is small. Unlike max-pooling, fine-tuning works well because input words are updated globally.

Table 7 shows the results of comparing the accuracy with ELU-gate and other structures. The multiple filter layer is a convolution layer having a filter size of as in the case of CNN (Kim, 2014) structure. The number of filters in the multiple filter layers was 100 per filter, and the kernel size of pooling was . The convolutional layer is a layer that excepted ELU-gate.

5.5 Text transformation

In image classification, capsules represent the various properties of a particular entity that is present in an image. These properties include types such as tilt, orient, hue, etc. In order to apply this analogy to text, we experimented with documents to see how capsules can learn the innate characteristics of the document being converted. To test this reconstructive phenomena, we added three fully connected layers to the capsule network with static-routing.

Figure 5: Reconstruct layer consisting of 3 fully connected layers, is sequence length, is embedding size.

We added the MSE loss between the input and the reconstruct layer output, and downscaled the MSE loss by 0.03. Pretrained word vectors were not utilized. We confirmed the decoder results when we gave random noise between -0.3 and 0.3 to each dimension of the activated capsule in . We used the words with the highest value by measuring the cosine similarity of each row and vocabulary.

Sentence Noise i
what is the name of neil armstrong’s wife
0 -
what is the name of john davis’s one father
0.3 1
what is the name of john george’s one father
-0.3 2
what is the name of richard davis’s one
0.3 2
what is the name of richard davis’s one father
-0.2 4
what was the name of john davis’s one father
-0.3 5
what is the name that richard davis’s one sons
0.3 7
what was the name of john davis’s one
-0.3 8
what is the name of richard davis’s one daughters
0.3 10
what was the name of richard
robinson’s one daughters
0.3 12
what is the that of one american there
-0.2 12
what was the name of john davis’s one daughters
0.2 15
Table 8: Random noise is added to the -th element of the 16-dimensional vector.

The first row in Table 8 is the original sentence of TREC-QA with no added noise. When the noise is added, the result does not change the meaning of the question, but some words changed. Also, the changed sentence is a newly created without the same as the dataset. In the case of words, we could not visualize detailed changes like images because measured the similarity of the words included in the vocabulary.

6 Conclusion

In this paper, we proposed the application of capsule networks to the text classification domain and suggested the utilization of a static routing variant. We compared the proposed model to CNNs, and demonstrated that capsule networks are indeed useful for text classification based on seven popular benchmark datasets. We additionally proposed static routing, an alternative to dynamic routing, that results in higher classification accuracies with less computation.


  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI), volume 16, pages 265–283.
  • Dai and Le (2015) Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems (NIPS), pages 3079–3087.
  • Dauphin et al. (2016) Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083.
  • Hinton et al. (2011) Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. 2011. Transforming auto-encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer.
  • Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of International Conference on Machine Learning (ICML), pages 1188–1196.
  • Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics.
  • Maas et al. (2011) Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics (ACL), pages 142–150. Association for Computational Linguistics.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (NIPS), pages 3111–3119.
  • Miyato et al. (2017) Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2017. Adversarial training methods for semi-supervised text classification. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Pang and Lee (2004) Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics (ACL), page 271. Association for Computational Linguistics.
  • Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL), pages 115–124. Association for Computational Linguistics.
  • Peng et al. (2017) Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. 2017. Large kernel matters–improve semantic segmentation by global convolutional network. arXiv preprint arXiv:1703.02719.
  • (16) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “why should i trust you?” explaining the predictions of any classifier.
  • Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. In Advances in Neural Information Processing Systems (NIPS), pages 3859–3869.
  • Shen et al. (2018) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018. Bi-directional block self-attention for fast and memory-efficient sequence modeling. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Wiebe et al. (2005) Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2-3):165–210.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (NIPS), pages 649–657.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description