Neural Language Priors
The choice of sentence encoder architecture reflects assumptions about how a sentence’s meaning is composed from its constituent words. We examine the contribution of these architectures by holding them randomly initialised and fixed, effectively treating them as as hand-crafted language priors, and evaluating the resulting sentence encoders on downstream language tasks. We find that even when encoders are presented with additional information that can be used to solve tasks, the corresponding priors do not leverage this information, except in an isolated case. We also find that apparently uninformative priors are just as good as seemingly informative priors on almost all tasks, indicating that learning is a necessary component to leverage information provided by architecture choice.
sec:introduction Sentence representations are fixed-length vectors that encode sentence properties and allow models to learn across many \glsnlp tasks. These representations enable learning procedures to focus on the training signal from specific “downstream” \glsnlp tasks [Conneau and Kiela(2018)], circumventing the often limited amount of labelled data. Naturally, sentence representations that can effectively encode semantic and syntactic properties into a representations are highly sought after, and are a cornerstone of modern \glsnlp systems.
In practice, sentence representations are formed by applying an encoding function (or encoder) provided by a \glsnn architecture, to the word vectors of the corresponding sentence. Encoders have been successfully trained to predict the context of sentence [Kiros et al.(2015)Kiros, Zhu, Salakhutdinov, Zemel, Torralba, Urtasun, and Fidler, Ba et al.(2016)Ba, Kiros, and Hinton], or to leverage supervised multi-task objectives [Conneau et al.(2017)Conneau, Kiela, Schwenk, Barrault, and Bordes, Dehghani et al.(2018)Dehghani, Gouws, Vinyals, Uszkoreit, and Kaiser].
The choice of encoder architecture asserts an inductive bias [Battaglia et al.(2018)Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, Gulcehre, Song, Ballard, Gilmer, Dahl, Vaswani, Allen, Nash, Langston, Dyer, Heess, Wierstra, Kohli, Botvinick, Vinyals, Li, and Pascanu], and reflects assumptions about the data-generating process. Different encoders naturally prioritise one solution over another [Mitchell(1991)], independent of the observed data, trading sample complexity for flexibility [Geman et al.(2008)Geman, Bienenstock, and Doursat]. Given that \glsplnn, which are able to generalise well, can also overfit when presented with random labels [Zhang et al.(2016)Zhang, Bengio, Hardt, Recht, and Vinyals], we expect that architecture plays a dominant role in generalisation capability [Lempitsky et al.(2018)Lempitsky, Vedaldi, and Ulyanov].
The inductive biases of encoder architectures reflect assumptions about how a sentence’s meaning is composed from its constituent words. A plethora of architectures have been investigated, each designed with a specific set of inductive biases in mind. \glsboe architectures disregard word order [Harris(1954), Salton et al.(1975)Salton, Wong, and Yang, Manning et al.(2008)Manning, Raghavan, and Schutze], \glsrnn architectures can leverage word positional information [Kiros et al.(2015)Kiros, Zhu, Salakhutdinov, Zemel, Torralba, Urtasun, and Fidler, Ba et al.(2016)Ba, Kiros, and Hinton], \glscnn architectures compose information at the -gram level [Collobert et al.(2011)Collobert, Weston, Bottou, Karlen, Kavukcuoglu, and Kuksa, Vieira and Moura(2017), Gan et al.(2016)Gan, Pu, Henao, Li, He, and Carin], self-attention models leverage explicit positional information with long range context [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, Ahmed et al.(2017)Ahmed, Keskar, and Socher, Shaw et al.(2018)Shaw, Uszkoreit, and Vaswani, Dehghani et al.(2018)Dehghani, Gouws, Vinyals, Uszkoreit, and Kaiser, Radford et al.(2019)Radford, Wu, Child, Luan, Amodei, and Sutskever, Devlin et al.(2018)Devlin, Chang, Lee, and Toutanova, Cer et al.(2018)Cer, Yang, Kong, Hua, Limtiaco, John, Constant, Guajardo-Cespedes, Yuan, Tar, Sung, Strope, and Kurzweil], and graph-based models can exploit linguistic structures extracted by traditional \glsnlp methods [Tai et al.(2015)Tai, Socher, and Manning, Li et al.(2018)Li, Richtarik, Ding, and Gao, Zhang et al.(2019)Zhang, Bengio, Hardt, and Singer, Teng and Zhang(2016), Kim et al.(2018a)Kim, Lee, Song, and Yoon, Ahmed et al.(2019)Ahmed, Samee, and Mercer, Bastings et al.(2017)Bastings, Titov, Aziz, Marcheggiani, and Sima’an, Marcheggiani and Titov(2017), Marcheggiani et al.(2018)Marcheggiani, Bastings, and Titov, Marcheggiani and Perez-Beltrachini(2018)]. This list is far from exhaustive.
Given the critical role of encoder architectures in \glsnlp, we set out to examine their contribution to downstream task performance independent of biases induced by learning processes. We find that even architectures expected to have extremely strong language priors yield almost no gains when compared to architectures that are equipped with apparently uninformative priors, consistent with the results found in \citetWieting2019. This suggests that for \glsnlp tasks, relying on the prior is insufficient, and the learning process is necessary, in contrast to what was found in the vision field [Lempitsky et al.(2018)Lempitsky, Vedaldi, and Ulyanov]. In short, although there are known strong inductive biases for language, there is no best language prior, and in practice there is surprisingly little correspondence between the two.
To show this, given a set of pre-trained word embeddings, we evaluate the classification accuracy of a variety architectures on a set of \glsnlp tasks, only updating the parameters specific to the task, holding the parameters of the architecture fixed at their random initialisation.
2.1 Priors from Random Sentence Encoders\label
The line of investigation we take follows \citetWieting2019 closely. We treat randomly initialized \glsplnn as handcrafted priors for how the meaning of a sentence is composed from its constituent words. Concretely, let each word have a pre-trained and fixed -dimensional word representation . Consider a sentence consisting of words . Using an encoding function , the meaning of the sentence is distilled into a sentence representation :
where are the parameters of the encoding function. For \glsnn architectures that output a matrix , where is an output dimensionality and is a temporal dimensionality111 In practice may not directly correspond to the length of the input sentence due to e.g. finite kernel sizes in convolution operations., we pool along the temporal dimension using a pooling function . For our main results we use max pooling throughout, as it has been successful in InferSent [Conneau et al.(2017)Conneau, Kiela, Schwenk, Barrault, and Bordes].
The are typically learned using e.g. \glsmle on sentence context, resulting in representing a sample from the encoder’s posterior over functions applied to given a corpus. Instead of learning , we simply sample from its own prior. then represents a sample from the encoder’s prior over functions applied to .
For each encoding function, we take multiple samples of . For each sample, the resulting encoder function is used to produce sentence embeddings for a set of downstream tasks. These downstream tasks are the supervised transfer tasks of the SentEval [Conneau and Kiela(2018)] framework, where the transfer model is a simple logistic regression model or a MLP222For emphasis: the parameters of these logistic regression model and MLP are updated by the task.. Combining the results from multiple samples then gives a performance estimate of each encoder’s prior.
2.2 BOREPs, Random LSTMs and ESNs\label
subsec:existing We take the architectures investigated in [Wieting and Kiela(2019)] as a starting point: \glsborep, Random \glslstm Neworks and \glsplesn. \glsborep is simply a random projection of word embeddings to a higher dimension, Rand\glslstm is a randomly initialised bi-directional \glslstm [Hochreiter and Schmidhuber(1997)], and \glsesn is a hypertuned randomly initialised bi-directional \glsesn [Jaeger(2001)]. For more details please see [Wieting and Kiela(2019)].
2.3 Random \glsplcnn\label
|CNN Window = 3||4096||74.9(.3)||76.9(.7)||85.4(.2)||88.6(.1)||75.6(.5)||88.7(1.2)||79.1(.2)||69.4(.5)|
|CNN Window = 4||4096||74.3(.3)||74.8(.8)||84.2(.3)||86.8(.3)||75.5(.5)||85.2(1.1)||78.0(.2)||69.2(.3)|
Although \glsplcnn are more famously used in the image domain [Simonyan and Zisserman(2014), He et al.(2015)He, Zhang, Ren, and Sun], they have also enjoyed much success as sentence encoders [Collobert et al.(2011)Collobert, Weston, Bottou, Karlen, Kavukcuoglu, and Kuksa, Vieira and Moura(2017), Gan et al.(2016)Gan, Pu, Henao, Li, He, and Carin]. A temporal one-dimensional convolution is performed by applying a -channel filter to a window of words and a bias added. This weight is initialised uniformly at random from , where d is the word embedding dimension. The representation is then obtained by pooling
Note that using a window size corresponds to \glsborep.
2.4 Random Self-Attention\label
subsec:attention Attention mechanisms have been employed on many \glsnlp tasks with tremendous success [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, Ahmed et al.(2017)Ahmed, Keskar, and Socher, Shaw et al.(2018)Shaw, Uszkoreit, and Vaswani, Dehghani et al.(2018)Dehghani, Gouws, Vinyals, Uszkoreit, and Kaiser, Radford et al.(2019)Radford, Wu, Child, Luan, Amodei, and Sutskever, Devlin et al.(2018)Devlin, Chang, Lee, and Toutanova, Cer et al.(2018)Cer, Yang, Kong, Hua, Limtiaco, John, Constant, Guajardo-Cespedes, Yuan, Tar, Sung, Strope, and Kurzweil]. Self-attention in particular has enabled the incorporation of incredibly long ranged contexts, as well as hierarchical contextualisations of word embeddings within a highly parallel setting.
In our random setting, the word embeddings are first projected up to a dimensional space. We then optionally add sinusoidal positional encodings [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin]. We then apply two layers of random self-attention with residual connections, each followed by layer normalisation. A single head of a self-attention layer produces new embeddings for each query representations out of the value representations , controlled by the key representations
The -dimensional key and query representations are given by independent random projections acting upon the self-attention layer input. We use eight heads of attention in each layer. The pooling function is applied to this output to produce the sentence representation .
We keep the default initialisation of the FairSeq implementation, which is Xavier uniform [Glorot and Bengio(2010)] for the weights of the self-attention layer.
2.5 Random Tree\glspllstm\label
subsec:treelstm The final architecture we consider is the Tree\glslstm. This architecture is particularly interesting as it can potentially incorporate syntactic information into the sentence representations [Tai et al.(2015)Tai, Socher, and Manning, Li et al.(2018)Li, Richtarik, Ding, and Gao, Zhang et al.(2019)Zhang, Bengio, Hardt, and Singer, Teng and Zhang(2016), Kim et al.(2018a)Kim, Lee, Song, and Yoon, Ahmed et al.(2019)Ahmed, Samee, and Mercer].
We specifically consider the Binary Constituency Tree\glslstm [Tai et al.(2015)Tai, Socher, and Manning]. This differs from a regular \glslstm by having a two forget gates - one for each child node given by the structure of the parsed sentence.
Word representations are first presented to a random bi-directional \glslstm of combined dimensionality to provide contextualised representations
The contextualised representations are then presented to a random Tree\glslstm, whose outputs are pooled to produce the sentence representation
Both weights of the bi-directional \glslstm and the Tree\glslstm are initialised uniformly at random from . We used the Stanford parser [Manning et al.(2014)Manning, Surdeanu, Bauer, Finkel, Bethard, and McClosky] to parse each sentence. Punctuation and special characters were removed, and numbers were only kept if they formed an independent word and were not part of a mixed word of letters and numbers. Then, in the length of a word was reduced to zero, the word was replaced with a placeholder * character. After parsing, the prepossessing described in [Kim et al.(2018a)Kim, Lee, Song, and Yoon] was used to compute the parse tree for the Tree\glslstm.
subsec:evaluation The SentEval tasks we evaluate on are sentiment analysis (MR, SST), question-type (TREC), product review (CR), subjectivity (SUBJ), opinion polarity (MPQA), paraphrasing (MRPC), and entailment (SICK-E). We use the default SentEval settings defined in [Conneau and Kiela(2018)]. We evaluate for five samples (seeds) per architecture per task.
We follow the FairSeq implementation [Ott et al.(2019)Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier, and Auli] to build our \glscnn and self-attention networks. We also follow the implementation of [Kim et al.(2018b)Kim, Choi, Edmiston, Bae, and Lee] without the structure-aware tag representations to build our Tree\glspllstm.
sec:results Our investigation is concerned with the priors of encoder architectures, rather than the posteriors they may learn from data; we only compare untrained encoders acting upon word embeddings.
table:results_max contains the performance of architectures discussed in \Crefsec:method at dimensionality 4096 on the selected SentEval tasks, together with the results from \citetWieting2019. \Creffigure:results_max contains the performance for these architectures across a range of dimensionalities.
As a sanity check, we evaluated \glsborep and \glscnn with a window size of 1 and found the performance indistinguishable.
In general, we find that even if encoders have inductive biases that present additional information that can be used to solve a task, the corresponding priors do not leverage this information, except in an isolated case. This strongly indicates that learning is an essential component of building encoder architectures if any gains are to be made beyond apparently uninformative priors.
sec:conclusion We have evaluated randomly initialised architectures to measure the contribution of priors in distilling sentence meaning. We find that apparently uninformative priors are just as good as seemingly informative priors on almost all tasks, indicating that learning is a necessary component to leverage information provided by architecture choice.
sec:acknowledgements We thank Jeremie Vallee for assistance with the experimental setup, and the wider machine learning group at Babylon for useful comments and support throughout this project.
- [Ahmed et al.(2017)Ahmed, Keskar, and Socher] Karim Ahmed, Nitish Shirish Keskar, and Richard Socher. 2017. Weighted Transformer Network for Machine Translation. pages 1–10.
- [Ahmed et al.(2019)Ahmed, Samee, and Mercer] Mahtab Ahmed, Muhammad Rifayat Samee, and Robert E. Mercer. 2019. Improving Tree-LSTM with Tree Attention. In 2019 IEEE 13th International Conference on Semantic Computing (ICSC), pages 247–254. IEEE.
- [Ba et al.(2016)Ba, Kiros, and Hinton] Jimmy Lei Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization.
- [Bastings et al.(2017)Bastings, Titov, Aziz, Marcheggiani, and Sima’an] Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. 2017. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. pages 1957–1967.
- [Battaglia et al.(2018)Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, Gulcehre, Song, Ballard, Gilmer, Dahl, Vaswani, Allen, Nash, Langston, Dyer, Heess, Wierstra, Kohli, Botvinick, Vinyals, Li, and Pascanu] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018. Relational inductive biases, deep learning, and graph networks. Under Review, pages 1–37.
- [Cer et al.(2018)Cer, Yang, Kong, Hua, Limtiaco, John, Constant, Guajardo-Cespedes, Yuan, Tar, Sung, Strope, and Kurzweil] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder.
- [Collobert et al.(2011)Collobert, Weston, Bottou, Karlen, Kavukcuoglu, and Kuksa] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (almost) from Scratch.
- [Conneau and Kiela(2018)] Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations.
- [Conneau et al.(2017)Conneau, Kiela, Schwenk, Barrault, and Bordes] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.
- [Conneau et al.(2018)Conneau, Kruszewski, Lample, Barrault, and Baroni] Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single vector: Probing sentence embeddings for linguistic properties.
- [Dehghani et al.(2018)Dehghani, Gouws, Vinyals, Uszkoreit, and Kaiser] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2018. Universal Transformers. pages 1–23.
- [Devlin et al.(2018)Devlin, Chang, Lee, and Toutanova] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- [Gan et al.(2016)Gan, Pu, Henao, Li, He, and Carin] Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin. 2016. Learning Generic Sentence Representations Using Convolutional Neural Networks. Emnlp, pages 2380–2390.
- [Geman et al.(2008)Geman, Bienenstock, and Doursat] Stuart Geman, Elie Bienenstock, and René Doursat. 2008. Neural Networks and the Bias/Variance Dilemma. Neural Computation, 4(1):1–58.
- [Glorot and Bengio(2010)] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-10), 9:249–256.
- [Harris(1954)] Zellig S. Harris. 1954. Distributional Structure. WORD, 10(2-3):146–162.
- [He et al.(2015)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition.
- [Hochreiter and Schmidhuber(1997)] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8):1735–1780.
- [Jaeger(2001)] Herbert Jaeger. 2001. The “echo state” approach to analysing and training recurrent neural networks - with an Erratum note. Technical Report 148.
- [Kim et al.(2018a)Kim, Lee, Song, and Yoon] Sungwon Kim, Sang-gil Lee, Jongyoon Song, and Sungroh Yoon. 2018a. FloWaveNet : A Generative Flow for Raw Audio.
- [Kim et al.(2018b)Kim, Choi, Edmiston, Bae, and Lee] Taeuk Kim, Jihun Choi, Daniel Edmiston, Sanghwan Bae, and Sang-goo Lee. 2018b. Dynamic Compositionality in Recursive Neural Networks with Structure-aware Tag Representations.
- [Kiros et al.(2015)Kiros, Zhu, Salakhutdinov, Zemel, Torralba, Urtasun, and Fidler] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-Thought Vectors.
- [Lempitsky et al.(2018)Lempitsky, Vedaldi, and Ulyanov] Victor Lempitsky, Andrea Vedaldi, and Dmitry Ulyanov. 2018. Deep Image Prior. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9446–9454. IEEE.
- [Li and Roth(2002)] Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics -, volume 1, pages 1–7, Morristown, NJ, USA. Association for Computational Linguistics.
- [Li et al.(2018)Li, Richtarik, Ding, and Gao] Yu Li, Peter Richtarik, Lizhong Ding, and Xin Gao. 2018. On the Decision Boundary of Deep Neural Networks.
- [Manning et al.(2014)Manning, Surdeanu, Bauer, Finkel, Bethard, and McClosky] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Manning et al.(2008)Manning, Raghavan, and Schutze] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval.
- [Marcheggiani et al.(2018)Marcheggiani, Bastings, and Titov] Diego Marcheggiani, Joost Bastings, and Ivan Titov. 2018. Exploiting Semantics in Neural Machine Translation with Graph Convolutional Networks.
- [Marcheggiani and Perez-Beltrachini(2018)] Diego Marcheggiani and Laura Perez-Beltrachini. 2018. Deep Graph Convolutional Encoders for Structured Data to Text Generation. pages 1–9.
- [Marcheggiani and Titov(2017)] Diego Marcheggiani and Ivan Titov. 2017. Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling. pages 1506–1515.
- [Mitchell(1991)] Tom Mitchell. 1991. The need for biases in learning generalisations. Readings in Machine Learning, (May).
- [Ott et al.(2019)Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier, and Auli] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling.
- [Radford et al.(2019)Radford, Wu, Child, Luan, Amodei, and Sutskever] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Open AI.
- [Salton et al.(1975)Salton, Wong, and Yang] G. Salton, A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620.
- [Shaw et al.(2018)Shaw, Uszkoreit, and Vaswani] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations.
- [Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. pages 1–14.
- [Tai et al.(2015)Tai, Socher, and Manning] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks.
- [Teng and Zhang(2016)] Zhiyang Teng and Yue Zhang. 2016. Bidirectional Tree-Structured LSTM with Head Lexicalization.
- [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need.
- [Vieira and Moura(2017)] Joao Paulo Albuquerque Vieira and Raimundo Santos Moura. 2017. An analysis of convolutional neural networks for sentence classification. In 2017 XLIII Latin American Computer Conference (CLEI), volume 2017-Janua, pages 1–5. IEEE.
- [Wieting and Kiela(2019)] John Wieting and Douwe Kiela. 2019. No Training Required: Exploring Random Encoders for Sentence Classification. pages 1–16.
- [Zhang et al.(2016)Zhang, Bengio, Hardt, Recht, and Vinyals] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization.
- [Zhang et al.(2019)Zhang, Bengio, Hardt, and Singer] Chiyuan Zhang, Samy Bengio, Moritz Hardt, and Yoram Singer. 2019. Identity Crisis: Memorization and Generalization under Extreme Overparameterization. pages 1–28.