Neural Language Priors

Neural Language Priors

Joseph Enguehard Dan Busbridge Vitalii Zhelezniak Nils Hammerla firstname.lastname@babylonhealth.com Babylon Health, 60 Sloane Ave, Chelsea, London SW3 3DD
Abstract

The choice of sentence encoder architecture reflects assumptions about how a sentence’s meaning is composed from its constituent words. We examine the contribution of these architectures by holding them randomly initialised and fixed, effectively treating them as as hand-crafted language priors, and evaluating the resulting sentence encoders on downstream language tasks. We find that even when encoders are presented with additional information that can be used to solve tasks, the corresponding priors do not leverage this information, except in an isolated case. We also find that apparently uninformative priors are just as good as seemingly informative priors on almost all tasks, indicating that learning is a necessary component to leverage information provided by architecture choice.

\glsdisablehyper\makeglossaries\loadglsentries

preamble/acronyms/a \loadglsentriespreamble/acronyms/b \loadglsentriespreamble/acronyms/c \loadglsentriespreamble/acronyms/d \loadglsentriespreamble/acronyms/e \loadglsentriespreamble/acronyms/g \loadglsentriespreamble/acronyms/h \loadglsentriespreamble/acronyms/k \loadglsentriespreamble/acronyms/i \loadglsentriespreamble/acronyms/l \loadglsentriespreamble/acronyms/m \loadglsentriespreamble/acronyms/n \loadglsentriespreamble/acronyms/o \loadglsentriespreamble/acronyms/p \loadglsentriespreamble/acronyms/r \loadglsentriespreamble/acronyms/s \loadglsentriespreamble/acronyms/t \loadglsentriespreamble/acronyms/v \loadglsentriespreamble/acronyms/w \Crefname@preambleequationEquationEquations\Crefname@preamblefigureFigureFigures\Crefname@preambletableTableTables\Crefname@preamblepagePagePages\Crefname@preamblepartPartParts\Crefname@preamblechapterChapterChapters\Crefname@preamblesectionSectionSections\Crefname@preambleappendixAppendixAppendices\Crefname@preambleenumiItemItems\Crefname@preamblefootnoteFootnoteFootnotes\Crefname@preambletheoremTheoremTheorems\Crefname@preamblelemmaLemmaLemmas\Crefname@preamblecorollaryCorollaryCorollaries\Crefname@preamblepropositionPropositionPropositions\Crefname@preambledefinitionDefinitionDefinitions\Crefname@preambleresultResultResults\Crefname@preambleexampleExampleExamples\Crefname@preambleremarkRemarkRemarks\Crefname@preamblenoteNoteNotes\Crefname@preamblealgorithmAlgorithmAlgorithms\Crefname@preamblelistingListingListings\Crefname@preamblelineLineLines\crefname@preambleequationEquationEquations\crefname@preamblefigureFigureFigures\crefname@preamblepagePagePages\crefname@preambletableTableTables\crefname@preamblepartPartParts\crefname@preamblechapterChapterChapters\crefname@preamblesectionSectionSections\crefname@preambleappendixAppendixAppendices\crefname@preambleenumiItemItems\crefname@preamblefootnoteFootnoteFootnotes\crefname@preambletheoremTheoremTheorems\crefname@preamblelemmaLemmaLemmas\crefname@preamblecorollaryCorollaryCorollaries\crefname@preamblepropositionPropositionPropositions\crefname@preambledefinitionDefinitionDefinitions\crefname@preambleresultResultResults\crefname@preambleexampleExampleExamples\crefname@preambleremarkRemarkRemarks\crefname@preamblenoteNoteNotes\crefname@preamblealgorithmAlgorithmAlgorithms\crefname@preamblelistingListingListings\crefname@preamblelineLineLines\crefname@preambleequationequationequations\crefname@preamblefigurefigurefigures\crefname@preamblepagepagepages\crefname@preambletabletabletables\crefname@preamblepartpartparts\crefname@preamblechapterchapterchapters\crefname@preamblesectionsectionsections\crefname@preambleappendixappendixappendices\crefname@preambleenumiitemitems\crefname@preamblefootnotefootnotefootnotes\crefname@preambletheoremtheoremtheorems\crefname@preamblelemmalemmalemmas\crefname@preamblecorollarycorollarycorollaries\crefname@preamblepropositionpropositionpropositions\crefname@preambledefinitiondefinitiondefinitions\crefname@preambleresultresultresults\crefname@preambleexampleexampleexamples\crefname@preambleremarkremarkremarks\crefname@preamblenotenotenotes\crefname@preamblealgorithmalgorithmalgorithms\crefname@preamblelistinglistinglistings\crefname@preamblelinelinelines\cref@isstackfull\@tempstack\@crefcopyformatssectionsubsection\@crefcopyformatssubsectionsubsubsection\@crefcopyformatsappendixsubappendix\@crefcopyformatssubappendixsubsubappendix\@crefcopyformatsfiguresubfigure\@crefcopyformatstablesubtable\@crefcopyformatsequationsubequation\@crefcopyformatsenumienumii\@crefcopyformatsenumiienumiii\@crefcopyformatsenumiiienumiv\@crefcopyformatsenumivenumv\@labelcrefdefinedefaultformats

1 Introduction

\label

sec:introduction Sentence representations are fixed-length vectors that encode sentence properties and allow models to learn across many \glsnlp tasks. These representations enable learning procedures to focus on the training signal from specific “downstream” \glsnlp tasks [Conneau and Kiela(2018)], circumventing the often limited amount of labelled data. Naturally, sentence representations that can effectively encode semantic and syntactic properties into a representations are highly sought after, and are a cornerstone of modern \glsnlp systems.

In practice, sentence representations are formed by applying an encoding function (or encoder) provided by a \glsnn architecture, to the word vectors of the corresponding sentence. Encoders have been successfully trained to predict the context of sentence [Kiros et al.(2015)Kiros, Zhu, Salakhutdinov, Zemel, Torralba, Urtasun, and Fidler, Ba et al.(2016)Ba, Kiros, and Hinton], or to leverage supervised multi-task objectives [Conneau et al.(2017)Conneau, Kiela, Schwenk, Barrault, and Bordes, Dehghani et al.(2018)Dehghani, Gouws, Vinyals, Uszkoreit, and Kaiser].

The choice of encoder architecture asserts an inductive bias [Battaglia et al.(2018)Battaglia, Hamrick, Bapst, Sanchez-Gonzalez, Zambaldi, Malinowski, Tacchetti, Raposo, Santoro, Faulkner, Gulcehre, Song, Ballard, Gilmer, Dahl, Vaswani, Allen, Nash, Langston, Dyer, Heess, Wierstra, Kohli, Botvinick, Vinyals, Li, and Pascanu], and reflects assumptions about the data-generating process. Different encoders naturally prioritise one solution over another [Mitchell(1991)], independent of the observed data, trading sample complexity for flexibility [Geman et al.(2008)Geman, Bienenstock, and Doursat]. Given that \glsplnn, which are able to generalise well, can also overfit when presented with random labels [Zhang et al.(2016)Zhang, Bengio, Hardt, Recht, and Vinyals], we expect that architecture plays a dominant role in generalisation capability [Lempitsky et al.(2018)Lempitsky, Vedaldi, and Ulyanov].

The inductive biases of encoder architectures reflect assumptions about how a sentence’s meaning is composed from its constituent words. A plethora of architectures have been investigated, each designed with a specific set of inductive biases in mind. \glsboe architectures disregard word order [Harris(1954), Salton et al.(1975)Salton, Wong, and Yang, Manning et al.(2008)Manning, Raghavan, and Schutze], \glsrnn architectures can leverage word positional information [Kiros et al.(2015)Kiros, Zhu, Salakhutdinov, Zemel, Torralba, Urtasun, and Fidler, Ba et al.(2016)Ba, Kiros, and Hinton], \glscnn architectures compose information at the -gram level [Collobert et al.(2011)Collobert, Weston, Bottou, Karlen, Kavukcuoglu, and Kuksa, Vieira and Moura(2017), Gan et al.(2016)Gan, Pu, Henao, Li, He, and Carin], self-attention models leverage explicit positional information with long range context [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, Ahmed et al.(2017)Ahmed, Keskar, and Socher, Shaw et al.(2018)Shaw, Uszkoreit, and Vaswani, Dehghani et al.(2018)Dehghani, Gouws, Vinyals, Uszkoreit, and Kaiser, Radford et al.(2019)Radford, Wu, Child, Luan, Amodei, and Sutskever, Devlin et al.(2018)Devlin, Chang, Lee, and Toutanova, Cer et al.(2018)Cer, Yang, Kong, Hua, Limtiaco, John, Constant, Guajardo-Cespedes, Yuan, Tar, Sung, Strope, and Kurzweil], and graph-based models can exploit linguistic structures extracted by traditional \glsnlp methods [Tai et al.(2015)Tai, Socher, and Manning, Li et al.(2018)Li, Richtarik, Ding, and Gao, Zhang et al.(2019)Zhang, Bengio, Hardt, and Singer, Teng and Zhang(2016), Kim et al.(2018a)Kim, Lee, Song, and Yoon, Ahmed et al.(2019)Ahmed, Samee, and Mercer, Bastings et al.(2017)Bastings, Titov, Aziz, Marcheggiani, and Sima’an, Marcheggiani and Titov(2017), Marcheggiani et al.(2018)Marcheggiani, Bastings, and Titov, Marcheggiani and Perez-Beltrachini(2018)]. This list is far from exhaustive.

Given the critical role of encoder architectures in \glsnlp, we set out to examine their contribution to downstream task performance independent of biases induced by learning processes. We find that even architectures expected to have extremely strong language priors yield almost no gains when compared to architectures that are equipped with apparently uninformative priors, consistent with the results found in \citetWieting2019. This suggests that for \glsnlp tasks, relying on the prior is insufficient, and the learning process is necessary, in contrast to what was found in the vision field [Lempitsky et al.(2018)Lempitsky, Vedaldi, and Ulyanov]. In short, although there are known strong inductive biases for language, there is no best language prior, and in practice there is surprisingly little correspondence between the two.

To show this, given a set of pre-trained word embeddings, we evaluate the classification accuracy of a variety architectures on a set of \glsnlp tasks, only updating the parameters specific to the task, holding the parameters of the architecture fixed at their random initialisation.

2 Method

\label

sec:method

2.1 Priors from Random Sentence Encoders

\label

subsec:sentence-encoders

The line of investigation we take follows \citetWieting2019 closely. We treat randomly initialized \glsplnn as handcrafted priors for how the meaning of a sentence is composed from its constituent words. Concretely, let each word have a pre-trained and fixed -dimensional word representation . Consider a sentence consisting of words . Using an encoding function , the meaning of the sentence is distilled into a sentence representation :

(1)

where are the parameters of the encoding function. For \glsnn architectures that output a matrix , where is an output dimensionality and is a temporal dimensionality111 In practice may not directly correspond to the length of the input sentence due to e.g. finite kernel sizes in convolution operations., we pool along the temporal dimension using a pooling function . For our main results we use max pooling throughout, as it has been successful in InferSent [Conneau et al.(2017)Conneau, Kiela, Schwenk, Barrault, and Bordes].

The are typically learned using e.g. \glsmle on sentence context, resulting in representing a sample from the encoder’s posterior over functions applied to given a corpus. Instead of learning , we simply sample from its own prior. then represents a sample from the encoder’s prior over functions applied to .

For each encoding function, we take multiple samples of . For each sample, the resulting encoder function is used to produce sentence embeddings for a set of downstream tasks. These downstream tasks are the supervised transfer tasks of the SentEval [Conneau and Kiela(2018)] framework, where the transfer model is a simple logistic regression model or a MLP222For emphasis: the parameters of these logistic regression model and MLP are updated by the task.. Combining the results from multiple samples then gives a performance estimate of each encoder’s prior.

2.2 BOREPs, Random LSTMs and ESNs

\label

subsec:existing We take the architectures investigated in [Wieting and Kiela(2019)] as a starting point: \glsborep, Random \glslstm Neworks and \glsplesn. \glsborep is simply a random projection of word embeddings to a higher dimension, Rand\glslstm is a randomly initialised bi-directional \glslstm [Hochreiter and Schmidhuber(1997)], and \glsesn is a hypertuned randomly initialised bi-directional \glsesn [Jaeger(2001)]. For more details please see [Wieting and Kiela(2019)].

2.3 Random \glsplcnn

\label

subsec:convolutional

\topruleModel Dim MR CR MPQA SUBJ SST2 TREC SICK-E MRPC
\midruleBOE 300 77.3(.2) 78.6(.3) 87.6(.1) 91.3(.1) 80.0(.5) 81.5(.8) 78.7(.1) 72.9(.3)
\midruleBOREP 4096 77.4(.4) 79.5(.2) 88.3(.2) 91.9(.2) 81.8(.4) 88.8(.3) 82.7(.7) 73.9(.4)
BOREP (ours) 4096 75.3(.2) 78.2(.5) 88.5(.2) 90.3(.4) 79.3(1.1) 88.5(1.3) 82.1(.2) 71.8(.7)
RandLSTM 4096 77.2(.3) 78.7(.5) 87.9(.1) 91.9(.2) 81.5(.3) 86.5(1.1) 81.8(.5) 74.1(.5)
RandLSTM (ours) 4096 76.9(.2) 80.9(.3) 88.7(.1) 91.7(.1) 81.3(.5) 89.2(.4) 81.7(.5) 71.8(.6)
ESN 4096 78.1(.3) 80.0(.6) 88.5(.2) 92.6(.1) 83.0(.5) 87.9(1.0) 83.1(.4) 73.4(.4)
ESN (ours) 4096 70.4(.1) 76.9(.8) 86.3(.1) 88.7(.4) 76.4(.5) 88.9(1.2) 78.4(.3) 67.4(.7)
CNN Window = 3 4096 74.9(.3) 76.9(.7) 85.4(.2) 88.6(.1) 75.6(.5) 88.7(1.2) 79.1(.2) 69.4(.5)
CNN Window = 4 4096 74.3(.3) 74.8(.8) 84.2(.3) 86.8(.3) 75.5(.5) 85.2(1.1) 78.0(.2) 69.2(.3)
Self-Attention 4096 68.0(.3) 77.1(.5) 82.0(.5) 90.1(.3) 78.8(1.2) 84.9(1.3) 73.7(.7) 67.1(1.1)
TreeLSTM 4096 75.6(.2) 78.5(.3) 87.7(.1) 91.4(.0) 79.9(.5) 90.3(.7) 80.7(.9) 71.1(.5)
\bottomrule
\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after\@tabular@row@before@xcolor\@xcolor@row@after
Table 1: \labeltab:lowdimexpts Performance (accuracy) for on all eight tasks. The results indicated by are taken from [Wieting and Kiela(2019)]. Mean (standard deviation) for each model is reported across five seeds. Our \glsesn was evaluated using a spectral radius of 1.0, a maximum kernel deviation from 0.0 of 0.1, and a sparsity 0.5, whereas the result from [Wieting and Kiela(2019)] is the best performing model from a hyperparameter search.\labeltable:results˙max

Although \glsplcnn are more famously used in the image domain [Simonyan and Zisserman(2014), He et al.(2015)He, Zhang, Ren, and Sun], they have also enjoyed much success as sentence encoders [Collobert et al.(2011)Collobert, Weston, Bottou, Karlen, Kavukcuoglu, and Kuksa, Vieira and Moura(2017), Gan et al.(2016)Gan, Pu, Henao, Li, He, and Carin]. A temporal one-dimensional convolution is performed by applying a -channel filter to a window of words and a bias added. This weight is initialised uniformly at random from , where d is the word embedding dimension. The representation is then obtained by pooling

(2)

Note that using a window size corresponds to \glsborep.

\includegraphics

[width=]body/results/max-pool-chart.pdf

Figure 1: Performance (accuracy) for on all eight tasks across five seeds. We observe: 1) Almost every encoder architecture performs at best, similarly to the relatively uninformative \glsborep, and at worst, much worse. 2) Taking \glsborep as \glscnn with a window size of 1, we note that increasing \glscnn window size impairs performance. This indicates that any gains to be made from employing -grams over word representations as a basis for distilling meaning needs to be learned. 3) The performance of the Self-Attention Network with and without positional encoding is fairly similar. This indicates that although the encoder architecture has positional information available, the transfer model cannot learn to use it. It would be interesting to look at the BShift task to probe this directly [Conneau et al.(2018)Conneau, Kruszewski, Lample, Barrault, and Baroni]. 4) Random Self-Attention networks perform poorly even though they form a cornerstone of modern state of the art \glsnlp systems. Considering \Crefeq:self-attention, we see that the random contextualisation can be any linear combination of the input, with none selected by an inductive bias. There is no reason to expect this random combination to outperform \glsborep. 5) The Tree\glslstm performs noticeably better than other encoder architectures on TREC, a question-type task which relies heavily on sentence syntax to solve [Li and Roth(2002)]. It appears that in this instance, the encoder may be using the syntactic information available, however, its performance on all other tasks is comparable to \glsborep. \labelfigure:results_max

2.4 Random Self-Attention

\label

In our random setting, the word embeddings are first projected up to a dimensional space. We then optionally add sinusoidal positional encodings [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin]. We then apply two layers of random self-attention with residual connections, each followed by layer normalisation. A single head of a self-attention layer produces new embeddings for each query representations out of the value representations , controlled by the key representations

(3)

The -dimensional key and query representations are given by independent random projections acting upon the self-attention layer input. We use eight heads of attention in each layer. The pooling function is applied to this output to produce the sentence representation .

We keep the default initialisation of the FairSeq implementation, which is Xavier uniform [Glorot and Bengio(2010)] for the weights of the self-attention layer.

2.5 Random Tree\glspllstm

\label

subsec:treelstm The final architecture we consider is the Tree\glslstm. This architecture is particularly interesting as it can potentially incorporate syntactic information into the sentence representations [Tai et al.(2015)Tai, Socher, and Manning, Li et al.(2018)Li, Richtarik, Ding, and Gao, Zhang et al.(2019)Zhang, Bengio, Hardt, and Singer, Teng and Zhang(2016), Kim et al.(2018a)Kim, Lee, Song, and Yoon, Ahmed et al.(2019)Ahmed, Samee, and Mercer].

We specifically consider the Binary Constituency Tree\glslstm [Tai et al.(2015)Tai, Socher, and Manning]. This differs from a regular \glslstm by having a two forget gates - one for each child node given by the structure of the parsed sentence.

Word representations are first presented to a random bi-directional \glslstm of combined dimensionality to provide contextualised representations

(4)

The contextualised representations are then presented to a random Tree\glslstm, whose outputs are pooled to produce the sentence representation

(5)

Both weights of the bi-directional \glslstm and the Tree\glslstm are initialised uniformly at random from . We used the Stanford parser [Manning et al.(2014)Manning, Surdeanu, Bauer, Finkel, Bethard, and McClosky] to parse each sentence. Punctuation and special characters were removed, and numbers were only kept if they formed an independent word and were not part of a mixed word of letters and numbers. Then, in the length of a word was reduced to zero, the word was replaced with a placeholder * character. After parsing, the prepossessing described in [Kim et al.(2018a)Kim, Lee, Song, and Yoon] was used to compute the parse tree for the Tree\glslstm.

2.6 Evaluation

\label

subsec:evaluation The SentEval tasks we evaluate on are sentiment analysis (MR, SST), question-type (TREC), product review (CR), subjectivity (SUBJ), opinion polarity (MPQA), paraphrasing (MRPC), and entailment (SICK-E). We use the default SentEval settings defined in [Conneau and Kiela(2018)]. We evaluate for five samples (seeds) per architecture per task.

We follow the FairSeq implementation [Ott et al.(2019)Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier, and Auli] to build our \glscnn and self-attention networks. We also follow the implementation of [Kim et al.(2018b)Kim, Choi, Edmiston, Bae, and Lee] without the structure-aware tag representations to build our Tree\glspllstm.

3 Results

\label

sec:results Our investigation is concerned with the priors of encoder architectures, rather than the posteriors they may learn from data; we only compare untrained encoders acting upon word embeddings.

\Cref

table:results_max contains the performance of architectures discussed in \Crefsec:method at dimensionality 4096 on the selected SentEval tasks, together with the results from \citetWieting2019. \Creffigure:results_max contains the performance for these architectures across a range of dimensionalities.

As a sanity check, we evaluated \glsborep and \glscnn with a window size of 1 and found the performance indistinguishable.

In general, we find that even if encoders have inductive biases that present additional information that can be used to solve a task, the corresponding priors do not leverage this information, except in an isolated case. This strongly indicates that learning is an essential component of building encoder architectures if any gains are to be made beyond apparently uninformative priors.

4 Conclusion

\label

sec:conclusion We have evaluated randomly initialised architectures to measure the contribution of priors in distilling sentence meaning. We find that apparently uninformative priors are just as good as seemingly informative priors on almost all tasks, indicating that learning is a necessary component to leverage information provided by architecture choice.

5 Acknowledgements

\label

sec:acknowledgements We thank Jeremie Vallee for assistance with the experimental setup, and the wider machine learning group at Babylon for useful comments and support throughout this project.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393497
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description