GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method
Large pre-trained language models such as BERT have been the driving force behind recent improvements across many NLP tasks. However, BERT is only trained to predict missing words - either behind masks or in the next sentence - and has no knowledge of lexical, syntactic or semantic information beyond what it picks up through unsupervised pre-training. We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into any layer of a pre-trained BERT. Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model. Our qualitative analysis shows that counter-fitted embedding injection particularly helps with cases involving synonym pairs.
With the recent success of pre-trained language models such as ELMo (peters_deep_2018-1) and BERT (devlin_bert_2019) across many areas of NLP, there is increased interest in exploring how these architectures can be further improved. One line of work aims at model compression, making BERT smaller and accessible while mostly preserving its performance (xu_bert--theseus_2020; goyal_power-bert_2020; sanh_distilbert_2019; aguilar_knowledge_2020; lan_albert_2020; chen_adabert_2020). Other studies seek to further enhance model performance by duplicating existing layers (kao_further_2020) or introducing external information into BERT, such as information from knowledge bases (peters_knowledge_2019; wang_k-adapter_2020) or multi-modal information (lu_vilbert_2019; lin_interbert_2020).
Before the rise of contextualised models, transfer of pre-trained information between datasets and tasks in NLP was based on word embeddings. Over multiple years, substantial effort was placed into the creation of such embeddings. While originally capturing mainly collocation patterns (mikolov_efficient_2013; pennington_glove:_2014), subsequent work enriched these embeddings with additional information, such as dependencies (levy_dependency-based_2014), subword information (bojanowski_enriching_2017; luong_better_2013), word prototypes (huang_improving_2012) and semantic lexicons (faruqui_retrofitting_2015). As a result, there exists a wealth of pre-trained embedding resources for many languages in a unified format which could provide complementary information for contemporary pre-trained contextual models.
In this work, we propose a new method for injecting pre-trained embeddings into any layer of BERT’s internal representation. Our approach differs from previous work by introducing linguistically-enriched embeddings directly into BERT through a novel injection method. We apply our method to multiple semantic similarity detection benchmark datasets and show that injecting pre-trained dependency-based and counter-fitted embeddings can further enhance BERT’s performance. More specifically, we make the following contributions:
We propose GiBERT - a lightweight gated method for injecting externally pre-trained embeddings into BERT (section 4.1).
We provide ablation studies and detailed analysis for core model components (section 4.3).
We demonstrate that our model improves BERT’s performance on multiple semantic similarity detection datasets. Moreoever, when compared to multi-head attention injection, our gated injection method uses fewer parameters while achieving comparable performance for dependency embeddings and improved results for counter-fitted embeddings (section 6).
Our qualitative analysis provides insights into GiBERT’s improved performance, such as in cases of sentences pairs involving synonyms. (section 6).
2 Related work
Due to BERT’s widespread success in NLP, many recent studies have focused on further improving BERT by introducing external information. Studies differ regarding the type of external information provided, the application area and their technical approach. We broadly categorise existing approaches based on their modification method into input-related, external and internal. Input modifications (zhao_bert_2020; singh_constructing_2020; lai_simple_2020; ruan_fine-tuning_2020) adapt the information that is fed to BERT - e.g. feeding text triples separated by [SEP] tokens instead of sentence pairs as in lai_simple_2020 - while leaving the architecture unchanged. Output modifications (xuan_fgn_2020; zhang_semantics-aware_2020) build on BERT’s pre-trained representation by adding external information after the encoding step - e.g. combining it with additional semantic information as in zhang_semantics-aware_2020 - without changing BERT itself. By contrast, internal modifications introduce new information directly into BERT by adapting its internal architecture. Relatively few studies have taken this approach as this is technically more difficult and might increase the risk of so-called catastrophic forgetting - completely forgetting previous knowledge when learning new tasks (french_catastrophic_1999; wen_few-shot_2018). However, such modifications also offer the opportunity to directly harness BERT’s powerful architecture to process the external information alongside the pretrained one. Most existing work on internal modifications has attempted to combine BERT’s internal representation with visual and knowledge base information: lu_vilbert_2019 modified BERT’s transformer block with co-attention to integrate visual and textual information, while lin_interbert_2020 introduced a multimodal model which uses multi-head attention to integrate encoded image and text information between each transformer block. peters_knowledge_2019 suggested a word-to-entity attention mechanism to incorporate external knowledge into BERT and wang_k-adapter_2020 proposed to inject factual and linguistic knowledge through separate adapter modules. Our approach differs from previous research as we propose to introduce external information with an addition-based mechanism which uses fewer parameters than existing attention-based techniques (lu_vilbert_2019; lin_interbert_2020; peters_knowledge_2019). We further incorporate a gating mechanism to scale injected information in an attempt to reduce the risk of catastrophic forgetting. Moreover, our work focuses on injecting pretrained word embeddings, rather than multimodal or knowledge base information as in previous studies.
Semantic similarity detection
Detecting paraphrases and semantically related posts in Community Question Answering requires modelling the semantic relationship between a text pair. This is a fundamental and well known NLP problem for which many methods have been proposed. Early work has focused on feature-engineering techniques, exploring various syntactic (filice_kelp_2017), semantic (balchev_pmi-cool_2016) and lexical features (tran_jaist_2015; almarwani_gw_qa_2017). Subsequent work attempted to model text pair relationships solely based on increasingly complex neural architectures (deriu_swissalps_2017; wang_bilateral_2017; tan_multiway_2018) or by combining both approaches through hybrid techniques (wu_ecnu_2017; feng_beihang-msra_2017; koreeda_bunji_2017). Most recently, contextual models such as ELMo (peters_deep_2018-1) and BERT (devlin_bert_2019) have reached state-of-the-art performance through pretraining large context-aware language models on vast amounts of textual data. Our study joins up earlier lines of work with current state-of-the-art contextual representations by combining BERT with dependency-based and counter-fitted embeddings, previously shown to be useful for semantic similarity detection.
3 Datasets and Tasks
We focus on the task of semantic similarity detection which is a fundamental problem in NLP and involves modelling the semantic relationship between two sentences in a binary classification setup. We work with the following five widely used datasets which cover a range of related tasks and sizes (see Appendix A).
The Microsoft Research Paraphrase dataset (MSRP) contains 5K pairs of sentences from news websites which were obtained based on heuristics and an SVM classifier. Gold labels are based on human binary annotations for sentential paraphrase detection (dolan_automatically_2005).
The SemEval 2017 CQA dataset (SemEval-2017:task3) consists of three subtasks involving posts from the online forum Qatar Living
The Quora duplicate questions dataset
is the largest of the selected datasets, consisting of more than 400k question pairs with binary labels.
All of the above datasets provide two short texts, each usually a single sentence but sometimes consisting of multiple sentences. For simplicity, we refer to each short text as ‘sentence’. We frame the task as semantic similarity detection between two sentences in a binary classification task.
We propose GiBERT - a Gated Injection Method for BERT. GiBERT’s architecture is illustrated with a toy example in Figure 1 and comprises the following phases: obtaining BERT’s intermediate representation from Transformer block (step 1-2 in Figure 1), obtaining an alternative input representation based on linguistically-enriched word embeddings (step 3-4), combining both representations (steps 5-7) and passing on the injected information to subsequent BERT layers to make a final prediction (steps 8-9).
We encode a sentence pair with a pre-trained BERT model (devlin_bert_2019) and obtain BERT’s internal representation at different layers (see section 4.3 for injection layer choices).
where is the internal hidden size of BERT ( for BERT). For injecting information at later layers, we obtain BERT’s internal representation after transformer block with (step 2):
where L is the number of Transformer blocks ( for BERT) and MultiheadAtt denotes multihead attention.
External embedding representation
To enrich this representation, we obtain alternative representations for and by looking up word embeddings in a matrix of pre-trained embeddings where indicates the vocabulary size and is the dimensionality of the pre-trained embeddings (step 3, refer to section 4.2 for details on our choice of pre-trained embeddings). To ensure alignment between BERT’s representation at word piece level and the word embedding representation at token level, an alignment function copies embeddings of tokens that were separated into multiple word pieces and adds BERT’s special ‘[CLS]’ and ‘[SEP]’ tokens, resulting in an injection sequence (step 4). For example, we copy the pre-trained embedding of the word ‘prompt’ to match the two corresponding word pieces ‘pro’ and ‘##mpt’ (see Figure 1).
Multihead Attention Injection
Multihead attention was proposed by vaswani_attention_2017:
and is employed in Transformer networks in the form of self-attention (where queries , keys and values come from the previous layer) or encoder-decoder attention (where queries come from the decoder; keys and values from the encoder). Previous work has successfully employed multihead attention to combine BERT with external information (see section 2). For example, in their multimodal VilBERT model, lu_vilbert_2019 combined textual and visual representations by passing the keys and values from each modality as input to the other modalityâs multi-head attention block. Similarly, peters_knowledge_2019 used multihead attention to combine projected BERT representations (as queries) with entity-span representations (as keys and values) n their knowledge-enrichment method for BERT. For our case of combining BERT with the injection sequence, it is therefore intuitive to try to use the following multi-head attention injection method:
where queries are provided by BERT’s internal representation, while keys and values come from the injected embeddings. The output of the attention mechanism is then combined with the previous layer through addition.
The above multihead attention injection mechanism is rather complex and requires many parameters. We therefore propose an alternative way of combining external embeddings with BERT which requires only 14% of parameters used in multi-head attention (see Appendix B). First, we add a feed-forward layer – consisting of a linear layer with and with a tanh activation function – to project the aligned embedding sequence to BERT’s internal dimensions and squash the output values to a range between -1 and 1 (step 5):
Then, we use a residual connection to inject the projected external information into BERT’s representation from Transformer block (see section 4.3 for injection at different locations) and obtain a new enriched representation :
However, as injection values can get rather large (between -1 and 1) in comparison to BERT’s internal representation (based on our observation usually ranging around -0.1 to 0.1), a downside of directly injecting external information in this way is that BERT’s pre-trained information can be easily overwritten by the injection, resulting in catastrophic forgetting. To address this potential pitfall, we further propose a gating mechanism which uses a gating vector to scale the injected information before combining it with BERT’s internal representation as follows:
where denotes element-wise multiplication using broadcasting (step 6 & 7). The gating parameters are initialised with zeros and updated during training. This has the benefit of starting finetuning from representations which are equivalent to vanilla BERT and gradually introducing the injected information during finetuning along certain dimensions. In case the external representations are not beneficial for the task, it is easy for the model to ignore them by keeping the gating parameters at zero.
The combined representation is then fed as input to BERT’s next Transformer block (step 8). At the final Transformer block , we use the vector which corresponds to the ‘[CLS]’ token in the input and is typically used as the sentence pair representation (step 9). As proposed by devlin_bert_2019, this is followed by a softmax classification layer (with weights and ) to calculate class probablilities where indicates the number of classes. During finetuning, we train the entire model for 3 epochs with early stopping and cross-entropy loss. Learning rates are tuned for each seed and dataset based on development set performance (reported in Appendix C).
4.2 Injected Embeddings
While any kind of information could be injected, we focus on two types of pretrained embeddings: dependency-based (levy_dependency-based_2014) and counter-fitted embeddings (mrksic_counter-fitting_2016). Our choice is motivated by previous research which found syntactic features useful for semantic similarity detection (filice_kelp_2017; feng_beihang-msra_2017) and counter-fitted embeddings helpful in several other tasks (alzantot_generating_2018; jin_is_2020).
The dependency-based embeddings by levy_dependency-based_2014 extend the SkipGram embedding algorithm proposed by mikolov_efficient_2013 by replacing linear bag-of-word contexts with dependency-based contexts which are extracted from parsed English Wikipedia sentences. As BERT has not been exposed to dependencies during pretraining and previous studies have found that BERT’s knowledge of syntax is only partial (rogers_primer_2020), we reason that these embeddings could provide helpful complementary information.
The counter-fitted embeddings by mrksic_counter-fitting_2016 integrate antonymy and synonymy relations into word embeddings based on an objective function which combines three principles: repelling antonyms, attracting synonymy and preserving the vector space. For training, they obtain synoynmy and antonymy pairs from the Paraphrase Database and WordNet, demonstrating an increased performance on SimLex-999. We use their highest-scoring vectors which they obtained by applying their counter-fitting method to Paragram vectors from wieting_paraphrase_2015. We reason that the antonym and synonym relations contained in the word embeddings could be especially useful for paraphrase detection by explicitly capturing these semantic relations.
4.3 Injection Settings
|GiBERT with dependency embeddings|
|- no gating||.906||.905||.732||.751||.424|
|- with gating||.913||.908||.755||.778||.433|
|GiBERT with counter-fitted embeddings|
|- no gating||.907||.906||.733||.763||.435|
|- with gating||.907||.908||.751||.767||.451|
Catastrophic forgetting is a potential problem when introducing external information into a pre-trained model as the injected information could disturb or completely overwrite existing knowledge (wang_k-adapter_2020). In our proposed model, a gating mechanism is used to scale injected embeddings before adding them to the pre-trained internal BERT representation (see section 4.1). To understand the importance of this mechanism, we contrast development set performance for injecting information after the embedding layer with gating - as defined in equation 7 - and without - as in equation 6 - (Table 1). For dependency embedding injection without gating, performance only improves on 2 out of 5 datasets over the baseline and in some cases even drops below BERT’s performance, while it outperforms the baseline on all datasets when using the gating mechanism. Counter-fitted embedding injection without gating improves on 4 out of 5 datasets, with further improvements when adding gating, outperforming the vanilla BERT model across all datasets. In addition, gating makes model training more stable and reduces failed runs (where the model predicted only the majority class) from 30% to 0% on the particularly imbalanced SemEval C dataset. This highlights the importance of the gating mechanism in our proposed architecture.
In our proposed model, information can be injected between any of BERT’s pre-trained transformer blocks. We reason that different locations may be more appropriate for certain kinds of embeddings as previous research has found that different types of information tend to be encoded and processed at specific BERT layers (rogers_primer_2020). We experiment with injecting embeddings at three possible locations: after the embedding layer (using ), after the middle layer (using in BERT) and after the penultimate layer (using in BERT). Table 2 shows that midlayer injection is ideal for counter-fitted embeddings, while late injection appears to work best for dependency embeddings (Table 2). This is in line with previous work which found that BERT tends to processes syntactic information at later layers than linear word-level information (rogers_primer_2020). We consequently use these injection locations in our final model.
|GiBERT with dependency embeddings|
|- embd layer||.913||.908||.755||.778||.433|
|- layer 6||.911||.908||.755||.776||.438|
|- layer 11||.914||.910||.760||.773||.444|
|GiBERT with counter-fitted embeddings|
|- embd layer||.907||.908||.751||.767||.451|
|- layer 6||.917||.909||.760||.771||.464|
|- layer 11||.910||.907||.755||.771||.450|
Our main evaluation metric is F1 score as this is more meaningful for datasets with imbalanced label distributions (such as SemEval C) than accuracy. We also report performance on difficult cases using the non-obvious F1 score (peinelt_aiming_2019). This metric distinguishes non-obvious instances in a dataset from obvious ones based on lexical overlap and gold labels, calculating a separate F1 score for challenging cases. It therefore tends to be lower than the normal F1 score.
dodge_fine-tuning_2020 recently showed that early stopping and random seeds can have considerable impact on the performance of finetuned BERT models. In this work, we finetune all models for 3 epochs with early stopping. Our reported scores average model performance across two different seeds for BERT-based models.
Following standard practice, we encode the sentence pair with BERT’s vector from the final layer, followed by a softmax layer. We finetune all layers for 3 epochs with early stopping. Following devlin_bert_2019, we tune learning rates on the dev set of each dataset.
We further provide an alternative Attention-based embedding Injection method for BERT based on the multihead attention injection mechanism described in equations 3 to 4. For direct comparison, we inject embeddings at the same layers as GiBERT (layer 6 for counter-fitted embeddings and layer 11 for dependency-based embeddings). We follow the same finetuning procedure as GiBERT and the BERT baseline.
For SemEval, we compare against the best participating SemEval 2017 system for each subtask based on F1 score. For MSRP, we show a neural matching architecture (pang_text_2016). For Quora, we compare against the Interactive Inference Network (gong_natural_2018) using accuracy, as no F1 has been reported. We also provide a semantics-aware BERT model (zhang_semantics-aware_2020) which leverages a semantic role labeler.
|Sentence 1||Sentence 2||Gold label||BERT prediction||GiBERT prediction|
|(1)||it took me more than 10 people; over the course of the whole day to convince my point at qatar airways… as to how my points needs to be redeemed… at long last my point was made… dont seem know what they are doing??? appalling to say the least||this isn’t the first time. so many rants by irate customers on so many diverse situations signals a very serious problem. so called first class airlines and no basic customer care. over confidence much?||is related||not related||is related|
|(2)||hi; my wife was on a visit visa; today; her residency visa was issued; so i went to immigration and paid 500 so there is no need to leave the country and enter again on the residency visa . she has done her medical before for the visit visa extension; do we need to do the medical again for the residency visa? thanks||dear all; please let me know how many days taking for approve family visa nw; am last wednesday (12/09/2012) apply family visa for my husband and daughter; but still now showing in moi website itz under review; itz usual reply? why delayed like this? please help me regards divya||is related||is related||not related|
Comparison with previous systems
GiBERT with counter-fitted embeddings outperforms the F1 score of BERT and other previous systems across all datasets (except on SemEval C)
When contrasting the gated injection method (GiBERT) with an alternative attention-based injection method (AiBERT), we find that both injection methods generally improve over the performance of the BERT baseline. In direct comparison between both methods, we find that injecting embeddings with the lightweight gated method achieves comparable results to the complex multihead attention injection method for introducing dependency embeddings, while for the injection of counter-fitted embeddings, GiBERT even outperforms AiBERT.
|Instances with antonym pairs|
|Instances with synonym pairs|
|Instances without synonym/antonym pairs|
Counter-fitted embeddings are designed to explicitly encode synonym and antonym relationships between words. To better understand how the injection of counter-fitted embeddings affects the ability of our model to deal with instances involving such semantic relations, we use synonym and antonym pairs from the PPDB and wordnet (provided by mrksic_counter-fitting_2016) and search the development partition of the datasets for sentence pairs where the first sentence contains one word of the synonym/antonym pair and the second sentence the other word. Table 5 reports F1 performance of our model on cases with synonym pairs, antonym pairs and neither one. We find that our model’s F1 performance particularly improves over BERT on instances containing synonym pairs, as illustrated in example (1) in Table 4. By contrast, the performance on cases with antonym pairs stays roughly the same, even slightly decreasing on Quora. This can be understood with the help of example (2) in Table 4, as word pairs can be antonyms in isolation (e.g. husband - wife), but not in the specific context of a given example (e.g. it’s not important if the visa is for the wife or husband). In rare cases, the injection of distant antonym pair embeddings could deter the model from detecting related sentence pairs. We also observe a slight performance boost for cases that don’t contain synonym or antonym pairs. This could be because of improved representations for words which occurred in examples without their synonym or antonym counterpart.
In this paper, we introduced a new approach for injecting external information into BERT. Our proposed method adds linguistically enriched embeddings to BERT’s internal representation through a lightweight gating mechanism which requires significantly fewer parameters than a multihead attention injection method. Evaluating our injection method on multiple semantic similarity detection datasets, we demonstrated that injecting counter-fitted embeddings clearly improved performance over vanilla BERT, while dependency embeddings achieved slightly smaller gains for these tasks. In comparison to the multihead attention injection mechanism, we found the gated method at least as effective, with comparable performance for dependency embedding and improved results for counter-fitted embeddings. Our qualitative analysis highlighted that counter-fitted injection was particularly helpful for instances involving synonym pairs. Future work could explore combining multiple embedding sources or injecting other types of information. Another direction is to investigate the usefulness of embedding injection for other tasks or compressed BERT models.
Appendix A Datasets
|Sentence 1||Sentence 2||Label|
|Quora||Paraphrase detection||There are only 2,000 Roman Catholics living in Banja Luka now.||There are just a handful of Catholics left in Banja Luka.||is_paraphrase|
|MSRP||Paraphrase detection||Which is the best way to learn coding?||How do you learn to program?||is_paraphrase|
|SemEval||(A) Internal answer detection||Anybody recommend a good dentist in Doha?||Dr Sarah Dental Clinic||is_related|
|(B) Paraphrase detection||Where I can buy good oil for massage?||Blackheads - Any suggestions on how 0 to get rid of them??||not_related|
|(C) External answer detection||Can anybody tell me where is Doha clinic?||Dr. Rizwi - Al Ahli Hospital||not_related|
Appendix B Required Injection Parameters
This section compares the number of required parameters in the two alternative injection methods discussed in section 4.1: a multihead attention injection mechanism which is based on previous methods for combining external knowledge with BERT and a novel lightweight gated injection mechanism.
Multihead attention injection
In multihead attention injection (equations 3 to 4), the keys are provided by BERTâs representation from the injection layer and the queries are the injected information . Multihead attention requires the following weight matrices and biases to transform queries, keys and values (indicated by , and ) and transform the attention output (indicated by ):
where indicates BERT’s hidden dimension and indicates the dimensionality of the injected embeddings. When injecting embeddings with (see section 4.2) into BERT with , this amounts to new parameters.
Therefore, injecting embeddings with into BERT requires new parameters. Our proposed gated injection mechanism only requires 14% of the parameters used in a multihead attention injection mechanism. Using fewer parameters results in a smaller model which is especially beneficial for injecting information during finetuning, where small learning rates and few epochs make it difficult to learn large amounts of new parameters.
Appendix C Best Hyper-Parameters
Hyper-parameters were chosen based on development set F1 scores.
|Learning rate (1st seed)||5e-5||2e-5||3e-5||2e-5||2e-5|
|Learning rate (2nd seed)||5e-5||2e-5||2e-5||2e-5||3e-5|
|AiBERT with dependency-based embeddings|
|Learning rate (1st seed)||3e-5||3e-5||2e-5||3e-5||2e-5|
|Learning rate (2nd seed)||5e-5||2e-5||2e-5||5e-5||2e-5|
|AiBERT with counter-fitted embeddings|
|Learning rate (1st seed)||5e-5||2e-5||2e-5||3e-5||2e-5|
|Learning rate (2nd seed)||5e-5||3e-5||5e-5||3e-5||2e-5|
|GiBERT with dependency-based embeddings|
|Learning rate (1st seed)||2e-5||3e-5||2e-5||3e-5||2e-5|
|Learning rate (2nd seed)||3e-5||2e-5||2e-5||5e-5||3e-5|
|GiBERT with counter-fitted embeddings|
|Learning rate (1st seed)||5e-5||2e-5||2e-5||5e-5||2e-5|
|Learning rate (2nd seed)||5e-5||3e-5||3e-5||5e-5||3e-5|
Appendix D Gating Parameter Analysis
As described in section
4, the gating parameters in our proposed model are initialised as a vector of zeros. During training, the model can learn to gradually inject external information by adjusting gating parameters to for adding, or for subtracting injected information along certain dimensions. Alternatively, injection stays turned off if all parameters remain at zero. Figure 2 shows a histogram of learned gating vectors for our best GiBERT models with counter-fitted (left) and dependency embedding injection (right).
On most datasets, the majority of parameters have been updated to small non-zero values, letting through controlled amounts of injected information without completely overwriting BERT’s internal representation.
Only on Semeval B (with 4K instances the smallest of the datasets, compare section
3), more than 500 of the 768 dimensions of the injected information stay blocked out for both model variants. The gating parameters also filter out many dimensions of the dependency-based embeddings on MSRP (the second smallest dataset).
This suggests that models trained on smaller datasets may benefit from slightly longer finetuning or a different gating parameter initialisation to make full use of the injected information.
- We use the the uncased version of BERT available through Tensorflow Hub.
- As reflected by the lower scores compared to other datasets, SemEval C is particularly difficult due to the external question-answering scenario and its highly imbalanced label distribution which varies between train, dev and test set.
- Note that we train models for the same number of epochs, but one epoch uses all training examples contained in the dataset. This gives models trained on larger datasets more opportunity to update their parameters.