GiBERT: Introducing Linguistic Knowledge into BERTthrough a Lightweight Gated Injection Method

GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method


Large pre-trained language models such as BERT have been the driving force behind recent improvements across many NLP tasks. However, BERT is only trained to predict missing words - either behind masks or in the next sentence - and has no knowledge of lexical, syntactic or semantic information beyond what it picks up through unsupervised pre-training. We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into any layer of a pre-trained BERT. Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model. Our qualitative analysis shows that counter-fitted embedding injection particularly helps with cases involving synonym pairs.


1 Introduction

With the recent success of pre-trained language models such as ELMo (peters_deep_2018-1) and BERT (devlin_bert_2019) across many areas of NLP, there is increased interest in exploring how these architectures can be further improved. One line of work aims at model compression, making BERT smaller and accessible while mostly preserving its performance (xu_bert--theseus_2020; goyal_power-bert_2020; sanh_distilbert_2019; aguilar_knowledge_2020; lan_albert_2020; chen_adabert_2020). Other studies seek to further enhance model performance by duplicating existing layers (kao_further_2020) or introducing external information into BERT, such as information from knowledge bases (peters_knowledge_2019; wang_k-adapter_2020) or multi-modal information (lu_vilbert_2019; lin_interbert_2020).

Before the rise of contextualised models, transfer of pre-trained information between datasets and tasks in NLP was based on word embeddings. Over multiple years, substantial effort was placed into the creation of such embeddings. While originally capturing mainly collocation patterns (mikolov_efficient_2013; pennington_glove:_2014), subsequent work enriched these embeddings with additional information, such as dependencies (levy_dependency-based_2014), subword information (bojanowski_enriching_2017; luong_better_2013), word prototypes (huang_improving_2012) and semantic lexicons (faruqui_retrofitting_2015). As a result, there exists a wealth of pre-trained embedding resources for many languages in a unified format which could provide complementary information for contemporary pre-trained contextual models.

In this work, we propose a new method for injecting pre-trained embeddings into any layer of BERT’s internal representation. Our approach differs from previous work by introducing linguistically-enriched embeddings directly into BERT through a novel injection method. We apply our method to multiple semantic similarity detection benchmark datasets and show that injecting pre-trained dependency-based and counter-fitted embeddings can further enhance BERT’s performance. More specifically, we make the following contributions:

  1. We propose GiBERT - a lightweight gated method for injecting externally pre-trained embeddings into BERT (section 4.1).

  2. We provide ablation studies and detailed analysis for core model components (section 4.3).

  3. We demonstrate that our model improves BERT’s performance on multiple semantic similarity detection datasets. Moreoever, when compared to multi-head attention injection, our gated injection method uses fewer parameters while achieving comparable performance for dependency embeddings and improved results for counter-fitted embeddings (section 6).

  4. Our qualitative analysis provides insights into GiBERT’s improved performance, such as in cases of sentences pairs involving synonyms. (section 6).

2 Related work

BERT modifications

Due to BERT’s widespread success in NLP, many recent studies have focused on further improving BERT by introducing external information. Studies differ regarding the type of external information provided, the application area and their technical approach. We broadly categorise existing approaches based on their modification method into input-related, external and internal. Input modifications (zhao_bert_2020; singh_constructing_2020; lai_simple_2020; ruan_fine-tuning_2020) adapt the information that is fed to BERT - e.g. feeding text triples separated by [SEP] tokens instead of sentence pairs as in lai_simple_2020 - while leaving the architecture unchanged. Output modifications (xuan_fgn_2020; zhang_semantics-aware_2020) build on BERT’s pre-trained representation by adding external information after the encoding step - e.g. combining it with additional semantic information as in zhang_semantics-aware_2020 - without changing BERT itself. By contrast, internal modifications introduce new information directly into BERT by adapting its internal architecture. Relatively few studies have taken this approach as this is technically more difficult and might increase the risk of so-called catastrophic forgetting - completely forgetting previous knowledge when learning new tasks (french_catastrophic_1999; wen_few-shot_2018). However, such modifications also offer the opportunity to directly harness BERT’s powerful architecture to process the external information alongside the pretrained one. Most existing work on internal modifications has attempted to combine BERT’s internal representation with visual and knowledge base information: lu_vilbert_2019 modified BERT’s transformer block with co-attention to integrate visual and textual information, while lin_interbert_2020 introduced a multimodal model which uses multi-head attention to integrate encoded image and text information between each transformer block. peters_knowledge_2019 suggested a word-to-entity attention mechanism to incorporate external knowledge into BERT and wang_k-adapter_2020 proposed to inject factual and linguistic knowledge through separate adapter modules. Our approach differs from previous research as we propose to introduce external information with an addition-based mechanism which uses fewer parameters than existing attention-based techniques (lu_vilbert_2019; lin_interbert_2020; peters_knowledge_2019). We further incorporate a gating mechanism to scale injected information in an attempt to reduce the risk of catastrophic forgetting. Moreover, our work focuses on injecting pretrained word embeddings, rather than multimodal or knowledge base information as in previous studies.

Semantic similarity detection

Detecting paraphrases and semantically related posts in Community Question Answering requires modelling the semantic relationship between a text pair. This is a fundamental and well known NLP problem for which many methods have been proposed. Early work has focused on feature-engineering techniques, exploring various syntactic (filice_kelp_2017), semantic (balchev_pmi-cool_2016) and lexical features (tran_jaist_2015; almarwani_gw_qa_2017). Subsequent work attempted to model text pair relationships solely based on increasingly complex neural architectures (deriu_swissalps_2017; wang_bilateral_2017; tan_multiway_2018) or by combining both approaches through hybrid techniques (wu_ecnu_2017; feng_beihang-msra_2017; koreeda_bunji_2017). Most recently, contextual models such as ELMo (peters_deep_2018-1) and BERT (devlin_bert_2019) have reached state-of-the-art performance through pretraining large context-aware language models on vast amounts of textual data. Our study joins up earlier lines of work with current state-of-the-art contextual representations by combining BERT with dependency-based and counter-fitted embeddings, previously shown to be useful for semantic similarity detection.

3 Datasets and Tasks

We focus on the task of semantic similarity detection which is a fundamental problem in NLP and involves modelling the semantic relationship between two sentences in a binary classification setup. We work with the following five widely used datasets which cover a range of related tasks and sizes (see Appendix A).


The Microsoft Research Paraphrase dataset (MSRP) contains 5K pairs of sentences from news websites which were obtained based on heuristics and an SVM classifier. Gold labels are based on human binary annotations for sentential paraphrase detection (dolan_automatically_2005).


The SemEval 2017 CQA dataset (SemEval-2017:task3) consists of three subtasks involving posts from the online forum Qatar Living1. Each subtask provides an initial post as well as 10 posts which were retrieved by a search engine and annotated with binary labels by humans. The task requires the distinction between relevant and non-relevant posts. The original problem is a ranking setting, but since the gold labels are binary, we focus on a classification setup. In subtask A, the posts are questions and comments from the same thread, in an answer relevancy detection scenario (26K instances). Subtask B is question paraphrase detection (4K instances). Subtask C is similar to A but comments were retrieved from an external thread (47K). We use the 2016 test set as the dev set and the 2017 test set as the test set.


The Quora duplicate questions dataset is the largest of the selected datasets, consisting of more than 400k question pairs with binary labels.2 The task is to predict whether two questions are paraphrases, similar to SemEval subtask B. We use wang_bilateral_2017’s train/dev/test set partition.

All of the above datasets provide two short texts, each usually a single sentence but sometimes consisting of multiple sentences. For simplicity, we refer to each short text as ‘sentence’. We frame the task as semantic similarity detection between two sentences in a binary classification task.


4.1 Architecture

Figure 1: Our proposed GiBERT architecture illustrated with a toy example (where internal BERT dimension and embedding dimension ). The model input consists of a sentence pair which is processed with a WordPiece tokenizer (step 1) and encoded with BERT up to layer (step 2). We obtain an alternative representation for the sentence pair based on pretrained word embeddings (step 3), while ensuring that external word embeddings are aligned with BERT’s WordPieces by repeating embeddings for tokens which have been broken down into several WordPieces (step 4). The aligned word embedding sequence is passed through a linear and tanh layer to match BERT’s embedding dimension (step 5). We apply a gating mechanism (step 6) before adding the injected information to BERT’s representation from layer i (step 7). The combined representation is passed on to the next layer (step 8). At the final layer, the C vector is used as the sentence pair representation, followed by a classification layer (step 9).

We propose GiBERT - a Gated Injection Method for BERT. GiBERT’s architecture is illustrated with a toy example in Figure 1 and comprises the following phases: obtaining BERT’s intermediate representation from Transformer block (step 1-2 in Figure 1), obtaining an alternative input representation based on linguistically-enriched word embeddings (step 3-4), combining both representations (steps 5-7) and passing on the injected information to subsequent BERT layers to make a final prediction (steps 8-9).

BERT representation

We encode a sentence pair with a pre-trained BERT model (devlin_bert_2019) and obtain BERT’s internal representation at different layers (see section 4.3 for injection layer choices).3 Following standard practice, we process the two input sentences and with a word piece tokenizer (wu_googles_2016) and combine them using ‘[CLS]’ and ‘[SEP]’ tokens which indicate sentence boundaries. Then, the word pieces are mapped to ids, resulting in a sequence of word piece ids where indicates the number of word pieces in the sequence (step 1 in Figure 1). In the case of embedding layer injection, we use BERT’s embedding layer output denoted with which results from summing the word piece embeddings , positional embeddings and segment embeddings (step 2):


where is the internal hidden size of BERT ( for BERT). For injecting information at later layers, we obtain BERT’s internal representation after transformer block with (step 2):


where L is the number of Transformer blocks ( for BERT) and MultiheadAtt denotes multihead attention.

External embedding representation

To enrich this representation, we obtain alternative representations for and by looking up word embeddings in a matrix of pre-trained embeddings where indicates the vocabulary size and is the dimensionality of the pre-trained embeddings (step 3, refer to section 4.2 for details on our choice of pre-trained embeddings). To ensure alignment between BERT’s representation at word piece level and the word embedding representation at token level, an alignment function copies embeddings of tokens that were separated into multiple word pieces and adds BERT’s special ‘[CLS]’ and ‘[SEP]’ tokens, resulting in an injection sequence (step 4). For example, we copy the pre-trained embedding of the word ‘prompt’ to match the two corresponding word pieces ‘pro’ and ‘##mpt’ (see Figure 1).

Multihead Attention Injection

Multihead attention was proposed by vaswani_attention_2017:


and is employed in Transformer networks in the form of self-attention (where queries , keys and values come from the previous layer) or encoder-decoder attention (where queries come from the decoder; keys and values from the encoder). Previous work has successfully employed multihead attention to combine BERT with external information (see section 2). For example, in their multimodal VilBERT model, lu_vilbert_2019 combined textual and visual representations by passing the keys and values from each modality as input to the other modality’s multi-head attention block. Similarly, peters_knowledge_2019 used multihead attention to combine projected BERT representations (as queries) with entity-span representations (as keys and values) n their knowledge-enrichment method for BERT. For our case of combining BERT with the injection sequence, it is therefore intuitive to try to use the following multi-head attention injection method:


where queries are provided by BERT’s internal representation, while keys and values come from the injected embeddings. The output of the attention mechanism is then combined with the previous layer through addition.

Gated Injection

The above multihead attention injection mechanism is rather complex and requires many parameters. We therefore propose an alternative way of combining external embeddings with BERT which requires only 14% of parameters used in multi-head attention (see Appendix B). First, we add a feed-forward layer – consisting of a linear layer with and with a tanh activation function – to project the aligned embedding sequence to BERT’s internal dimensions and squash the output values to a range between -1 and 1 (step 5):


Then, we use a residual connection to inject the projected external information into BERT’s representation from Transformer block (see section 4.3 for injection at different locations) and obtain a new enriched representation :


However, as injection values can get rather large (between -1 and 1) in comparison to BERT’s internal representation (based on our observation usually ranging around -0.1 to 0.1), a downside of directly injecting external information in this way is that BERT’s pre-trained information can be easily overwritten by the injection, resulting in catastrophic forgetting. To address this potential pitfall, we further propose a gating mechanism which uses a gating vector to scale the injected information before combining it with BERT’s internal representation as follows:


where denotes element-wise multiplication using broadcasting (step 6 & 7). The gating parameters are initialised with zeros and updated during training. This has the benefit of starting finetuning from representations which are equivalent to vanilla BERT and gradually introducing the injected information during finetuning along certain dimensions. In case the external representations are not beneficial for the task, it is easy for the model to ignore them by keeping the gating parameters at zero.

Output layer

The combined representation is then fed as input to BERT’s next Transformer block (step 8). At the final Transformer block , we use the vector which corresponds to the ‘[CLS]’ token in the input and is typically used as the sentence pair representation (step 9). As proposed by devlin_bert_2019, this is followed by a softmax classification layer (with weights and ) to calculate class probablilities where indicates the number of classes. During finetuning, we train the entire model for 3 epochs with early stopping and cross-entropy loss. Learning rates are tuned for each seed and dataset based on development set performance (reported in Appendix C).

4.2 Injected Embeddings

While any kind of information could be injected, we focus on two types of pretrained embeddings: dependency-based (levy_dependency-based_2014) and counter-fitted embeddings (mrksic_counter-fitting_2016). Our choice is motivated by previous research which found syntactic features useful for semantic similarity detection (filice_kelp_2017; feng_beihang-msra_2017) and counter-fitted embeddings helpful in several other tasks (alzantot_generating_2018; jin_is_2020).

The dependency-based embeddings by levy_dependency-based_2014 extend the SkipGram embedding algorithm proposed by mikolov_efficient_2013 by replacing linear bag-of-word contexts with dependency-based contexts which are extracted from parsed English Wikipedia sentences. As BERT has not been exposed to dependencies during pretraining and previous studies have found that BERT’s knowledge of syntax is only partial (rogers_primer_2020), we reason that these embeddings could provide helpful complementary information.

The counter-fitted embeddings by mrksic_counter-fitting_2016 integrate antonymy and synonymy relations into word embeddings based on an objective function which combines three principles: repelling antonyms, attracting synonymy and preserving the vector space. For training, they obtain synoynmy and antonymy pairs from the Paraphrase Database and WordNet, demonstrating an increased performance on SimLex-999. We use their highest-scoring vectors which they obtained by applying their counter-fitting method to Paragram vectors from wieting_paraphrase_2015. We reason that the antonym and synonym relations contained in the word embeddings could be especially useful for paraphrase detection by explicitly capturing these semantic relations.

4.3 Injection Settings

MSRP Quora SemEval
BERT .906 .906 .714 .754 .414
GiBERT with dependency embeddings
- no gating .906 .905 .732 .751 .424
- with gating .913 .908 .755 .778 .433
GiBERT with counter-fitted embeddings
- no gating .907 .906 .733 .763 .435
- with gating .907 .908 .751 .767 .451
Table 1: F1 development scores of models injecting pretrained embeddings after the embedding layer with vs. without gating mechanism.

Gating Mechanism

Catastrophic forgetting is a potential problem when introducing external information into a pre-trained model as the injected information could disturb or completely overwrite existing knowledge (wang_k-adapter_2020). In our proposed model, a gating mechanism is used to scale injected embeddings before adding them to the pre-trained internal BERT representation (see section 4.1). To understand the importance of this mechanism, we contrast development set performance for injecting information after the embedding layer with gating - as defined in equation 7 - and without - as in equation 6 - (Table 1). For dependency embedding injection without gating, performance only improves on 2 out of 5 datasets over the baseline and in some cases even drops below BERT’s performance, while it outperforms the baseline on all datasets when using the gating mechanism. Counter-fitted embedding injection without gating improves on 4 out of 5 datasets, with further improvements when adding gating, outperforming the vanilla BERT model across all datasets. In addition, gating makes model training more stable and reduces failed runs (where the model predicted only the majority class) from 30% to 0% on the particularly imbalanced SemEval C dataset. This highlights the importance of the gating mechanism in our proposed architecture.

Injection Location

In our proposed model, information can be injected between any of BERT’s pre-trained transformer blocks. We reason that different locations may be more appropriate for certain kinds of embeddings as previous research has found that different types of information tend to be encoded and processed at specific BERT layers (rogers_primer_2020). We experiment with injecting embeddings at three possible locations: after the embedding layer (using ), after the middle layer (using in BERT) and after the penultimate layer (using in BERT). Table 2 shows that midlayer injection is ideal for counter-fitted embeddings, while late injection appears to work best for dependency embeddings (Table 2). This is in line with previous work which found that BERT tends to processes syntactic information at later layers than linear word-level information (rogers_primer_2020). We consequently use these injection locations in our final model.

MSRP Quora SemEval
BERT .906 .906 .714 .754 .414
GiBERT with dependency embeddings
- embd layer .913 .908 .755 .778 .433
- layer 6 .911 .908 .755 .776 .438
- layer 11 .914 .910 .760 .773 .444
GiBERT with counter-fitted embeddings
- embd layer .907 .908 .751 .767 .451
- layer 6 .917 .909 .760 .771 .464
- layer 11 .910 .907 .755 .771 .450
Table 2: F1 scores of embedding injection at different layers on the development set.
F1 non-obvious F1
MSRP Quora SemEval MSRP Quora SemEval
Previous systems
\newcitefilice_kelp_2017 - - - .506 - - - - .199 -
\newcitewu_ecnu_2017 - - .777 - - - - .707 - -
\newcitekoreeda_bunji_2017 - - - - .197 - - - - .028
\newcitepang_text_2016 .829 - - - - - - - - -
\newcitegong_natural_2018 (accuracy) - (.891) - - - - - - - -
\newcitezhang_semantics-aware_2020* .882 .718 - - - - - - - -
Our implementation
BERT .876 .902 .704 .473 .268 .827 .860 .656 .243 .085
AiBERT .871 .903 .745 .495 .272 .827 .866 .680 .248 .092
GiBERT .883 .904 .768 .474 .238 .849 .864 .704 .231 .087
AiBERT .877 .904 .724 .496 .263 .835 .867 .662 .264 .076
GiBERT .884 .907 .780 .511 .256 .858 .862 .719 .248 .090
Table 3: Model performance on test set. All BERT-based methods use BERT. The first 6 rows are taken from the cited papers, the rest are our implementations. Bold font highlights the best system. * indicates that the system reported performance on a slightly different dataset version.

5 Evaluation


Our main evaluation metric is F1 score as this is more meaningful for datasets with imbalanced label distributions (such as SemEval C) than accuracy. We also report performance on difficult cases using the non-obvious F1 score (peinelt_aiming_2019). This metric distinguishes non-obvious instances in a dataset from obvious ones based on lexical overlap and gold labels, calculating a separate F1 score for challenging cases. It therefore tends to be lower than the normal F1 score.


dodge_fine-tuning_2020 recently showed that early stopping and random seeds can have considerable impact on the performance of finetuned BERT models. In this work, we finetune all models for 3 epochs with early stopping. Our reported scores average model performance across two different seeds for BERT-based models.

5.1 Baselines


Following standard practice, we encode the sentence pair with BERT’s vector from the final layer, followed by a softmax layer. We finetune all layers for 3 epochs with early stopping. Following devlin_bert_2019, we tune learning rates on the dev set of each dataset.


We further provide an alternative Attention-based embedding Injection method for BERT based on the multihead attention injection mechanism described in equations 3 to 4. For direct comparison, we inject embeddings at the same layers as GiBERT (layer 6 for counter-fitted embeddings and layer 11 for dependency-based embeddings). We follow the same finetuning procedure as GiBERT and the BERT baseline.

Previous systems

For SemEval, we compare against the best participating SemEval 2017 system for each subtask based on F1 score. For MSRP, we show a neural matching architecture (pang_text_2016). For Quora, we compare against the Interactive Inference Network (gong_natural_2018) using accuracy, as no F1 has been reported. We also provide a semantics-aware BERT model (zhang_semantics-aware_2020) which leverages a semantic role labeler.

Sentence 1 Sentence 2 Gold label BERT prediction GiBERT prediction
(1) it took me more than 10 people; over the course of the whole day to convince my point at qatar airways… as to how my points needs to be redeemed… at long last my point was made… dont seem know what they are doing??? appalling to say the least this isn’t the first time. so many rants by irate customers on so many diverse situations signals a very serious problem. so called first class airlines and no basic customer care. over confidence much? is related not related is related
(2) hi; my wife was on a visit visa; today; her residency visa was issued; so i went to immigration and paid 500 so there is no need to leave the country and enter again on the residency visa . she has done her medical before for the visit visa extension; do we need to do the medical again for the residency visa? thanks dear all; please let me know how many days taking for approve family visa nw; am last wednesday (12/09/2012) apply family visa for my husband and daughter; but still now showing in moi website itz under review; itz usual reply? why delayed like this? please help me regards divya is related is related not related
Table 4: Examples from the Semeval development set. Synonym and antonym pairs are highlighted in bold.

6 Results

Comparison with previous systems

GiBERT with counter-fitted embeddings outperforms the F1 score of BERT and other previous systems across all datasets (except on SemEval C)4, see Table 3. It also improves the performance on non-obvious cases in comparison to previous systems. The largest improvement of GiBERT is observed with counter-fitted embeddings, especially on the internal CQA datasets SemEval A and B (the datasets with the highest proportion of examples involving synonym pairs, see section 6). GiBERT with dependency embeddings still generally improves over vanilla BERT, but performance gains tend to be smaller than with counter-fitted embeddings, possibly because semantic information tends to be more important for the tasks at hand.

Injection method

When contrasting the gated injection method (GiBERT) with an alternative attention-based injection method (AiBERT), we find that both injection methods generally improve over the performance of the BERT baseline. In direct comparison between both methods, we find that injecting embeddings with the lightweight gated method achieves comparable results to the complex multihead attention injection method for introducing dependency embeddings, while for the injection of counter-fitted embeddings, GiBERT even outperforms AiBERT.

MSRP Quora SemEval
Instances with antonym pairs
(4%) (4%) (21%) (28%) (20%)
BERT .81 .87 .77 .75 .46
GiBERT .81 .86 .77 .75 .46
Instances with synonym pairs
(11%) (9%) (22%) (31%) (17%)
BERT .87 .90 .81 .78 .54
GiBERT .90 .91 .82 .83 .54
Instances without synonym/antonym pairs
(85%) (87%) (64%) (51%) (68%)
BERT .91 .91 .71 .72 .36
GiBERT .92 .91 .73 .73 .41
Table 5: F1 score on instances containing synonymy pairs, antonymy pairs or no pairs across datasets. (The added percentage of the three groups can exceed 100 as an instance can contain synonym and antonym pairs.)

Error Analysis

Counter-fitted embeddings are designed to explicitly encode synonym and antonym relationships between words. To better understand how the injection of counter-fitted embeddings affects the ability of our model to deal with instances involving such semantic relations, we use synonym and antonym pairs from the PPDB and wordnet (provided by mrksic_counter-fitting_2016) and search the development partition of the datasets for sentence pairs where the first sentence contains one word of the synonym/antonym pair and the second sentence the other word. Table 5 reports F1 performance of our model on cases with synonym pairs, antonym pairs and neither one. We find that our model’s F1 performance particularly improves over BERT on instances containing synonym pairs, as illustrated in example (1) in Table 4. By contrast, the performance on cases with antonym pairs stays roughly the same, even slightly decreasing on Quora. This can be understood with the help of example (2) in Table 4, as word pairs can be antonyms in isolation (e.g. husband - wife), but not in the specific context of a given example (e.g. it’s not important if the visa is for the wife or husband). In rare cases, the injection of distant antonym pair embeddings could deter the model from detecting related sentence pairs. We also observe a slight performance boost for cases that don’t contain synonym or antonym pairs. This could be because of improved representations for words which occurred in examples without their synonym or antonym counterpart.

7 Conclusion

In this paper, we introduced a new approach for injecting external information into BERT. Our proposed method adds linguistically enriched embeddings to BERT’s internal representation through a lightweight gating mechanism which requires significantly fewer parameters than a multihead attention injection method. Evaluating our injection method on multiple semantic similarity detection datasets, we demonstrated that injecting counter-fitted embeddings clearly improved performance over vanilla BERT, while dependency embeddings achieved slightly smaller gains for these tasks. In comparison to the multihead attention injection mechanism, we found the gated method at least as effective, with comparable performance for dependency embedding and improved results for counter-fitted embeddings. Our qualitative analysis highlighted that counter-fitted injection was particularly helpful for instances involving synonym pairs. Future work could explore combining multiple embedding sources or injecting other types of information. Another direction is to investigate the usefulness of embedding injection for other tasks or compressed BERT models.



Appendix A Datasets

Dataset Task Example
Sentence 1 Sentence 2 Label
Quora Paraphrase detection There are only 2,000 Roman Catholics living in Banja Luka now. There are just a handful of Catholics left in Banja Luka. is_paraphrase
MSRP Paraphrase detection Which is the best way to learn coding? How do you learn to program? is_paraphrase
SemEval (A) Internal answer detection Anybody recommend a good dentist in Doha? Dr Sarah Dental Clinic is_related
(B) Paraphrase detection Where I can buy good oil for massage? Blackheads - Any suggestions on how 0 to get rid of them?? not_related
(C) External answer detection Can anybody tell me where is Doha clinic? Dr. Rizwi - Al Ahli Hospital not_related
Table 6: Text pair similarity data sets with examples.

Appendix B Required Injection Parameters

This section compares the number of required parameters in the two alternative injection methods discussed in section 4.1: a multihead attention injection mechanism which is based on previous methods for combining external knowledge with BERT and a novel lightweight gated injection mechanism.

Multihead attention injection

In multihead attention injection (equations 3 to 4), the keys are provided by BERT’s representation from the injection layer and the queries are the injected information . Multihead attention requires the following weight matrices and biases to transform queries, keys and values (indicated by , and ) and transform the attention output (indicated by ):


where indicates BERT’s hidden dimension and indicates the dimensionality of the injected embeddings. When injecting embeddings with (see section 4.2) into BERT with , this amounts to new parameters.

Gated injection

The proposed gated injection method (equations 5 to 7) only introduces the weights and biases from the projection layer, as well as the gating vector:


Therefore, injecting embeddings with into BERT requires new parameters. Our proposed gated injection mechanism only requires 14% of the parameters used in a multihead attention injection mechanism. Using fewer parameters results in a smaller model which is especially beneficial for injecting information during finetuning, where small learning rates and few epochs make it difficult to learn large amounts of new parameters.

Appendix C Best Hyper-Parameters

Hyper-parameters were chosen based on development set F1 scores.

MSRP Quora SemEval
batch size 32 32 16 32 16
Learning rate (1st seed) 5e-5 2e-5 3e-5 2e-5 2e-5
Learning rate (2nd seed) 5e-5 2e-5 2e-5 2e-5 3e-5
AiBERT with dependency-based embeddings
Learning rate (1st seed) 3e-5 3e-5 2e-5 3e-5 2e-5
Learning rate (2nd seed) 5e-5 2e-5 2e-5 5e-5 2e-5
AiBERT with counter-fitted embeddings
Learning rate (1st seed) 5e-5 2e-5 2e-5 3e-5 2e-5
Learning rate (2nd seed) 5e-5 3e-5 5e-5 3e-5 2e-5
GiBERT with dependency-based embeddings
Learning rate (1st seed) 2e-5 3e-5 2e-5 3e-5 2e-5
Learning rate (2nd seed) 3e-5 2e-5 2e-5 5e-5 3e-5
GiBERT with counter-fitted embeddings
Learning rate (1st seed) 5e-5 2e-5 2e-5 5e-5 2e-5
Learning rate (2nd seed) 5e-5 3e-5 3e-5 5e-5 3e-5
Table 7: Tuned hyper-parameters for BERT-based models.

Appendix D Gating Parameter Analysis

Figure 2: Histogram of the 768-dimensional gating vector across datasets for GiBERT with counter-fitted embeddings (left) and GiBERT with dependency embeddings (right).

As described in section 4, the gating parameters in our proposed model are initialised as a vector of zeros. During training, the model can learn to gradually inject external information by adjusting gating parameters to for adding, or for subtracting injected information along certain dimensions. Alternatively, injection stays turned off if all parameters remain at zero. Figure 2 shows a histogram of learned gating vectors for our best GiBERT models with counter-fitted (left) and dependency embedding injection (right). On most datasets, the majority of parameters have been updated to small non-zero values, letting through controlled amounts of injected information without completely overwriting BERT’s internal representation. Only on Semeval B (with 4K instances the smallest of the datasets, compare section  3), more than 500 of the 768 dimensions of the injected information stay blocked out for both model variants. The gating parameters also filter out many dimensions of the dependency-based embeddings on MSRP (the second smallest dataset). This suggests that models trained on smaller datasets may benefit from slightly longer finetuning or a different gating parameter initialisation to make full use of the injected information.5


  3. We use the the uncased version of BERT available through Tensorflow Hub.
  4. As reflected by the lower scores compared to other datasets, SemEval C is particularly difficult due to the external question-answering scenario and its highly imbalanced label distribution which varies between train, dev and test set.
  5. Note that we train models for the same number of epochs, but one epoch uses all training examples contained in the dataset. This gives models trained on larger datasets more opportunity to update their parameters.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description