BERTScore: Evaluating Text Generation with BERT
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.
Automatic evaluation of natural language generation, for example in machine translation and caption generation, requires comparing candidate sentences to annotated references. The goal is to evaluate semantic equivalence. However, commonly used methods rely on surface-form similarity only. For example, Bleu (Papineni et al., 2002), the most common machine translation metric, simply counts -gram overlap between the candidate and the reference. While this provides a simple and general measure, it fails to account for meaning-preserving lexical and compositional diversity.
In this paper, we introduce BERTScore, a language generation evaluation metric based on pre-trained BERT contextual embeddings (Devlin et al., 2019). BERTScore computes the similarity of two sentences as a sum of cosine similarities between their tokens’ embeddings.
BERTScore addresses two common pitfalls in -gram-based metrics (Banerjee and Lavie, 2005). First, such methods often fail to robustly match paraphrases. For example, given the reference people like foreign cars, Bleu and Meteor (Banerjee and Lavie, 2005) incorrectly give a higher score to people like visiting places abroad compared to consumers prefer imported cars. This leads to performance underestimation when semantically-correct phrases are penalized because they differ from the surface form of the reference. In contrast to string matching (e.g., in Bleu) or matching heuristics (e.g., in Meteor), we compute similarity using contextualized token embeddings, which have been shown to be effective for paraphrase detection (Devlin et al., 2019). Second, -gram models fail to capture distant dependencies and penalize semantically-critical ordering changes (Isozaki et al., 2010). For example, given a small window of size two, Bleu will only mildly penalize swapping of cause and effect clauses (e.g. A because B instead of B because A), especially when the arguments A and B are long phrases. In contrast, contextualized embeddings are trained to effectively capture distant dependencies and ordering.
We experiment with BERTScore on machine translation and image captioning tasks using the outputs of 363 systems by correlating BERTScore and related metrics to available human judgments. Our experiments demonstrate that BERTScore correlates highly with human evaluations. In machine translation, BERTScore shows stronger system-level and segment-level correlations with human judgments than existing metrics on multiple common benchmarks and demonstrates strong model selection performance compared to Bleu. We also show that BERTScore is well-correlated with human annotators for image captioning, surpassing Spice, a popular task-specific metric (Anderson et al., 2016). Finally, we test the robustness of BERTScore on the adversarial paraphrase dataset PAWS (Zhang et al., 2019), and show that it is more robust to adversarial examples than other metrics. The code for BERTScore is available at https://github.com/Tiiiger/bert_score.
2 Problem Statement and Prior Metrics
Natural language text generation is commonly evaluated using annotated reference sentences. Given a reference sentence tokenized to tokens and a candidate tokenized to tokens , a generation evaluation metric is a function . Better metrics have a higher correlation with human judgments. Existing metrics can be broadly categorized into using -gram matching, edit distance, embedding matching, or learned functions.
2.1 -gram Matching Approaches
The most commonly used metrics for generation count the number of -grams that occur in the reference and candidate . The higher the is, the more the metric is able to capture word order, but it also becomes more restrictive and constrained to the exact form of the reference.
Formally, let and be the lists of token -grams () in the reference and candidate sentences. The number of matched -grams is , where is an indicator function. The exact match precision () and recall () scores are:
Several popular metrics build upon one or both of these exact matching scores.
The most widely used metric in machine translation is Bleu (Papineni et al., 2002), which includes three modifications to . First, each -gram in the reference can be matched at most once. Second, the number of exact matches is accumulated for all reference-candidate pairs in the corpus and divided by the total number of -grams in all candidate sentences. Finally, very short candidates are discouraged using a brevity penalty. Typically, Bleu is computed for multiple values of (e.g. ) and the scores are averaged geometrically. A smoothed variant, SentBleu (Koehn et al., 2007) is computed at the sentence level. In contrast to Bleu, BERTScore is not restricted to maximum -gram length, but instead relies on contextualized embeddings that are able to capture dependencies of potentially unbounded length.
Meteor (Banerjee and Lavie, 2005) computes and while allowing backing-off from exact unigram matching to matching word stems, synonyms, and paraphrases. For example, running may match run if no exact match is possible. Non-exact matching uses an external stemmer, a synonym lexicon, and a paraphrase table. Meteor 1.5 (Denkowski and Lavie, 2014) weighs content and function words differently, and also applies importance weighting to different matching types. The more recent Meteor++ 2.0 (Guo and Hu, 2019) further incorporates a learned external paraphrase resource. Because Meteor requires external resources, only five languages are supported with the full feature set, and eleven are partially supported. Similar to Meteor, BERTScore allows relaxed matches, but relies on BERT embeddings that are trained on large amounts of raw text and are currently available for 104 languages. BERTScore also supports importance weighting, which we estimate with simple corpus statistics.
Other Related Metrics
NIST (Doddington, 2002) is a revised version of Bleu that weighs each -gram differently and uses an alternative brevity penalty. Bleu (Galley et al., 2015) modifies multi-reference Bleu by including human annotated negative reference sentences. chrF (Popović, 2015) compares character -grams in the reference and candidate sentences. chrF++ (Popović, 2017) extends chrF to include word bigram matching. Rouge (Lin, 2004) is a commonly used metric for summarization evaluation. Rouge- (Lin, 2004) computes (usually ), while Rouge- is a variant of with the numerator replaced by the length of the longest common subsequence. CIDEr (Vedantam et al., 2015) is an image captioning metric that computes cosine similarity between -- weighted -grams. We adopt a similar approach to weigh tokens differently. Finally, Chaganty et al. (2018) and Hashimoto et al. (2019) combine automatic metrics with human judgments for text generation evaluation.
2.2 Edit-distance-based Metrics
Several methods use word edit distance or word error rate (Levenshtein, 1966), which quantify similarity using the number of edit operations required to get from the candidate to the reference. TER (Snover et al., 2006) normalizes edit distance by the number of reference words, and ITER (Panja and Naskar, 2018) adds stem matching and better normalization. PER (Tillmann et al., 1997) computes position independent error rate, CDER (Leusch et al., 2006) models block reordering as an edit operation. CharacTer (Wang et al., 2016) and EED (Stanchev et al., 2019) operate on the character level and achieve higher correlation with human judgements on some languages.
2.3 Embedding-based Metrics
Word embeddings (Mikolov et al., 2013; Pennington et al., 2014; Grave et al., 2018; Nguyen et al., 2017; Athiwaratkun et al., 2018) are learned dense token representations.
Meant 2.0 (Lo, 2017) uses word embeddings and shallow semantic parses to compute lexical and structural similarity.
Yisi-1 (Lo et al., 2018) is similar to Meant 2.0, but makes the use of semantic parses optional.
Both methods use a relatively simple similarity computation, which inspires our approach, including using greedy matching (Corley and Mihalcea, 2005) and experimenting with a similar importance weighting to Yisi-1.
However, we use contextual embeddings, which capture the specific use of a token in a sentence, and potentially capture sequence information. We do not use external tools to generate linguistic structures, which makes our approach relatively simple and portable to new languages.
Instead of greedy matching, WMD (Kusner et al., 2015), WMDO (Chow et al., 2019), and SMS (Clark et al., 2019) propose to use optimal matching based on earth mover’s distance (Rubner et al., 1998).
2.4 Learned Metrics
Various metrics are trained to optimize correlation with human judgments. Beer (Stanojević and Sima’an, 2014) uses a regression model based on character -grams and word bigrams. Blend (Ma et al., 2017) uses regression to combine 29 existing metrics. Ruse (Shimanaka et al., 2018) combines three pre-trained sentence embedding models. All these methods require costly human judgments as supervision for each dataset, and risk poor generalization to new domains, even within a known language and task (Chaganty et al., 2018). Cui et al. (2018) and Lowe et al. (2017) train a neural model to predict if the input text is human-generated. This approach also has the risk of being optimized to existing data and generalizing poorly to new data. In contrast, the model underlying BERTScore is not optimized for any specific evaluation task.
Given a reference sentence and a candidate sentence , we use contextual embeddings to represent the tokens, and compute matching using cosine similarity, optionally weighted with inverse document frequency scores. Figure 1 illustrates the computation.
We use contextual embeddings to represent the tokens in the input sentences and . In contrast to prior word embeddings (Mikolov et al., 2013; Pennington et al., 2014), contextual embeddings, such as BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018), can generate different vector representations for the same word in different sentences depending on the surrounding words, which form the context of the target word. The models used to generate these embeddings are most commonly trained using various language modeling objectives, such as masked word prediction (Devlin et al., 2019).
We experiment with different models (Section 4), using the tokenizer provided with each model. Given a tokenized reference sentence , the embedding model generates a sequence of vectors . Similarly, the tokenized candidate is mapped to . The main model we use is BERT, which tokenizes the input text into a sequence of word pieces (Wu et al., 2016), where unknown words are split into several commonly observed sequences of characters. The representation for each word piece is computed with a Transformer encoder (Vaswani et al., 2017) by repeatedly applying self-attention and nonlinear transformations in an alternating fashion. BERT embeddings have been shown to benefit various NLP tasks (Devlin et al., 2019; Liu, 2019; Huang et al., 2019; Yang et al., 2019a).
The vector representation allows for a soft measure of similarity instead of exact-string (Papineni et al., 2002) or heuristic (Banerjee and Lavie, 2005) matching. The cosine similarity of a reference token and a candidate token is . We use pre-normalized vectors, which reduces this calculation to the inner product . While this measure considers tokens in isolation, the contextual embeddings contain information from the rest of the sentence.
The complete score matches each token in to a token in to compute recall, and each token in to a token in to compute precision. We use greedy matching to maximize the matching similarity score,
Previous work on similarity measures demonstrated that rare words can be more indicative for sentence similarity than common words (Banerjee and Lavie, 2005; Vedantam et al., 2015). BERTScore enables us to easily incorporate importance weighting. We experiment with inverse document frequency () scores computed from the test corpus. Given reference sentences , the score of a word-piece token is
where is an indicator function. We do not use the full - measure because we process single sentences, where the term frequency () is likely 1. For example, recall with weighting is
Because we use reference sentences to compute , the scores remain the same for all systems evaluated on a specific test set. We apply plus-one smoothing to handle unknown word pieces.
Because we use pre-normalized vectors, our computed scores have the same numerical range of cosine similarity (between and ).
However, in practice we observe scores in a more limited range, potentially because of the learned geometry of contextual embeddings.
While this characteristic does not impact BERTScore’s capability to rank text generation systems, it makes the actual score less readable.
We address this by rescaling BERTScore with respect to its empirical lower bound as a baseline.
We compute using Common Crawl monolingual datasets.
After this operation is typically between and . We apply the same rescaling procedure for and . This method does not affect the ranking ability and human correlation of BERTScore, and is intended solely to increase the score readability.
4 Experimental Setup
We evaluate our approach on machine translation and image captioning.
Contextual Embedding Models
We evaluate twelve pre-trained contextual embedding models, including variants of BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b), XLNet (Yang et al., 2019b), and XLM (Lample and Conneau, 2019).
We present the best-performing models in Section 5.
We use the 24-layer model
Our main evaluation corpus is the WMT18 metric evaluation dataset (Ma et al., 2018), which contains predictions of 149 translation systems across 14 language pairs, gold references, and two types of human judgment scores. Segment-level human judgments assign a score to each reference-candidate pair. System-level human judgments associate each system with a single score based on all pairs in the test set. WMT18 includes translations from English to Czech, German, Estonian, Finnish, Russian, and Turkish, and from the same set of languages to English. We follow the WMT18 standard practice and use absolute Pearson correlation and Kendall rank correlation to evaluate metric quality, and compute significance with the Williams test (Williams, 1959) for and bootstrap re-sampling for as suggested by Graham and Baldwin (2014). We compute system-level scores by averaging BERTScore for every reference-candidate pair. We also experiment with hybrid systems by randomly sampling one candidate sentence from one of the available systems for each reference sentence (Graham and Liu, 2016). This enables system-level experiments with a higher number of systems. Human judgments of each hybrid system are created by averaging the WMT18 segment-level human judgments for the corresponding sentences in the sampled data. We compare BERTScores to one canonical metric for each category introduced in Section 2, and include the comparison with all other participating metrics from WMT18 in Appendix F.
In addition to the standard evaluation, we design model selection experiments. We use 10K hybrid systems super-sampled from WMT18. We randomly select 100 out of 10K hybrid systems, and rank them using the automatic metrics. We repeat this process 100K times. We report the percentage of the metric ranking agreeing with the human ranking on the best system (Hits@1). In Tables 23-28, we include two additional measures to the model selection study: (a) the mean reciprocal rank of the top metric-rated system according to the human ranking, and (b) the difference between the human score of the top human-rated system and that of the top metric-rated system.
We use the human judgments of twelve submission entries from the COCO 2015 Captioning Challenge. Each participating system generates a caption for each image in the COCO validation set (Lin et al., 2014), and each image has approximately five reference captions. Following Cui et al. (2018), we compute the Pearson correlation with two system-level metrics: the percentage of captions that are evaluated as better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). We compute BERTScore with multiple references by scoring the candidate with each available reference and returning the highest score. We compare with eight task-agnostic metrics: Bleu (Papineni et al., 2002), Meteor (Banerjee and Lavie, 2005), Rouge-L (Lin, 2004), CIDEr (Vedantam et al., 2015), BEER (Stanojević and Sima’an, 2014), EED (Stanchev et al., 2019), chrF++ (Popović, 2017), and CharacTER (Wang et al., 2016). We also compare with two task-specific metrics: Spice (Anderson et al., 2016) and Leic (Cui et al., 2018). Spice is computed using the similarity of scene graphs parsed from the reference and candidate captions. Leic is trained to predict if a caption is written by a human given the image.
Tables 1--3 show system-level correlation to human judgements, correlations on hybrid systems, and model selection performance. We observe that BERTScore is consistently a top performer. In to-English results, RUSE (Shimanaka et al., 2018) shows competitive performance. However, RUSE is a supervised method trained on WMT16 and WMT15 human judgment data. In cases where RUSE models were not made available, such as for our from-English experiments, it is not possible to use RUSE without additional data and training. Table 4 shows segment-level correlations. We see that BERTScore exhibits significantly higher performance compared to the other metrics. The large improvement over Bleu stands out, making BERTScore particularly suitable to analyze specific examples, where SentBleu is less reliable. In Appendix A, we provide qualitative examples to illustrate the segment-level performance difference between SentBleu and BERTScore. At the segment-level, BERTScore even significantly outperforms RUSE. Overall, we find that applying importance weighting using at times provides small benefit, but in other cases does not help. Understanding better when such importance weighting is likely to help is an important direction for future work, and likely depends on the domain of the text and the available test data. We continue without weighting for the rest of our experiments. While recall , precision , and F1 alternate as the best measure in different setting, F1 performs reliably well across all the different settings. Our overall recommendation is therefore to use F1. We present additional results using the full set of 351 systems and evaluation metrics in Tables 12--28 in the appendix, including for experiments with importance weighting, different contextual embedding models, and model selection.
Table 6 shows correlation results for the COCO Captioning Challenge. BERTScore outperforms all task-agnostic baselines by large margins. Image captioning presents a challenging evaluation scenario, and metrics based on strict -gram matching, including Bleu and Rouge, show weak correlations with human judgments. importance weighting shows significant benefit for this task, suggesting people attribute higher importance to content words. Finally, Leic (Cui et al., 2018), a trained metric that takes images as additional inputs and is optimized specifically for the COCO data and this set of systems, outperforms all other methods.
Despite the use of a large pre-trained model, computing BERTScore is relatively fast. We are able to process 192.5 candidate-reference pairs/second using a GTX-1080Ti GPU. The complete WMT18 en-de test set, which includes 2,998 sentences, takes 15.6sec to process, compared to 5.4sec with SacreBLEU (Post, 2018), a common Bleu implementation. Given the sizes of commonly used test and validation sets, the increase in processing time is relatively marginal, and BERTScore is a good fit for using during validation (e.g., for stopping) and testing, especially when compared to the time costs of other development stages.
6 Robustness Analysis
We test the robustness of BERTScore using adversarial paraphrase classification. We use the Quora Question Pair corpus (QQP; Iyer et al., 2017) and the adversarial paraphrases from the Paraphrase Adversaries from Word Scrambling dataset (PAWS; Zhang et al., 2019). Both datasets contain pairs of sentences labeled to indicate whether they are paraphrases or not. Positive examples in QQP are real duplicate questions, while negative examples are related, but different questions. Sentence pairs in PAWS are generated through word swapping. For example, in PAWS, Flights from New York to Florida may be changed to Flights from Florida to New York and a good classifier should identify that these two sentences are not paraphrases. PAWS includes two parts: PAWSQQP, which is based on the QQP data, and PAWSWiki. We use the PAWSQQP development set which contains 667 sentences. For the automatic metrics, we use no paraphrase detection training data. We expect that pairs with higher scores are more likely to be paraphrases. To evaluate the automatic metrics on QQA, we use the first 5,000 sentences in the training set instead of the the test set because the test labels are not available. We treat the first sentence as the reference and the second sentence as the candidate.
Table 6 reports the area under ROC curve (AUC) for existing models and automatic metrics. We observe that supervised classifiers trained on QQP perform worse than random guess on PAWSQQP, which shows these models predict the adversarial examples are more likely to be paraphrases. When adversarial examples are provided in training, state-of-the-art models like DIIN (Gong et al., 2018) and fine-tuned BERT are able to identify the adversarial examples but their performance still decreases significantly from their performance on QQP. Most metrics have decent performance on QQP, but show a significant performance drop on PAWSQQP, almost down to chance performance. This suggests these metrics fail to to distinguish the harder adversarial examples. In contrast, the performance of BERTScore drops only slightly, showing more robustness than the other metrics.
We propose BERTScore, a new metric for evaluating generated text against gold standard references. BERTScore is purposely designed to be simple, task agnostic, and easy to use. Our analysis illustrates how BERTScore resolves some of the limitations of commonly used metrics, especially on challenging adversarial examples. We conduct extensive experiments with various configuration choices for BERTScore, including the contextual embedding model used and the use of importance weighting. Overall, our extensive experiments, including the ones in the appendix, show that BERTScore achieves better correlation than common metrics, and is effective for model selection. However, there is no one configuration of BERTScore that clearly outperforms all others. While the differences between the top configurations are often small, it is important for the user to be aware of the different trade-offs, and consider the domain and languages when selecting the exact configuration to use. In general, for machine translation evaluation, we suggest using , which we find the most reliable. For evaluating text generation in English, we recommend using the 24-layer model to compute BERTScore. For non-English language, the multilingual is a suitable choice although BERTScore computed with this model has less stable performance on low-resource languages. We report the optimal hyperparameter for all models we experimented with in Appendix B
Briefly following our initial preprint publication, Zhao et al. (2019) published a concurrently developed method related to ours, but with a focus on integrating contextual word embeddings with earth mover’s distance (EMD; Rubner et al., 1998) rather than our simple matching process. They also propose various improvements compared to our use of contextualized embeddings. We study these improvements in Appendix C and show that integrating them into BERTScore makes it equivalent or better than the EMD-based approach. Largely though, the effect of the different improvements on BERTScore is more modest compared to their method. Shortly after our initial publication, YiSi-1 was updated to use BERT embeddings, showing improved performance (Lo, 2019). This further corroborates our findings. Other recent related work includes training a model on top of BERT to maximize the correlation with human judgments (Mathur et al., 2019) and evaluating generation with a BERT model fine-tuned on paraphrasing (Yoshimura et al., 2019). More recent work shows the potential of using BERTScore for training a summarization system (Li et al., 2019) and for domain-specific evaluation using SciBERT (Beltagy et al., 2019) to evaluate abstractive text summarization (Gabriel et al., 2019).
In future work, we look forward to designing new task-specific metrics that use BERTScore as a subroutine and accommodate task-specific needs, similar to how Wieting et al. (2019) suggests to use semantic similarity for machine translation training. Because BERTScore is fully differentiable, it also can be incorporated into a training procedure to compute a learning loss that reduces the mismatch between optimization and evaluation objectives.
This research is supported in part by grants from the National Science Foundation (III-1618134, III-1526012, IIS1149882, IIS-1724282, TRIPODS-1740822, CAREER-1750499), the Office of Naval Research DOD (N00014-17-1-2175), and the Bill and Melinda Gates Foundation, SAP, Zillow, Workday, and Facebook Research. We thank Graham Neubig and David Grangier for for their insightful comments. We thank the Cornell NLP community including but not limited to Claire Cardie, Tianze Shi, Alexandra Schofield, Gregory Yauney, and Rishi Bommasani. We thank Yin Cui and Guandao Yang for their help with the COCO 2015 dataset.
Appendix A Qualitative Analysis
|Case||No.||Reference and Candidate Pairs||Human||Bleu|
|1.||: At the same time Kingfisher is closing 60 B&Q outlets across the country||38||125||530|
|: At the same time, Kingfisher will close 60 B & Q stores nationwide|
|2.||: Hewlett-Packard to cut up to 30,000 jobs||119||39||441|
|: Hewlett-Packard will reduce jobs up to 30.000|
|3.||: According to opinion in Hungary, Serbia is ‘‘a safe third country".||23||96||465|
|: According to Hungarian view, Serbia is a ‘‘safe third country."|
|4.||: Experts believe November’s Black Friday could be holding back spending.||73||147||492|
|: Experts believe that the Black Friday in November has put the brakes on spending|
|5.||: And it’s from this perspective that I will watch him die.||37||111||414|
|: And from this perspective, I will see him die.|
|6.||: In their view the human dignity of the man had been violated.||500||470||115|
|: Look at the human dignity of the man injured.|
|8.||: For example when he steered a shot from Ideye over the crossbar in the 56th minute.||516||524||185|
|: So, for example, when he steered a shot of Ideye over the latte (56th).|
|7.||: A good prank is funny, but takes moments to reverse.||495||424||152|
|: A good prank is funny, but it takes only moments before he becomes a boomerang.|
|9.||: I will put the pressure on them and onus on them to make a decision.||507||471||220|
|: I will exert the pressure on it and her urge to make a decision.|
|10.||: Transport for London is not amused by this flyposting "vandalism."||527||527||246|
|: Transport for London is the Plaka animal "vandalism" is not funny.|
|11.||: One big obstacle to access to the jobs market is the lack of knowledge of the German language.||558||131||313|
|: A major hurdle for access to the labour market are a lack of knowledge of English.|
|12.||: On Monday night Hungary closed its 175 km long border with Serbia.||413||135||55|
|: Hungary had in the night of Tuesday closed its 175 km long border with Serbia.|
|13.||: They got nothing, but they were allowed to keep the clothes.||428||174||318|
|: You got nothing, but could keep the clothes.|
|14.||: A majority of Republicans don’t see Trump’s temperament as a problem.||290||34||134|
|: A majority of Republicans see Trump’s temperament is not a problem.|
|15.||:His car was still running in the driveway.||299||49||71|
|: His car was still in the driveway.|
|16.||: Currently the majority of staff are men.||77||525||553|
|: At the moment the men predominate among the staff.|
|17.||: There are, indeed, multiple variables at play.||30||446||552|
|: In fact, several variables play a role.|
|18.||: One was a man of about 5ft 11in tall.||124||551||528|
|: One of the men was about 1,80 metres in size.|
|19.||: All that stuff sure does take a toll.||90||454||547|
|: All of this certainly exacts its toll.|
|20.||: Wage gains have shown signs of picking up.||140||464||514|
|: Increases of wages showed signs of a recovery.|
We study BERTScore and SentBleu using WMT16 German-to-English (Bojar et al., 2016). We rank all 560 candidate-reference pairs by human score, BERTScore, or SentBleu from most similar to least similar. Ideally, the ranking assigned by BERTScore and SentBleu should be similar to the ranking assigned by the human score.
Table 7 first shows examples where BERTScore and SentBleu scores disagree about the ranking for a candidate-reference pair by a large number. We observe that BERTScore is effectively able to capture synonyms and changes in word order. For example, the reference and candidate sentences in pair 3 are almost identical except that the candidate replaces opinion in Hungary with Hungarian view and switches the order of the quotation mark (‘‘) and a. While BERTScore ranks the pair relatively high, SentBleu judges the pair as dissimilar, because it cannot match synonyms and is sensitive to the small word order changes. Pair 5 shows a set of changes that preserve the semantic meaning: replacing to cut with will reduce and swapping the order of 30,000 and jobs. BERTScore ranks the candidate translation similar to the human judgment, whereas SentBleu ranks it much lower. We also see that SentBleu potentially over-rewards -gram overlap, even when phrases are used very differently. In pair 6, both the candidate and the reference contain the human dignity of the man. Yet the two sentences convey very different meaning. BERTScore agrees with the human judgment and ranks the pair low. In contrast, SentBleu considers the pair as relatively similar because of the significant word overlap.
The bottom half of Table 7 shows examples where BERTScore and human judgments disagree about the ranking. We observe that BERTScore finds it difficult to detect factual errors. For example, BERTScore assigns high similarity to pair 11 when the translation replaces German language with English and pair 12 where the translation incorrectly outputs Tuesday when it is supposed to generate Monday. BERTScore also fails to identify that 5ft 11in is equivalent with 1.80 metres in pair 18. As a result, BERTScore assigns low similarity to the eighth pair in Table 7. SentBleu also suffers from these limitations.
Figure 2 visualizes the BERTScore matching of two pairs of candidate and reference sentences. The figure illustrates how matches synonymous phrases, such as imported cars and foreign cars. We also see that effectively matches words even given a high ordering distortion, for example the token people in the figure.
Appendix B Representation Choice
As suggested by previous works (Peters et al., 2018; Reimers and Gurevych, 2019), selecting a good layer or a good combination of layers from the BERT model is important.
In designing BERTScore, we use WMT16 segment-level human judgment data as a development set to facilitate our representation choice.
For Chinese models, we tune with the WMT17 ‘‘en-zh’’ data because the language pair ‘‘en-zh’’ is not available in WMT16.
In Figure 3, we plot the change of human correlation of over different layers of BERT, RoBERTa, XLNet and XLM models.
Based on results from different models, we identify a common trend that computed with the intermediate representations tends to work better.
We tune the number of layer to use for a range of publicly available models.
|Model||Total Number of Layers||Best Layer|
Appendix C Ablation Study of MoverScore
Word Mover’s Distance (WMD; Kusner et al., 2015) is a semantic similarity metric that relies on word embeddings and optimal transport. MoverScore (Zhao et al., 2019) combines contextual embeddings and WMD for text generation evaluation. In contrast, BERTScore adopts a greedy approach to aggregate token-level information. In addition to using WMD for generation evaluation, Zhao et al. (2019) also introduce various other improvements. We do a detailed ablation study to understand the benefit of each improvement, and to investigate whether it can be applied to BERTScore. We use a 12-layer uncased BERT model on the WMT17 to-English segment-level data, the same setting as Zhao et al. (2019).
We identify several differences between MoverScore and BERTScore by analyzing the released source code. We isolate each difference, and mark it with a bracketed tag for our ablation study:
[MNLI] Use a BERT model fine-tuned on MNLI (Williams et al., 2018).
[PMEANS] Apply power means (Rücklé et al., 2018) to aggregate the information of different layers.
[IDF-L] For reference sentences, instead of computing the scores on the 560 sentences in the segment-level data ([IDF-S]), compute the scores on the 3,005 sentences in the system-level data.
[SEP] For candidate sentences, recompute the scores on the candidate sentences. The weighting of reference tokens are kept the same as in [IDF-S]
[RM] Exclude punctuation marks and sub-word tokens except the first sub-word in each word from the matching.
We follow the setup of Zhao et al. (2019) and use their released fine-tuned BERT model to conduct the experiments.
Table 9 shows the results of our ablation study.
We report correlations for the two variants of WMD Zhao et al. (2019) study: unigrams (WMD1) and bigrams (WMD2).
Our corresponds to the vanilla setting and the importance weighted variant corresponds to the [IDF-S] setting.
The complete MoverScore metric corresponds to [IDF-S]+[SEP]+[PMEANS]+[MNLI]+[RM].
We make several observations.
First, for all language pairs except fi-en and lv-en, we can replicate the reported performance.
For these two language pairs, Zhao et al. (2019) did not release their implementations at the time of publication.
|IDF-L + SEP||WMD1||0.651||0.660||0.819||0.703||0.714||0.724||0.715|
|IDF-L + SEP + RM||WMD1||0.651||0.686||0.803||0.681||0.730||0.730||0.720|
|IDF-L + SEP + PMEANS||WMD1||0.658||0.663||0.820||0.707||0.717||0.725||0.712|
|IDF-L + SEP + MNLI||WMD1||0.659||0.679||0.822||0.732||0.718||0.746||0.725|
|IDF-L + SEP + PMEANS + MNLI||WMD1||0.672||0.686||0.831||0.738||0.725||0.753||0.737|
|IDF-L + SEP + PMEANS + MNLI + RM||WMD1||0.670||0.708||0.821||0.717||0.738||0.762||0.744|
Appendix D Additional Experiments on Abstractive Text Compression
We use the human judgments provided from the MSR Abstractive Text Compression Dataset (Toutanova et al., 2016) to illustrate the applicability of BERTScore to abstractive text compression evaluation. The data includes three types of human scores: (a) meaning: how well a compressed text preserve the meaning of the original text; (b) grammar: how grammatically correct a compressed text is; and (c) combined: the average of the meaning and the grammar scores. We follow the experimental setup of Toutanova et al. (2016) and report Pearson correlation between BERTScore and the three types of human scores. Table 10 shows that has the highest correlation with human meaning judgments, and correlates highly with human grammar judgments. provides a balance between the two aspects.
|Best metrics according to Toutanova et al. (2016)||SKIP-2+Recall+MULT-PROB||0.59||N/A||0.51|
Appendix E BERTScore of Recent MT Models
|WMT14 En-De||ConvS2S (Auli et al., 2017)||0.266||0.6099||0.6055||0.6075||0.8499||0.8482||0.8488|
|Transformer-big (Ott et al., 2018)||0.298||0.6587||0.6528||0.6558||0.8687||0.8664||0.8674|
|DynamicConv (Wu et al., 2019)||0.297||0.6526||0.6464||0.6495||0.8664||0.8640||0.8650|
|WMT14 En-Fr||ConvS2S (Auli et al., 2017)||0.408||0.6998||0.6821||0.6908||0.8876||0.8810||0.8841|
|Transformer-big (Ott et al., 2018)||0.432||0.7148||0.6978||0.7061||0.8932||0.8869||0.8899|
|DynamicConv (Wu et al., 2019)||0.432||0.7156||0.6989||0.7071||0.8936||0.8873||0.8902|
|IWSLT14 De-En||Transformer-iwslt (Ott et al., 2019)||0.350||0.6749||0.6590||0.6672||0.9452||0.9425||0.9438|
|LightConv (Wu et al., 2019)||0.348||0.6737||0.6542||0.6642||0.9450||0.9417||0.9433|
|DynamicConv (Wu et al., 2019)||0.352||0.6770||0.6586||0.6681||0.9456||0.9425||0.9440|
Table 11 shows the Bleu scores and the BERTScores of pre-trained machine translation models on WMT14 English-to-German, WMT14 English-to-French, IWSLT14 German-to-English task. We used publicly available pre-trained models from
fairseq (Ott et al., 2019).
Appendix F Additional Results
In this section, we present additional experimental results:
Segment-level and system-level correlation studies on three years of WMT metric evaluation task (WMT16--18)
Model selection study on WMT18 10K hybrid systems
System-level correlation study on 2015 COCO captioning challenge
Robustness study on PAWS-QQP.
Following BERT (Devlin et al., 2019), a variety of Transformer-based (Vaswani et al., 2017) pre-trained contextual embeddings have been proposed and released. We conduct additional experiments with four types of pre-trained embeddings: BERT, XLM (Lample and Conneau, 2019), XLNet (Yang et al., 2019b), and RoBERTa (Liu et al., 2019b). XLM (Cross-lingual Language Model) is a Transformer pre-trained on the translation language modeling of predicting masked tokens from a pair of sentence in two different languages and masked language modeling tasks using multi-lingual training data. Yang et al. (2019b) modify the Transformer architecture and pre-train it on a permutation language modeling task resulting in some improvement on top of the original BERT when fine-tuned on several downstream tasks. Liu et al. (2019b) introduce RoBERTa (Robustly optimized BERT approach) and demonstrate that an optimized BERT model is comparable to or sometimes outperforms an XLNet on downstream tasks.
We perform a comprehensive study with the following pre-trained contextual embedding models:
BERT models: bert-base-uncased, bert-large-uncased, bert-based-chinese, bert-base-multilingual-cased, and bert-base-cased-mrpc
RoBERTa models: roberta-base, roberta-large, and roberta-large-mnli
XLNet models: xlnet-base-cased and xlnet-base-large
XLM models: xlm-mlm-en-2048 and xlm-mlm-100-1280
f.1 WMT Correlation Study
Because of missing data in the released WMT16 dataset (Bojar et al., 2016), we are only able to experiment with to-English segment-level data, which contains the outputs of 50 different systems on 6 language pairs. We use this data as the validation set for hyperparameter tuning (Appendix B). Table 12 shows the Pearson correlations of all participating metrics and BERTScores computed with different pre-trained models. Significance testing for this dataset does not include the baseline metrics because the released dataset does not contain the original outputs from the baseline metrics. We conduct significance testing between BERTScore results only.
The WMT17 dataset (Bojar et al., 2017) contains outputs of 152 different translations on 14 language pairs. We experiment on the segment-level and system-level data on both to-English and from-English language pairs. We exclude fi-en data from the segment-level experiment due to an error in the released data. We compare our results to all participating metrics and perform standard significance testing as done by Bojar et al. (2017). Tables 13--16 show the results.
Table 12--22 collectively showcase the effectiveness of BERTScore in correlating with human judgments. The improvement of BERTScore is more pronounced on the segment-level than on the system-level. We also see that more optimized or larger BERT models can produce better contextual representations (e.g., comparing and ). In contrast, the smaller XLNet performs better than a large one. Based on the evidence in Figure 8 and Tables 12--22, we hypothesize that the permutation language task, though leading to a good set of model weights for fine-tuning on downstream tasks, does not necessarily produce informative pre-trained embeddings for generation evaluation. We also observe that fine-tuning pre-trained models on a related task, such as natural language inference (Williams et al., 2018), can lead to better human correlation in evaluating text generation. Therefore, for evaluating English sentences, we recommend computing BERTScore with a 24-layer RoBERTa model fine-tuned on the MNLI dataset. For evaluating Non-English sentences, both the multilingual BERT model and the XLM model trained on 100 languages are suitable candidates. We also recommend using domain- or language-specific contextual embeddings when possible, such as using BERT Chinese models for evaluating Chinese tasks. In general, we advise users to consider the target domain and languages when selecting the exact configuration to use.
f.2 Model Selection Study
Similar to Section 4, we use the 10K hybrid systems super-sampled from WMT18. We randomly select 100 out of 10K hybrid systems, rank them using automatic metrics, and repeat this process 100K times. We add to the results in the main paper (Table 3) performance of all participating metrics in WMT18 and results from using different contextual embedding models for BERTScore. We reuse the hybrid configuration and metric outputs released in WMT18. In addition to the Hits@1 measure, we evaluate the metrics using (a) mean reciprocal rank (MRR) of the top metric-rated system in human rankings, and (b) the absolute human score difference (Diff) between the top metric- and human-rated systems. Hits@1 captures a metric’s ability to select the best system. The other two measures quantify the amount of error a metric makes in the selection process. Tables 23--28 show the results from these experiments.
The additional results further support our conclusion from Table 3: BERTScore demonstrates better model selection performance. We also observe that the supervised metric RUSE displays strong model selection ability.
f.3 Image Captioning on COCO
We follow the experimental setup described in Section 4. Table 29 shows the correlations of several pre-trained contextual embeddings. We observe that precision-based methods such as Bleu and are weakly correlated with human judgments on image captioning tasks. We hypothesize that this is because human judges prefer captions that capture the main objects in a picture for image captioning. In general, has a high correlation, even surpassing the task-specific metric Spice Anderson et al. (2016). While the fine-tuned RoBERTa-Large model does not result in the highest correlation, it is one of the best metrics.
f.4 Robustness Analysis on PAWS-QQP
We present the full results of the robustness study described in Section 6 in Table 30. In general, we observe that BERTScore is more robust than other commonly used metrics. BERTScore computed with the 24-layer RoBERTa model performs the best. Fine-tuning RoBERTa-Large on MNLI (Williams et al., 2018) can significantly improve the robustness against adversarial sentences. However, a fine-tuned BERT on MRPC (Microsoft Research Paraphrasing Corpus) (Dolan and Brockett, 2005) performs worse than its counterpart.