BERTScore: Evaluating Text Generation with BERT

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang,  Varsha Kishore,11footnotemark: 1  Felix Wu,11footnotemark: 1  Kilian Q. Weinberger, and Yoav Artzi
Department of Computer Science and Cornell Tech, Cornell University
{vk352, fw245, kilian}     {yoav}
  Equal contribution.

We propose BERTScore, an automatic evaluation metric for text generation. Analogous to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference. However, instead of looking for exact matches, we compute similarity using contextualized BERT embeddings. We evaluate on several machine translation and image captioning benchmarks, and show that BERTScore correlates better with human judgments than existing metrics, often significantly outperforming even task-specific supervised metrics.

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhangthanks:   Equal contribution.,  Varsha Kishore,11footnotemark: 1  Felix Wu,11footnotemark: 1  Kilian Q. Weinberger, and Yoav Artzi Department of Computer Science and Cornell Tech, Cornell University {vk352, fw245, kilian}     {yoav} ASAPP Inc.

1 Introduction

Automatic evaluation of natural language generation, for example in machine translation and caption generation, requires comparing candidate sentences to annotated references. The goal is to evaluate the semantic equivalence of the candidates and references. However, common methods rely on surface-form similarity only. For example, Bleu (bleu), the most common machine translation metric, simply counts -gram overlap between the candidate and the annotated reference. While this provides a simple and general measure, it fails to capture much of the lexical and compositional diversity of natural language.

In this paper, we focus on sentence-level generation evaluation, and introduce BERTScore, an evaluation metric based on pre-trained BERT contextual embeddings (bert). BERTScore computes the similarity between two sentences as a weighted aggregation of cosine similarities between their tokens.

BERTScore addresses three common pitfalls in -gram based methods (meteor). First, -gram based methods use exact string matching (e.g. in Bleu) or define a cascade of matching heuristics (e.g. in Meteor (meteor)), and fail to robustly match paraphrases. For example, given the reference people like foreign cars, metrics like Bleu or Meteor incorrectly give a higher score to people like visiting places abroad compared to consumers prefer imported cars. This is because the metrics fail to correctly identify paraphrased words. This leads to underestimation of performance, when semantically-correct phrases are penalized because they differ from the surface form of the reference sentence. In contrast to string matching, we compute cosine similarity using contextualized token embeddings, which have been shown as effective for paraphrase detection (bert). A second problem is the lack of distinction between tokens that are important or unimportant to the sentence meaning. For example, given the reference a child is playing, both the child is playing and a child is singing will get the same Bleu score. This often leads to performance overestimation, especially in models with robust language models that correctly generate function words. Instead of treating all tokens equally, we introduce a simple importance weighting scheme to emphasize words of higher significance to sentence meaning. Finally, -gram models fail to capture distant dependencies and penalize semantically-critical ordering changes (Isozaki10:autoeval). For example, given a small window, Bleu will only mildly penalize swapping of cause and effect (e.g. A because B instead of B because A), especially when the arguments A and B are long phrases. In contrast, contextualized embeddings are trained to effectively capture distant dependencies and ordering in all the involved token embeddings.

We experiment with BERTScore on machine translation and image captioning tasks using multiple systems by correlating BERTScore and related metrics to available human judgments. Our experiments demonstrate that BERTScore correlates highly with human evaluations of the quality of machine translation and image captioning systems. In machine translation, BERTScore correlates better with segment-level human judgment than existing metrics on the common WMT17 benchmark (wmt17em), including outperforming metrics learned specifically for this dataset. We also show that BERTScore is well correlated with human annotators for image captioning, surpassing Spice, a popular task-specific metric (spice), on the twelve 2015 COCO Captioning Challenge participating systems (coco). Finally, we test the robustness of BERTScore on the adversarial paraphrase dataset PAWS (paws), and show that it is more robust to adversarial examples than other metrics. BERTScore is available at

2 Existing Text Generation Evaluation Metrics

Natural language text generation is commonly evaluated against annotated reference sentences. Given reference sentence and an candidate sentence , a generation evaluation metric is a function that maps and to a real number. The goal is to give sentences that are preferred by human judgments higher scores relatively. Existing metrics can be broadly categorized into -gram matching metrics, embedding-based metrics, and learned metrics.

2.1 -gram Matching Approaches

The most commonly used metrics for text generation count the number of -grams () that occur in the reference and candidate . In general, the higher the -gram order is, the more the metric is able to capture word order, but also becomes more restrictive and constrained to the exact form of the reference.

Formally, let and be the lists of -grams () in the reference and candidate sentences. The number of -gram matches is

where is an indicator function. The precision and recall are

Several popular metrics build upon one or both of these exact matching scores.


The most widely used metric in machine translation is Bleu (bleu), which includes three modifications to . First, each -gram in the reference can only be matched at most once. For example, if is the sooner the better and is the the the, only two words in are matched for instead of all three words. Second, Bleu is designed as a corpus-level metric, where a set of reference-candidate pairs is evaluated as a group. The number of exact matches is accumulated for all pairs and divided by the total number of -grams in all candidate sentences. Finally, Bleu introduces a brevity penalty to penalize when this total number of -grams across all candidate sentences is low. Typically, Bleu is computed for various values of (e.g. ) that are averaged geometrically. A smoothed variant, SentBleu (moses) is computed on a sentence level. In contrast to Bleu, BERTScore is not restricted to maximum -gram length, but instead relies on contextualized embeddings that are able to capture dependencies of unbounded length.


Meteor (meteor) computes and while allowing backing-off from exact unigram matching to matching word stems, synonyms, and paraphrases. For example, running may match run if not exact match is possible. These non-exact matches use external stemmer, synonym lexicon, and paraphrase table. The computation uses beam search to minimize the number of matched chunks (consecutive word unigram matches). Meteor 1.5 (meteor1.5) distinguishes between content and function words, and assigns weights their importance differently. It also applies importance weighting to different matching types, including exact unigrams, stems, synonyms, and paraphrases. These parameters are tuned to maximize correlation with human judgments. Because Meteor requires external resources, only five languages are fully supported with the full feature set, including either synonym or paraphrase matching, and eleven are partially supported. Similar to Meteor, BERTScore is designed to allow relaxed matches. But instead of relying on external resources, BERTScore takes advantage of BERT embeddings that are trained on large amounts of raw text and can be easily created for new languages. BERTScore also incorporates importance weighting, which is estimated from simple corpus statistics. In contrast to Rouge, BERTScore does not require any tuning process to maximize correlations with human judgments.

Other Related Metrics

NIST (nist) is a revised version of Bleu that weighs each -gram differently and also introduces an alternative brevity penalty. chrF (chrF) compares character -grams, in the reference and candidate sentences. chrF++ (chrF++) extends chrF to include word bigram matching. PER (per), WER (wer), CDER (cder), TER (ter), and TERp (TER-Plus) are metrics based on edit distance. CIDEr (cider) is an image captioning metric that computes cosine similarity between tf–idf weighted -grams. Finally, Rouge (rouge) is a commonly used metric for summarization evaluation. Rouge- (rouge) computes (usually ), while Rouge- is a variant of with the numerator replaced by the length of the longest common subsequence.

Figure 1: Illustration of the computation of the recall metric . Given the reference and candidate , we compute BERT embeddings and pairwise cosine similarity. We weigh the similarity measures using weights.

2.2 Embedding-based Metrics

Word embeddings (word2vec; glove; fasttext; dai2017mixture; athiwaratkun2018probabilistic) are dense representations of tokens that are learned by optimizing an objective that follow the distributional semantics candidate, where similar words are encouraged to be closer to one another in the learned space. This property has been studied for generation evaluation. MEANT 2.0 (meant2) uses pre-trained word embeddings to compute lexical similarity and exploits shallow semantic parses to evaluate structural similarity. task-dialog-eval explore using average-pooling and max-pooling on word embeddings to construct sentence-level representation, which is used to compute cosine similarity between the reference and candidate sentences. rus2012comparison also study greedy word embedding matching. In contrast to these methods, we use contextualized embeddings, which capture the specific use of the token in the sentence and, potentially, sequence information. We do not use external tools to generate linguistic structures, and therefore our approach is relatively easily to apply to new languages. Our token-level computation allows to visualize the matching and weigh tokens differently according to their importance.

2.3 Learning Based Metrics

Learning-based metrics are usually trained to optimize correlation with human judgments. BEER (beer) uses a regression model based on character -grams and word bigrams. BLEND (blend) employs SVM regression to combine 29 existing metrics for English. RUSE (ruse) uses a multi-layer perceptron regressor on three pre-trained sentence embedding models. All these methods require human judgments as supervision, which are necessary for each dataset and costly to obtain. These models also face the risk of poor generalization to new domain, even within a known language and task. Instead of regressing on human judgment scores, leic train a neural model that takes an image and a caption as inputs and predicts if the caption is human-generated. One potential risk with this approach is that it is optimized to existing models, and may generalize poorly to new models. In contrast, the parameters of the BERT model underlying BERTScore are not optimized for any specific task. We also do not require access to images and provide an approach that applies for both text-only and multi-modal tasks.

3 BERTScore

Given a reference sentence and a candidate sentence , we use contextual embeddings to represent the tokens, and compute a weighted matching using cosine similarity and inverse document frequency scores.

Token Representation

We use BERT contextual embeddings to represent the tokens in the input sentences and . In contrast to word embeddings (word2vec; glove), contextual embeddings (elmo; bert) provide the same word different vector representations in different sentences. BERT uses a Transformer encoder (transformer) trained on masked language modeling and next-sentence prediction tasks. Pre-trained BERT embeddings were shown to benefit various NLP tasks, including natural language inference, sentiment analysis, paraphrase detection, question answering, named entity recognition (bert), text summarization (bert-summarization), contextual emotion detection (bert-emotion), citation recommendation (bert-citation), and document retrieval (bert-retrieval). The BERT model tokenizes the input text into a sequence of word pieces (google16), where unknown words are split into several commonly observed sequences of characters. The contextualized embedding representation for each word piece is computed by repeatedly applying self-attention and nonlinear transformations in an alternating fashion. This process generates multiple layers of embedded representations. Following initial experiments, we use the ninth layer from the model. This is consistent with recent findings showing that the intermediate BERT layers may lead to more semantically meaningful contextual embeddings than the final layer (alternate; linguistic-context). Appendix A studies the effect of layer choice. Given a reference tokenized into word pieces , BERT generates the sequence . Similarly, we map the tokenized candidate to .

Similarity Measure

The vector representation enables a soft measure of similarity instead of exact string matching (bleu) or heuristics (meteor). We measure the quality of matching a reference word piece and an candidate word piece using their cosine similarity:

We use pre-normalized vectors, which reduces this calculation reduces to the inner product .

While this similarity computes the similarity of word pieces in isolation from the rest of the two sentences, it inherits the dependence on context from the BERT model.

Importance Weighting

Previous work on such similarity measures demonstrated that rare words can be more indicative for sentence similarity than common words (meteor; cider). We incorporate importance weighting using inverse document frequency () scores computed from the reference sentences in the test corpus. Given reference sentences , the of a word piece is

We do not use the full tf-idf measure because we process single sentences, where it is likely that the term frequency (tf) is 1. Because we use the reference sentences, the scores remain the same for all systems evaluated on a specific test set. For unknown words in the reference sentences, we apply plus-one smoothing.


The complete score matches each word piece in to a word piece in to compute recall, and each word piece in to a word piece in to compute precision. The two combine to compute an F1 measure. We use greedy matching to maximize the matching similarity score. In contrast to Bleu (Section 2), we identify matches using -weighted cosine similarity, which allows for approximate matching. For a reference and candidate , the recall, precision, and F1 scores are:

4 Experimental Setup

We evaluate our approach on machine translation and image captioning. We focus on correlations with human judgments.

Bert Models

We use the uncased English model for English tasks, for Chinese tasks, and the cased multilingual model for other languages. Appendix A shows the effect of the BERT model choice.

Machine Translation

We use the WMT17 metric evaluation dataset (wmt17em), which contains translation systems outputs, gold reference translations, and two types of human judgment scores. Segment-level human judgments assign a score to each pair of output and reference. System-level human judgments associate each system with a single score based on all output-reference pairs in the test set. The dataset uses absolute Pearson correlation with human judgments to evaluate metric quality. We compute system-level score by averaging the BERTScores for all system outputs. WMT17 includes translations from English to Czech, German, Finnish, Latvian, Russian, and Turkish, and from the same set of languages to English. We compare the performance of several popular metrics: Bleu (bleu), CDER (cder), and TER (ter). We also compare our correlations with state-of-the-art metrics, including METEOR++ (meteor++), chrF++ (chrF++), BEER (beer), BLEND (blend), and RUSE (ruse).

Image Captioning

We use the human judgments of twelve submission entries from the COCO 2015 Captioning Challenge. Each participating system generates a caption for each image in the COCO validation set (coco), and each image has approximately five reference captions. Following leic, we compute the Pearson correlation with two system-level metrics: M1, the percentage of captions that are evaluated as better or equal to human captions and M2, the percentage of captions that are indistinguishable from human caption. We compute BERTScore with multiple references by scoring the candidate with each available reference and returning the highest score. We compare BERTScore to four task-agnostic metrics: BLEU (bleu), METEOR (meteor), ROUGE-L (rouge), and CIDEr (cider). We also compare with two task-specific metrics: SPICE (spice) and LEIC (leic). SPICE is computed using the similarity of the scene graphs parsed from the reference and candidate captions. LEIC uses a critique network that takes an image and a caption and outputs a proxy score, predicting whether the caption is written by human.

5 Results

Machine Translation

Setting Metrics cs-en de-en fi-en lv-en ru-en tr-en zh-en Average
Unsupervised SentBleu 0.435 0.432 0.571 0.393 0.484 0.538 0.512 0.481
chrF++ 0.523 0.534 0.678 0.52 0.588 0.614 0.593 0.579
METEOR++ 0.552 0.538 0.720 0.563 0.627 0.626 0.646 0.610
Supervised BLEND 0.594 0.571 0.733 0.594 0.622 0.671 0.66 0.635
RUSE 0.624 0.644 0.750 0.697 0.673 0.716 0.691 0.685
Pre-trained 0.644 0.675 0.816 0.708 0.724 0.698 0.664 0.704
0.664 0.667 0.788 0.681 0.704 0.702 0.706 0.702
0.670 0.686 0.820 0.710 0.729 0.714 0.704 0.719
(no ) 0.680 0.672 0.815 0.701 0.708 0.735 0.702 0.716
Table 1: Absolute Pearson correlations with segment-level human judgments on WMT17 to-English translations.
Setting Metric cs-en de-en fi-en lv-en ru-en tr-en zh-en Average
Unsupervised Bleu 0.971 0.923 0.903 0.979 0.912 0.976 0.864 0.933
CDER 0.989 0.930 0.927 0.985 0.922 0.973 0.904 0.947
chrF++ 0.940 0.965 0.927 0.973 0.945 0.960 0.880 0.941
MEANT 2.0 0.926 0.950 0.941 0.970 0.962 0.932 0.838 0.931
TER 0.989 0.906 0.952 0.971 0.912 0.954 0.847 0.933
Supervised BEER 0.972 0.960 0.955 0.979 0.936 0.972 0.902 0.954
BLEND 0.968 0.976 0.958 0.979 0.964 0.984 0.894 0.960
RUSE 0.996 0.964 0.983 0.988 0.951 0.993 0.930 0.972
Pre-trained 0.981 0.941 0.998 0.993 0.939 0.984 0.888 0.961
0.998 0.979 0.960 0.964 0.975 0.981 0.948 0.972
0.990 0.966 0.993 0.989 0.959 0.995 0.951 0.978
(no ) 0.987 0.955 0.982 0.993 0.934 0.994 0.947 0.970
Table 2: Absolute Pearson correlations with system-level human judgments on WMT17 to-English translations.
Setting Metric en-cs en-de en-lv en-ru en-tr en-zh Average
Unsupervised Bleu 0.956 0.804 0.920 0.866 0.898 0.924 0.895
CDER 0.968 0.813 0.930 0.924 0.957 0.983 0.929
chrF++ 0.974 0.852 0.956 0.945 0.986 0.976 0.948
MEANT 2.0 0.976 0.770 0.959 0.957 0.991 0.943 0.933
TER 0.955 0.796 0.909 0.933 0.967 0.970 0.922
Supervised BEER 0.970 0.842 0.930 0.944 0.980 0.914 0.930
Pre-trained 0.950 0.760 0.949 0.945 0.992 0.974 0.928
0.983 0.894 0.952 0.981 0.994 0.989 0.966
0.974 0.832 0.952 0.967 0.994 0.991 0.952
(no ) 0.968 0.838 0.958 0.965 0.992 0.984 0.951
Table 3: Absolute Pearson correlations with system-level human judgments on WMT17 from-English translations.

Tables 1 and 2 show segment-level and system-level correlations on to-English translations, and table 3 shows system-level correlation on from-English translations. Across most language pairs, BERTScore shows the highest correlations with human judgments both at the segment-level and system-level. While the recall and precision measures alternate as the best measure across language, the combination of them to the F1 measure performs reliably across the different settings. BERTScore shows better correlation than RUSE, a supervised metric trained on WMT16 and WMT15 human judgment data. We also observe that weighting generally leads to better correlation.

Image Captioning

Table 4 shows correlation results for the COCO Captioning Challenge. BERTScore outperforms all task-agnostic baselines by large margins. Image captioning presents a challenging evaluation scenario, and metrics based on strict -gram matching, including Bleu and Rouge, have weak correlations with human judgments. importance weighting shows significant benefits for this task, which suggests people attribute higher importance to content words. Finally, LEIC (leic), remains highly competitive and outperforms all other methods. LEIC stands out from the other metrics. First, it is trained on the COCO data and is optimized for the task of distinguishing between human and generated captions. Second, it has access to the images, while all other methods observe the text only.

Setting Metric M1 M2
Task-agnostic Bleu-1 0.124 0.135
Bleu-2 0.037 0.048
Bleu-3 0.004 0.016
Bleu-4 -0.019 -0.005
Meteor 0.606 0.594
Rouge-L 0.090 0.096
CIDEr 0.438 0.440
0.159 0.200
0.809 0.749
0.518 0.513
(no ) 0.689 0.621
Task-specific Spice 0.759 0.750
LEIC 0.939 0.949
Table 4: Absolute Pearson correlation of system-level metric scores with human judgments on the 2015 COCO Captioning Challenge. Beside BERTScore, all correlations are cited from leic. M1 is the percentage of captions that are evaluated as better or equal to human caption, and M2 is the percentage of captions that are indistinguishable from human captions.

6 Qualitative Analysis

Case Sentences Ranks (out of 560)
SentBleu 1. : According to opinion in Hungary, Serbia is “a safe third country”. Human: 23
: According to Hungarian view, Serbia is a “safe third country.” : 100
SentBleu: 465
2. : At same time Kingfisher is closing 60 B&Q outlets across the country Human: 38
: At the same time, Kingfisher will close 60 B & Q stores nationwide : 201
SentBleu: 530
3. : Construction took six months. Human: 243
: Has taken six months of construction. : 230
SentBleu: 549
4. : Authorities are quickly repairing the fence. Human: 205
: Authorities are about to repair the fence fast. : 193
SentBleu: 491
5. : Hewlett-Packard to cut up to 30,000 jobs Human: 119
: Hewlett-Packard will reduce jobs up to 30.000 : 168
SentBleu: 441
SentBleu 6. : In their view the human dignity of the man had been violated. Human: 500
: Look at the human dignity of the man injured. : 523
SentBleu: 115
7. : A good prank is funny, but takes moments to reverse. Human: 495
: A good prank is funny, but it takes only moments before he becomes a boomerang. : 487
SentBleu: 152
8. : For example when he steered a shot from Ideye over the crossbar in the 56th minute. Human: 516
: So, for example, when he steered a shot of Ideye over the latte (56th). : 498
SentBleu: 185
9. : I will put the pressure on them and onus on them to make a decision. Human: 507
: I will exert the pressure on it and her urge to make a decision. : 460
SentBleu: 220
10. : Contrary to initial fears, however, the wound was not serious. Human: 462
: Contrary to initial fears, he remained without a serious Blessur. : 481
SentBleu: 256
Table 5: Examples sentences where the ranks of similarity assigned by SentBleu and differ significantly on WMT16 German-to-English reference candidate pairs. : gold reference, : candidate outputs of MT systems. The sentences are ranked by the similarity, i.e. rank 1 is the most similar pair assigned by a score. An ideal metric should rank similarly as humans.
Figure 2: Word matches for each word in candidate sentence for Example 4 (Table 5). We color-code the contributions of each word in .

We study the BERTScore and SentBleu using failure cases of reference and candidate pairs from WMT16 German-to-English (wmt16em). We rank all 560 pairs by the human score, BERTScore, or SentBleu score from most similar to least similar. Ideally, the ranks assigned by BERTScore and SentBleu should be similar to the rank assigned by the human score.

Table 5 shows examples where BERTScore and SentBleu scores disagree about the ranking for the example by a large number. We observe that BERTScore is effectively able to capture synonyms and changes in word order. For example, in the first pair, the reference and candidate sentences are almost identical except that the candidate replaces opinion in Hungry with Hungarian view and switches the order of and a. While BERTScore ranks the pair relatively high, SentBleu ranks the pair as dissimilar possibly because it cannot match the synonyms and is sensitive to the small changes in the order of words. The fifth pair shows a set of changes that preserve the semantic meaning: replacing to cut with will reduce and swapping the order of 30,000 and jobs. BERTScore ranks the candidate translation similar to the human judgment, whereas SentBleu ranks it much lower. We also see that SentBleu potentially over-rewards -grams overlap, even when phrases are used very differently. In the sixth pair, both the candidate and the reference contain the human dignity of the man. Yet the two sentences convey very different meaning. BERTScore agrees with the human judgment and assigns a low rank to the pair. In contrast, SentBleu considers the pair as relatively similar because the reference and the candidate sentences have significant word overlap.

Because BERTScore relies on explicit alignments, it is easy to visualize the word matching to better understand the resulting score. Figure 2 visualizes the BERTScore matching of two pairs from Table 5. The coloring in the figure visualizes the amount of contribution of each token to the overall score, including both the score and the cosine similarity. In both examples, function words such as are, to, and of, contribute less to the overall similarity score.

Trained on QQP (supervised) DecAtt 0.939* 0.263
DIIN 0.952* 0.324
BERT 0.963* 0.351
Trained on QQP + PAWSQQP (supervised) DecAtt - 0.511
DIIN - 0.778
BERT - 0.831
Metric (unsupervised) Bleu-1 0.737 0.402
Bleu-2 0.720 0.548
Bleu-3 0.712 0.527
Bleu-4 0.707 0.527
Meteor 0.755 0.532
Rouge-L 0.740 0.536
chrF++ 0.577 0.608
Metric (pre-trained) 0.752 0.664
0.765 0.666
0.770 0.664
Table 6: Area under ROC curve (AUC) on QQP and PAWSQQP datasets. BERTScore is more robust to the adversarial paraphrase example. The scores of trained DecATT (decatt), DIIN (diin), and fine-tuned BERT are reported by paws. *: score on the held out test set of QQP.

7 Robustness Analysis

We test the robustness of BERTScore using adversarial paraphrase classification. We use the Quora Question Pair corpus (QQP; QQP) and the Paraphrase Adversaries from Word Scrambling dataset (PAWS; paws). Both datasets contain pairs of sentences labeled to indicate whether they are paraphrases or not. Positive examples in QQP are real duplicated questions, while negative examples are generated from related, but different questions. Sentence pairs in PAWS are generated through word swapping. For example, in PAWS, Flights from New York to Florida may be changed to Flights from Florida to New York and a good classifier should identify that these two sentences are not paraphrases. PAWS includes two parts PAWSQQP, which is based on the QQP data, and PAWSWiki. Table 6 shows the area under ROC curve for existing models and automatic metrics.

We observe that supervised classifiers trained on QQP perform even worse than random guess on PAWSQQP, i.e. these models believe that the adversarial examples are more likely to be paraphrases. When adversarial examples are provided in training, state-of-the-art models like DIIN (diin) and fine-tuned BERT are able to identify the adversarial examples but their performance still decreases significantly from the their performance on QQP.

We study the effectiveness of automatic metrics for paraphrase detection without any training data. We use the PAWSQQP development set which contains 667 sentences. For QQP, we use the first 5000 sentences in the training set instead because the test labels are not available. We treat the first sentence as the reference and the second sentence as the candidate. We expect that pairs with higher score are more likely to be paraphrases. Most metrics have decent performance on QQP, but show a significant performance drop on PAWSQQP, almost down to chance performance. This suggests these metrics fail to to distinguish the harder adversarial examples. In contrast, the performance of BERTScore drops only slightly, which demonstrates that it is more robust than the other metrics.

8 Conclusion

We propose BERTScore, a new metric for evaluating generated text against gold standard references. Our experiments on common benchmarks demonstrate that BERTScore achieves better correlation than common metrics, such as Bleu or Meteor. Our analysis illustrates the potential of BERTScore to resolve some of the limitations of these commonly used metrics, especially on challenging adversarial examples. BERTScore is purposely designed to be simple, interpretable, task agnostic, and easy to use. The code for BERTScore is available at


This research is supported in part by grants from the National Science Foundation (III-1618134, III-1526012, IIS1149882, IIS-1724282, and TRIPODS-1740822), the Office of Naval Research DOD (N00014-17-1-2175), and the Bill and Melinda Gates Foundation, SAP, Zillow and Facebook Research. We thank Graham Neubig, Tianze Shi, Yin Cui, and Guandao Yang for their insightful comments.


Appendix A Model and Representation Choice Analysis

In Section 5, we report the human correlation of BERTScores computed by using the uncased BERTBASE model. We hereby investigate the potential improvement of using different BERT models on the WMT16 German-to-English data (wmt16em). In Table 7, we report the average human correlation on segment level of computed by using BERTBASE and BERTLARGE. As expected, BERTScores computed on the BERTLARGE model correlate better with human judgment on the WMT17 dataset. However, the improvement is marginal and appears less appealing when we consider the computational overhead of BERTLARGE. Therefore, in our opinion using BERTBASE will suffice.

Since there are BERT pre-trained models on different domains, we hypothesize that using more domain-specific model would improve the correlation with human judgment. On WMT16 English-to-Chinese translation data, we compute with BERTMULTI which is a general domain multilingual BERT model trained on 104 languages and with BERTCHINESE which is trained solely on Chinese data. The experimental result is presented in Table 8. We observe that, as hypothesized, computed with BERTCHINESE shows a significant performance increase. Therefore, we expect a future improvement through more domain-specific BERT models and advise practitioners to use domain-specific models when available.

-uncased 0.640
-cased 0.623
-uncased 0.659
-cased 0.641
Table 7: Pearson correlations of with segment level human judgment on WMT16 German-to-English machine translation tasks using different BERT models.
Model Precision Recall F1
BERT Multilingual 0.717 0.761 0.758
BERT Chinese 0.736 0.792 0.789
Table 8: Absolute Pearson correlation of segment-level metric scores with human judgements on WMT17 English-to-Chinese translation task. The domain-specific pre-trained model (BERTCHINESE) outperforms the domain-general pre-trained model (BERTMULTI).
Figure 3: Absolute Pearson correlation of computed with 3 BERT models, across different layers, with segment level human judgments on WMT16 German-to-English machine translation task. Consistently, correlation drops significantly in the final layers.

As suggested by previous studies (elmo; alternate), selecting a good layer, or a good combination of layers, of hidden representations is important. In designing BERTScore, we use WMT16 segment level human judgment data as a development set to facilitate our representation choice. In Figure 3, we plot the change of human correlation of BERTScores over different layers of BERT models on WMT16 German-to-English translation task. Based on results from 3 different BERT models, we identify a common trend that computed with the representations of intermediate layers tends to work better. In practice, we use the th layer of a BERTBASE model.

Task Model Bleu
WMT14 En-De ConvS2S (gehring2017convs2s) 0.266 0.8323 0.8311 0.8312
Transformer-base (tensor2tensor) 0.285 0.8443 0.8438 0.8439
Transformer-big (snmt) 0.298 0.8506 0.8474 0.8487
DynamicConv (wu2018pay) 0.297 0.8482 0.8452 0.8464
WMT14 En-Fr ConvS2S (gehring2017convs2s) 0.408 0.8749 0.8693 0.8718
Transformer-big (snmt) 0.432 0.8819 0.8751 0.8782
DynamicConv (wu2018pay) 0.432 0.8810 0.8754 0.8779
IWSLT14 De-En Transformer-iwslt (ott2019fairseq) 0.347 0.7903 0.7764 0.7820
LightConv  (wu2018pay) 0.348 0.7924 0.7751 0.7823
DynamicConv (wu2018pay) 0.352 0.7941 0.7782 0.7848
Table 9: Bleu scores and BERTScores of publicly available pre-trained MT models in Tensor2Tensor (tensor2tensor) and fairseq (ott2019fairseq). : trained on unconfirmed WMT data version, : trained on WMT16 +  paracrawl, : trained on WMT16, : trained by us using fairseq.

a.1 BERTScore of recent MT models.

Table 9 shows the Bleu scores and the BERTScores of pre-trained machine translation models on WMT14 English-to-German, WMT14 English-to-French, IWSLT14 German-to-English task. We used publicly available pre-trained models from Tensor2Tensor (tensor2tensor)111 Code available at, and pre-trained model available at gs://tensor2tensor-checkpoints/transformer_ende_test. and fairseq (ott2019fairseq)222 Code and pre-trained model available at . Since a pre-trained Transformer model on IWSLT is not released, we trained our own using the fairseq library. We use multilingual cased 333Hash code:
for English-to-German and English-to-French pairs, and English uncased 444Hash code: bert-base-uncased_L9_version=0.1.0 for German-to-English pairs. Interestingly, the gap between a DynamicConv (wu2018pay) trained on only WMT16 and a Transformer (snmt) trained on WMT16 and  paracrawl (about 30 more training data) becomes larger when evaluated with BERTScores rather than Bleu.

Comments 1
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description