Unsupervised Evaluation Metrics and Learning Criteria
for Non-Parallel Textual Transfer
We consider the problem of automatically generating textual paraphrases with modified attributes or properties, focusing on the setting without parallel data (hu-1; shen-1). This setting poses challenges for evaluation. We show that the metric of post-transfer classification accuracy is insufficient on its own, and propose additional metrics based on semantic preservation and fluency as well as a way to combine them into a single overall score. We contribute new loss functions and training strategies to address the different metrics. Semantic preservation is addressed by adding a cyclic consistency loss and a loss based on paraphrase pairs, while fluency is improved by integrating losses based on style-specific language models. We experiment with a Yelp sentiment dataset and a new literature dataset that we propose, using multiple models that extend prior work (shen-1). We demonstrate that our metrics correlate well with human judgments, at both the sentence-level and system-level. Automatic and manual evaluation also show large improvements over the baseline method of shen-1. We hope that our proposed metrics can speed up system development for new textual transfer tasks while also encouraging the community to address our three complementary aspects of transfer quality.
We consider textual transfer, which we define as the capability of generating textual paraphrases with modified attributes or stylistic properties, such as politeness (politeness), sentiment (hu-1; shen-1), and formality (formality). An effective transfer system could benefit a range of user-facing text generation applications such as dialogue (ritter2011data) and writing assistance (heidorn2000intelligent). It can also improve NLP systems via data augmentation and domain adaptation.
However, one factor that makes textual transfer difficult is the lack of parallel corpora. Advances have been made in developing transfer methods that do not require parallel corpora (see Section 2), but issues remain with automatic evaluation metrics. simple-transfer used crowdsourcing to obtain manually-written references and used BLEU (papineni2002bleu) to evaluate sentiment transfer. However, this approach is costly and difficult to scale for arbitrary textual transfer tasks.
Researchers have thus turned to unsupervised evaluation metrics that do not require references. The most widely-used unsupervised evaluation uses a pretrained style classifier and computes the fraction of times the classifier was convinced of transferred style (shen-1). However, relying solely on this metric leads to models that completely distort the semantic content of the input sentence. Table 1 illustrates this tendency.
We address this deficiency by identifying two competing goals: preserving semantic content and producing fluent output. We contribute two corresponding metrics. Since the metrics are unsupervised, they can be used directly for tuning and model selection, even on test data. The three metric categories are complementary and help us avoid degenerate behavior in model selection. For particular applications, practitioners can choose the appropriate combination of our metrics to achieve the desired balance among transfer, semantic preservation, and fluency. It is often useful to summarize the three metrics into one number, which we discuss in Section 3.3.
We also add learning criteria to the framework of shen-1 to accord with our new metrics. We encourage semantic preservation by adding a “cyclic consistency” loss (to ensure that transfer is reversible) and a loss based on paraphrase pairs (to show the model examples of content-preserving transformations). To encourage fluent outputs, we add losses based on pretrained corpus-specific language models. We also experiment with multiple, complementary discriminators and find that they improve the trade-off between post-transfer accuracy and semantic preservation.
To demonstrate the effectiveness of our metrics, we experiment with textual transfer models discussed above, using both their Yelp polarity dataset and a new literature dataset that we propose. Across model variants, our metrics correlate well with human judgments, at both the sentence-level and system-level.
2 Related Work
Textual Transfer Evaluation
Recent work has included human evaluation of the three categories (post-transfer style accuracy, semantic preservation, fluency), but does not propose automatic evaluation metrics for all three (simple-transfer; back-translation; chen2018adversarial; zhang2018learning). There have been recent proposals for supervised evaluation metrics (simple-transfer), but these require annotation and are therefore unavailable for new textual transfer tasks. There is a great deal of recent work in textual transfer (yang2018unsupervised; santos2018fighting; zhang2018learning; logeswaran2018content; nikolov2018large), but all either lack certain categories of unsupervised metric or lack human validation of them, which we contribute. Moreover, the textual transfer community lacks discussion of early stopping criteria and methods of holistic model comparison. We propose a one-number summary for transfer quality, which can be used to select and compare models.
In contemporaneous work, mir2019evaluating similarly proposed three types of metrics for style transfer tasks. There are two main differences compared to our work: (1) They use a style-keyword masking procedure before evaluating semantic similarity, which works on the Yelp dataset (the only dataset mir2019evaluating test on) but does not work on our Literature dataset or similarly complicated tasks, because the masking procedure goes against preserving content-specific non-style-related words. (2) They do not provide a way of aggregating three metrics for the purpose of model selection and overall comparison. We address these two problems, and we also propose metrics that are simple in addition to being effective, which is beneficial for ease of use and widespread adoption.
Textual Transfer Models
In terms of generating the transferred sentences, to address the lack of parallel data, hu-1 used variational autoencoders to generate content representations devoid of style, which can be converted to sentences with a specific style. goldberg-1 used conditional language models to generate sentences where the desired content and style are conditioning contexts. simple-transfer used a feature-based approach that deletes characteristic words from the original sentence, retrieves similar sentences in the target corpus, and generates based on the original sentence and the characteristic words from the retrieved sentences. cycle-reinforce integrated reinforcement learning into the textual transfer problem. Another way to address the lack of parallel data is to use learning frameworks based on adversarial objectives (gan); several have done so for textual transfer (yu-1; li-1; yang-1; shen-1; fu-1). Recent work uses target-domain language models as discriminators to provide more stable feedback in learning (yang2018unsupervised).
To preserve semantics more explicitly, fu-1 use a multi-decoder model to learn content representations that do not reflect styles. authorship use a cycle constraint that penalizes distance between input and round-trip transfer reconstruction. Our cycle consistency loss is inspired by authorship, together with the idea of back translation in unsupervised neural machine translation (back-translation-nmt; back-translation-nmt-2), and the idea of cycle constraints in image generation by zhu-1.
3.1 Issues with Most Existing Methods
|original input||the host that walked us to the table and left without a word .|
|0.5||0.87||0.65||the food is the best and the food is the .|
|3.3||0.72||0.75||the owner that went to to the table and made a smile .|
|7.5||0.58||0.81||the host that walked through to the table and are quite perfect !|
Prior work in automatic evaluation of textual transfer has focused on post-transfer classification accuracy (“”), computed by using a pretrained classifier to measure classification accuracy of transferred texts (hu-1; shen-1). However, there is a problem with relying solely on this metric. Table 1 shows examples of transferred sentences at several points in training the model of shen-1. is highest very early in training and decreases over time as the outputs become a stronger semantic match to the input, a trend we show in more detail in Section 6. Thus transfer quality is inversely proportional to semantic similarity to the input sentence, meaning that these metrics are complementary and difficult to optimize simultaneously.
We also identify a third category of metric, namely fluency of the transferred sentence, and similarly find it to be complementary to the first two. These three metrics can be used to evaluate textual transfer systems and to do hyperparameter tuning and early stopping. In our experiments, we found that training typically converges to a point that gives poor . Intermediate results are much better under a combination of all three unsupervised metrics. Stopping criteria are rarely discussed in prior work on textual transfer.
3.2 Unsupervised Evaluation Metrics
We now describe our proposals. We validate the metrics with human judgments in Section 6.3.
Post-transfer classification accuracy (“”):
This metric was mentioned above. We use a CNN (kim-1) trained to classify a sentence as being from or (two corpora corresponding to different styles or attributes). Then is the percentage of transferred sentences that are classified as belonging to the transferred class.
Semantic Similarity (“”):
We compute semantic similarity between the input and transferred sentences. We embed sentences by averaging their word embeddings weighted by idf scores, where ( is a word, is a sentence, ). We use -dimensional GloVe word embeddings (glove). Then, is the average of the cosine similarities over all original/transferred sentence pairs. Though this metric is quite simple, we show empirically that it is effective in capturing semantic similarity. Simplicity in evaluation metrics is beneficial for computational efficiency and widespread adoption. The quality of transfer evaluations will be significantly boosted with even such a simple metric. We also experimented with METEOR (denkowski-1). However, given that we found it to be strongly correlated with (shown in supplemental materials), we adopt due to its computational efficiency and simplicity.
Different textual transfer tasks may require different degrees of semantic preservation. Our summary metric, described in Section 3.3, can be tailored by practitioners for various datasets and tasks which may require more or less weight on semantic preservation.
Transferred sentences can exhibit high and while still being ungrammatical. So we add a third unsupervised metric to target fluency. We compute perplexity (“”) of the transferred corpus, using a language model pretrained on the concatenation of and . We note that perplexity is distinct from fluency. However, certain measures based on perplexity have been shown to correlate with sentence-level human fluency judgments (gamon2005sentence; DBLP:conf/conll/KannRF18). Furthermore, as discussed in Section 3.3, we punish abnormally small perplexities, as transferred texts with such perplexities typically consist entirely of words and phrases that do not result in meaningful sentences. Our summary metric, described in Section 3.3, can be tailored by practitioners for various datasets and tasks which may require more or less weight on semantic preservation.
3.3 Summarizing Metrics into One Score
It is often useful to summarize multiple metrics into one number, for ease of tuning and model selection. To do so, we propose an adjusted geometric mean () of a generated sentence :
where , and . Note that as discussed above, we punish abnormally small perplexities by setting .
When choosing models, different practitioners may prefer different trade-offs of , , and . As one example, we provide a set of parameters based on our experiments: . We sampled 300 pairs of transferred sentences from a range of models from our two different tasks (Yelp and literature) and asked annotators which of the two sentences is better. We denote a pair of sentences by where is preferred. We train the parameters using the following loss:
In future work, a richer function could be learned from additional annotated data, and more diverse textual transfer tasks can be integrated into the parameter training.
4 Textual Transfer Models
The textual transfer systems introduced below are designed to target the metrics. These system variants are also used for metric evaluation. Note that each variant of the textual transfer system uses different components described below.
Our model is based on shen-1. We define and to be latent style and content variables, respectively. and are two corpora containing sentences and respectively, where the word embeddings are in . We transfer using an encoder-decoder framework. The encoder (where are sentence domain, style space, and content space, respectively) is defined using an RNN with gated recurrent unit (GRU; chung2014empirical) cells. The decoder/generator is defined also using a GRU RNN. We use to denote the style-transferred version of . We want for .
4.1 Reconstruction and Adversarial Losses
shen-1 used two families of losses for training: reconstruction and adversarial losses. The reconstruction loss solely helps the encoder and decoder work well at encoding and generating natural language, without any attempt at transfer:
The loss seeks to ensure that when a sentence is encoded to its content vector and then decoded to generate a sentence, the generated sentence should match . For their adversarial loss, shen-1 used a pair of discriminators: tries to distinguish between and , and between and . In particular, decoder ’s hidden states are aligned instead of output words.
where is the size of a mini-batch. outputs the probability that its input is from style where the classifiers are based on the convolutional neural network from kim-1. The CNNs use filter -gram sizes of 3, 4, and 5, with 128 filters each. We obtain hidden states by unfolding from the initial state and feeding in . We obtain hidden states by unfolding from and feeding in the previous output probability distributions.
4.2 Cyclic Consistency Loss
We use a “cyclic consistency” loss (zhu-1) to encourage already-transferred sentences to be able to be recovered by transferring back again. This loss is similar to except we now transfer style twice in the loss. Recall that we seek to transfer to . After successful transfer, we expect to have style , and (transferred back from ) to have style . We want to be very close to the original untransferred . The loss is defined as
where or, more concisely, .
To use this loss, the first step is to transfer sentences from style to to get . The second step is to transfer of style back to so that we can compute the loss of the words in using probability distributions computed by the decoder. Backpropagation on the embedding, encoder, and decoder parameters will only be based on the second step, because the first step involves argmax operations which prevent backpropagation. Still, we find that the cyclic loss greatly improves semantic preservation during transfer.
4.3 Paraphrase Loss
While provides the model with one way to preserve style (i.e., simply reproduce the input), the model does not see any examples of style-preserving paraphrases. To address this, we add a paraphrase loss very similar to losses used in neural machine translation. We define the loss on a sentential paraphrase pair and assume that and have the same style and content. The loss is the sum of token-level log losses for generating each word in conditioned on the encoding of :
For paraphrase pairs, we use the ParaNMT-50M dataset (wieting-1).111We first filter out sentence pairs where one sentence is the substring of another, and then randomly select 90K pairs.
4.4 Language Modeling Loss
We attempt to improve fluency (our third metric) and assist transfer with a loss based on matching a pretrained language model for the target style. The loss is the cross entropy (CE) between the probability distribution from this language model and the distribution from the decoder:
where and are distributions over the vocabulary defined as follows:
where stands for all words in the vocabulary built from the corpora. When transferring from style to , is the distribution under the language model pretrained on sentences from style and is the distribution under the decoder . The two distributions and are over words at position given the words already predicted by the decoder. The two style-specific language models are pretrained on the corpora corresponding to the two styles. They are GRU RNNs with a dropout probability of 0.5, and they are kept fixed during the training of the transfer network.
4.5 Multiple Discriminators
Note that each of the textual transfer system variants uses different losses or components described in this section. To create more variants, we add a second pair of discriminators, and , to the adversarial loss to address the possible mode collapse problem (dual-d). In particular, we use CNNs with -gram filter sizes of 3, 4, and 5 for and , and we use CNNs with -gram sizes of 1, 2, and 3 for and . Also, for and , we use the Wasserstein GAN (WGAN) framework (wgan). The adversarial loss takes the following form:
where where is sampled for each training instance. The adversarial loss is based on wgan,222We use a default value of . with the exception that we use the hidden states of the decoder instead of word distributions as inputs to , similar to Eq. (3).
We choose WGAN in the hope that its differentiability properties can help avoid vanishing gradient and mode collapse problems. We expect the generator to receive helpful gradients even if the discriminators perform well. This approach leads to much better outputs, as shown below.
5 Experimental Setup
We use the same Yelp dataset as shen-1, which uses corpora of positive and negative Yelp reviews. The goal of the transfer task is to generate rewritten sentences with similar content but inverted sentiment. We use the same train/development/test split as shen-1. The dataset has 268K, 38K, 76K positive training, development, and test sentences, respectively, and 179K/25K/51K negative sentences. Like shen-1, we only use sentences with 15 or fewer words.
We consider two corpora of literature. The first corpus contains works of Charles Dickens collected from Project Gutenberg. The second corpus is comprised of modern literature from the Toronto Books Corpus (toronto-book). Sentences longer than 25 words are removed. Unlike the Yelp dataset, the two corpora have very different vocabularies. This dataset poses challenges for the textual transfer task, and it provides diverse data for assessing quality of our evaluation system. Given the different and sizable vocabulary, we preprocess by using the named entity recognizer in Stanford CoreNLP (manning2014stanford) to replace names and locations with -person- and -location- tags, respectively. We also use byte-pair encoding (BPE), commonly used in generation tasks (P16-1162). We only use sentences with lengths between 6 and 25. The resulting dataset has 156K, 5K, 5K Dickens training, development, and testing sentences, respectively, and 165K/5K/5K modern literature sentences.
5.2 Hyperparameter Settings
Section 4.6 requires setting the weights for each component. Depending on which model is being trained (see Table 2), the ’s for the unused losses will be zero. Otherwise, we set , , , , , where is the number of epochs. For optimization we use Adam (adam) with a learning rate of . We implement our models using TensorFlow (tf-2015).333Our implementation is based on code from shen-1. Code will be available via the first author’s webpage yzpang.me.
5.3 Pretrained Evaluation Models
For the pretrained classifiers, the accuracies on the Yelp and Literature development sets are 0.974 and 0.933, respectively. For language models, the perplexities on the Yelp and Literature development sets are 27.4 and 40.8, respectively.
6 Results and Analysis
|Models||Transfer quality||Semantic preservation||Fluency|
6.1 Analyzing Metric Relationships
Table 2 shows results for the Yelp dataset and Figure 1 plots learning trajectories of those models. Table 3 shows results for the Literature dataset. Models for the Literature dataset show similar trends. The figures show trajectories of statistics on corpora transferred/generated from the dev set during learning. Each two consecutive markers deviate by half an epoch of training. Lower-left markers generally precede upper-right ones. In Figure 1(a), the plots of by error rate () exhibit positive slopes, meaning that error rate is positively correlated with . Curves to the upper-left corner represent better trade-off between error rate and . In the plots of by in Figure 1(b), the M0 curve exhibits large positive slope but the curves for other models do not, which indicates that M0 sacrifices for . Other models maintain consistent as increases during training.
6.2 System-Level Validation
Annotators were shown the untransferred sentence, as well as sentences produced by two models (which we refer to as A and B). They were asked to judge which better reflects the target style (A, B, or tie), which has better semantic preservation of the original (A, B, or tie), and which is more fluent (A, B, or tie). Results are shown in Table 4.
Overall, the results show the same trends as our automatic metrics. For example, on Yelp, large differences in human judgments of semantic preservation (M2M0, M7M0, M7M2) also show the largest differences in , while M6 and M7 have very similar human judgments and very similar scores.
6.3 Sentence-Level Validation of Metrics
|Metric||Method of validation||Yelp||Lit.|
|% of machine and human judgments that match||94||84|
|Spearman’s b/w and human ratings of semantic preservation||0.79||0.75|
|Spearman’s b/w negative and human ratings of fluency||0.81||0.67|
We describe a human sentence-level validation of our metrics in Table 5.
To validate , human annotators were asked to judge the style of 100 transferred sentences (sampled equally from M0, M2, M6, M7). Note that it is a binary choice question (style 0 or style 1 without “tie” option) so that human annotators had to make a choice. We then compute the percentage of machine and human judgments that match.
We validate and by computing sentence-level Spearman’s between the metric and human judgments (an integer score from 1 to 4) on 150 generated sentences (sampled equally from M0, M2, M6, M7). We presented pairs of original sentences and transferred sentences to human annotators. They were asked to rate the level of semantic similarity (and similarly for fluency) where 1 means “extremely bad”, 2 means “bad/ok/needs improvement”, 3 means “good”, and 4 means “very good.” They were also given 5 examples for each rating (i.e., a total of 20 for four levels) before annotating. From Table 5, all validations show strong correlations on the Yelp dataset and reasonable correlations on Literature.
We validate by obtaining human pairwise preferences (without the “tie” option) of overall transfer quality and measuring the fraction of pairs in which the score agrees with the human preference. Out of 300 pairs (150 from each dataset), 258 (86%) match.
The transferred sentences used in the evaluation are sampled from the development sets produced by models M0, M2, M6, and M7, at the accuracy levels used in Table 2. In the data preparation for the manual annotation, there is sufficient randomization regarding model and textual transfer direction.
6.4 Comparing Losses
Cyclic Consistency Loss.
We compare the trajectories of the baseline model (M0) and the +cyc model (M2). Table 2 and Figure 1 show that under similar , M2 has much better semantic similarity for both Yelp and Literature. In fact, cyclic consistency loss proves to be the strongest driver of semantic preservation across all of our model configurations. The other losses do not constrain the semantic relationship across style transfer, so we include the cyclic loss in M3 to M7.
Table 2 shows that the model with paraphrase loss (M1) slightly improves over M0 on both datasets under similar . For Yelp, M1 has better and than M0 at comparable semantic similarity. So, when used alone, the paraphrase loss helps. However, when combined with other losses (e.g., compare M2 to M4), its benefits are mixed. For Yelp, M4 is slightly better in preserving semantics and producing fluent output, but for Literature, M4 is slightly worse. A challenge in introducing an additional paraphrase dataset is that its notions of similarity may clash with those of content preservation in the transfer task. For Yelp, both corpora share a great deal of semantic content, but Literature shows systematic semantic differences even after preprocessing.
Language Modeling Loss.
When comparing between M2 and M3, between M4 and M5, and between M6 and M7, we find that the addition of the language modeling loss reduces , sometimes at a slight cost of semantic preservation.
6.5 Results based on Supervised Evaluation
|LM + classifier||22.3||0.900|
If we want to compare the models using one single number, is our unsupervised approach. We can also compute BLEU scores between our generated outputs and human-written gold standard outputs using the 1000 Yelp references from simple-transfer. For BLEU scores reported for the methods of simple-transfer, we use the values reported by yang2018unsupervised. We use the same BLEU implementation as used by yang2018unsupervised, i.e., multi-bleu.perl. We compare three models selected during training from each of our M6 and M7 settings. We also report post-transfer accuracies reported by prior work, as well our own computed scores for M0, M6, M7, and the untransferred sentences. Though the classifiers differ across models, their accuracy tends to be very high (), making it possible to make rough comparisons of across them.
BLEU scores and post-transfer accuracies are shown in Table 6. The most striking result is that untransferred sentences have the highest BLEU score by a large margin, suggesting that prior work for this task has not yet eclipsed the trivial baseline of returning the input sentence. However, at similar levels of , our models have higher BLEU scores than prior work. We additionally find that supervised BLEU shows a trade-off with : for a single model type, higher generally corresponds to lower BLEU.
We proposed three kinds of metrics for non-parallel textual transfer, studied their relationships, and developed learning criteria to address them. We emphasize that all three metrics are needed to make meaningful comparisons among models. We expect our components to be applicable to a broad range of generation tasks.
We thank Karl Stratos and Zewei Chu for helpful discussions, the annotators for performing manual evaluations, and the anonymous reviewers for useful comments. We also thank Google for a faculty research award to K. Gimpel that partially supported this research.
Appendix A Supplementary Material
a.1 Textual Transfer Model
We iteratively update (1) , , , and by gradient descent on , , , and , respectively, and (2) , by gradient descent on .
a.1.2 Full Algorithm
Please refer to Algorithm 1.
a.2 Tables and Plots in Results
Figures 1(a) and 1(b) show the learning trajectories for the Literature dataset, which show similar trends as those for Yelp. While the plots for the two datasets appear different from an initial glance, comparing similarities at fixed error rates and comparing perplexities at fixed similarities reveals that the results largely resemble those for the Yelp dataset. The baseline M0 struggles on the Literature dataset. The particularly low perplexity for M0 does not indicate fluent sentences, but rather the piecing together of extremely common words and phrases.
In our analysis, we used as the primary metric for semantic preservation. However, if we were to use instead (where is computed by METEOR scores between original sentence and transferred sentence, averaged over sentence pairs), the plots and our conclusions would be largely unchanged. Using the Literature dataset as an example, Figure 2 shows that the correlation between and is very large. Specifically, we randomly sample 200 transferred corpora generated using different models, and generated at different times during training. We obtain and of each of these 200 transferred corpora using techniques discussed in the main text. We thus have 200 data points, as shown in Figure 2.
Table 8 provides examples of textual transfer.
|Original||—||—||—||—||i got my car back and was extremely unhappy .||Negative|
|M0||0.818||0.719||37.3||10.0||i got my favorite loves and was delicious .||Positive|
|M7||0.818||0.805||29.0||22.8||i got my car back and was very happy .||Positive|
|Original||—||—||—||—||the mozzarella sub is absolutely amazing .||Positive|
|M0||0.818||0.719||37.3||10.0||the front came is not much better .||Negative|
|M7||0.818||0.805||29.0||22.8||the cheese sandwich is absolutely awful .||Negative|
|Original||—||—||—||—||they are completely unprofessional and have no experience .||Negative|
|M0||0.818||0.719||37.3||10.0||they are super fresh and well !||Positive|
|M7||0.818||0.805||29.0||22.8||they are very professional and have great service .||Positive|
|Original||—||—||—||—||i would honestly give this place zero stars if i could .||Negative|
|M0||0.818||0.719||37.3||10.0||i would recommend give this place from everyone again .||Positive|
|M7||0.818||0.805||29.0||22.8||i would definitely recommend this place all stars if i could .||Positive|
|Original||—||—||—||—||for all those reasons , we wo n’t go back .||Negative|
|M0||0.818||0.719||37.3||10.0||for all of pizza , you do you go .||Positive|
|M7||0.818||0.805||29.0||22.8||for all those reviews , i highly recommend to go back .||Positive|
|Original||—||—||—||—||the owner was super nice and welcoming .||Positive|
|M0||0.818||0.719||37.3||10.0||the server was extremely bland with all .||Negative|
|M7||0.818||0.805||29.0||22.8||the owner was very rude and unfriendly .||Negative|
|Original||—||—||—||—||this is one of the best hidden gems in phoenix .||Positive|
|M0||0.818||0.719||37.3||10.0||this is one of the worst _num_ restaurants in my life .||Negative|
|M7||0.818||0.805||29.0||22.8||this is one of the worst restaurants in phoenix .||Negative|
|Original||—||—||—||—||i declined on their offer , but appreciated the gesture !||Positive|
|M0||0.818||0.719||37.3||10.0||i asked on their reviews , they are the same time !||Negative|
|M7||0.818||0.805||29.0||22.8||i paid for the refund , and explained the frustration !||Negative|
|Original||—||—||—||—||it was a most extraordinary circumstance .||Dickens|
|M0||0.694||0.728||22.3||8.81||it was a little deal of the world .||Modern|
|M2||0.692||0.781||49.9||12.8||it was a huge thing on the place .||Modern|
|M6||0.704||0.794||63.2||12.8||it was a most important effort over the relationship .||Modern|
|Original||—||—||—||—||i conjure you , tell me what is the matter .||Dickens|
|M0||0.694||0.728||22.3||8.81||i ’m sorry , i ’m sure i ’m going to be , but i was a little man .||Modern|
|M2||0.692||0.781||49.9||12.8||i ’m telling you , tell me what ’s the time .||Modern|
|M6||0.704||0.794||63.2||12.8||i am telling you , tell me what ’s the matter .||Modern|
|Original||—||—||—||—||a public table is laid in a very handsome hall for breakfast , and for dinner , and for supper .||Dickens|
|M0||0.694||0.728||22.3||8.81||the other of the man was a little , and then , and -person- ’s eyes , and then -person- .||Modern|
|M2||0.692||0.781||49.9||12.8||a little table is standing there for all , and for me , and for you .||Modern|
|M6||0.704||0.794||63.2||12.8||a small table is placed in a very blue room for breakfast , and for dinner , and for dinner .||Modern|
|Original||—||—||—||—||does n’t she know it ’s dangerous for a young woman to go off by herself ?||Modern|
|M0||0.694||0.728||22.3||8.81||do n’t have been a little of a man of your own ?||Dickens|
|M2||0.692||0.781||49.9||12.8||it n’t she know it ’s dangerous for a little woman to go out from us ?||Dickens|
|M6||0.704||0.794||63.2||12.8||does n’t she know it ’s a dangerous act for a young lady to go off by herself ?||Dickens|
|Original||—||—||—||—||it whispered to me about my new strength and abilities .||Modern|
|M0||0.694||0.728||22.3||8.81||it is not a little man .||Dickens|
|M2||0.692||0.781||49.9||12.8||it appears to me about my new strength and desire .||Dickens|
|M6||0.704||0.794||63.2||12.8||it appears to me my new strength and desire .||Dickens|