# Contrastive Entropy: A new evaluation metric for unnormalized language models

###### Abstract

Perplexity (per word) is the most widely used metric for evaluating language models. Despite this, there has been no dearth of criticism for this metric. Most of these criticisms center around lack of correlation with extrinsic metrics like word error rate (WER), dependence upon shared vocabulary for model comparison and unsuitability for unnormalized language model evaluation. In this paper, we address the last problem and propose a new discriminative entropy based intrinsic metric that works for both traditional word level models and unnormalized language models like sentence level models. We also propose a discriminatively trained sentence level interpretation of recurrent neural network based language model (RNN) as an example of unnormalized sentence level model. We demonstrate that for word level models, contrastive entropy shows a strong correlation with perplexity. We also observe that when trained at lower distortion levels, sentence level RNN considerably outperforms traditional RNNs on this new metric.

Contrastive Entropy: A new evaluation metric for unnormalized language models

Kushal Arora, Anand Rangarajan |

Dept. of Computer and Information Science and Engineering, |

University of Florida, Gainesville, Fl, USA |

{karora, anand}@cise.ufl.edu |

## 1 Introduction

There are two standard evaluation metrics for language models: perplexity and word error rate (WER). The simpler of these two, WER, is the percentage of erroneously recognized words (deletions, insertions, substitutions) to the total number of words in a speech recognition task i.e.

(1) |

The second metric, perplexity (per word), is an information theoretic measure that evaluates the similarity between the proposed probability distribution and the original distribution . It can be computed as an inverse of the (geometric) average probability of test set

(2) |

where is the number of words in the test set .

In many ways, WER is a better metric. Any improvements on language modeling benchmarks is meaningful only if they translate to improvements in Automatic Speech Recognition (ASR) or Machine Translation. The problem with WER is that it needs a complete ASR pipeline to evaluate. Also, almost all benchmarking datasets are behind a pay-wall, hence not readily available for evaluation.

Perplexity, on the other hand, is a theoretically elegant and easy to compute metric which correlates well with WER for simpler n-gram models. This makes PPL a good substitute for WER when evaluating n-grams models, but for more complex language models the correlation is not so strong [1]. In addition to this, due to its reliance on exact probabilities, perplexity is an unsuitable metric to evaluate unnormalized models for which the partition function is intractable. Also, when comparing two models using perplexity, they must share the same vocabulary.

Most of the previous work done to improve upon perplexity has been focused on achieving better correlation with WER. Iyer et al. [1] proposed a decision tree based metric that uses additional features like word length, POS tags and phonetic length of words to improve the WER correlation. Chen et al. [2] proposed a new metric M-ref which attempts to learn the likelihood curve between WER and perplexity. Clarkson et al. [3] attempted to use entropy in conjunction with perplexity—empirically learning the mixing coefficients.

In this paper we focus on a different problem, the problem of extending perplexity for unnormalized language models evaluation. We do so by introducing a discriminative approach to language model evaluation. Our approach is inspired by Contrastive Estimation [4] and stems from the philosophical starting point that a superior language model should be able to distinguish better between the sentence from the test set and its deformed version. While we use an unnormalized sentence level model as an example in this paper this technique should work for all models where partition function is intractable, for example unnormalized Model M and feed forward neural network language model (NNLM) from [5] or sentence level models like [6], [7] and [8].

In the next section, we give a sketch derivation of perplexity that highlights its word level model assumption. As we will be using a sentence level language model for evaluation, we then move the probability space to sentences and derive an expression for cross entropy rate for sentence level models. In Section 3, we introduce our new discriminative metric, Contrastive Entropy, which removes the normalization requirement associated with perplexity. In Section 4, we formulate recurrent neural networks as sentence level language models that we use for validation and in Section 5 we analyze this new metric across various models on the Pen-TreeBank section of the WSJ dataset. We conclude this paper by hypothesizing a better correlation between WER and contrastive entropy based on the fact they share the same goal of minimizing errors in prediction.

## 2 Sentence level cross entropy rate

The Perplexity defined in equation (2) can also seen as exponentiated cross entropy rate, , with cross entropy approximated as

(3) |

This approximation can be derived viewing language as one continuous, infinite stream of words leading to the following expression for cross entropy rate:

(4) |

where is a set of all the sentences of length

Now, assuming the language to be ergodic and stationary, the Shannon-McMillan-Breiman Theorem [9] states that (4) can be approximated as a single sequence that is long enough, hence

(5) |

Here, is the test set and being the number of words in this test set.

In this derivation language was seen as an infinite stream of words. If instead, we build a sample space on sentences, then we can define the cross entropy of language as an infinite stream of sentences as

here is a set of all documents containing sentences.

Now, applying the Shannon-McMillan-Breiman Theorem as we did in (5) and assuming that the sentences are independent and identically distributed, we can approximate the cross entropy rate of the sentence level model as

(6) |

where is the number of sentences in the test set .

As the cross entropy still depends upon the exact probability, equation (6) is still intractable. In the next section, we overcome this problem by defining a discriminative evaluation metric which, instead of trying to minimize the distance between the original distribution and the proposed distribution , tries to maximize the discriminative ability of the model towards the test set from its distorted version.

## 3 Contrastive Entropy and Contrastive Entropy Ratio

Let be the test set. We pass this test set through a noisy channel and let the distorted version of this test set be . We now define the contrastive entropy rate as

(7) | |||||

Here, is a measure of the distortion introduced in the test set, is the unnormalized probability and is the size of the test set, which is the cardinality of words and sentences for word and sentence level models respectively.

The intuition behind our evaluation technique is that the distorted test set can be seen as an out of domain text, and that a superior language model should be able to better discriminate in-domain text from the language from the malformed set that are less likely to be generated by the same language source.

The metric proposed above still has a major drawback. It is not scale invariant. Let’s say a model M generates a probability distribution for test set . We can simply cheat on this metric by proposing a model that exponentiates the probability by a factor of , i.e. multiplies the entropy by factor of . This limits the usefulness of the contrastive entropy to intra-model comparison for hyper-parameter optimization.

We overcome this issue by reporting an additional value for each model which we term the contrastive entropy ratio. The idea here is to choose a distortion level as baseline, let’s say 10% and report the gain for a higher distortion levels, for example 30% over this baseline distortion :

(8) |

/ | Higher or similar | Lower |
---|---|---|

Higher | Superior | Scaling issues |

Lower | Indeterminate | Inferior |

Neither of the two numbers can provide a complete picture in isolation. Contrastive entropy can be cheated upon by scaling entropy, on the other hand, there is no guarantee that the contrastive entropy ratio would rise faster for a better discriminative model, but together, they balance each other out. Table 1 shows how to interpret these values. A model with higher contrastive entropy and a higher or similar contrastive entropy ratio would mean that it performs better at discriminating the good examples from the bad ones, whereas, a larger contrastive entropy with lower ratio would mean that models use different scales, and a higher ratio with lower cross entropy would not mean much while comparing the two models.

## 4 Sentence-level RNNLM

As the metric we proposed here benchmarks the unnormalized level models, in this section we propose a simple sentence level language model that we can use to show the efficacy of our metric. This new model is simply an unfolded Recurrent Neural Network Language Model [10] build at sentence level and trained to maximize the margin between a valid sentence and its distorted version.

The Recurrent Neural Network based Language model can be defined recursively using the following equations

(9) | |||||

(10) | |||||

(11) |

Equations (9) and (10) can be seen as building latent space representations of phrases using words and history and (11) can be seen as predicting the probability of this word given the context. This phrasal representation built in (9) and (10) then would be treated as the history for the next step. A standard sigmoidal nonlinearity is used for and the probability distribution function is a standard softmax.

If we limit the context to sentence levels and move the probability space to the sequence of the words or n-grams, equation (9) and (10) can be seen as composition function building phrase , of the length , from sub-phrase , of the length , and the th word . Equation (11) can be seen as building the unnormalized probability over the phrase . We can rephrase the equations (9), (10) and (11) as

(12) | |||||

(13) |

Here we use the standard sigmoidal non linearity for the function and the identity function for .

We now define the score of a length sentence as

(14) |

The probability of the sentence can now be modeled as an exponential distribution

(15) |

where is the partition function and the contrastive entropy from (7) can be calculated as

(16) |

where is the distorted version of with distortion percentage .

Training is done using a contrastive criterion where we try to maximize the distance between the in-domain sentence and its distorted version. This formulation is similar to one followed by Collobert et al. [8] and Okanohara et al. [7] for language modeling and by Smith and Eisner [4] for POS tagging. Mathematically, we can define this pseudo discriminative training objective as

(17) |

where is the distorted version of sentence and is the parameter of the model.

This simplistic sentence level recurrent neural network model is implemented in python using Theano [11] and is available at https://github.com/kushalarora/sentenceRNN.

## 5 Experiments

We use the Pen Treebank dataset with the following splits and preprocessing: Sections 0-20 were used as training data, sections 21-22 for validation and 23-24 for testing. The training, validation and testing token sizes are 930k, 74k and 82k respectively. The vocabulary is limited to 10k words with all words outside this set mapped to a special token .

We start by examining the distortion generation mechanism. As the evaluation includes the word level models, we need to preserve the word count. To do this, we restrict distortions to only two types: substitution and transpositions. For substitutions, we randomly swap the current word in the sentence with a random word from the vocabulary. For transposition, we randomly select a word from the same sentence and swap it with the current one. For each word in a sentence, there are three possible outcomes: no distortion with probability , substitution with probability and transposition with probability with .

Now, let’s start by considering the sentence level RNN model proposed in section 4. For contrastive entropy to be a good measure for sentence level models, the following assertions should be true: i) contrastive entropy should monotonically increase with distortions, ii) contrastive entropy of training set should go down with each epoch, and iii) contrastive entropy should increase with increase in training distortion margin. Figures 1, 2 and 3 show that the assertions made above empirically hold. We see a monotonic increase in contrastive entropy with distortion and training distortion margin in Figures 1 and 3 respectively. Figure 2 shows the contrastive entropy increase for training data with epochs. All sentence level RNN model referred above and elsewhere in this paper were trained using gradient descent with learning rate of 0.1 and regularization coefficient of .

Finally, we would like to compare the standard word level baseline models and our sentence level language model on this new metric. The objective here is to verify the hypothesis that between two language models, the superior one should be able to better distinguish the test sentence from their distorted versions. This is akin to saying that a better language model should have higher contrastive entropy value with similar or higher cross entropy ratio. Tables 2 and 3 shows the results for our experiments. The results were generated using the open source language modeling SRILM toolkit [12] for n-gram models and the RNNLM toolkit [13] for the RNN language model. The RNN model used had 200 hidden layers, with class size of 50. The sRNN-75(10) row in Tables 2 and 3 indicates that the sentence level RNN model was trained with latent space size of 75 and with training distortion level of 10%. All the results here were averaged over 10 runs.

Model | 10% | 30% | 50% | |
---|---|---|---|---|

3-gram KN | 148.28 | 1.993 | 4.179 | 5.279 |

5-gram KN | 141.46 | 2.021 | 4.198 | 5.308 |

RNN | 141.31 | 2.546 | 5.339 | 6.609 |

sRNN-75(50) | - | 1.978 | 3.961 | 6.477 |

sRNN-75(10) | - | 2.339 | 6.759 | 11.01 |

sRNN-150(10) | - | 2.547 | 7.581 | 12.925 |

As hypothesized, the contrastive entropy rises in Table 2’s columns 2 to 4 and correlates negatively with perplexity for word level models—i.e. the models expected to do better on perplexity do better on Contrastive entropy as well. Rows 4 to 6 compare sentence level RNN models. Here too, as expected, sRNN trained with distortion level of 10% outperforms sRNN trained with distortion margin of 50%. Now, let’s compare word level models to our sentence level model. We can see that sRNN-75(50) performs worse compared to RNN for all levels and worse than 3-gram and 5-gram models for 10% and 30%. This can be attributed to the training distortion margin of 50% which encourages the sRNN to see anything with less than 50% distortion as in-domain sentences. On the other hand sRNN trained with distortion level of 10% performs the best as compared to all other models as it has been tuned to label slightly un-grammatical sentences or ones that have slightly un-natural structure as out of domain.

Table 3 shows that scaling is not an issue for word level models as ratios are more or less the same. Sentence level models at 10% distortion do better than all the word-level models on both metrics which demonstrates their superior performance. sRNN-75(50) is an interesting case. At test distortion level of 30% it is clearly inferior to all word level models as it was trained on a distortion margin of 50%. With 50% test distortion the result is unclear as it does worse on contrastive entropy but better on contrastive ratio.

Model | 30%/10% | 50%/10% |
---|---|---|

3-gram KN | 2.096 | 2.649 |

5-gram KN | 2.077 | 2.626 |

RNN | 2.097 | 2.596 |

sRNN-75(50) | 2.002 | 3.275 |

sRNN-75(10) | 2.890 | 5.257 |

sRNN-150(10) | 2.976 | 5.074 |

## 6 Conclusion

In this paper we proposed a new evaluation criteria which can be used to evaluate unnormalized language models and showed, using examples, its efficacy in comparing sentence level models among themselves and to word level models. As both WER and contrastive entropy are discriminative measures, we hypothesize that contrastive entropy should have a better correlation with WER as compared to perplexity.

We also proposed a discriminatively trained sentence level formulation of recurrent neural networks which outperformed the current state of the art RNN models on our new metric. We hypothesize that this formulation of RNN does a better job at discriminative tasks like lattice re-scoring as compared to standard RNN and other traditional language modeling techniques. We conclude by restating that a metric is meaningful only if it can measure improvements in real world applications. Further experiments evaluating contrastive entropy’s correlation with the WER and BLEU metrics over a wide range of datasets are required to unquestionably demonstrate the usefulness of this metric. Similarly, to establish superior discriminative ability of sentence level RNNs over standard RNNs, we must compare their performance on real word discriminative tasks like n-best list re-scoring.

## References

- [1] R. Iyer, M. Ostendorf, and M. Meteer, “Analyzing and predicting language model improvements,” in Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on. IEEE, 1997, pp. 254–261.
- [2] S. F. Chen, D. Beeferman, and R. Rosenfeld, “Evaluation metrics for language models,” 1998.
- [3] P. Clarkson, T. Robinson et al., “Towards improved language model evaluation measures.” in EUROSPEECH, 1999.
- [4] N. A. Smith and J. Eisner, “Contrastive estimation: Training log-linear models on unlabeled data,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005, pp. 354–362.
- [5] A. Sethy, S. Chen, E. Arisoy, and B. Ramabhadran, “Unnormalized exponential and neural network language models,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5416–5420.
- [6] R. Rosenfeld, S. F. Chen, and X. Zhu, “Whole-sentence exponential language models: a vehicle for linguistic-statistical integration,” Computer Speech & Language, vol. 15, no. 1, pp. 55–73, 2001.
- [7] D. Okanohara and J. Tsujii, “A discriminative language model with pseudo-negative samples.” in ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, vol. 45, no. 1, 2007, p. 73.
- [8] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 160–167.
- [9] P. H. Algoet and T. M. Cover, “A sandwich proof of the Shannon-McMillan-Breiman theorem,” The annals of probability, pp. 899–909, 1988.
- [10] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neural network based language model.” in INTERSPEECH, 2010, pp. 1045–1048.
- [11] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), Jun. 2010, oral Presentation.
- [12] A. Stolcke, “Srilm-an extensible language modeling toolkit.” in INTERSPEECH, 2002.
- [13] T. Mikolov, S. Kombrink, A. Deoras, L. Burget, and J. Cernocky, “RNNLM-recurrent neural network language modeling toolkit,” in Proc. of the 2011 ASRU Workshop, 2011, pp. 196–201.