Posterior calibration and exploratory analysis for natural language processing models
Supplementary information for “Posterior calibration and exploratory analysis for natural language processing models” (EMNLP 2015)
Abstract
Many models in natural language processing define probabilistic distributions over linguistic structures. We argue that (1) the quality of a model’s posterior distribution can and should be directly evaluated, as to whether probabilities correspond to empirical frequencies; and (2) NLP uncertainty can be projected not only to pipeline components, but also to exploratory data analysis, telling a user when to trust and not trust the NLP analysis. We present a method to analyze calibration, and apply it to compare the miscalibration of several commonly used models. We also contribute a coreference sampling algorithm that can create confidence intervals for a political event extraction task.^{1}^{1}1This is the extended version of a paper published in Proceedings of EMNLP 2015. This version includes acknowledgments and an appendix. For all materials, see: http://brenocon.com/nlpcalib/
Posterior calibration and exploratory analysis for natural language processing models
Khanh Nguyen Department of Computer Science University of Maryland, College Park College Park, MD 20742 kxnguyen@cs.umd.edu Brendan O’Connor College of Information and Computer Sciences University of Massachusetts, Amherst Amherst, MA, 01003 brenocon@cs.umass.edu
1 Introduction
Natural language processing systems are imperfect. Decades of research have yielded analyzers that misidentify named entities, misattach syntactic relations, and misrecognize noun phrase coreference anywhere from 1040% of the time. But these systems are accurate enough so that their outputs can be used as soft, if noisy, indicators of language meaning for use in downstream analysis, such as systems that perform question answering, machine translation, event extraction, and narrative analysis (McCord et al., 2012; Gimpel and Smith, 2008; Miwa et al., 2010; Bamman et al., 2013).
To understand the performance of an analyzer, researchers and practitioners typically measure the accuracy of individual labels or edges among a single predicted output structure , such as a mostprobable tagging or entity clustering (conditional on text data ).
But a probabilistic model gives a probability distribution over many other output structures that have smaller predicted probabilities; a line of work has sought to control cascading pipeline errors by passing on multiple structures from earlier stages of analysis, by propagating prediction uncertainty through multiple samples (Finkel et al., 2006), best lists (Venugopal et al., 2008; Toutanova et al., 2008), or explicitly diverse lists (Gimpel et al., 2013); often the goal is to marginalize over structures to calculate and minimize an expected loss function, as in minimum Bayes risk decoding (Goodman, 1996; Kumar and Byrne, 2004), or to perform joint inference between early and later stages of NLP analysis (e.g. Singh et al., 2013; Durrett and Klein, 2014).
These approaches should work better when the posterior probabilities of the predicted linguistic structures reflect actual probabilities of the structures or aspects of the structures. For example, say a model is overconfident: it places too much probability mass in the top prediction, and not enough in the rest. Then there will be little benefit to using the lower probability structures, since in the training or inference objectives they will be incorrectly outweighed by the top prediction (or in a sampling approach, they will be systematically undersampled and thus have toolow frequencies). If we only evaluate models based on their top predictions or on downstream tasks, it is difficult to diagnose this issue.
Instead, we propose to directly evaluate the calibration of a model’s posterior prediction distribution. A perfectly calibrated model knows how often it’s right or wrong; when it predicts an event with 80% confidence, the event empirically turns out to be true 80% of the time. While perfect accuracy for NLP models remains an unsolved challenge, perfect calibration is a more achievable goal, since a model that has imperfect accuracy could, in principle, be perfectly calibrated. In this paper, we develop a method to empirically analyze calibration that is appropriate for NLP models (§3) and use it to analyze common generative and discriminative models for tagging and classification (§4).
Furthermore, if a model’s probabilities are meaningful, that would justify using its probability distributions for any downstream purpose, including exploratory analysis on unlabeled data. In §6 we introduce a representative corpus exploration problem, identifying temporal event trends in international politics, with a method that is dependent on coreference resolution. We develop a coreference sampling algorithm (§5.2) which projects uncertainty into the event extraction, inducing a posterior distribution over event frequencies. Sometimes the event trends have very high posterior variance (large confidence intervals),^{2}^{2}2We use the terms confidence interval and credible interval interchangeably in this work; the latter term is debatably more correct, though less widely familiar. reflecting when the NLP system genuinely does not know the correct semantic extraction. This highlights an important use of a calibrated model: being able to tell a user when the model’s predictions are likely to be incorrect, or at least, not giving a user a false sense of certainty from an erroneous NLP analysis.
2 Definition of calibration
Consider a binary probabilistic prediction problem, which consists of binary labels and probabilistic predictions for them. Each instance has a groundtruth label , which is used for evaluation. The prediction problem is to generate a predicted probability or prediction strength . Typically, we use some form of a probabilistic model to accomplish this task, where represents the model’s posterior probability^{3}^{3}3Whether comes from a Bayesian posterior or not is irrelevant to the analysis in this section. All that matters is that predictions are numbers . of the instance having a positive label ().
Let be the set of predictionlabel pairs produced by the model. Many metrics assess the overall quality of how well the predicted probabilities match the data, such as the familiar cross entropy (negative average loglikelihood),
or mean squared error, also known as the Brier score when is binary (Brier, 1950),
Both tend to attain better (lower) values when is near 1 when , and near 0 when ; and they achieve a perfect value of 0 when all .^{4}^{4}4These two loss functions are instances of proper scoring rules (Gneiting and Raftery, 2007; Bröcker, 2009).
Let be the joint empirical distribution over labels and predictions. Under this notation, . Consider the factorization
where denotes the label empirical frequency, conditional on a prediction strength (Murphy and Winkler, 1987).^{5}^{5}5 We alternatively refer to this as label frequency or empirical frequency. The probabilities can be thought of as frequencies from the hypothetical population the data and predictions are drawn from. probabilities are, definitionally speaking, completely separate from a probabilistic model that might be used to generate predictions. Applying this factorization to the Brier score leads to the calibrationrefinement decomposition (DeGroot and Fienberg, 1983), in terms of expectations with respect to the prediction strength distribution :
(1) 
where we denote for brevity.
Here, calibration measures to what extent a model’s probabilistic predictions match their corresponding empirical frequencies. Perfect calibration is achieved when for all ; intuitively, if you aggregate all instances where a model predicted , they should have at percent of the time. We define the magnitude of miscalibration using root mean squared error:
Definition 1 (RMS calibration error).
The second term of Eq 1 refers to refinement, which reflects to what extent the model is able to separate different labels (in terms of the conditional Gini entropy ). If the prediction strengths tend to cluster around 0 or 1, the refinement score tends to be lower. The calibrationrefinement breakdown offers a useful perspective on the accuracy of a model posterior. This paper focuses on calibration.
There are several other ways to break down squared error, loglikelihood, and other probabilistic scoring rules.^{6}^{6}6They all include a notion of calibration corresponding to a Bregman divergence (Bröcker, 2009); for example, crossentropy can be broken down such that KL divergence is the measure of miscalibration. We use the Brierbased calibration error in this work, since unlike crossentropy it does not tend toward infinity when near probability 0; we hypothesize this could be an issue since both and are subject to estimation error.
3 Empirical calibration analysis
Input: A set of predictionlabel pairs .
Output: Calibration error.
Parameter: Target bin size .
Step 1: Sort pairs by prediction values in ascending order.
Step 2: For each, assign bin label .
Step 3: Define each bin as the set of indices of pairs that have the same bin label. If the last bin has size less than , merge it with the secondtolast bin (if one exists). Let be the set of bins.
Step 4: Calculate empirical and predicted probabilities per bin:
Step 5: Calculate the calibration error as the root mean squared error per bin, weighted by bin size in case they are not uniformly sized:
Input: A set of predictionlabel pairs .
Output: Calibration error with a 95% confidence interval.
Parameter: Number of samples, .
Step 1: Calculate from step 4 of Algorithm 1.
Step 2: Draw samples. For each ,

For each bin , draw , where . If necessary clip to :

Calculate the sample’s from using the pairs as per Step 5 of Algorithm 1.
Step 3: Calculate the 95% confidence interval for the calibration error as:
where and are the mean and the standard deviation, respectively, of the s calculated from the samples.
From a test set of labeled data, we can analyze model calibration both in terms of the calibration error, as well as visualizing the calibration curve of label frequency versus predicted strength. However, computing the label frequencies requires an infinite amount of data. Thus approximation methods are required to perform calibration analysis.
3.1 Adaptive binning procedure
Previous studies that assess calibration in supervised machine learning models (NiculescuMizil and Caruana, 2005; Bennett, 2000) calculate label frequencies by dividing the prediction space into deciles or other evenly spaced bins—e.g. , , etc.—and then calculating the empirical label frequency in each bin. This procedure may be thought of as using a form of nonparametric regression (specifically, a regressogram; Tukey 1961) to estimate the function from observed data points. But models in natural language processing give very skewed distributions of confidence scores (many are near 0 or 1), so this procedure performs poorly, having much more variable estimates near the middle of the distribution (Figure 1).
We propose adaptive binning as an alternative. Instead of dividing the interval into fixedwidth bins, adaptive binning defines the bins such that there are an equal number of points in each, after which the same averaging procedure is used. This method naturally gives wider bins to area with fewer data points (areas that require more smoothing), and ensures that these areas have roughly similar standard errors as those near the boundaries, since for a bin with number of points and empirical frequency , the standard error is estimated by , which is bounded above by . Algorithm 1 describes the procedure for estimating calibration error using adaptive binning, which can be applied to any probabilistic model that predicts posterior probabilities.
3.2 Confidence interval estimation
Especially when the test set is small, estimating calibration error may be subject to error, due to uncertainty in the label frequency estimates. Since how to estimate confidence bands for nonparametric regression is an unsolved problem (Wasserman, 2006), we resort to a simple method based on the binning. We construct a binomial normal approximation for the label frequency estimate in each bin, and simulate from it; every simulation across all bins is used to construct a calibration error; these simulated calibration errors are collected to construct a normal approximation for the calibration error estimate. Since we use bin sizes of at least in our experiments, the central limit theorem justifies these approximations. We report all calibration errors along with their 95% confidence intervals calculated by Algorithm 2.^{7}^{7}7A major unsolved issue is how to fairly select the bin size. If it is too large, the curve is oversmoothed and calibration looks better than it should be; if it is too small, calibration looks worse than it should be. Bandwidth selection and crossvalidation techniques may better address this problem in future work. In the meantime, visualizations of calibration curves help inform the reader of the resolution of a particular analysis—if the bins are far apart, the data is sparse, and the specific details of the curve are not known in those regions.
3.3 Visualizing calibration
In order to better understand a model’s calibration properties, we plot the pairs obtained from the adaptive binning procedure to visualize the calibration curve of the model—this visualization is known as a calibration or reliability plot. It provides finer grained insight into the calibration behavior in different prediction ranges. A perfectly calibrated curve would coincide with the diagonal line. When the curve lies above the diagonal, the model is underconfident (); and when it is below the diagonal, the model is overconfident ().
An advantage of plotting a curve estimated from fixedsize bins, instead of fixedwidth bins, is that the distribution of the points hints at the refinement aspect of the model’s performance. If the points’ positions tend to cluster in the bottomleft and topright corners, that implies the model is making more refined predictions.
4 Calibration for classification and tagging models
Using the method described in §3, we assess the quality of posterior predictions of several classification and tagging models. In all of our experiments, we set the target bin size in Algorithm 1 to be 5,000 and the number of samples in Algorithm 2 to be 10,000.
4.1 Naive Bayes and logistic regression
4.1.1 Introduction
Previous work on Naive Bayes has found its probabilities to have calibration issues, in part due to its incorrect conditional independence assumptions (NiculescuMizil and Caruana, 2005; Bennett, 2000; Domingos and Pazzani, 1997). Since logistic regression has the same loglinear representational capacity (Ng and Jordan, 2002) but does not suffer from the independence assumptions, we select it for comparison, hypothesizing it may have better calibration.
We analyze a binary classification task of Twitter sentiment analysis from emoticons. We collect a dataset consisting of tweets identified by the Twitter API as English, collected from 2014 to 2015, with the “emoticon trick” (Read, 2005; Lin and Kolcz, 2012) to label tweets that contain at least one occurrence of the smiley emoticon “:)” as “happy” () and others as . The smiley emoticons are deleted in positive examples. We sampled three sets of tweets (subsampled from the Decahose/Gardenhose stream of public tweets) with JanApr 2014 for training, MayDec 2014 for development, and JanApr 2015 for testing. Each set contains tweets, split between an equal number of positive and negative instances. We use binary features based on unigrams extracted from the twokenize.py^{8}^{8}8https://github.com/myleott/arktwokenizepy tokenization. We use the scikitlearn (Pedregosa et al., 2011) implementations of Bernoulli Naive Bayes and L2regularized logistic regression. The models’ hyperparameters (Naive Bayes’ smoothing paramter and logistic regression’s regularization strength) are chosen to maximize the F1 score on the development set.
4.1.2 Results
Naive Bayes attains a slightly higher F1 score (NB 73.8% vs. LR 72.9%), but logistic regression has much lower calibration error: less than half as much RMSE (NB 0.105 vs. LR 0.041; Figure 2). Both models have a tendency to be underconfident in the lower prediction range and overconfident in the higher range, but the tendency is more pronounced for Naive Bayes.
4.2 Hidden Markov models and conditional random fields
4.2.1 Introduction
Hidden Markov models (HMM) and linear chain conditional random fields (CRF) are another commonly used pair of analogous generative and discriminative models. They both define a posterior over tag sequences , which we apply to partofspeech tagging.
We can analyze these models in the binary calibration framework (§23) by looking at marginal distribution of binaryvalued outcomes of parts of the predicted structures. Specifically, we examine calibration of predicted probabilities of individual tokens’ tags (§4.2.2), and of pairs of consecutive tags (§4.2.3). These quantities are calculated with the forwardbackward algorithm.
To prepare a POS tagging dataset, we extract Wall Street Journal articles from the English CoNLL2011 coreference shared task dataset from Ontonotes (Pradhan et al., 2011), using the CoNLL2011 splits for training, development and testing. This results in 11,772 sentences for training, 1,632 for development, and 1,382 for testing, over a set of 47 possible tags.
We train an HMM with Dirichlet MAP using one pseudocount for every transition and word emission. For the CRF, we use the regularized LBFGS algorithm implemented in CRFsuite (Okazaki, 2007). We compare an HMM to a CRF that only uses basic transition (tagtag) and emission (tagword) features, so that it does not have an advantage due to more features. In order to compare models with similar task performance, we train the CRF with only 3000 sentences from the training set, which yields the same accuracy as the HMM (about 88.7% on the test set). In each case, the model’s hyperparameters (the CRF’s regularizer, the HMM’s pseudocount) are selected by maximizing accuracy on the development set.
4.2.2 Predicting singleword tags
In this experiment, we measure miscalibration of the two models on predicting tags of single words. First, for each tag type, we produce a set of 33,306 predictionlabel pairs (for every token); we then concatenate them across the tags for calibration analysis. Figure 3 shows that the two models exhibit distinct calibration patterns. The HMM tends to be very underconfident whereas the CRF is overconfident, and the CRF has a lower (better) overall calibration error.
We also examine the calibration errors of the individual POS tags (Figure 4(a)). We find that CRF is significantly better calibrated than HMM in most but not all categories (39 out of 47). For example, they are about equally calibrated on predicting the NN tag. The calibration gap between the two models also differs among the tags.
4.2.3 Predicting twoconsecutiveword tags
There is no reason to restrict ourselves to model predictions of single words; these models define marginal distributions over larger textual units. Next we examine the calibration of posterior predictions of tag pairs on two consecutive words in the test set. The same analysis may be important for, say, phrase extraction or other chunking/parsing tasks.
We report results for the top 5 and 100 most frequent tag pairs (Figure 4(b)). We observe a similar pattern as seen from the experiment on single tags: the CRF is generally better calibrated than the HMM, but the HMM does achieve better calibration errors in 29 out of 100 categories.
These tagging experiments illustrate that, depending on the application, different models can exhibit different levels of calibration.
5 Coreference resolution
We examine a third model, a probabilistic model for withindocument noun phrase coreference, which has an efficient samplingbased inference procedure. In this section we introduce it and analyze its calibration, in preparation for the next section where we use it for exploratory data analysis.
5.1 Antecedent selection model
We use the Berkeley coreference resolution system (Durrett and Klein, 2013), which was originally presented as a CRF; we give it an equivalent a series of independent logistic regressions (see appendix for details). The primary component of this model is a locallynormalized loglinear distribution over clusterings of noun phrases, each cluster denoting an entity. The model takes a fixed input of mentions (noun phrases), indexed by in their positional order in the document. It posits that every mention has a latent antecedent selection decision, , denoting which previous mention it attaches to, or new if it is starting a new entity that has not yet been seen at a previous position in the text. Such a mentionmention attachment indicates coreference, while the final entity clustering includes more links implied through transitivity. The model’s generative process is:
Definition 2 (Antencedent coreference model and sampling algorithm).

For , sample

Calculate the entity clusters as , the connected components of the antecedent graph having edges for where .
Here denotes all information in the document that is conditioned on for loglinear features . denotes the entity clusters, where each element is a set of mentions. There are entity clusters corresponding to the number of connected components in . The model defines a joint distribution over antecedent decisions ; it also defines a joint distribution over entity clusterings , where the probability of an is the sum of the probabilities of all vectors that could give rise to it. In a manner similar to a distancedependent Chinese restaurant process (Blei and Frazier, 2011), it is nonparametric in the sense that the number of clusters is not fixed in advance.
5.2 Samplingbased inference
For both calibration analysis and exploratory applications, we need to analyze the posterior distribution over entity clusterings. This distribution is a complex mathematical object; an attractive approach to analyze it is to draw samples from this distribution, then analyze the samples.
This antecedentbased model admits a very straightforward procedure to draw independent samples, by stepping through Def. 2: independently sample each then calculate the connected components of the resulting antecedent graph. By construction, this procedure samples from the joint distribution of (even though we never compute the probability of any single clustering ).
Unlike approximate sampling approaches, such as Markov chain Monte Carlo methods used in other coreference work to sample (Haghighi and Klein, 2007), here there are no questions about burnin or autocorrelation (Kass et al., 1998). Every sample is independent and very fast to compute—only slightly slower than calculating the MAP assignment (due to the and normalization for each ). We implement this algorithm by modifying the publicly available implementation from Durrett and Klein.^{9}^{9}9Berkeley Coreference Resolution System, version 1.1: http://nlp.cs.berkeley.edu/projects/coref.shtml
5.3 Calibration analysis
We consider the following inference query: for a randomly chosen pair of mentions, are they coreferent? Even if the model’s accuracy is comparatively low, it may be the case that it is correctly calibrated—if it thinks there should be great variability in entity clusterings, it may be uncertain whether a pair of mentions should belong together.
Let be 1 if the mentions and are predicted to be coreferent, and 0 otherwise. Annotated data defines a goldstandard value for every pair . Any probability distribution over defines a marginal Bernoulli distribution for every proposition , marginalizing out :
(2) 
where is true iff there is an entity in that contains both and .
In a traditional coreference evaluation of the bestprediction entity clustering, the model assigns 1 or 0 to every and the pairwise precision and recall can be computed by comparing them to the corresponding . Here, we instead compare the prediction strengths against empirical frequencies to assess pairwise calibration, with the same binary calibration analysis tools developed in §3 by aggregating pairs with similar values. Each is computed by averaging over 1,000 samples, simply taking the fraction of samples where the pair is coreferent.
We perform this analysis on the development section of the English CoNLL2011 data (404 documents). Using the sampling inference method discussed in , we compute 4.3 millions predictionlabel pairs and measure their calibration error. Our result shows that the model produces very wellcalibrated predictions with less than (Figure 5), though slightly overconfident on middle to highvalued predictions. The calibration error indicates that it is the most calibrated model we examine within this paper. This result suggests we might be able to trust its level of uncertainty.
6 Uncertainty in Entitybased Exploratory Analysis
6.1 Entitysyntactic event aggregation
We demonstrate one important use of calibration analysis: to ensure the usefulness of propagating uncertainty from coreference resolution into a system for exploring unannotated text. Accuracy cannot be calculated since there are no labels; but if the system is calibrated, we postulate that uncertainty information can help users understand the underlying reliability of aggregated extractions and isolate predictions that are more likely to contain errors.
We illustrate with an event analysis application to count the number of “country attack events”: for a particular country of the world, how many news articles describe an entity affiliated with that country as the agent of an attack, and how does this number change over time? This is a simplified version of a problem where such systems have been built and used for political science analysis (Schrodt et al., 1994; Schrodt, 2012; Leetaru and Schrodt, 2013; Boschee et al., 2013; O’Connor et al., 2013). A coreference component can improve extraction coverage in cases such as “Russian troops were sighted …and they attacked …”
We use the coreference system examined in §5 for this analysis. To propagate coreference uncertainty, we rerun event extraction on multiple coreference samples generated from the algorithm described in §5.2, inducing a posterior distribution over the event counts. To isolate the effects of coreference, we use a very simple syntactic dependency system to identify affiliations and events. Assume the availability of dependency parses for a document , a coreference resolution , and a lexicon of country names, which contains a small set of words for each country ; for example, . The binary function assesses whether an entity is affiliated with country and is described as the agent of an attack, based on document text and parses ; returns true iff both:^{10}^{10}10Syntactic relations are Universal Dependencies (de Marneffe et al., 2014); more details for the extraction rules are in the appendix.

There exists a mention described as country : either its head word is in (e.g. “Americans”), or its head word has an nmod or amod modifier in (e.g. “American forces”, “president of the U.S.”); and there is only one unique country among the mentions in the entity.

There exists a mention which is the nsubj or agent argument to the verb “attack” (e.g. “they attacked”, “the forces attacked”, “attacked by them”).
For a given , we first calculate a binary variable for whether there is at least one entity fulfilling in a particular document,
(3) 
and second, the number of such documents in , the set of New York Times articles published in a given time period ,
(4) 
These quantities are both random variables, since they depend on ; thus we are interested in the posterior distribution of , marginalizing out ,
(5) 
If our coreference model was highly certain (only one structure, or a small number of similar structures, had most of the probability mass in the space of all possible structures), each document would have an posterior near either 0 or 1, and their sum in Eq. 5 would have a narrow distribution. But if the model is uncertain, the distribution will be wider. Because of the transitive closure, the probability of is potentially more complex than the single antecedent linking probability between two mentions—the affiliation and attack information can propagate through a long coreference chain.
6.2 Results
We tag and parse a 193,403 article subset of the Annotated New York Times LDC corpus (Sandhaus, 2008), which includes articles about world news from the years 1987 to 2007 (details in appendix). For each article, we run the coreference system to predict 100 samples, and evaluate on every entity in every sample.^{11}^{11}11We obtained similar results using only 10 samples. We also obtained similar results with a different query function, the total number of entities, across documents, that fulfill . The quantity of interest is the number of articles mentioning attacks in a 3month period (quarter), for a given country. Figure 6 illustrates the mean and 95% posterior credible intervals for each quarter. The posterior mean is calculated as the mean of the samples, and the interval is the normal approximation , where is the standard deviation among samples for that country and time period.
Uncertainty information helps us understand whether a difference between data points is real. In the plots of Figure 6, if we had used a 1best coreference resolution, only a single line would be shown, with no assessment of uncertainty. This is problematic in cases when the model genuinely does not know the correct answer. For example, the 19931996 period of the USA plot (Figure 6, top) shows the posterior mean fluctuating from 1 to 5 documents; but when credible intervals are taken into consideration, we see that model does not know whether the differences are real, or were caused by coreference noise.
A similar case is highlighted at the bottom plot of Figure 6. Here we compare the event counts for Yugoslavia and NATO, which were engaged in a conflict in 1999. Did the New York Times devote more attention to the attacks by one particular side? To a 1best system, the answer would be yes. But the posterior intervals for the two countries’ event counts in mid1999 heavily overlap, indicating that the coreference system introduces too much uncertainty to obtain a conclusive answer for this question. Note that calibration of the coreference model is important for the credible intervals to be useful; for example, if the model was badly calibrated by being overconfident (too much probability over a small set of similar structures), these intervals would be too narrow, leading to incorrect interpretations of the event dynamics.
Visualizing this uncertainty gives richer information for a potential user of an NLPbased system, compared to simply drawing a line based on a single 1best prediction. It preserves the genuine uncertainty due to ambiguities the system was unable to resolve. This highlights an alternative use of Finkel et al. (2006)’s approach of sampling multiple NLP pipeline components, which in that work was used to perform joint inference. Instead of focusing on improving an NLP pipeline, we can pass uncertainty on to exploratory purposes, and try to highlight to a user where the NLP system may be wrong, or where it can only imprecisely specify a quantity of interest.
Finally, calibration can help error analysis. For a calibrated model, the more uncertain a prediction is, the more likely it is to be erroneous. While coreference errors comprise only one part of event extraction errors (alongside issues in parse quality, factivity, semantic roles, etc.), we can look at highly uncertain event predictions to understand the nature of coreference errors relative to our task. We manually analyzed documents with a 50% probability to contain an “attack”ing countryaffiliated entity, and found difficult coreference cases.
In one article from late 1990, an “attack” event for IRQ is extracted from the sentence “But some political leaders said that they feared that Mr. Hussein might attack Saudi Arabia”. The mention “Mr. Hussein” is classified as IRQ only when it is coreferent with a previous mention “President Saddam Hussein of Iraq”; this occurs only 50% of the time, since in some posterior samples the coreference system split apart these two “Hussein” mentions. This particular document is additionally difficult, since it includes the names of more than 10 countries (e.g. United States, Saudi Arabia, Egypt), and some of the Hussein mentions are even clustered with presidents of other countries (such as “President Bush”), presumably because they share the “president” title. These types of errors are a major issue for a political analysis task; further analysis could assess their prevalence and how to address them in future work.
7 Conclusion
In this work, we argue that the calibration of posterior predictions is a desirable property of probabilistic NLP models, and that it can be directly evaluated. We also demonstrate a use case of having calibrated uncertainty: its propagation into downstream exploratory analysis.
Our posterior simulation approach for exploratory and error analysis relates to posterior predictive checking (Gelman et al., 2013), which analyzes a posterior to test model assumptions; Mimno and Blei (2011) apply it to a topic model.
One avenue of future work is to investigate more effective nonparametric regression methods to better estimate and visualize calibration error, such as Gaussian processes or bootstrapped kernel density estimation.
Another important question is: what types of inferences are facilitated by correct calibration? Intuitively, we think that overconfidence will lead to overly narrow confidence intervals; but in what sense are confidence intervals “good” when calibration is perfect? Also, does calibration help joint inference in NLP pipelines? It may also assist calculations that rely on expectations, such as inference methods like minimum Bayes risk decoding, or learning methods like EM, since calibrated predictions imply that calculated expectations are statistically unbiased (though the implications of this fact may be subtle). Finally, it may be interesting to pursue recalibration methods, which readjust a noncalibrated model’s predictions to be calibrated; recalibration methods have been developed for binary (Platt, 1999; NiculescuMizil and Caruana, 2005) and multiclass (Zadrozny and Elkan, 2002) classification settings, but we are unaware of methods appropriate for the highly structured outputs typical in linguistic analysis. Another approach might be to directly constrain during training, or try to reduce it as a trainingtime risk minimization or cost objective (Smith and Eisner, 2006; Gimpel and Smith, 2010; Stoyanov et al., 2011; Brümmer and Doddington, 2013).
Calibration is an interesting and important property of NLP models. Further work is necessary to address these and many other questions.
Acknowledgments
Thanks to Erik LearnedMiller, Benjamin Marlin, Craig Greenberg, PhanMinh Nguyen, Caitlin Cellier and the CMU ARK Lab for discussion and comments, and to the anonymous reviewers (especially R3) for helpful suggestions.
References
 Bamman et al. (2013) David Bamman, Brendan O’Connor, and Noah A. Smith. Learning latent personas of film characters. In Proceedings of ACL, 2013.
 Bennett (2000) Paul N. Bennett. Assessing the calibration of naive Bayes’ posterior estimates. Technical report, Carnegie Mellon University, 2000.
 Blei and Frazier (2011) David M. Blei and Peter I. Frazier. Distance dependent Chinese restaurant processes. The Journal of Machine Learning Research, 12:2461–2488, 2011.
 Boschee et al. (2013) Elizabeth Boschee, Premkumar Natarajan, and Ralph Weischedel. Automatic extraction of events from open source text for predictive forecasting. Handbook of Computational Approaches to Counterterrorism, page 51, 2013.
 Brier (1950) Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
 Bröcker (2009) Jochen Bröcker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643):1512–1519, 2009.
 Brümmer and Doddington (2013) Niko Brümmer and George Doddington. Likelihoodratio calibration using priorweighted proper scoring rules. arXiv preprint arXiv:1307.7981, 2013. Interspeech 2013.
 de Marneffe et al. (2014) MarieCatherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning. Universal Stanford dependencies: A crosslinguistic typology. In Proceedings of LREC, 2014.
 DeGroot and Fienberg (1983) Morris H. DeGroot and Stephen E. Fienberg. The comparison and evaluation of forecasters. The statistician, pages 12–22, 1983.
 Domingos and Pazzani (1997) Pedro Domingos and Michael Pazzani. On the optimality of the simple Bayesian classifier under zeroone loss. Machine learning, 29(23):103–130, 1997.
 Durrett and Klein (2013) Greg Durrett and Dan Klein. Easy victories and uphill battles in coreference resolution. In EMNLP, pages 1971–1982, 2013.
 Durrett and Klein (2014) Greg Durrett and Dan Klein. A joint model for entity analysis: Coreference, typing, and linking. Transactions of the Association for Computational Linguistics, 2:477–490, 2014.
 Finkel et al. (2006) Jenny Rose Finkel, Christopher D. Manning, and Andrew Y. Ng. Solving the problem of cascading errors: Approximate Bayesian inference for linguistic annotation pipelines. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 618–626. Association for Computational Linguistics, 2006.
 Gelman et al. (2013) Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian data analysis. Chapman and Hall/CRC, 3rd edition, 2013.
 Gimpel and Smith (2008) Kevin Gimpel and Noah A. Smith. Rich sourceside context for statistical machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 9–17, 2008.
 Gimpel and Smith (2010) Kevin Gimpel and Noah A. Smith. Softmaxmargin CRFs: Training loglinear models with cost functions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 733–736. Association for Computational Linguistics, 2010.
 Gimpel et al. (2013) Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory Shakhnarovich. A systematic exploration of diversity in machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1100–1111, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/D131111.
 Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
 Goodman (1996) Joshua Goodman. Parsing algorithms and metrics. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 177–183, Santa Cruz, California, USA, June 1996. Association for Computational Linguistics. doi: 10.3115/981863.981887. URL http://www.aclweb.org/anthology/P961024.
 Haghighi and Klein (2007) Aria Haghighi and Dan Klein. Unsupervised coreference resolution in a nonparametric Bayesian model. In Annual Meeting, Association for Computational Linguistics, volume 45, page 848, 2007.
 Kass et al. (1998) Robert E. Kass, Bradley P. Carlin, Andrew Gelman, and Radford M. Neal. Markov chain Monte Carlo in practice: a roundtable discussion. The American Statistician, 52(2):93–100, 1998.
 Kumar and Byrne (2004) Shankar Kumar and William Byrne. Minimum Bayesrisk decoding for statistical machine translation. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLTNAACL 2004: Main Proceedings, pages 169–176, Boston, Massachusetts, USA, May 2  May 7 2004. Association for Computational Linguistics.
 Leetaru and Schrodt (2013) Kalev Leetaru and Philip A. Schrodt. GDELT: Global data on events, location, and tone, 1979–2012. In ISA Annual Convention, volume 2, page 4, 2013.
 Lin and Kolcz (2012) Jimmy Lin and Alek Kolcz. Largescale machine learning at Twitter. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 793–804. ACM, 2012.
 McCord et al. (2012) Michael C. McCord, J. William Murdock, and Branimir K. Boguraev. Deep parsing in Watson. IBM Journal of Research and Development, 56(3.4):3–1, 2012.
 Mimno and Blei (2011) David Mimno and David Blei. Bayesian checking for topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 227–237, Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/D111021.
 Miwa et al. (2010) Makoto Miwa, Sampo Pyysalo, Tadayoshi Hara, and Jun’ichi Tsujii. Evaluating dependency representations for event extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 779–787, Beijing, China, August 2010. Coling 2010 Organizing Committee. URL http://www.aclweb.org/anthology/C101088.
 Murphy and Winkler (1987) Allan H. Murphy and Robert L. Winkler. A general framework for forecast verification. Monthly Weather Review, 115(7):1330–1338, 1987.
 Ng and Jordan (2002) Andrew Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in neural information processing systems, 14:841, 2002.
 NiculescuMizil and Caruana (2005) Alexandru NiculescuMizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 625–632, 2005.
 O’Connor et al. (2013) Brendan O’Connor, Brandon Stewart, and Noah A. Smith. Learning to extract international relations from political context. In Proceedings of ACL, 2013.
 Okazaki (2007) Naoaki Okazaki. Crfsuite: a fast implementation of conditional random fields (CRFs), 2007. URL http://www.chokkan.org/software/crfsuite/.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Platt (1999) John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers. MIT Press (2000), 1999. URL http://research.microsoft.com/pubs/69187/svmprob.ps.gz.
 Pradhan et al. (2011) Sameer Pradhan, Lance Ramshaw, Mitchell Marcus, Martha Palmer, Ralph Weischedel, and Nianwen Xue. CoNLL2011 shared task: Modeling unrestricted coreference in Ontonotes. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–27. Association for Computational Linguistics, 2011.
 Read (2005) Jonathon Read. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In Proceedings of the ACL Student Research Workshop, pages 43–48. Association for Computational Linguistics, 2005.
 Sandhaus (2008) Evan Sandhaus. The New York Times Annotated Corpus. Linguistic Data Consortium, LDC2008T19, 2008.
 Schrodt (2012) Philip A. Schrodt. Precedents, progress, and prospects in political event data. International Interactions, 38(4):546–569, 2012.
 Schrodt et al. (1994) Philip A. Schrodt, Shannon G. Davis, and Judith L. Weddle. KEDS – a program for the machine coding of event data. Social Science Computer Review, 12(4):561 –587, December 1994. doi: 10.1177/089443939401200408. URL http://ssc.sagepub.com/content/12/4/561.abstract.
 Singh et al. (2013) Sameer Singh, Sebastian Riedel, Brian Martin, Jiaping Zheng, and Andrew McCallum. Joint inference of entities, relations, and coreference. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, pages 1–6. ACM, 2013.
 Smith and Eisner (2006) David A. Smith and Jason Eisner. Minimum risk annealing for training loglinear models. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 787–794, Sydney, Australia, July 2006. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P062101.
 Stoyanov et al. (2011) Veselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In International Conference on Artificial Intelligence and Statistics, pages 725–733, 2011.
 Toutanova et al. (2008) Kristina Toutanova, Aria Haghighi, and Christopher D. Manning. A global joint model for semantic role labeling. Computational Linguistics, 34(2):161–191, 2008.
 Tukey (1961) John W. Tukey. Curves as parameters, and touch estimation. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 681–694, Berkeley, Calif., 1961. University of California Press. URL http://projecteuclid.org/euclid.bsmsp/1200512189.
 Venugopal et al. (2008) Ashish Venugopal, Andreas Zollmann, Noah A. Smith, and Stephan Vogel. Wider pipelines: Nbest alignments and parses in MT training. In Proceedings of AMTA, 2008.
 Wasserman (2006) Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
 Zadrozny and Elkan (2002) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of KDD, pages 694–699. ACM, 2002.
Appendix
1 Sampling a deterministic function of a random variable
In several places in this paper, we define probability distributions over deterministic functions of a random variable, and sample from them by applying the deterministic function to samples of the random variable. This should be valid by construction, but we supply the following argument for further justification.
is a random variable and is a deterministic function which takes a value of as its input. Since depends on a random variable, is a random variable as well. The distribution for , or aspects of it (such as a PMF or independent samples from it) can be calculated by marginalizing out with a Monte Carlo approximation. Assuming has discrete outputs (as is the case for the event counting function , or connected components function ), we examine the probability mass function:
(6)  
(7)  
(8)  
(9) 
Eq. 8 holds because is a deterministic function, and Eq. 9 is a Monte Carlo approximation that uses samples from .
This implies that a set of values calculated on samples, , should constitute a sample from the distribution ; in our event analysis section we usually call this the “posterior” distribution of (the function there). In our setting, we do not directly use the PMF calculation above; instead, we construct normal approximations to the probability distribution .
We use this technique in several places. For the calibration error confidence interval, the calibration error is a deterministic function of the uncertain empirical label frequencies ; there, we propagate posterior uncertainty from a normal approximation to the Bernoulli parameter’s posterior (the distribution under the central limit theorem) through simulation. In the coreference model, the connected components function is a deterministic function of the antecedent vector; thus repeatedly calculating yields samples of entity clusterings from their posterior. For the event analysis, the counting function is a function of the entity samples, and thus can be recalculated on each—this is a multiple step deterministic pipeline, which postprocesses simulated random variables.
As in other Monte Carlobased inference techniques (as applied to both Bayesian and frequentist (e.g. bootstrapping) inference), the mean and standard deviation of samples drawn from the distribution constitute the mean and standard deviation of the desired posterior distribution, subject to Monte Carlo error due to the finite number of samples, which by the central limit theorem shrinks at a rate of . The Monte Carlo standard error for estimating the mean is where is the standard deviation. So with 100 samples, the Monte Carlo standard error for the mean is times smaller than standard deviation. Thus in the time series graphs, which are based on samples, the posterior mean (dark line) has Monte Carlo uncertainty that is 10 times smaller than the vertical gray area (95% CI) around it.
2 Normalization in the coreference model
Durrett and Klein (2013) present their model as a globally normalized, but fully factorized, CRF:
Since the factor function decomposes independently for each random variable , their probabilities are actually independent, and can be rewritten with local normalization,
This interpretation justifies the use of independent sampling to draw samples of the joint posterior.
3 Event analysis: Corpus selection, country affiliation, and parsing
Articles are filtered to yield a dataset about world news. In the New York Times Annotated Corpus, every article is tagged with a large set of labels. We include articles that contain a category whose label starts with the string Top/News/World, and exclude articles with any category matching the regex /(SportsOpinion), and whose text body contains a mention of at least one country name.
Country names are taken from the dictionary country_igos.txt based on previous work (http://brenocon.com/irevents/). Country name matching is case insensitive and uses light stemming: when trying to match a word against the lexicon, if a match is not found, it backs off to stripping the last and last two characters. (This is usually unnecessary since the dictionary contains modifier forms.)
POS, NER, and constituent and dependency parses are produced with Stanford CoreNLP 3.5.2 with default settings except for one change, to use its shiftreduce constituent parser (for convenience of processing speed). We treat tags and parses as fixed and leave their uncertainty propagation for future work.
When formulating the extraction rules, we examined frequencies of all syntactic dependencies within countryaffiliated entities, in order to help find reasonably highcoverage syntactic relations for the “attack” rule.
4 Event time series graphs
The following pages contain posterior time series graphs for 20 countries, as described in the section on coreferencebased event aggregation, in order of decreasing total event frequency. As in the main paper, the blue line indicates the posterior mean, and the gray region indicates 95% posterior credibility intervals, with count aggregation at the monthly level. The titles are ISO3 country codes.