Scaling Text with the Class Affinity Model
Abstract
Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Dáil confidence vote. To solve the Dáil scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentencelevel block bootstrap. Applying our method to the Dáil debate, we are able to scale the legislators between extreme progovernment and proopposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.
journalname \startlocaldefs \endlocaldefs
Scaling Text
and
t2This research was supported by the European Research Council grant ERC2011StG 283794QUANTESS.
1 Introduction
Text classification, where the goal is to infer a discrete class label from observed text, is a core activity in statistical and machine learning and natural language processing. Instances of this problem include inferring authorship (Mosteller and Wallace, 1963) or genre Kessler et al. (1997), detecting deception (Newman, Pennebaker and Berry, 2003), classifying email as “spam” (Heckerman et al., 1998), or detecting sentiment (Pang, Lee and Vaithyanathan, 2002). The huge appeal of the methods developed for these applications is that, from a small training set, it is possible to classify a large number of unlabelled documents to reasonable accuracy without costly human intervention.
In many applications, however, classification is an uninteresting goal, since the correct identification of the class is obvious and costless. It is fundamentally uninteresting, for example, to attempt to predict the political party of a speaker or the identity of a Supreme Court justice. Furthermore, in many social and political settings with observed discrete outcomes, institutions may cause predicted and observed class membership to diverge in significant ways. In parliamentary democracies where party discipline is enforced, for instance, voting may follow party lines even if the best predictions from observable features indicate more heterogeneous outcomes. In such cases, it is trivial to predict class (a legislator’s vote) from observable covariates (political party). In the presence of these covariates, the text of a speech is ancillary to the goal of class label prediction.
Even when observing text does not improve prediction performance, it is not the case that text is uninformative. In legislative debates, the text that legislators generate through floor speeches may provide a direct opportunity for them to express their contrary and divergent preferences (see for instance Benoit and Herzog, 2012). With legal briefs, to take another example, it is trivial to classify opinions as majority or dissenting but using the observed text and other information it is possible to place the briefs on a spectrum between the two extremes (Clark and Lauderdale, 2010). Simply attempting to predict the category of opinion—for instance classifying amicus curiae briefs as propetitioner or prorespondent (e.g. Evans et al., 2007), is of less direct interest since these categories are already known. The text of a document can reveal nuances that are not captured by and sometimes in disagreement with its class label.
Government party members  Opposition party members  

Fianna Fáil (FF)  24  Democratic Left (DL)  3 
Progressive Dems. (PD)  1  Fine Gael (FG)  22 
Green  1  
Labour (Lab)  7  
Speech text  
Median length (leaders)  6,348 tokens  
Median length (others)  2,210 tokens  
Vocabulary size  9,731 word types 
Here, we focus on an application that is illsuited to text classification but where text is nonetheless informative. We analyze the 1991 Irish Dáil confidence debate, previously studied by Laver and Benoit (2002) who used the debate speeches to demonstrate their “Wordscores” scaling method. The context is that in 1991, as the country was coming out of a recession, a series of corruption scandals surfaced involving improper property deals made between the government and certain private companies. The public backlash precipitated a confidence vote in the government, on which the legislators (each called a Teachta Dála, or TD) debated and then voted to decide whether the current government would remain or be forced constitutionally to resign. Table 1 summarizes the composition of the Dáil in 1991 and provides some descriptive statistics about the speech texts. We can use the debate as a chance to learn the legislators’ ideological positions.
Because the Irish parliamentary context is characterized by strict party discipline, the move was largely symbolic and each legislator voted strictly with his or her party: all members of the governing parties (Fianna Fáil and the Progressive Democrats) voted to support the government, and all members of the opposition parties (the Democratic Left, Fine Gael, Green, and Labour) voted against.
Despite the votes being entirely predictable, the floor speeches from the debate before the official tally reveal nuances to legislators’ positions. Take, for example, the following excerpt from Noel Davern, a moderate from the Fianna Fáil party:
It is not that the financial scandals have not occurred. They have occurred and the Government have taken quick action on them. In fact, we are not fully qualified to speak on them until we see the results of the full and independent inquiry.
Davern supports the government, but at the same time does not excuse them from all culpability. Contrast this with a typical opposition speech, calling for a vote against the confidence motion, from Labour TD Michael Ferris:
Our decision to oppose this motion of confidence is a positive assertion of the disapproval of the ordinary people of the actions of this discredited Government. The people have watched with amazement the unfolding of scandals which have tainted this Government. The Government cannot now be said to deserve the confidence of the people.
Both legislators express views that place them somewhere between the two extremes of absolute government support and absolute opposition support.
Where do Davern, Ferris, and the other 56 TDs that participated in the debate lie on this ideological spectrum? This is the essential question that we attack in this manuscript. In answering the question, we have at our disposal the speech texts, along with some additional information. We know that the leader of the government (Haughey, the Fianna Fáil Taoiseach) will give a speech at one extreme of the progovernment spectrum, and we know that the heads of the two major opposition parties (Spring and De Rossa, the Labour the Democratic Left leaders) will be at the extreme of the other end. We will use these three texts as reference points by which to scale the other 55 ambiguous texts whose positions are unknown and must be estimated.
To solve our particular problem, we develop a new text scaling method that is broadly applicable to situations where most documents are unlabelled but we have a few examples of documents at the extremes of a hypothesized ideological or stylistic spectrum. Instead of predicting class membership, our objective in such problems is to scale a continuous characteristic, through measuring the fit of a text to a set of known classes based on its degree of similarity to typical texts from these classes.
In what follows, we develop the class affinity model and demonstrate its use in scaling the degree of support or opposition expressed in the speeches made during the confidence debate. We start by outlining the foundations of our scaling model, contrasting it first to similar approaches designed for classification (Section 2), and then to lexicographical association methods in the form of sentiment dictionaries (Section 3). Section 4 then sets out the model, comparing this to related methods, highlighting the differences through on statistical principles but also using our application. Sections 5 and 6 detail how this model and its reference distributions are estimated, while Section 7 relates the affinity model to related methods. In Section 8, we show how to measure the influence of individual words, and provide recommendations for removing common terms that might skew the results. We apply this procedure to choose a tailored vocabulary for our application in Section 9. Section 10 demonstrates how to estimate uncertainty for the class affinity scaled estimates. Finally, we summarize the results the results of fitting the class affinity model to our application (Section 11), and offer some concluding remarks.
2 Scaling with a classification method
We have stated repeatedly that classification is not our objective in this problem, but nonetheless there is a long tradition of fitting classification methods to text, and we might try applying one of those methods here. We have a “training set” of the three leadership speeches, one of which we can label as Government and two as Opposition. We can fit a supervised classification method to this training set and then use it to make predictions for the other 55 legislators.
Using the Naive Bayes text classification method popularized by Sahami et al. (1998), we would model the tokens in each speech text as independent draws from a labeldependent distribution estimated from the reference texts. Letting label denote Government and label denote Opposition, for each label and word type in our vocabulary , we would estimate , the probability that a random token drawn from a text with label is equal to . Typically we use the empirical word occurrence frequencies in the reference documents or some smoothed version thereof. Here and throughout the text, unless otherwise noted we will take our vocabulary to be the set of word types that appear at least twice in the leadership speeches, excluding common function words from the modified Snowball stop word list distributed with the quanteda software package (Porter, 2006; Benoit, 2017); we ignore words outside this set.
Under the “naive” assumption that tokens in a text are independent draws from the same distribution, assuming equal prior odds for each label, the logodds that the label is Government given the word counts is
where denotes the number of times that word type appears in the text. The expression for arises as the log ratio of two multinomial likelihoods with probability vectors and . Using Naive Bayes classification for this twoclass prediction problem, we would predict the label as Government when , and we would predict the label as Opposition when .
The quantity measures the strength of the evidence that the label of a text is Government or Opposition, and we can use this quantity to scale the 55 virgin texts. Unfortunately, the Naive Bayes scaling method has serious drawbacks. First, the estimated log odds tend to be absurdly high. On our example, the median absolute log odds is , corresponding to an unrealistically high probability of class membership exceeding . Second, because is measuring the strength of the evidence, longer texts will tend to have higher absolute log odds. We illustrate both of these defects in Fig. 1, where we plot the absolute odds of class membership as a function of text length.
Related methods suffer from versions of this same problem. Multinomial inverse regression (Taddy, 2013) regularizes the probability vector estimates and adds a calibration step to the logodds, but it still suffers from the same drawbacks as Naive Bayes. Discriminative methods, like those used by Joachims (1998) and Jia et al. (2014), are affected to a degree depending on their choice of features. With logistic regression, for example, when the features are linear functions of the counts , then it will still be the case that longer documents have more extreme counts and hence more extreme predictions. Other choices of predictors can give rise to predictors that are less sensitive to variations in document length.
Even if these classification methods did not suffer the defects noted above, there is still a fundamental disconnect between the classification philosophy and the goals of scaling. In the classification world, a document is either “black” or “white;” for an unlabelled document, the method will tell you the probability that the label is black. In reality, though, a text is “gray,” a mixture of black and white. This is a fundamental difference in perspective that precludes using a classification method for our task. We expand on this metaphor below.
3 Scaling with dictionaries
Not all text scaling methods take the blackandwhite classification view of the world. One of the most successful alternatives is dictionarybased scaling (Stone, Dunphy and Smith, 1966; Pennebaker, Francis and Booth, 2001; Hu and Liu, 2004). In their simplest forms, dictionary methods conceive as each text as a mixture of two contrasting poles, such as positive and negative. Neutral words get discarded from the vocabulary. The scaling of a text is determined by the average orientation of its tokens.
There are many variations of dictionarybased scaling but for concreteness we will focus on \NAT@swafalse\NAT@partrue\NAT@fullfalse\NAT@citetpGrimmerStewart2013 formulation. To apply that scaling to the problem at hand—scaling debate speeches—we would need two nonoverlapping lists: one of words associated with Government and one of words associated with Opposition. Given these lists, we would assign a score to each word type in the Government list, and a score to each word type in the Opposition list. The dictionarybased scaling of a text with token count vector would be
where ; this quantity is equal to the difference in word type occurrence rates between the Government and Opposition lists.
It is laborintensive and errorprone to build a custom dictionary for each application, so often when practitioners apply dictionary scaling methods, they use offtheshelf dictionaries instead of building their own. For our application, the Lexicoder sentiment dictionary (LSD, 2015 version), “a broad lexicon scored for positive and negative tone and tailored primarily to political texts,” would be a natural choice (Young and Soroka, 2012, 211). However, as those authors note, applying an offtheshelf dictionary to a new domain often leads to undesirable results. Table 2 illustrates this point in the context of our application by comparing the word orientations as determined by the LSD with their empirical associations with Government and Opposition as observed in the leadership speeches. The rows indicate the LSDassigned orientations of the words; the columns are the significant differences in usage rates between the two classes as measured by the “keyness” likelihood ratio score at significance level , taking negations into account as recommended by Young and Soroka (2012). We display the number of word types in each cell, along with the most common words.
Government/Opposition  
Sentiment  Government  Neutral  Opposition 
Positive  11  377  2 
partners, progress, balance, achieved, legitimate, best, forward, better, improvement, improvements  confidence, like, great, well, ensure, hope good, opportunity, normal, responsible  wealth, creation  
Neutral  66  2,329  54 
public, now, economic, per, economy, cent, growth, new, way, community  government, country, business, irish, made, many, us, can, years, must  people, political, house, mr, one, taoiseach, minister, deputy, time, questions  
Negative  8  346  0 
problems, ireland’s, debt, difficulties, deficit, deterioration, opposite, implications  scandals, ireland, difficult, allegations, failed, concern, scandal, unfortunately, innuendo, loss  — 
If the dictionary were appropriate for our application, we should observe positive words associated with government usage, and negative words associated with opposition usage. The patterns in Table 2, however, show a very different result. Only 11 “positive” words have high usage in the government leadership speech, and no “negative” words have high usage in the opposition leadership speeches. Most “positive” and “negative” words do not have a clear association with either Government or Opposition. Furthermore, there are some worrying cases where the dictionary orientation is counter to the association between the classes. For example, while the LSD declares the word to be negative, in the context of the debate deficit refers simply to a fiscal outcome; likewise, confidence is related to the question of the debate, and not intended to convey positive valence. Despite being designed to detect political valence, the dictionary fails here since it has not been tailored for this particular debate. Terms that are associated with one type of affect generally are used differently in the context of the noconfidence debate.
Beyond the problem of domain adaptation, the more fundamental issue with dictionary methods is that their basic premise—that each word has a clear orientation—is inappropriate in our domain. Most words in our application do not clearly either belong in one category or the other. We can seen this in Table 2, where over 95% of the word types do not have statistically significantly different usage rates between the government and opposition leadership speeches. The vast majority of words get used by both government and opposition, and thus have mixed associations with both classes. Some dictionaries try to adjust for this by giving nonbinary scores to the words (Bradley and Lang, 1999), but these adjustments are often ad hoc, and they suffer from the same domain adaption problems. In the sequel, we present an alternative method that allows for mixed word association while simultaneously adapting to the domain.
4 The affinity model
Classification methods assume that each text is a member a welldefined category. Dictionary methods do not make this strong assumption, but they too take an unrealistic view of the world by supposing that each word has a welldefined orientation. Table 3 highlights this difference, and makes clear that there is room for a third worldview allowing both texts and words to be gray. We will formalize this intuition in a statistical model that we refer to as the “affinity model.”
Documents  
Gray  B/W  
Gray  Affinity Model  Classification  
Words  B/W  Dictionaries 
Our basic conceptual model is that over the course of a speech, a speaker’s orientation switches back and forth between Government mode and Opposition mode. When she is in Government mode, she chooses words in the same manner as the government leadership. Likewise, when she is Opposition mode, she chooses words in the same manner as the opposition leadership. We should place the speaker on the spectrum between the two extremes of progovernment and proopposition according to what proportion of time she spends in each mode.
Formally, let denote the vocabulary of word types, a set with cardinality . Encode the text of a speech as a sequence of tokens , with each token belonging to . In our model, the speaker’s underlying orientation evolves in parallel to the text and can be represented as where for the value denotes the speaker’s underlying orientation while uttering token . We will in general suppose that there are possible orientations, identified with the labels .
In our conceptual framework, a speech and the corresponding underlying orientation sequence are realizations of some speakerspecific random process. For , we define a speaker’s affinity toward orientation as , the expected proportion of time that her underlying orientation is :
Each speaker has an underlying affinity vector .
In our specific application, there are orientations. Each debate speaker has a separate affinity vector . We will scale each speaker by estimating his or her affinities for Government () and Opposition ().
We will impose two simplifying assumptions to make inference under our model tractable. First, we will suppose that are independent and identically distributed. This forces that for every label , and position , the underlying orientation is randomly distributed with . Second, we will suppose that are independent conditional on , and that the distribution of depends only on and is the same for all positions . This positional invariance allows us to define for each label and word type the probability
and it allows us to define the reference distribution . Our two simplifying assumptions result in a generative model: for each position , the speaker picks an underlying orientation with probabilities determined by ; given that the underlying orientation is , the speaker picks token according to distribution . Fig. 2(a) summarizes this generative process.
For each position , the chance that word appears in position is
Further, are independent, so that the probability of observing the token sequence is
(1) 
where is the number of times word appears in the text. At a high level, this is the same generative model as that used for a topic model (Blei, Ng and Jordan, 2003). The main difference between these models is that topic models are typically unsupervised, but the affinity model uses supervision to estimate . We elaborate more on the connection to topic models in Section 7.4.
We note also that the affinity model can be seen as a generalization of the Naive Bayes model depicted in Fig. 2(b). In the Naive Bayes model, each document has a single underlying orientation, . All words in the document share the same underlying orientation. The parameter can be seen as the prior distribution for . In Naive Bayes, we do not estimate , but instead we estimate for each class . In Naive Bayes, each document has just one underlying orientation. The power of the affinity model is that it allows the underlying orientation to vary with the word position.
5 Estimating affinities
The affinity model described in Section 4 lends itself naturally to likelihoodbased estimation. We first consider the problem of estimating the affinity vector for a particular text, when we are given the reference distributions .
The parameter space for the affinity vector is the simplex consisting of all vectors with nonnegative components satisfying the equality constraint . One implication of the equality constraint is that the model is overparametrized, which makes estimating directly awkward. To handle this constraint, we will reparametrize the model in terms of a dimensional contrast vector .
In the case, we set so that and the parameter space for is . In the general case we let be defined by the relation
(2) 
where is any point in the interior of the parameter space and the contrast matrix has full rank and satisfies . In principle and can be arbitrary, but for concreteness we will take to be the center of the parameter space , and we will take to be the Helmert matrix. The parameter space for the contrast vector, then, is where denotes componentwise partial order. With this particular choice of and , the general case agrees with the special case when .
Following equation (1), the loglikelihood function for the contrast vector is
(3) 
where and We will estimate by maximizing or a penalized version thereof.
In the special case when , the score and observed information functions gotten from differentiating the log likelihood are
The expected information is
To define the analogous functions in the general case, define the matrixvalued function with . In the general case, the analogous functions are
(4)  
(5) 
where is the diagonal matrix with for . The expected information is
where is the matrix with th row equal to for .
The observed information function is positive semidefinite, indicating that the log likelihood function is concave. We can estimate by maximizing the log likelihood using the NewtonRaphson iterative method. The expensive part of this maximization procedure is computing , which takes time or faster if the count vector is sparse. In our experience on the Dáil speeches, the method typically converges after about five iterations. The difficult part of the optimization is that we must restrict the search to the parameter space ; we accomplish this using an interiorpoint barrier method (Boyd and Vandenberghe, 2004, Ch. 11).
In exchange for adding a small bias to the estimates, we can reduce the variance and remove the explicit inequality constraints on the parameter space. In particular, Firth (1993) shows that in the asymptotic regime where tends to infinity, adding a penalty of order to a log likelihood adds a term of size to the bias of the estimator (sometimes reducing the estimator’s bias, but not necessarily doing so in our setting). In our case, we choose a positive scalar and define the penalty function
Then, we estimate the affinities by maximizing the penalized log likelihood where . The penalty ensures that is strictly concave, and further that the maximizer is unique and belongs to the interior of the parameter space. For the analyses in this manuscript, we use the penalty value . Section 6 provides some theoretical justification for this penalty value in a related context.
6 Estimating reference distributions
The reference distributions themselves need to be estimated from data. In our framework, this learning step requires not large volumes of training data, but rather texts that are clearly polar examples of each reference class, to form benchmarks for estimating the other texts’ affinities to these classes. In the context of our specific application, the 1991 Irish Dáil confidence debate, recall that the contrasting classes represent Government () and Opposition (). We will use the leaders of the government and opposition respectively to represent the archetype texts for each class. Taoiseach (Prime Minister) Charles Haughey’s speech forms the government reference text for estimating , and the speeches from the two opposition party leaders (Spring and de Rossa) form the reference texts for estimating .
To estimate a particular reference distribution , we will suppose in general that we have at our disposal texts drawn from this distribution of lengths . We denote the vectors of word counts for these texts by . In our application, for estimating the Government reference, and for estimating the Opposition reference. We will use smoothed empirical frequencies to estimate as advocated by Lidstone (1920). We choose a nonnegative smoothing constant and estimate the probability of word type as
Specifically, we will set . It is not essential to smooth the estimates of , but doing so reduces estimation variability.
There are many reasonable choices for the smoothing constant , including choosing adaptively (Fienberg and Holland, 1972). In natural language processing, it is common to take so that is the maximum a posteriori estimator under a uniform prior (Jurafsky and Martin, 2009, Sec. 4.5.1). From a frequentist standpoint, the value —which corresponds to using a Jeffreys prior for —is slightly more defensible. In the regime where is fixed and tends to infinity, using the results from Firth (1993) one can show that using results in an expected KullbackLeibler divergence from to of order instead of for other choices of .
Once we have estimates of the reference distributions, to get an estimate of the class affinity vector for a text, we use the methods from Section 5, using the estimated class distributions in place of their true values.
7 Connections to other methods
7.1 Dictionary methods
In the special case that the reference distributions have disjoint supports—that is, when no two classes and are such that both and for some word type —affinity scaling is exactly equivalent to dictionary scaling.
To make this equivalence clear, suppose that for each word type , at most one of the reference probabilities is nonzero. When this is the case, we can partition the vocabulary as a union of disjoint sets, , where
Here, is the set of word types associated with label . The disjoint support condition ensures that each word type is associated with exactly one label.
Under the disjoint support condition, when we observe the th token , we can immediately infer the underlying orientation to be the only class with this word in its support. The loglikelihood simplifies to
where and the constant does not depend on . In this case, the maximum likelihood estimate of the class affinity vector is
That is, the estimated class affinities are the token occurrence rates in the support sets .
7.2 Wordscores
The “Wordscores” scaling method developed by Laver, Benoit and Garry (2003) turns out to be closely related to class affinity scaling. That method, which is primarily used to scale documents between reference classes works well in practice but has been criticized for having ad hoc theoretical foundations (Lowe, 2008). We can show, however, that Wordscores scaling is closely related to affinity scaling, and gives highly correlated results for texts that are not close to the extremes (represented by the reference text positions). We elaborate on this connection below.
In its simplest form, Wordscores takes as given reference distributions for each class, denoted and . The method defines the wordscore of a word type as
(6) 
Word types that only appear in class 2 have scores of , while types that only appear in class 1 have scores of . Other types have intermediate values indicating the relative degrees of association with the two classes. The unnormalized “text score” of a length text with token count vector is then the average wordscore of its tokens:
(7) 
Texts with positive values tend to be more like class 2, while texts with negative values tend to be more like class 1.
The magnitude of the unnormalized score is not directly interpretable. To fix this, Martin and Vanberg (2007) advocate rescaling the score to ensure that average reference texts from the two classes have scores of and . To realize the Martin–Vanberg scaling, for define
An average text of length from class has token counts satisfying , so that its score is . Using the relation termwise in the sum, one can verify that . The Martin–Vanberg wordscore scaling is
An average text from class satisfies ; an average text from class satisfies .
The wordscore scaling turns out to be deeply connected to affinity scaling. To see this connection, note that using the parameterization from Section 5, the score and observed information functions for the affinity model evaluated at are
There is a striking relationship between the scaled text score and the derivatives of the mixture model log likelihood:
The right hand side of this expression is equal to the first Fisher scoring iterate computed while maximizing starting from the initial value . When the maximizer is close to , it will be approximately equal to this first iterate. Thus, when a text is roughly balanced between the two reference classes (), it will also be the case that
For moderate documents, the wordscore scaling is a linear transformation of the estimated class affinities.
We demonstrate the quality of this approximation in Fig. 2(d), where we plot the wordscore scaling versus the estimated government affinity for the moderate debate speeches. We can see that there is very good agreement between the two scalings, and that , the two scalings are almost identical.
7.3 Support vector machines and logistic regression
We have just shown analytically that affinity scaling gives similar results to Wordscores. It turns out that, when the number of reference documents is small, up to scaling, both methods are approximately equivalent to classifying with a support vector machine or linear regression.
Suppose that we are in the twoclass () case, and that there is one reference document for each class. Imagine fitting a linear classifier that tries to predict class using a document’s word frequencies as features. With a vocabulary size greater than the number of training documents, the two classes can be perfectly separated as long as the two reference distributions and corresponding to the training documents are identical. In this case, the support vector machine fit and the logistic regression fit are identical, up to differences that arise from regularizing the coefficients.
Given a document with length and word count vector , its feature vector is its vector of word frequencies, . The feature vectors for the two training documents are and . Up to a constant of proportionality, the maximum margin predictor, expressed as a function of is
(8) 
Since the classes are perfectly separated, and multiple of this predictor gives the same classification performance on the training set; the precise scaling chosen by the fitting procedure will depend on the regularization parameters.
Comparing the support vector machine scaling (8) with the unnormalized wordscores scaling (7), we can see that the only substantive difference is the denominator in the coefficient on . Thus, up to a constant shift and scale, if is roughly constant relative to , then the two methods will give similar results. In light of the connection between Wordscores and affinity scaling developed in Sec. 7.2, this implies that in these situations, the support vector machine results will be highly correlated with the affinity scaling results.
We verified the connection between the two methods empirically, using the SVMlight software with the default tuning parameters (Joachims, 1999). Fig. 2(b) shows the support vector machine estimated log odds plotted against the affinity scaling results. Both scalings give similar results (correlation ). The main distinction is that the numerical value of the support vector machine log odds is determined completely by the regularization parameter and is thus uninterpretable. The affinity scaling of a document, by contrast, can be interpreted directly.
7.4 Topic models
Topic models share a similar perspective with the affinity model in that both represent texts as mixtures of topics, with each topic having an associated word distribution. In our framework, the topics correspond to the reference classes, and the textspecific topic weights correspond to class affinities. We learn the class distributions from a set of labeled reference texts. This approach differs from that taken by unsupervised topic models (Blei, Ng and Jordan, 2003; Grimmer, 2010), where estimated topics may or may not correspond to scaling quantities of interest.
Supervised variants of topic models allow for associations between labels and topics, but these models all assume that class membership is discrete, not a continuous scale (McAuliffe and Blei, 2008; Ramage et al., 2009; Roberts, Stewart and Airoldi, 2016). These supervised models force clear associations between the topics and the scaling quantities of interest, but they assume that the texts have discrete labels indicating class membership. This fundamental assumption places these methods in the same category as other classification methods like Naive Bayes, estimating the probability of class membership, not class affinity.
Despite their philosophical differences, in practice supervised topic models can give scalings that are highly correlated with the affinity model scaling. The connection to supervised topic models is easiest to understand in the case of \NAT@swafalse\NAT@partrue\NAT@fullfalse\NAT@citetpMcauliffeBlei2008 Supervised Latent Dirichlet Allocation (sLDA), which models a textspecific label as a random quantity linked to a linear function of the textspecific topic weights. Roughly speaking, the method works in two stages. In the first stage, sLDA fits a topic model to the reference texts. In the second stage, sLDA fits a logistic regression model using the fitted topic weights as predictors and the class label as response. In practice, sLDA fits the topics and the logistic regression simultaneously, but when the number of topics is larger than the number of reference texts, any differences between sequential and simultaneous fitting are determined by the regularization parameters and the random initialization.
The connection between sLDA and affinity model scaling is closest with two topics and two reference texts. In this case, since the number of topics equals the number of reference texts, sLDA can get a perfect fit by allocating one topic to each reference text, and can separate the two classes perfectly given the topic weights by using a linear predictor for the odds of class membership of the form where the coefficient gets determined by the regularization parameters. When the sLDA fit gets used for prediction on the unlabelled texts, the fitted topic weights will be the same as the values from a fitted affinity model (again, ignoring the effects of regularization regularization and initialization). The sLDA score will be highly correlated with the difference in estimated affinities.
In the case when there are more topics and more reference texts, the relationship between affinity scaling and sLDA is not as simple, but the same general intuition still holds and the two methods still give highly correlated results. Fig. 2(e) illustrates this with a model using 10 topics, where the correlation between the nonreference text scalings from the two methods is . Here, the sLDA method gives unreasonable results for the extremes. Furthermore, the interpretation of the scaling value if different: odds of class membership for sLDA, versus degree of membership for the affinity model.
7.5 Unsupervised methods
Some approaches to scaling texts, including Latent Semantic Indexing (Deerwester et al., 1990) and Slapin and Proksch (2008)’s “Wordfish” Poisson scaling method, estimate latent textspecific traits using unsupervised methods. Often, the estimated traits are correlated with recognizable attributes, and so they can be used to scale ideology. Letting denote the count of word type in text , the Slapin and Proksch (2008) Wordfish model specifies that is a Poisson random variable with mean , where for some unknown textspecific parameters ( and ) and wordspecific parameters ( and ). Estimates of have been shown to provide valid estimates of latent positions expressed in speeches (Lowe and Benoit, 2013).
The drawback to unsupervised scaling of this sort, however, is that they provide no guarantee that the estimated latent trait corresponds to the quantity of interest. We demonstrate this behavior in Fig. 2(f), where we plot the Wordfish scaling estimates of the debate speeches versus the affinity scaling estimates. The two methods give similar results (correlation ), but there are also some notable differences. The government and opposition leaders are not the most extreme examples as determined by Wordfish, indicating that even in this focused context—a debate over a confidence motion—the primary dimension of difference is something other than the governmentopposition divide.
8 Diagnostics
In the previous section, we used the simple analytic form of the affinity scaling model to get an understanding of its connections with other text scaling methods. Beyond this, we will now see another advantage of the model’s form: its simplicity facilitates computationally efficient diagnostic checking for the model fit.
Ideally, our fit should exhibit two characteristics. First, it should not be driven by a small number of word types, but instead it should be determined by an accumulation of information from many different word types. Second, the word types that show the most influence in determining the fit should be ones that make sense from a subject matter perspective. To check whether our scaling results satisfy these properties, and to better understand them generally, we will develop an influence measure to characterize the impact of each word type in determining the overall fit.
Our strategy for assessing influence stems from Cook (1977), who, in the context of linear regression, assesses the influence of each observation by measuring the change that results from deleting the observation. Proceeding analogously, we will measure the influence of a word type by setting the corresponding token count to zero and observing the change in the class affinity estimate . Ideally, we would do this by computing the maximizer of the log likelihood (or, when regularizing, the penalized log likelihood) gotten after setting to zero, but the large number of word types makes this impractical. We will settle for finding a computationally simple closedform approximation to .
Suppose that is a vector of token counts for the particular text of interest, and that is the affinity vector estimate gotten from , the maximizer of the corresponding log likelihood defined in (3). Making the dependence on explicit, the score and observed information functions are
where is a diagonal matrix with for and is as defined in Section 5.
For an arbitrary word type , consider the effect of setting . This defines a new vector of token counts defined by and for all . Let denote the th standard basis vector in and define , where . Note that , so that
Since , this implies that evaluating the score function with the new data at the old estimate gives
(9) 
The maximizer of the new log likelihood is roughly equal to the first Newton scoring step from . We can compute this step explicitly by first computing the inverse of the observed information matrix:
(10) 
where .
Approximating the maximizer by the first Newton step from gives
where we have used (9) and (10) to simplify the expression. Using this approximation for gives us an approximation for the change in the estimated affinities:
Motivated by this approximation, we define our influence measure as
(11) 
where denoteds norm. When we are regularizing the estimates, using a penalized log likelihood in place of , we define the influence similarly, using the negative Hessian in place of .
Using a norm instead of a Euclidean norm in the definition of allows us to interpret as the total amount of positive change to the components of . Given that this is also equal to the total amount of negative change.
9 Vocabulary selection
As previously mentioned, the results presented in Fig. 3 and elsewhere in the prequel use as vocabulary the set of word types appearing in the leadership speeches, excluding words appearing only once and words on the English Snowball “stop” word list. Why did we exclude these words?
Initially, we did not exclude any words from the vocabulary. We fit the affinity model to the complete vocabulary and used it to scale the 55 nonleadership speeches. Then, to help understand our results, we computed the influence measures as defined in (11) for each speech word count vector and word type . We also recorded the direction of the influence (whether the appearance of the word pushes the fit towards Government or Opposition). This gave us a matrix of (speech, word) influence measures. Most of the entries of this matrix are zero since most count vectors are sparse and words that do not appear in a speech have no influence on its affinity estimate. For each word type, we recorded the count of nonzero speech influence entries, along with the median and maximum of the nonzero entries. We report these values in Table 4, grouped by the direction of influence.
Government  Opposition  

Word  Count  Median  Max  Word  Count  Median  Max 
and  55  1.3  2.5  the  55  2.5  4.7 
our  49  0.9  2.7  that  55  1.3  3.5 
graduate  3  0.8  0.9  to  55  1.2  2.6 
deasy  3  0.7  1.6  they  55  1.0  2.6 
attribute  1  0.7  0.7  a  55  0.9  1.7 
social  30  0.6  8.0  is  55  0.9  1.7 
per cent  26  0.6  3.2  not  55  0.7  1.6 
corresponding  1  0.6  0.6  people  54  0.7  3.0 
nation  12  0.6  1.4  it  55  0.7  1.7 
proof  2  0.6  1.0  he  42  0.6  2.0 
1987  20  0.5  2.7  at  54  0.5  1.3 
economic  33  0.5  2.1  his  43  0.5  1.4 
will  55  0.5  1.5  taoiseach  43  0.5  1.3 
international  18  0.5  1.1  by  55  0.4  0.7 
union  9  0.5  0.9  as  55  0.4  1.2 
We can see, for example, that the word type social exhibited influence on 30 speeches. For one of these speeches, deleting the word social has the affect of shifting the speech’s affinity estimate away from Government by ; the median shift for the 30 speeches is . Deleting social shifts the fit away from Government; equivalently, the appearances of social push the fit towards Government.
The influence of a word is determined by its usage rate and the degree to which is usage is imbalanced across the reference classes. The word types that show up as influential in Table 4 are those that appear frequently and exhibit a small imbalance between Government and Opposition, or else appear moderately and exhibit a large imbalance between the two classes. This holds generally: influential words tend to either be highly imbalanced, or moderately imbalanced with high usage rates.
Many of the of the words in Table 4 make sense, for example social, nation, and economic influence the affinity fit towards Government, and people and taoiseach influence the affinity fit towards Opposition. However, we can clearly see that certain function words like and and the are exerting a big influence on the fit. These function words have slightly imbalanced usage rates in the reference texts, which, compounded with a high usage rate, results in a large net influence. This sensitivity to stylistic differences is a manifestation of a common critique of the related Wordscores scaling method (Beauchamp, 2012; Grimmer and Stewart, 2013). To reduce sensitivity to stylistic differences, we eliminated function words (the Snowball English “stop” words) from our analysis.
We can also see in Table 4 that there are words that a few rare words like attribute and proof have large influence. These words are not meaningful discriminators on substantive grounds, but they show up as influential because they only appear once in the reference speeches. The estimated probabilities for these words are unreliable. Their influence is determined purely by estimation variability. To get around this, in our final analysis we choose to exclude these words—the hapax legomena—that only appear once in the reference speeches.
Government  Opposition  

Word  Count  Median  Max  Word  Count  Median  Max 
deasy  3  0.9  1.9  people  54  1.3  5.0 
per cent  26  0.8  3.7  taoiseach  43  0.8  3.1 
nation  12  0.8  1.8  democrats  23  0.7  1.9 
social  30  0.8  10.7  minister  44  0.6  2.5 
corresponding  1  0.7  0.7  system  37  0.6  2.7 
1990  17  0.7  2.0  house  54  0.5  1.9 
union  9  0.7  1.0  o’kennedy  5  0.5  0.9 
belief  3  0.7  1.0  progressive  24  0.5  1.4 
economic  33  0.7  2.8  say  39  0.5  1.3 
reform  19  0.7  2.4  issue  27  0.5  1.4 
1987  20  0.6  4.0  million  26  0.5  1.6 
policy  27  0.6  2.0  printed  2  0.5  0.7 
roads  6  0.6  2.6  wealth  6  0.5  1.4 
new  38  0.6  1.6  headings  2  0.4  0.4 
international  18  0.6  1.5  said  41  0.4  1.6 
After excluding stop words and hapax legomena, we were left with a reduced vocabulary of 1321 word types. We refit the model and rescaled the speeches, computing the influences of the word types in the reducedvocabulary model. Table 5 shows the most influential Government and Opposition words, computed as before. It is possible that Snowball word list could have missed some influential function words, but inspecting the words in Table 5 and the other words further down in the order, we found that this was not the case for our application. The only suspicious words are say and said, but in the context of the debate, it makes sense that these words are proOpposition. When the word said gets used, it is typically used to quote the government (“they said” or “they continue to say”), usually by an opposition member criticizing the government. Likewise, at first glance it may seem suspicious that per cent is at the top of the Government list, but in fact this often used to cite national statistics about the economy and the GDP, using the state of the economy explain the unrest.
10 Uncertainty quantification
In principle, it is possible to get standard errors for the affinity estimates directly from the expected or observed information function (5). However, these likelihoodbased standard errors are likely too narrow, because they ignore uncertainty in the estimates of the reference distributions (), and they rely on the independence assumptions in the model. Ignoring uncertainty in the reference distribution estimates is inappropriate when the reference set is small, as it is here (three leadership speeches). Similarly, the independence assumption—that word tokens in different positions of a text are independent of each other—simplifies the analysis, but it is likely violated in realworld data. To accurately assess the uncertainty in our estimates, we need a method that accounts for the uncertainty in the reference distribution estimates and the dependence between nearby words in text.
To estimate the sampling distribution of the scaling estimates under dependence between word tokens, we will use a block bootstrap that respects the natural linguistic structure of the text, by following Lowe and Benoit (2013)’s recommendation to resample texts at the sentence level to simulate sampling variation but also to capture meaningful dependencies among words within natural syntactic units. To properly account for uncertainty in the reference distribution estimates, we will also construct sentencelevel bootstrapped reference speeches. The full procedure is as follows:

For bootstrap replicates :

For each reference text construct bootstrapped reference text , where has sentences drawn with replacement from , with the same total number of sentences.

Use the bootstrapped reference texts to estimate the reference distributions as described in Sec. 6.

Construct a bootstrap version of the scaled text by resampling sentences from , with replacement.

Treating the reference distribution estimates as fixed, construct an affinityscaling estimate from .


Use the sample standard deviation of as the bootstrapped estimate of the standard error of the affinity scaling estimate for .
We performed this procedure for all of 55 nonleadership speeches, getting a separate bootstrap standard error for each. For comparison, we computed likelihoodbased (Wald) standard error for the estimates from the Fisher information conditional on the reference estimates. Unsurprisingly, the bootstrap standard errors are generally wider than the likelihoodbased estimates. The two uncertainty estimates are both on the same order of magnitude, with the bootstrap standard error being less than 1.5 times as large as the likelihoodbased standard error for most of the speeches (87%); the median ratio of the two standard errors is 1.3. In the sequel, we use bootstrap standard errors to quantify the uncertainty in the affinity estimates.
Fig. 4 displays the estimated government affinities for all 55 speeches after performing feature selection. The figure includes 95% confidence intervals, computed using the sentencelevel bootstrap. We discuss these results in detail in the next section.
11 Results
At both the level of the government versus opposition and interparty levels, the results are entirely in line with expectations: not only are the parties arrayed in an order that would be consistent with expectations, with opposition parties on the Opposition side, and the governing parties on the other, but also we see that speeches from the different parties align with the extremity of their positions in regards to the establishment. The speeches of most centrist opposition party, Fine Gael, express a more moderate antiGovernment positions than either the left party Labour or the farleft Democratic Left party. This median difference emerges clearly even though we considered the speeches of the Labour and Democratic Left leaders as equivalent for the purposes of training the Opposition class.
The more interesting distinctions emerge when we examine intraparty differences in expressed position. Among the government ministers, it is not surprising to see that John Wilson, the FF Deputy Prime Minister (Tánaiste, or “FF Tan” in the plot), and Gerard Collins, the Foreign Minister and a senior Fianna Fáil minister had extreme Governmentoriented estimated positions exceeded only by the Taoiseach Charles Haughey himself. What is more interesting is that the next minister in the estimated ranking, Albert Reynolds, would later become the next Taoiseach. At the other extreme, among the most Oppositionoriented government minister we see notable examples in Raphael (Ray) Burke, who was removed from his ministerial position the following year, and Mary O’Rourke, who months later would challenge Albert Reynolds for the party leadership.
The “backbench” FF members voted with the government but generally gave speeches that were far more lukewarm than the FF ministers. Correspondingly, we see that the estimated estimated Government affinities for the backbenchers are generally lower than those of the minsters. There were three exceptions, members with extreme estimated Governmentoriented affinities: Nolan, Cullimore, and Cowan. One of these members, Brian Cowen, became Minister for Labour the following year, and occupied senior positions include Prime Minister for the next two decades.
On the opposition side, we see a similar set of heterogeneous estimated affinities. Two salient examples of extreme estimated Governmentoriented affinities are Fine Gael TD Garret FitzGerald, a former and future Prime Minister, and TD Peter Barry, who had fought Fitzgerald in 1987 for party leadership. Both emphasized fairly standard economic concerns, attacking the government’s poor economic performance rather than its corrupt behavior. It is notable that the member with the highest estimated proopposition affinity, DL member Pat Rabbitte who would later become leader of the Labour Party; in his speech, he engaged in a personal set of attacks against the Taoiseach and specifically attacking his character and judgment.
The results of applying the class affinity scaling model to the confidence debate speeches provides a results consistent with expectations and with previous scholarly investigations of this episode (Laver and Benoit, 2002). Using only the texts of the speeches, we have succeeded at revealing differences between the speakers that were not apparent from their party affiliations.
12 Discussion
In our application and in others like it, the correct prediction of a class is no longer a relevant benchmark because the process of producing political text is expected to produce heterogeneous text within each class. For us, the class—here, voting for or against the confidence motion, which was perfectly correlated with government or opposition status—is observed and uninteresting, while the heterogeneity is the primary interest. Despite what would seem obvious from a measurement model or scaling perspective, however, a standard approach in evaluating machine learning applications in political science has been predictive accuracy benchmarked against known classes (e.g. Evans et al., 2007; Yu, Kaufmann and Diermeier, 2008). This focus on estimating correct classes not only wrongly shifts attention away from the substantively interesting variation in latent traits, but also may ultimately impair classification generality by encouraging overfitting to reduce predictive error.
Our proposed alternative, class affinity scaling, is based on a probability model similar to those underlying class predictive methods, but allows for mixed class membership. We have shifted focus from class prediction, something typically uninteresting in the social sciences, to a form of latent parameter estimation, while retaining the advantages of supervised learning approaches where the analyst controls the inputs that anchor the model. While there is a strong tradition in some disciplines, such as political science, of adapting machine learning to produce continuous scales, practitioners are often unaware of the differences in modeling assumptions between classification and scaling methods (e.g. Laver, Benoit and Garry, 2003), or they have not fully explored the implications of these assumptions (e.g. Beauchamp, 2012). We have highlighted the differences and similarities in a form that encourages future development.
The relative simplicity of our method makes it amenable to direct mathematical analysis. This simplicity allowed us to draw connections between Naive Bayes classification, dictionarybased scaling, and a host of other methods. We were further able to exploit the analytic simplicity of the affinity scaling model to develop an influence measure assessing the sensitivity of the fit, which we then used to guide our vocabulary selection and to validate our fits to the Dáil debate.
Using our method to explore the nuances of the speeches in the 1991 Dáil confidence motion, we produced estimates for each speaker that accord with both a qualitative reading of the speech transcripts and an expert understanding of Irish politics. Our application is a hard domain problem, where no known lexicographical map exists to differentiate government versus opposition speech and dictionarybased scaling, even with a dictionary derived from political text, gives unsatisfactory results. With limited training from the leadership speeches, class affinity scaling is able to adapt to the context of the debate and give a meaningful scaling. The method has applications far beyond political text, however, and could be used to score more standard sentiment problems on a continuous scale, or applied to any other problem for which contrasting reference texts can be identified.
References
 Beauchamp (2012) {bunpublished}[author] \bauthor\bsnmBeauchamp, \bfnmNick\binitsN. (\byear2012). \btitleUsing text to scale legislatures with uninformative voting. \bnotehttp://nickbeauchamp.com/work/Beauchamp_scaling_current.pdf. \endbibitem
 Benoit (2017) {bmanual}[author] \bauthor\bsnmBenoit, \bfnmKenneth\binitsK. (\byear2017). \btitlequanteda: Quantitative Analysis of Textual Data, \beditionR package version 0.99.9 ed. \bpublisherLondon School of Economics, \baddressLondon. \endbibitem
 Benoit and Herzog (2012) {bunpublished}[author] \bauthor\bsnmBenoit, \bfnmKenneth\binitsK. \AND\bauthor\bsnmHerzog, \bfnmAlexander\binitsA. (\byear2012). \btitleIntraParty Conflict Over Fiscal Austerity. \bnoteLSE manuscript, October 29. \endbibitem
 Blei, Ng and Jordan (2003) {barticle}[author] \bauthor\bsnmBlei, \bfnmD. M.\binitsD. M., \bauthor\bsnmNg, \bfnmA. Y.\binitsA. Y. \AND\bauthor\bsnmJordan, \bfnmM. I.\binitsM. I. (\byear2003). \btitleLatent Dirichlet allocation. \bjournalThe Journal of Machine Learning Research \bvolume3 \bpages993–1022. \endbibitem
 Boyd and Vandenberghe (2004) {bbook}[author] \bauthor\bsnmBoyd, \bfnmS.\binitsS. \AND\bauthor\bsnmVandenberghe, \bfnmL.\binitsL. (\byear2004). \btitleConvex Optimization. \bpublisherCambridge University Press. \endbibitem
 Bradley and Lang (1999) {barticle}[author] \bauthor\bsnmBradley, \bfnmM M\binitsM. M. \AND\bauthor\bsnmLang, \bfnmP J\binitsP. J. (\byear1999). \btitleAffective norms for English words (ANEW): Instruction manual and affective ratings. \endbibitem
 Clark and Lauderdale (2010) {barticle}[author] \bauthor\bsnmClark, \bfnmTom S.\binitsT. S. \AND\bauthor\bsnmLauderdale, \bfnmBenjamin\binitsB. (\byear2010). \btitleLocating Supreme Court Opinions in Doctrine Space. \bjournalAmerican Journal of Political Science \bvolume54 \bpages871–890. \endbibitem
 Cook (1977) {barticle}[author] \bauthor\bsnmCook, \bfnmR Dennis\binitsR. D. (\byear1977). \btitleDetection of influential observation in linear regression. \bjournalTechnometrics \bvolume19 \bpages15–18. \endbibitem
 Deerwester et al. (1990) {barticle}[author] \bauthor\bsnmDeerwester, \bfnmScott\binitsS., \bauthor\bsnmDumais, \bfnmSusan T\binitsS. T., \bauthor\bsnmFurnas, \bfnmGeorge W\binitsG. W., \bauthor\bsnmLandauer, \bfnmThomas K\binitsT. K. \AND\bauthor\bsnmHarshman, \bfnmRichard\binitsR. (\byear1990). \btitleIndexing by latent semantic analysis. \bjournalJournal of the American society for information science \bvolume41 \bpages391–407. \endbibitem
 Evans et al. (2007) {barticle}[author] \bauthor\bsnmEvans, \bfnmMichael\binitsM., \bauthor\bsnmMcIntosh, \bfnmWayne\binitsW., \bauthor\bsnmLin, \bfnmJimmy\binitsJ. \AND\bauthor\bsnmCates, \bfnmCynthia\binitsC. (\byear2007). \btitleRecounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research. \bjournalJournal of Empirical Legal Studies \bvolume4 \bpages10071039. \endbibitem
 Fienberg and Holland (1972) {barticle}[author] \bauthor\bsnmFienberg, \bfnmStephen E.\binitsS. E. \AND\bauthor\bsnmHolland, \bfnmPaul W.\binitsP. W. (\byear1972). \btitleOn the Choice of Flattening Constants for Estimating Multinomial Probabilities. \bjournalJournal of Multivariate Analysis \bvolume2 \bpages127–134. \endbibitem
 Firth (1993) {barticle}[author] \bauthor\bsnmFirth, \bfnmD.\binitsD. (\byear1993). \btitleBias reduction of maximum likelihood estimates. \bjournalBiometrika \bvolume80 \bpages27–38. \endbibitem
 Grimmer (2010) {barticle}[author] \bauthor\bsnmGrimmer, \bfnmJ\binitsJ. (\byear2010). \btitleA Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases. \bjournalPolitical Analysis \bvolume18 \bpages1–35. \endbibitem
 Grimmer and Stewart (2013) {barticle}[author] \bauthor\bsnmGrimmer, \bfnmJustin\binitsJ. \AND\bauthor\bsnmStewart, \bfnmBrandon M.\binitsB. M. (\byear2013). \btitleText as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. \bjournalPolitical Analysis \bvolume21 \bpages267297. \endbibitem
 Heckerman et al. (1998) {barticle}[author] \bauthor\bsnmHeckerman, \bfnmD.\binitsD., \bauthor\bsnmHorvitz, \bfnmE.\binitsE., \bauthor\bsnmSahami, \bfnmM.\binitsM. \AND\bauthor\bsnmS. Dumais, \bfnmS.\binitsS. (\byear1998). \btitleA Bayesian approach to filtering junk email. \bjournalProceedings of the AAAI98 Workshop on Learning for Text Categorization \bpages55–62. \endbibitem
 Hu and Liu (2004) {binproceedings}[author] \bauthor\bsnmHu, \bfnmMinqing\binitsM. \AND\bauthor\bsnmLiu, \bfnmBing\binitsB. (\byear2004). \btitleMining and summarizing customer reviews. In \bbooktitleProceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining \bpages168–177. \bpublisherACM. \endbibitem
 Jia et al. (2014) {barticle}[author] \bauthor\bsnmJia, \bfnmJinzhu\binitsJ., \bauthor\bsnmMiratrix, \bfnmLuke\binitsL., \bauthor\bsnmYu, \bfnmBin\binitsB., \bauthor\bsnmGawalt, \bfnmBrian\binitsB., \bauthor\bsnmEl Ghaoui, \bfnmLaurent\binitsL., \bauthor\bsnmBarnesmoore, \bfnmLuke\binitsL., \bauthor\bsnmClavier, \bfnmSophie\binitsS. \betalet al. (\byear2014). \btitleConcise comparative summaries (CCS) of large text corpora with a human experiment. \bjournalThe Annals of Applied Statistics \bvolume8 \bpages499–529. \endbibitem
 Joachims (1998) {binproceedings}[author] \bauthor\bsnmJoachims, \bfnmThorsten\binitsT. (\byear1998). \btitleText categorization with support vector machines: Learning with many relevant features. In \bbooktitleEuropean conference on machine learning \bpages137–142. \bpublisherSpringer. \endbibitem
 Joachims (1999) {bincollection}[author] \bauthor\bsnmJoachims, \bfnmT.\binitsT. (\byear1999). \btitleMaking largeScale SVM Learning Practical. In \bbooktitleAdvances in Kernel Methods  Support Vector Learning (\beditor\bfnmB.\binitsB. \bsnmSchölkopf, \beditor\bfnmC.\binitsC. \bsnmBurges \AND\beditor\bfnmA.\binitsA. \bsnmSmola, eds.) \bchapter11, \bpages169–184. \bpublisherMIT Press, \baddressCambridge, MA. \endbibitem
 Jurafsky and Martin (2009) {bbook}[author] \bauthor\bsnmJurafsky, \bfnmDaniel\binitsD. \AND\bauthor\bsnmMartin, \bfnmJames H.\binitsJ. H. (\byear2009). \btitleSpeech and Language Processing, \bedition2 ed. \bpublisherPearson, \baddressUpper Saddle River, NJ. \endbibitem
 Kessler et al. (1997) {barticle}[author] \bauthor\bsnmKessler, \bfnmBrett\binitsB., \bauthor\bsnmNumberg, \bfnmGeoffrey\binitsG., \bauthor\bsnmSchütze, \bfnmHinrich\binitsH., \bauthor\bsnmKessler, \bfnmBrett\binitsB. \AND\bauthor\bsnmNumberg, \bfnmGeoffrey\binitsG. (\byear1997). \btitleAutomatic detection of text genre. \bjournalThe eighth conference \bpages32–38. \endbibitem
 Laver and Benoit (2002) {barticle}[author] \bauthor\bsnmLaver, \bfnmMichael\binitsM. \AND\bauthor\bsnmBenoit, \bfnmKenneth\binitsK. (\byear2002). \btitleLocating TDs in policy spaces: the computational text analysis of Dáil speeches. \bjournalIrish Political Studies \bvolume17 \bpages59–73. \endbibitem
 Laver, Benoit and Garry (2003) {barticle}[author] \bauthor\bsnmLaver, \bfnmMichael\binitsM., \bauthor\bsnmBenoit, \bfnmKenneth\binitsK. \AND\bauthor\bsnmGarry, \bfnmJohn\binitsJ. (\byear2003). \btitleEstimating the policy positions of political actors using words as data. \bjournalAmerican Political Science Review \bvolume97 \bpages311–331. \endbibitem
 Lidstone (1920) {barticle}[author] \bauthor\bsnmLidstone, \bfnmG. J.\binitsG. J. (\byear1920). \btitleNote on the general case of the BayesLaplace formula for inductive or a posteriori probabilities. \bjournalTransactions of the Faculty of Actuaries \bvolume8 \bpages182–192. \endbibitem
 Lowe (2008) {barticle}[author] \bauthor\bsnmLowe, \bfnmWill\binitsW. (\byear2008). \btitleUnderstanding wordscores. \bjournalPolitical Analysis \bvolume16 \bpages356–371. \endbibitem
 Lowe and Benoit (2013) {barticle}[author] \bauthor\bsnmLowe, \bfnmWilliam\binitsW. \AND\bauthor\bsnmBenoit, \bfnmKenneth\binitsK. (\byear2013). \btitleValidating Estimates of Latent Traits From Textual Data Using Human Judgment as a Benchmark. \bjournalPolitical Analysis \bvolume21 \bpages298–313. \endbibitem
 Martin and Vanberg (2007) {barticle}[author] \bauthor\bsnmMartin, \bfnmL. W.\binitsL. W. \AND\bauthor\bsnmVanberg, \bfnmG.\binitsG. (\byear2007). \btitleA robust transformation procedure for interpreting political text. \bjournalPolitical Analysis \bvolume16 \bpages93–100. \bdoi10.1093/pan/mpm010 \endbibitem
 McAuliffe and Blei (2008) {binproceedings}[author] \bauthor\bsnmMcAuliffe, \bfnmJon D\binitsJ. D. \AND\bauthor\bsnmBlei, \bfnmDavid M\binitsD. M. (\byear2008). \btitleSupervised topic models. In \bbooktitleAdvances in neural information processing systems \bpages121–128. \endbibitem
 Mosteller and Wallace (1963) {barticle}[author] \bauthor\bsnmMosteller, \bfnmF.\binitsF. \AND\bauthor\bsnmWallace, \bfnmD. L.\binitsD. L. (\byear1963). \btitleInference in an Authorship Problem. \bjournalJ. Am. Stat. Assoc. \bvolume58 \bpages275–309. \endbibitem
 Newman, Pennebaker and Berry (2003) {barticle}[author] \bauthor\bsnmNewman, \bfnmM L\binitsM. L., \bauthor\bsnmPennebaker, \bfnmJames W\binitsJ. W. \AND\bauthor\bsnmBerry, \bfnmD S\binitsD. S. (\byear2003). \btitleLying words: Predicting deception from linguistic styles. \bjournalPersonality and social …. \endbibitem
 Pang, Lee and Vaithyanathan (2002) {barticle}[author] \bauthor\bsnmPang, \bfnmB.\binitsB., \bauthor\bsnmLee, \bfnmL.\binitsL. \AND\bauthor\bsnmVaithyanathan, \bfnmS.\binitsS. (\byear2002). \btitleThumbs up? Sentiment classification using machine learning techniques. \bjournalProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) \bpages7986. \endbibitem
 Pennebaker, Francis and Booth (2001) {bbook}[author] \bauthor\bsnmPennebaker, \bfnmJames W\binitsJ. W., \bauthor\bsnmFrancis, \bfnmMartha E\binitsM. E. \AND\bauthor\bsnmBooth, \bfnmRoger J\binitsR. J. (\byear2001). \btitleLinguistic inquiry and word count: LIWC 2001. \bpublisherErlbaum Publishers, \baddressMahway, NJ. \endbibitem
 Porter (2006) {bunpublished}[author] \bauthor\bsnmPorter, \bfnmMartin\binitsM. (\byear2006). \btitleSnowball English stop word list. \bnotehttp://snowball.tartarus.org/algorithms/english/stop.txt. \endbibitem
 Ramage et al. (2009) {binproceedings}[author] \bauthor\bsnmRamage, \bfnmDaniel\binitsD., \bauthor\bsnmHall, \bfnmDavid\binitsD., \bauthor\bsnmNallapati, \bfnmRamesh\binitsR. \AND\bauthor\bsnmManning, \bfnmChristopher D\binitsC. D. (\byear2009). \btitleLabeled LDA: A supervised topic model for credit attribution in multilabeled corpora. In \bbooktitleProceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1Volume 1 \bpages248–256. \bpublisherAssociation for Computational Linguistics. \endbibitem
 Roberts, Stewart and Airoldi (2016) {barticle}[author] \bauthor\bsnmRoberts, \bfnmMargaret E\binitsM. E., \bauthor\bsnmStewart, \bfnmBrandon M\binitsB. M. \AND\bauthor\bsnmAiroldi, \bfnmEdoardo M\binitsE. M. (\byear2016). \btitleA model of text for experimentation in the social sciences. \bjournalJ. Am. Stat. Assoc. \bvolume111 \bpages988–1003. \endbibitem
 Sahami et al. (1998) {barticle}[author] \bauthor\bsnmSahami, \bfnmMehran\binitsM., \bauthor\bsnmDumais, \bfnmSusan\binitsS., \bauthor\bsnmHeckerman, \bfnmDavid\binitsD. \AND\bauthor\bsnmHorvitz, \bfnmEric\binitsE. (\byear1998). \btitleA Bayesian approach to filtering junk email. \bvolume62 \bpages98–105. \endbibitem
 Slapin and Proksch (2008) {barticle}[author] \bauthor\bsnmSlapin, \bfnmJonathan B.\binitsJ. B. \AND\bauthor\bsnmProksch, \bfnmSvenOliver\binitsS.O. (\byear2008). \btitleA Scaling Model for Estimating TimeSeries Party Positions from Texts. \bjournalAmerican Journal of Political Science \bvolume52 \bpages705722. \endbibitem
 Stone, Dunphy and Smith (1966) {bbook}[author] \bauthor\bsnmStone, \bfnmPhilip J\binitsP. J., \bauthor\bsnmDunphy, \bfnmDexter C\binitsD. C. \AND\bauthor\bsnmSmith, \bfnmMarshall S\binitsM. S. (\byear1966). \btitleThe general inquirer: A computer approach to content analysis. \bpublisherMIT press, \baddressCambridge, MA. \endbibitem
 Taddy (2013) {barticle}[author] \bauthor\bsnmTaddy, \bfnmMatt\binitsM. (\byear2013). \btitleMultinomial inverse regression for text analysis. \bjournalJ. Am. Stat. Assoc. \bvolume108 \bpages755–770. \endbibitem
 Young and Soroka (2012) {barticle}[author] \bauthor\bsnmYoung, \bfnmLori\binitsL. \AND\bauthor\bsnmSoroka, \bfnmStuart\binitsS. (\byear2012). \btitleAffective News: The Automated Coding of Sentiment in Political Texts. \bjournalPolitical Communication \bvolume29 \bpages205–231. \endbibitem
 Yu, Kaufmann and Diermeier (2008) {barticle}[author] \bauthor\bsnmYu, \bfnmBei\binitsB., \bauthor\bsnmKaufmann, \bfnmStefan\binitsS. \AND\bauthor\bsnmDiermeier, \bfnmDaniel\binitsD. (\byear2008). \btitleClassifying Party Affiliation from Political Speech. \bjournalJournal of Information Technology & Politics \bvolume5 \bpages3348. \endbibitem