FASTSUBS: An Efficient and Exact Procedure for Finding the Most Likely Lexical Substitutes Based on an Ngram Language Model
Abstract
Lexical substitutes have found use in areas such as paraphrasing, text simplification, machine translation, word sense disambiguation, and part of speech induction. However the computational complexity of accurately identifying the most likely substitutes for a word has made large scale experiments difficult. In this paper I introduce a new search algorithm, fastsubs , that is guaranteed to find the most likely lexical substitutes for a given word in a sentence based on an ngram language model. The computation is sublinear in both and the vocabulary size . An implementation of the algorithm and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available at http://goo.gl/jzKH0.
EDICS Category: SPELANG
I Introduction
Lexical substitutes have proven useful in applications such as paraphrasing [1], text simplification [2], and machine translation [3]. Best published results in unsupervised word sense disambiguation [4], and part of speech induction [5] represent word context as a vector of substitute probabilities. Using a statistical language model to find the most likely substitutes of a word in a given context is a successful approach ([6, 7]). However the computational cost of an exhaustive algorithm, which computes the probability of every word before deciding the top , makes large scale experiments difficult. On the other hand, heuristic methods run the risk of missing important substitutes.
This paper presents the fastsubs algorithm which can efficiently and correctly identify the most likely lexical substitutes for a given context based on an ngram language model without going through most of the vocabulary. Even though the worstcase performance of fastsubs is still proportional to vocabulary size, experiments demonstrate that the average cost is sublinear in both the number of substitutes and the vocabulary size . To my knowledge, this is the first sublinear algorithm that exactly identifies the top most likely lexical substitutes.
The efficiency of fastsubs makes large scale experiments based on lexical substitutes feasible. For example, it is possible to compute the top 100 substitutes for each one of the 1,173,766 tokens in the WSJ section of the Penn Treebank [8] in under 5 hours on a typical workstation. The same task would take about 6 days with the exhaustive algorithm. The Penn Treebank substitute data and an implementation of the algorithm are available from the author’s website at http://goo.gl/jzKH0.
Section II derives substitute probabilities as defined by an ngram language model with an arbitrary order and smoothing. Section III describes the fastsubs algorithm. Section IV proves the correctness of the algorithm and Section V presents experimental results on its time complexity. Section VI summarizes the contributions of this paper.
Ii Substitute Probabilities
This section presents the derivation of lexical substitute probabilities based on an ngram language model. Details of this derivation are important in finding an admissible algorithm that identifies the most likely substitutes efficiently, without trying out most of the vocabulary.
Ngram language models assign probabilities to arbitrary sequences of words (or other tokens like punctuation etc.) based on their occurrence statistics in large training corpora. They approximate the probability of a sequence of words by assuming each word is conditionally independent of the rest given the previous words. For example a trigram model would approximate the probability of a sequence as:
(1) 
where lowercase letters like , , represent words and strings of letters like represent word sequences. The computation is typically performed using log probabilities, which turns the product into a summation:
(2) 
where . The individual conditional probability terms are typically expressed in backoff form:^{1}^{1}1Even interpolated models can be represented in the backoff form and in fact that is the way SRILM stores them in ARPA (Doug Paul) format model files.
(3) 
where is the discounted log probability estimate for (typically slightly less than the log frequency in the training corpus), is the number of times has been observed in the training corpus, is the backoff weight to keep the probabilities add up to 1. The formula can be generalized to arbitrary ngram orders if we let stand for zero or more words. The recursion bottoms out at unigrams (single words) where . If there are any outofvocabulary words we assume they are mapped to a special token, so is never undefined.
It is best to use both left and right context when estimating the probabilities for potential lexical substitutes. For example, in “He lived in San Francisco suburbs.”, the token San would be difficult to guess from the left context but it is almost certain looking at the right context. The log probability of a substitute word given both left and right contexts can be estimated as:
Here the “” symbol represents the position the candidate substitute is going to occupy. The first line follows from the definition of conditional probability and the second line comes from Equation 1 except the terms that do not include the candidate have been dropped.
The expression for the unnormalized log probability of a lexical substitute according to Equation II and the decomposition of its terms according to Equation 3 can be combined to give us Equation 5. For arbitrary order ngram models we would end up with a sum of terms and each term would come from one of alternatives.
(5) 
Iii Algorithm
The task of fastsubs is to pick the top substitutes () from a vocabulary of size that maximize Equation 5 for a given context . Equation 5 forms a tree where leaf nodes are primitive terms such as , , and parent nodes are compound terms, i.e. sums or conditional expressions. The basic strategy is to construct a priority queue of candidate substitutes for Equation 5 by composing substitute queues for each of its subexpressions. The structure of these queues and how they can be composed is described next, followed by the construction of the individual queues for each of the subexpressions.
Iiia Upper bound queues
A sum such as is not necessarily maximized by the ’s that maximize either of its terms. What we can say for sure is that the sum for any cannot exceed the upper bound where maximizes and maximizes . We can find the that maximizes the sum by repeatedly evaluating candidates until we find one whose value is (i) larger than all the candidates that have been evaluated, and (ii) larger than the upper bound for the remaining candidates.
Based on this intuition, we define an abstract data type called an upper bound queue that maintains an upper bound on the actual values of its elements. Each successive pop from an upper bound queue is not guaranteed to retrieve the element with the largest value, but the remaining elements are guaranteed to have values smaller than or equal to a nonincreasing upper bound. An upper bound queue supports three operations:

: returns an upper bound on the value of the elements in the queue.

: returns the top element in the queue. Note that this element is not guaranteed to have the highest value.

: extracts and returns the top element in the queue and updates the upper bound if possible.
Upper bound queues can be composed easily. Going back to our sum example let us assume that we have valid upper bound queues for and for . The queue for the sum has because the upper bound for a sum clearly cannot exceed the total of the upper bounds for its constituent terms. can return any element from the queue without violating the contract. However in order to find the true maximum, we eventually need an element whose value exceeds the upper bound for the remaining elements. Thus we can bias our choice for to prefer elements that (i) have high values, and (ii) reduce the upper bound quickly. In practice nondeterministically picking to be one of or works well. can extract and return the same element from the corresponding child queue. If the upper bound of a child queue drops as a result, so does the upper bound of the compound queue .
IiiB Top level queue
The top level sum in Equation 5 is a sum of conditional expressions for an order language model. We can construct an upper bound queue for the sum using the upper bound queues for its constituent terms as described in the previous section. Let represent the queue for the top level sum, represent the constituent conditional expressions and represent their associated queues.
(6)  
For we nondeterministically pick the top element from one of the children and extracts and returns that same element adjusting the upper bound if necessary.
As mentioned before does not necessarily return the element with the maximum value. In order to find the top elements fastsubs keeps popping elements from and computes their true values according to Equation 5 until at least of them have values above the upper bound for the remaining elements in the queue. Table I gives the pseudocode for fastsubs .
This procedure will return the correct result as long as cycles through all the words in the vocabulary and the upper bound for the remaining elements, , is accurate. The loop can in fact cycle through all the words in the vocabulary because at least one of the subexpressions, , is well defined for every word. The accuracy of depends on the accuracy of the upper bounds for constituent terms, which are described next.
IiiC Queues for conditional expressions
Conditional expressions indicated by “{” in Equation 5 pick their topmost child whose argument has been observed in the training corpus. Let be the queue for such a conditional expression and be its children terms. Let be the child whose queue has the maximum upper bound. The upper bound for cannot exceed the upper bound for because the value of the conditional expression for any given is equal to the value of one of its children. Thus we define the queue operations for conditional expressions based on :
(7)  
IiiD Queues for sums of primitive terms
As described in Section IIIA, the upper bound of a queue for a sum like is equal to the sum of the upper bounds of the constituent queues. It turns out that for sums of primitive terms, only the term that has the candidate word as an argument has a nonconstant upperbound. The language model defines to be 0 for any word sequence that does not appear in the training set. Therefore the terms that have the candidate word as an argument always have the upper bound 0. Finally, the and terms without the candidate word act as constants.
For notational consistency we define upper bounds for the constant terms as well. Let and represent sequences of zero or more words that do not include the candidate . We have:
(8)  
For terms with in their argument, many words from the vocabulary would be unobserved in the argument sequence and share the maximum value of 0. In the rare case that all vocabulary words have been observed in the argument sequence, they would each have negative values and 0 would still be a valid upper bound. Thus fastsubs uses the constant 0 as an upper bound for terms with .
(9) 
Only the term with an argument has an upper bound queue as described in the next section. fastsubs picks the top element for a sum of primitive terms only from its constituent.^{2}^{2}2 Remember that the top value in an upper bound queue is not guaranteed to have the largest value. Thus ignoring the terms does not effect the correctness of the algorithm. Let be the queue for a sum of primitive terms and let indicate its constituents (, , constant or otherwise). We have:
(10)  
IiiE Queues for primitive terms
fastsubs precomputes actual priority queues (which satisfy the upper bound queue contract) for terms that include in their argument:
(11)  
Here and stand for zero or more words and is a candidate lexical substitute word. gives the real maximum, thus provides a tight upper bound. is guaranteed to return the element with the highest value.
The queues are constructed once in the beginning of the program as sorted arrays and reused in queries for different contexts. The construction can be performed in one pass through the language model and the memory requirement is of the same order as the size of the language model. Candidates that have not been observed in the argument context will be at the bottom of this queue because if . To save memory such are not placed in the queue. Thus after we run out of elements in the queue returns:
(12)  
undef 
Iv Correctness
As mentioned in Section III, the correctness of the algorithm depends on two factors: (i) the function should return an upper bound on the remaining values in , and (ii) the function should cycle through the whole vocabulary for the top level queue.
The correctness of the function can be proved recursively. For primitive terms is equal to the actual maximum (e.g. for ), or is an obvious upper bound (e.g. ). For sums, is equal to the sum of the upper bounds for the children and, for conditional expressions, is equal to the maximum of the upper bounds for the children.
To prove that will cycle through the entire vocabulary it suffices to show that the queue for at least one child of will cycle through the entire vocabulary. This is in fact the case because one of the children will always include the term whose queue contains the entire vocabulary.
V Complexity
A exhaustive algorithm to find the most likely substitutes in a given context could try each word in the vocabulary as a potential substitute and compute the value of the expression given in Equation 5. The computation of Equation 5 requires operations for an order language model, which we will assume to be a constant. If we have words in our vocabulary the cost of the exhaustive algorithm to find a single most likely substitute would be .
In order to quantify the efficiency of fastsubs on a real world dataset, I used a corpus of 126 million words of WSJ data as the training set and the WSJ section of the Penn Treebank [8] as the test set. Several 4gram language models were built from the training set using KneserNey smoothing in SRILM [9] with vocabulary sizes ranging from 16K to 512K words. The average number of operations for the top level upper bound queue was measured for number of substitutes ranging from 1 to 16K. Figure 1 shows the results.
The time cost of fastsubs depends on the number of iterations of the while loop in Table I which in turn depends on the quality of words returned by and the tightness of the upper bound given by . The worst case is no better than the exhaustive algorithm’s . However Figure 1 shows that the average performance of fastsubs on real data is significantly better when . The number of operations in the while loop to get the top substitutes is sublinear in (the slope of the loglog curves are around 0.5878) and approaches the vocabulary size as . The effect of vocabulary size is practically insignificant: increasing vocabulary size from 16K to 512K less than doubles the average number of steps for a given .
As a practical example, it is possible to compute the top 100 substitutes for each one of the 1,173,766 tokens in Penn Treebank with a vocabulary size of 64K in under 5 hours on a typical 2012 workstation.^{3}^{3}3 Running a single thread on an Intel Xeon E74850 2GHz processor. The same task would take about 6 days for the exhaustive algorithm.
Vi Contributions
Finding likely lexical substitutes has a range of applications in natural language processing. In this paper we introduced an exact and efficient algorithm, fastsubs , that is guaranteed to find the most likely substitutes for a given word context from a word vocabulary. Its average runtime is sublinear in both and giving a significant improvement over an exhaustive algorithm when . An implementation of the algorithm and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available at http://goo.gl/jzKH0.
Acknowledgments
I would like to thank the members of the Natural Language Group at USC/ISI for their hospitality and for convincing me that a fastsubs algorithm is possible.
References
 [1] D. McCarthy and R. Navigli, “Semeval2007 task 10: English lexical substitution task,” in Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval2007), 2007, pp. 48–53.
 [2] L. Specia, S. Jauhar, and R. Mihalcea, “Semeval2012 task 1: English lexical simplification,” in Proceedings of the International Workshop on Semantic Evaluation, 2012, forthcoming.
 [3] R. Mihalcea, R. Sinha, and D. McCarthy, “Semeval2010 task 2: Crosslingual lexical substitution,” in Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, 2010, pp. 9–14.
 [4] D. Yuret and M. A. Yatbaz, “The noisy channel model for unsupervised word sense disambiguation,” Computational Linguistics, vol. 36, no. 1, pp. 111–127, March 2010. [Online]. Available: http://www.aclweb.org/anthologynew/J/J10/J101004.pdf
 [5] M. A. Yatbaz, E. Sert, and D. Yuret, “Learning syntactic categories using paradigmatic representations of word context,” in Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, July 2012.
 [6] T. Hawker, “Usyd: Wsd and lexical substitution using the web1t corpus,” in SemEval2007: 4th International Workshop on Semantic Evaluations, 2007. [Online]. Available: /ref/hawker/98.pdf
 [7] D. Yuret, “Ku: Word sense disambiguation by substitution,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007, pp. 207–213.
 [8] M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, and A. Taylor, Treebank3. Philadelphia: Linguistic Data Consortium, 1999.
 [9] A. Stolcke, “Srilm – an extensible language modeling toolkit,” in Seventh International Conference on Spoken Language Processing, 2002.