Lexical Co-occurrence, Statistical Significance, and Word Association

Lexical Co-occurrence, Statistical Significance, and Word Association

Dipak Chaudhari
Computer Science and Engg.
IIT Bombay
dipakc@cse.iitb.ac.in &Om P. Damani
Computer Science and Engg.
IIT Bombay
damani@cse.iitb.ac.in &Srivatsan Laxman
Microsoft Research India

Lexical co-occurrence is an important cue for detecting word associations. We present a theoretical framework for discovering statistically significant lexical co-occurrences from a given corpus. In contrast with the prevalent practice of giving weightage to unigram frequencies, we focus only on the documents containing both the terms (of a candidate bigram). We detect biases in span distributions of associated words, while being agnostic to variations in global unigram frequencies. Our framework has the fidelity to distinguish different classes of lexical co-occurrences, based on strengths of the document and corpus-level cues of co-occurrence in the data. We perform extensive experiments on benchmark data sets to study the performance of various co-occurrence measures that are currently known in literature. We find that a relatively obscure measure called Ochiai, and a newly introduced measure CSA capture the notion of lexical co-occurrence best, followed next by LLR, Dice, and TTest, while another popular measure, PMI, suprisingly, performs poorly in the context of lexical co-occurrence.

Lexical Co-occurrence, Statistical Significance, and Word Association

Dipak Chaudhari Computer Science and Engg. IIT Bombay dipakc@cse.iitb.ac.in                    Om P. Damani Computer Science and Engg. IIT Bombay damani@cse.iitb.ac.in                    Srivatsan Laxman Microsoft Research India Bangalore slaxman@microsoft.com

1 Introduction

The notion of word association is important for numerous NLP applications, like, word sense disambiguation, optical character recognition, speech recognition, parsing, lexicography, natural language generation, and machine translation. Lexical co-occurrence is an important indicator of word association and this has motivated several frequency-based measures for word association [Church and Hanks, 1989, Dunning, 1993, Dice, 1945, Washtell and Markert, 2009]. In this paper, we present a theoretical basis for detection and classification of lexical co-occurrences111Note that we are interested in co-occurrence, not collocation, i.e., pairs of words that co-occur in a document with an arbitrary number of intervening words. Also, we use the term bigram to mean bigram at-a-distance or spanned-bigram – again, other words can occur in-between the constituents of a bigram.. In general, a lexical co-occurrence could refer to a pair of words that occur in a large number of documents; or it could refer to a pair of words that, although appear only in a small number of documents, occur frequently very close to each other within each document. We formalize these ideas and construct a significance test for co-occurrences that will allow us to detect different kinds of co-occurrences within a single unified framework (a feature which is absent in current measures for co-occurrence). As a by-product, our framework also leads to a better understanding of existing measures for word co-occurrence.

As pointed out in  [Kilgarriff, 2005], language is never random - which brings us to the question of what model of random chance can give us a good statistical test for lexical co-occurrences. We need a null hypothesis that can account for an observed co-occurrence as a pure chance event and this in-turn requires a corpus generation model. It is often reasonable to assume that documents in the corpus are generated independent of each other. Existing frequecy-based association measures like PMI [Church and Hanks, 1989], LLR [Dunning, 1993] etc. further assume that each document is drawn from a multinomial distribution based on global unigram frequencies. The main concern with such a null model is the overbearing influence of unigram frequencies on the detection of word associations. For example, the association between anomochilidae (dwarf pipe snakes) and snake would go undetected in our wikepedia corpus, since less than of the pages containing snake also contained anomochilidae. Similarly, under current models, the expected span (inter-word distance) of a bigram is also very sensitive to the associated unigram frequencies: the expected span of a bigram composed of low frequency unigrams is much larger than that with high frequency unigrams. This is contrary to how word associations appear in language, where semantic relationships manifest with small inter-word distances irrespective of the underlying unigram distributions.

These considerations motivate our search for a more direct relationship between words, one that can potentially be detected using careful statistical characterization of inter-word distances, while minimizing the influence of the associated unigram frequencies. We focus on only the documents containing both the terms (of a candidate bigram) since in NLP applications, we often have to chose from a set of alternatives for a given word. Hence, rather than ask the abstract question of whether words and are related, our approach is to ask, given that is a candidate for pairing with , how likely is it that and are lexically correlated. For example, probability that anomochilidae is found in the vicinity of snake is higher if we knew that anomochilidae and snake appear in the same context.

We consider a null model that represents each document as a bag of words 222There can be many ways to associate a bag of words with a document. Details of this association are not important for us, except that the bag of words provides some kind of quantitative summary of the words within the document.. Then, a random permutation of the associated bag of words gives a linear representation for the document. An arbitrary relation between a pair of words will result in the locations of these words to be randomly distributed in the documents in which they co-occur. If the observed span distribution of a bigram resembles that under the (random permutation) null model, then the relation between the words is not strong enough for one word to influence the placement of the other. However, if the words are found to occur closer together than explainable by our null model, then we hypothesize existence of a more direct association between these words.

In this paper, we formalize the notion of statistically significant lexical co-occurrences by introducing a null model that can detect biases in span distributions of word associations, while being agnostic to variations in global unigram frequencies. Our framework has the fidelity to distinguish different classes of lexical co-occurrences, based on strengths of the document and corpus-level cues of co-occurrence in the data. We perform extensive experiments on benchmark data sets to study the performance of various co-occurrence measures that are currently known in literature. We find that a relatively obscure measure called Ochiai, and a newly introduced measure CSA, capture the notion of lexical co-occurrence best, followed next by LLR, Dice, and TTest, while another popular measure, PMI, suprisingly, performs poorly in the context of lexical co-occurrence.

2 Lexically significant co-occurrences

Consider a bigram . Let denote the set of documents (from out of the entire corpus) that contain at least one occurrence of . The frequency of in document , , is the maximum number of non-overlapped occurrences of in . A set of occurrences of a bigram are called non-overlapping if the words corresponding to one occurrence from the set do not appear in-between the words corresponding to any other occurrence from the set.

The span of an occurrence of is the ‘unsigned distance’ between the first and last textual units of interest associated with that occurrence. We mostly use words as the unit of distance, but in general, distance can be measured in words, sentences, or even paragraphs (e.g. an occurrence comprising two adjacent words in a sentence has a word-span of one and a sentence-span of zero). Likewise, the size of a document , denoted as , is correspondingly measured in units of words, sentences or paragraphs. Finally, let denote the maximum number of non-overlapped occurrences of in with span less than a given threshold . We refer to as the span-constrained frequency of in . Note that cannot exceed .

To assess the statistical significance of the bigram we ask if the span-constrained frequency (of ) is more than what we would expect for it in a document of size containing ‘random’ occurrences of . Our intuition is that if two words are semantically related, they will often appear close to each other in the document and so the distribution of the spans will typically exhibit a prominent bias toward values less than a small .

Consider the null hypothesis that a document is generated as a random permutation of the bag of words associated with the document. Let denote the probability of observing a span-constrained frequency (for ) of at least in a document of length that contains a maximum of non-overlapped occurrences of . Observe that for any ; also, for we have (i.e. all occurrences will always have span less than for ). However, for typical values of (i.e. for ) the probability decreases with increasing . For example, consider a document of length 400 with 4 non-overlapped occurrences of . The probabilities of observing at least 4, 3, 2, 1 and 0 occurrences of within a span of 20 words are 0.007, 0.09, 0.41, 0.83, and 1.0 respectively. Since , even if 3 of the 4 occurrences of (in the example document) have span less than 20 words, there is 9% chance that the occurrences were a consequence of a random event (under our null model). As a result, if we desired a confidence-level of at least 95%, we would have to declare as insignificant.

Given an () and a span upper-bound () the document is said to support the hypothesis “ is a -significant bigram” if . We refer to as the document-level lexical co-occurrence of . Define indicator variables , as:


Let ; models the number of documents (out of ) that support the hypothesis “ is a -significant bigram.” The expected value of is given by


where denotes the smallest for which we can get (This quantity is well-defined since is non-increasing with respect to ). For the example given earlier, and .

Using Hoeffding’s Inequality, for ,


Therefore, we can bound the deviation of the observed value of from its expectation by chosing appropriately. For example, in our corpus, the bigram (canyon, landscape) occurs in documents. For , we find that documents (out of 416) have -significant occurrences, while is 14.34. Let . By setting , we get , which is greater than the observed value of (=33). Thus, we cannot be 99% sure that the occurrences of (canyon, landscape) in the 33 documents were a consequence of non-random phenomena. Hence, our test declares (canyon, landscape) as insignificant at . We formally state the significance test for lexical co-occurrences next:

Definition 1 (Significant lexical co-occurrence)

Consider a bigram and a set of documents containing at least one occurrence of . Let denote the number of documents (out of ) that support the hypothesis “ is an -significant bigram (for a given , )”. The occurrences of the bigram are regarded -significant with confidence (for some user-defined ) if we have , where and is given by Eq. (3). The ratio is called the Co-occurrence Significance Ratio (CSR) for .

We now describe how to compute for in . Let denote the number of ways of embedding non-overlapped occurrences of in a document of length . Similarly, let denote the number of ways of embedding non-overlapped occurrences of in a document of length , in such a way that, at least of the occurrences have span less than . Recall that denotes the probability of observing a span-constrained frequency (for ) of at least in a document of length that contains a maximum of non-overlapped occurrences of . Thus, we can assign the probability in terms of and as follows:


To compute and , we essentially need the histogram for given and . Let denote the number of ways to embed non-overlapped occurrences of a bigram in a document of length in such a way that exactly of the occurrences satisfy the span constraint . We can obtain and from using

0:   - length of document; - number of non-overlapped occurrences to be embedded; - span constraint for occurrences
0:   - histogram of when occurrences are embedded in a document of length
1:  Initialize for
2:  if  then
3:     return
4:  if  then
5:     ;
6:     return
7:  for  to  do
8:     for  to  do
10:         for  to  do
11:            if  then
13:            else
15:  return
Algorithm 1

Algorithm 1 lists the pseudocode for computing the histogram . It enumerates all possible ways of embedding non-overlapped occurrences of a bigram in a document of length . The main steps in the algorithm involve selecting a start and end position for embedding the very first occurrence (lines 7-8) and then recursively calling (line 9). The -loop selects a start position for the first occurrence of the bigram, and the -loop selects the end position. The task in the recursion step is to now compute the number of ways to embed the remaining non-overlapped occurrences in the remaining positions. Once we have , we need to check whether the occurrence introduced at positions will contribute to the count. If , whenever there are span-constrained occurrences in positions to , there will be span-constrained occurrences in positions 1 to . Thus, we increment by the quantity (lines 10-12). However, if , there is no contribution to the span-constrained frequency from the occurrence, and so we increment by the quantity (lines 10-11, 13-14).

This algorithm is exponential in and , but it does not depend explicitly on the data. This allows us to populate the histogram off-line, and publish the tables for various , , and . (If the paper is accepted, we will make an interface to this table publicly available).

3 Utility of CSR test

Evidence for significant lexical co-occurrences can be gathered at two levels in the data – document-level and corpus-level. First, at the document level, we may find that a surprisingly high proportion of occurrences within a document (of a pair of words) have smaller spans than they would by random chance. Second, at the corpus-level, we may find a pair of words appearing closer-than-random in an unusually high number of documents in the corpus. The significance test of Definition 1 is capable of gathering both kinds of evidence from data in carefully calibrated amounts. Prescribing essentially fixes the strength of the document-level hypothesis in our test. A small corresponds to a strong document-level hypothesis and vice-versa. The second parameter in our test, , controls the confidence of our decision given all the documents in the data corpus. A small represents a high confidence test (in the sense that there are a surprisingly large number of documents in the corpus, each of which, individually have some evidence of relatedness for the pair of words). By running the significance test with different values of and , we can detect different types of lexically significant co-occurrences. We illustrate the utility of our test of significance by considering the 4 types of lexical significant co-occurrences

Type A: These correspond to the strongest lexical co-occurrences in the data, with strong document-level hypotheses (low ) as well as high corpus-level confidence (low ). Intuitively, if a pair of words appear close together several times within a document, and if this pattern is observed in a large number of documents, then the co-occurrence is of Type A.

Type B: These are co-occurrences based on weak document-level hypotheses (high ) but because of repeated observation in a substantial number of documents in the corpus, we can still detect them with high confidence (low ). We expect many interesting lexical co-occurrences in text corpora to be of Type B – pairs of words that appear close to each other only a small number of times within a document, but they appear together in a large number of documents.

Type C: Sometimes we may be interested in words that are strongly correlated within a document, even if we observe the strong correlation only in a relatively small number of documents in the corpus. These correspond to Type C co-occurrences. Although they are statistically weaker inferences than those of Type A and Type B (since confidence is lower) Type C co-occurrences represent an important class of relationships between words. If the document corpus contains a very small of number documents on some topic, then strong co-occurrences (i.e. those found with low ) which are unique to that topic may not be detected at low values of . By relaxing the confidence parameter , we may be able to detect such occurrences (possibly at the cost of some extra false positives).

Type D: These co-occurrences represent the weakest correlations found in the data, since they neither employ a strong document-level hypothesis nor enforce a high corpus-level confidence. In most applications, we expect Type D co-occurrences to be of little use, with their best case utility being to provide a baseline for disambiguating Type C co-occurrences.

Table 1: 4 types of lexical co-occurrences.

In the experiments we describe later, we fix the and for the different Types as per Table 1. Finally, we note that Types B and C subsume Type A; similarly, Type D subsumes all three other types. Thus, to detect co-occurrences that are exclusively of (say) Type B, we would have to run the test with a high and low and then remove from the output, those co-occurrences that are also part of Type A.

4 Experimental Results

4.1 Datasets and Text Corpus

Since similarity and relatedness are different kinds of word associations [Budanitsky and Hirst, 2006], in  [Agirre et al., 2009] two different data sets, namely 203 words sim (the union of similar and unrelated pairs) and 252 words rel (the union of related and unrelated pairs) datasets are derived from wordsim [Finkelstein et al., 2002]. We use these two data sets in our experiments. These datasets are symmetric in that the order of words in a pair is not expected to matter. As some of our chosen co-occurrence measures are asymmetric, we also report results on the asymmetric 272-words esslli dataset for the ‘free association’ task at [ESSLLI, 2008].

We use the Wikipedia [Wikipedia, April 2008] corpus in our experiments. It contains 2.7 million articles for a total size of 1.24 Gigawords. We did not pre-process the corpus - no lemmatization, no function-word removal. When counting document size in words, punctuation symbols were ignored. Documents larger than 1500 words were partitioned keeping the size of each part to no greater than 1500 words.

In Table 2, we present some examples of different types of co-occurrences observed in the data.

Dataset Type A bigrams Type B bigrams Type C bigrams Type D bigrams
sim announcement-news forest-graveyard lobster-wine stock-egg
bread-butter tiger-carnivore lad-brother cup-object
rel baby-mother alcohol-chemistry victim-emergency money-withdrawal
country-citizen physics-proton territory-kilometer minority-peace
esslli arrest-police pamphlet-read meditate-think fairground-roundabout
arson-fire spindly-thin ramble-walk
Table 2: Examples of Type A, B, C and D co-occurrences under a span constraint of 20 words.

4.2 Performance of different co-occurrence measures

We now compare the performance of various frequency-based measures in the context of lexical significance. Given the large numbers of measures proposed in the literature [Pecina and Schlesinger, 2006], we need to identify a subset of measures to compare. Inspired by [Janson and Vegelius, 1981] and [Tan et al., 2006] we identify three properties of co-occurrence measure which may be useful for language processing applications. First is Symmetry - does the measure yield the same association score for (x,y) and (y,x)? Second is Null Addition - does addition of data containing neither x nor y affect the association score for (x,y)? And, finally, Homogenity - if we replicate the corpus several times and merge them to construct a larger corpus, does the association score for (x,y) remain unchanged? Note that the concept of homogenity conflicts with the notion of statistical support, as support increases in direct proportion with the absolute amount of evidence. Different applications may need co-occurrence measures having different combinations of these properties.

Method Formula


Null Add.


CSR (this work) Y Y N
CSA (this work) Y N Y
LLR [Dunning, 1993] Y Y Y
PMI [Church and Hanks, 1989] Y N Y
SCI [Washtell and Markert, 2009] N N Y
CWCD [Washtell and Markert, 2009] N N Y
Pearson’s test Y Y Y
T-test Y N Y
Dice [Dice, 1945] Y N Y
Ochiai [Janson and Vegelius, 1981] Y N Y
Jaccard [Jaccard, 1912] Y N Y

Terminology: ( and )
Total number of tokens in the corpus unigram frequencies of in the corpus Span-constrained () bigram frequency Harmonic mean of the spans of occurrences Expected value of f(x,y)

Table 3: Properties of selected co-occurrence measures

Table 3 shows the characteristics of our chosen co-occurrence measures, which were selected from several domains like ecology, psychology, medicine, and language processing. Except Ochiai [Ochiai, 1957], [Janson and Vegelius, 1981], and the recently introduced measure CWCD [Washtell and Markert, 2009]333From various so-called windowless measures introduced in [Washtell and Markert, 2009], we chose the best-performing variant Cue-Weighted Co-Dispersion (CWCD) and implemented a window based version of it with harmonic mean. We note that any of windowless (or spanless) measure can easily be thought of as a special case of a window-based measure where the windowless formulation corresponds to a very large window (or span in our terminology)., all other selected measures are well-known in the NLP community [Pecina and Schlesinger, 2006]. Based on our extensive study of theoretical and empirical properties of CSR, we also introduce a new bigram frequency based measure called CSA (Co-occurrence Significance Approximated), which approximates the behaviour of CSR over a wide range of parameter settings.

In our experiments, we found that Ochiai and Chi-Square have almost identical performance, differing only in 3rd decimal digits. This can be be explained easily. In our context, for any word , as defined in Table 3, and therefore . With this, Chi-Square reduces to square of Ochiai. Similarly Jaccard and Dice coincide, since and . Hence we do not report further results for Chi-Square and Jaccard.

In our first set of experiments, we compared the performance of various frequency-based measures in terms of their suitability for detecting lexically significant co-occurrences (cf. Definition 1). A high Spearman correlation coefficient between the ranked list produced by a given measure and the list produced by CSR with respect to some choice of and would imply that the measure is effective in detecting the corresponding type of lexically significant co-occurrences.

Span Threshold
Measure Data 5w 25w 50w
PMI sim C - -
rel - - -
essli - - -
CWCD sim - - -
rel - - -
essli - - -
CSA sim A, B, C, D A, B, C A, B, C
rel A, B, C, D A, B, C A, C
essli A, B, C, D A, B, C A, C
Dice sim A, B, C, D A, B, C A, B
rel A, B, C, D - -
essli - - -
Ochiai sim A, B, C, D A, B, C, D A, B, C
rel A, B, C, D A, B, C A, B, C
essli A, B, C, D A, B A
LLR sim A, B, C, D A, B A
rel A, B, C, D A A
essli A, B, C A A
TTest sim A, B, C A -
rel A, B, C - -
essli - - -
SCI sim - - -
rel - - -
essli - - -
Table 4: Types of lexical co-occurrences detected by different measures
Figure 1: Maximum correlation of various measures with various types of CSR for sim dataset

The Table 4 lists for each measure and for each data set, the different types of lexically significant co-occurrences that the measure is able to detect effectively – if the corresponding Spearman correlation coefficient exceeds 0.90, we consider the measure to be effective for the given type. Results are shown for three different span constraints – small span of 5 words (or 5w), medium span of 25 words (or 25w) and large span of 50 words (or 50w). For example, the CSA and Ochiai measures are effective in detecting all 4 types of lexically significant co-occurrences (A, B, C and D) in all three data sets, when the span constraint is set to 5 words. Figure 1 presents a detailed quantitative comparison of the best performance of each measure with respect to each type of co-occurrence for a range of different span constraints on the sim data set (Similar results were obtained on other data sets). The inferences we can draw are consistent with the results of Table 4.

Parameters for best correlation
Measure Span Type Correlation
PMI 5w 0.05 1 C 91.3
25w 0.40 1 D 85.3
50w 0.50 1 D 82.0
CWCD 5w 0.99 0.9 D 83.6
25w 0.50 0.9 D 76.0
50w 0.50 0.9 D 74.4
CSA 5w 0.1 0.0005 A 98.9
25w 0.05 0.0005 A 96.7
50w 0.1 0.0005 A 94.9
Dice 5w 0.1 0.005 A 96.1
25w 0.05 0.005 A 93.0
50w 0.1 0.0005 A 91.3
Ochiai 5w 0.1 0.1 A 97.4
25w 0.1 0.01 A 95.5
50w 0.1 0.005 A 94.5
LLR 5w 0.05 0.0005 A 97.3
25w 0.05 0.0005 A 94.8
50w 0.1 0.0005 A 92.6
TTest 5w 0.05 0.0005 A 94.2
25w 0.05 0.0005 A 90.9
50w 0.1 0.0005 A 88.8
SCI 5w 0.05 0.0005 A 82.7
25w 0.05 0.0005 A 75.9
50w 0.1 0.0005 A 73.1
Table 5: Best performing -pairs for different measures on sim data

In our next experiment, we examine which of the four types of co-occurrences are best captured by each measure. Results for the sim data set are listed in Table 5 (Similar results were obtained on the other data sets). For each measure and for each span constraint, the table describes the best performing parameters ( and ), the corresponding co-occurrence Type and the associated ‘best’ correlation achieved with respect to the test of Definition 1 . The results show that, irrespective of the span constraint, most measures perform best on Type A co-occurrences. This is reasonable because Type A essentially represents the strongest correlations in the data and one would expect the measures to capture the strong correlations better than weaker ones. There are however, two exceptions to this rule, namely PMI and CWCD, which instead peak at Types C or D. The best correlations for these two measures are also typically lower than the other measures. We now summarize the main findings from our study:

  • The relatively obscure Ochiai, and the newly introduce CSA are the best performing measure, in terms of detecting all types of lexical co-occurrences in all data sets and for a wide range of span constraints.

  • Dice, LLR and TTest are the other measures that effectively track lexically significant co-occurrences (although, all three are less effective as the span constraints become larger).

  • SCI, CWCD, and the popular PMI measure are ineffective at capturing any notion of lexically significant co-occurrences, even for small span constraints. In fact, the best result for PMI is the detection of Type C co-occurrences in the sim data set. The low and high setting of Type C suggests that PMI does a poor job of detecting the strongest co-occurrences in the data, overlooking both strong document-level as well as corpus-level cues for lexical significance.

sim rel esslli
PMI top 10 R Ochiai top 10 R PMI top 10 R Ochiai top 10 R PMI top 10 R Ochiai top 10 R
vodka-gin 42 football-soccer 3 money-laundering 2 soap-opera 1 nook-cranny 91 floyd-pink 4
seafood-lobster 59 street-avenue 5 soap-opera 1 money-laundering 2 hither-thither 104 either-or 1
bread-butter 13 physics-chemistry 2 opec-oil 8 computer-software 18 sprain-ankle 60 election-general 7
vodka-brandy 99 television-radio 6 weather-forecast 5 television-film 7 blimey-cor 147 nook-cranny 91
midday-noon 79 championship-tournament 10 psychology-cognition 77 jerusalem-israel 16 margarine-butter 77 twentieth-century 2
murder-manslaughter 19 man-woman 16 decoration-valor 73 weather-forecast 5 tinker-tailor 65 bride-groom 16
cucumber-potato 130 vodka-gin 42 gender-equality 11 drug-abuse 4 ding-dong 26 you-me 14
dividend-payment 61 king-queen 9 tennis-racket 20 credit-card 3 bride-groom 16 north-south 19
physics-chemistry 2 car-automobile 43 liability-insurance 25 game-series 12 jigsaw-puzzle 30 question-answer 11
psychology-psychiatry 27 harvard-yale 11 fbi-investigation 10 stock-market 9 bidder-auction 76 atlantic-ocean 10
Table 6: Top 10 bigrams according to PMI and Ochiai rankings on sim, rel, and esslli datasets. ’R’ denotes the bigrams rankings according to type-A CSR measure(). Span of 25 words is used for all the three measures.

Note that our results do not contradict the utility of PMI, SCI, or, CWCD as word-association measures. We only observe their poor performance in context of detecting lexical co-occurrences. Also, our notion of lexical co-occurrence is symmetric. It is possible that asymmetric SCI may have competitive performance for certain asymmetric tasks compared to the better performing symmetric measures. Finally, to give a qualitative feel about the differences in the correlations preferred by different methods, in Table 6, we show the top 10 bigrams picked by PMI and Ochiai for all three datasets.

5 Relation between lexical co-occurrence and human judgements

Method 1 2 3 4 5 6 7 8 9 10
Human environment maradona opec computer money jerusalem law weather network fbi
Judgement ecology (84) football (53) oil (8) software (18) bank (28) israel (16) lawyer (42) forecast (5) hardware (107) investigation (10)
CSR soap money credit drug weather cup television opec stock fbi
opera (24) laundering (129) card (20) abuse (69) forecast (8) coffee (82) film (31) oil (3) market (19) investigation (10)
Table 7: Top 10 word associations picked in rel dataset. The numbers in the brackets are the cross rankings: CSR rankings in the human row and human rankings in the CSR row. CSR parameters are same as that for Table 6.

While the focus of our work is on characterizing the statistically significant lexical co-occurrence, as illustrated in in Table 7, human judgement of word association is governed by many factors in addition to lexical co-occurrence considerations, and many non co-occurrence based measures have been designed to capture semantic word association. Notable among them are distributional similarity based measures [Agirre et al., 2009, Bollegala et al., 2007, Chen et al., 2006] and knowledge-based measures [Milne and Witten, 2008, Hughes and Ramage, 2007, Gabrilovich and Markovitch, 2007, Yeh et al., 2009, Strube and Ponzetto, 2006, Finkelstein et al., 2002, Wandmacher et al., 2008]. Since our focus is on frequency based measures alone, we do not discuss these other measures.

The lexical co-occurrence phenomenon and the human judgement of semantic association are related but different dimensions of relationships between words and different applications may prefer one over the other. For example, suppose, given one word (say dirt), the task is to choose from among a number of alternatives for the second(say grime and filth). Human judgment scores for (dirt, grime) and (dirt, filth) are 5.4 and 6.1 respectively. However, their lexical co-occurrence scores (CSR) are 1.49 and 0.84 respectively. This is because filth is often used in a moral context as well. Grime is usually used only in a physical sense. Dirt is used mostly in a physical sense, but is a bit more generic and may be used in a moral sense occasionally. Hence (dirt, grime) is more correlated in corpus than (dirt, filth). This shows that human judgement is fallible and annotators may ignore the subtleties of meanings that may be picked up by a statistical techniques like ours.

In general, for association with a given word, all synonyms of a second word will be given similar semantic relatedness score by human judges but they may have very different lexical association scores.

For applications where the notion of statistical lexical co-occurrence is potentially more relevant than semantic relatedness, our method can be used to generate a gold-standard of lexical association (against which other association measures can be evaluated). In this context, it is interesting to note that contrary to the human judgement, each one of the co-occurrence measures studied by us finds (dirt, grime) more associated than (dirt, filth).

Having explained that significant lexical co-occurrence is a fundamentally different notion than human judgement of word association, we also want to emphasize that the two are not completely different notions either and they correlate reasonably well with each-other. For sim, rel, and essli datasets, CSR’s best correlations with human judgment are 0.74, 0.65, and 0.46 respectively. Note that CSR is a symmetric notion and hence correlates far more with human judgement for symmetric sim and rel datasets than for the asymmetric essli dataset. Also, at first glance, it is little counter-intuitive that the notion of lexical co-occurrence yields better correlations with the sim (based on similarity) data set when compared to the rel(based on relatedness) data set. This can essentially be explained by our observation that similar words tend to co-occur less frequently by-chance than the related words.

6 Conclusions

In this paper, we introduced the notion of statistically significant lexical co-occurrences. We detected skews in span distributions of bigrams to assess significance and showed how our method allows classification of co-occurrences into different types. We performed experiments to assess the performance of various frequency-based measures for detecting lexically signficant co-occurrences. We believe lexical co-occurrence can play a critical role in several applications, including sense disambiguation, mutli-word spotting, etc. We will address some of these in our future work.


  • [Agirre et al., 2009] Agirre, Eneko, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In NAACL-HLT.
  • [Bollegala et al., 2007] Bollegala, Danushka, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In WWW, pages 757–766.
  • [Budanitsky and Hirst, 2006] Budanitsky, Alexander and Graeme Hirst. 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguists, 32(1):13–47.
  • [Chen et al., 2006] Chen, Hsin-Hsi, Ming-Shun Lin, and Yu-Chuan Wei. 2006. Novel association measures using web search with double checking. In ACL.
  • [Church and Hanks, 1989] Church, Kenneth Ward and Patrick Hanks. 1989. Word association norms, mutual information and lexicography. In ACL, pages 76–83.
  • [Dice, 1945] Dice, L. R. 1945. Measures of the amount of ecological association between species. Ecology, 26:297–302.
  • [Dunning, 1993] Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74.
  • [ESSLLI, 2008] ESSLLI. 2008. Free association task at lexical semantics workshop esslli 2008. http://wordspace.collocations.de/doku.php/workshop:esslli:task.
  • [Finkelstein et al., 2002] Finkelstein, Lev, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: the concept revisited. ACM Trans. Inf. Syst., 20(1):116–131.
  • [Gabrilovich and Markovitch, 2007] Gabrilovich, Evgeniy and Shaul Markovitch. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI.
  • [Hughes and Ramage, 2007] Hughes, T and D Ramage. 2007. Lexical semantic relatedness with random graph walks. In EMNLP.
  • [Jaccard, 1912] Jaccard, P. 1912. The distribution of the flora of the alpine zone. New Phytologist, 11:37–50.
  • [Janson and Vegelius, 1981] Janson, Svante and Jan Vegelius. 1981. Measures of ecological association. Oecologia, 49:371–376.
  • [Kilgarriff, 2005] Kilgarriff, Adam. 2005. Language is never ever ever random. Corpus Linguistics and Linguistic Theory, 1(2):263–276.
  • [Milne and Witten, 2008] Milne, David and Ian H. Witten. 2008. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In ACL.
  • [Ochiai, 1957] Ochiai, A. 1957. Zoogeografical studies on the soleoid fishes found in japan and its neighbouring regions-ii. Bulletin of the Japanese Society of Scientific Fisheries, 22.
  • [Pecina and Schlesinger, 2006] Pecina, Pavel and Pavel Schlesinger. 2006. Combining association measures for collocation extraction. In ACL.
  • [Strube and Ponzetto, 2006] Strube, Michael and Simone Paolo Ponzetto. 2006. Wikirelate! computing semantic relatedness using wikipedia. In AAAI, pages 1419–1424.
  • [Tan et al., 2006] Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. 2006. Chapter 6.7: Evaluation of association patterns. In Introduction to Data Mining, pages 379–382. Pearson Education, Inc.
  • [Wandmacher et al., 2008] Wandmacher, T., E. Ovchinnikova, and T. Alexandrov. 2008. Does latent semantic analysis reflect human associations? In European Summer School in Logic, Language and Information (ESSLLI’08).
  • [Washtell and Markert, 2009] Washtell, Justin and Katja Markert. 2009. A comparison of windowless and window-based computational association measures as predictors of syntagmatic human associations. In EMNLP, pages 628–637.
  • [Wikipedia, April 2008] Wikipedia. April 2008. http://www.wikipedia.org.
  • [Yeh et al., 2009] Yeh, Eric, Daniel Ramage, Chris Manning, Eneko Agirre, and Aitor Soroa. 2009. Wikiwalk: Random walks on wikipedia for semantic relatedness. In ACL workshop ”TextGraphs-4: Graph-based Methods for Natural Language Processing””.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description