# Natural Language Feature Selection via Cooccurrence

###### Abstract

Specificity is important for extracting collocations, keyphrases, multi-word and index terms [Newman et al. 2012]. It is also useful for tagging, ontology construction [Ryu and Choi 2006], and automatic summarization of documents [Louis and Nenkova 2011; Chali and Hassan 2012]. Term frequency and inverse-document frequency (TF-IDF) are typically used to do this, but fail to take advantage of the semantic relationships between terms [Church and Gale 1995]. The result is that general idiomatic terms are mistaken for specific terms. We demonstrate use of relational data for estimation of term specificity. The specificity of a term can be learned from its distribution of relations with other terms. This technique is useful for identifying relevant words or terms for other natural language processing tasks.

## Motivation

A deeper understanding of the semantics in natural language can help overcome limitations of basic statistical methods that lack it. One fundamental property of natural language tokens is specificity. Specificity was defined for a term as the number of documents to which the term pertains [Jones 1972], but has since become more abstract in order to apply to multi-word terms and other natural language tasks [Frantzi et al. 1998]. The common definition is that a “specific” term has meaning within a relative subdomain, while a “general” term may apply to entire domains of study.

Specificity itself is usually estimated with frequency statistics such as term-frequence and inverse-document frequency (TF-IDF). We will try to do better than TF-IDF by taking advantage of the underlying semantic relationships between terms. Our primary assumption is that two terms which are strongly related will tend to occur together. This is known as the latent relation hypothesis [Akbik et al. 2012]. We will use this to infer relations through term collocations. We connect concept relations to a notion of specificity using the theory of a “semantic hierarchy” [Chodorow et al. 1985]. In a semantic hierarchy, “high level” terms are general, and exist above “low level” terms, and a high level term is connected to specific terms which are related to it.

We will create a new method of inferring specificity by using a simple cooccurrence model of relations together with the idea of a semantic hierarchy. This method is based on the semantics of the terms involved, and so robust against functional words which TF-IDF tends to fail at. This results in the high-precision selection of terms appropriate for tasks such as tagging where a relatively small number of terms are desired.

## Prior Work

Prior work on term specificity has used frequency statistics such as TF-IDF [Church and Gale 1995], context measures such as C/NC-value [Frantzi et al. 1998; Caraballo and Charniak 1999; Ryu and Choi 2006], and latent-space-analysis techniques based on term informativeness [Hogan 2007; Kireyev 2009].

Relationship extraction between terms follows two strategies: the use of statistical measures [He 1999; Ryu and Choi 2006; Hogan 2007] and pattern-based information extraction techniques [Navigli and Velardi 2004; Akbik et al. 2012].

Statistical measures attempt to capture a condional probability such as “given term A has occurred, what is the probability of term B occurring?” [He 1999] and more primitive frequencies such as cooccurrence and individual term frequency [Ryu and Choi 2006]. They include bag-of-words frequencies as well as contextual measures. Context methods examine the distribution of modifier terms which immediately precede or follow the term in question. Ryu and Choi (2006) describe the semantic model behind context methods: “Distribution of adjective-term relation refers to the idea that specific nouns are rarely modified, while general nouns are frequently modified in text.” These unsupervised techniques are more popular for large-scale use on the web because they require less training and fine-tuning than the pattern-based methods [Rosenfeld and Feldman 2007; Akbik et al. 2012].

Pattern-based information extraction relies on specialized parsers to extract ternary relations: two operand terms and a third relation operator. e.g. For the sentence “SVMs are a kind of binary classification.” a parser might identify the two terms “SVM” and “binary classification” as well as the “is a” relation between them. This parsing technique relies heavily on prior syntactic knowledge and has problems identifying relations that aren’t represented in a single sentence [Navigli and Velardi 2004; Akbik et al. 2012]. Performance is dependent on pre-processing methods and the parsing language pattern used [Etzioni et al. 2011; Gupta and Manning 2011].

Our contribution is the combination of a relatively fast relation extraction technique with the semantic model for specificity inference.

## Models

### Relation Extraction

Since this relational model is essentially a knowledge representation used by later techniques, we choose a crude and computationally fast method for inference of term relations. Via the latent relation hypothesis mentioned earlier, we assume that a semantic relationship exists between two terms based on cooccurrence frequency. Relations between concepts will only be modeled implicitly.

The corpus is discretized into observation units. Units could be defined by sentence breaks, paragraph breaks, document breaks, etc. To collect occurrence statistics, each unit is treated as an independent observation of terms. e.g. If we used sentence breaks to define observation units, the sentence “SVMs are a kind of binary classification.” would imply a stronger relation between “SVM” and “binary classification”.

Compared to parsing methods for relation extraction, this method is not constrained to sentence-level analysis and can be done more easily. However, it doesn’t learn the type or quality of the relation. Where parsing techniques treat relations as either present or not-present, cooccurrence will produce relations between every term. Fortunately, these relations are identified by statistics which can be interpreted as confidence or strenth of the relation. A strong relation is one that a parser might identify e.g. the “is a” relation between two terms in a sentence. A weak relation is one that is semantically trivial and unlikely for a parser to find e.g. “they are different measures used in different techniques but occur in the same corpus” is a relation which will probably not be explicit in natural text, but would result in a weak cooccurrence relation.

The assumption that cooccurrence implies a relationship is not always correct. The result of our method is a full graph which requires some sort of pruning to avoid false relations. For our applications, we do not prune the graph to test how useful it is in this state.

Identifying cooccurrence statistics requires work for vocabulary of size .

### Specificity

Once we have an idea of relations between terms, we use a semantic model to infer specificity. We assume a semantic hierarchy exists between all terms. It is not a strict tree hierarchy in that lower terms may be connected to multiple parents. However, it is still tree-like in that a more specific term will have fewer relations with other terms because it is semantically unrelated to terms outside its domain. A more general term will have a larger number of relations with specific terms within its domain as well as relations with peers and parents. At the top of the hierarchy are very general terms which serve functional purposes in writing or are not domain-specific. We do not have to explicitly model the hierarchy, but will infer the specificity of a term by its distribution of relations with other terms.

Figure 2 shows an example of the very general word “data set” which has higher TF-IDF than the specific word “kernel method”. TF-IDF would consider “data set” a more relevant term than “kernel method”. A specific word like “kernel method” will have a restricted domain of relations with other terms, so its relations are distributed in a more predictable manner. The result is that its relation distribution has a lower Shannon entropy than the general word “data set”.

We will use the normalized version of this cooccurrence distribution to approximate the probability that term has a relation with term . The cooccurrence approximation for a relation between term and term will be:

Where is the vocabulary size and is a matrix of nonnegative cooccurrence counts. is the total number of observation units which contain both term and term . The Shannon entropy formula is useful because it captures the idea of predictability and we expect specific terms to have more predictable coocurrence distributions.

Terms that have specific relation patterns receive a lower score. The results are ranked from low-to-high to produce the most specific terms first.

Altogether, there is work required to compute this value for all terms in vocabulary of size . Since TF-IDF is , our method is more costly than computing TF-IDF from the same data.

## Methods and Results

### Application in Keyphrase Extraction

Index term and keyphrase extraction is a task which requires extraction of “main terms and concepts in a document” [Newman et al. 2012]. This is analogous to extracting both low-level terms as well as terms which are general but still within the domain of a particular document. That is, we no longer want to identify the most specific terms, but a particular range of terms. With this aim, we use alternate versions of the probability estimate based not directly on cooccurrence. The alternate estimators are and , estimates based on covariance and mutual information.

Our primary data set was a set of machine learning and neurobiology articles from NIPS. It has 132262 unigrams, or 86529 terms after using a segmenter [Newman et al. 2012]. The segmented term results are shown below, but similar results were found with unigrams and alternate data sets including a set of ACL research article texts and PubMed search results.

The covariance between terms was calculated from observation data. Positive covariances are assumed to represent the strength of a relationship between terms, so non-positive covariances are discarded. A specific term will have most of its relationship mass concentrated in a smaller number of other terms, so an entropy calculation is taken over the positive-covariance distribution for each term. This method treats negative correlations as zero and so ignores the negative relationships between terms of different domains.

A second measure uses mutual information in an attempt to leverage the negative relations ignored before. Data is binarized to represent “present” or “not present” for each term in each observation. Mutual information is calculated by

In a binarized frequency format, there are four cases which contribute to the mutual information calculation: two where a single term is present, one where both are present, and one where neither is present. A specific term will tend to have high mutual information with many others. Due to the many terms outside its domain, a specific term tends toward an XOR pattern with others: where one is present, the other is not. This results in higher mutual information between these terms. Additionally, the specific term will have high mutual information with terms in its domain. On the other hand, general terms which are still domain-specific gain mutual information from their descendents in the semantic hierarchy. It is the functional terms which we want to filter out which will have lower mutual information in more cases. When an entropy estimate is calculated over this distribution with other terms, the results are ranked high-to-low to produce domain-specific terms first.

We used document-breaks to define observation units so that each method used the same input data: bag-of-words frequency counts for each document. When used to rank terms directly, the cooccurrence methods appear comparable to TF-IDF at this task. They have very similar precision scores for the top-N ranked terms compared to TF-IDF.

For each of these top 4 measures, we performed 30-fold cross validation over 500 terms using single-feature support-vector machine classifiers. Classifiers did best when trained on entropy of mutual information. Performance increases with the amount of work required to compute the feature.

Term | Increase Under TF-IDF | Golden |
---|---|---|

data point | 482 | 1 |

section NUM | 481 | 0 |

et al | 481 | 0 |

equation NUM | 476 | 0 |

figure NUM | 468 | 0 |

training data | 449 | 1 |

NUM figure NUM | 437 | 0 |

other hand | 435 | 0 |

table NUM | 432 | 0 |

show how | 421 | 0 |

Term | Increase Under Cooccurence | Golden |
---|---|---|

model predict | 405 | 0 |

generative process | 397 | 1 |

follow lemma | 369 | 0 |

sufficient condition | 359 | 1 |

model prediction | 354 | 1 |

kernel method | 342 | 1 |

online learning | 338 | 1 |

convex function | 331 | 1 |

dash line | 331 | 0 |

proof theorem NUM | 331 | 0 |

convex optimization problem | 330 | 1 |

## Future Work

### Evalutation of Specificity Model

The specificity model could be evaluated directly by comparison with existing human-curated ontologies such as the Medical Subject Headings (MeSH) provided with NCBI publications.

### Application in Automatic Summarization

### Structure Learning for Semantic Hierarchy Models

The semantic hierarchy which was only used implicitly before could be modeled explicitly.

When plotting terms based on the covariance and mutual information entropies, the results from above emerge. General terms tend to occur in the upper regions of the plot, and very general stop-word terms tend to occur in the left side:

The specific term “support vector” is highlighted in green. Top related terms to it are highlighted in red. These are the terms which had the highest mutual information with “support vector”. Other highly related terms to “support vector” which I didn’t plot were “training point”, “training set”, and “training data”. I didn’t plot them because they overlap and are difficult to see. These “training X” terms tend to be associated with most supervised learning techniques, and are considered less specific than “support vector”. As a low-level specific term, our model doesn’t expect it to be related to many other low-level terms.

## References

- Akbik et al. [2012] Alan Akbik, Larysa Visengeriyeva, Priska Herger, Holmer Hemsen, and Alexander LÃ¶ser. Unsupervised discovery of relations and discriminative extraction patterns. In Proceedings of COLING 2012, pages 17–32, Mumbai, India, December 2012. The COLING 2012 Organizing Committee. URL http://www.aclweb.org/anthology/C12-1002.
- Caraballo and Charniak [1999] Sharon A. Caraballo and Eugene Charniak. Determining the specificity of nouns from text. In In Proceedings SIGDAT-99, pages 63–70, 1999. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.9574.
- Chali and Hassan [2012] Yllias Chali and Sadid A. Hassan. On the effectiveness of using sentence compression models for query-focused multi-document summarization. In Proceedings of COLING 2012: Technical Papers, 2012. URL http://aclweb.org/anthology//C/C12/C12-1029.pdf.
- Chodorow et al. [1985] Martin S. Chodorow, Roy J. Byrd, and George E. Heidorn. Extracting semantic hierarchies from a large on-line dictionary. In Proceedings of the 23rd annual meeting on Association for Computational Linguistics, ACL ’85, pages 299–304, Stroudsburg, PA, USA, 1985. Association for Computational Linguistics. doi: 10.3115/981210.981247. URL http://dx.doi.org/10.3115/981210.981247.
- Church and Gale [1995] Kenneth Church and William A. Gale. Inverse document frequency (idf): A measure of deviations from poisson. In Proceedings of the Third Workshop on Very Large Corpora, pages 121–130, 1995. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.3452.
- Etzioni et al. [2011] Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam Mausam. Open information extraction: the second generation. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, IJCAI’11, pages 3–10. AAAI Press, 2011. ISBN 978-1-57735-513-7. doi: 10.5591/978-1-57735-516-8/IJCAI11-012. URL http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-012.
- Frantzi et al. [1998] KaterinaT. Frantzi, Sophia Ananiadou, and Junichi Tsujii. The c-value/nc-value method of automatic recognition for multi-word terms. In Research and Advanced Technology for Digital Libraries, volume 1513 of Lecture Notes in Computer Science, pages 585–604. Springer Berlin Heidelberg, 1998. ISBN 978-3-540-65101-7. doi: 10.1007/3-540-49653-X˙35. URL http://dx.doi.org/10.1007/3-540-49653-X_35.
- Gupta and Manning [2011] Sonal Gupta and Christopher D. Manning. Analyzing the dynamics of research by extracting key aspects of scientific papers. In In Proceedings of IJCNLP, 2011. URL http://130.203.133.150/viewdoc/summary;jsessionid=39F64999229EFFDBB89E2F9963C91068?doi=10.1.1.224.7833.
- He [1999] Qin He. Knowledge discovery through co-word analysis. Library Trends, 48:133–159, 1999. URL https://www.ideals.illinois.edu/bitstream/handle/2142/8267/?sequence=1.
- Hogan [2007] Deirdre Hogan. Empirical measurements of lexical similarity in noun phrase conjuncts. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 149–152, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1557769.1557813.
- Jones [1972] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21, 1972. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.8343.
- Kireyev [2009] Kirill Kireyev. Semantic-based estimation of term informativeness. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pages 530–538, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. ISBN 978-1-932432-41-1. URL http://dl.acm.org/citation.cfm?id=1620754.1620831.
- Louis and Nenkova [2011] Annie Louis and Ani Nenkova. Text specificity and impact on quality of news summaries. In Proceedings of the Workshop on Monolingual Text-To-Text Generation, MTTG ’11, pages 34–42, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN 9781937284053. URL http://dl.acm.org/citation.cfm?id=2107679.2107684.
- Navigli and Velardi [2004] Roberto Navigli and Paola Velardi. Learning domain ontologies from document warehouses and dedicated web sites. Comput. Linguist., 30(2):151–179, June 2004. ISSN 0891-2017. doi: 10.1162/089120104323093276. URL http://dx.doi.org/10.1162/089120104323093276.
- Newman et al. [2012] David Newman, Nagendra Koilada, Jey Han Lau, and Timothy Baldwin. Bayesian text segmentation for index term identification and keyphrase extraction. In Proceedings of COLING 2012, pages 2077–2092, Mumbai, India, December 2012. The COLING 2012 Organizing Committee. URL http://www.aclweb.org/anthology/C12-1127.
- Rosenfeld and Feldman [2007] Benjamin Rosenfeld and Ronen Feldman. Clustering for unsupervised relation identification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM ’07, pages 411–418, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-803-9. doi: 10.1145/1321440.1321499. URL http://doi.acm.org/10.1145/1321440.1321499.
- Ryu and Choi [2006] Pum-Mo Ryu and Key-Sun Choi. Taxonomy learning using term specificity and similarity. In Proceedings of the 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pages 41–48, Sydney, Australia, July 2006. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W/W06/W06-0506.