Improving Text Normalization by Optimizing Nearest Neighbor Matching

Improving Text Normalization by Optimizing Nearest Neighbor Matching

Salman Ahmad Ansari    Usman Zafar    Asim Karim
{15030013, 14030017, akarim}

Text normalization is an essential task in the processing and analysis of social media that is dominated with informal writing. It aims to map informal words to their intended standard forms. In this paper, we present an automatic optimization-based nearest neighbor matching approach for text normalization. This approach is motivated by the observation that text normalization is essentially a matching problem and nearest neighbor matching with an adaptive similarity function is the most direct procedure for it. Our similarity function incorporates weighted contributions of contextual, string, and phonetic similarity, and the nearest neighbor matching involves a minimum similarity threshold. These four parameters are tuned efficiently using grid search. Our approach matches each out-of-vocabulary (OOV) word to its intended in-vocabulary (IV) word rather than the usual approach of candidate-generation-and-filtering of OOV words for each IV word. We evaluate the performance of our approach on two benchmark datasets. The results demonstrate that parameter tuning on small sized labeled datasets produce state-of-the-art text normalization performances.

Improving Text Normalization by Optimizing Nearest Neighbor Matching

Salman Ahmad Ansari and Usman Zafar and Asim Karim {15030013, 14030017, akarim}

1 Introduction

Social media has proliferated the use of informal writings. In particular, such writings contain numerous spelling variations that do not appear in standard lexicons. Analysis of such informal content produces poor results if unnormalized.

Text normalization is the task of mapping an out-of-vocabulary (OOV, not in lexicon) word to an in-vocabulary (IV, in-lexicon) word that best captures its intent in the writing. For example, sum1 and r are typical informal variants of lexicon entries ‘someone’ and ‘are’, respectively. Two key elements are required in text normalization: (1) assessment of the relatedness or the similarity between OOV and IV words, and (2) procedure for mapping each OOV word to its intended IV word(s). The similarity between an OOV and an IV word can be quantified by relating the contexts in which these words occur in a corpus, by considering their phonetic similarity, and by determining their string similarity. While these ideas have been utilized in previous works on text normalization, the common practice is to either adopt a single idea/cue or to combine them in equal proportions to define the similarity between words (e.g., Sonmez and Ozgur (2014)). Additionally, procedures for finding OOV words similar to an IV word that follow a hierarchical candidate generation and filtering approach (e.g., Han et al. (2013)) involve a number of user-selectable parameters that are difficult to tune for improved performance and often result in poor recall.

In this work, we present a nearest neighbor matching approach for text normalization in which the weighted contribution of contextual, phonetic, and string similarity to the overall similarity and the matching threshold is optimized. The nearest IV neighbors of an OOV word with similarity greater than or equal to the threshold define the mappings. Another contribution has been the realization that mapping of OOV words to IV words instead of the traditional method of generating and filtering OOV words from IV words results in a notable increase in precision. This simple procedure produces best matches for any . The weights and the threshold are optimized on a sample labeled dataset to improve normalization accuracy. We evaluate our approach on two benchmark datasets of text normalization. The results show significant improvement in precision and F-measure even when small labeled datasets are used for optimization.

2 Related Work

Typical works on text normalization have relied primarily upon string and phonetic similarity in an hierarchical candidate generation and filtering procedure for identifying lexical variations of IV words (Elmagarmid et al. (2007); Contractor et al. (2010); Gouws et al. (2011); Han et al. (2013); Ahmed (2015)). For example, (Han et al., 2013) present a technique that generates a ‘confusion set’ for IV words by filtering out OOV words using edit distance and phonetic measures, followed by ranking based on a tri-gram language model.

Recently, more emphasis has been placed on sophisticated contextual information for text normalization. Levy and Goldberg (2014) evaluate the performance of Word2Vec embeddings on different NLP tasks including word similarity and word analogy. Sridhar (2015) propose a candidate generation and filtering approach that uses word-embeddings to create the confusion set for IV words. Hassan and Menezes (2013) construct a bi-partite graph with context nodes on one side and word nodes on the other. Edges between words and contexts are weighted with contextual information, and Markov random walks are used to discover OOV-IV pairs. Sonmez and Ozgur (2014) present a word association graph that encodes the position of each word with respect to other words. Edges indicate contextual association and their weights are assigned based on co-occurrences of the words. A number of context focused learning procedures are also presented as part of a text normalization challenge (Baldwin et al. (2015)). Our work explores a direct matching approach with an adaptive similarity function combining different notions of similarity for improved text normalization performance.

3 Optimized Nearest Neighbor Matching

Our proposed optimized nearest neighbor approach for text normalization is motivated by two intuitions. First, the relative contributions of different notions of similarity towards the overall similarity between words can be different for different languages and contexts (e.g., geographical regions, topics, etc). Second, nearest neighbor matching of an OOV word to IV word(s) is a direct and optimal matching strategy with few user tunable parameters. We start the discussion of our approach by formally defining the text normalization problem.

Let and be the set of in-vocabulary (IV) and out-of-vocabulary (OOV) words, respectively. Then, text normalization is the task of matching each OOV word () to one or more IV words () such that is an informally spelled variant of in the context of interest. Text normalization is evaluated using precision, recall, and F-measure computed over a labeled dataset (i.e., a data containing correct mappings).

3.1 Weighted Similarity Function

In order to match OOV words to their most relevant IV words, we need to define relatedness or similarity between OOV and IV words. In previous works of text normalization, similarity has been defined in terms of contextual, phonetic (sound-based), and string similarity. In this work, we propose a weighted average of all three notions of similarity. Mathematically, the (overall) similarity between and is defined as

where , , and respectively give the contextual, phonetic, and string similarity between words, and , , are the corresponding weights. Without loss of generality, we assume all similarities and weights lie in the interval . This ensures that the overall similarity also lies in the interval with higher values signifying greater similarity between the words. If a similarity is not defined (e.g., contextual similarty is not known) then the corresponding weight is set to zero in the computation.

The contextual similarity between words and quantifies the relatedness of these words based on their contextual usage. A commonly-used representation of words based on their contextual usage in a corpus is provided by Word2Vec (Mikolov et al. (2013)). Let and be the learned vector representations of words and , respectively. Then, their contextual similarity is defined as

This represents the cosine similarity between the two vectors, and it lies in the interval .

The phonetic similarity between words and measures the degree of similarity in pronunciation/sound of the words. Different sound-based encoding schemes are available for different languages. For the English language, Double Metaphone has been shown to be accurate (Philips (2000)). However, we do not match the encoding of similar words directly but instead we calculate the string similarity (see next paragraph) between the encodings. Hence, based on this encoding scheme, the phonetic similarity varies between when any encoding matches exactly for the two words and when no matches are found.

The string similarity between words and quantifies the string similarity between the words. There are a number of string similarity measures available. In this work, we adopt the normalized longest common subsequence measure defined as (Yujian and Bo (2007))

Here, denotes the length of the longest common subsequence of the words, is the character-length of the word, and gives the Levenshtein distance between the words. This similarity also lies in the interval .

Figure 1: Illustration of nearest neighbor matching as bi-partite graph matching

3.2 Nearest Neighbor Matching

Text normalization is essentially a matching problem. For one-to-one or one-to-many matching, the nearest neighbors approach is the most direct and appropriate. When each OOV word is matched with its most similar IV word, i.e., when is a maximum for all . In addition to this standard procedure, we also introduce a minimum similarity threshold such that a match only occurs when the maximum similarity is greater than or equal to , i.e., when is the highest similarity and . In general, each OOV word can be matched with IV words that represent the top- most similar IV words of the OOV word.

Figure 1 illustrates nearest neighbor matching as a bi-partite graph matching problem. One set of nodes in the graph represents OOV words and the other set represents IV words. An edge exists between an OOV word and an IV word if their similarity is and it is among the top- highest similarities of the OOV word, The figure shows edges that satisfy these conditions. It is possible that an OOV word does not have any edges, indicating that its similarity to all IV words is below the threshold ; such words will not be matched.

It is easy to see that given a similarity function, nearest neighbor matching (or bi-partite graph matching) will yield an optimal matching in the sense that the sum of similarities of matched OOV words is a maximum.

3.3 Optimizing the Parameters

The weights in the overall similarity function control the relative contribution of each notion of similarity towards the overall similarity between words. Since contextual, phonetic, and string similarity are computed independently of the text normalization problem, it is expected that tuning these weights would improve text normalization performance. Similarly, the threshold controls the matching of OOV and IV words.

We propose an optimization approach for automatically tuning these parameters (, , , ). Assuming a labeled dataset is available for training, the optimal parameter values can be found by grid search in parameter space (Bergstra and Bengio (2012)). This is computationally efficient since only four grid values in space need to be searched. For example, if jumps of are taken then less than 10,000 evaluation iterations are required. Note that the evaluation simply involves nearest neighbor matching and computation of text normalization performance. The grid search, which can be repeated hierarchically for refined estimates, will yield optimal parameter values (, , , ) to be used for future text normalization.

4 Experimental Evaluation

4.1 Datasets and Evaluation Settings

We use a Twitter dataset with over 4.5 million tweets for learning the contextual representation of words. Based on Python’s Enchant English dictionary, this dataset has 36,071 distinct IV and 46,480 distinct OOV words. For evaluating our approach, we use two publicly available labeled datasets for text normalization: Han and Baldwin (2011)111 (Han) and Liu et al. (2011) (Yang) datasets. The Han dataset contains 807 IV and 975 OOV words out of which 268 IV and 386 OOV words overlap with our Twitter dataset. The Yang dataset contains 1,975 IV and 3,937 OOV words out of which 1,030 IV and 2,141 OOV words overlap with our Twitter dataset. Our evaluation, however, is done over all the words in the datasets since we ignore contextual similarity (i.e., ) in the overall similarity of word pairs whose contextual similarity is not defined. These datasets provide only one mapping per OOV word; hence, we perform nearest neighbor matching.

We adopt standard Precision (Pre), Recall (Rec), and F-measure (Fme) to evaluate the performance of text normalization (Sonmez and Ozgur (2014)).

Figure 2: Performance with varying threshold when ,, and on Han dataset
Exp , , , Pre Rec Fme
Han Dataset
1 90.2 89.2 89.7
2 87.0 87.0 87.0
3 89.3 89.0 89.2
Yang’s Dataset
1 76.1 76.0 76.0
2 74.2 76.2 75.2
3 76.6 74.6 75.6
Table 1: Performance of our approach on Han and Yang datasets (see text for description of exp)

4.2 Results

We start by presenting the performance trend of our approach with varying values of threshold while weights are not optimized. Figure 2 shows this trend for Han dataset when , , and (this is the best un-optimized combination). It is seen that precision and recall remain high at low threshold values, but recall starts decreasing rapidly at higher threshold values. The highest F-measure of 87.4% is obtained at .

Subsequently, we conduct three different experiments to evaluate our parameter-optimized approach. Experiment 1 reports average parameter values and performance over entire dataset via 2-fold cross-validation. Experiment 2 reports average parameter values and average performance for 5 randomized runs testing over 80% of the dataset after optimizing over the respective remaining 20% of the dataset. Experiment 3 reports parameter values and performance over the dataset when parameters are optimized over the other dataset (e.g., performance over Han dataset after learning over Yang dataset). In Experiments 1 and 2 there is no overlap between words in the training and test samples.

Table 1 gives the results of these experiments. The highest F-measure of 89.7% and 76% on Han and Yang datasets, respectively, is obtained in Experiment 1. Even when a small sample size of 20% is used for tuning the parameters (Exp 2), our approach produces F-measures of 87.0% and 75.2% on Han and Yang datasets, respectively. These results compare with the previous best F-measures of 82.3% on Han dataset (Sonmez and Ozgur (2014)) and 73% on Yang dataset (Yang and Eisenstein (2013)). A review of the optimal parameters reveal that contextual and string similarity play dominant roles in text normalization while phonetic similarity is less significant. These results confirm that our approach is not only simpler but also more accurate when contextual information of words and small sized labeled dataset is available.

5 Concluding Remarks

We have presented and evaluated an efficient optimized nearest neighbor (NN) matching approach for improved text normalization. Results on two benchmark datasets have demonstrated that weights in similarity function and threshold in NN matching can be tuned over small labeled samples to yield state-of-the-art performance.


  • Ahmed (2015) B. Ahmed. 2015. Lexical normalisation of twitter data. In Science and Information Conference (SAI), 2015. IEEE, pages 326–328.
  • Baldwin et al. (2015) T. Baldwin, M. De Marneffe, B. Han, Y. Kim, A. Ritter, and W. Xu. 2015. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text.
  • Bergstra and Bengio (2012) J. Bergstra and Y. Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13(Feb):281–305.
  • Contractor et al. (2010) D. Contractor, T. Faruquie, and L. Subramaniam. 2010. Unsupervised cleansing of noisy text. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. ACL, pages 189–196.
  • Elmagarmid et al. (2007) A. Elmagarmid, P. Ipeirotis, and V. Verykios. 2007. Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on 19(1):1–16.
  • Gouws et al. (2011) S. Gouws, D. Hovy, and D. Metzler. 2011. Unsupervised mining of lexical variants from noisy text. In Proceedings of the First workshop on Unsupervised Learning in NLP. ACL, pages 82–90.
  • Han and Baldwin (2011) B. Han and T. Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the ACL. ACL, pages 368–378.
  • Han et al. (2013) B. Han, P. Cook, and T. Baldwin. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology (TIST) 4(1):5.
  • Hassan and Menezes (2013) H. Hassan and A. Menezes. 2013. Social text normalization using contextual graph random walks. In Proceedings of ACL. ACL, pages 1577–1586.
  • Levy and Goldberg (2014) O. Levy and Y. Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems. pages 2177–2185.
  • Liu et al. (2011) F. Liu, F. Weng, B. Wang, and Y. Liu. 2011. Insertion, deletion, or substitution?: Normalizing text messages without pre-categorization nor supervision. In Proceedings of ACL. ACL, pages 71–76.
  • Mikolov et al. (2013) T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
  • Philips (2000) L. Philips. 2000. The double metaphone search algorithm. C/C++ Users Journal 18(6):38–43.
  • Sonmez and Ozgur (2014) C. Sonmez and A. Ozgur. 2014. A graph-based approach for contextual text normalization. In Proceedings of EMNLP. ACL, pages 313–324.
  • Sridhar (2015) V. Sridhar. 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of NAACL-HLT. ACL, pages 8–16.
  • Yang and Eisenstein (2013) Y. Yang and J. Eisenstein. 2013. A log-linear model for unsupervised text normalization. In Proceedings of EMNLP. ACL, pages 61–72.
  • Yujian and Bo (2007) L. Yujian and L. Bo. 2007. A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(6):1091–1095.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description