How to Evaluate Word Representations of Informal Domain?

# How to Evaluate Word Representations of Informal Domain?

Yekun Chai
University of Edinburgh
chaiyekun@gmail.com
&Naomi Saphra
University of Edinburgh
n.saphra@ed.ac.uk
University of Edinburgh
alopez@inf.ed.ac.uk
###### Abstract

Diverse word representations have surged in most state-of-the-art natural language processing (NLP) applications. Nevertheless, how to efficiently evaluate such word embeddings in the informal domain such as Twitter or forums, remains an ongoing challenge due to the lack of sufficient evaluation dataset. We derived a large list of variant spelling pairs from UrbanDictionary with the automatic approaches of weakly-supervised pattern-based bootstrapping and self-training linear chain conditional random field (CRF). With these extracted relation pairs we promote the odds of eliding the text normalization procedure of traditional NLP pipelines and directly adopting representations of non-standard words in the informal domain. Our code is available.

## 1 Introduction

Distributional word representation is an impressive approach to denoting the natural language words with low-dimensional real-valued vectors that could implicitly signal the syntactic or semantic statistics of corresponding words, which has been extensively used in amounts of current NLP systems, e.g. text classification (Kim, 2014; Yang et al., 2016; Joulin et al., 2016; Xiang et al., 2019), name entity recognition (Turian et al., 2010; Huang et al., 2015; Akbik et al., 2018, 2019), etc. Among these, CBOW (Mikolov et al., 2013a), Skip-gram with Negative Sampling (SGNS) (Mikolov et al., 2013b), GloVe  (Pennington et al., 2014) and FastText (Bojanowski et al., 2017) are the most well-known ones and on which we experiment.

Despite the prevalence of word representations, only a minority of them can be directly applied to the raw informal text without preprocessing. Some specific word representations are proposed to such domain (Tang et al., 2014; Benton et al., 2016; Dhingra et al., 2016; Vosoughi et al., 2016). Conventional NLP systems usually do text normalization engineering on Twitter data at the preprocessing stage (Jiang et al., 2011; Singh and Kumari, 2016; Arora and Kansal, 2019), by converting non-standard words to their standard forms. Nevertheless, the original meaning of raw sentences might have deviated to some extent (Eisenstein, 2013). The elimination of the spelling expression may lead to the scarcity of the writer’s persona and behavioral characteristics information that the original demographic statistics can convey (Saphra and Lopez, 2016; Benton et al., 2016).

Therefore, we collected a variant spelling dataset for use of NLP research in the informal domain. Our key contributions are:

• [noitemsep]

• to collect around 25k variant spelling pairs from UrbanDictionary;

• to employ weakly-supervised pattern-based bootstrapping and linear-chain CRF with self-training methods for extraction. Our results outperform the lexico-syntactic surface rule-based baseline method;

• to pretrain the word embeddings of CBOW, SGNS, GloVe and FastText on our cleaned English tweets and intrinsically evaluate them by measuring the cosine similarity using our variant spelling pairs;

• to evaluate a Twitter hashtag prediction downstream task with above 4 kinds of word embeddings and analyze the performance and correlation with the this intrinsic metric.

• to develop an online tool for searching informal word variant spelling.

## 2 Related Work

Word analogy task and word similarity task provide the staple benchmarks to evaluate the goodness of distributed word representations.

To evaluate the quality of word representations, the word analogy task can serve as the intrinsic metric to measure the distance between analogical word pairs in the multi-dimensional word representation space Mikolov et al. (2013a, b); Levy et al. (2015). It requires the analogical word pair dataset that contains different semantic or syntactic relations such as (“UK”, “London”) and (“King”, “Queen”). MSR analogy dataset (Mikolov et al., 2013c) is proposed for this purpose. Four examples are shown in Table 1. The word analogy task measures the relation of “a is to ã as b is to b̃”. Taking the relation of capital city for instance, the word vector relation is like: vec(“London”) - vec(“UK”) + vec(“France”) = vec(“Paris”) .

Let denote the vocabulary of word representations, given the word pair (a, ã) and relation (b, ), the word analogy task counts it correct only if the closest candidate word is identical to the word ã:

 ^a =argmax^a∈V∖{~b,b,a}cos(^a,~b−b+a) =argmax^a∈V∖{~b,b,a}(cos(^a,~b)−cos(^a,b)+cos(^a,a))

In addition, word similarity tasks deliver another intrinsic test metric on word embeddings (Pennington et al., 2014; Levy et al., 2015). Such datasets consist of WordSim353 (Finkelstein et al., 2002), MC (Miller and Charles, 1991), RG (Rubenstein and Goodenough, 1965), MEN  (Bruni et al., 2012), Mechanical Turk  (Radinsky et al., 2011), SCWS (Huang et al., 2012), Rare Words (Luong et al., 2013) and SimLex-999 (Hill et al., 2015). Similarly, given the synonym pair (a, ã), we can get the nearest candidate word vector as:

 ^a=argmax^a∈V∖{a}cos(^a,a)

It considers the word representation capture the relationship beween a and ã only when the word vector vec(ã) is closest to vec(a) among all the vocabulary, i.e. .

The aforementioned evaluation word pair datasets are all aiming at formal domains, but there are few works of word pair datasets focusing on the informal domains. Saphra and Lopez (2015) extracted a small size of variant spelling dataset from UrbanDictionary using lexico-syntactic surface rule-based method with regular expression. Such manually summarized rule is devoid of diverse kinds of surface expressions, which might lead to some biases on the collected pairs despite its high precision.

Grounding on this, we employed semi-supervised approaches for the variant spelling dataset extraction and empirically illustrate its efficacy.

## 3 Preliminaries

We disentangle the informal spelling variant extraction task and clarify the definitions involved in this task.

##### UrbanDictionary

UrbanDictionary is a large-scale crowd-sourcing online slang dictionary, which records the newly emerging slang words, phrases and their meanings. It contains not only the informal words or phrases and their definitions but also definition tags, term editors and update time. Table 2 presents the items that are contained in an example entry of UrbanDictionary. UrbanDictionary holds the promise of collaborative NLP resources in the informal domain such as Twitter and social media forrums (Nguyen et al., 2018). It is also be utilized for Twitter text normalization task (Beckley, 2015) and explanation generation of unseen non-standard English language Ni and Wang (2017). Meanwhile, it could be quite independent of word representations trained on an informal text corpus such as tweets, assuring the fairness of the collected dataset.

##### Spelling variant detection

As shown in Table 1, the word pair (“m8”,“mate”) can be regarded as spelling variants. For each spelling variant pair, the first instance belongs to informal words and the second one is a formal word. The spelling variant detection task is to extract the word pairs that possess the relations of spelling variants.

This task can be defined as: given a corpus of dictionary terms of size , which consists of word entries , corresponding definitions , and the expected relationship . For the word’s definition with length , we have , represents the the -th token in -th word’s definition. The purpose is to find the variant spelling tuple . Finally, we could get a large set of variant spelling tuples where denotes the total count of extracted pairs.

It is assumed that words with similar meanings would have closer distances in the distributed word vector space. The aforementioned word analogy task and word similarity task measure the cosine similarity between the given relation pairs. Likewise, taking the cosine distance between spelling variant pairs could allow for evaluating word representations pretrained on informal texts (Saphra and Lopez, 2016). Taking the pair (“m8”, “mate”) for instance, the pair score is denoted as 1 if vec(“mate”) ranks within top () closest word embeddings, otherwise 0. Afterward, the accuracy statistic takes the mean average precision (MAP) of all pairs as the overall performance. We term this task as spelling variant similarity task.

## 4 Approaches

### 4.1 Baseline

Target spelling variants in UrbanDictionary are always accompanied by a certain kind of definition expression patterns, which can be used as the extraction rule. The direct way to spelling variant detection is summarizing lexico-syntactic surface rules and employ Regular Expression (RE) techniques for extraction.

It is obvious that the matched definition expression patterns for spelling variants are capable of indicating the variant words, denoting as [Y]. For instance,

 the variation of [Y] another word for [Y] another way of saying [Y] the incorrect spelling of [Y]

Accordingly we can manually create a large number of RE rules, as in Table LABEL:tb:RE. Taking the pattern way of saying [Y] for example, the RE pattern can be written as:

 way of saying \"(?P[\w’]+)\"

### 4.2 Weakly-supervised pattern-based bootstrapping

Figure 1 presents the semi-supervised pattern-based bootstrapping method to iteratively detect the target spelling variant pairs from unannotated corpus given a small set of seeds. A tuple pool is used to label the unlabeled data and benefit for pattern generation, and it stores the initial given seeds and extracted candidate pairs with highly confidence in each iteration. Similarly, a pattern pool incrementally appends matched relation patterns with fair certainty, on which the candidate tuples are matched accordingly.

#### 4.2.1 Bootstrapping process

Algorithm 1 illustrates the procedure of our bootstrapping algorithm. Firstly, initialize the tuple pool with a small size of seeds and label the occurrence of instances of seeds in the tuple pool. Afterward, generate candidate patterns grounded on the context of the previous occurrence and pass them to a pattern scorer. The filtered patterns are appended into the pattern pool and are used to search new tuples in turn. These candidate tuples are merged into the tuple pool after passing through a candidate tuple scorer. Finally, go to the first step until reaching the maximum iteration.

##### Pattern generation

When generating candidate patterns, we use the contextual words with a fixed context size on both sides. Given the definition “Scottish way of saying yes” and spelling variant tuple (“aye”, “yes”) for example, the generated candidate pattern will be way of saying<\w> when the window size .

##### Pattern scorer

A well-defined pattern scorer plays a crucial role in bootstrapping methods because the candidate patterns generated from the previous step could face two main problemsGupta and Manning (2014a):

• [noitemsep]

• over-confidently assign the badly-behaved pattern with a high score;

• conservatively treat the well-performed pattern with low confidence.

The ideal patterns are expected to reach a balance between accuracy and coverage. The candidate patterns in the preceding iteration could affect the following steps to a great extent. Virulent patterns could match pernicious seeds and such seeds could be regarded as reliable to generate new patterns. This effect could propagate iteration by iteration and finally destroy the system performance.

To avoid those problems and better evaluate the confidence of patterns. RlogF scoring metric is employed in our system(Riloff, 1996). The score of -th pattern can be defined as:

 score(patterni)=FiNilog2(Fi)

where denotes the count of unique variant positive entities in the tuple pool that -pattern match, represents the total count that -th pattern can extract. It can be seen that this scoring metric attends to both accuracy and coverage. The factor is high when variant pairs extracted by the -th pattern have good coverage and correlation with reliable tuples in the tuple poolCarlson et al. (2010). Meanwhile, the factor indicates the coverage that -th candidate pattern can achieve. To make it conservative, we set a high predefined threshold for each pattern score to only retain the most reliable patterns at each iteration.

##### Tuple scorer

The quality of extracted tuples can also have a huge influence on the final results. A small set of the wrong tuple may lead to semantic drift, the malignant cycle in the bootstrapping iteration. In other words, wrongly extracted tuples could lead to a couple of problematic patterns, and in turn, the noisome patterns could match amounts of destructive candidate pairs.

Hence we calculate the confidence for -th candidate in the tuple pool as:

 score(tuplei)=∑Pij=1log2(Fj+1)Pi

where is the count that patterns are able to extract tuple and signals the unique number of tuples in the tuple pool that can be extracted by pattern .

Intuitively, candidate tuples that can be matched by multiple patterns are regarded as more reliable than those only matchde by one pattern Carlson et al. (2010). If tuple can be extracted by a large number of patterns, the tuple is comparably more confident than others.

It can be seen that the aforementioned tuple scorer only considers the relevance between positive tuples and reliable patterns, but ignore the occurrence count of candidate pairs. We assume that if a candidate pair occurs multiple times, it is more likely to be a target tuple. Hence we define a variant called RlogF with tuple count:

 score(tuplei)=∑Pij=1log2(FJ+1)Pilog2|tuplei|

where is the occurrence count in the training data.

Finally, we only maintain the top most confident tuples and remove others which are less reliable.

##### Target word constraint

Some extraction tasks like name entity recognition (NER) could utilize existing toolkit for Part-of-Speech(POS) tagging ahead of time so that the bootstrapping system could easily find the chunk boundary of each candidate. But the un-pre-tagged data is quite hard to detect the correct chunk boundaries. Taking the definition of “it’s a way of saying someone is really loud” for example, what the surface pattern a way of saying can match is a wrongly extracted word [someone] instead of a clause [someone is really loud] in this case.

Inspired by SPIED (Gupta and Manning, 2014b) that adds the POS tag constraints on target extracted entities, we adopt the stopword constraints to filter out the wrongly extracted entities: if the extracted entity is a stopword such as “the” and “something”, the normalized Levenshtein distance between the instances in extracted candidate tuples is used to measure the similarity of the formal and informal words. Normalized Levenshtein distance is dividing the Levenshtein distance by the overall length of the word pairs. Only word pairs whose score values are lower than a threshold could be retained.

### 4.3 Self-training based linear chain CRF tagging

The linear chain CRF is often applied in the sequence tagging in a supervised way. With self-training strategy, we iteratively train the supervised CRF system on the unlabeled data.

As illustrated in Algorithm 2, we firstly use the linear chain CRF model trained on a small size of labeled data to predict a large number of unlabeled data , and supplement the most confident samples into initial golden labeled data. Then the next iteration will be run using the updated until satisfying the stop criteria.

##### Feature engineering

Feature engineering is extremely important for most statistical machine learning systems. Unlike the aforementioned methods, we use both shallow features and deep-parsed text features, such as dependency parsing and POS tag information.

For each target word, the features including:

• word lowercase

• word lemma

• whether it is digit

• whether it is title

• POS tag333including both shallow POS tags (e.g. VERB) and detailed POS tag features (e.g. VBG), denoting “” and “” in the following sections.

• syntactic dependency

• the word lemma and POS tag of its head word in the dependency path tree

For the context of the target words, we include:

• word lowercase

• word lemma

• whether it is a digit

• whether it is a title

• POS tag

• the word lemma and POS tag of its head word in dependency path tree

• The end of sentence tag EOS or the beginning of sentence BOS

Table LABEL:tb:feat_eg illustrates the features extracted for the definition in Figure 2 with context window size 1.

## 5 Experiments

The spelling variant pair detection is done with the following methods. Finally, we can get a large list of spelling variant pairs.

### 5.1 Baseline

In the baseline, the RE patterns are directly employed used for variant spelling detection. The matched variant entity counts vary from 9 to 951 among our summarized surface patterns as shown in Table LABEL:tb:RE. We randomly permute and sample 200 pairs from them and manually calculated accuracy is approximately 80%. The baseline approach has high accuracy but low coverage. The extracted patterns are simple and monotonic due to the limited expensive hand-crafted rules. It is laborious and impractical to detect the possible rules for all corpus because rules usually require specific domain knowledge and do not have flexible scalability.

### 5.2 Bootstrapping methods

We set the initial seed size as 5, iteration count 8, selected top-ranked size of pattern scorer as 10, the threshold of normalized Levenshtein distance , the pattern context size and the selected ranked tuple size . We manually estimate the accuracy from sampled 100 extracted tuples for each experiment.

As shown in Figure 3, the bootstrapping systems selected the top tuples after tuple matching. It can be seen that systems with outperforms the counterpart with at the initial two iterations and the trend becomes opposite thereafter. It may be because that the detected tuple count at first is fewer than the limited size , but systems with larger filter size could include less confident tuples. Such noise candidates could lead to a vicious cycle in the bootstrapping process and deviate away from the correct results. Systems with =10 and 20 can extract 28,397 and 29,643 spelling variant pairs at the end of the 8-th iteration.

Table 4 presents that the bootstrapping systems degrade when we consider matched tuple count based on RlogF tuple scorer. This may be due to the semantic drift of the iterative process. We find that adding the tuple count factor results in more noisy patterns such as a typo of, of the verb, to type out. This might be because the factor counts too much and should be scaled or regularized.

We can observe in Figure 5 that the accuracy increases and keeps stable with the increase of iteration count after adding stopword constraint. The average and maximum accuracy of the detection tasks is 72% and 82% respectively. This trick makes the bootstrapping systems more conservative and guarantees its high accuracy.

### 5.3 Self-training

To suffice the training requirement of self-training, we firstly manually labeled 2,000 samples that contain positive variant entities and 1,000 samples without variant words. We employ IO tagging method, where I represents “inside” the variant entity boundary and O denotes “outside” the target variants.

We set the context size , confidence threshold , maximum iteration number as 5. Similarly, we manually check the correct tuples from sampled 100 instances.

##### Random search

We employ L-BFGS training algorithm with Elastic Net regularization to train the CRF model with the random search for hyperparameter tuning techniques. We randomly sample the hyper-parameter 50times and use 3-fold cross-validation to tune the initial CRF model on the hand-labeled small amount of gold-labeled data. The CRF achieves the best with L1 value 2.35 and L2 value 0.08.

##### Feature contribution

We generated features with a fixed context size 3, similar to examples shown in Figure LABEL:tb:feat_eg. We plot the contribution for selected features in Figure 6. Suppose the current word position is , we can see that when the -th word is “spelling”, the current word is more reliable to be the target entity. This supports the surface rule “another spelling of”. Besides, when the POS tag of the current word is punctuation or the word’s lowercase is the word “it”, it is less confident to be the target. Such ranking has a great match on the intuition of our domain-dependent rules.

Afterwards, 5 iterations are run for each model setting, as in Figure 7 and LABEL:fig:crf_num. The model with the context size 3 and confidence threshold 0.9 achieves the best accuracy at above 90%. In contrast, the model with context size 4 and confidence threshold 0.8 identifies the most tuple numbers but presents a decreasing low accuracy during all the iterations. Thus it is necessary to reach a balance between the extracted number (coverage) and precision. Finally, we choose context size 3 and confidence threshold 0.9 as the optimal setting, where the accuracy is 0.97 and the extracted count is 26,698 at iteration 5. It can be seen that self-training approaches outperform the previous two methods in terms of accuracy.

We also developed an online tool for the purpose of searching for spelling variants shown as figure LABEL:fig:web-search,  LABEL:fig:web-result.

### 5.4 Analysis

We train the aforementioned four kinds of embeddings including CBOW, SGNS, GloVe and FastText on self-cleaned 2.35GB English Tweets from scraped 260GB multi-lingual tweets. The preprocessing consists of tokenization, lowercase and removing non-English words, but without text normalization.

These embeddings are trained with minimum vocabulary occurrence of 200, the context window size of 5/10/15 and embedding size of 100/200.

##### Spelling variant similarity

We evaluate the pre-trained word representations on the spelling variant similarity task described in section 3. Here we use English Simple wiki and Wiki data to generate the formal word vocabulary and filter out the word tuples whose second instances are not formal words. The MAP values of cosine similarity with top of 1/20/50/100 are shown in Table LABEL:tb:res-simpLABEL:tb:res-en.

We also experimented on a Twitter hashtag prediction task with conventional TextCNN classification models to compare the correlation between the spelling variant similarity task and the performance on the hashtag prediction task. The results is in Table LABEL:tb:pred_res.

##### Comparison

The Pearson correlation between the previous intrinsic and extrinsic metrics are in Figure 8LABEL:fig:pearson-en. Their performance has a certain correlation in terms of its best loss. As for measuring the similarity of the top closest words, the correlation increases with the decrease of the value . When we compute the top 1 similarity, the correlation could be relatively high.

The scatter plot of best loss and accuracy on dev set are in Figure LABEL:fig:scatter-simp-lossLABEL:fig:scatter-simp-accLABEL:fig:scatter-en-loss and LABEL:fig:scatter-en-acc. Figure LABEL:fig:scatter-simp-loss and LABEL:fig:scatter-en-loss both reflect the positive correlation between the performance of such two tasks.

It can be concluded that the spelling variant similarity task perform relatively correlated with the Twitter hashtag prediction downstream tasks. Such evidence promotes the odds of removing the text normalization in the NLP pipelines and directly use embeddings trained on the informal-domain text.

## 6 Conclusion

We have extracted a variant spelling tuple dataset of approximately 25K tuples that could achieve a roughly above 90% accuracy. We empirically prove that the text normalization may be removed when handling NLP tasks in the informal domain. Such a spelling variant dataset can also be used in a large number of NLP systems of the informal domain.

## References

• Akbik et al. (2019) Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 724–728.
• Akbik et al. (2018) Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649.
• Arora and Kansal (2019) Monika Arora and Vineet Kansal. 2019. Character level embedding with deep convolutional neural network for text normalization of unstructured data for twitter sentiment analysis. Social Network Analysis and Mining, 9(1):12.
• Beckley (2015) Russell Beckley. 2015. Bekli: A simple approach to twitter text normalization. In Proceedings of the Workshop on Noisy User-generated Text, pages 82–86.
• Benton et al. (2016) Adrian Benton, Raman Arora, and Mark Dredze. 2016. Learning multiview embeddings of twitter users. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 14–19.
• Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
• Bruni et al. (2012) Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 136–145. Association for Computational Linguistics.
• Carlson et al. (2010) Andrew Carlson, Sue Ann Hong, Kevin Killourhy, and Sophie Wang. 2010. Active learning for information extraction via bootstrapping.
• Dhingra et al. (2016) Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W Cohen. 2016. Tweet2vec: Character-based distributed representations for social media. arXiv preprint arXiv:1605.03481.
• Eisenstein (2013) Jacob Eisenstein. 2013. What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the association for computational linguistics: Human language technologies, pages 359–369.
• Finkelstein et al. (2002) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on information systems, 20(1):116–131.
• Gupta and Manning (2014a) Sonal Gupta and Christopher Manning. 2014a. Improved pattern learning for bootstrapped entity extraction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 98–108.
• Gupta and Manning (2014b) Sonal Gupta and Christopher Manning. 2014b. Spied: Stanford pattern based information extraction and diagnostics. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 38–44.
• Hill et al. (2015) Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695.
• Huang et al. (2012) Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 873–882. Association for Computational Linguistics.
• Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
• Jiang et al. (2011) Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. 2011. Target-dependent twitter sentiment classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 151–160. Association for Computational Linguistics.
• Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
• Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
• Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225.
• Luong et al. (2013) Thang Luong, Richard Socher, and Christopher Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113.
• Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
• Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
• Mikolov et al. (2013c) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751.
• Miller and Charles (1991) George A Miller and Walter G Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28.
• Nguyen et al. (2018) Dong Nguyen, Barbara McGillivray, and Taha Yasseri. 2018. Emo, love and god: making sense of urban dictionary, a crowd-sourced online dictionary. Royal Society open science, 5(5):172320.
• Ni and Wang (2017) Ke Ni and William Yang Wang. 2017. Learning to explain non-standard english words and phrases. arXiv preprint arXiv:1709.09254.
• Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
• Radinsky et al. (2011) Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pages 337–346. ACM.
• Riloff (1996) Ellen Riloff. 1996. Automatically generating extraction patterns from untagged text. In Proceedings of the national conference on artificial intelligence, pages 1044–1049.
• Rubenstein and Goodenough (1965) Herbert Rubenstein and John B Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.
• Saphra and Lopez (2015) Naomi Saphra and Adam Lopez. 2015. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 36–40, Denver, Colorado. Association for Computational Linguistics.
• Saphra and Lopez (2016) Naomi Saphra and Adam Lopez. 2016. Evaluating informal-domain word representations with urbandictionary. arXiv preprint arXiv:1606.08270.
• Singh and Kumari (2016) Tajinder Singh and Madhu Kumari. 2016. Role of text pre-processing in twitter sentiment analysis. Procedia Computer Science, 89:549–554.
• Tang et al. (2014) Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1555–1565.
• Turian et al. (2010) Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics.
• Vosoughi et al. (2016) Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb Roy. 2016. Tweet2vec: Learning tweet embeddings using character-level cnn-lstm encoder-decoder. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 1041–1044. ACM.
• Xiang et al. (2019) Liuyu Xiang, Xiaoming Jin, Lan Yi, and Guiguang Ding. 2019. Adaptive region embedding for text classification. arXiv preprint arXiv:1906.01514.
• Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489.

## Appendix A Appendices

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters