A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models

A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models

Abstract.

Word representation has always been an important research area in the history of natural language processing (NLP). Understanding such complex text data is imperative, given that it is rich in information and can be used widely across various applications. In this survey, we explore different word representation models and its power of expression, from the classical to modern-day state-of-the-art word representation language models (LMS). We describe a variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs. These models can transform large volumes of text into effective vector representations capturing the same semantic information. Further, such representations can be utilized by various machine learning (ML) algorithms for a variety of NLP related tasks. In the end, this survey briefly discusses the commonly used ML and DL based classifiers, evaluation metrics and the applications of these word embeddings in different NLP tasks.

Text Mining, Natural Language Processing, Word representation, Language Models
1234567

1. Introduction

Text-based data is increasing at a rapid rate, where the low quality of the unstructured text is growing rapidly than structured text. Textual data is very common in many different domains whether social media, online forums, published articles or clinical notes for patients and online reviews given online where people express their opinions and sentiments to some products or businesses (Hu and Liu, 2012).

Text data is a rich source of getting information and gives more opportunity to explore valuable insights which can not be achieved from quantitative data (hwee Tan, 1999). The main aim of different NLP methods is to get a human-like understanding of the text. It helps to examine the vast amount of unstructured and low-quality text and discover appropriate insights. Couple with ML, it can formulate different models for the classification of low-quality text to give labels or obtain information based on prior training. For instance; researchers in the past focused on mining the opinion and sentiments of users about a product, restaurant and movie reviews etc. to predict the sentiment of users. Over the years text has been used in various applications such as email filtering (Carreras and Màrquez, 2001), Irony and sarcasm detection (Naseem et al., 2020d) document organization (Hammouda and Kamel, 2004), sentiment and opinion mining prediction (Naseem et al., 2019c, 2020e), hate speech detection (Naseem et al., 2019b, d; Naseem and Musial, 2019a; Naseem et al., 2019a), question answering (Gupta and Lehal, 2009), content mining (Aggarwal and Reddy, 2013), biomedical text mining (Naseem et al., 2020b; Naseem et al., 2020c) and many more.

Figure 1. Text Classification Pipeline

However, being unstructured content, it adds complexity to the model, deciphers automatically or uses in conjunction with traditional features for a ML framework [57]. Moreover, even though large of volumes of text information is widely available and can be leveraged for interesting applications, it is rife with problems. Like most data, it suffers from traditional problems such as class imbalance and lack of class labels, but besides, there are some inherent issues with text information. Apart from being unstructured, text mining and representation learning become more challenging due to the following discussed factors.

The language on social media is unstructured and informal. Social media users express their emotions and write in different ways, use abbreviations, punctuations, emoticons, slangs and often use URLs. These language imperfections may cause noise and is a challenging task to handle by applying appropriate pre-processing techniques. Besides, understanding semantics, syntactical information and context is important for text analysis (Naseem and Musial, 2019b; Naseem, 2020).

Much research has been dedicated to addressing each of these concerns individually. However, in this survey, we focus on how text can be represented as numeric continuous vectors for easier representation, understanding and applicability to traditional machine-learning frameworks. Text may be seen as a collection of entities such as documents, sentences, words or characters and most algorithms leverage the implicit relationship between these entities to infer them as vectors.

Over the years, many methods and algorithms have been used to infer vectors from text be at character, word, sentence or document level. All the methods are aimed at better quantifying the richness in the information and making them more suitable for machine learning frameworks such as to perform clustering, dimensionality reduction or text classification. In this survey, we study how text representation methods have evolved from manually selecting the features called feature engineering to more SOTA representational learning methods which leverage neural networks to discover relevant embeddings.

In any NLP task, first, we should have data which we are interested in analyzing. The next step is to represent the raw unstructured data in a form that ML classification algorithms can understand. Text representation is divided into main two parts: i) Text pre-processing and, ii) Features Extraction and then classify the learned representations using an appropriate classifier (Kowsari et al., 2019; Aggarwal and Zhai, 2012).

Contribution and Organization In this paper, we present a comprehensive study of various text representation methods starting from the bag of words approach to more SOTA representational learning methods. We describe various commonly used text representation methods and their variations and discuss various text mining applications they have been used in. We conclude with a discussion about the future of text representation based on our findings. We would like to note that this paper, strictly focuses on the representation of text for low-quality text Classification and therefore uses content, data and text interchangeably.

Below, first, we briefly discuss different steps in text classification pipeline illustrated in Fig. 1, followed by the details of each step in next sections.

  1. Unstructured (Low Quality) Text: Unstructured (Low Quality) text is a form of written text which requires metadata and cannot easily be listed or classified. Usually, it is the information generated by users on social media postings, documents, email or messages. Raw text is scattered and sparse with less number of features and does not give sufficient word co-occurrence information. It is an important origin of information for businesses, research institutes and monitoring agencies. Often companies mine it for getting the data to improve their marketing strategies and achieve an edge in the marketplace. It plays a big part in predictive analytics and in analysing sentiments of users to find-out the overall opinion of customers. It helps to discover unique insights by revealing hidden information, discovering trends and recognising relationships between irrelevant bits of the data (Haddi et al., 2013; Uysal and Günal, 2014).

  2. Text Representation: For text classification, the text should be converted into a form which the computer can understand. First, we need to improve the quality of raw and unstructured text and then extract features from it before classification. Both of these steps are briefly discussed below.

    • Text pre-processing: pre-processing is the crucial step, especially in the classification of short text. pre-processing techniques are valuable techniques for decreasing the data adequacy, sparsity and helps to improve the low quality of text especially in the case of short text where everyone writes in their style, and use emoticons, abbreviations, make spelling mistakes and use URLs etc. A proper combination of common and advance pre-processing techniques can help to learn good text representation (Bao et al., 2014; Singh and Kumari, 2016). pre-processing techniques analysed in our study are briefly discussed in section 2.

    • Features Extraction: Features extractions of the data is the critical step for machines to classify and understand the data like humans. It is the process of transforming raw data into numeric data which machines can understand. Usually, this feature extraction step of transforming a raw data is called a features vector. Extracting robust word representations is not so easy without having a considerable amount of corpus due to diversity of expressing sentiments, emotions and intentions in the English language. However, due to social media platforms, researchers now have access to get an enormous amount of data. However, assigning labels to this massive amount of data collected from social media platforms is not an easy job. To make this annotation process easy, researchers initially worked on finding a sign of sentiment and emotion within the content of the text like emoticons and hashtags (Suttles and Ide, 2013; Wang et al., 2012a; Kowsari et al., 2019). Some of the famous classical and current feature extraction algorithms are briefly discussed in section 3.

  3. Classification: Selecting the best classifier is the essential part of text classification pipeline. It is hard to find out the most effective and adequate classifier for text classification task without understanding theoretically and conceptually each algorithm. Since the scope of this paper is only restricted to present different text representation methods so we will not discuss different text classification algorithms in detail. These classifiers include famous traditional ML algorithms of text classification such as Logistic Regression which is used in many data mining areas (Patriche et al., 2016; Chen et al., 2017), Naive Bayes which is computationally not expensive and works well with less amount of memory (Larson, 2010), K-nearest Neighbour which is non-parametric methods and Support Vector Machine is a famous classifier which has been widely used in many different areas earlier. Then tree-based algorithms like random forest and decision tree are discussed followed by deep learning (DL)-based classifiers which are a collection of methods and approaches motivated by the working mechanism of the human brain. These methods utilise the extensive amounts of training input data to achieve the high quality of semantically rich text representations which can be given as input to different ML methods which can make better predictions (Korde and Mahender, 2012; Kowsari et al., 2019).

2. Text Pre-processing

Text datasets contain a lot of unwanted words such as stop-words, punctuation, incorrect spellings, slangs, etc. This unwanted noise and words may have an negative effect on the performance of the classification task. Below first, we present the preliminaries where we discuss different methods and techniques related to text pre-processing and cleaning, followed by some literature review where researchers analyzed the effects of text pre-processing techniques.

2.1. Preliminaries related to Text Pre-processing

  • Tokenization A process of transforming a text (sentence) into tokens or words is known as tokenization. Documents can be tokenized into sentences, whereas sentences can be converted into tokens. In tokenization, a sequence to text is divided into the words, symbols, phrases or tokens (Balazs and Velásquez, 2016). The prime objective of tokenization is to find out the words in a sentence. Usually, tokenization is applied as a first and standard pre-processing step in any NLP task.(Giachanou et al., 2017)

  • Removal of Noise, URLs, Hashtag and User-mentions Unwanted strings and Unicode are considered as leftover during the crawling process, which is not useful for the machines and creates noise in the data. Also, almost all of tweets messages posted by users, contains URLs to provide extra information, User-mention/tags () and use hashtag symbol to associate their tweet message with some particular topic and can also express their sentiments in tweets by using hashtags. These give extra information which is useful for human beings, but it does not provide any information to machines and considered as noise which needs to be handled. Researchers have presented different techniques to handle this extra information provided by users such as in the case of URLs; it is replaced with tags (Agarwal et al., 2011) whereas User-mentions () are removed (Bermingham and Smeaton, 2011; Khan et al., 2014)

  • Word Segmentation Word segmentation is the process of separating the phrases, content and keywords used in the hashtag. Moreover, this step can help in understanding and classifying the content of tweets easily for machines without any human intervention. As mentioned earlier, Twitter users use # (hashtags) in almost all tweets to associate their tweets with some particular topic. The phrase or keyword starting with # is known as hashtags. Various techniques are presented in the literature for word segmentation in (Reuter et al., 2016; Celebi and Ozgur, 2016).

  • Replacing Emoticons and Emojis Twitter users use many different emoticons and emojis such as:), :(, etc. to express their sentiments and opinions. So it is important to capture this useful information to classify the tweets correctly. There are few tokenizers available which can capture few expressions and emotions and replace them with their associated meanings (Gimpel et al., [n.d.]).

  • Replacement of abbreviation and slang Character limitations of Twitter enforce online users to use abbreviations, short words and slangs in their posts online. An abbreviation is a short or acronym of a word such as MIA which stands for missing in action. In contrast, slang is an informal way of expressing thoughts or meanings which is sometimes restricted to some particular group of people, context and considered as informal. So it is crucial to handle such kind of informal nature of text by replacing them to their actual meaning to get better performance without losing information. Researchers have proposed different methods to handle this kind of issue in a text, but the most useful technique is to convert them to an actual word which is easy for a machine to understand (Kouloumpis et al., 2011; Mullen and Malouf, 2006).

  • Replacing elongated characters Social media users, sometimes intentionally use elongated words in which they purposely write or add more characters repeatedly more times, such as loooovvveee, greeeeat. Thus, it is important to deal with these words and change them to their base word so that classifier does not treat them different words. In our experiments, we replaced elongated words to their original base words. Detection and Replacement of elongated words have been studied by (Mohammad et al., 2013) and (Balahur, 2013).

  • Correction of Spelling mistakes Incorrect spellings and grammatical mistakes are very commonly present in the text, especially in the case of social media platforms, especially on Twitter and Facebook. Correction of spelling and grammatical mistakes helps in reducing the same words written indifferently. Textblob is one the library which can be used for this purpose. Norvig’s spell correction8 method is also widely used to correct spelling mistakes.

  • Expanding Contractions A contraction is a shortened form of the words which is widely being used by online users. An apostrophe is used in the place of the missing letter(s). Because we want to standardize the text for machines to process easily so, in the removal of contractions, shortened words are expanded to their original root /base words. For example, words like how is, I’m, can’t and don’t are the contractions for words how is, I am, cannot and do not respectively. In the study conducted by (Boia et al., 2013), contractions were replaced with their original words or by the relevant word. If contractions are not replaced, then the tokenization step will create tokens of the word ”can’t” into ”can” ”t”.

  • Removing Punctuations Social media users use different punctuations to express their sentiments and emotions, which may are useful for humans but not all much useful for machines for the classification of short texts. So removal of punctuation is common practice in classification tasks such as sentiment analysis. However, sometimes some punctuation symbols like ”!” and ”?” shows/denotes the sentiments. Its common practice to remove punctuation. (Lin and He, 2009). whereas, replacing question mark or sign of exclamation with tags has also been studied by (Balahur, 2013).

  • Removing Numbers Text corpus usually contains unwanted numbers which are useful for human beings to understand but not much use for machines which makes lowers the results of the classification task. The simple and standard method is to remove them (He et al., 2011; Jianqiang, 2015). However, we could lose some useful information if we remove them before transforming slang and abbreviation into their actual words. For example, words like ”2maro”, ”4 u”, ”gr8”, etc. should be first converted to actual words, and then we can proceed with this pre-processing step.

  • Lower-casing all words A sentence in a corpus has many different words with capitalization. This step of pre-processing helps to avoid different copies of the same words. This diversity of capitalization within the corpus can cause a problem during the classification task and lower the performance. Changing each capital letters into a lower case is the most common method to handle this issue in text data. Although, this pre-processing technique projects all tokens in a corpus under the one feature space also causes a bunch of problems in the interpretation of some words like ”US” in the raw corpus. The word ”US ”could be pronoun and a country name as well, so converting it to a lower case in all cases can be problematic. The study conducted by (dos Santos and de C. Gatti, 2014) has lower-cased words in corpus to get clean words.

  • Removing Stop-words In-text classification task, there are many words which do not have critical significance and are present in high frequency in a text. It means the words which does not help to improve the performance because they do not have much information for the sentiment classification task, so it is recommended to remove stop words before feature selection step. Words like (a, the, is, and, am, are, etc.). A popular and straightforward method to handle with such words is to remove them. There are different stop-word libraries available such as NLTK, scikit-learn and spaCy.

  • Stemming One word can turn up in many different forms, whereas the semantic meaning of those words is still the same. Stemming is the techniques to replace and remove the suffixes and affixes to get the root, base or stem word. The importance of stemming was studied by (Mejova and Srinivasan, 2011). There are several types of stemming algorithms which helps to consolidate different forms of words into the same feature space such as Porter Stemmer, Lancaster stemmer and Snowball stemmers etc. Feature reduction can be achieved by utilizing the stemming technique.

  • Lemmatization The purpose of the lemmatization is the same as stemming, which is to cut down the words to it’s base or root words. However, in lemmatization inflection of words are not just chopped off, but it uses lexical knowledge to transform words into its base forms. There are many libraries available which help to do this lemmatization technique. Few of the famous ones are NLTK (Wordnet lemmatizer), genism, Stanford CoreNLP, spaCy and TextBlob etc.

  • Part of Speech (POS) Tagging The purpose of Pat of speech (POS) tagging is to assign part of speech to text. It clubs together with the words which have the same grammatical with words together.

  • Handling Negations For humans, it is simple to get the context if there is any negation present in the sentence, but for machines sometimes it does not help to capture and classify accurately so handling a negation can be a challenging task in the case of word-level text analysis. Replacing negation words with the prefix ’NEG_’ has been studied by (Narayanan et al., 2013). Similarly, handling negations with antonym has been studied by (Perkins, 2010).

2.2. Related work on text pre-processing methods

Text pre-processing plays a significant role in text classification. Many researchers in the past have made efforts to understand the effectiveness of different pre-processing techniques and their contribution to text classification tasks. Below we present some studies conducted on the effects of pre-processing techniques on text classification tasks.

Bao et al. (Bao et al., 2014) study showed the effect of pre-processing techniques on Twitter analysis task. Uni-gram and bi-grams features were fed to Liblinear classifier for the classification. They showed in their study that reservation of URL features, the transformation of negation (negated words) and normalization of repeated tokens have a positive effect on classification results whereas lemmatization and stemming have a negative effect on classification results. Singh and Kumari (Singh and Kumari, 2016) showed the impact of pre-processing on Twitter dataset full of abbreviations, slangs, acronyms for the sentiment classification task. In their study, they showed the importance and significance of slang and correction of spelling mistakes and used Support Vector Machine (SVM) classifier to study the role of pre-processing for sentiment classification. Haddi et al. (Haddi et al., 2013) also explored the effect of text pre-processing on movie review dataset. The experimental shows that pre-processing methods like the transformation of text such as changing abbreviations to actual words and removal of stop word, special characters and handling of negation with the prefix ‘NOT’ and stemming can significantly improve the classification performance. The SVM classifier was used in their experiments — the study conducted by Uysal and Gunal. (Uysal and Günal, 2014) to analyze the role of pre-processing on two different languages for sentiment classification was presented. They employed SVM classifier in their studies and showed that performance is improved by selecting an appropriate combination of different techniques such as removal of stop words, the lower casing of text, tokenization and stemming. They concluded that researchers should choose all possible combinations carefully because the inappropriate combination may result in degradation of performance. Similarly, Jianqiang and Xiaolin (Jianqiang and Xiaolin, 2017) studied the role of six different pre-processing techniques on five datasets in their study, where they used four different classifiers. Their experimental results show that replacing acronyms (abbreviations) with actual words and negations improved the sentiment classification, whereas removing stop-words, special characters, and URLs have an adverse influence on the results of sentiment classification. Role of text pre-processing to reduce the sparsity issue in Twitter sentiment classification is studied by Said et al. (Saif et al., 2013). Experimental results demonstrate that choosing a combination of appropriate pre-processing methods can decrease the sparsity and enhance the classification results. Agarwal et al. (Agarwal et al., 2011) proposed novel tweets pre-processing approaches in their studies. They replaced URL, user mentions, repeated characters and negated words with different tags and removed hashtags. Classification results were improved by their proposed pre-processing methods. In other studies presented by Saloot et al. (Saloot et al., 2015) and Takeda and Takefuiji (Yamada et al., 2015) in the natural language workshop which focuses on noise user-generated text9. Noisy nature of Twitter messages is reduced/decreased by normalizing tweets using a maximum entropy model and entity linking. Recently, Symeonidis et al. (Symeonidis et al., 2018) presented the comparative analysis of different techniques on two datasets for Twitter sentiment analysis. In their study, they studied the effect of each technique on four traditional ML-based classifiers and one neural network-based classifier with only TFIDF (unigram) for words representation method. Their study showed that pre-processing technique such as removing numbers, lemmatization and expanding contractions to base words performs better, whereas removing punctuation does not perform well in the classification task. Their study also presented the interactions of the limited number of different techniques with others and showed that techniques which perform well when interacted with others. However, no work has been done the recommendation of pre-processing techniques to improve the quality of the text.

3. Feature Extraction methods

In this section, we discuss various popularly used feature extraction models. Different researchers in the past have proposed different features of extraction models to address the problem of loosing syntactic and semantic relationships between words. These methods, along with the literature review where different methods have been adopted for different NLP related tasks. First, we present some classical models, followed by some famous representation learning models.

3.1. Classical Models

This section presents some of the classical models which were commonly used in earlier days for the text classification task. Frequency of words is the basis of this kind of words representation methods. In these methods, a text is transformed into a vector form which contains the number of the words appearing in a document. First, we give a short description of categorical word representation methods and then weighted word representation methods.

  1. Categorical word representation: is the simplest way to represent text. In this method, words are represented by a symbolic representation either ”1” or ”0”. One-hot encoding and Bag-of-words (BoW) are the two models which come under categorical word representation methods. Both are briefly discussed below.

    • One hot encoding: The most straightforward method of text representation is one hot encoding. In one hot encoding, the dimension is the same amount of terms present in the vocabulary. Every term in vocabulary is represented as a binary variable such as 0 or 1, which means each word is made up of zeros and ones. Index of the corresponding word is marked with 1, whereas all others are marked as zero (0). Each unique word has a unique dimension and will be represented by a 1 in that dimension with 0s everywhere else.

    • Bag-of-Words (BoW): BoW is simply an extension of one-hot encoding. It adds up the one-hot representations of words in the sentence. The BOW method is used in many different areas such as NLP, computer vision (CV), and information retrieval (IR) etc. The matrix of words built using BOW ignore the semantic relationship between words and order of word is also ignored along with the grammar.

      Figure 2. An illustration of one-hot encoding and BoW models

      As stated, BOW is an extension of one-hot encoding, e.g., encodes token in the vocabulary as a 1-hot-encoded vector. As vocabulary may increase to huge numbers, then vocabulary size would increase and thereby, the length of the vectors would increase too. Besides, a large number of ”0s” which may result in a sparse matrix, containing no order of text as well as information of the grammar used in the sentences.

      An example of ”Hello” and ”World” as one-hot encoding and ”Hello World” as BoW is given in Fig. 2.

  2. Weighted Word representation: Here, we present the common methods for weighted word representations such as Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF). These are associated with categorical word representation methods but rather than only counting; weighted models feature numerical representations based on words frequency. Both of them are briefly discussed below.

    • Term Frequency (TF) : Term frequency (TF), is the straightforward method of text feature extraction. TF calculates how often a word occurs in a document. A word can probably appear many times in large documents as compared to small ones. Hence, TF is computed by dividing the length of the document. In other words, TF of a word is computed by dividing it with the total number of words in the document.

    • Term Frequency-Inverse Document Frequency (TF-IDF): To cut down the impact of common words such as ’the’, ’and’ etc. in the corpus, TF-IDF was presented by (Sparck Jones, 1988) for text representation. TF here stands for Term frequency which is defined in the above section, and IDF denotes inverse document frequency which is a technique presented to be used with TF to reduce the effect of common words. IDF assigns a more weight to words with either higher or lower frequencies. This combination of TF and IDF method is known as TF-IDF and is represented mathematically by the below equation.

      Where denotes the terms; denotes each document; represents the collection of documents and denotes sum of documents with term in it. TF-IDF is built on the concept of BOW model; therefore, it can not capture the order of words in a document, semantics and syntactical information of words. Hence, TF-IDF is good to use as a lexical level feature.

3.2. Representation Learning

Since categorical word representations, models fail to capture syntactic and semantic meaning of the words, and these models suffer the curse of high dimensionality. The shortcomings of these models led the researchers to learn the distributed word representation in low dimension space (Bolukbasi et al., 2016). The limitations of classical feature extraction methods make it use a limited for building a suitable model in ML. Due to this, different models have been presented in the past, which discovers the representations automatically for downstream tasks such as classification. Such methods which discover features itself are called as feature learning or representation learning.

It is very important because the performance of ML models heavily depends on the representations of the input (Bengio et al., 2013). DL-based model, which are good at learning important features itself, is changing traditional feature learning methods. Proper representation can be learned either by utilizing supervised learning methods or unsupervised methods.

In the area of NLP, unsupervised text representation methods like word embeddings have replaced categorical text representation methods. These word embeddings turned into very efficient representation methods to improve the performance of various downstream tasks due to having a previous knowledge for different ML models. Classical feature learning methods have been replaced by these neural network-based methods due to their good representation learning capacity. Word embedding is a feature learning method where a word from the vocabulary is mapped to dimensional vector. Many different words embedding algorithms have been presented, and the famous algorithms, for instance, Word2Vec, Glove and FastText, are discussed in this study.

First, we briefly present different pre-training methods for learning the word representation of the document. These pre-training methods are classified into three different groups: (i) Supervised learning (SL), (ii) Unsupervised learning (UL), and (iii) Self-supervised learning (SSL). Below we discuss each of these briefly:

  1. Supervised learning (SL) is to learn a feature that translates an input to an output on a basis of input-output pair training data.

  2. Unsupervised learning (UL) is to discover some intrinsic information, such as clusters, densities, latent representations, from unlabeled data.

  3. Self-Supervised learning (SSL) is a hybrid of SL and UL. SSL’s learning model is mostly the same as SL, except the training data labels are automated. SSL’s main concept is to predict some aspect of the input in some form from other parts. The Masked Language Model (MLM), for instance, is a self-supervised task that tries to predict the masked words in a sentence provided the remaining words.

Distributed Representations

As previously mentioned, hand-crafted features were primarily used to model tasks in natural language before approaches based on neural networks came around and addressed some of the challenges faced by conventional Ml algorithms, such as the dimensionality curse.

  1. Continuous Words Representation (Non-Contextual Embeddings):

    Word Embedding is NLP technique in which text from the corpus is mapped as the vectors. In other words, it is a type of learned representation which allows same meaning words to have the same representation. It is the distributed representation of a text (words and documents) which is a significant breakthrough for better performance for NLP related problems. The most significant benefit of word embedding is this that it provides more efficient and expressive representation by keeping the word similarity of context and by low dimensional vectors. Nowadays, word embedding is being used in many different applications like semantic analysis, philology, psychiatry, cognitive science, social science and psychology (Elekes et al., 2018).

    An automatic feature learning technique in which every token in a vocabulary is indexed into an N dimension vector is known as distributed vectors or Word embedding. Which follows the distributional hypothesis. According to this, words which are used and appear in the similar contexts tend to assure the same meanings. So these vectors tend to have the attributes of word’s neighbours, and they capture the similarity between words. During 1990, several researchers made attempts to lay down the foundation of distributional semantic learning.

    Bengio et al. (Bengio et al., 2003) presented a model which learned word representations using distributed representation. Authors presented NNLM model which obtains word representations as to the product while training language model (LM). Just like traditional LM, NNLM also uses previous words/tokens to predict the word/token. Different word embedding models have been proposed, which makes uni-grams useful and understandable to ML algorithms and usually, these models are used in the first layer in a deep neural network-based model. These word embedding are pre-trained by predicting a word based on its context without losing semantic and syntactical information. Thus, using these embedding techniques have demonstrated to be helpful in many NLP tasks because It does not lose the order of words and captures the meaning of words (syntactic and semantic information of words). However, the popularity of word representation methods are due to two famous models, Word2Vec (Mikolov et al., 2013) and GloVe (Manning et al., 2014). These famous, along with others, are briefly discussed below.

    • Word2Vec

      Word2vec is words representation model developed by (Mikolov et al., 2013). This model uses two hidden layers which are used in a shallow neural network to create a vector of each word. The word vectors captured by Continuous Bag of words (CBOW) and Skip-gram models of word2vec are supposed to have semantic and syntactic information of words. To have a better representation of words, it is recommended to train the corpus with the large corpus. Word2Vec have proved to be useful in many NLP related tasks (Collobert et al., 2011). Word2Vec was developed to build training of embedding more significant, and since then, it has been used as a standard for developing pre-trained word representation. Based on the context, Word2Vec predicts by using one of the two neural network models such as Continuous bag of words (CBOW) and Skip-gram. A predefined length of the window is moved together with the corpus in both models, and the training is done with words inside the window in each step (Altszyler et al., 2016). This feature presentation algorithm gives a robust tool for unfolding relationships in the corpus and the similarity between token. For instance, this method would regard the two words such as “small” and “smaller” near to each other in the vector space. Fig.3 shows the working principle of both Word2Vec algorithms, CBOW and Skip-Gram.

      Figure 3. Working principle of Word2Vec
      (Image taken from (Mikolov et al., 2013))
      • Continuous Bag of words (CBOW): Continuous Bag of words (CBOW) gives words prediction of current work based on its context. CBOW communicates with the neighbouring words in the window. Three layers are used in CBOW process. Context is considered as the first layer whereas the layer which is hidden matches with the estimation of every word from the input to the weight matrix which later on is estimated to the output which is considered as the third layer. The last phase of this method is to correlate the output and the work itself to improve the representation based on the backpropagation of the error gradient. In a Fig.3, CBOW method predicts the middle word based on its context in skip-gram predicts the context word based on centre word (Naili et al., 2017).

      • Skip-Gram:

        Skip-Gram is the reverse of CBOW model; prediction is given based on the central word after the training of context in skip-gram. Input layer correlates with the targeted word, and the output layer corresponds with the context. This model looks for the estimation of the context given the word, unlike CBOW. The last phase of this model is the correlation between output and every word of the context to adjust representation based on back-propagation (Naili et al., 2017; Elekes et al., 2018).

        Skip-gram is efficient when we have less training data, and not frequent words are well presented. In comparison, CBOW is quicker and performs better with repeated words. To address the issues of learning the final vectors, two algorithms are proposed. First one is negative sampling in which we restrict the sum of output vectors which needs to be updated, so only a sample of the vectors is updated based on a noise distribution which is a probabilistic distribution used in the sample step. Moreover, the other method is Hierarchical softmax which is developed based on the Huffman tree. It is a binary tree which gives all words depending on their counts. Then normalization is done for each step from the root to the target. Negative sampling is efficient when the dimension of vectors is less and works well with repeated words. In comparison, hierarchical softmax works well when we have less frequent words (Naili et al., 2017).

    • Global Vectors (GloVe):

      Word2vec-trained word embedding will better capture the semantics of words and manipulate the connectivity of words. However, Word2vec mainly focuses on the local context window knowledge, whereas the global statistical information is not used well. So the Glove (Manning et al., 2014) is presented, which is a famous algorithm based on the global co-occurrence matrix, each element in the matrix depicts the frequency of the word and the word co-occur in a appropriate context window and is widely used for the text classification task.

      GloVe is an expansion of the word2Vec for learning word vectors efficiently where the words prediction is made based on surrounding words. Glove is based on the appearances of a word in the corpus, which is based on two steps. Creation of the co-occurrence matrix from the corpus is the first step, followed by the factorization to get vectors. Like word2Vec, GloVe also provided pre-trained embeddings in a different dimension (100, 200, 300 dimensions) which are trained over the vast corpus. The objective function of GloVe is given below:

      where;

      V : is size of vocabulary,

      X : is co-occurrence matrix,

      is frequency of word k co-occurring with word j,

      total number of occurrences of word k in the corpus,

      is the probability of word j occurring within the context of word k,

      w is a word embedding of dimension d,

      is the the context word embedding of dimension d

      Word representation methods such as Word2vec and GloVe are simple, accurate, and on large data sets, they can learn semantic representations of words. They do not, however, learn embedded words from out-of-vocabulary(OOV) words. Such words can be defined in two ways: words that are not included in the current vocabulary and words that do not appear in the current training corpus. To solve these various models are proposed to address this challenge. We briefly describe one of the most famous models below.

    • FastText

      Bojanowski et al. (Bojanowski et al., 2016a) proposed FastText and is based on CBOW. When compared with other algorithms, FastText decreases the training time and maintains the performance. Previously mentioned algorithms assign a distinct representation to every word which introduces a limitation, especially in case of languages with sub-word level information/ OOV.

      FastText model solved the issues mentioned above. FastText breaks a word in n-grams instead of full word for feeding into a neural network, which can acquire the relationship between characters and pick up the semantics of words. FastText gives better results by having better word representation primarily in the case of rare words. Facebook has presented pre-trained word embeddings for 294 different languages, trained on Wikipedia using FastText embedding on 300 dimensions and utilized word2Vec skip-gram model with its default parameters (Joulin et al., 2016).

    Although these models are used to retain syntactic and semantic information of a document, there remains the issue of how to keep the full context-specific representation of a document. Understanding the actual context is required for the most downstream tasks in NLP. Some work recently tried to incorporate the word embedding with the LM to solve the problem of meaning. Below, some of the common context-based models are briefly presented.

    Figure 4. Working principle of Context2Vec
    (Image taken from(Melamud et al., 2016a))
  2. Contextual word representations:

    • Generic Context word representation (Context2Vec): Generic Context word representation (Context2Vec) was proposed by Melamud et al. (Melamud et al., 2016b) in 2016 to generate context-dependent word representations. Their model is based on word2Vec’s CBOW model but replaces its average word representation within a fixed window with better and powerful Bi-directional LSTM neural network. A large text corpus was used to learn neural model which embeds context from a sentence and target words in the same low dimension which later is optimized to reflect the inter-dependencies between target and their entire sentential context as a whole as shown in Fig. 4.

    • Contextualized word representations Vectors (CoVe):

      McCann et al. (McCann et al., 2017a) presented their model contextualized word representations vectors (CoVe) which is based on context2Vec. They used machine translation to build CoVe instead of the approach used in Word2Vec (skip-gram or CBOW) or Glove (Matric factorization). Their basic approach was to pre-train two-layer BiLSTM for attention sequence to sequence translation, starting from GloVe word vectors and then they took the output of sequence encoder and called it a CoVe, Combine it with GloVe vectors and use in a downstream task-specific mode using transfer learning.

    • Embedding from language Models (ELMo)

      Peters et al. (Peters et al., 2018a) proposed Embedding from Language Models (ELMo), which gives deep contextual word representations. Researchers concur that two problems should be taken into account in a successful word representation model: the dynamic nature of word use in semantics and grammar, and as the language environment evolves, these uses should alter. They therefore introduce a method of deep contextualised word representation to address the two problems above, as seen in Fig. 5.

      Figure 5. Working principle of ELMo
      (Image taken from(Devlin et al., 2018))

      The final word vectors are learned from bi-directional language model (forward and backward LMs). ELMo uses the linear concatenation of representations learnt from bidirectional language model instead of only just the final layer representations like other contextual word representations. In different sentences, ELMo provides different word representations for the same word. Word representations learned by ELMo are based on the representation learned from Bi-language model (BiLM). The log-likelihood of sentences is used in the training phase of BiLMs both in forward and backward LMs. The final vector is computed after the concatenation of hidden representations from forwarding LM and backward LM , where and is given below:

      Where the token representation parameters and softmax parameters and are shared between the forward and backward directions, respectively. And and are then forward and backward LSTM parameters respectively.In a downstream task, ELMo extracts the representations learned from BiLM from an intermediate layer and executes a linear combination for each token. BiLM contains 2L+1 set representations as given below.

      where is the layer of token and for each bilstm layer. ELMo is a combination of these characteristics unique to the task where all layers in M are flattened to a single vector and are given below:

      (1)

      Where are weights which are softmax normalized for the combination of representations from different layers and is a hyper-parameter for optimization and scaling of representations.

      Table 1 presents the comparison of Classical, non-contextual and contextual (Context2Vec, CoVe and ELMo) LMs with their Pros and cons.

      Model Architecture Type Pros Cons
      One Hot Encoding
      and
      BoW
      - Count based
      i) Easy to compute
      ii) Works with the unknown word
      iii) Fundamental metric to extract terms
      i) It does not capture the semantics syntactic info.
      ii) Common words effect on the results
      iii) Can not capture sentiment of words
      TF
      and
      TF-IDF
      -
      i) Easy to compute
      ii) Fundamental metric to extract the descriptive terms
      iii) Because of IDF, common terms do not impact results
      i) It does not capture the semantics syntactic info.
      ii) Can not capture the sentiment of words
      Word2Vec Log Bilinear Prediction based
      i) It captures the text semantics syntactic
      ii) Trained on huge corpus ( Pre-trained)
      i) Fails to capture contextual information.
      ii) It fails to capture OOV words
      iii) Need huge corpus to learn
      GloVe Log Bilinear Count based
      i) Enforce vectors in the vector space to identify
      sub-linear relationships
      ii) Smaller weight will not affect the training progress
      for common words pairs such as stop words
      i) It fails to capture contextual information
      ii) Memory utilization for storage
      iii) It fails to capture OOV words
      iv) Need huge corpus to learn (Pre-trained)
      FastText Log Bilinear Prediction based
      i) Works for rare words
      ii) Address OOV words issue.
      i) It fails to capture contextual information
      ii) Memory consumption for storage
      iii) Compared to GloVe and Word2Vec, it is more
      costly computationally.
      Context2Vec
      CoVe ELMo
      BiLSTM Prediction based i) It solves the contextual information issue
      i) Improves performance
      ii) Computationally is more expensive
      iii) Require another word embedding for all
      LSTM and feed-forward layer
      Table 1. Comparison of Classical, non-contextual and contextual (Context2Vec, CoVe, ELMo) Word Representation Models

      Summary: Text representation embeds textual data into a vector space, which significantly affects the performance of downstream learning tasks. Better representation of text is more likely to facilitate better performance if it can efficiently capture intrinsic data attributes. Below we briefly highlight the limitations of categorical and continuous word representation models.

      Classical word representation methods like categorical and weighted word representations are the most naive and most straightforward representation of textual data. These legacy word representation models have been used widely in early days for different classification tasks like document classification, Natural language processing (NLP), information retrieval and computer vision (CV). The categorical word representation models are simple and not difficult to implement but their limitations such as they do not consider capture semantics and syntactic information because they do not consider the order of words and do not consider any relationship between words. Further, the size of the input vector is proportional to vocabulary size, which makes them computationally expensive, which results in poor performance.

      Representation learning methods have helped the research community to build powerful models. However, its drawback is that the features need to be selected manually so to solve this shortcoming there was a need to present some methods which can discover and learn these representations automatically for any downstream task. This automatic extraction of features without human intervention is known as representation learning which has improved results drastically over the past few years in many areas such as image detection, speech recognition, NLP etc. (LeCun et al., 2015). Continuous word representation models like Word2Vec , GloVe and FastText (Mikolov et al., 2013; Manning et al., 2014; Joulin et al., 2016; Bojanowski et al., 2016b) etc. have drastically improved the classification results and overcome shortcomings of categorical representations. It is found that having these continuous word representation of words is more affected as compared to traditional linguistic features because of their ability to capture more semantic and syntactic information of the textual data without losing much information. Despite their success, there are still some limitations which they are not capable of addressing such as they are unable to handle polysemy issues because they assign the same vector to word and ignores its context. Also, models like Word2Vec and GloVe assigns a random vector to a word which they did not encounter during training which means they are unable to handle out of vocabulary (OOV) words which were solved by FastText which breaks words into n-grams. All of these limitations degrades the performance of text classification. Moreover, all of the current SOTA methods do not perform well in the case of the low-quality text.

      Language
      Models
      Semantics Syntactical Context
      Out of
      Vocabulary
      1-Hot encoding [] [] [] []
      BoW [] [] [] []
      TF [] [] [] []
      TF-IDF [] [] [] []
      Word2Vec [✓] [✓] [] []
      GloVe [✓] [✓] [] []
      FastText [✓] [✓] [] [✓]
      Context2Vec [✓] [✓] [✓] [✓]
      CoVe [✓] [✓] [✓] []
      ELMo [✓] [✓] [✓] [✓]
      Table 2. Gap Analysis of Classic, Non-contextual, Contextual (Context2Vec, Cove and ELMo)LMs
    • Universal Language Model Fine-Tuning (ULMFiT)

      Presented by Jeremy Howard of fast.ai and Sebastian Ruder of the NUI Galway Insight Center, Universal Language Model Fine-tuning (ULMFiT) (Howard and Ruder, 2018) is basically a method to allow transfer learning and achieve excellent performance for any NLP task, without training models from scratch. ULMFiT proposed two new methods within the network layers, Discriminative Fine-tuning (Discr) and Slanted Triangular Learning Rates (STLR) to enhance the Transfer Learning process. This approach includes fine-tuning a pre-trained LM, trained on the dataset of Wikitext 103, to a new dataset in such a way that it does not neglect what it has learned before. UMFiT was based on the SOTA LM at that time which is LSTM-based model. The architecture and training method, ULMFiT, builds on similar approaches of CoVE and ELMo. In CoVe and ELMo, the encoder layers are frozen. ULMFiT instead describes a way to train all layers, and does so without over-fitting or running into “catastrophic forgetting”, which has been more of a problem for NLP (vs Computer vision) transfer learning in part because NLP models tend to be relatively shallow. Table 2 presents the gap analysis of Classic, Non-contextual, Contextual (Context2Vec, Cove and ELMo)LMs.

      Figure 6. Working principle of ULMFiT
      (Image taken from(Howard and Ruder, 2018))

      ULMFiT follows three-step to get the good results on downstream tasks, i.e., (i) General LM pre-training, (ii) Target task LM fine-tuning, and (iii) Target task classifier fine-tuning. Three training stages of ULMFiT is shown in Fig. 6.

      The LM pre-training is unsupervised, as the unlabeled text datasets are numerous, the pre-training can be expanded up as much as possible. It still has, however, a reliance on task-customized models. Therefore, the enhancement is only gradual as looking for a better model architecture for each role remains non-trivial until the transformer-based models that are discussed below come into being.

    • Transformer-based Pre-trained Language Models

      Transformer (Vaswani et al., 2017) has been proven to be more efficient and faster than LSTM or CNN for language modelling, and thus the following advances in this domain will rely on this architecture.

    • GPT (OpenAI Transformer): Generative Pre-Training, GPT (Radford et al., 2018), is the first Transformer-based pre-trained LM that can effectively manipulate the semantics of words in terms of context. By learning on a massive set of free text corporas, GPT extends the unsupervised LM to a much larger scale. Unlike ELMo, GPT uses the decoder of the transformer to model the language as it is an auto-regressive model where the model predicts the next word according to its previous context. GPT has shown good performance on many downstream tasks. One drawback of GPT is it’s uni-directional, i.e., the model is only trained to predict the future left-to-right context. The overall model of GPT is shown in Fig.7.

      Figure 7. Working principle of GPT
      (Image taken from(Radford et al., 2018))
    • Bidirectional Encoder Representations from Transformers (BERT)

      As seen in Fig. 8, Bidirectional Encoder Representations from Transformers (BERT) is a direct descendant of GPT: train a huge LM on free text and then fine-tune individual tasks without custom network architectures. BERT (Devlin et al., 2018) is another contextualised word representation LM, where the transformer NN uses parallel attention layers rather than sequential recurrence (Vaswani et al., 2017).

      Figure 8. Working principle of BERT
      (Image taken from(Devlin et al., 2018))

      Instead of the basic language task, BERT is trained with two tasks to encourage bi-directional prediction and sentence-level understanding. BERT is trained on two unsupervised tasks: (1) a” masked language model (MLM), where 15% of the tokens are arbitrarily masked (i.e. replaced with the ”[MASK]” token), and the model is trained to predict the masked tokens, (2) a ”next sentence prediction” (NSP) task, where a pair of sentences are provided to the model and trained to identify when the second one follows the first. This second task is intended to collect additional information that is long-term or pragmatic.

      BERT is trained in the dataset of Books Corpus (Zhu et al., 2015) and English Wikipedia text passages. There are two BERT pre-trained model available: BERT-Base and BERT-Large. BERT can be used on un-annotated data or fine-tuned on one’s task-specific data straight from the pre-trained model. The publicly accessible pre-trained model and fine-tuning code are available online 10.

    • BERT Variants:

      Recent research also explores and strengthens the goal and architecture of BERT. Some of them are briefly discussed below:

    • GPT2: The OpenAI team released a scaled-up variant of GPT in 2019 with GPT2 (Radford et al., 2018). It incorporates some slight improvements compared to the previous concerning the position of layer normalisation and residual relations. Overall, there are four distinct GPT2 variants with the smallest being identical to GPT, the medium one being similar in size to BERT-LARGE and the xlarge one being released with 1.5B parameters as the actual GPT2 standard.

    • XLNet: XLNet, also known as Generalized Auto-regressive Pre-training for Language Understanding (Yang et al., 2019) which proposes a new task to predict the bidirectional context instead of the masked Language task in BERT, it is a permutation language in which we make some permutations of each sentence so the two contexts will be taken into consideration. In order to maintain the position information of the token to be expected, authors employed two-stream self-attention. XLNET was presented to overcome the issue of pre-training fine-tune discrepancy and to include bidirectional contexts simultaneously.

    • RoBERTa: RoBERTa: A Robustly Optimized BERT Pre-training Approach was implemented in July 2019 (Liu et al., 2019), it is like a lite version of BERT, but it has fewer parameters and better performance as it removes the training on the sentence classification task. RoBERTa made following changes to the BERT model: (1) Longer training of the model with larger batches and more data; (2) Eliminating the NSP goal; (3) Longer sequence training; (4) Dynamically during pre-training, the masked roles will change. All these changes boost the model’s efficiency and make it efficient with XLNet ’s previous SOTA results.

    • ALBERT: Despite this success, BERT has some limitations such as BERT has a huge number of parameters which is the cause for problems like degraded pre-training time, memory management issues and model degradation etc (Lan et al., 2019). These issues are very well addressed in ALBERT, which is modified based on the architecture of BERT and proposed by Lan et al. (Lan et al., 2019). In scaling pre-trained models, ALBERT implements two-parameter reduction methods that lift the essential barriers. (i) factorized embedding parameterization - decomposes the big vocabulary embedding matrix to two small matrices, (ii) replaces the NSP loss by SOP loss; and (iii) cross-layer parameter sharing- stops the parameter from prospering with the network depth. These methods significantly lower the number of parameters used when compared with BERT without significantly affecting the performance of the model, thus increasing parameter-efficiency. An ALBERT configuration is the same as BERT (large) has 18 times less parameters and can be trained about 1.7 times faster. ALBERT establishes new SOTA results while having fewer parameters compared to BERT.

    • Other Models: Some of the other recently proposed LMs are a cross-lingual LM Pre-training (XLM(Lample and Conneau, 2019) from Facebook enhanced BERT for Cross-lingual LM. Two unsupervised training objectives that only include monolingual corporations were introduced by XLM authors: Causal Language Modeling (CLM) and Masked Language Modeling (MLM) and demonstrated that both the CLM and MLM approaches have powerful cross-lingual features that can be used for pre-training models. Similarly, StructBERT (Wang et al., 2019) implemented a structural objective word that randomly permits the reconstruction order of 3-grams and a structural objective sentence that predicts the ordering of two consecutive segments.

      DistilBERT: (Sanh et al., 2019), a distilled version of BERT, reduces the size of a BERT LM by 40% while retaining 97% of its language understanding proficiency and being 60% quicker. MegatronLM (Shoeybi et al., 2019), a scaled-up transform-based model, 24 times larger than BERT, training multi-billion parameter LMs using model parallelism. CRTL (Keskar et al., 2019), A Conditional Transformer Language Model for Controllable Generation, is a 1.63 billion-parameter conditional transformer LM, it is a conditional generative model. Another recently proposed model, ERNIE (Sun et al., 2019), Enhanced representation through knowledge integration, used knowledge masking techniques including entity-level masking and phrase-level masking instead of randomly masking tokens. Authors of ERNIE extended their work and presented ERNIE 2.0 (Sun et al., 2020) further incorporated more pre-training tasks, such as semantic closeness and discourse relations. SpanBERT (Joshi et al., 2020) generalized ERNIE to mask random spans, without indicating to external knowledge.

      Other prominent LM includes UniLM: (Dong et al., 2019), which uses three objective functions : (i) language modelling (LM), (ii) masked language modelling (MLM), and (iii) sequence-to-sequence language modelling (seq2seq LM), for pre-training a transformer model. In order to monitor what context the prediction conditions are in, UniLM uses special self-attention masks. ELECTRA (Clark et al., 2020) proposed more better pre-training techniques as compared to BERT. Authors of ELECTRA replaced some of the input tokens with their plausible substitute samples from small generator network rather than corrupting some positions of the inputs with [MASK]. ELECTRA trains a discriminator to determine whether or not each token has been substituted by a generator in the corrupted input that can be used for fine-tuning in downstream tasks. MASS (Song et al., 2019) is another recently proposed LM. In order to pre-train sequence-to - sequence models, MASS uses masked sequences and adopts an encoder-decoder system and expands the MLM objective. Without pre-training or with other pre-training approaches, MASS makes substantial improvements over baselines on a range of zero / low-resource language generation tasks, including neural machine translation (MT), text summarization and conversational response generation.

    • Text-to Text Transfer Transformer (T5): (Raffel et al., 2019), unified natural language understand and generation by transforming the data to the format of text-to-text and apply the framework of an encoder-decoder. In terms of pre-training objectives, architectures, pre-training datasets and transfer techniques, T5 has implemented a novel pre-training corpus and also systematically contrasts previously proposed methods. T5 adopts a text infilling objective, more extended training and multi-task pre-training. For fine-tuning T5 uses the token vocabulary of the decoder as the prediction labels.

    • BART:(Lewis et al., 2019): For pre-training sequence-to-sequence models, BART added additional noise functions beyond MLM. First, using an arbitrary noise function, the input sequence is distorted. Then, a transformer network reconstructs the corrupted input. BART explores a broad range of noise functionality, including token functions. masking, deletion of tokens, text infilling, rotation of documents and shuffling of words. The best performance is attained by using both sentence shuffling and text infilling. BART matches RoBERTa’s performance on GLUE and SQuAD and attain SOTA results on a number of tasks for generating text.

      LMs Release Date Architecture Pre-Training Task Corpus Used
      Word2Vec Jan-13 FCNN - Google News
      GloVe Oct-14 FCNN - Common Crawl corpus
      FastText Jul-16 FCNN - Wikipedia
      ELMo Feb-18 BiLSTM BiLM Wiki-Text-103
      GPT Jun-18
      Transformer
      Decoder
      LM Book-Corpus
      GPT-2 Jun-18
      Transformer
      Decoder
      LM Web-Text
      BERT Oct-18
      Transformer
      Encoder
      MLM & NSP WikiEn+Book-Corpus
      RoBERTa Jul-19
      Transformer
      Encoder
      MLM & NSP
      Book-Corpus + CC-News
      +Open-Web-Text+ STORIES
      ALBERT Sep-19
      Transformer
      Encoder
      MLM+SOP same as BERT
      XLNet Jun-19
      Auto-regressive Transformer
      Encoder
      PLM
      WikiEn+ Book-Corpus+Giga5
      + Clue-Web + Common Crawl
      ELECTRA 2020
      Transformer
      Encoder
      RTD+MLM same as XLNet
      UniLM 2020
      Transformer
      Encoder
      MLM+NSP WikiEn + Book-Corpus
      MASS 2020 Transformer Seq2Seq MLM *Task-dependent
      BART 2020 Transformer DAE same as RoBERTa
      T5 2020 Transformer Seq2Seq MLM Colossal Clean Crawled Corpus (C4)
      Table 3. A comparison of popular Language models.

Although these models were able to solve context issues but are trained on general domain corpora such as Wikipedia, which limits their applications to specific domains or tasks. To enhance the performance in sub-domains, domain-specific transformer-based models have been proposed. Some of the most famous in the biomedical domain are Sci-BERT (Beltagy et al., 2019), BioBERT (Lee et al., 2019) and BioALBERT (Naseem et al., 2020b). Recently, other domain-specific models such as BERTTweet (Nguyen et al., 2020), COVID Twitter BERT (CT-BERT) (Müller et al., 2020) have been trained on datasets from Twitter. Domain-specific models were shown to be useful replacements for LMs trained on general corpus for various downstream tasks. In Table3, we present the architecture, Objective function and dataset used for training in LMs.

3.3. Related work on Word representation methods

Below we present some relevant studies where different word representation models have been employed for various text classification tasks.

Pang et al. (Pang et al., 2002a) performed that binary classification task on IMDb dataset and employed unigrams, bigram and POS tags as features. For classification, they used SVM, Maximum entropy and NB classifiers in their study and found out that best results were achieved with unigrams as feature and SVM as classifier. Kwok and Wang (Kwok and Wang, 2013) used n-grams features along with NB classifier tweets classification. These legacy based word representation methods such as n-grams, BoW, TF, and TF-IDF have been widely used in different studies for various text classification tasks (Greevy, 2004; Davidson et al., 2017; Liu and Forss, 2014; Kouloumpis et al., 2011). These traditional methods for text classification are simple, computationally economical. However, their limitations such as ignoring word order, unable to capture semantic information and high dimensionality etc. restrict their use for efficient text classification tasks.

Later, representation learning methods of learning text representation directly using neural network (Collobert and Weston, 2008) was adopted, which improved classification results. Word embeddings from continuous word representation models such as Word2Vec and GloVe are the most famous and widely used ones among these methods because of low dimensionality of vectors and capture semantic relationships. Word representation models have also been used for sentence-level classification task by averaging word vectors as feature representation which is utilized later on as input for sentence-level classification (Castellucci et al., 2015).

Word embeddings which are created based on unigrams and by averaging embeddings are not able to capture the issue of syntactic dependencies like ”but” and ”negations” can change the complete meaning of a sentence and long dependencies within a sentence (Castellucci et al., 2015). Sochet et al. (Socher et al., 2013) proposed a recursive neural network which can capture and model long semantic and sentiment dependencies of words and sentence at different stages. The disadvantage of this method is that it depends on parsing, which makes it challenging to use on Twitter related text (Foster et al., 2011). A paragraph representation model solved this issue learns word vectors and does not reply on parsing (Le and Mikolov, 2014). Both of these, recursive neural and paragraph representation models have assessed on IMDb dataset used by Pang et al. (Pang et al., 2002b), and both models improved the classification results obtained by using BoW features.

Deep neural network-based methods have also been used for Text classification tasks. Tang et al. (Tang et al., 2016a) proposed sentiment specific word representation model, which are achieved from emoticons labelled tweet messages with the help of the neural network. Severyn and Moschitti (Severyn and Moschitti, 2015) presented another neural network-based model where they used Word2Vec to learn embedding. Tweets are presented as a matrix wherein which columns compare with words, thus retaining the position they appear in a tweet. Emoticons annotated data was utilized to pre-train the weights and then trained by hand-annotated from SemEval contestant. The experimental results tell that pretraining step enables for a better initialization of the networks’ weights and therefore, has a positive role in classification results. In another study conducted by Fu et al. (Fu et al., 2017), Word2Vec was employed to get word representation which was forwarded to the recursive encoder as an input for text classification. Ren et al. (Ren et al., 2016) also used Word2Vec to generate word representations and proposed a new model for the Twitter classification task. Lauren et al. (Lauren et al., 2018) presented a different document representation model where they used the skip-gram algorithm for generating word representations for the classification of clinical texts.

Due to the limitations and restrictions in a few corpora, pre-trained word embeddings are preferred by researchers as an input of ML models. Qin et al. (Qin et al., 2016) used pre-trained Word2Vec embeddings and forwarded these word embeddings to CNN. Similarly, Kim (Kim, 2014) utilized pre-trained embeddings of Word2Vec and forwarded to CNN neural network, which increased the classification results. Camacho et al. (Camacho-Collados et al., 2016) for concept representation in their work. Jianqiang and Xiolin (Jianqiang and Xiaolin, 2018) have initialized word embeddings using pre-trained GloVe embeddings in their DCNN model. Similarly, Wallace (Wallace, 2017) applied GloVe, and Word2Vec pre-trained word embedding in deep neural network-based algorithms and enhanced the classification results. Similarly, a study conducted by Wang et al. (Wang et al., 2016), used pre-trained GloVe embeddings as an input to LSTM with attention model for aspect based classification and Liu et al. (Liu et al., 2018) employed pre-trained word embeddings Word2Vec for recommending idioms in essay writing. Recently, Ilic et al. (Ilic et al., 2018) have used contextual word embeddings (ELMo) for word representation for the detection of sarcasm and irony and shown that using ELMo word representations have improved the classification results. The research community has made limited efforts for solving the above mention limitations of continuous word representation models by proposing different models. For example, for handing OOV words which are not seen in the training and they are assigned UNK token and same vector for all words and degrades results if the number of OOV is large. Different methods to handle OOV words have been proposed in different studies (Dhingra et al., 2017; Herbelot and Baroni, 2017; Pinter et al., 2017) But still these models does not capture the polysemy issues. This issue of words with different meanings (polysemy) is addressed in different models presented by the (Neelakantan et al., 2015; Iacobacci et al., 2015). In recent days, researchers has presented more robust models to handle OOV words and polysemy issues (Liu et al., 2015; Melamud et al., 2016a; McCann et al., 2017b; Peters et al., 2018b).

To handle domain-specific problems different studies have been conducted where researchers used existing knowledge encoded in semantic lexicons to these word embedding to improve the downsides of using the pre-trained embedding which is trained on news data which is usually different from the data we use in our tasks. Some of the models presented are proposed in the following studies which inject external knowledge in existing word embedding and improves the results (Faruqui et al., 2014; Speer et al., 2016; Mrksic et al., 2017; Seungil et al., 2017; Niebler et al., 2017). Word embeddings are beneficial in different areas beyond NLP like link prediction, information retrieval and recommendation systems. Ledell et al.  (Ledell et al., 2017) proposed a model which is suitable for many of the applications mentioned above and acted as a baseline. None of the above mentioned is robust enough and fails to integrate sentiment of words in the representations and does not work well in domain-specific tasks such as sentiment analysis etc.

Studies show that adding sentiment information into conventional word representation models improves performance. To integrate the sentiment information into word embeddings, researchers have proposed different hybrid word representations by changing existing skip-gram model (Tang et al., 2014). Tang et al. (Tang et al., 2016b) proposed several hybrid ranking models (HyRank) and developed sentiment embeddings based on C&W, which considers context and sentiment polarity of tweets. Similarly, several other models are presented, which considers context and sentiment polarity of words for sentiment analysis (Tang et al., 2015; Liang-Chih et al., 2018; Rezaeinia et al., 2017a). Yu et al. (Liang-Chih et al., 2018) proposed sentiment embeddings by refining pre-trained embeddings Re(*) using the intensity score of external knowledge resource. Rezaeinia et al. (Rezaeinia et al., 2017b) proposed improved word vectors (IWV) by combining word embeddings, part of speech (POS) and combination of lexicons for sentiment analysis. Recently, Cambria et al. (Cambria et al., 2018) proposed context embeddings for sentiment analysis by conceptual primitives from text and linked with commonsense concepts and named entities.

Recent studies have used these contextual and transformer-based LMs in their model in various NLP tasks. Furthermore, various studies have been presented which use the domain-specific LMs for different NLP tasks. These hybrid and domain-specific LMs have improved the performance and ability to capture complex word attributes, such as semantics, OOV, context, and syntax, into account in various NLP task.

4. Classification techniques

Choosing an appropriate classifier is one of the main steps in the text classification task. Without having a comprehensive knowledge of every algorithm, we cannot find out the most effective model for the text classification task. Out of many ML algorithms used in text classification, we will present some famous and commonly used classification algorithms. These are used for sentiment classification tasks such as Naïve Bayes (NB), Support vector machine (SVM), logistic regression (LR), Tree-based classifiers like decision tree (DT) and random forest (RF) and neural network-based (DL) algorithms. Table 4 presents the pros and cons of classification algorithms.

4.1. ML based classifiers

  • Naive Bayes(NB) classifiers: The Naive Bayes (NB) classifiers are a group of different classification algorithms which are based on Bayes theorem, presented by Thomas Bayes (Hill, 1968). All Naive Bayes algorithms have the same assumption, i.e., each pair of features being classified is independent of others. The NB classification algorithms are widely used for information retrieval (Qu et al., 2018) and many text classification tasks (Pak and Paroubek, 2010; Melville et al., 2009). Naive Bayes classifiers are called ”Naive” because it considers that every feature is independent of other features in the input. Whereas in reality, words and phrases in the sentences are highly interrelated. The meaning and sentiment depend on the position of words in the sentence, which can change if the position is changed.

    NB classifiers are derived from Bayes theorem which states that given the number of documents (n) to be classified into z classes where the predicted label out is . The Naive Bayes theorem is given as follows:

    Where denotes a document and refers to the classes. In simple words, the NB algorithm will take each word in the training data and will calculate the probability of that word being classified. Once the probabilities of every word are calculated, then classifier is read to classify new data by utilizing the prior calculated probabilities during the training phase. Advantages of NB classifiers are; they are scalable, more suitable when the dimension of input is high, its implementation is simple, less computationally expensive, works well when less training data is available and can often outperform other classification algorithms. Whereas the disadvantages are; NB classifiers make a solid makes a reliable hypothesis on the shape of data distribution, i.e. any two features are independent given the output class, which gives bad results (Soheily-Khah et al., 2017; Wang et al., 2012b). Another limitation of NB classifiers is due to data scarcity. For any value of the feature, we have to approximate the likelihood value by a frequentist

  • Support vector machine (SVM): The support vector machine (SVM) classifiers are one of the famous and common used algorithms used for text classification due to its good performance. SVM is a non-probabilistic binary linear classification algorithm which performs by plotting the training data in multi-dimensional space. Then SVM categories the classes with a hyper-plane. The algorithm will add a new dimension if the classes can not be separated linearly in multi-dimensional space to separate the classes. This process will continue until a training data can be categorized into two different classes.

    The advantage of SVM classifiers is that results are obtained by using SVM are usually better. The disadvantage of SVM algorithms is that it is not easy to choose a suitable kernel, long training time in case of extensive data and more computational resources are required etc.

    Classifiers Pros Cons
    NB
    i) Less computational time.
    ii) Easy to understand & implement.
    iii) Can easily be trained less data.
    i) Relies strongly on the class features independence
    and does not perform well if the condition is not met.
    ii) Issue of zero conditional probability for zero
    frequency features which makes total probability zero
    SVM
    i) Effective in higher dimension
    ii) Can model non linear decision boundary
    iii) Robust to the issue of over-fitting
    i) More computational time for large datasets
    ii) Kernal selection is difficult
    iii) Does not perform well in case of overlapped
    classes
    LR
    i) Easy and Simple to implement.
    ii) Less computationally expensive
    iii) Does not need tuning and features
    to be uniformly distributed
    i) Fails in case of non-linear problems
    ii) Need large datasets
    iii) Predict results on the basis of independent
    variables
    DT
    i) Interpretable and easy to understand
    ii) Less pre-processing required
    iii) Fast and almost zero hyper-parameters
    to tuned
    i) High chances of over fitting
    ii) Less prediction accuracy as compared to others
    iii) Complex calculation in large number of classes
    RF
    i) Fast to train, flexible and gives high
    results ii) Less variance than single DT
    iii) Less pre-processing required
    i) Not easy and simple to interpret
    ii) Require more computational resources
    iii) Require more time to predict as compared to
    others
    DL
    i) Fast predictions once training is complete
    ii) Works well in case of huge data
    iii) Flexible architecture, can be utilized for
    classification and regression tasks
    i) Require a large amount of data
    ii) Computationally expensive and time-consuming
    iii) DL based classifiers are like black-box (issue of
    model interpretability exists)
    Table 4. Comparison of Classification Algorithms
  • Logistic Regression (LR) classifier: Logistic regression (LR) is a statistical model and is one of the earliest techniques used for classification. LR predicts probabilities rather than classes (Fan et al., 2008; Genkin et al., 2007) or existence of an event like a win/lose or healthy/sick etc. This can be expanded to model many classes of events like deciding whether an image consists of a cat, duck, cow, etc. Every object being identified in the image would be given a probability between 0 and 1 and the sum adding to one. LR predicts the results based on the set of independent values. However, if the wrong independent values are added, then the model will not predict good results. It works well in the case of categorical results but fails in the case of continuous results. Also, LR wants that every data point to be independent of all others, but if the findings are interlinked to one another, then classifier will not predict good results.

  • Decision Tree (DT) classifier: Decision tree (DT) was presented by Magerman (1995) and developed by Quinlan (1986). It is one of the earliest classification models for text and data mining and is employed successfully in different areas for classification task (Morgan and Sonquist, 1963). The main intuition behind this idea was to create tree-based attributes for data points, but the major question is which feature could be a parent and which will be a child’s level. DT classifier design contains a root, decision and leaf nodes which denote dataset, carry out the computation and performs classification respectively. During the training phase, the classifier learns the decision need to be executed to separate labelled categories. To classify the unknown instance, the data is passed through the tree. A particular feature from the input text is matched with the fixed which was known during the training stage. The calculation at each decision node compares the chosen features with this fixed feature earlier; the decision relies on whether the feature is more prominent than or less than the fixed which creates two-way division in the tree. The text will eventually go over these decision nodes until it reaches the leaf node that describes it assigned class.

    The advantages of DT classifier are; the amount of hyper-parameters which require tuning is nearly zero, easy to describe, can be understood easily by its visualizations whereas the significant disadvantages of DT classifier are; it is sensitive to a minor change in the data (Giovanelli et al., 2017) and have a probability of overfitting (Quinlan, 1987), complex computations in case of a large number of class labels and have difficulties with-out-of sample prediction.

  • Random Forest (RF) Classifier: Random forest which is also called an ensembles learning technique for text classification which concentrates on methods to compare the results of several trained models in line to give better classifier and performance than a single model. Ho (1998) proposed RF classifier, which is simple to understand and also gives better results in classification. RF classifier is composed of the number of DT classifiers where every tree is trained by a bootstrapped subset of the training text. An arbitrary subset of the characteristics is selected at every decision node, and the model will only examine part of these features. The primary issue with utilizing the single tree is that it has massive variation so that the arrangement of the training data and features can impact its results.

    This classifier is quick to train for textual data but slow in giving predictions when trained (Bansal et al., 2018). Performs good with both categorical and continuous variables, can automatically handle missing values, robust to outliers and less affected by noise whereas training a vast number of trees can be computationally expensive, require more training time and utilize much memory.

    4.2. Deep learning based classifiers

    DL based models are motivated by the working of the human brain. It has attained SOTA results in many different areas (Rehman et al., 2019; Naseem et al., 2020a) including NLP. It requires a large number of training data to achieve a semantically good representation of textual data. DL models have attained excellent results compared to models on different classification tasks. Main architectures of DL which are commonly used in any text classification task, are briefly discussed below.

  • Recurrent Neural Network (RNN): RNN is one of the popular neural network-based model which is widely used for different text classification tasks (Sutskever et al., 2011; Mandic and Chambers, 2001). Previous data points of a sequence are assigned more weights in an RNN model which makes it more useful and better for any text, string or sequential data classification. RNN models deal with data from previous layers/nodes in such a good way which makes them superiors for semantic analysis of a corpus. Gated recurrent unit (GRU) and long short term memory (LSTM) are the most common types of RNNs which are used of text classification.

    The one of the drawback of RNN is that they are sensitive to gradient vanishing problem and exploding gradient when gradient descent’s error is back propagated (Bengio et al., 1994).

  • Long Short-Term Memory (LSTM): LSTM was presented by Hochreiter and Schmidhuber (1997). LSTM was presented to address the gradient descent issues of RNN by keeping the long term dependency in a better way as compared to RNNs. It is more effective to overcome the issues of vanishing gradient (Pascanu et al., 2013). Even though LSTMs have an architecture like a chain which is same as RNNs but it uses different gates which handles the volume of information carefully, which is allowed from each node state. The role of each gate and node in a basic LSTM cell is explained below.


    ,

    ,

    ,

    ,

    ,

    Where , and denotes input gate, candid memory cell and forget gate activation respectively. whereas computes new memory cell value and and represents the final output gate. is bias vector, denotes weight matrix and denotes input to the memory cell at time .

  • Gated Recurrent Unit (GRU): GRU is another type of RNNs which are presented by Chung et al. (2014) and Cho et al. (2014). GRU is the simplest form of LSTM architecture. However, it includes two gates and does not contain internal memory which makes it different from LSTM. Also, in GRU, a second non-linearity (tanh) is not applied on a network. The working of a GRU cell is given below:


    Where denotes to the update gate of t, represents input vector, W,U and b denotes parameter matrices. which is a activation function can be ReLU or sigmoid, represents reset gate of t, is the output gate of vector t, and denotes the hyperbolic tangent function.

  • Convolutional Neural Networks (CNN): Another famous architecture of DL is CNN which is mostly used for hierarchical classification in a DL (Jaderberg et al., 2014). CNN was built and used for image classification in early days, but over the period, it has shown excellent results for text classification as well (Lecun et al., 1998). In image classification, an image tensor is convolved with a set of kernels of size x. The convolution layers in the CNN are known as feature maps which can be stacked to have multiple filters. To overcome the computational issue due to the size of dimensionality, CNN uses pooling layer to reduce the size from one layer to the other one. Different pooling methods have been proposed by researchers to decrease the output without losing features (Scherer et al., 2010).

    Max pooling is the most common pooling technique where maximum elements in the pooling window are selected. To feed the pooled output from stacked features map to the next one, features are flattened into one column. Usually, the last layer of CNNs is fully connected. Weights and feature filters are adjusted during the backpropagation step of CNN. The number of channels is the major issue with CNN’s for text classification, which is very few in case of image classification. Three channels form RGB. For text, it can be a vast number which makes dimensions very high for text classification (Johnson and Zhang, 2014).

5. Evaluation Metrics

In terms of evaluating text classification models, accuracy, precision, recall, and F1 score are the most used to assess the text classification methods. Below we briefly discuss each of these.

Confusion matrix: Confusion matrix is a unique table or a method which is used to present the efficiency of the classification algorithm. In Table 5, we present the confusion matrix. Details are given below:

Actual Class
Predicted Class Positive Negative
Positive
True Positive
(TP)
False Negative
(FN)
Negative
False Positive
(FP)
True Negative
(TN)
Table 5. Confusion Matrix
  • True Positives (TP): TP are the accurately predicted positive instances.

  • True Negatives (TN): TN are the accurately predicted negative instances.

  • False Positives (FP): FP are wrongly predicted positive instances.

  • False Negatives (FN): FN are wrongly predicted negative instances.

Once we understand the confusion matrix and its parameters, then we can define and understand evaluation metrics easily, briefly explained below:

  • Accuracy: Accuracy is the simple ratio of observations predicted correctly to the total observations and is given by

  • Precision: Precision is the ratio of true positive (TP) observations to the overall positive predicted values (TP+FP) and is given by

  • Recall: Recall is the ration of true positive (TP) observations to the overall observations (TP+FN) and is given by

  • F1 score - Weighted average of Recall and Precision is knowns as F1 score which means F1-score consists of both FPs and FNs and is given by

6. Applications

In the earliest history of ML and AI, these LMs have been widely used to extract features for text classification tasks, especially in the area of information retrieval systems. However, as technological advances have emerged over time, these have been globally used in many domains such as medicine, social sciences, healthcare, psychology, law, engineering, etc. These LMs have been used in many different areas of text classification tasks such as Information Retrieval, Sentiment Analysis, Recommender Systems, Summarization, Question Answering, Machine Translation, Named Entity Recognition, and Adversarial Attacks and Defenses etc. in different areas.

7. Conclusion

In this survey, we have introduced various algorithms that enable us to capture rich information in text data and represent them as vectors for traditional frameworks. We firstly discussed classical methods of text representation which mostly involved feature engineering followed by DL-based model. DL techniques have been attracting much attention in these years, which are well known especially for their capability of addressing problems in computer vision and speech recognition areas. The great success DL achieved stems from its use of multiple layers of nonlinear processing units for learning multiple layers of feature representations of data; different layers correspond to different abstraction levels. DL methods not only shows powerful capability in semantic analysis applications on text data but can be successfully used in a number of tasks of text classification and natural language processing. We discussed different word embedding methods such as Word2Vec, GloVe, FastText and contextual words vectors like Context2Vec, CoVe and ELMO. Finally, in the end, we presented current different SOTA models based on the transformer trained on general corpus as well as domain-specific transformer-based LMs. These LMs are still in their developing phase, but we expect in-depth learning-based NLP research to be driven in the direction of making better use of unlabeled data. We expect such a trend to continue with more and better model designs. We expect to see more NLP applications that employ reinforcement learning methods, e.g., dialogue systems. We also expect to see more research on multi-modal learning as, in the real world, language is often grounded on (or correlated with) other signals.

Footnotes

  1. copyright: acmcopyright
  2. journalyear: 2018
  3. doi: 10.1145/1122445.1122456
  4. conference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NY
  5. booktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY
  6. price: 15.00
  7. isbn: 978-1-4503-XXXX-X/18/06
  8. http://norvig.com/spell-correct.html
  9. http://noisy-text.github.io/
  10. https://github.com/google-research/bert

References

  1. Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca J. Passonneau. 2011. Sentiment Analysis of Twitter Data.
  2. Charu C Aggarwal and Chandan K Reddy. 2013. Data clustering: algorithms and applications. CRC Press.
  3. Charu C. Aggarwal and ChengXiang Zhai. 2012. A Survey of Text Classification Algorithms. In Mining Text Data.
  4. Edgar Altszyler, Mariano Sigman, and Diego Fernández Slezak. 2016. Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database. ArXiv abs/1610.01520 (2016).
  5. Alexandra Balahur. 2013. Sentiment Analysis in Social Media Texts. In WASSA@NAACL-HLT.
  6. Jorge A. Balazs and Juan D. Velásquez. 2016. Opinion Mining and Information Fusion: A survey. Information Fusion 27 (2016), 95–110.
  7. Himani Bansal, Gulshan Shrivastava, Nguyen Nhu, and Loredana STANCIU. 2018. Social Network Analytics for Contemporary Business Organizations. https://doi.org/10.4018/978-1-5225-5097-6
  8. Yanwei Bao, Changqin Quan, Lijuan Wang, and Fuji Ren. 2014. The Role of Pre-processing in Twitter Sentiment Analysis. In Intelligent Computing Methodologies, De-Shuang Huang, Kang-Hyun Jo, and Ling Wang (Eds.). Springer International Publishing, Cham, 615–624.
  9. Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. arXiv:1903.10676 [cs.CL]
  10. Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
  11. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003), 1137–1155. http://dl.acm.org/citation.cfm?id=944919.944966
  12. Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning Long-term Dependencies with Gradient Descent is Difficult. Trans. Neur. Netw. 5, 2 (March 1994), 157–166. https://doi.org/10.1109/72.279181
  13. Adam Bermingham and Alan Smeaton. 2011. On Using Twitter to Monitor Political Sentiment and Predict Election Results. In Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP 2011). Asian Federation of Natural Language Processing, Chiang Mai, Thailand, 2–10. https://www.aclweb.org/anthology/W11-3702
  14. Marina Boia, Boi Faltings, Claudiu Cristian Musat, and Pearl Pu. 2013. A :) Is Worth a Thousand Words: How People Attach Sentiment to Emoticons and Words in Tweets. 2013 International Conference on Social Computing (2013), 345–350.
  15. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016a. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 (2016).
  16. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016b. Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016). arXiv:1607.04606 http://arxiv.org/abs/1607.04606
  17. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 4349–4357. http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf
  18. José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence 240 (2016), 36 – 64. https://doi.org/10.1016/j.artint.2016.07.005
  19. Erik Cambria, Soujanya Poria, Devamanyu Hazarika, and Kenneth Kwok. 2018. SenticNet 5: Discovering Conceptual Primitives for Sentiment Analysis by Means of Context Embeddings. In AAAI.
  20. Xavier Carreras and Lluís Màrquez. 2001. Boosting Trees for Anti-Spam Email Filtering. CoRR cs.CL/0109015 (2001). http://arxiv.org/abs/cs.CL/0109015
  21. Giuseppe Castellucci, Danilo Croce, and Roberto Basili. 2015. Acquiring a Large Scale Polarity Lexicon Through Unsupervised Distributional Methods. In Natural Language Processing and Information Systems, Chris Biemann, Siegfried Handschuh, André Freitas, Farid Meziane, and Elisabeth Métais (Eds.). Springer International Publishing, Cham, 73–86.
  22. Arda Celebi and Arzucan Ozgur. 2016. Segmenting Hashtags using Automatically Created Training Data.
  23. Wei James Chen, Xiaoshen Xie, Jiale Wang, Biswajeet Pradhan, Haoyuan Hong, Dieu Tien Bui, Zhao Duan, and Jianquan Ma. 2017. A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility.
  24. Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1724–1734. https://doi.org/10.3115/v1/D14-1179
  25. Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014). arXiv:1412.3555 http://arxiv.org/abs/1412.3555
  26. Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
  27. Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (Helsinki, Finland) (ICML ’08). ACM, New York, NY, USA, 160–167. https://doi.org/10.1145/1390156.1390177
  28. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural Language Processing (almost) from Scratch. CoRR abs/1103.0398 (2011). arXiv:1103.0398 http://arxiv.org/abs/1103.0398
  29. Thomas Davidson, Dana Warmsley, Michael W. Macy, and Ingmar Weber. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. CoRR abs/1703.04009 (2017). arXiv:1703.04009 http://arxiv.org/abs/1703.04009
  30. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
  31. Bhuwan Dhingra, Hanxiao Liu, Ruslan Salakhutdinov, and William W. Cohen. 2017. A Comparative Study of Word Embeddings for Reading Comprehension. CoRR abs/1703.00993 (2017). arXiv:1703.00993 http://arxiv.org/abs/1703.00993
  32. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems. 13063–13075.
  33. Cícero Nogueira dos Santos and Maíra A. de C. Gatti. 2014. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In COLING.
  34. Ãbel Elekes, Adrian Englhardt, Martin Schäler, and Klemens Böhm. 2018. Toward meaningful notions of similarity in NLP embedding models. International Journal on Digital Libraries (04 2018). https://doi.org/10.1007/s00799-018-0237-y
  35. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 9 (June 2008), 1871–1874. http://dl.acm.org/citation.cfm?id=1390681.1442794
  36. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and Noah A. Smith. 2014. Retrofitting Word Vectors to Semantic Lexicons. CoRR abs/1411.4166 (2014). arXiv:1411.4166 http://arxiv.org/abs/1411.4166
  37. Jennifer Foster, Özlem Çetinoğlu, Joachim Wagner, Joseph Le Roux, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0. In Proceedings of 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, Chiang Mai, Thailand, 893–901. https://www.aclweb.org/anthology/I11-1100
  38. Xianghua Fu, Wangwang Liu, Yingying Xu, and Laizhong Cui. 2017. Combine HowNet lexicon to train phrase recursive autoencoder for sentence-level sentiment analysis. Neurocomputing 241 (2017), 18–27.
  39. Alexander Genkin, David D Lewis, and David Madigan. 2007. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics 49, 3 (2007), 291–304. https://doi.org/10.1198/004017007000000245 arXiv:https://doi.org/10.1198/004017007000000245
  40. Anastasia Giachanou, Julio Gonzalo, Ida Mele, and Fabio Crestani. 2017. Sentiment Propagation for Predicting Reputation Polarity. https://doi.org/10.1007/978-3-319-56608-5_18
  41. Kevin Gimpel, Nathan Schneider, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. [n.d.]. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.
  42. Christian Giovanelli, Xin U. Liu, Seppo Antero Sierla, Valeriy Vyatkin, and Ryutaro Ichise. 2017. Towards an aggregator that exploits big data to bid on frequency containment reserve market. IECON 2017 - 43rd Annual Conference of the IEEE Industrial Electronics Society (2017), 7514–7519.
  43. Edel Greevy. 2004. Automatic text categorisation of racist webpages.
  44. Vishal Gupta and Gurpreet Lehal. 2009. A Survey of Text Mining Techniques and Applications. Journal of Emerging Technologies in Web Intelligence 1 (08 2009). https://doi.org/10.4304/jetwi.1.1.60-76
  45. Emma Haddi, Xiaohui Liu, and Yong Shi. 2013. The Role of Text Pre-processing in Sentiment Analysis. In ITQM.
  46. Khaled M. Hammouda and Mohamed S. Kamel. 2004. Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Trans. on Knowl. and Data Eng. 16, 10 (Oct. 2004), 1279–1296. https://doi.org/10.1109/TKDE.2004.58
  47. Yulan He, Chenghua Lin, and Harith Alani. 2011. Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 123–131. https://www.aclweb.org/anthology/P11-1013
  48. Aurélie Herbelot and Marco Baroni. 2017. High-risk learning: acquiring new word vectors from tiny data. CoRR abs/1707.06556 (2017). arXiv:1707.06556 http://arxiv.org/abs/1707.06556
  49. Bruce M. Hill. 1968. Posterior Distribution of Percentiles: Bayes’ Theorem for Sampling from a Population. J. Amer. Statist. Assoc. 63, 322 (1968), 677–691. http://www.jstor.org/stable/2284038
  50. Tin Kam Ho. 1998. The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 8 (Aug. 1998), 832–844. https://doi.org/10.1109/34.709601
  51. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
  52. Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 328–339. https://doi.org/10.18653/v1/P18-1031
  53. Xia Hu and Huan Liu. 2012. Text Analytics in Social Media. Springer US, Boston, MA, 385–414. https://doi.org/10.1007/978-1-4614-3223-4_12
  54. Ah hwee Tan. 1999. Text Mining: The state of the art and the challenges. In In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases. 65–70.
  55. Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning Sense Embeddings for Word and Relational Similarity. In ACL.
  56. Suzana Ilic, Edison Marrese-Taylor, Jorge A. Balazs, and Yutaka Matsuo. 2018. Deep contextualized word representations for detecting sarcasm and irony. CoRR abs/1809.09795 (2018). arXiv:1809.09795 http://arxiv.org/abs/1809.09795
  57. Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Reading Text in the Wild with Convolutional Neural Networks. CoRR abs/1412.1842 (2014). arXiv:1412.1842 http://arxiv.org/abs/1412.1842
  58. Zhao Jianqiang. 2015. Pre-processing Boosting Twitter Sentiment Analysis? 748–753. https://doi.org/10.1109/SmartCity.2015.158
  59. Zhao Jianqiang and Gui Xiaolin. 2017. Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis. IEEE Access 5 (2017), 2870–2879.
  60. Zhao Jianqiang and Gui Xiaolin. 2018. Deep Convolution Neural Networks for Twitter Sentiment Analysis. IEEE Access PP (01 2018), 1–1. https://doi.org/10.1109/ACCESS.2017.2776930
  61. Rie Johnson and Tong Zhang. 2014. Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. CoRR abs/1412.1058 (2014). arXiv:1412.1058 http://arxiv.org/abs/1412.1058
  62. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.
  63. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. CoRR abs/1607.01759 (2016). arXiv:1607.01759 http://arxiv.org/abs/1607.01759
  64. Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv:1909.05858 [cs.CL]
  65. Farhan Hassan Khan, Saba Bashir, and Usman Qamar. 2014. TOM: Twitter Opinion Mining Framework Using Hybrid Classification Scheme. Decis. Support Syst. 57 (Jan. 2014), 245–257. https://doi.org/10.1016/j.dss.2013.09.004
  66. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. arXiv preprint arXiv:1408.5882 (2014).
  67. Vandana Korde and C. Namrata Mahender. 2012. TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY.
  68. Efthymios Kouloumpis, Theresa Wilson, and Johanna D. Moore. 2011. Twitter Sentiment Analysis: The Good the Bad and the OMG!. In ICWSM.
  69. Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura E. Barnes, and Donald E. Brown. 2019. Text Classification Algorithms: A Survey. CoRR abs/1904.08067 (2019). arXiv:1904.08067 http://arxiv.org/abs/1904.08067
  70. Irene Kwok and Yuzhou Wang. 2013. Locate the Hate: Detecting Tweets Against Blacks. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (Bellevue, Washington) (AAAI’13). AAAI Press, 1621–1622. http://dl.acm.org/citation.cfm?id=2891460.2891697
  71. Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining. CoRR abs/1901.07291 (2019). arXiv:1901.07291 http://arxiv.org/abs/1901.07291
  72. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv:1909.11942 [cs.CL]
  73. Ray R. Larson. 2010. Introduction to Information Retrieval. J. Am. Soc. Inf. Sci. Technol. 61, 4 (April 2010), 852–853. https://doi.org/10.1002/asi.v61:4
  74. Paula Lauren, Guangzhi Qu, Feng Zhang, and Amaury Lendasse. 2018. Discriminant document embeddings with an extreme learning machine for classifying clinical narratives. Neurocomputing 277 (2018), 129–138. https://doi.org/10.1016/j.neucom.2017.01.117
  75. Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 http://arxiv.org/abs/1405.4053
  76. Yann LeCun, Y Bengio, and Geoffrey Hinton. 2015. Deep Learning. Nature 521 (05 2015), 436–44. https://doi.org/10.1038/nature14539
  77. Yann Lecun, Leon Bottou, Y Bengio, and Patrick Haffner. 1998. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 86 (12 1998), 2278 – 2324. https://doi.org/10.1109/5.726791
  78. Ledell, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston. 2017. StarSpace: Embed All The Things! arXiv e-prints, Article arXiv:1709.03856 (Sep 2017), arXiv:1709.03856 pages. arXiv:1709.03856 [cs.CL]
  79. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv:1901.08746 [cs.CL]
  80. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
  81. Liang-Chih, Jin Wang, K. Robert Lai, and Xuejie Zhang. 2018. Refining Word Embeddings Using Intensity Scores for Sentiment Analysis. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 26, 3 (March 2018), 671–681. https://doi.org/10.1109/TASLP.2017.2788182
  82. Chenghua Lin and Yulan He. 2009. Joint Sentiment/Topic Model for Sentiment Analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (Hong Kong, China) (CIKM ’09). ACM, New York, NY, USA, 375–384. https://doi.org/10.1145/1645953.1646003
  83. Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015. Learning Context-sensitive Word Embeddings with Neural Tensor Skip-gram Model. In Proceedings of the 24th International Conference on Artificial Intelligence (Buenos Aires, Argentina) (IJCAI’15). AAAI Press, 1284–1290. http://dl.acm.org/citation.cfm?id=2832415.2832428
  84. Shuhua Liu and Thomas Forss. 2014. Combining N-gram Based Similarity Analysis with Sentiment Analysis in Web Content Classification. In Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1 (Rome, Italy) (IC3K 2014). SCITEPRESS - Science and Technology Publications, Lda, Portugal, 530–537. https://doi.org/10.5220/0005170305300537
  85. Yuanchao Liu, Bingquan Liu, Lili Shan, and Xin Wang. 2018. Modelling context with neural networks for recommending idioms in essay writing. Neurocomputing 275 (2018), 2287 – 2293. https://doi.org/10.1016/j.neucom.2017.11.005
  86. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  87. David M. Magerman. 1995. Statistical Decision-tree Models for Parsing. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (Cambridge, Massachusetts) (ACL ’95). Association for Computational Linguistics, Stroudsburg, PA, USA, 276–283. https://doi.org/10.3115/981658.981695
  88. Danilo P. Mandic and Jonathon Chambers. 2001. Recurrent Neural Networks for Prediction: Learning Algorithms,Architectures and Stability. John Wiley & Sons, Inc., New York, NY, USA.
  89. Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford Corenlp Natural Language Processing Toolkit.. In ACL (System Demonstrations). 55–60.
  90. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017a. Learned in Translation: Contextualized Word Vectors. CoRR abs/1708.00107 (2017). arXiv:1708.00107 http://arxiv.org/abs/1708.00107
  91. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017b. Learned in Translation: Contextualized Word Vectors. In NIPS.
  92. Yelena Mejova and Padmini Srinivasan. 2011. Exploring Feature Definition and Selection for Sentiment Classifiers.
  93. Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016a. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin, Germany, 51–61. https://doi.org/10.18653/v1/K16-1006
  94. Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016b. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In CoNLL.
  95. Prem Melville, Wojciech Gryc, and Richard D. Lawrence. 2009. Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Paris, France) (KDD ’09). ACM, New York, NY, USA, 1275–1284. https://doi.org/10.1145/1557019.1557156
  96. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Advances in neural information processing systems. 3111–9.
  97. Saif Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. 2013. NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) (Atlanta, Georgia, USA). Association for Computational Linguistics, 321–327. http://aclweb.org/anthology/S13-2053
  98. James N. Morgan and John A. Sonquist. 1963. Problems in the Analysis of Survey Data, and a Proposal. J. Amer. Statist. Assoc. 58, 302 (1963), 415–434. http://www.jstor.org/stable/2283276
  99. Nikola Mrksic, Ivan Vulic, Diarmuid Ó Séaghdha, Ira Leviant, Roi Reichart, Milica Gasic, Anna Korhonen, and Steve J. Young. 2017. Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints. CoRR abs/1706.00374 (2017). arXiv:1706.00374 http://arxiv.org/abs/1706.00374
  100. T. Mullen and R. Malouf. 2006. A preliminary investigation into sentiment analysis of informal political discourse. AAAI Spring Symposium - Technical Report SS-06-03 (2006), 159–162. https://www.scopus.com/inward/record.uri?eid=2-s2.0-33747172751&partnerID=40&md5=6b12793b70eae006102989ed6d398fcb cited By 68.
  101. Martin Müller, Marcel Salathé, and Per E Kummervold. 2020. COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter. arXiv preprint arXiv:2005.07503 (2020).
  102. Marwa Naili, Anja Habacha Chaibi, and Henda Hajjami Ben Ghezala. 2017. Comparative study of word embedding methods in topic segmentation. Procedia computer science 112 (2017), 340–349.
  103. Vivek Narayanan, Ishan Arora, and Arjun Bhatia. 2013. Fast and accurate sentiment classification using an enhanced Naive Bayes model. CoRR abs/1305.6143 (2013). arXiv:1305.6143 http://arxiv.org/abs/1305.6143
  104. Usman Naseem. 2020. Hybrid Words Representation for the classification of low quality text. Ph.D. Dissertation.
  105. U Naseem, SK Khan, M Farasat, and F Ali. 2019a. Abusive Language Detection: A Comprehensive Review. Indian Journal of Science and Technology 12, 45 (2019), 1–13.
  106. Usman Naseem, Shah Khalid Khan, Madiha Farasat, and Farasat Ali. 2019b. Abusive Language Detection: A Comprehensive Review. Indian Journal of Science and Technology 12 (2019). http://www.indjst.org/index.php/indjst/article/view/146538
  107. Usman Naseem, Shah Khalid Khan, Imran Razzak, and Ibrahim A. Hameed. 2019c. Hybrid Words Representation for Airlines Sentiment Analysis. In AI 2019: Advances in Artificial Intelligence, Jixue Liu and James Bailey (Eds.). Springer International Publishing, Cham, 381–392.
  108. Usman Naseem, Matloob Khushi, Shah Khalid Khan, Nazar Waheed, Adnan Mir, Atika Qazi, Bandar Alshammari, and Simon K. Poon. 2020a. Diabetic Retinopathy Detection Using Multi-layer Neural Networks and Split Attention with Focal Loss. In International Conference on Neural Information Processing. Springer, 1–12.
  109. Usman Naseem, Matloob Khushi, Vinay Reddy, Sakthivel Rajendran, Imran Razzak, and Jinman Kim. 2020b. BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition. arXiv preprint arXiv:2009.09223 (2020).
  110. Usman Naseem and Katarzyna Musial. 2019a. Dice: Deep intelligent contextual embedding for twitter sentiment analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 953–958.
  111. Usman Naseem and Katarzyna Musial. 2019b. DICE: Deep Intelligent Contextual Embedding for Twitter Sentiment Analysis. 2019 15th International Conference on Document Analysis and Recognition (ICDAR) (2019), 1–5.
  112. Usman Naseem, Katarzyna Musial, Peter Eklund, and Mukesh Prasad. 2020c. Biomedical Named-Entity Recognition by Hierarchically Fusing BioBERT Representations and Deep Contextual-Level Word-Embedding. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  113. Usman Naseem, Imran Razzak, Peter Eklund, and Katarzyna Musial. 2020d. Towards Improved Deep Contextual Embedding for the identification of Irony and Sarcasm. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–7.
  114. Usman Naseem, Imran Razzak, and Ibrahim A Hameed. 2019d. Deep Context-Aware Embedding for Abusive and Hate Speech detection on Twitter. Aust. J. Intell. Inf. Process. Syst. 15, 3 (2019), 69–76.
  115. Usman Naseem, Imran Razzak, Katarzyna Musial, and Muhammad Imran. 2020e. Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Generation Computer Systems 113 (2020), 58–69.
  116. Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2015. Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space. CoRR abs/1504.06654 (2015). arXiv:1504.06654 http://arxiv.org/abs/1504.06654
  117. Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020).
  118. Thomas Niebler, Martin Becker, Christian Pölitz, and Andreas Hotho. 2017. Learning Semantic Relatedness From Human Feedback Using Metric Learning. CoRR abs/1705.07425 (2017). arXiv:1705.07425 http://arxiv.org/abs/1705.07425
  119. Alexander Pak and Patrick Paroubek. 2010. Twitter as a Corpus for Sentiment Analysis and Opinion Mining. In LREC.
  120. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002a. Thumbs Up?: Sentiment Classification Using Machine Learning Techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10 (EMNLP ’02). Association for Computational Linguistics, Stroudsburg, PA, USA, 79–86. https://doi.org/10.3115/1118693.1118704
  121. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002b. Thumbs up?: Sentiment Classification Using Machine Learning Techniques. In The Conference on Empirical Methods on Natural Language Processing. Association for Computational Linguistics, 79–86.
  122. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the Difficulty of Training Recurrent Neural Networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (Atlanta, GA, USA) (ICML’13). JMLR.org, III–1310–III–1318. http://dl.acm.org/citation.cfm?id=3042817.3043083
  123. Cristian Patriche, Pîrnău Gabriel, Adrian Grozavu, and Bogdan RoÅŸca. 2016. A Comparative Analysis of Binary Logistic Regression and Analytical Hierarchy Process for Landslide Susceptibility Assessment in the Dobrov River Basin, Romania. Pedosphere 26 (06 2016), 335–350. https://doi.org/10.1016/S1002-0160(15)60047-9
  124. Jacob Perkins. 2010. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing.
  125. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (New Orleans, Louisiana). Association for Computational Linguistics, 2227–2237. https://doi.org/10.18653/v1/N18-1202
  126. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018b. Deep contextualized word representations. CoRR abs/1802.05365 (2018). arXiv:1802.05365 http://arxiv.org/abs/1802.05365
  127. Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking Word Embeddings using Subword RNNs. CoRR abs/1707.06961 (2017). arXiv:1707.06961 http://arxiv.org/abs/1707.06961
  128. Pengda Qin, Weiran Xu, and Jun Guo. 2016. An Empirical Convolutional Neural Network Approach for Semantic Relation Classification. Neurocomput. 190, C (May 2016), 1–9. https://doi.org/10.1016/j.neucom.2015.12.091
  129. Zhaowei Qu, Xiaomin Song, Shuqiang Zheng, Xiaoru Wang, Xiaohui Song, and Zuquan Li. 2018. Improved Bayes Method Based on TF-IDF Feature and Grade Factor Feature for Chinese Information Classification. 2018 IEEE International Conference on Big Data and Smart Computing (BigComp) (2018), 677–680.
  130. J.R. Quinlan. 1987. Simplifying decision trees. International Journal of Man-Machine Studies 27, 3 (1987), 221 – 234. https://doi.org/10.1016/S0020-7373(87)80053-6
  131. J. R. Quinlan. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (March 1986), 81–106. https://doi.org/10.1023/A:1022643204877
  132. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018). https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
  133. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
  134. Arshia Rehman, Saeeda Naz, Usman Naseem, Imran Razzak, and Ibrahim A Hameed. 2019. Deep AutoEncoder-Decoder Framework for Semantic Segmentation of Brain Tumor. Aust. J. Intell. Inf. Process. Syst. 15, 3 (2019), 53–60.
  135. Yafeng Ren, Yue Zhang, Meishan Zhang, and Donghong Ji. 2016. Context-sensitive Twitter Sentiment Classification Using Neural Network. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 215–221. http://dl.acm.org/citation.cfm?id=3015812.3015844
  136. Jack Reuter, Jhonata Pereira-Martins, and Jugal Kalita. 2016. Segmenting Twitter Hashtags. International Journal on Natural Language Computing 5 (08 2016), 23–36. https://doi.org/10.5121/ijnlc.2016.5402
  137. Seyed Mahdi Rezaeinia, Ali Ghodsi, and Rouhollah Rahmani. 2017a. Improving the Accuracy of Pre-trained Word Embeddings for Sentiment Analysis. CoRR abs/1711.08609 (2017). arXiv:1711.08609 http://arxiv.org/abs/1711.08609
  138. Seyed Mahdi Rezaeinia, Ali Ghodsi, and Rouhollah Rahmani. 2017b. Improving the Accuracy of Pre-trained Word Embeddings for Sentiment Analysis. CoRR abs/1711.08609 (2017). arXiv:1711.08609 http://arxiv.org/abs/1711.08609
  139. Hassan Saif, Marta Fernandez Andres, Yulan He, and Harith Alani. 2013. Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold. In ESSEM@AI*IA.
  140. Mohammad Arshi Saloot, Norisma Idris, Nor Liyana Mohd Shuib, Ram Gopal Raj, and AiTi Aw. 2015. Toward Tweets Normalization Using Maximum Entropy. In NUT@IJCNLP.
  141. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
  142. Dominik Scherer, Andreas Müller, and Sven Behnke. 2010. Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In Artificial Neural Networks – ICANN 2010, Konstantinos Diamantaras, Wlodek Duch, and Lazaros S. Iliadis (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 92–101.
  143. Seungil, David Ding, Kevin Canini, Jan Pfeifer, and Maya Gupta. 2017. Deep Lattice Networks and Partial Monotonic Functions. arXiv e-prints, Article arXiv:1709.06680 (Sep 2017), arXiv:1709.06680 pages. arXiv:1709.06680 [stat.ML]
  144. Aliaksei Severyn and Alessandro Moschitti. 2015. Twitter Sentiment Analysis with Deep Convolutional Neural Networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (Santiago, Chile) (SIGIR ’15). ACM, New York, NY, USA, 959–962. https://doi.org/10.1145/2766462.2767830
  145. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]
  146. Tajinder Singh and Madhu Kumari. 2016. Role of Text Pre-processing in Twitter Sentiment Analysis.
  147. R Socher, A Perelygin, J.Y. Wu, J Chuang, C.D. Manning, A.Y. Ng, and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP 1631 (01 2013), 1631–1642.
  148. Saeid Soheily-Khah, Pierre-François Marteau, and Nicolas Béchet. 2017. Intrusion detection in network systems through hybrid supervised and unsupervised mining process- a detailed case study on the ISCX benchmark dataset -.
  149. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450 (2019).
  150. Karen Sparck Jones. 1988. Document Retrieval Systems. Taylor Graham Publishing, London, UK, UK, Chapter A Statistical Interpretation of Term Specificity and Its Application in Retrieval, 132–142. http://dl.acm.org/citation.cfm?id=106765.106782
  151. Robyn Speer, Joshua Chin, and Catherine Havasi. 2016. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. CoRR abs/1612.03975 (2016). arXiv:1612.03975 http://arxiv.org/abs/1612.03975
  152. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).
  153. Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding.. In AAAI. 8968–8975.
  154. Ilya Sutskever, James Martens, and Geoffrey Hinton. 2011. Generating Text with Recurrent Neural Networks. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, Washington, USA) (ICML’11). Omnipress, USA, 1017–1024. http://dl.acm.org/citation.cfm?id=3104482.3104610
  155. Jared Suttles and Nancy Ide. 2013. Distant Supervision for Emotion Classification with Discrete Binary Values. In Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 2 (Samos, Greece) (CICLing’13). Springer-Verlag, Berlin, Heidelberg, 121–136. https://doi.org/10.1007/978-3-642-37256-8_11
  156. Symeon Symeonidis, Dimitrios Effrosynidis, and Avi Arampatzis. 2018. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Systems with Applications 110 (2018), 298 – 310. https://doi.org/10.1016/j.eswa.2018.06.022
  157. Duyu Tang, Bing Qin, Furu Wei, Li Dong, Ting Liu, and Ming Zhou. 2015. A Joint Segmentation and Classification Framework for Sentence Level Sentiment Classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 11 (2015), 1750–61.
  158. Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou. 2016a. Sentiment Embeddings with Applications to Sentiment Analysis. IEEE Transactions on Knowledge and Data Engineering 28, 2 (2016), 496–509.
  159. Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou. 2016b. Sentiment Embeddings with Applications to Sentiment Analysis. IEEE Trans. on Knowl. and Data Eng. 28, 2 (Feb. 2016), 496–509. https://doi.org/10.1109/TKDE.2015.2489653
  160. Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Baltimore, Maryland, 1555–1565. https://doi.org/10.3115/v1/P14-1146
  161. Alper Kursat Uysal and Serkan Günal. 2014. The impact of preprocessing on text classification. Inf. Process. Manage. 50 (2014), 104–112.
  162. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., USA, 6000–6010. http://dl.acm.org/citation.cfm?id=3295222.3295349
  163. Byron Wallace. 2017. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, Taipei, Taiwan, 253–263.
  164. Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. 2019. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. arXiv:1908.04577 [cs.CL]
  165. Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, and Amit P. Sheth. 2012a. Harnessing Twitter ”Big Data” for Automatic Emotion Identification. 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing (2012), 587–592.
  166. Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. 2016. Attention-based LSTM for Aspect-level Sentiment Classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 606–615. https://doi.org/10.18653/v1/D16-1058
  167. Yuyang Wang, Roni Khardon, and Pavlos Protopapas. 2012b. NONPARAMETRIC BAYESIAN ESTIMATION OF PERIODIC LIGHT CURVES. The Astrophysical Journal 756, 1 (aug 2012), 67. https://doi.org/10.1088/0004-637x/756/1/67
  168. Ikuya Yamada, Hideaki Takeda, and Yoshiyasu Takefuji. 2015. Enhancing Named Entity Recognition in Twitter Messages Using Entity Linking. In NUT@IJCNLP.
  169. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. CoRR abs/1906.08237 (2019). arXiv:1906.08237 http://arxiv.org/abs/1906.08237
  170. Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. CoRR abs/1506.06724 (2015). arXiv:1506.06724 http://arxiv.org/abs/1506.06724
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
420691
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description