On the Role of Text Preprocessing in Neural Network Architectures:
An Evaluation Study on Text Categorization and Sentiment Analysis
In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a state-of-the-art text classifier based on convolutional neural networks. Despite potentially affecting the final performance of any given model, this aspect has not received a substantial interest in the deep learning literature. We perform an extensive evaluation in standard benchmarks from text categorization and sentiment analysis. Our results show that a simple tokenization of the input text is often enough, but also highlight the importance of being consistent in the preprocessing of the evaluation set and the corpus used for training word embeddings.
Jose Camacho-Collados Department of Computer Science Sapienza University of Rome email@example.com Mohammad Taher Pilehvar Department of Theoretical and Applied Linguistics University of Cambridge firstname.lastname@example.org
Words are often considered as the basic constituents of texts.111Recent work has also considered other linguistic units such as word senses (Li and Jurafsky, 2015; Flekova and Gurevych, 2016; Pilehvar et al., 2017), characters within tokens (Ballesteros et al., 2015; Kim et al., 2016; Luong and Manning, 2016; Xiao and Cho, 2016) or, more recently, characters ngrams (Schütze, 2017). These techniques require a different kind of preprocessing and, while they have been shown effective in various settings, in this work we only focus on the mainstream word-based models. The first module in an NLP pipeline is often a tokenizer which transforms texts to sequences of words. However, different preprocessing techniques can be (and are) further used in practise. These include lemmatization, lowercasing or multiword grouping, which are the techniques explored in this paper. Although these preprocessing decisions have already been studied in the text classification literature (Leopold and Kindermann, 2002; Uysal and Gunal, 2014), little attention has been paid to them in recent neural network models. Additionally, word embeddings have been shown to play an important role in boosting the generalization properties of neural systems (Zou et al., 2013; Bordes et al., 2014; Kim, 2014; Weiss et al., 2015). However, often word embedding techniques do not focus much on the preprocessing of their underlying training corpora. As a result, the impact of text preprocessing on their performance has remained understudied.222Not only the preprocessing of the corpus may play an important role but also its nature, domain, etc. Levy et al. (2015) also showed how small hyperparameter variations may have an impact on the performance of word embeddings. However, these considerations remain out of the scope of this paper.
In this work we specifically study the role of word constituents in Convolutional Neural Networks (CNNs). Soon after their successful application to computer vision (LeCun et al., 2010; Krizhevsky et al., 2012), CNNs were ported to NLP for their desirable sensitivity to spatial structure of data which makes them effective in capturing semantic or syntactic patterns of word sequences (Goldberg, 2016). CNNs have proven to be effective in a wide range of NLP applications, including text classification tasks such as topic categorization (Johnson and Zhang, 2015; Tang et al., 2015; Xiao and Cho, 2016; Conneau et al., 2017) and sentiment analysis (Kalchbrenner et al., 2014; Kim, 2014; Dos Santos and Gatti, 2014; Yin et al., 2017), which are the tasks considered in this work.
In this paper we aim to find answers to the following two questions:
Are neural network architectures (in particular CNNs) affected by seemingly small preprocessing decisions in the input text?
Does the preprocessing of the embedding’s training corpus have an impact on the final performance of a state-of-the-art neural network text classifier?
According to our experiments in text categorization and sentiment analysis, these decisions are important in certain cases. Moreover, we shed some light on the motivations of each preprocessing decision and provide some hints on how to normalize the input corpus to better suit each setting.
2 Text Preprocessing
Given an input text, words are gathered as input units of classification models through tokenization. We refer to the corpus which is only tokenized as vanilla. For example, given the sentence “Apple is asking its manufacturers to move MacBook Air production to the United States.” (running example), the vanilla tokenized text would be as follows (white spaces delimiting different word units):
Apple is asking its manufacturers to move MacBook Air production to the United States .
This is the simplest preprocessing technique which consists of lowercasing each single token of the input text:
apple is asking its manufacturers to move macbook air production to the united states .
Due to its simplicity, lowercasing has been a popular practice in modules of deep learning libraries and word embedding packages (Pennington et al., 2014; Faruqui et al., 2015). Despite its desirable property of reducing sparsity and vocabulary size, lowercasing may negatively impact system’s performance by increasing ambiguity. For instance, the Apple company in our example and the apple fruit would be considered as identical entities.
The process of lemmatizing consists of replacing a given token with its corresponding lemma:
Apple be ask its manufacturer to move MacBook Air production to the United States .
Lemmatization has been traditionally a recurring preprocessing technique for linear text classification systems (Mullen and Collier, 2004; Toman et al., 2006; Hassan et al., 2007) but is rarely used on neural network architectures. The main idea behind lemmatization is to reduce sparsity, as inflected forms coming from the same lemma may occur few times (or not at all) during training. However, this may come at the cost of neglecting important syntactic nuances.
2.3 Multiword grouping
This last preprocessing technique consists of grouping consecutive tokens together into a single token if found in a given inventory:
Apple is asking its manufacturers to move MacBook_Air production to the United_States .
The motivation behind this step lies in the idiosyncratic nature of multiword expressions (Sag et al., 2002), e.g. United States in the example. The meaning of these multiword expressions can hardly be inferred from the individual tokens and a treatment of these instances as single units may lead to a better learning of a given model. Because of this, word embedding toolkits such as Word2vec propose statistical approaches and pre-trained models for representing these multiwords in the vector space (Mikolov et al., 2013b).
We performed experiments in text topic categorization and polarity detection. The task of topic categorization consists of assigning a topic to a given document from a pre-defined set of topics, while polarity detection is a binary classification task which consists of detecting the sentiment of a given sentence as being either positive or negative (Dong et al., 2015). In our experiments we evaluate two different settings: (1) word embedding’s training corpus was preprocessed similarly to the evaluation datasets (Section 3.1) and (2) the two were preprocessed differently (Section 3.2).
As classification model333We used Keras (Chollet, 2015) and Theano (Team, 2016) for our model implementation. we used a standard CNN classifier based on the work of Kim (2014), using ReLU (Nair and Hinton, 2010) as non-linear activation function. However, instead of passing the pooled features directly to a fully connected softmax layer, we added a recurrent layer (specifically LSTM (Hochreiter and Schmidhuber, 1997)) which had been shown to be able to effectively replace multiple layers of convolution and be beneficial particularly for large inputs (Xiao and Cho, 2016). This model was used for both topic categorization and polarity detection tasks, with slight hyperparameter variations given the different natures (mainly in their text size) of the tasks. Given that in topic categorization the input text size is expected to be larger than that in polarity detection (usually phrases, snippets or sentences), we followed past work (Kim, 2014; Xiao and Cho, 2016; Pilehvar et al., 2017) and used more epochs, convolution filters and LSTM units, and fixed the same configuration across all datasets and experiments. The embedding layer was initialized using pre-trained word embeddings. We trained Word2vec444We trained CBOW with standard hyperparameters: 300 dimensions, context window of 5 words and hierarchical softmax for normalization. (Mikolov et al., 2013a) on the 3B-word UMBC WebBase corpus(Han et al., 2013), which is a corpus composed of paragraphs extracted from the web as part of the Stanford WebBase Project (Hirai et al., 2000).
|Dataset||Type||# of||# of||Eval.|
For the topic categorization task we used the BBC news dataset555http://mlg.ucd.ie/datasets/bbc.html (Greene and Cunningham, 2006), 20News (Lang, 1995), Reuters666Due to the large number of labels in the original Reuters (i.e. 91) and to be consistent with the other datasets, we reduce the dataset to its 8 most frequent labels, a reduction already performed in previous works (Sebastiani, 2002). (Lewis et al., 2004) and Ohsumed777ftp://medir.ohsu.edu/pub/ohsumed. For the polarity detection task we used PL04 (Pang and Lee, 2004), PL05888Both PL04 and PL05 were downloaded from this website: goo.gl/rbvoT7 (Pang and Lee, 2005), RTC999http://www.rottentomatoes.com, IMDB (Maas et al., 2011) and the Stanford Sentiment dataset (Socher et al., 2013, SF)101010We mapped the numerical value of phrases to either negative (from 0 to 0.4) or positive (from 0.6 to 1), removing the neutral phrases according to the scale (from 0.4 to 0.6)., Details of the characteristics and statistics of each dataset are displayed in Table 1. For both tasks the evaluation was carried out either by 10-fold cross-validation or using the train-test splits of the datasets, in case of availability.
The datasets and the UMBC corpus used to train the embeddings were preprocessed in four different ways (see Section 2). For tokenization and lemmatization we relied on Stanford CoreNLP (Manning et al., 2014). As for multiwords, we used the phrases from the pre-trained Google News Word2vec vectors, which were obtained using a simple statistical approach on the underlying corpus (Mikolov et al., 2013b).111111For future work it could be interesting to explore more complex methods to learn embeddings for multiword expressions (Yin and Schütze, 2014; Poliak et al., 2017).
|Dataset+Vector||Topic categorization||Polarity detection|
3.1 Experiment 1: Preprocessing effect
Table 2 shows the accuracy121212Computed by averaging accuracy of two different runs. of the classification model using our four preprocessing techniques. Despite their simplicity, both the vanilla setting (tokenization only) and lowercasing prove to be consistent across datasets and tasks. Both settings perform in the same ballpark as the best result in 8 of the 9 datasets (with no noticeable differences between topic categorization and polarity detection). The only dataset in which tokenization does not seem enough is Ohsumed, which, unlike the more general nature of the other topic categorization datasets (i.e. news), belongs to a more specialized domain (i.e. medical) for which more fine-grained distinctions are required to classify cardiovascular diseases. This result suggests that embeddings trained in a general corpus may not be entirely accurate for specialized domains. Therefore, the network needs a sufficient number of examples for each word to properly learn its representation. Hence, the sparsity issue is particularly highlighted in this domain-specific dataset. In fact, lowercasing and lemmatizing, which are mainly aimed at reducing sparsity, outperform the vanilla setting by over six points on this dataset.
Nevertheless, the use of more complex preprocessing techniques such as lemmatization and multiword grouping does not seem to help in general. Even though lemmatization has been proved useful in conventional linear models mainly due to their sparsity reduction effect, neural network architectures are more capable of overcoming sparsity thank to the generalization power of word embeddings. In turn, multiword grouping does not seem beneficial either (in four of the datasets the results are significantly worse than the best system), suggesting that the introduction of these units produce an undesirable increment of unseen (or rarely seen) single and multi tokens in the training and test data.
|Vector||Topic categorization||Polarity detection|
3.2 Experiment 2: Cross-preprocessing
This experiment aims at studying the impact of using different word embeddings (with differently preprocessed training corpora) on tokenized datasets (vanilla setting). In addition to the word embeddings trained on the UMBC corpus described in the experimental setting, we compared with the popular pre-trained 300-dimensional word embeddings of Word2vec trained on the 100B-token Google News corpus which make use of the same multiword grouping considered in our experiments.
Table 3 shows the results on this cross-preprocessing experiment. In this experiment we observe a different trend, with multiword-enhanced vectors exhibiting a better performance, both for UMBC (best overall performance in four datasets and in the same ballpark of the best result in four of the remaining five datasets) and Google News. In this case the same set of words is learnt but single tokens inside multiword expressions are not trained. Instead, these single tokens are considered in isolation only, without the added noise when considered inside the multiword expression as well. For instance, the word Apple has a clearly different meaning in isolation from the one inside the multiword expression Big_Apple, hence it can be seen as beneficial not to train the word Apple when part of this multiword expression. Interestingly, using multiword-wise embeddings on the vanilla setting leads to consistently better results than using them on the same multiword-grouped preprocessed dataset in eight of the nine datasets.
Apart from this somewhat surprising finding, the use of the embeddings trained on a simple tokenized corpus (i.e. vanilla) proved again competitive, as different preprocessing techniques such as lowercasing and especially lemmatizing do not seem to help. The results of the pre-trained embeddings of Word2Vec using multiword grouping, although not directly comparable for having been trained on a much larger corpus, are in line (on the same ballpark) with those trained on the UMBC corpus using the same preprocessing.
In this paper we analyzed the impact that simple preprocessing decisions in the input text may have on the performance of standard word-based neural text classification models. Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Additionally, word embeddings trained on multiword-grouped corpora perform surprisingly well when applied to simple tokenized datasets.
Further analysis and experimentation would be required to fully understand the significance of these results but we hope this work can be viewed as a starting point for studying the impact of text preprocessing in deep learning models. As future work, we plan to extend our analysis to other tasks (e.g. question answering), languages (particularly morphologically rich languages for which these results may vary substantially) and preprocessing techniques (e.g. stopword removal or disambiguation).
Jose Camacho-Collados is supported by a Google PhD Fellowship in Natural Language Processing.
- Ballesteros et al. (2015) Miguel Ballesteros, Chris Dyer, and Noah A Smith. 2015. Improved transition-based parsing by modeling characters instead of words with lstms. In Proceedings of EMNLP.
- Bordes et al. (2014) Antoine Bordes, Sumit Chopra, and Jason Weston. 2014. Question answering with subgraph embeddings. In Proceedings of EMNLP.
- Chollet (2015) François Chollet. 2015. Keras. https://github.com/fchollet/keras.
- Conneau et al. (2017) Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2017. Very deep convolutional networks for text classification. In Proceedings of EACL. Valencia, Spain, pages 1107–1116.
- Dong et al. (2015) Li Dong, Furu Wei, Shujie Liu, Ming Zhou, and Ke Xu. 2015. A statistical parsing framework for sentiment classification. Computational Linguistics .
- Dos Santos and Gatti (2014) Cícero Nogueira Dos Santos and Maira Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING. pages 69–78.
- Faruqui et al. (2015) Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of NAACL. pages 1606–1615.
- Flekova and Gurevych (2016) Lucie Flekova and Iryna Gurevych. 2016. Supersense embeddings: A unified model for supersense interpretation, prediction, and utilization. In Proceedings of ACL. Berlin, Germany.
- Goldberg (2016) Yoav Goldberg. 2016. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research 57:345–420.
- Greene and Cunningham (2006) Derek Greene and Pádraig Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International conference on Machine learning. ACM, pages 377–384.
- Han et al. (2013) Lushan Han, Abhay Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC ebiquity-core: Semantic textual similarity systems. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics. volume 1, pages 44–52.
- Hassan et al. (2007) Samer Hassan, Rada Mihalcea, and Carmen Banea. 2007. Random walk term weighting for improved text classification. International Journal of Semantic Computing 1(04):421–439.
- Hirai et al. (2000) Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. Webbase: A repository of web pages. Computer Networks 33(1):277–293.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Johnson and Zhang (2015) Rie Johnson and Tong Zhang. 2015. Effective use of word order for text categorization with convolutional neural networks. In Proceedings of NAACL.
- Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of ACL. pages 655–665.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of EMNLP.
- Kim et al. (2016) Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-aware neural language models. In Proceedings of AAAI.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. pages 1097–1105.
- Lang (1995) Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Proceedings of the 12th international conference on machine learning. pages 331–339.
- LeCun et al. (2010) Yann LeCun, Koray Kavukcuoglu, and Clément Farabet. 2010. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. IEEE, pages 253–256.
- Leopold and Kindermann (2002) Edda Leopold and Jörg Kindermann. 2002. Text categorization with support vector machines. how to represent texts in input space? Machine Learning 46(1-3):423–444.
- Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3:211–225.
- Lewis et al. (2004) David D. Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research 5(Apr):361–397.
- Li and Jurafsky (2015) Jiwei Li and Dan Jurafsky. 2015. Do multi-sense embeddings improve natural language understanding? In Proceedings of EMNLP. Lisbon, Portugal.
- Luong and Manning (2016) Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Association for Computational Linguistics (ACL). Berlin, Germany.
- Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of ACL-HLT. Portland, Oregon, USA, pages 142–150.
- Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. pages 55–60.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. CoRR abs/1301.3781.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pages 3111–3119.
- Mullen and Collier (2004) Tony Mullen and Nigel Collier. 2004. Sentiment analysis using support vector machines with diverse information sources. In EMNLP. volume 4, pages 412–418.
- Nair and Hinton (2010) Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). Omnipress, pages 807–814.
- Pang and Lee (2004) Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the ACL.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of EMNLP. pages 1532–1543.
- Pilehvar et al. (2017) Mohammad Taher Pilehvar, Jose Camacho-Collados, Roberto Navigli, and Nigel Collier. 2017. Towards a Seamless Integration of Word Senses into Downstream NLP Applications. In Proceedings of ACL. Vancouver, Canada.
- Poliak et al. (2017) Adam Poliak, Pushpendre Rastogi, M. Patrick Martin, and Benjamin Van Durme. 2017. Efficient, compositional, order-sensitive n-gram embeddings. In Proceedings of EACL.
- Sag et al. (2002) Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for nlp. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, pages 1–15.
- Schütze (2017) Hinrich Schütze. 2017. Nonsymbolic text representation. In Proceedings of EACL. Valencia, Spain, pages 785–796.
- Sebastiani (2002) Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) 34(1):1–47.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. 2013. Parsing With Compositional Vector Grammars. In Proceedings of EMNLP.
- Tang et al. (2015) Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In EMNLP. pages 1422–1432.
- Team (2016) Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688.
- Toman et al. (2006) Michal Toman, Roman Tesar, and Karel Jezek. 2006. Influence of word normalization on text classification. Proceedings of InSciT 4:354–358.
- Uysal and Gunal (2014) Alper Kursat Uysal and Serkan Gunal. 2014. The impact of preprocessing on text classification. Information Processing & Management 50(1):104–112.
- Weiss et al. (2015) David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Proceedings of ACL. Beijing, China.
- Xiao and Cho (2016) Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. CoRR abs/1602.00367.
- Yin et al. (2017) Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 .
- Yin and Schütze (2014) Wenpeng Yin and Hinrich Schütze. 2014. An exploration of embeddings for generalized phrases. In ACL (Student Research Workshop). pages 41–47.
- Zou et al. (2013) Will Y. Zou, Richard Socher, Daniel M. Cer, and Christopher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of EMNLP. pages 1393–1398.