Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?
This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText (Joulin et al., 2016) and convolutional networks. For convolutional networks, we compare between encoding mechanisms using character glyph images, one-hot (or one-of-n) encoding, and embedding. In total there are 473 models, using 14 large-scale text classification datasets in 4 languages including Chinese, English, Japanese and Korean. Some conclusions from these results include that byte-level one-hot encoding based on UTF-8 consistently produces competitive results for convolutional networks, that word-level n-grams linear models are competitive even without perfect word segmentation, and that fastText provides the best result using character-level n-gram encoding but can overfit when the features are overly rich.
Keywords: text classification, text encoding, text representation, multilingual language processing, convolutional network
Being able to process different kinds of languages in a unified and consistent fashion is of great interest to the natural language processing (NLP) community, especially with the recent advancements in deep learning methods. Among these languages, Chinese, Japanese and Korean (CJK) pose unique challenges due to reasons in both linguistics and computation. Unlike some alphabetic languages such as English, there is no clear word boundary for some of the CJK texts. This makes it difficult to apply many laugnage processing methods that assume word as the basic construct.
Recently, many authors have proposed to use character-level encoding for language processing with convolutional networks (ConvNets) (Kim et al., 2016) (Zhang et al., 2015), casting away the word segmentation problem. Unfortunately, working with characters for CJK languages is not direct, because the amount of characters can be huge. For example, one-hot (or one-of-n) encoding used by Zhang et al. (2015) is not practical because each one-hot vector would be prohibitively large.
This drives us to search for alternative ways of encoding CJK texts. The encoding mechanisms considered in this article include character glyph images, one-hot encoding and embedding. For one-hot encoding, we considered feasible encoding levels including UTF-8 bytes and characters after romanization. For embedding, we performed experiments on encoding levels including character, UTF-8 bytes, romanized characters, segmented word with a prebuilt word segmenter, and romanized word. A brief search in the literature seems to confirm that this article is the first to study all of these encoding mechanisms in a systematic fashion.
Historically, linear models such as (multinomial) logistic regression (Cox, 1958) and support vector machines (Cortes and Vapnik, 1995) have been the default choice for text classification, with bag-of-words features and variants such as n-grams and TF-IDF (Sparck Jones, 1972). Therefore, in this article we provide extensive comparisons using multinomial logistic regression, with bag-of-characters, bag-of-words and their n-gram and TF-IDF (Sparck Jones, 1972) variants. Furthermore, experiments using the recently proposed fastText (Joulin et al., 2016) are also presented with all these different feature variants.
Large-scale multi-lingual datasets are required to make sure that our comparisons are meaningful. Therefore, we set out to crawl the Internet for several large-scale text classification datasets. Eventually, we were able to obtain 14 large-scale datasets in 4 languages including Chinese, English, Japanese and Korean, for 2 different tasks including sentiment analysis and topic categorization. We plan to release all the code used in this article under an open source license, including crawling, preprocessing, and training on all datasets.
The conclusions of this article include that the one-hot encoding model at UTF-8 byte level consistently offers competitive results for convolutional networks, that linear models remain strong for the text classification task, and that fastText provides the best results with character n-grams but tends to overfit when the features are overly rich. We hope that these results can offer useful guidance for the community to select appropriate encoding mechanims that can handle different languages in a unified and consistent fashion.
2 Encoding Mechanisms for Convolutional Networks
For the purpose of fair comparisons, all of our convolutional networks share the same design except for the first few layers. We call this common part the classifier, and the different first several layers the encoder. In the benchmarks we have 2 classifier designs - one large and the other small. The large classifier consists of 12 layers, and the small one 8. Table 1 and 2 details the designs. All parameterized layers use ReLU (Nair and Hinton, 2010) as the non-linearity.
2.1 Character Glyph
Glyph is a typography term indicating a readable character for the purposes of writing. CJK languages consist of characters that are rich in their topological forms, where strokes and parts could represent semantic meaning. This makes glyph a potentially feasible encoding solution.
In the context of this article, we refer to glyphs as images of characters rendered by some font. In the experiments we use the freely available GNU Unifont 111http://unifoundry.com/unifont.html (version 8.0.01), where each character is converted to a 16-by-16 pixel image. We consider all characters that belong to the Unicode basic multi-lingual plane (BMP), which have code points less than or equal to the hex value FFFF. Figure 1 shows some glyph examples in this font.
For the large classifier, the glyph encoder contains 8 parameterized layers with 6 spatial convolutional layers and 2 linear layers. The small model consists of a 6-layer glyph encoder with 4 spatial convolutional layers and 2 linear layers. Table 3 and 5 present the design choices.
In the benchmarks we will refer to these 2 models as large GlyphNet and small GlyphNet respectively. During training, each sample consists of at most 512 characters for the large GlyphNet and 486 for the small one. Zero is padded if the length of the sample string is shorter, and characters beyond these limits are ignored. Note that each character must pass through the spatial glyph encoder and each sample could contain hundreds of characters. As a result, the training time of GlyphNet is significantly longer than any other model considered in this article.
It is worth noting that recent research has shown that CJK characters can help to improve the results of various tasks including text classification (Shimada et al., 2016) (Liu et al., 2017) and translation (Costa-jussà et al., 2017), further justifying the potential of encoding CJK characters via glyphs.
2.2 One-hot Encoding
In the simplest version of one-hot (or one-of-n) encoding, each entity must be converted into a vector whose size equals to the cardinality of the set of all possible entities, and all values in this vector are zero except for the position that corresponds to the index of the entity in the set. For example, in the paper by Zhang et al. (2015), each entity is a character and the size of the vector equals to the size of the alphabet containing all characters. Unfortunately, this naive way of using one-hot encoding is only computationally feasible if the entity set is relatively small. Texts in CJK languages can easily span tens of thousands of characters.
In this article, we consider 2 simple solutions to this problem. The first one is to treat the text (in UTF-8) as a sequence of bytes and encode at byte-level. The second one, already presented in Zhang et al. (2015), is to romanize the text so that encoding using the English alphabet is feasible. Note that the second solution is equivalent of encoding at byte-level with romanized text, because the English alphabet is contained in UTF-8 and they will not go beyond the limit of one byte.
In the following we will call these 2 models byte-level OnehotNet and romanization OnehotNet. Similar to GlyphNet, each OnehotNet also has a large variant and a small variant depending on the classifier used. Both variants use the same encoder design that consists of 4 convolutional layers, in which the large variant admits input length 2048 and the small 1944. Table 6 provides the configuration. Compared to GlyphNet, OnehotNet is significantly faster because the encoder handles all symbols in the input at once.
The idea of language processing at byte level has been explored by Gillick et al. (2016), where they apply an LSTM-based (Hochreiter and Schmidhuber, 1997) sequence-to-sequence (Cho et al., 2014b) (Sutskever et al., 2014) model at byte-level for a variety of tasks including part-of-speech tagging and named entity recognition, for 4 languages including English, German, Spanish and Dutch. The advantage of byte-level processing is that they can be immediately applied to any language regardless of whether there are too many entities at character or word levels. The same advantage applies to CJK, and perhaps any language that can be digitized as well.
We use the terminology “embedding” to refer to the idea of associating each entity a fixed size vector, same as most papers in the machine learning literature. These vectors are randomly initialized, and then learnt either with an unsupervised criterion or jointly with the task at hand. The advantage of embedding models is there there is no need to explicitly construct one-hot vectors, therefore the memory footprint of embedding models is significantly smaller than that of OnehotNet. As a result, embedding can be applied to almost any encoding level.
In this article, we use embedding at a variety of different levels, including byte, character, word, romanization character, and romanization word. All of of our emedding vectors are of size 256, and they are learnt jointly with the text classification task at hand. The size of vocabulary is 257 for byte-level and romanized-level encoding, 65537 for character-level encoding, and 200,002 for word level and romanized word-level encoding.
The character-level encoding considers all code points in the basic multilingual plane (BMP) of Unicode. The word and romanized-word vocabularies are built by selecting the 200,000 most frequent entities appeared in the training data for each dataset, plus one additional entry to represent an out-of-vocabulary symbol. One additional entry is also added to each vocabulary to include a padding symbol for shorter texts. There are 2 embedding models, since we have designed the classifier with 2 different sizes. We will refer to them as large EmbedNet and small EmbedNet respecitvely. The large Embednet admits input length of 512, and the small one 486.
When the input text is represented by explicit one-hot vectors, embedding is equivalent of using a linear first layer. Therefore, the difference between OnehotNet and EmbedNet in this article is whether the first layer is linear or convolutional. The idea of embedding has been applied to ConvNet-based text processing pretty early on, with representative work for tasks like named entity recognition, part-of-speech tagging(Collobert et al., 2011b), text classification at word level (Kim, 2014) and language modeling at character level (Kim et al., 2016).
3 Linear Models and fastText
Besides ConvNets, we also offer benchmarks in linear models using multinomial logistic regression, and the fastText program by Joulin et al. (2016).
3.1 Linear Models
The linear multinomial logistic regression models are all bag-of-entity models, where the entity is character, word, romanized word. The 1-gram bag-of-entity model admits a feature of size 200,000 by selecting the most frequent ones from the training dataset. The 5-gram model admis grams of length up to 5, using the 1,000,000 most frequent features in the training dataset.
Note that word segmentation is not a simple problem for some of CJK texts, because they sometimes do not contain clear word boundaries like the space character in most alphebatic languages. Section 4.2 introduces how word segmentation is done for each language.
The idea of bag-of-character and its n-gram version has been explored by Peng et al. (2003) for text classification in Asian languages, where they observed comparable results with word-level models. This is probably because of the large character vocabularies in these languages, in which each character has a similar sparsity in representing meaning compared to each word in an alphebetic language.
fastText (Joulin et al., 2016) is a recent tool for fast text classification by incorporating several tricks such as hierarchical softmax (Goodman, 2001) (Mikolov et al., 2013) and feature hashing (Weinberger et al., 2009). Combined with an efficient implementation and a highly optimized learning rate schedule, fastText is able to process input text at a speed of several orders of magnitude of that of ConvNets. This gives it a particular advantage and we hope to include the its results as a reference for our community.
The fastText model is essentially a 2-layer fully connected neural network without non-linearity. The number of hidden units is 10 across all of our experiments. During training, we use an initial learning rate of 0.1 and a hashing bucket size of 10,000,000. We used 10% of the training dataset as validation and remaining as training to choose the best number of epoches, from the choices 2, 5 and 10. This validation process necessary because fastText does not have weight decay (Joulin et al., 2016) and it relies on early stopping to prevent overfitting. It is also the only model fast enough for such hyper-parameter tuning in this article. For each dataset, we explored features at character, word and romanized word levels, with variants of 1-gram, 2-gram and 5-gram features.
4 Datasets and Preprocessing
To ensure that our results are significant enough to demonstrate the differences between encoding methods, we need to acquire large-scale datasets. To do that, we set out to crawl the Internet for text classification datasets in 4 language including Chinese, English, Japanese and Korean. Eventually, we were able to obtain 14 datasets, most of which are at the scale of millions of samples. We performed experiments using all aforementioned models on all of these datasets.
In total, we have obtained 8 sentiment classification datasets from online shopping reviews in Chinese, English, Japanese and Korean, 1 sentiment classification dataset from online restaurant reviews in Chinese, and 3 news topic classification dataset in Enlish and Chinese. Additionally, we were able to combine the online shopping review datasets in different languages to construct 2 joint datasets, which can be used to test each model’s ability to handle different languages in a unified fashion. Table 4 summarizes the statistics of all these datasets.
Dianping. The Dianping dataset consists of user reviews crawled from Chinese online restaurant review website dianping.com. This dataset was developed and used by Zhang et al. for research in collaborative filtering (Zhang et al., 2013a) (Zhang et al., 2013b) and sentiment analysis (Zhang et al., 2014a) (Zhang et al., 2014b). After removing duplicated texts, we preprocessed the dataset such that stars 1, 2 and 3 belong to the negative class, and stars 4 and 5 belong to the positive class. Then we randomly selected 2,000,000 samples for training and 500,000 samples for testing with equal number of samples in each sentiment.
JD. The JD dataset consists of user reviews crawled from the Chinese online shopping website jd.com. After duplication removal, we were able to obtain 2 sentiment classification datasets in which one is to predict the full 5 stars and the other is binary. The binary dataset was built such that stars 1 and 2 belong to the negative sentiment, and stars 4 and 5 belong to the positive sentiment. Star 3 is ignored in the JD binary dataset. There are 3,000,000 training samples and 250,000 testing samples in the JD full dataset, and 4,000,000 training samples and 360,000 testing samples in the JD binary dataset. In each case, the samples are evenly distributed across classes.
Rakuten. The Rakuten dataset consists of user reviews cralwed from the Japanese online shopping webiste rakuten.co.jp. After duplication removal, we were able to obtain 2 sentiment classification datasets in which one is to predict the full 5 stars and the other is binary. The binary dataset was built such that stars 1 and 2 belong to the negative sentiment, and stars 4 and 5 belong to the positive sentiment. Star 3 is ignored in the Rakuten binary dataset. There are 4,000,000 training samples and 500,000 testing samples in the Rakuten full dataset, and 3,400,000 training samples and 400,000 testing samples in the Rakuten binary dataset. In each case, the samples are evenly distributed across classes.
11st. The 11st dataset consists of user reviews crawled from the Korean online shopping website 11st.co.kr. After duplication removal, we were able to obtain 2 sentiment classification datasets in which one is to predict the full 5 stars and the other is binary. The binary dataset was built such that stars 1, 2 and 3 belong to the negative sentiment, and stars 4 and 5 belong to the positive sentiment. There are 750,000 training samples and 100,000 testing samples in the 11st full dataset, and 4,000,000 training samples and 400,000 testing samples in the 11st binary dataset. In each case, the samples are evenly distributed across classes.
Amazon. The Amazon dataset consists of users reviews crawled from the English online shopping website amazon.com. We use the same datasets constructed by Zhang et al. (2015), which came from the Stanford Network Analysis Project (SNAP) 222http://snap.stanford.edu/ and developed by McAuley and Leskovec (2013) for sentiment analysis. There are 2 sentiment classification datasets in which one is to predict the full 5 stars and the other is binary. The binary dataset was built such that stars 1 and 2 belong to the negative sentiment, and stars 4 and 5 belong to the positive sentiment. Star 3 is ignored in the Amazon binary dataset. There are 3,000,000 training samples and 650,000 testing samples in the Amazon full dataset, and 3,600,000 training samples and 400,000 testing samples in the Amazon binary dataset. In each case, the samples are evenly distributed across classes.
Ifeng. The Ifeng dataset consists of first paragraphs of news articles from the Chinese news website ifeng.com. We crawled all news from the year 2006 to the year 2016 and selected 5 different news channels as 5 topic classes. These classes are mainland China politics, International news, Taiwan - Hong Kong- Macau politics, military news, and society news. After duplication removal, the dataset consists of 800,000 training samples and 50,000 testing samples. These samples are evenly distributed across classes.
Chinanews. The Chinanews daaset consists of first paragraphs of news articles from the Chinese news website chinanews.com. We crawled all news from the year 2008 to the year 2016 and selected 7 different news channels as 7 topic classes. These classes are mainland China politics, Hong Kong - Macau politics, Taiwan politics, International news, financial news, culture, entertainment, sports, and health. After duplication removal, the dataset consists of 1,400,000 training samples and 112,000 testing samples. These samples are evenly distributed across classes.
NYTimes. The NYTimes dataset consists of first paragraphs of news articles from the English news website nytimes.com. We crawled all news from the year 1981 to the year 2015 and combined several channels to construct 7 topic classes. These classes are business news, New York regional news, sports, U.S. politics, world news and opinions, arts and fashion, and entertainment and science. After duplication removal, the dataset consists of 1,400,000 training samples and 105,000 testing samples. These samples are evenly distributed acorss classes.
Joint. The four dataset sources JD, Rakuten, 11st and Amazon are all sentiment classification tasks from online shopping websites, with both full 5 stars prediction or binary prediction. Therefore, we could combine them in each case to form two new joint datasets of 5 classes or 2 classes. This dataset is particularly useful since it spans 4 languages and can be used to test a model’s ability to handle different languages in a unified fashion. In total, there are 10,750,000 trainig samples and 1,500,000 testing samples in the joint full dataset, and 15,000,000 training samples and 1,560,000 testing samples in the joint binary dataset. All samples are evenly distributed across classes.
4.2 Word Segmentation and Romanization
Since there is no clear word boundary in some of the CJK texts, word segmentation is necessary before applying any of the word-level models. Romanization for some of the CJK texts also depends on word segmentation to produce the correct transliteration in the English alphabet. In this section, we present both word segmentation and romanization processes used for producing the results, for each languages Chinese, Japanese and Korean. All the tools we used are relatively popular and standard for CJK language processing.
Chinese. For Chinese, we use the freely available word segmentation package called jieba 333https://github.com/fxsjy/jieba (version 0.38). The romanization standard we used is Pinyin, using the pypinyin 444https://github.com/mozillazg/python-pinyin (version 0.12) package which in turn calls jieba for disambiguate between characters with multiple pronunciations.
Japanese. For Japanese, we use the freely available word segmentation and tagging package MeCab 555http://taku910.github.io/mecab (version 0.996) with the default model for Japanese. The romanization form used is Hepburn, which is done by converting the segmented words using python-romkan 666https://www.soimort.org/python-romkan (version 0.2.1).
Korean. Word segmentation is done for Korean using MeCab as well, but with a model in the Korean language 777https://bitbucket.org/eunjeon/mecab-ko-dic. Instead of calling MeCab and parsing the results like that in Japanese, we used the MeCab wrapper in KoNLPy 888http://konlpy.org which offers rich information for Korean text. The romanization standard used is the Revised Romanization of Korean (RR), which is done in 2 steps. The first step is to convert any Hanja in the text to Hangul via the python package hanja (version 0.11) 999https://github.com/suminb/hanja, and the second step is to transliterate the generate Hangul using the python package hangul-romanize 101010https://github.com/youknowone/hangul-romanize.
After introducing the optimization parameters used for all of our models, this section then presents the results for these models. Most of our experiments are implemented using Torch 7 (Collobert et al., 2011a), with NVIDIA CUDNN 111111https://developer.nvidia.com/cudnn as the GPU backend.
The optimization process used for all convolutional network models is stochastic gradient descent (SGD) with momentum (Polyak, 1964) (Sutskever et al., 2013). The training process operates on random minimabatches of size 16, with different numbers of minibatches per epoch. The sixth column in Table 4 shows the number of minibatches for one epoch for each dataset. The model parameters are initialized in the same way as in He et al. (2015) – for each layer the bias is set to 0, and weights are randomly sampled from a Gaussian distribution of mean 0 and standard deviation , where is the number of output units each input unit connects to. All the models have an initial learning rate of 0.00001, which is halved every 8 epoches. The training stops at the 100th epoch. A small weight decay of 0.00001 is applied to the model to stabilize training. Each model is trained using one NVIDIA Tesla K40 GPU.
The optimizaiton algorithm used for all linear models is parallelized SGD. Each model is trained with a sparse representation via HOGWILD! (Niu et al., 2011) parallelization using 10 CPU cores. An extra core is used for continuously testing on both training and testing datasets. The learning rate used for the algorithm is 0.001. A small weight decay of 0.00001 is applied to each model to stabilize the training process. The training stops after 1000 continuous testing steps are done. All of our models are run with a batch of INTEL XEON E5-2630 v2 CPUs.
The optimization parameters for fastText (Joulin et al., 2016) are controlled by the original authors’ program 121212https://github.com/facebookresearch/fastText. We set the embedding dimension to be 10 with a bucket size of 10,000,000, and going through each dataset for 2, 5 or 10 epoches depending on the validation result from 10% of the training dataset. The optimization algorithm is SGD with decaying learning rate, where the initial learning rate is set to 0.1 and the decay change rate set to 100. The number of CPU cores used is 10, with a batch of INTEL XEON E5-2630 v2 CPUs. All other parameters used are the program’s defaults.
The results for all the models are split into several tables. Table 7 lists the results for GlyphNet, where the numbers are testing errors in percentages. Similarly, Tables 8, 9, 10 and 11 list the testing errors for OnehotNet, EmbedNet, linear models and fastText. As long as it is appicable in each table, the best result for each dataset is marked blue and the worst red. The epoch numbers for fastText models are presented in Appendix B
For each Chinese, Japanese and Korean dataset, we have 37 models each, and for English we have 22. In total, there are 473 models benchmarked in this article. Due to space limitations, the training errors are not present in the main text of this article, but readers can refer to Appendix A for them.
In this section, we provide some analysis on the testing results presented in the previous section. These analyses include average ranks between models, generalization ability of each model under different encoding mechanisms, and estimations of training time.
6.1 Rank the Models
To compare between different encoding mechanisms, this section presents the ranking of testing errors of all models. For English datasets, there are some missing values in various models in Tables 8, 9, 10 and 11. These values are missing because the corresponding models operate on romanized texts, and there is no romanization because the texts are already in the English alphabet. However, in order to make the model rank between different datasets comparable, we need to make sure that every dataset has the same number of models. To do this, we simply fill the missing values in romanized models with their corresponding ones for English. As a result, all datasets have 37 models to rank.
For each dataset, we rank all of the models in ascending order of their testing errors. The rank is the index of the model in this ordering. As a result, the smaller the rank, the better the model performs. Then, we compute the minimum, first quartile, median, third quartile, and maximum rank across different datasets for each model and put these numbers as a box plot in Figure 2. The numbers in Figure 2 indicate both how each models perform on average, and how stable these models are across different datasets and languages.
From the results, the model achieved the best consistent performance is the character-level 5-gram fastText (Joulin et al., 2016) model. The result is more apparent in table 11, where for almost all Chinese, Japanese and Korean datasets the best encoding is character-level 5-gram for fastText. For English, the best encoding is often word n-grams, although character-level 5-gram models are quite competitive as well. Character-level encoding with number of grams less than 5 are significantly worse, with the worst being bag-of-character linear model with TFIDF features. Word-level n-grams feature for both linear models and fastText are competitive, although our data processing pipeline did not guarantee perfect word segmentation for CJK languages because of the segmenters used.
Convolutional networks consistently have the best stability across different datasets and languages, with the best being byte-level large OnehotNet. This suggests that handling different language at byte-level regardless of whether characters could span multiple bytes is quite a feasible solution for handling different languages in a unified fashion. What is better is that byte-level language processing requires the least amount of pre-processing – just present UTF-8 encoded strings to the model. Therefore, we believe byte-level model is a promising approach towards applying deep learning to natural language processing.
Finally, many different models have hit rank 1 as their minimum, suggesting that there is no single best models across different datasets and languages. However, this is limited to the model hyperparameters we chose. It is worth noting that hyperparameters are more thoroughly explored for fastText than other models in this article.
In this section, we look at the generalization gap – the expected difference between training and testing errors – of different models. The generalization gap in this article is approximated by the subtraction of the training error from the testing error. The approximation to the underlying sample distribution should be pretty accurate because all our datasets are very large.
As an example, Figure 3 visualizes the generalization gap for the Joint binary dataset. This figure exemplifies typical generalization properties of different models for all of our datasets. Additionally, Figure 4 offers a box plot for the rankings on generalization error, computed in the same way as Figure 2 for testing errors.
From these figures, one could easily observe that fastText (Joulin et al., 2016) tends to overfit much more aggressively than either convolutional networks or our own implementation of linear models, in spite of our effort in hyper-parameter tuning. Also, it overfits more using richer features as the number of grams goes from 1 to 5. Given the theoretical fact that fastText could not have more representation capacity than a linear model, this could be a result of the lack of regularization and the aggressive optimization strategy in fastText.
However, the fact that models with simpler representation capacity can overfit so aggressively indicates that generalization does not only depend on the complexity of the model or the number of parameters in the model, but also its capacity to represent the data for the task at hand. This aspect may be the reason why on average models like convolutional networks can achieve much better results than what can be characterized by the upper-bounds of traditional learning theory. This requires further study beyond the current generalization bounds based on statistical concentration inequalities and complexity measurements, and it may require a better characterization between the relationship of representation and generalization.
6.3 Training Time
The training times of different models vary greatly in our experiments. Table 12 offers an estimation of time it took for each model to go over 1,000,000 samples with the hardware mentioned in the previous section. In general, fastText (Joulin et al., 2016) offers the best training time and only requires CPUs, whereas convolutional networks take the longest time and require GPUs. Depending on the methods of encoding, the performance between convolutional networks also differ drastically, with EmbedNet tens of times faster than GlyphNet. Figure 5 visualizes the estimations as a bar chart.
These results show that fastText (Joulin et al., 2016) offers the fastest training and evaluation while achieving competitive results. On the other hand, models using convolutional networks consume the most amount of computation time. As a result, in this article we could afford to do hyper-parameter tuning for fastText but not on convolutional networks.
The convolutional network models in this article are designed not for achieving the best performance, but for the fairness of comparing between different encoding mechanisms within the computational budget we possess. Given the fact that different designs of convolutional networks could offer drastically different performance, we believe there is a great deal of potential for improvement from different design choices on convolutional networks.
It is also worth noting that the task in question – text classification – is quite simple. Convolutional networks may not show an advantage in this specific task, but may become more useful for more complicated reasoning tasks concerning text inputs and outputs. The comparison between different encoding mechanisms presented this article offer valuable knowledge towards the choice for convolutional networks in general language processing.
7 Other Models
In spite of the 473 models we have benchmarked, this article is in no way a complete essay on every possible model for text classification. Some of the interesting models we did not benchmark include recurrent networks, the use of sparse convolutions for text, and different variations of convolutional architectures.
By focusing on different encoding mechanisms for deep learning models, this article performs experiments only on one kind – convolutional networks. Another often-used kind for processing texts is recurrent networks, constructed using different types of cells like long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) and gated recurrent units (GRU) (Cho et al., 2014a). Some authors have found that recurrent networks applied to different levels of encoding can offer good results for text classification as well (for example, Dai and Le (2015) and Liu et al. (2016)). Combinations of convolutional networks and recurrent networks are also explored for text classification (for example, Xiao and Cho (2016)).
This article explores one-hot encoding for convolutional networks using byte-level encoding and romanization. Another alternative is to implement a convolutional module that can take sequences of indices instead of explicit vectors to represent one-hot encoding. This would avoid the memory overflow problem when applying one-hot encoding to large vocabularies. However, so far there has been no deep learing toolkit that has implementation of such a sparse convolutional module. Furthermore, it may require special numerical optimization that would merit its own essay. Therefore, it is not included for presentation in this article.
Finally, the results on convolutional networks in this article are limited to the purpose of offering fair comparisons between different encoding mechanisms. Another dimension of exploration is the design variants of convolutional networks for text processing, such as very deep networks (Conneau et al., 2017), residual (He et al., 2016) and dense (Huang et al., 2016) connections, and advanced pooling schemes for handling the variable length problem (Kalchbrenner et al., 2014) (Johnson and Zhang, 2017). We are optimistic that exploration of all these different architecture designs could improve the results further for convolutional networks.
This article explores the use of different encoding mechanisms for both deep learning and linear models for text classification in Chinese, English, Japanese and Korean. These encoding mechanisms include one-hot encoding, embedding and images of character glyphs. Different levels of encoding are applied to each mechanism whenever application, including UTF-8 encoded bytes, characters, words, romanized characters and romanized words. There are in total 473 models benchmarked in this article, including convolutional networks, linear models and fastText (Joulin et al., 2016).
A total of 14 large-scale datasets were built in this article for benchmarking these models in 4 languages including Chinese, English, Japanese and Korean. Most of these datasets have millions of samples for training, and 2 of these datasets contains samples mixed in all these 4 languages to testing different model’s ability to handle different languages in a consistent and unified fashion.
Some conclusions from these results are:
fastText (Joulin et al., 2016) has the best result with character-level n-gram encoding for Chinese, Japanese and Korean texts. For English, the best encoding for fastText is word-level n-grams.
Word-level encoding for CJK languages are competitive even without perfect segmentation, for both fastText and linear models.
The best encoding mechanism for convolutional networks is byte-level one-hot encoding. This indicates that convolutional networks have the ability to understand text from a low-level representation, and offers great simplicity for handling multiple languages in a consistent and unified fashion.
fastText tends to overfit more than convolutional networks, in spite of the fact that it does not have more representation capacity than a linear model.
In the future, we hope to extend the results to recurrent networks, and explore how different designs of convolutional networks would affect the results. We plan to release all the source code used for all the benchmarks, and hope that these results are useful for the community to choose which encoding mechanism to use when facing with multi-lingual text processing.
We want to express our thanks to Junyoung Chung, Yoon Kim, Sainbayar Sukhbaatar and Kentaro Hanaki for offering their knowledge on Korean and Japanese languages. Armand Joulin offered valuable suggestions on hyper-parameter tuning for fastText.
A Training Errors of All Models
B Epoches for fastText Models
The validated epoches for running all fastText models are detailed in Table 17.
- Cho et al. (2014a) Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. Syntax, Semantics and Structure in Statistical Translation, page 103, 2014a.
- Cho et al. (2014b) Kyunghyun Cho, Bart van Merriënboer, ÇaÄlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, October 2014b. Association for Computational Linguistics.
- Collobert et al. (2011a) Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011a.
- Collobert et al. (2011b) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011b.
- Conneau et al. (2017) Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. Very deep convolutional networks for text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1107–1116, Valencia, Spain, April 2017. Association for Computational Linguistics.
- Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
- Costa-jussà et al. (2017) Marta R. Costa-jussà, David Aldón, and José A. R. Fonollosa. Chinese–spanish neural machine translation enhanced with character and word bitmap fonts. Machine Translation, pages 1–13, 2017. ISSN 1573-0573. doi: 10.1007/s10590-017-9196-0.
- Cox (1958) David R Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological), pages 215–242, 1958.
- Dai and Le (2015) Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, NIPS, 2015.
- Gillick et al. (2016) Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. Multilingual language processing from bytes. In Proceedings of NAA-HLT, pages 1296–1306, 2016.
- Goodman (2001) Joshua Goodman. Classes for fast maximum entropy training. In ICASSP, pages 561–564. IEEE, 2001. ISBN 0-7803-7041-4.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Huang et al. (2016) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
- Johnson and Zhang (2017) Rie Johnson and Tong Zhang. Deep pyramid convolutional neural network for text classification. In Proceedings of the 55nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2017.
- Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
- Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 212–217. Association for Computational Linguistics, 2014.
- Kim (2014) Yoon Kim. Convolutional neural networks for sentence classification. EMNLP 2014, 2014.
- Kim et al. (2016) Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural language models. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- Liu et al. (2017) Frederick Liu, Han Lu, Chieh Lo, and Graham Neubig. Learning character-level compositionality with visual features. In The 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, July 2017.
- Liu et al. (2016) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2873–2879. AAAI Press, 2016.
- McAuley and Leskovec (2013) Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172. ACM, 2013.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
- Niu et al. (2011) Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 693–701. Curran Associates, Inc., 2011.
- Peng et al. (2003) Fuchun Peng, Xiangji Huang, Dale Schuurmans, and Shaojun Wang. Text classification in asian languages without word segmentation. In Proceedings of the sixth international workshop on Information retrieval with Asian languages-Volume 11, pages 41–48. Association for Computational Linguistics, 2003.
- Polyak (1964) B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 – 17, 1964. ISSN 0041-5553.
- Shimada et al. (2016) D. Shimada, R. Kotani, and H. Iyatomi. Document classification through image-based character embedding and wildcard training. In 2016 IEEE International Conference on Big Data (Big Data), pages 3922–3927, Dec 2016. doi: 10.1109/BigData.2016.7841067.
- Sparck Jones (1972) Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.
- Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
- Weinberger et al. (2009) Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120. ACM, 2009.
- Xiao and Cho (2016) Yijun Xiao and Kyunghyun Cho. Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv:1602.00367, 2016.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015.
- Zhang et al. (2013a) Yongfeng Zhang, Min Zhang, Yiqun Liu, and Shaoping Ma. Improve collaborative filtering through bordered block diagonal form matrices. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 313–322. ACM, 2013a.
- Zhang et al. (2013b) Yongfeng Zhang, Min Zhang, Yiqun Liu, Shaoping Ma, and Shi Feng. Localized matrix factorization for recommendation based on matrix block diagonal forms. In Proceedings of the 22nd international conference on World Wide Web, pages 1511–1520. ACM, 2013b.
- Zhang et al. (2014a) Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 83–92. ACM, 2014a.
- Zhang et al. (2014b) Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu, and Shaoping Ma. Do users rate or review?: Boost phrase-level sentiment labeling with review-level sentiment classification. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 1027–1030. ACM, 2014b.