Text classification with pixel embedding
We propose a novel framework to understand the text by converting sentences or articles into video-like 3-dimensional tensors. Each frame, corresponding to a slice of the tensor, is a word image that is rendered by the word’s shape. The length of the tensor equals to the number of words in the sentence or article. The proposed transformation from the text to a 3-dimensional tensor makes it very convenient to implement an -gram model with convolutional neural networks for text analysis. Concretely, we impose a 3-dimensional convolutional kernel on the 3-dimensional text tensor. The first two dimensions of the convolutional kernel size equal the size of the word image and the last dimension of the kernel size is . That is, every time when we slide the 3-dimensional kernel over a word sequence, the convolution covers word images and outputs a scalar. By iterating this process continuously for each -gram along with the sentence or article with multiple kernels, we obtain a 2-dimensional feature map. A subsequent 1-dimensional max-over-time pooling is applied to this feature map, and three fully connected layers are used for conducting text classification finally. Experiments of several text classification datasets demonstrate surprisingly superior performances using the proposed model in comparison with existing methods.
Word representation, is the foundation of natural language processing (NLP) tasks, such as the text classification [kim2014convolutional], machine translation [sutskever2014sequence], question answering [zhou2015simple], etc. The most straightforward approach to word representation is the one-hot encoding, which projects words into sparse 1-of- vectors with being the size of the vocabulary. Another popular framework of word representation is to construct word vectors using the word2vec [mikolov2013distributed; pennington2014glove], which is an unsupervised approach. Both the 1-of- and word2vec encodings have their own limitations. For example, the one-hot embedding has the issue of curse-of-dimensionality and word2vec requires the availability of a prior corpus for pre-training. Both one-hot and word2vec encodings are word-level embeddings. In addition to the word-level encoding, zhang2015character propose a character-level convolutional neural network (char-CNN) which quantifies the characters for alphabetic scripts. However, the character-level embedding is inapplicable to ideograph languages such as Chinese and Japanese, because the number of characters for such languages can be huge.
Intuitively, when we read an article on a screen, our eyes capture the text as a series of images which are then passed onto the brain for recognition and understanding. Hence, a natural way of word representation is to use visual shapes of the words or characters as features [shimada2016document; sun2019vcwe; su2017learning; liu2017learning]. For examples, su2017learning and shimada2016document take Chinese and Japanese characters as images and apply a subsequent convolutional autoencoder to take those images as inputs and then output low-dimensional character embeddings. With such character embeddings, a char-CNN [shimada2016document] or traditional recurrent neural networks [su2017learning; liu2017learning] can be adapted for Chinese and Japanese text analysis tasks.
However, the limitations of these visual embedding models are obvious: (1) Characters are treated separately with a traditional local convolutional kernel, which ignores the statistics of characters or possibilities of words’ co-occurrence (-gram characteristics); (2) These models compress the word or character’s visual vector into a low-dimensional vector, which makes these models lack interpretability; and (3) Existing visual embedding based models are all designed for ideograph languages. To solve these problems, we propose a novel framework to adapt a word’s pixel (visual) embedding for English, while our model can be easily extended to any other languages. Concretely, we render the shape of a word in a document or a sentence as an image and then fold those images into a 3-dimensional tensor sequentially. That is, the document or sentence is converted into a video-like 3-dimensional tensor. Each frame of the video corresponds to a word and the length of the video equals to the number of words in the document. To capture the -gram characteristics of the text, we propose to impose 3-dimensional convolutions on the “text video". Compared with a small convolutional kernel (traditionally, a kernel is most frequently used), we use a big 3-dimensional kernel of size to extract statistics information of the text, where and are the width and height of word images respectively, and is the number of words covered by the kernel. This can be interpreted as an -gram model as shown in Figure 1. With multiple 3-dimensional kernels, the convolutional layer outputs a feature matrix, whose columns are features of -grams and rows correspond to the channels of different kernels. Following [kim2014convolutional], a subsequent 1-dimensional max-over-time pooling is applied to this 2-dimensional feature map. Finally, three fully connected (FC) layers are used for conducting text classification.
The contributions of our work are three-fold:
We propose to represent a sentence or an article with a video-like 3-dimensional tensor, and each frame of this tensor represents one word in the sentence or article;
We use a 3-dimensional convolutional kernel to learn the -gram features from the tensor representation of the text;
We evaluate our model on several text classification tasks on both performances and interpretability.
2 Related Works
Recently, deep learning has been shown to achieve impressive performances on the NLP tasks [kim2014convolutional; wang2015semantic; iyyer2015deep; goldberg2016primer; jiang2018text; jacovi2018understanding; tang2019entity]. Under the NLP framework via deep neural networks, one typically needs to find a way to embed the raw text into features that computers can “recognize and understand”. Currently, the existing approaches for text embedding can be categorized into three frameworks from coarse to fine. The first one is the document-level or sentence-level approach that embeds documents or sentences into vectors [le2014distributed; lin2017a]. The second one is the word-level embedding [mikolov2013distributed; pennington2014glove; joulin2017bag], and the last one is character-level [zhang2015character] or radical-level embedding [ke2017radical].
The simplest implementation of the word-level embedding is to encode words as one-hot vectors. The dimension of the one-hot vector equals the size of the vocabulary . Typically, ranges from thousands to tens of thousands, which may hence lead to the issue of curse-of-dimensionality. Another approach is to construct a corpus-related matrix that contains statistical information of this corpus and then compute the word representation by factoring the matrix [deerwester1990indexing]. However, the size of the constructed matrix is usually large, which makes the decomposition very time-consuming. The most classical way of word-level embedding is based on the word2vec framework [mikolov2013distributed], which is originated from the neural language model. The word2vec encodes semantic features of words into a low-dimensional dense vector with the word’s local context, which is also called distributed word vectors. However, the quality of word vectors heavily depends on the quality and quantity of the corpus. As an improvement, pennington2014glove propose to incorporate the global matrix factorization [deerwester1990indexing] and local context [mikolov2013distributed], which can strike a balance between the performance and cost. For an NLP task, both the one-hot and distributed representation methods have their own limitations: the one-hot embedding has the issue of curse-of-dimensionality and the word2vec requires the availability of a corpus as well as pre-training prior to a specific NLP task.
Another popular framework is the document to vector or sentence to vector [le2014distributed; lin2017a], which aims to represent sentences, paragraphs, and documents with vectors. le2014distributed propose a “Paragraph Vector" model to learn fixed-length feature representations for sentences or documents in an unsupervised way. The “Paragraph Vector" is based on word2vec [mikolov2013distributed].
zhang2015character propose a character-level encoding model that quantifies the characters in English words sequentially. Combining this elegant design of text embedding with convolutional neural networks (CNN), their method achieves excellent results on text classification. Unfortunately, the character-level encoding method is only applicable to the phonogram, such as English, but cannot be extended to logogram languages, such as Chinese or Japanese. Following the char-CNN, ke2017radical propose to encode Chinese and Japanese characters with the semantic radical components to bridge this gap. In their model, each Chinese or Japanese character can be divided into a sequence of radical-level embeddings. However, the radical-level method ignores the spatial structure of Chinese characters, which is a big difference between Chinese and alphabetic scripts. The pixel embedding proposed in this paper is completely different from the existing encoding methods. It is directly motivated by the way of human reading, for which eyes receive visual signals of the text and then send them to the brain for further analysis. Therefore, we use the pixel image of the text as its representation which exactly mimics the way how human read the text.
In contrast to the one-hot, word2vec, and char-level embedding, human read and understand the text from a completely different perspective, which is based on the visual shapes of the words. Intuitively, when we read a web article on the screen or a book, our eyes capture the text as a series of images rather than embedding them into vectors. In other words, human understand the text with the visual information of the words, i.e., we recognize characters or words from their images that are captured by our eyes. Therefore, we believe that the pixel image, i.e., the character’s morphological shape, provides a natural way to represent characters and words. Motivated by this idea, several visual embedding methods [shimada2016document; su2017learning; sun2019vcwe] have been developed for Chinese and Japanese text understanding. However, it is very difficult to visually embed alphabetic languages such as English, because English words cannot be rendered as the same sized image as Chinese or Japanese characters.
|Slangs and abbreviations|
|Remove stop words|
|Remove low-frequency words|
|Stem and lemmatization||–||–|
|Maintain a vocabulary|
|Sparsity of vector|
|Dimension of vector||70|
The proposed model for the text classification is shown in Figure 1. Given a document or a sentence , we first render the word in this document as a matrix . Sequentially, a series of text matrices are then folded into a 3-dimensional tensor , where is the length of the sentence. In other words, the document or a sentence is taken as a “video", and each frame of the video corresponds a word of with the size of . Compared with extracting a word’s representation from the visual pixel map with a convolutional autoencoder [shimada2016document; su2017learning], our model prefers to using a 3-dimensional convolutional layer to deal with the “text video". The size of the convolutional kernel is , where is the number of words that the kernel covers at a time. Hence, the 3-dimensional kernel acts as an -gram detector. The single convolution with multiple kernels produces a new feature map as shown in Figure 1. After the operation of the single convolutional layer, we apply the max-over-time pooling [collobert2011natural] to carry out down-sampling. The max-over-time pooling operation in our model is different from the traditional ones that are popular in the field of computer vision. We conduct a 1-dimensional max-pooling procedure along the time axis for each channel. The 2-dimensional feature map is combined using the max-over-time pooling procedure followed by a nonlinear function activation (e.g., the ReLu function). Finally, we flatten the feature map after the max pooling, and the FC layers accept the flattened vectors as inputs to make the final classification.
Table 1 compares the word’s visual representation with other existing word embedding schemes in terms of data preprocessing steps. From the summarization, it is clear that the char-CNN, and our method require much less preprocessing steps than the one-hot vector and distributed representation. The last two rows in Table 1 summarize the sparsity and dimension of word vectors for each method.
3.2 Network Implementation
The network architecture can be described as follows:
Conv3d layer: kernel size = (20, 131, 3), stride = (1, 1, 1), number of kernels = 50, padding = 0;
MaxPool1d layer (the max-over-time pooling): kernel size = 3, stride = 3, dilation = 3, padding = 0 ;
FC layer 1: input = 1250, output = 512;
FC layer 2: input = 512, output = 100;
FC layer 3: input = 100, output = number of classes.
The specification stride = 3 for the MaxPool1d results in no overlaps in max-over-time pooling.
3.3 Model Interpretation
The proposed model has a concise structure with one convolutional layer, one max-pooling layer, and three subsequent FC layers. In image processing, a convolution between a kernel and local pixels (usually covering a dimensional area) of an image can blur, sharpen, emboss the image or detect edges of this image. This process is often applied to the neighbourhood of the local area repeatedly.
Different from the traditional way that the kernel focuses on local pixels of an image, we propose to compute the weighted average between the convolutional kernel and the whole word image as shown in Figure 1. It suggests that the size of the convolution kernel should be the same as the size of the images. Because of the video-like representation of the text data, when the 3-dimensional kernel slides over the text tensor, it computes the convolutional weighted average for several word images at one time. We prefer to such kind of global convolution rather than the local convolution for the following reasons: First, the information of text images is centralized; second, this design makes it very convenient to interpret the convolutional operation as an -gram detector.
As shown in Figure 1, every time when the convolutional kernel slides over the text, it operates on two neighbouring word images, and in this case it is a 2-gram detector. During the training, we input a sentence or an article which contains words, and the first layer of the proposed model would output -gram feature vectors sequentially. Some of the high-frequency -grams of the corpus can be repeatedly detected by the 3-dimensional kernel. Therefore, the values of the corresponding components in the feature vector for those high-frequency -grams are larger than others. In contrast, the components in the feature vector that corresponds to the low-frequency word pairs would be small. By applying different kernels, we can obtain a feature map as the output of the first layer as shown in Figure 1. The columns of are the -gram features and the rows correspond to the channels.
For the testing, by inputting a test sentence, a corresponding feature map is produced by the first layer of the trained model. As stated earlier, a larger value of indicates that the -th 2-gram of this test sentence is more frequently detected by the -th filter, where is the index of kernel, is the index of 2-gram phrase.
|DBPedia||448,000||112,000||70,000||14||52||Titleabstract of article|
“Classes” represents the number of classes, and “Ave length” refers to the average number of words in the content.
For comparisons, we consider four baseline methods as follows:
The character-level convolutional neural networks (char-CNN) [zhang2015character].
CNN for text classification on top of the one-hot word vectors denoted as CNN one-hot;
CNN for text classification on top of the distributed word vectors obtained via word2vec [kim2014convolutional] denoted as CNN wor2vec;
We experiment with two variants of the proposed model:
Our model with the max-over-time pooling, as shown in Figure 1;
Our model by substituting the 1-dimensional max-over-time pooling with a 2-dimensional max pooling. The kernel size is .
Five datasets used in our experiments are described as follows:
The AG’s news corpus is a collection of more than 1 million news articles 111https://www.di.unipi.it/g̃ulli/AG_corpus_of_news_articles.html. In [zhang2015character], four largest classes, namely World, Sports, Business, and Science/Technology, are selected from this corpus. Each sample is constructed by joining the title and description fields.
The DBPedia ontology dataset 222https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014. The DBpedia dataset uses a large multi-domain ontology which has been derived from Wikipedia [lehmann2015dbpedia]. The DBpedia ontology dataset is constructed by picking 14 non-overlapping classes: Company, Educational Institution, Artist, Athlete, Office Holder, Mean Of Transportation, Building, Natural Place, Village, Animal, Plant, Album, Film, Written Work. For each of these 14 ontology classes, the fields of samples we used are the joint of the title and abstract of each Wikipedia article.
Yelp reviews. The Yelp reviews dataset is obtained from the Yelp Dataset Challenge in 2015. Each review of this dataset has one user’s review score ranging from 1 star to 5 stars. Predicting the number of users’ review stars corresponds to a 5-class classification task.
Yahoo Answers corpus. This corpus is extracted from the Yahoo! Answers Comprehensive Questions and Answers version 1.0. We follow [zhang2015character] to construct a topic classification dataset from this corpus by selecting 10 main categories: Society, Culture Science, Mathematics Education, Reference Computers, Internet Sports Business, Finance, Entertainment, Music Family, Relationships Politics, Government .
The Amazon reviews dataset. We obtain the Amazon review dataset from the Stanford Network Analysis Project (SNAP), which spans over 18 years with 34,686,770 reviews from 6,643,669 users on 2,441,053 products [mcauley2013hidden]. Different from the Yelp review dataset, we predict the binary sentiment label for each review in the Amazon dataset. The sentiment classes of reviews with 1 or 2 stars are labelled as negative, and those with 3 or 4 stars are labelled as positive. The samples split for training, validation, and testing for all the five datasets are shown in Table 2.
4.3 Training setting
For the implementation of our model, we need to render words into word images with their shapes. Unfortunately, the lengths of English words vary dramatically and sometimes can be very large as shown in Figure 2. In particular, some words in the web context can have over 50 characters. For word image rendering, we adapt the size of images with the longest word. If the maximum length is too large, it can increase the blank space of word images, which causes redundancy for short words.
For English, most of the lengths of the words are less than 17 as shown in Figure 2. To balance the performance and redundancy, words with lengths greater than 17 are removed from the corpus. That is, we set the maximum length of words in our corpus as 17. With this threshold, we render each word in our corpus into word images. Here, the 131 pixels is the minimum width that can load 17 English characters with a font size of 20. The font we used for English characters is “New Times".
We use the Adam [kingma2014adam] algorithm as the network optimizer with the learning rate equal to 0.0001. The dropout rate of the FC network is 0.5.
4.4 Prediction for Topics and Sentiments
The first experiment concerns the overall performance of text classification on both the document’s topics and sentiments. Table 3 shows the testing accuracy for all the five listed datasets under all the models. The first four datasets have multiple categories, and the last dataset “Amazon" has binary labels of review’s polarity.
Clearly, the proposed methods achieve superior performances on the text classification compared with the existing ones for all five datasets. It is worth emphasizing that both our models (with or without max-over-time pooling) accept the text images as inputs without any preprocessing steps required by other approaches such as removing misspellings, low-frequency words, stop words, stem and lemmatization, maintaining a vocabulary for words or characters, etc. Our methods also do not need to pre-train word vectors as the word2vec based methods. The results of the two variants of our model demonstrate that the max-over-time pooling is an efficient and necessary operation for text feature extraction. Furthermore, compared with the existing methods, the proposed model is much more interpretable, as will be detailed in the next section.
|New w/o max||0.83||0.93||0.72||0.56||0.91|
4.5 Interpreting the 3-dimensional convolution
The 3-dimensional convolutional kernel acts as an -gram detector in our model. As shown in Figure 1, the conventional kernel operates on two frames (i.e., two words) at a time, which thus corresponds to a bi-gram detector. For a sentence of length , we can generate a feature vector , which is a continuous bi-gram feature of . By applying different kernels, we can obtain a feature map after the 3-dimensional convolution.
During the testing, we render a test sentence into a text video . We then input the text video , and a corresponding feature map is the output. Its element corresponds to the convolution result between the -th 3-dimensional convolutional kernel and -th bi-gram (word pair ). A larger indicates that the -th bi-gram of the input sentence is more relevant for the classification task (selected by kernel ). By identifying the maximum element of , we can easily find out the most-related bi-gram for the task of classification within the sentence , where , is the number of words in . For example, in sentiment classification, the bi-gram “not great" is more likely relevant to a negative review that will report a large value of after convolution. We can also identify other most-significant -grams, such as tri-gram, four-gram and so on, by setting different values of . In convenience of visualization, we study the bi-gram in the following experiments.
We visualize the weighted -grams according to the first layer of the network trained on the task of classifying the AG’s news dataset. It has four classes, “World", “Sports", “Business", and “Science & Technology". There are 7600 test samples for all categories and each class has 1900 samples. Because all the testing 2-grams have been weighted by the feature map for each class, we can visualize them with two-words phrase (2-grams) clouds separately as shown in Figure 4 (a)–(d). A larger font size in the word cloud pictures indicates a higher frequency that this two-words phrase has been detected by the 3-dimensional -gram detectors. The two words of these 2-grams are joined by an underscore “_" for the convenience of visualization.
According to the results of Figure 4 (a), we observe that the -gram “in Iraq" is the most frequent phrase that has been detected for the category “World" news. Some other phrases such as “Canadian Press", “NEW YORK", “President Bush", “UNITED NATIONS" and so on that are associated with the category “World" news, have also been highlighted by our 2-gram detectors. In Figure 4 (b)–(d), we can also see that the 2-grams “World Cup", “the Olympic", “Formula One" have been detected for the category “Sport"; “the company", “target= stocks", “Oil prices" and so on have been detected for the category “Business"; and “the company", “Apple Company" and so on have been detected for the category “Science & Technology".
By comparing Figure 4 (a) with Figure 4 (d), we observe that “NEW YORK" is the intersection of the high-frequency phrase between the most-weighted 2-grams for categories “World" and “Science & Technology". It suggests that there might be ambiguity for our 2-gram detector when classifying the phrase “NEW YORK" in the categories of “World" news or “Science & Technology" news. The same situation arises when categorizing “Business" and “Science & Technology". By comparing Figure 4 (c) with Figure 4 (d), we find that the most-highlighted 2-grams for “Business" and “Science & Technology" also have an intersection of “the company", which clearly belong to both categories. In contrast, the highlighted 2-grams in Figure 4 (b) has no intersection with other three categories, which makes the category “Sports" most distinctive from the others. The confusion matrix of the four-class classification in Figure 3 supports this argument.
We propose a novel framework to understand the text data by converting English sentences or articles into a video-like 3-dimensional tensors, which can be viewed as “video text”. Each frame or each slice of the tensor is a word image that is rendered as the word’s shape. This transformation makes it convenient to implement an -gram model based on the convolutional neural networks. We achieve this goal by imposing a 3-dimensional convolutional kernel on text tensors. The first two dimensions of the kernel size are the same as the size of the word image and the last dimension of the kernel size is . That is, the 3-dimensional kernel covers words and outputs a scalar each time. A subsequent 1-dimensional max-over-time pooling is applied to this feature map, and then three FC layers are implemented with a final goal for text classification. Experiments of text classification on both topic and sentiment analysis illustrate surprisingly excellent results of the proposed model. Our model can be easily applied to other languages as well as other NLP tasks such as the machine translation.