The Power of Communities: A Text Classification Model with Automated Labeling Process Using Network Community Detection

The Power of Communities: A Text Classification Model with Automated Labeling Process Using Network Community Detection

Minjun Kim 1Department of Systems Science and Industrial Engineering,
12Center for Collective Dynamics of Complex Systems,
Binghamton University, State University of New York, Binghamton, NY 13902, USA,
24Pypestream, Inc., New York, USA
, 1 4mkim151@binghamton.edu
   Hiroki Sayama 1Department of Systems Science and Industrial Engineering,
12Center for Collective Dynamics of Complex Systems,
Binghamton University, State University of New York, Binghamton, NY 13902, USA,
23Waseda Innovation Lab, Waseda University, Tokyo, Japan
3
4email: sayama@binghamton.edu
Abstract

The text classification is one of the most critical areas in machine learning and artificial intelligence research. It has been actively adopted in many business applications such as conversational intelligence systems, news articles categorizations, sentiment analysis [1], emotion detection systems [2], and many other recommendation systems in our daily life. One of the problems in supervised text classification models is that the models’ performance depend heavily on the quality of data labeling that are typically done by humans. In this study, we propose a new network community detection-based approach to automatically label and classify text data into multiclass value spaces. Specifically, we build a network with sentences as the network nodes and pairwise cosine similarities between TFIDF vector representations of the sentences as the network link weights. We use the Louvain method [16] to detect the communities in the sentence network. We train and test Support vector machine and Random forest models on both the human labeled data and network community detection labeled data. Results showed that models with the data labeled by network community detection outperformed the models with the human-labeled data by 2.68-3.75% of classification accuracy. Our method may help development of a more accurate conversational intelligence system and other text classification systems.

Network science · Network community detection · Sentence networks · Document networks · Machine learning · Natural language processing · Text classification · Tf-idf · Cosine similarity
\tocauthor

Minjun Kim, Hiroki Sayama

1 Introduction

Text data is a great source of knowledge for building many useful recommendation systems, search engines as well as conversational intelligence systems. However, it is often found to be a difficult and time consuming task to structure the unstructured text data especially when it comes to labeling the text data for training text classification models. Data labeling, typically done by humans, is prone to make misslabeled data entries, and hard to track whether the data is correctly labeled or not. This human labeling practice indeed impacts on the quality of the trained models in solving classificastion problems.

Some previous studies attempted to solve this problem by utilizing unsupervised [3, 5] and semisupervised [4] machine learning models. However, those studies used pre-defined keywords list for each category in the document, which provides the models with extra referencial materials to look at when making the classification predictions, or included already labeled data as a part of the entire data set from which the models learn. In case of using clustering algorithms such as Kmeans [5], since the features selected for each class depend on the frequency of the particular words in the sentences, when there are words that appear in multiple sentences frequently, it is very much possible that those words can be used as features for multiple classes leading the model to render more ambiguity, and to result a poor performance in classifying documents.

Although there are many studies in text classification problems using machine learning techniques, there has been a limited number of studies conducted in text classifications utilizing networks science. Network science is actively being adopted in studying biological networks, social networks, financial market prediction [6] and more in many fields of study to mine insights from the collectively inter-connected components by analysing their relationships and structural characteristics. Only a few studies adopted network science theories to study text classifications, and showed preliminary results of the text clustering performed by network analysis specially with network community detection algorithms [7, 8]. However, those studies did not clearly show the quality of community detection algorithms and other possible usefull features. Network community detection [9] is graph clustering methods actively used in complex networks analysis From large social networks analysis [10] to RNA-sequencing analysis [11] as a tool to partition a graph data into multiple parts based on the network’s structural properties such as betweeness, modularity, etc.

In this paper, we study further to show the usefulness of the network community detection on labeling unlabeled text data that will automate and improve human labeling tasks, and on training machine learning classification models for a particular text classification problem. We finally show that the machine learning models trained on the data labeled by the network community detection model outperform the models trained on the human labeled data.

2 Method

We propose a new approach of building text classification models using a network community detection algorithm with unlabeled text data, and show that the network community detection is indeed useful in labeling text data by clustering the text data into multiple distinctive groups, and also in improving the classification accuracy. This study follows below steps (see Figure.1), and uses Python packages such as NLTK, NetworkX and SKlearn.

  • Gathered a set of text data that was used to develop a particular conversational intelligence(chatbot) system from an artificial intelligence company, Pypestream. The data contains over 2,000 sentences of user expressions on that particular chatbot service such as [”is there any parking space?”, ”what movies are playing?”, ”how can I get there if I’m taking a subway?”]

  • Tokenizing and cleaning the sentences by removing punctuations, special characters and English stopwords that appear frequently without holding much important meaning. For example, [”how can I get there if I’m taking a subway?”] becomes [’get’, ’taking’, ’subway’]

  • Stemmizing the words, and adding synonyms and bigrams of the sequence of the words left in each sentence to enable the model to learn more kinds of similar expressions and the sequences of the words. For example, [’get’, ’taking’, ’subway’] becomes [’get’, ’take’, ’subway’, ’tube’, ’underground’, ’metro’, ’take metro’, ’get take’, ’take subway’, ’take underground’, …]

  • Transforming the preprocessed text data into a vector form by computing TFIDF of each preprocessed sentence with regard to the entire data set, and computing pair-wise cosine similiarity of the TFIDF vectors to form the adjacency matrix of the sentence network

  • Constructing the sentence network using the adjacency matrix with each preprocessed sentence as a network node and the cosine similarity of TFIDF representations between every node pair as the link weight.

  • Applying a network community detection algorithm on the sentence network to detect the communities where each preprocessed sentence belong, and build a labeled data set with detected communities for training and testing machine learning classification models.

Figure 1: Analysis process. a. preprocess the text data by removing punctuations, stopwords and special characters, and add synonyms and bigrams, b. transform the prepocessed sentence into TFIDF vector, and compute pair-wise cosine similairy between every sentence pair, c. construct the sentence networks, and apply Louvain method to detect communities of every sentence, d. label each sentence with the detected communities, e. train and test Support vector machine and Random forest models on the labeled data.

2.1 Data, Preprocessing and Representation

The data set obtained from Pypestream is permitted to be used for the research purpose only, and for a security reason, we are not allowed to share the data set. It was once originaly used for creating a conversational intelligence system(chatbot) to support customer inqueries about a particular service. The data set is a two-column comma separated value format data with one column of ”sentence” and the other column of ”class”. It contains 2,212 unique sentences of user expressions asking questions and aswering to the questions the chatbot asked to the users(see Table.LABEL:table:classes). The sentences are all in English without having any missspelled words, and labeled with 19 distinct classes that are identified and designed by humans. Additional data set that only contains the sentences was made for the purpose of this study by taking out the ”class” column from the original data set.

From each sentence, we removed punctuations, special characters and English stopwords to keep only those meaningful words that serve the main purpose of the sentence, and to avoid any redundant computing. We then tokenized each sentence into words to process the data further in word level. For words in each sentence, we added synonyms of the words to handle more variations of the sentence as a typical method of increasing the resulting classification models’ capability of understanding more unseen expressions with different words that describe similar meanings. Although we used the predefined synonyms from the Python NLTK package, one might develop it’s own synonym data to use in accordance with the context of the particular data to achieve a better accuracy. We also added bigrams of the words to deal with those cases where the tokenization breaks the meaning of the word that consist of two words. For example, if we tokenized the sentence ”go to binghamton university” and process the further steps without adding bigrams of them, the model is likely to yield a lower confidence on classifying unseen sentences with ”binghamton university”, or does not understand ”binghamton university” at all since the meaning of ”binghamton university” is lost in the data set [12].

With the preprocessed text data, we built vector representations of the sentences by performing weighted document representation using TFIDF weighting scheme [13, 14]. TFIDF, as known as Term frequency inversed document frequency, is a document representation that takes account of the importance of each word by its frequency in the whole set of documents and its frequency in particular sets of documents. Specifically, let be a set of documents and the set of unique terms in the entire documents where is the number of documents in the data set and the number of unique words in the documents. In this study, the documents are the preprocessed sentences and the terms are the unique words in the preprocessed sentences. The importance of a word is captured with its frequency as denoting the frequency of the word in the document . Then a document is represented as an -dimensional vector . However, In order to compute more concise and meaningful importance of a word, TFIDF not only takes the frequency of a particular word in a particular document into account, but also considers the number of documents that the word appears in the entire data set. The underlying thought of this is that a word appeared frequently in some groups of documents but rarely in the other documents is more important and relavant to the groups of documents. Applying this contcept, is weighted by the document frequency of a word, and becomes where is the number of documents the word appears, and thus the document is represented as .

Class Number of sentences
CHAT_AGENT 92
RENT_HALL 80
GETINFO_BOX_OFFICE_HOUR 3
GETINFO_PARKING 34
GETINFO_SEATING_CHART 34
GETINFO_STUDENT_DISCOUNT 31
GETINFO_NEARBY_RESTAURANT 39
GETINFO_ISSUE 31
GETINFO_TOUR 110
GETINFO_FREE_PERFORMANCE 34
RESERVE_PARKING 28
GETINFO_DIRECTION 977
STARTOVER 3
ORDER_EVENTS 444
GETINFO_JOB 28
GETINFO 26
GETINFO_EXACT_ADDRESS 40
GETINFO_DRESSCODE 35
GETHELP_LOST_FOUND 143
Table 1: The original text data contains 2,212 unique sentences labeled with 19 distinct classes assigned by humans. It is not a balanced data set which means that each class contains different number of sentences.

2.2 Sentence Network Construction

With the TFIDF vector representations, we formed sentence networks to investigate the usefulness of the network community detection. In total, 10 sentence networks (see Figure.2 and Figure.4) were constructed with 2,212 nodes representing sentences and edge weights representing the pairwise similarities between sentences with 10 different network connectivity threshold values. The networks we formed were all undirected and weighted graphs. Particularly, as for the network edge weights, the cosine similarity [14, 15] is used to compute the similarities between sentences. The cosine similarity is a similarity measure that is in a floating number between 0 and 1, and computed as the angle difference between two vectors. A cosine similarity of means that the two vectors are perpendicular to each other implying no similarity, on the other hand a cosine similarity of means that the two vectors are identical. It is popularly used in text mining and information retrieval techniques. In our study, the cosine similarity between two sentences and is defined as below equation.

(1)

where:

, -

, -


To build our sentence networks, we formed a network adjacency matrix for 2,212 sentences, , with the pairwise cosine similarities of TFIDF vector representations computed in the above step.

2.3 Network Community Detection and Classification Models

The particular algorithm of network community detection used in this study is Louvain method [16] which partitions a network into the number of nodes - every node is its own comunity, and from there, clusters the nodes in a way to maximize each cluster’s modularity which indicates how strong is the connectivity between the nodes in the community. This means that, based on the cosine similarity scores - the networks edge weights, the algorithm clusters similar sentences together in a same community while the algorithm proceeds maximizing the connectivity strength amongst the nodes in each community. The network constructed with no threshold in place was detected to have 18 distinct communities with three single node communities. Based on the visualized network (see Figure.2), it seemed that the network community detection method clustered the sentence network as good as the original data set with human labeled classes although the communities do not look quite distinct. However, based on the fact that it had three single node communities and the number of distinct communities is less than the number of classes in the human labeled data set, we suspected possible problems that would degrade the quality of the community detection for the use of training text classification models.

Figure 2: A sentence network and its communities. The sentence network with no threshold on the node connectivity has 18 distinct communities including three single node communities.

2.3.1 Quality of Network Community Detection Based Labeling

We checked the community detection results with the original human labeled data by comparing the sentences in each community with the sentences in each human labeled class to confirm how well the algorithm worked. We built class maps to facilitate this process (see Figure.3) that show mapping between communities in the sentence networks and classes in the original data set. Using the class maps, we found two notable cases where; 1. the sentences from multiple communities are consist of the sentences of one class of the human labeled data, meaning the original class is splitted into multiple communities and 2. the sentences from one community consist of the sentences of multiple classes in human labeled data, meaning multiple classes in the original data are merged into one community. For example, in the earlier case (see blue lines in Figure.3) which we call Class-split, the sentences in COMMUNITY_1, COMMUNITY_2, COMMUNITY_5, COMMUNITY_8, COMMUNITY_10, COMMUNITY_14 and COMMUNITY_17 are the same as the sentences in CHAT_AGENT class. Also, in the later case (see red lines in Figure.3) which we call Class-merge, the sentences in COMMUNITY_7 are the same as the sentences in GETINFO_PARKING, GETINFO_NEARBY_RESTAURANT, GETINFO_TOUR, GETINFO_EXACT_ADDRESS, STARTOVER, ORDER_EVENTS, GETINFO_JOB, GETINFO, GETINFO_DRESSCODE, GETINFO_LOST_FOUND as well as GETINFO_FREE_PERFORMANCE.

Figure 3: A class map between detected communities and human labeled classes. The class map shows a mapping(all lines) between communities detected by the Louvain method and their corresponding human labeled classes of the sentence network with no threshold. Some communities contain sentences that appeared in multiple human labeled classes(blue lines, we call this Class-split), and some human labeled classes contain sentences that appeared in multiple communities(red lines, we call this Class-merge).
Figure 4: Nine sentence networks with different connectivity thresholds. Each node represents a sentence and an edge weight between two nodes represents the similarity between two sentences. In this study, we removed edges whose weight is below the threshold. a. network with threshold of 0.1 has 29 distinct communities with 11 signle node communities. b. network with threshold of 0.2 has 45 distinct communities with 20 single node communities, c. network with threshold of 0.3 has 100 distinct communities with 58 single node communities, d.network with threshold 0.4 has 187 distinct communities with 120 single node communities, e. network with threshold 0.5 has 320 distinct communities with 204 single node communities, f. network with threshold of 0.6 has 500 distinct communities with 335 single node communities, g. network with threshold of 0.7 has 719 distinct communities with 499 single node communities, h. network with threshold of 0.8 has 915 distinct communities with 658 single node communities, i. network with threshold of 0.9 has 1,140 distinct communities with 839 single node communities. Based on the visualized sentence networks, as the threshold gets larger it is shown that each network has more distinct communities and the detected communities look more distinct from each other with more single node communities.

The Class-split happens when a human labeled class is devided into multiple communities as the sentence network is clustered based on the semantic similarity. This actually can help improve the text classification based systems to work more sophisticatedly as the data set gets more detailed subclasses to design the systems with. Although, it is indeed a helpful phenomena, we would like to minimize the number of subclasses created by the community detection algorithm simply because we want to avoid having too many subclasses that would add more complexity in designing any applications using the community data. On the other hand, the Class-merge happens when multiple human labeled classes are merged into one giant community. This Class-merge phenomena also helps improve the original data set by detecting either misslabeled or ambiguous data entries. We will discuss more details in the following subsection. Nonetheless, we also want to minimize the number of classes merged into the one giant community, because when too many classes are merged into one class, it simply implies that the sentence network is not correctly clustered. For example, as shown in Figure.3 red lines, 12 different human labeled classes that do not share any similar intents are merged into COMMUNITY_7. If we trained a text classification model on this data, we would have lost the specifically designed purposes of the 12 different classes, expecting COMMUNITY_7 to deal with all the 12 different types of sentences. This would dramatically degrade the performance of the text classification models.

Figure 5: Optimal connectivity threshold point based on Class-split and Class-merge mertics The normalized Class-split score(blue line) is shown to increase as the threshold gets larger. On the other hand, normalized Class-merge(red line) decreases as the threshold gets larger. The optimal connectivity threshold is the point where both scores are minimized which is 0.5477.

In order to quantify the degree of Class-split and Class-merge of a network, and to find out optimal connectivity threshold that would yield the sentence network with the best community detection quality, we built two metrics using the class map. We quantified the Class-split by counting the number of communities splitted out from each and every human labeled class, and the Class-merge by counting the number of human labeled classes that are merged into each and every community. We then averaged the Class-splits across all the human labeled classes and Class-merges across all the communities. For example, using the class map of the sentence network with no threshold, we can easily get the number of Class-split and Class-merge as below. By averaging them, we get the Class_split and Class_merge scores of the sentence network, which is 2.7368 and 2.8333 respectively.

We computed the normalized Class_split and Class_merge scores for all 10 sentence networks (see Figure.5). Figure.5 shows the normalized Class-split and Class-merge scores of the 10 sentence networks with different connectivity thresholds ranging from to . With these series of Class_split and Class_merge scores, we found out that at 0.5477 of connectivity threshold we can get the sentence network that would give us the best quality of community detection result particularly for our purpose of training text classification models.

2.3.2 Detecting Misslabeled or Ambiguous Sentences in Human-made Data Set

Using the Class_merge information we got from the class map, we were able to spot out those sentences that are either misslabeled or ambiguous between classes in the original data set. This is extreamly helpful and convenient feature in fixing and improving text data for classification problems, because data fixing is normally a tedious and time consuming tasks which takes a great amount of human labor. For example, by looking at the class map, in our sentence network with no threshold, COMMUNITY_5 contains sentences appeared in GETINFO_EXACT_ADDRESS and CHAT_AGENT classes. We investigated the sentences in COMMUNITY_5, and were able to spot out one sentence [’I need to address a human being!’] which is very ambiguous for machines to classify between the two classes. This sentence is originally designed for CHAT_AGENT class, but because of its ambiguous expression with the word ’address’, it is together with sentences in GETINFO_EXACT_ADDRESS in COMMUNITY_5. After fixing the ambiguity of that sentence by correcting it to [’I need to talk to a human being!’], we easily improved the original data set.

2.3.3 Classification Models

Once we got the optimal connectivity threshold using the Class_split and Class_merge scores as shown above sections, we built the sentence network with the optimal threshold of 0.5477. We then applied the Louvain method to detect communities in the network, and to automatically label the data set. The network with threshold of 0.5477 has 399 communities with 20,856 edges. Class_split and Class_merge scores of the network was 22.3158 and 1.0627 respectively. We finally trained and tested machine learning based text classification models on the data set labeled by the community detection outcome to see how well our approach worked. Following a general machine learning train and test practice, we split the data set into train set(80% of the data) and test set(20% of the data). The particular models we trained and tested were standard Support vector machine [17] and Randome forest [18] models that are popularly used in natural language processing such as spam e-mail and news article categorizations. More details about the two famous machine learning models are well discussed in the cited papers.

Figure 6: Accuracies of text classification models The red bars represent the accuracies of the Support vector machine(0.9572) and the Random forest(0.9504) model trained on the original human labeled data, while the blue bars represent the accuracies of the same models trained the data set labeled by the network community detection algorithm(0.9931 and 0.9759 respectively). It is shown that the models trained on the community data resulted higher accuracy in classifying the sentences in the test data.

3 Result

Figure.6 shows the accuracy of the four Support vector machine and Random forest models trained on the original human labeled data and on the data labeled by our method. The accuracies are hit ratios that compute the number of correctly classified sentences over the number of all sentences in the test data. For example, if a model classified 85 sentences correctly out of 100 test sentences, then the accuracy is 0.85. In order to accurately compute the Ground truth hit ratio, we used the ground truth messages in the chatbot. The messages are the sentences that are to be shown to the chatbot users in response to the classification for a particular user query as below.

For example, for a question of ”how do I get there by subway?”, in the chatbot, there is a designed message of ”You can take line M or B to 35th street.” to respond to that particular query. Using these output messages in the chatbot, we were able to compute the ground truth accuracy of our classification models by comprehending the input sentences in the test sets, the detected classes from the models and linked messages. In our test, the Support vector machine trained on human labeled data performed 0.9572 while the same model trained on the data labeled by our method resulted 0.9931. Also, the Random forest model trained on human labeled data performed 0.9504 while the same model trained on the data labeled by our method did 0.9759.

4 Discussions and Conclusion

In this study, we demonstrated a new approach of training text classification models using the network community detection, and showed how the network community detection can help improve the models by automatically labeling text data and detecting misslabeled or ambiguous data points. As seen in this paper, we were able to yield better results in the accuracy of Support vector machine and Random forest models compared to the same models that were trained on the original human labeled data for the particular text classification problem. Our approach is not only useful in producing better classifiation models, but also in testing the quality of human made text data. One might be able to get even better results using this method by utilizing more sophisticatedly custom designed synonyms and stopwords, using more advanced natural language processing methods such as word-embeddings, utilizing higher n-grams such as trigrams, and using more balanced data sets. In the future, we would like to expand this study further to use the network itself to parse out classifications of unseen sentences without training machine learning models.

References

  • [1] Melville, P., Gryc, W. and Lawrence, R. D.: Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1275–1284. ACM (2009)
  • [2] Danisman, T. and Adil, A. Feeler.: Emotion classification of text using vector space model. In AISB 2008 Convention Communication, Interaction and Social Intelligence, Vol. 1, pp. 53. (2008)
  • [3] Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In Proceedings of the 18th conference on Computational linguistics, Vol. 1, pp. 453–459 (2000)
  • [4] Dorado, R., Sylvie, R.: Semisupervised text classificastion using unsupervised topic information. In Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference (FLAIRS), pp. 210–213 (2016)
  • [5] Zhou, X., Hu, Y. and Guo, L.: Text categorization based on clustering feature selection. Procedia Computer Science, Vol. 31, pp. 398–405 (2014)
  • [6] Kim, M. and Sayama, H.: Predicting stock market movements using network science: an information theoretic approach. Applied Network Science, Vol. 2 (2017)
  • [7] dos Santos, C.K., Evsukoff, A.G., and de Lima, B.: Cluster analysis in document networks. WIT Transactions on Information and Communication Technologies, Vol. 40, pp. 95–104. (2008)
  • [8] Mikhina, E.K. and Trifalenkov, V.I.: Text clustering as graph community detection. Procedia Computer Science, Vol. 123, pp. 271–277. (2018)
  • [9] Fortunato, S. and Hric, D.: Community detection in networks: A user guide. Physics Reports, Vol. 659, pp. 1–44. (2016)
  • [10] Steinhaeuser, K. and Chawla, N.V.: Community detection in a large real-world social network. Social computing, behavioral modeling, and prediction, pp. 168–175. Springer, Boston, MA (2008)
  • [11] Kanter, I., Yarri, G. and Kalisky, T.: Applications of community detection algorithms to large biological datasets. bioRxiv 547570. doi: https://doi.org/10.1101/547570 (2019)
  • [12] Bekkerman, R. and Allan, J.: Using bigrams in text categorization. Technical Report IR-408, Center of Intelligence Information Retrieval, pp. 161–175. UMass Amberst (2004)
  • [13] Trstenjak, B., Mikac, S. and Donko, D.: KNN with TF-IDF based framework for text categorization. Procedia Engineering, Vol. 69, pp. 1356–1364. (2014)
  • [14] Huang, A.: Similarity measures for text document clustering. Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Vol. 4, pp. 9–56. (2008)
  • [15] Li, B. and Han, L.: Distance weighted cosine similarity measure for text classification. International Conference on Intelligent Data Engineering and Automated Learning, pp. 611–618. Springer, Berlin, Heidelberg (2013)
  • [16] Blondel, V. D., Guillaume, J. L., Lambiotte, R. and Lefebvre, E.: Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment (2008)
  • [17] Drucker, H., Wu, D. and Vapnik, V. N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks, Vol. 10, pp. 1048–1054. (1999)
  • [18] Wu, Q., Ye, Y., Zhang, H.,Ng, M. K. and Ho, S. S.: ForesTexter: An efficient random forest algorithm for imbalance text categorization. Knowledge-Based Systems, Vol. 67, pp. 105–116. (2014)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
391992
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description