The Power of Communities: A Text Classification Model with Automated Labeling Process Using Network Community Detection
Text classification is one of the most critical areas in machine learning and artificial intelligence research. It has been actively adopted in many business applications such as conversational intelligence systems, news articles categorizations, sentiment analysis , emotion detection systems , and many other recommendation systems in our daily life. One of the problems in supervised text classification models is that the models’ performance depends heavily on the quality of data labeling that is typically done by humans. In this study, we propose a new network community detection-based approach to automatically label and classify text data into multiclass value spaces. Specifically, we build networks with sentences as the network nodes and pairwise cosine similarities between TFIDF vector representations of the sentences as the network link weights. We use the Louvain method  to detect the communities in the sentence networks. We train and test the Support Vector Machine and the Random Forest models on both the human-labeled data and network community detection labeled data. Results showed that models with the data labeled by the network community detection outperformed the models with the human-labeled data by 2.68-3.75% of classification accuracy. Our method may help developments of more accurate conversational intelligence and other text classification systems.
Minjun Kim, Hiroki Sayama
Text data is a great source of knowledge for building many useful recommendation systems, search engines as well as conversational intelligence systems. However, it is often found to be a difficult and time-consuming task to structure the unstructured text data especially when it comes to labeling the text data for training text classification models. Data labeling, typically done by humans, is prone to make mislabeled data entries, and hard to track whether the data is correctly labeled or not. This human labeling practice indeed impacts on the quality of the trained models in solving classification problems.
Some previous studies attempted to solve this problem by utilizing unsupervised [3, 5, 6] and semisupervised  machine learning models. However, those studies used a pre-defined keywords list for each category in the document, which provides the models with extra referential materials to look at when making the classification predictions, or included already labeled data as a part of the entire data set from which the models learn. In case of using unsupervised algorithms such as Kmeans and LDA [5, 6], it is very much possible that frequently appearing words in multiple sentences can be used as features for multiple classes. This leads the models to render more ambiguity and to result in a poor performance in classifying documents. Also, the number of distinct classes () to be made is not determined systematically using the data, but heuristically by trying out many different values of which is not a reliable optimization.
Although there are many studies in text classification problems using machine learning techniques, there have been a limited number of studies conducted in text classifications utilizing network science. Network science is actively being adopted in studying biological networks, social networks, financial market prediction  and more in many fields of study to mine insights from the collectively inter-connected components by analyzing their relationships and structural characteristics. Only a few studies adopted network science theories to study text classifications and showed preliminary results of the text clustering performed by network analysis specially with the network community detection algorithms [8, 9, 10]. However, those studies did not clearly show the quality of community detection algorithms or other possible useful features. Network community detection  is graph clustering methods used in complex networks analysis from large social networks analysis  to RNA-sequencing analysis  as a tool to partition a graph data into multiple parts based on the network’s structural properties such as betweenness, modularity, etc.
In this paper, we study further to show the usefulness of the network community detection on labeling unlabeled text data that will automate and improve human labeling tasks, and on training machine learning classification models for a particular text classification problem. We finally show that the machine learning models trained on the data labeled by the network community detection model outperform the models trained on the human-labeled data.
We propose a new approach of building text classification models using a network community detection algorithm with unlabeled text data, and show that the network community detection is indeed useful in labeling text data by clustering the text data into multiple distinctive groups and also in improving the classification accuracy. This study follows the below steps (see Figure 1), and uses Python packages such as NLTK, NetworkX and SKlearn.
Gathered a set of text data that was used to develop a particular conversational intelligence system from an artificial intelligence company, Pypestream. The data contains over 2,000 sentences of user expressions on that particular chatbot service such as [“is there any parking space?”, “what movies are playing?”, “how can I get there if I’m taking a subway?”]
Tokenizing and cleaning the sentences by removing punctuations, special characters and English stopwords that appear frequently without holding much important meaning. For example, [“how can I get there if I’m taking a subway?”] becomes [‘get’, ‘taking’, ‘subway’]
Stemmizing the words following a suffix stripping algorithm , and adding synonyms and bigrams of the sequence of the words left in each sentence to enable the model to learn more kinds of similar expressions and the sequences of the words. For example, [‘get’, ‘taking’, ‘subway’] becomes [‘get’, ‘take’, ‘subway’, ‘tube’, ‘underground’, ‘metro’, ‘take metro’, ‘get take’, ‘take subway’, ‘take underground’, …]
Transforming the preprocessed text data into a vector form by computing TFIDF of each preprocessed sentence with regard to the entire data set, and computing pair-wise cosine similiarity of the TFIDF vectors to form the adjacency matrix of the sentence network
Constructing the sentence network using the adjacency matrix with each preprocessed sentence as a network node and the cosine similarity of TFIDF representations between every node pair as the link weight.
Applying a network community detection algorithm on the sentence network to detect the communities where each preprocessed sentence belong, and build a labeled data set with detected communities for training and testing machine learning classification models.
2.1 Data, Preprocessing and Representation
The data set obtained from Pypestream is permitted to be used for the research purpose only, and for a security reason, we are not allowed to share the data set. It was once originally used for creating a conversational intelligence system (chatbot) to support customer inquiries about a particular service. The data set is a two-column comma-separated value format data with one column of “sentence” and the other column of “class”. It contains 2,212 unique sentences of user expressions asking questions and answering to the questions the chatbot asked the users. The sentences are all in English without having any misspelled words, and labeled with 19 distinct classes that are identified and designed by humans. Additional data set that only contains the sentences was made for this study by taking out the “class” column from the original data set.
From each sentence, we removed punctuations, special characters and English stopwords to keep only those meaningful words that serve the main purpose of the sentence and to avoid any redundant computing. We then tokenized each sentence into words to process the data further in word level. For words in each sentence, we added synonyms of the words to handle more variations of the sentence as a typical method of increasing the resulting classification models’ capability of understanding more unseen expressions with different words that describe similar meanings. The synonyms we added to the data are not context-specific synonyms, but all predefined synonyms of particular words. Although we used the predefined synonyms from the Python NLTK package, one might develop it’s own synonym data to use under the context of the particular data to achieve better accuracy. We also added bigrams of the words to deal with those cases where the tokenization breaks the meaning of the word that consists of two words. For example, if we tokenized the sentence “go to Binghamton University” and process the further steps without adding bigrams of them, the model is likely to yield lower confidence on classifying unseen sentences with “Binghamton University” since the meaning of “Binghamton University” is lost in the data set .
With the preprocessed text data, we built vector representations of the sentences by performing weighted document representation using the TFIDF weighting scheme [16, 17, 18]. The TFIDF, as known as Term frequency inversed document frequency, is a document representation that takes account of the importance of each word by its frequency in the whole set of documents and its frequency in particular sets of documents. Specifically, let be a set of documents and the set of unique terms in the entire documents where is the number of documents in the data set and the number of unique words in the documents. In this study, the documents are the preprocessed sentences and the terms are the unique words in the preprocessed sentences. The importance of a word is captured with its frequency as denoting the frequency of the word in the document . Then a document is represented as an -dimensional vector . However, In order to compute more concise and meaningful importance of a word, TFIDF not only takes the frequency of a particular word in a particular document into account, but also considers the number of documents that the word appears in the entire data set. The underlying thought of this is that a word appeared frequently in some groups of documents but rarely in the other documents is more important and relevant to the groups of documents. Applying this concept, is weighted by the document frequency of a word, and becomes where is the number of documents the word appears, and thus the document is represented as .
2.2 Sentence Network Construction
With the TFIDF vector representations, we formed sentence networks. In total, 10 sentence networks (see Figure 2 and Figure 4) were constructed with 2,212 nodes representing sentences and edge weights representing the pairwise similarities between sentences with 10 different network connectivity threshold values. The networks we formed were all undirected and weighted graphs. Particularly, as for the network edge weights, the cosine similarity [18, 19] is used to compute the similarities between sentences. The cosine similarity is a similarity measure that is in a floating number between 0 and 1, and computed as the angle difference between two vectors. A cosine similarity of means that the two vectors are perpendicular to each other implying no similarity, on the other hand, a cosine similarity of means that the two vectors are identical. It is popularly used in text mining and information retrieval techniques. In our study, the cosine similarity between two sentences and is defined as below equation.
To build our sentence networks, we formed a network adjacency matrix for 2,212 sentences, , with the pairwise cosine similarities of TFIDF vector representations computed in the above step.
2.3 Network Community Detection and Classification Models
The particular algorithm of network community detection used in this study is the Louvain method  which partitions a network into the number of nodes - every node is its own community, and from there, clusters the nodes in a way to maximize each cluster’s modularity which indicates how strong is the connectivity between the nodes in the community. This means that, based on the cosine similarity scores - the networks edge weights, the algorithm clusters similar sentences together in the same community while the algorithm proceeds maximizing the connectivity strength amongst the nodes in each community. The network constructed with no threshold in place was detected to have 18 distinct communities with three single node communities. Based on the visualized network (see Figure 2), it seemed that the network community detection method clustered the sentence network as good as the original data set with human-labeled classes although the communities do not look quite distinct. However, based on the fact that it had three single node communities and the number of distinct communities is less than the number of classes in the human-labeled data set, we suspected possible problems that would degrade the quality of the community detection for the use of training text classification models.
2.3.1 Quality of Network Community Detection Based Labeling
We checked the community detection results with the original human-labeled data by comparing the sentences in each community with the sentences in each human-labeled class to confirm how well the algorithm worked. We built class maps to facilitate this process (see Figure 3) that show mapping between communities in the sentence networks and classes in the original data set. Using the class maps, we found two notable cases where; 1. the sentences from multiple communities are consist of the sentences of one class of the human-labeled data, meaning the original class is splitted into multiple communities and 2. the sentences from one community consist of the sentences of multiple classes in human-labeled data, meaning multiple classes in the original data are merged into one community. For example, in the earlier case (see blue lines in Figure 3) which we call Class-split, the sentences in COMMUNITY_1, COMMUNITY_2, COMMUNITY_5, COMMUNITY_8, COMMUNITY_10, COMMUNITY_14 and COMMUNITY_17 are the same as the sentences in CHAT_AGENT class. Also, in the later case (see red lines in Figure 3) which we call Class-merge, the sentences in COMMUNITY_7 are the same as the sentences in GETINFO_PARKING, GETINFO_NEARBY_RESTAURANT, GETINFO_TOUR, GETINFO_EXACT_ADDRESS, STARTOVER, ORDER_EVENTS, GETINFO_JOB, GETINFO, GETINFO_DRESSCODE, GETINFO_LOST_FOUND as well as GETINFO_FREE_PERFORMANCE.
The Class-split happens when a human-labeled class is divided into multiple communities as the sentence network is clustered based on the semantic similarity. This actually can help improve the text classification based systems to work more sophisticatedly as the data set gets more detailed subclasses to design the systems with. Although it is indeed a helpful phenomenon, we would like to minimize the number of subclasses created by the community detection algorithm simply because we want to avoid having too many subclasses that would add more complexity in designing any applications using the community data. On the other hand, the Class-merge happens when multiple human-labeled classes are merged into one giant community. This Class-merge phenomenon also helps improve the original data set by detecting either mislabeled or ambiguous data entries. We will discuss more details in the following subsection. Nonetheless, we also want to minimize the number of classes merged into the one giant community, because when too many classes are merged into one class, it simply implies that the sentence network is not correctly clustered. For example, as shown in Figure 3 red lines, 12 different human-labeled classes that do not share any similar intents are merged into COMMUNITY_7. If we trained a text classification model on this data, we would have lost the specifically designed purposes of the 12 different classes, expecting COMMUNITY_7 to deal with all the 12 different types of sentences. This would dramatically degrade the performance of the text classification models.
In order to quantify the degree of Class-split and Class-merge of a network, and to find out optimal connectivity threshold that would yield the sentence network with the best community detection quality, we built two metrics using the class map. We quantified the Class-split by counting the number of communities splitted out from each and every human-labeled class, and the Class-merge by counting the number of human-labeled classes that are merged into each and every community. We then averaged the Class-splits across all the human-labeled classes and Class-merges across all the communities. For example, using the class map of the sentence network with no threshold, we can easily get the number of Class-split and Class-merge as below. By averaging them, we get the Class_split and Class_merge scores of the sentence network, which is 2.7368 and 2.8333 respectively.
We computed the normalized Class_split and Class_merge scores for all 10 sentence networks (see Figure 5). Figure 5 shows the normalized Class-split and Class-merge scores of the 10 sentence networks with different connectivity thresholds ranging from to . With these series of Class_split and Class_merge scores, we found out that at 0.5477 of connectivity threshold we can get the sentence network that would give us the best quality of community detection result particularly for our purpose of training text classification models.
2.3.2 Detecting Mislabeled or Ambiguous Sentences in Human-made Data Set
Using the Class_merge information we got from the class map, we were able to spot out those sentences that are either mislabeled or ambiguous between classes in the original data set. This is extreamly helpful and convenient feature in fixing and improving text data for classification problems, because data fixing is normally a tedious and time consuming tasks which takes a great amount of human labor. For example, by looking at the class map, in our sentence network with no threshold, COMMUNITY_5 contains sentences appeared in GETINFO_EXACT_ADDRESS and CHAT_AGENT classes. We investigated the sentences in COMMUNITY_5, and were able to spot out one sentence [“I need to address a human being!”] which is very ambiguous for machines to classify between the two classes. This sentence is originally designed for CHAT_AGENT class, but because of its ambiguous expression with the word ‘address’, it is together with sentences in GETINFO_EXACT_ADDRESS in COMMUNITY_5. After fixing the ambiguity of that sentence by correcting it to [“I need to talk to a human being!”], we easily improved the original data set.
2.3.3 Classification Models
Once we got the optimal connectivity threshold using the Class_split and Class_merge scores as shown above sections, we built the sentence network with the optimal threshold of 0.5477. We then applied the Louvain method to detect communities in the network, and to automatically label the data set. The network with the threshold of 0.5477 has 399 communities with 20,856 edges. Class_split and Class_merge scores of the network were 22.3158 and 1.0627 respectively. We finally trained and tested machine learning based text classification models on the data set labeled by the community detection outcome to see how well our approach worked. Following a general machine learning train and test practice, we split the data set into a train set (80% of the data) and a test set (20% of the data). The particular models we trained and tested were the Support Vector Machine  and the Random Forest  models that are popularly used in natural language processing such as spam e-mail and news article categorizations. More details about the two famous machine learning models are well discussed in the cited papers.
|Data Labeling||SVM||Random forest|
|Network Community Labeled||0.9931||0.9759|
Table.1 shows the accuracy of the four Support Vector Machine and the Random Forest models trained on the original human-labeled data and the data labeled by our method. The accuracies are hit ratios that compute the number of correctly classified sentences over the number of all sentences in the test data. For example, if a model classified 85 sentences correctly out of 100 test sentences, then the accuracy is 0.85. In order to accurately compute the ground truth hit ratio, we used the ground truth messages in the chatbot. The messages are the sentences that are to be shown to the chatbot users in response to the classification for a particular user query as below.
For example, for a question of “how do I get there by subway?”, in the chatbot, there is a designed message of “You can take line M or B to 35th street” to respond to that particular query. Using these output messages in the chatbot, we were able to compute the ground truth accuracy of our classification models by comprehending the input sentences in the test sets, the detected classes from the models and linked messages. In our test, the Support Vector Machine trained on human-labeled data performed 0.9572 while the same model trained on the data labeled by our method resulted in 0.9931. Also, the Random Forest model trained on human-labeled data performed 0.9504 while the same model trained on the data labeled by our method did 0.9759.
4 Discussions and Conclusion
In this study, we demonstrated a new approach of training text classification models using the network community detection, and showed how the network community detection can help improve the models by automatically labeling text data and detecting mislabeled or ambiguous data points. As seen in this paper, we were able to yield better results in the accuracy of the Support Vector Machine and the Random Forest models compared to the same models that were trained on the original human-labeled data for the particular text classification problem. Our approach is not only useful in producing better classification models, but also in testing the quality of human-made text data. One might be able to get even better results using this method by utilizing more sophisticatedly custom-designed synonyms and stopwords, using more advanced natural language processing methods such as word-embeddings, utilizing higher n-grams such as trigrams, and using more balanced data sets. In the future, we would like to expand this study further to use the network itself to parse out classifications of unseen sentences without training machine learning models.
-  Melville, P., Gryc, W. and Lawrence, R. D.: Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1275–1284. ACM (2009)
-  Danisman, T. and Adil, A. Feeler.: Emotion classification of text using vector space model. In AISB 2008 Convention Communication, Interaction and Social Intelligence, Vol. 1, pp. 53. (2008)
-  Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In Proceedings of the 18th conference on Computational linguistics, Vol. 1, pp. 453–459 (2000)
-  Dorado, R., Sylvie, R.: Semisupervised text classification using unsupervised topic information. In Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference (FLAIRS), pp. 210–213 (2016)
-  Zhou, X., Hu, Y. and Guo, L.: Text categorization based on clustering feature selection. Procedia Computer Science, Vol. 31, pp. 398–405 (2014)
-  Andrea, L., M, I.S., Jane, X.W., Daniel, A., Konrad, K., and Luis, A.N.A.: High-Reproducibility and High-Accuracy method for automated topic classification. Physical Review X, Vol. 5. (2015)
-  Kim, M. and Sayama, H.: Predicting stock market movements using network science: an information theoretic approach. Applied Network Science, Vol. 2 (2017)
-  dos Santos, C.K., Evsukoff, A.G., and de Lima, B.: Cluster analysis in document networks. WIT Transactions on Information and Communication Technologies, Vol. 40, pp. 95–104. (2008)
-  Mikhina, E.K. and Trifalenkov, V.I.: Text clustering as graph community detection. Procedia Computer Science, Vol. 123, pp. 271–277. (2018)
-  Martin, G., Tiago, P. and Eduardo, G.A.: A network approach to topic models. Science Advances, Vol. 4, eaaq. 1360. (2018)
-  Fortunato, S. and Hric, D.: Community detection in networks: A user guide. Physics Reports, Vol. 659, pp. 1–44. (2016)
-  Steinhaeuser, K. and Chawla, N.V.: Community detection in a large real-world social network. Social computing, behavioral modeling, and prediction, pp. 168–175. Springer, Boston, MA (2008)
-  Kanter, I., Yarri, G. and Kalisky, T.: Applications of community detection algorithms to large biological datasets. bioRxiv 547570. doi: https://doi.org/10.1101/547570 (2019)
-  Porter, M.: An algorithm for suffix stripping. Program: electronic library and information systems , Vol. 40, pp. 211–218. (2006)
-  Bekkerman, R. and Allan, J.: Using bigrams in text categorization. Technical Report IR-408, Center of Intelligence Information Retrieval, pp. 161–175. UMass Amberst (2004)
-  Karen, J.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, Vol. 28, pp. 11–21. (1972)
-  Trstenjak, B., Mikac, S. and Donko, D.: KNN with TF-IDF based framework for text categorization. Procedia Engineering, Vol. 69, pp. 1356–1364. (2014)
-  Huang, A.: Similarity measures for text document clustering. Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Vol. 4, pp. 9–56. (2008)
-  Li, B. and Han, L.: Distance weighted cosine similarity measure for text classification. International Conference on Intelligent Data Engineering and Automated Learning, pp. 611–618. Springer, Berlin, Heidelberg (2013)
-  Blondel, V. D., Guillaume, J. L., Lambiotte, R. and Lefebvre, E.: Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment (2008)
-  Drucker, H., Wu, D. and Vapnik, V. N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks, Vol. 10, pp. 1048–1054. (1999)
-  Wu, Q., Ye, Y., Zhang, H.,Ng, M. K. and Ho, S. S.: ForesTexter: An efficient random forest algorithm for imbalance text categorization. Knowledge-Based Systems, Vol. 67, pp. 105–116. (2014)