Characterization of citizens using word2vec and latent topic analysis in a large set of tweets

Characterization of citizens using word2vec and latent topic analysis in a large set of tweets

Vladimir Vargas-Calderón Jorge E. Camargo Physics Department, Universidad Nacional de Colombia, Bogotá, Colombia Systems Engineering Department, Fundación Universitaria Konrad Lorenz, Bogotá, Colombia

With the increasing use of the Internet and mobile devices, social networks are becoming the most used media to communicate citizens’ ideas and thoughts. This information is very useful to identify communities with common ideas based on what they publish in the network. This paper presents a method to automatically detect city communities based on machine learning techniques applied to a set of tweets from Bogotá’s citizens. An analysis was performed in a collection of 2,634,176 tweets gathered from Twitter in a period of six months. Results show that the proposed method is an interesting tool to characterize a city population based on a machine learning methods and text analytics.

natural language processing, word embedding, t-sne, social network analysis
journal: Cities

1 Introduction

Internet usage in Colombia, and in particular in Bogotá, has been increasing in the last years, not only because it constitutes the largest source of information of every kind, but also because of governmental efforts to make people from different social backgrounds recurrent Internet users tic (), with the idea of reducing the digital gap in Colombia. For instance, from 2011 to 2014, several portions of the city have experienced an increase of up to 148.3% in homes with Internet connection dig_divide ().

By 2011, both Facebook and Twitter were the most popular social networks in Bogotá tic (). Moreover, 65.5% of Bogotá’s Internet users used social networks by 2014 dig_divide (). This study concentrates on Twitter because it allows its users to write posts with a length ranging from 1 to 140 characters111Data was collected before the change in the maximum tweet length to 280 characters, making Twitter a tool for microblogging, a form of communication in which users express their opinion about several topics in short posts (tweets) tw_usage ().

By only taking the users’ posts and not the explicit relations between the users, our large data set is an interesting basis for testing unsupervised community detection methodologies. In fact, the only common factor that relates the users from our data set is that they are connected by geographical location. Such unsupervised community detection methodologies help to understand social phenomena that takes place in that geographical region and in a particular period of using data from any social network with no a priori knowledge of people’s relations.

With the objective of identifying the main topics treated by Bogotá’s population on Twitter, as well as detecting possible communities, we collected tweets emitted from Bogotá. Then, a Word2Vec model d2v () was built in order to represent the set of tweets corresponding to each user as a vector in a vector space. The gap statistic gap () was used to estimate the number of clusters that could be formed using these vectors. Finally, a frequency distribution of words was built for each cluster, so that each cluster could be identified by its most frequent words, which ultimately characterizes a topic.

Our work acknowledges that online social networks constitute one of the main scenarios where people express their opinions, which makes them an outstanding source of information that allows the characterization of important topics for the citizenship. Therefore, we provide a robust method for studying cultural and political aspects of a society, which unveils groups of citizens that are actively concerned about particular topics that are relevant for the dynamics of a city. The main contribution of our work is the fully unsupervised and city-dependent method that we propose for discovering such groups. In turn, the identification of the main topics discussed by the citizens simplifies the task of targeting groups of people for promoting cultural, political or educational campaigns about very specific issues.

This paper is organized as follows. In Section 2 we present related work regarding methods for topic detection, models for text representation as well as a short review on community detection algorithms. Section 3 details material and methods. In Section 4 we present the theoretical aspects of the proposed method. In Section 5 the principal results are shown. We discuss these results in Section 6 with the purpose of providing some enlightenment on the way our method detected topics treated by Bogotá’s population. Finally, Section 7 presents the main conclusions of the paper and future work.

2 Related Work

In this section we review important milestones in the areas of topic detection, opinion-mining, text representation and community detection models, all of which are pertinent to our work.

Numerous methods have been created for topic detection in texts. The most successful and acknowledged ones are Latent Dirichlet Allocation (LDA) Blei2003 (), Latent Semantic Analysis (LSA) (or Latent Semantic Indexing (LSI)) Deerwester1990 (), and Correlated Topic Models (CTM) Blei2006 (). However, people always tune and mix these methods in larger pipelines that allow them to study a singular application in which they are interested. In particular, analyzing posts from Twitter is a challenging task because these are short texts that lack of context. Moreover, many of these short texts are not words of an official language, both because of misspelling and because of the usage of Internet slang. However, several studies have tried to adapt to these inconveniences. A model Zhou2016 () based on TwitterLDA Zhao2011 () (which is an adaptation of LDA to short texts such as tweets) was constructed to take advantage of Twitter as a source of real world event information. This model aimed to detect emerging events that could affect a geographical region in a city, and also quantitatively estimated the impact of such an event on the population nearby the event location. Another study Benny2015 () proposed a model to detect related topics based on clustering performed over a TF-IDF representation of tweets. The clustering was done by including weights between words called Associative Gravity Force Klahold2014 () (along with other measures of similarity between words, as well as ranking of words), that accounted for the frequency of pairs of words occurring together. Also, graph-based approaches have been recently proposed Hachaj2017 (), in which hash-tags are used to identify communities and topics by finding frequent co-occurring hash-tags.

Other non common methods have also been used to study texts from Twitter, like Cigarr??n2016 (), in which Formal Concept Analysis (FCA) Wille1992 (); BernhardGanter1999 (), a mathematical application of lattices and ordered sets to the process of concept formation, was used as an alternative approach to topic detection. Authors use FCA because it deals with several problems that the traditional methods suffer such as the unknown number of topics, the difficulty of these methods to adapt to new topics, among others.

It is worth noting that the majority of work has been done using English text corpora with valuable results Dashtipour2016 (). However, Spanish is significantly a more inflected language than English, and this difference could pose problems. For example, supervised machine learning methods for the topic classification of annotated Spanish tweets modeled with -grams have shown to be insufficient FernandezAnta2013 (). Better attempts to deal with the several research branches of topic detection and opinion-mining in Spanish have taken place. For instance, in the field of opinion mining, studies like DoloresMolina-Gonzalez2015 () proposed a lexicon-based model that adapts to specific domains in Spanish for polarity classification of film reviews. Also, polarity classification in Spanish tweets has been treated in Vilares:2013:SPC:2494266.2494300 (), where hybrid systems that bring together knowledge from lexical, syntactic and semantic structures in the Spanish language, as well as machine learning techniques used with the bag-of-words representation, have shown improvements over the sole bag of words approach. Besides, the clever creation of a corpus called MeSiento (Spanish for “I feel”) Montejo-Raez:2013:SKB:2487788.2487996 () allowed a robust unsupervised method for polarity classification of Spanish tweets that reached accuracy levels close to the ones obtained with supervised algorithms.

Nonetheless, these methods applied to sentiment analysis tasks depend a lot on annotated dictionaries which do not contain Bogotá’s jargon. As a matter of fact, few attempts have been made to make small topic-specific dictionaries such as AlvaradoValencia2016 (), where the political sentiment towards Bogotá mayoral candidates for 2015 was analyzed using Twitter and a political sentiment dictionary defined in the Colombian political context. A second study briefly examined the sentiment of tweets from Bogotá with words related with health symptoms Salcedo2015 (). A third study examined the results of 2015 Colombian regional elections and compared them with political ideology and Twitter activity of the candidates Correa2016 ().

In the end, it is clear that an unsupervised model for text representation is needed to give robust topic-independent text representations. One of the most successful and widely used text representation models is Word2Vec, which has proven to give good results regardless of the language in opinion mining and topic detection duties. For instance, Enr??quez2016 () combined Word2Vec and a bag-of-words document classifier, and showed that Word2Vec provided word embeddings that produced more stable results when doing cross-domain classification experiments. Also, since Word2Vec was first introduced, there has been some research trying to improve and fine-tune word embeddings. Such is the case of Li2016 () that proposes a hybrid model between skip-gram model and continuous bag of words (CBOW) called mixed word embedding (MWE). All in all, we choose word embedding models such as Word2Vec for being able to embed semantic similarities between words in a similarity metric defined over a Euclidean vector space.

Advances in sentiment analysis have been reported in recent works such as Dashtipour2016 (), where state of the art methods were surveyed and compared. Deep learning methods are being used in works such as Dashtipour2018ExploitingDL () to analyze sentiments in Persian texts. Deep convolutional neural networks have been also investigated to analyze sentiments in Twitter Jianqiang2018 (). Deep learning based methods have been used to detect malicious accounts in location-based social networks Gong2018 (). One recent work used a Bayesian network and fuzzy recurrent neural networks for detecting subjectivity Chaturvedi2018 ().

With regard to one of the particular objectives of our work: detecting communities, several methods have been developed in the last couple of decades to solve the so-called planted -partition model, where the structure of graphs are studied to find densely connected groups of nodes (see refs andrea2009 (); Yang2016 () for excellent reviews). More modern methods based on embedding communities in low-dimensional vector spaces try to solve problems such as node clustering, node classification, low-dimensional visualizations, edges prediction, among others with great success Cavallari:2017 (); yeli2018 (). However, we shall point out that this is a very active area of research with many facets, and as argued in rosvall2017 (), community detection should not be considered as a well-defined problem, but instead, should be motivated by particular reasons. In this sense, our motivation for detecting communities is to find groups of people with a clear topic of interest, regardless of whether such groups of people follow each other on Twitter. This means that we do not know from the beginning any connection between the nodes (users), and we aim to detect communities solely based on the data that characterizes each node, i.e. the text representation of each user’s tweets.

The contribution of this paper is twofold: first, we proposed a method to automatically identify digital communities of a city grouped by topic of interest, and second, we collected a set of tweets of Bogotá’s citizens to illustrate the proposed method.

3 Material and methods

An overview of the proposed method is depicted in Figure 1, which is inspired by the ideas found in  Silva:2017 (). We first crawl a set of tweets, which are stored in a document data base. Then, we generate vector representations of texts using Word2Vec. We selected this model because it has been the seed for all word embedding models, and it is the most widely used model, despite the existence of newer and very successful word embedding models such as fastText fasttextzip2016 (); joulin2017 (); bojanowski2017 (), BERT bert2018 (), Swivel swivel2016 () and ELMo elmo2018 (). Afterwards, clustering analysis is performed in the embedded space to find latent topics. Each user is projected in a 2D visualization in which the obtained latent topics are colored. It is of the uttermost importance to notice that we do not create a graph with explicit edges between the users, but rather let the latent topics found in each user’s tweets to create implicit edges. The following sections describe in detail each of these components.

Figure 1: Overview of the proposed method: (1) A component crawls a set of tweets, which are stored in a document data base; (2) The Word2Vec model is applied to this data set to build all the tweets in an embedding space; (3) A clustering analysis is performed in the embedded space to find latent topics; (4) Each user is projected in a 2D visualization in which the obtained latent topics are colored.

3.1 Crawler

A crawler component was implemented in Java using the Twitter Streaming API, which allowed to collect a set of 2,476,426 tweets of Bogotá in a period of 111 days (from August 2015 to December of 2015). The query only searched for tweets that matched with the string “Bogotá” in the field “place” of the tweet meta-data.

3.2 Tweets data set

Tweets corresponding to the same user form a document. All the documents form the corpus. The set of words composing the corpus is the vocabulary. The distribution of tweet and document length of the corpus is shown in Figure 2. It is worth noting that most of the tweets have 20-60 characters. The average length in characters of the tweets is 55 and the average length in tokens of the documents is 639.

Figure 2: Distribution of tweet and document length in the corpus.

3.3 Pre-processing

We decided to discard tweets with less than 20 characters because very short texts usually lack of significant information. From the tweets with 20 or more characters, a total of 58,644 documents were created. The documents were tokenized using the NLTK’s Tweet Tokenizer bird2009natural (), which allowed to preserve emoticon-like (and smileys) characters in tweets. The tokens were not stemmed because Word2Vec deals with different conjugations of words.

3.4 Word2Vec embedding

With these documents, a Word2Vec model with a context distance of 6 words was trained. Words that appeared at least 10 times in the corpus were selected from the vocabulary to train the model. This subset of the corpus vocabulary (or model’s vocabulary) was composed of 55,168 words. Word2Vec represents each document (the average of the embedded vector words composing the document) as a vector in a vector space whose cardinality was set to 150 dimensions.

3.5 Documents database

Having the vector representation in the 150 dimensional space, we discarded the ones that contained less than 40 word occurrences from the model’s vocabulary. This was done with the purpose of examining documents that represented active Twitter users. A total of 30,746 documents satisfied this condition.

3.6 Clustering of documents (users)

In order to identify the main topics treated by Bogotá’s Twitter users we used the k-means algorithm (with the Python’s scikit-learn module scikit-learn ()) on the vector representation of the documents. To determine the number of clusters in the cluster analysis, a gap statistic study was performed.

3.7 Frequency distribution of words within clusters

Once was estimated, we took the 15 most representative documents of each cluster to build a frequency distribution of words that allowed us to easily identify the topics represented in each cluster. Moreover, a tweet length distribution of the 15 documents of each cluster was also built in order to recognize which topics demanded users to write longer or shorter tweets.

4 Theoretical framework

This section presents the theoretical framework used in each component of the proposed method.

4.1 Word2Vec Embedding

Word embeddings constitute a solution to the task of numerically representing words as semantics and syntax carriers within sentences. Let be the ordered set of words, or vocabulary, contained in the corpus, where is the -th word, and is the size of the vocabulary. Generally, words are represented as hot-vectors, which are the -dimensional canonical basis vectors,


where is the hot-vector representation of the word . Clearly , where is the Kronecker’s delta and is the -th component of the -th word hot-vector, which shows that in this canonical basis of there cannot be a relation between words, since they are pairwise orthonormal. Therefore, Mikolov et al. mikolov () proposed a model to build a projector , where , that maps hot-vectors into embedded -dimensional vectors. That is, , where is the embedded -dimensional representation of the word . The way of constructing is via a multi-layered neural network that can be arranged in two different ways. The first way results in the Continuous Bag of Words (CBOW) model, and the second way results in the skip-gram model. This work centers in the CBOW model, in which the neural network has the job of predicting a target word given a set of words called context words. The context words of a word are defined as the set of words that are at a distance less than or equal to from each occurrence of the word in the corpus, where is some integer that one defines. For instance, if the corpus contains food reviews, one would expect the neural network to predict “food” when the context words are “delicious”, “yummy”, “exquisite”, and do not predict “cat”. The neural network can be depicted as in Figure 3, where the input layer consists of neurons ( is the number of words in the set of context words), the hidden layer of neurons, and the output layer of neurons.

Figure 3: Three-layered neural network explaining CBOW method from Word2Vec.

To see how is constructed, consider the simplified problem of a one word context. In this case, the input layer receives a word hot-vector , and acts linearly on it with a matrix , whose components are the weights between the -th neuron of the input layer, and the -th neuron of the hidden layer. This layer can be represented by an -dimensional vector whose components are the sum of the weights received by each neuron, i.e. . Here, can be identified as the matrix form of the projector , where the -th column of is . In a similar fashion, is a weight matrix between the hidden and the output layer. Note that both the -th column of and the -th row of are -dimensional vector representations of the word . Moreover, since , then . This allows to be a score of similarity between the different vector representations of the words and . These scores allow the definition of a softmax multinomial distribution,


where is the output of the -th neuron in the output layer. Eq. (2) is the estimated probability of being a context word of . To learn the correct probabilities, a loss function is defined as


Since the goal is to minimize , then must become larger in the learning process, meaning that words that share a context must have similar vector representations, and words that do not share contexts must have dissimilar vector representations. Notice that computing is time-consuming, since normally is a big number ().

Furthermore, when , is defined as the average vector of the context words’ vector representations. Here the loss function is , where is a set of words chosen from . This probability can be expressed as in Eq. (2) by changing by . Notice that there are ways of picking contexts, and the number of calculations increase dramatically. To reduce this number, negative sampling mikolov () is used. Let be the term frequency distribution of words. Let be a noise distribution built by taking the probability of each word and raising them to the power of 3/4 (this allows less frequent words to have more probability of being drawn from  goldberg2014word2vec ()), and then renormalizing. The problem of minimizing is converted into a binary classification problem as follows. Given a context , we pick a word from such that are context words for . This pair is called a true sample and is labeled by . From the noise distribution , words are drawn. The pairs are negative samples and are labeled by . Therefore, the binary classification problem can be stated as maximizing the joint probability of , where can refer to a word both from the true sample or from the negative samples dyer2014notes ():


Since the actual empirical distribution is not known, the one defined in Eq. (2) is used. Now, to reduce the calculations, under some approximations dyer2014notes (), it is possible to write Eq. (4) as two simple equations:


Note that these are sigmoid functions , of the same argument, except for a sign. From Eq. (5) it can be seen that if and are dissimilar (similar) then the mentioned probability will be close to 1 (0). Similarly, in Eq. (6), if and are similar (dissimilar), the probability will also be close to 1 (0). In this case, it can be shown that the error function takes the form


for a target word . To see negative sampling in the context of the skip-gram model (also used instead of CBOW), see goldberg2014word2vec ().

4.2 Gap statistic

The gap statistic allows the estimation of the number of clusters in a data set gap (). Consider a data set of samples and features. Let be the -th cluster found with K-Means out of clusters (other clustering algorithms can be implemented), containing samples. The within-cluster dispersion is defined as


which is similar to the variance except for a factor of . The total within-cluster dispersion for clusters is


is normally used to estimate the “correct” number of clusters via the elbow method. On the other hand, the gap statistic uses null reference distributions. These are each constructed by finding the range in the -dimensional space of the samples and generating data points with a uniform distribution. Next, clusters are computed for each reference distribution. If there are such distributions, then the gap statistic is defined as


where is the total within-cluster dispersion of the -th reference distribution for clusters. It can be argued gap () that reaches a maximum for when the cluster centroids are aligned in an equally spaced fashion. Also, the uncertainty of can be estimated to be


where is the standard deviation of the logarithm of the reference distributions’ total within-cluster dispersion. Thus, with a 1-sigma certainty, is the value of for which reaches its maximum.

5 Results

Once the Word2Vec model was trained, the documents were represented by the average of its Word2Vec representations. The gap statistic was computed for several quantity of clusters , resulting in Figure 4. The curve shown in this figure resembles a curve. This happens because the Word2Vec model uses all the dimensions to represent the similarity relations between word vectors. Since texts are normally rich in words, it is expected that no clear clusters are formed, and therefore the number of clusters is quite indistinguishable. This is also supported by the fact that Twitter users might tweet about different topics, making their document’s vector to be assigned to clusters that group 2 or more topics. Despite these inconveniences, the gap statistic allows us to estimate the number of clusters. From the figure, it can be seen that the curve begins to flatten down, or to form an elbow, around .

Figure 4: Gap statistic for several number of clusters.

The result of K-means clustering with 40 clusters is shown in Figure 7, where the 15 most representative documents of each cluster were plotted, using PCA (Figure (a)a) and t-SNE (Figure (b)b), using different colors for each cluster.

(a) Clusters found by k-means using Principal Component Analysis (PCA).
(b) Clusters found by k-means using t-distributed Stochastic Neighbor Embedding (t-SNE).
Figure 7: Annotated two-dimensional visualizations of the 15 most representative documents for each of the 40 clusters computed with K-Means. The axes of both visualizations have arbitrary units.

We established the 15 most representative documents of each cluster by sorting in ascending order the euclidean distance between each document of the cluster and its respective cluster centroid. With each set of 15 documents, a word frequency distribution was built in order to get the most frequent words of each cluster. A tweet length distribution was also built with each of these sets. The most frequent words in each cluster tell us if the topics can be easily defined. From both visualizations it is seen that there are two types of clusters: the ones corresponding to a mega-cluster, and the ones that are certainly different from the rest of the corpus, called one-topic clusters because their topic can be easily defined. The one-topic clusters are tagged in the figures with their respective topic (except from “Love” and “Religion”, which can be found within the mega-cluster). It can be seen that PCA makes a sparse visualization of the documents corresponding to one-topic clusters, while t-SNE groups them and plots them apart from the mega-cluster. It should be noted that the one-topic clusters can be used to treat documents as nodes in a graph, whose edges connect nodes from the same one-topic clusters. This graph could then be used by a semi-supervised community detection method such as mirabelli2018 () in order to label the remaining nodes from the mega-cluster.

The most frequent words for each cluster are presented in Table 1. It is noteworthy that even though clusters 25 and 30 are related to the same topic, documents in cluster 30 are more news-like written, while 25 contained documents full of personal opinions about Country Politics. The same relation occurs between clusters 34 and 39. Cluster 34 is full of documents with well-written tweets that try to inform the situation of a particular football team, or match, whereas cluster 39 contains documents that express feelings about specific teams or football events. It is important to point out that cluster 2 was filled with dialogue-like written tweets, i.e. tweets that have conversations in them. In the case of clusters 22 and 24, corresponding to news, it can be identified that the combination of words used in these types of reporting tweets is very peculiar of those clusters. This is clear because even though the topic news covers plenty of topics, the way the tweets are written is perceived by Word2Vec.

#22 (News) 104 #30 (Country Politics) 87 #34 (Sports) 87 #24 (News) 86 #25 (Country Politics) 85
hoy (today) Santos (Colombia’s president) gol (goal) via (way) Santos (Colombia’s president)
Tunja (a city) FARC (Colombian guerrilla) partido (match) departamento (state) FARC (Colombian guerrilla)
Bogotá Colombia city (Manchester City) sector paz (peace)
Boyacá (Colombian state) paz (peace) madrid (Real Madrid) tránsito (transit) Maduro (Venezuela’s president)
Santa Marta (a city) Maduro (Venezuela’s president) Santa Fé (football team) accidente (accident) país (country)
Colombia Colombia junior (football team) Bogotá país (country)
gobierno (government) gobierno (government) gran (great) Cundinamarca (Colombian State) terroristas (terrorists)
nacional (national) justicia (justice) mejor (better) cierre (closure) colombianos (Colombians)
vehículos (vehicles) colombianos (Colombians) américa (football team) total pueblo (people)
personas (people) Uribe (Colombia’s ex-president) nacional (football team) carril (lane) justicia (justice)
#10 (Local Politics) 84 #8 76 #11 (Religion) 73 #2 (Dialogues) 65 #16 (Love) 60
Bogotá más (more) Dios (God) más (more) vida (lane)
Peñalosa (Bogotá’s mayor) vida (life) Señor (Lord) ser (to be) día (day)
Colombia siempre (always) amor (love) vida (life) amor (love)
Santos (Colombia’s president) solo (only) padre (father) hijueputa (sonofabitch) quiero (I want)
Petro (Bogotá’s ex-mayor) mejor (better) vida (life) amor (love) Dios (God)
FARC (Colombian guerrilla) día (day) gracias (thanks) bien (good) siempre (always)
paz (peace) nunca (never) corazón (heart) solo (only) corazón (heart)
Uribe (Colombia’s ex-president) cosas (things) Jesús (Jesus) alguien (someone) mejor (better)
gobierno (government) gente (people) misericordia (misericordy) mamá (mom) mierda (shit)
alcalde (mayor) tiempo (time) fuerza (strength) hoy (today) gracias (thanks)
#7 (Love) 57 #26 51 #23 (Love) 50 #15 (Love) 49 #9 (Portuguese and French) 49
más (more) vida (life) Dios (God) quiero (life) não (no)
vida (life) mejor (better) vida (life) día (day) est (is)
quiero (I want) solo (only) amor (love) jajaja (laughter) pas (not)
mierda (shit) día (day) siempre (always) vida (life) mais (more)
mejor (better) años (years) feliz (happy) mejor (better) quero (I want)
gente (people) Colombia corazón (heart) hoy (today) elle (he)
bien (good) nunca (never) gracias (thanks) alguien (someone) vou (you)
siempre (always) siempre (always) tiempo (time) novio (boyfriend) minha (my)
amor (love) mundo (world) nunca (never) amo (I love) hoje (today)
necesito (I need) Bogotá personas (people) bien (good) melhor (better)
#39 (Sports Football) 48 #5 (Love) 44 #31 43 #1 42 #29 42
partido (match) amor (love) días (days) quiero (I want) usted (formal you)
Santa fé (football team) quiero (I want) hoy (today) jajaja (laughter) quiero (I want)
gol (goal) solo (only) bueno (good) usted (formal you) vida (life)
hoy (today) vida (life) gente (people) solo (only) solo (only)
vamos (come on) amo (I love) vida (life) mejor (better) cosas (things)
bien (good) dia (day) ahora (now) vida (life) mejor (better)
#13 42 #38 41 #28 41 #27 40 #20 40
más (more) más (more) más (more) jajaja (laughter) amo (I love)
tan (so) mejor (better) vida (life) más (more) más (more)
vida (life) quiero (I want) tan (so) fav voy (I will)
hoy (today) tan (so) alguien (someone) tan (so) tan (so)
quiero (I want) hoy (today) quiero (I want) quiero (I want) quiero (I want)
solo (only) mamá (mom) solo (only) vida (life) vida (life)
#4 40 #35 40 #37 39 #14 39 #32 39
más (more) más (more) más (more) más (more) más (more)
vida (life) tan (so) usted (formal you) quiero (I want) ser (to be)
alguien (someone) vida (life) vida (life) vida (life) tan (so)
solo (only) mierda (shit) mejor (better) tan (so) voy (I will)
persona (person) solo (only) amor (love) mejor (better) solo (only)
ser (to be) asi (like this/that) Dios (God) Dios (God) vida (life)
#6 38 #0 37 #17 36 #18 36 #33 36
más (more) más (more) más (more) más (more) más (more)
vida (life) quiero (I want) quiero (I want) tan (so) amor (love)
quiero (I want) tan (so) tan (so) mejor (better) vida (life)
amor (love) vida (life) vida (life) vida (life) quiero (I want)
solo (only) hoy (today) mejor (better) quiero (I want) solo (only)
siempre (always) día (day) hoy (today) jajaja (laughter) tan (so)
#19 35 #36 (English) 34 #12 34 #3 33 #21 33
más (more) like más (more) más (more) más (more)
mejor (better) love hoy (today) quiero (I want) amor (love)
quiero (I want) people día (day) siempre (always) vida (life)
amor (love) want Dios (God) vida (life) solo (only)
vida (life) get vida (life) voy (I will) quiero (I want)
tan (so) don’t solo (only) amor (love) ser (to be)
Table 1: The most frequent words of each cluster. Each cluster is represented by 3 cells: the first one tells the number of the cluster (#) and the topic (in parentheses, if the topic is easily identified); the second one tells the average number of characters per tweet of that cluster; and the third one, tells the 10 most common words of the cluster (with their respective translations in English in parentheses) for the 15 clusters with the larger tweet length average, and 6 most common words for the rest of the clusters.

6 Discussion

Remarkably, many clusters contained documents that constantly make reference to love. The similarity between these clusters can be seen because they share many common words. Also, cluster 9 is a very particular one because it encloses documents containing tweets both written in French and Portuguese.

From the average tweet length of each cluster, it can be easily seen that longer tweets generally are part of a document contained in a cluster that represents a specific topic. On the other hand, clusters with short tweet length average consist of documents with tweets that express personal experiences. From the table, it is seen that there are plenty of clusters with these sorts of documents, indicating that people tend to express their personal experiences with the same set of words and in very similar semantic expressions.

To make Table 1 easier to read, a PCA visualization of the cluster centroids is shown in Figure 8. In this figure, the clusters with the longer tweet length average appear to be away from the mega-cluster, comprised of all the short tweet length average clusters. This phenomenon allows us to propose that people tend to share their personal experiences in shorter tweets, while they give opinions of community important topics in longer texts.

Figure 8: PCA visualization of clusters’ centroids. The axes have arbitrary units. Each cluster is represented by its numeric identification, as in Table 1. Green numbers show those clusters whose topic is defined. Purple numbers show the clusters whose topic was love, and the rest are the clusters whose topic is difficult to define.

Overlapping of document vectors in two topics means that these users belong to both communities, which is normal because typically a user is interested in more than one topic. This is known in machine learning as soft clustering, in which an object can belong to more than one cluster. In our approach this analysis was not conducted, but it would be interesting to address this aspect in future work.

6.1 Threats to validity

It is worth noting that our approach is not using other information available in Twitter such as user account meta-data, re-tweets, likes, URLs, images, location, geo-referenced data (when available), etc. We concentrate our analysis only in the text content. However, these complementary information could add value to better represent a user and at the same time to modify the obtained communities. Other issue in our approach is the amount of tweets used in the experimentation. We gathered data from Bogota users in a period of six months, but It is not clear what is considered enough data. However, gathering more data is a time consuming task and high demanding disk space. We considered that this issue can be addressed in future work to analyze the impact of the detected communities in a larger period of time.

6.2 Policy implications

It is important to remark that our work can be used by government and private entities to develop cultural, political or educational campaigns, which are the most valuable intangible fields in a society. We believe that the method proposed in this paper is especially useful in developing countries because of the gap or disconnection that exists between politicians in the government and the citizenship, because our method easily identifies the main concerns of a society. As an example, during the time we gathered the tweets, every city in Colombia was going through one of the deepest changes of the country because of the peace treaty between the Colombian government and the FARC guerrilla, which was discovered by our method as one of the main topics discussed in Bogotá. This early information could have been used by the government to explain the treaty specifically to those confused or uninformed.

7 Conclusion and future work

This paper presented a method to automatically identify communities using citizens data from the social network Twitter. Up to knowledge, this is the first study that analyzes Twitter data from Bogota to automatically detect communities. The use of machine learning methods such as neural networks and dimensionality reduction algorithms to detect communities is the base of the proposed method. Results show that this method can find out groups of citizens that share common topics such as politics, news, religion, sports, languages, among others. This is an interesting tool that could be used by the local government to support making decision processes in which what communities express can provide valuable information.

As future work we want to compare our approach with the communities that can be obtained when data is modeled as a graph, which will be challenging because the obtained communities will not be necessarily comparable. We want to explore in the future the use of other data available in Twitter such as meta-data user account, re-tweets, likes, URLs, etc. It is possible that complementary information of users generates value to the concept of community. We also want to gather more Tweets to increase the sample data an analyze how the communities change in the time. It would be interesting to analyze how communities are influenced by social phenomena, events and period of the year.



  • (1) J. A. Chaparro Gaitán, Bogotá d.c. ciudad de estadísticas, Research Note 52, Secretaría Distrital de Planeación (July 2013).
  • (2) N. Durango Padilla, Tic y brecha digital, Research Note 73, Secretaría Distrital de Planeación (2014).
  • (3) A. Java, X. Song, T. Finin, B. Tseng, Why we twitter: Understanding microblogging usage and communities, in: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, WebKDD/SNA-KDD ’07, ACM, New York, NY, USA, 2007, pp. 56–65.
  • (4) R. Řehůřek, P. Sojka, Software framework for topic modelling with large corpora, in: Proceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45–50.
  • (5) R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2) (2001) 411–423.
  • (6) D. M. Blei, B. B. Edu, A. Y. Ng, A. S. Edu, M. I. Jordan, J. B. Edu, Latent Dirichlet Allocation, Journal of Machine Learning Research 3 (2003) 993–1022. arXiv:1111.6189v1.
  • (7) S. Deerwester, S. T. Dumais, R. Harshman, Indexing by latent semantic analysis, Journal of the American society for information science 41 (6) (1990) 391–407.
  • (8) D. M. Blei, J. D. Lafferty, Correlated Topic Models, Advances in Neural Information Processing Systems 18 (2006) 147–154.
  • (9) Y. Zhou, S. De, K. Moessner, Real World City Event Extraction from Twitter Data Streams, Procedia Computer Science 58 (DaMIS) (2016) 443–448.
  • (10) W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, X. Li, Comparing twitter and traditional media using topic models, 33rd European Conference on IR Research, ECIR 2011 (2011) 338–349.
  • (11) A. Benny, M. Philip, Keyword based tweet extraction and detection of related topics, Procedia Computer Science 46 (Icict 2014) (2015) 364–371.
  • (12) A. Klahold, P. Uhr, F. Ansari, M. Fathi, Using word association to detect multitopic structures in text documents, IEEE Intelligent Systems 29 (5) (2014) 40–46.
  • (13) T. Hachaj, M. R. Ogiela, Clustering of trending topics in microblogging posts: A graph-based approach, Future Generation Computer Systems 67 (2017) 297–304.
  • (14) J. Cigarran, a. Castellanos, A. Garcia-Serrano, A step forward for Topic Detection in Twitter: An FCA-based approach, Expert Systems with Applications 57 (2016) 21–36.
  • (15) R. Wille, Concept lattices and conceptual knowledge systems, Computers & Mathematics with Applications 23 (6-9) (1992) 493–515.
  • (16) G. Bernhard, W. Rudolf, Formal Concept Analysis: Mathematical Foundations, 1999.
  • (17) K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A. Y. A. Hawalah, A. Gelbukh, Q. Zhou, Multilingual sentiment analysis: State of the art and independent comparison of techniques, Cognitive Computation 8 (4) (2016) 757–771. doi:10.1007/s12559-016-9415-7.
  • (18) A. Fernández Anta, L. Núñez Chiroque, P. Morere, A. Santos, Sentiment analysis and topic detection of {S}panis tweets: A compatative study of {NLP} techniques, Procesamiento del Lenguaje Natural 50 (2013) 45–52.
  • (19) M. Dolores Molina-González, E. Martínez-Cámara, M. Teresa Martín-Valdivia, L. Alfonso Ureña-López, A Spanish semantic orientation approach to domain adaptation for polarity classification, Information Processing and Management 51 (4) (2015) 520–531.
  • (20) D. Vilares, M. A. Alonso, C. Gómez-Rodríguez, Supervised polarity classification of spanish tweets based on linguistic knowledge, in: Proceedings of the 2013 ACM Symposium on Document Engineering, DocEng ’13, ACM, New York, NY, USA, 2013, pp. 169–172.
  • (21) A. Montejo-Ráez, M. C. Díaz-Galiano, J. M. Perea-Ortega, L. A. Ureña López, Spanish knowledge base generation for polarity classification from masses, in: Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13 Companion, ACM, New York, NY, USA, 2013, pp. 571–578.
  • (22) J. A. Alvarado Valencia, A. Carrillo, J. Forero, L. Caicedo, J. C. Ureña, XXVI Simposio Internacional de Estadística 2016 Sincelejo, Sucre, Colombia, 8 al 12 de Agosto de 2016, in: XXVI Simposio Internacional de Estadística 2016, Sincelejo, 2016, pp. 1–4.
  • (23) D. Salcedo, A. León, Behavior of Symptoms on Twitter, in: H. Lossio-Ventura, Juan Antonio and Alatrista-Salas (Ed.), Proceedings of the 2nd Annual International Symposium on Information Management and Big Data - SIMBig 2015, CEUR-WS, Busco, 2015, pp. 83–84.
  • (24) J. C. Correa, J. Camargo, Ideological Consumerism in Colombian Elections, 2015: Links between Political Ideology, Twitter Activity and Electoral Results, Cyberpsychology, Behaviour and Social Networking 20 (2017) 37–43.
  • (25) F. Enríquez, J. A. Troyano, T. López-Solaz, An approach to the use of word embeddings in an opinion classification task, Expert Systems with Applications 66 (2016) 1–6.
  • (26) J. Li, J. Li, X. Fu, M. a. Masud, J. Z. Huang, Learning distributed word representation with multi-contextual mixed embedding, Knowledge-Based Systems 106 (2016) 220–230.
  • (27) K. Dashtipour, M. Gogate, A. Adeel, C. Ieracitano, H. Larijani, A. Hussain, Exploiting deep learning for persian sentiment analysis, in: BICS, 2018.
  • (28) Z. Jianqiang, G. Xiaolin, Z. Xuejun, Deep convolution neural networks for twitter sentiment analysis, IEEE Access 6 (2018) 23253–23260. doi:10.1109/ACCESS.2017.2776930.
  • (29) Q. Gong, Y. Chen, X. He, Z. Zhuang, T. Wang, H. Huang, X. Wang, X. Fu, Deepscan: Exploiting deep learning for malicious account detection in location-based social networks, IEEE Communications Magazine 56 (11) (2018) 21–27. doi:10.1109/MCOM.2018.1700575.
  • (30) I. Chaturvedi, E. Ragusa, P. Gastaldo, R. Zunino, E. Cambria, Bayesian network based extreme learning machine for subjectivity detection, Journal of the Franklin Institute 355 (4) (2018) 1780 – 1797, special Issue on Recent advances in machine learning for signal analysis and processing. doi:
  • (31) A. Lancichinetti, S. Fortunato, Community detection algorithms: A comparative analysis, Phys. Rev. E 80 (2009) 056117. doi:10.1103/PhysRevE.80.056117.
  • (32) Z. Yang, R. Algesheimer, C. J. Tessone, A comparative analysis of community detection algorithms on artificial networks, Scientific Reports 6 (2016) 30750 EP –, article.
  • (33) S. Cavallari, V. W. Zheng, H. Cai, K. C.-C. Chang, E. Cambria, Learning community embedding with community detection and node embedding on graphs, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, ACM, New York, NY, USA, 2017, pp. 377–386. doi:10.1145/3132847.3132925.
  • (34) Y. Li, C. Sha, X. Huang, Y. Zhang, Community detection in attributed graphs: An embedding approach (2018).
  • (35) M. Rosvall, J.-C. Delvenne, M. T. Schaub, R. Lambiotte, Different approaches to community detection, arXiv e-prints (2017) arXiv:1712.06468arXiv:1712.06468.
  • (36) W. Silva, A. Santana, F. Lobato, M. Pinheiro, A methodology for community detection in twitter, in: Proceedings of the International Conference on Web Intelligence, WI ’17, ACM, New York, NY, USA, 2017, pp. 1006–1009. doi:10.1145/3106426.3117760.
  • (37) A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, Compressing text classification models, arXiv e-prints (2016) arXiv:1612.03651arXiv:1612.03651.
  • (38) A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Association for Computational Linguistics, 2017, pp. 427–431.
  • (39) P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017) 135–146. arXiv:, doi:10.1162/tacl_a_00051.
  • (40) J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv e-prints (2018) arXiv:1810.04805arXiv:1810.04805.
  • (41) N. Shazeer, R. Doherty, C. Evans, C. Waterson, Swivel: Improving embeddings by noticing what’s missing, CoRR abs/1602.02215.
  • (42) M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, arXiv e-prints (2018) arXiv:1802.05365arXiv:1802.05365.
  • (43) S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, O’Reilly Media, Inc., 2009.
  • (44) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.
  • (45) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, 2013, pp. 3111–3119.
  • (46) Y. Goldberg, O. Levy, word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method.
  • (47) C. Dyer, Notes on noise contrastive estimation and negative sampling, arXiv preprint arXiv:1410.8251.
  • (48) B. Mirabelli, D. Kushnir, Active Community Detection: A Maximum Likelihood Approach, arXiv e-prints (2018) arXiv:1801.05856arXiv:1801.05856.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description