Discover Your Social Identity from What You Tweet: a Content Based Approach
An identity denotes the role an individual or a group plays in highly differentiated contemporary societies. In this paper, our goal is to classify Twitter users based on their role identities. We first collect a coarse-grained public figure dataset automatically, then manually label a more fine-grained identity dataset. We propose a hierarchical self-attention neural network for Twitter user role identity classification. Our experiments demonstrate that the proposed model significantly outperforms multiple baselines. We further propose a transfer learning scheme that improves our model’s performance by a large margin. Such transfer learning also greatly reduces the need for a large amount of human labeled data.
Keywords:Social Identity Twitter User Profiling Text Mining Neural Network
An identity is a characterization of the role an individual takes on. It is often described as the social context specific personality of an individual actor or a group of people . Identities can be things like jobs (e.g. “lawyer”, “teacher”), gender (man, woman), or a distinguishing characteristic (e.g. “a shy boy”, “a kind man”). People with different identities tend to exhibit different behaviors in the social space . In this paper, we use role identity to refer to the roles individuals or groups play in society.
Specifically on social media platforms, there are many different kinds of actors using social media, e.g., people, organizations, and bots. Each type of actor has different motivations, different resources at their disposal, and may be under different internal policies or constraints on when they can use social media, how they can represent themselves, and what they can communicate. If we want to understand who is controlling the conversation and whom is being impacted, it is important to know what types of actors are doing what.
To date, for Twitter, most research has separated types of actors largely based on whether the accounts are verified by Twitter or not , or whether they are bots or not . Previous study has shown that separating Twitter users into bots and non-bots provides better understanding of U.S. presidential election online discussion . Bessi and Ferrara reveal that social bots distort the 2016 U.S. presidential election online discussion and about one-fifth of the entire conversation comes from bots. However, a variety of different types of actors may be verified - e.g., news agencies, entertainment or sports team, celebrities, and politicians. Similarly, bots can vary - e.g., news bots and non-news bots. If we could classify the role identities of actors on Twitter, we could gain an improved understanding of who was doing the influencing and who was being influenced . For example, knowing the social roles of bots would enable a more in-depth analysis of bot activities in the diffusion process of disinformation, eg. whether bots pretend to be news agencies to persuade regular users.
Understanding the sender’s role is critical for doing research on, and developing technologies to stop, disinformation [12, 30]. Research has shown that disinformation has a greater reach if it is spread by news agencies and celebrities . Disinformation is generally thought to be promoted by bots [4, 43]; however, most tools for identifying bots have relatively low accuracy when used in the wild . News reporters, news agencies and celebrities often look like bots. Separating them out gives a better understanding of the role of bots in promoting disinformation. Assessing the extent to which official sites are communicating in a way that effectively counters disinformation also required identification of the sender’s role. Thus, role identification is foundational for disinformation research
In this paper, the primary goal is to classify Twitter users based on their role identities on social media. First, we introduce two datasets for Twitter user identity classification. One is automatically collected from Twitter aiming at identifying public figures on social media. Another is a human labeled dataset for more fine-grained Twitter user identity classification, which includes identities like government officials, news reporters, etc. Second, we present a hierarchical self-attention neural network for Twitter user identity classification. In our experiments, we show our method achieves excellent results when compared to many strong classification baselines. Last but not least, we propose a transfer learning scheme for fine-grained user identity classification which boosts our model’s performance a lot.
2 Related Work
Sociologists have long been interested in the usage of identities across various social contexts . As summarized in , three relatively distinct usages of identity exist in the literature. Some use identity to refer to the culture of a people . Some use it to refer to common identification with a social category . While others use identity to refer to the role a person plays in highly differentiated contemporary societies. In this paper, we use the third meaning. Our goal for identity classification is to separate actors with different roles in online social media.
Identity is the way that individuals and collectives are distinguished in their relations with others . Certain difficulties still exist for categorizing people into different groups based on their identities. Recasens et al. argue that identity should be considered to be varying in granularity and a categorical understanding would limit us in a fixed scope . While much work could be done along this line, at this time we adopt a coarse-grained labeling procedure, that only looks at major identities in the social media space.
Twitter, a popular online news and social networking site, is also a site that affords interactive identity presentation to unknown audiences. As pointed out by Robinson et al., individuals form new cyber identities on the internet, which are not necessarily the way they would be perceived offline . A customized identity classifier is needed for online social media like Twitter.
A lot of research has tried to categorize Twitter users based on certain criteria , like gender , location [25, 24, 45], occupation [23, 33], and political orientation . Another similar research topic is bot detection , where the goal is to identify automated user accounts from normal Twitter accounts. Differing from them, our work tries to categorize Twitter users based on users’ social identity or social roles. Similarly, Pirante et al. also study identity classification on Twitter . However, their approach is purely based on profile description, while we combine user self-description and tweets together. Additionally, we demonstrate that tweets are more helpful for identity classification than personal descriptions in our experiments.
In fact, learning Twitter users’ identities can benefit other related tasks. Twitter is a social media where individual user accounts and organization accounts co-exist. Many user classification methods may not work on these organization accounts, e.g., gender classification. Another example is bot detection. In reality, accounts of news agencies and celebrities often look like bots , because these accounts often employ automated services or teams (so called cyborgs), and they also share features with certain classes of bots; e.g., they may be followed more than they follow. Being able to classify actors’ roles on Twitter would improve our ability to automatically differentiate pure bots from celebrity accounts.
In this section, we describe details of our hierarchical self-attention neural networks. The overall architecture is shown in Figure 1. Our model first maps each word into a low dimension word embedding space, then it uses a Bidirectional Long Short-Term Memory (Bi-LSTM) network  to extract context specific semantic representations for words. Using several layers of multi-head attention neural networks, it generates a final classification feature vector. In the following parts, we elaborate these components in details.
3.1 Word Embedding
Our model first maps each word in description and tweets into a word embedding space by a table lookup operation, where is the vocabulary size, and is the embedding dimension.
Because of the noisy nature of tweet text, we further use a character-level convolutional neural network to generate character-level word embeddings, which are helpful for dealing with out of vocabulary tokens. More specifically, for each character in a word , we first map it into a character embedding space and get . Then a convolutional neural network is applied to generate features from characters . For a character window , a feature is generated by where and are a convolution filter and a bias term respectively, is a non-linear function . Sliding the filter from the beginning of the character embedding matrix till the end, we get a feature vector . Then, we apply max pooling over this vector to get the most representative feature. With such convolutional filters, we get the character-level word embedding for word .
The final vector representation for word is just the concatenation of its general word embedding vector and character-level word embedding vector. Given one description with tokens and tweets each with tokens, we get two embedding matrices and for description and tweets respectively.
After get the embedding matrices for tweets and description, we use a bidirectional LSTM to extract context specific features from each text. At each time step, one forward LSTM takes the current word vector and the previous hidden state to generate the hidden state for word . Another backward LSTM generates another sequence of hidden states in the reversed direction. We also tried Bi-directional GRU  in our initial experiments, which yields slightly worse performance.
The final hidden state for word is the concatenation of and as . With tweets and one description, we get two hidden state matrices and .
Following the Bi-LSTM layer, we use a word-level multi-head attention layer to find important words in a text .
Specifically, a multi-head attention is computed as follows:
where , , , , and are projection parameters for query, key, value, and output respectively.
Take a user description for example. Given the hidden state matrix of the description, each head first projects into three subspaces — query , key , and value . The matrix product between key and query after softmax normalization is the self-attention, which indicates important parts in the value matrix. The multiplication of self-attention and value matrix is the output of this attention head. The final output of multi-head attention is the concatenation of such heads after projection by .
After this word-level attention layer, we apply a row-wise average pooling to get a high-level representation vector for description.
Similarly, we can get representation vectors from tweets using the same word-level attention, which forms .
Further, a tweet-level multi-head attention layer computes the final tweets representation vector as follows:
In practise, we also tried using an additional Bi-LSTM layer to model the sequence of tweets, but we did not observe any significant performance gain.
Given the description representation and tweets representation , a field attention generates the final classification feature vector
where means concatenating by row.
3.4 Final Classification
Finally, the probability for each identity is computed by a softmax function:
where is the projection parameter, is the bias term, and is the set of identity classes. We minimize the cross-entropy loss function to train our model,
where equals to 1 if the identity is of class c, otherwise 0.
To examine the effectiveness of our method, we collect two datasets from Twitter. The first is a public figure dataset. We use Twitter’s verification as a proxy for public figures. These verified accounts include users in music, government, sports, business, and etc
In addition, we introduce another human labeled identity dataset for more fine-grained identity classification, which contains seven identity classes: “news media”, “news reporter”, “government official”, “celebrity”, “company”, “sport”, and “normal people”. For each identity, we manually labelled thousands of Twitter users and collected their most recent 20 tweets for classification in November 2018. For the normal Twitter users, we randomly sampled them from the Twitter sample stream. News media accounts are these official accounts of news websites like BBC. News reporters are mainly composed of news editors or journalists. Government officials represent government offices or politicians. We collected these three types of accounts from corresponding official websites. For the other three categories, we first search Twitter for these three categories, and then we downloaded their most recent tweets using Twitter’s API. Two individual workers labeled these users independently, and we include users that both two workers agreed on. The inter-rater agreement measure is 0.96. In Table 2, we list several representative Twitter handles for each identity class except for normal users. Table 1 shows a summary of this dataset. We randomly select 500 and 1000 users for development and test respectively. Since normal users are the majority of Twitter users, about half of the users in this dataset are normal users.
|News Media||News Reporter||Celebrity||Government Official||Company||Sport|
This paper focuses on a content-based approach for identity classification, so we only use personal description and text of each tweet for each user.
4.2 Hyperparameter Setting
In our experiments, we initialize the general word embeddings with released 300-dimensional Glove vectors
MNB: Multinomial Naive Bayes classifier with unigrams and bigrams. The term features are weighted by their TF-IDF scores. Additive smoothing parameter is set as via a grid search on the development set of identity dataset.
SVM: Support Vector Machine classifier with unigrams and linear kernel. The term features are weighted by their TF-IDF scores. Penalty parameter is set as 100 via a grid search on the development set of identity dataset.
CNN: Convolutional Neural Networks  with filter window size 3,4,5 and 100 feature maps each. Initial learning rate is and drops to at the last 1/3 epochs.
Bi-LSTM: Bidirectional-LSTM model with 300 hidden states in each direction. The average of output at each step is used for the final classification.
Bi-LSTM-ATT: Bidirectional-LSTM model enhanced with self-attention. We use multi-head attention with 6 heads.
fastText : we set word embedding size as 300, use unigram, and train it 10 epochs with initial learning 1.0.
For methods above, we combine personal description and tweets into a whole document for each user.
In Table 3, we show comparison results between our model and baselines. Generally, LSTM based methods work the best among all these baseline approaches. SVM has comparable performance to these neural network based methods on the identity dataset, but falls behind on the larger public figure dataset.
|Ablated Models||w/o attentions||93.78||92.45||87.0||83.26|
Our method outperforms these baselines on both datasets, especially for the more challenging fine-grained identity classification task. Our model can successfully identify public figures with accuracy 94.21% and classify identity with accuracy 89.5%. Compared to a strong baseline Bi-LSTM-ATT, our model achieves a 2.2% increase in accuracy, which shows that our model with structured input has better classification capability.
We further performed ablation studies to analyze the contribution of each model component, where we removed attention modules, character-level word embeddings, tweet texts, and user description one by one at a time. As shown in Table 3, attention modules make a great contribution to the final classification performance, especially for the more fine-grained task. We present the performance breakdown for each attention module in Table 4. Each level of attention effectively improves the performance of our model. Recognizing important words, tweets, and feature fields at different levels is helpful for learning classification representations. According to Table 3, the character-level convolutional layer is also helpful for capturing some character-level patterns.
We also examined the impact of two different text fields: personal description and tweets. Indeed, we found that what users tweeted about is more important than what they described themselves. On both datasets, users’ tweets provide more discriminative power than users’ personal descriptions.
|w/o word attention||88.8||84.41|
|w/o field attention||88.5||85.24|
|w/o tweet attention||88.5||84.6|
|w/o all attention||87.0||83.26|
4.5 Transfer Learning for Fine-grained Identity Classification
In reality, it is expensive to get a large-scale human labeled dataset for training a fine-grained identity classifier. However, a well-known drawback of neural network based methods is that they require a lot of data for training. Recently, learning from massive data and transferring learned knowledge to other tasks attracts a lot of attention [32, 19]. Since it is relatively easier to get a coarse-grained identity dataset to classify those public figures, we explore how to use this coarse-grained public figure dataset to help the training of fine-grained identity classifier.
Specifically, we first pretrain a binary classifier on the public figure dataset and save the best trained model on its development set. To make a fair comparison, we excluded all the users appearing in identity dataset from the public figure dataset when we built our datasets. Then we initialize the parameters of fine-grained identity classifier with this pretrained model except for the final classification layer. After such initialization step, we first train the final classification layer for 3 epochs with learning rate 0.01, and then train our full identity classification model with the same procedure as before. We observe a big performance boost when we apply such pretraining as shown in Table 3. The classification accuracy for the fine-grained task increases by 2.1% with transfer learning.
We further examined the performance of our model with pretraining using various amounts of training data. As shown in Figure 2, our pretrained model reaches a comparable performance only with 20%-30% labeled training data when compared to the model trained on full identity dataset without pretraining. Using only 20% of training data, we can get accuracy 0.888 and F1 0.839. If we increase the data size to 30% of the training data, the accuracy and F1 will increase to 0.905 and 0.863 respectively. Such pretraining makes great improvements over fine-grained identity classification especially when we lack labeled training data.
4.6 Case Study
In this section, we present a case study in the test set of identity dataset to show the effectiveness of our model. Because of the difficulties of visualizing and interpreting multi-head attention weights, we instead average over the attention weights in multiple heads which gives us an approximation of the importance of each word in texts. Take the user description for example, the approximated importance weight of each word in the description is given by
Similarly, we can get the importance weights for tweets as well as words in tweets.
In Figure 3, we show twenty tweets and a description from a government official user. We use the background color to represent importance weight for each word. The color depth denotes the importance degree of a word per tweet. We plot the tweet-level importance weights as the background color of tweet index at the beginning of each tweet. As shown in this figure, words like “congressman”, “legislation” in this user’s description are important clues indicating his/her identity. From the tweet-level attention, we know that 8th and 14th tweets are the most important tweets related with the identity because they include words like “legislation” and “bipartisan”. On the contrary, 5th tweet of this user only contain some general words like “car”, which makes it less important than other tweets.
4.7 Error Analysis
We perform an error analysis to investigate why our model fails in certain cases. Table 5 shows the confusion matrix generated from prediction results of our identity dataset. As shown in this table, it is relatively harder for our model to distinguish between celebrities and regular users. We further looked at such errors with high confidences and found that some celebrities just have not posted any indicating words in their tweets or descriptions. For example, one celebrity account only use “A Virgo” in the description without any other words, which makes this account predicted as a regular user. Including other features like number of followers or network connections may overcome this issue, and we leave it for future work. Another common error happens when dealing with non-English tweets. Even enhanced with transferred knowledge from the large-scale verify dataset, our model still cannot handle some rare languages in the data.
5 Discussion & Conclusion
As previously discussed, identities can vary in granularity. We examined two levels - coarse grained (verified or not) and more fine grained (news media, government officials, etc.). However, there could be more levels. This limits our understanding of activities of online actors with those identities. A hierarchical approach for identity classification might be worth further research. Future research should take this into consideration and learn users’ identities in a more flexible way. Besides, because of the nature of social media, the content on Twitter would evolve rapidly. In order to deploy our method in real-time, we need consider an online learning procedure that adapts our model to new data patterns. Since our method is purely content-based, potential improvements could be made using additional information like the number of users’ followers, users’ network connections, and even their profile images. We leave this as our future work.
In the real-world people often have multiple identities - e.g., Serbian, Entrepreneur, Policewoman, Woman, Mother. The question is what is the relation between identities, users, and user accounts. Herein, we treat each account as a different user. However, in social media, some people use different accounts and/or different social media platforms for different identities - e.g., Facebook for Mother, Twitter for Entrepreneur and a separate Twitter handle for official policewoman account. In this paper, we made no effort to determine whether an individual had multiple accounts. Thus, the same user may get multiple classifications if that user has multiple accounts. Future work should explore how to link multiple identities to the same user. To this point, when there is either a hierarchy of identities or orthogonal identity categories, then using identities at different levels of granularity, as we did herein, enables multiple identities to be assigned to the same account and so to the same user.
In conclusion, we introduce two datasets for online user identity classification. One is automatically extracted from Twitter, the other is a manually labelled dataset. We present a novel content-based method for classifying social media users into a set of identities (social roles) on Twitter. Our experiments on two datasets show that our model significantly outperforms multiple baseline approaches. Using one personal description and up to twenty tweets for each user, we can identify public figures with accuracy 94.21% and classify more fine-grained identities with accuracy 89.5%. We proposed and tested a transfer learning scheme that further boosts the final identity classification accuracy by a large margin. Though, the focus of this paper is learning users’ social identities. It is possible to extend this work to predict other demographics like gender and age.
- email: email@example.com
- email: firstname.lastname@example.org
- email: email@example.com
- email: firstname.lastname@example.org
- (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4.2.
- (1985) Organizational identity.. Research in organizational behavior.
- (1989) Social identity theory and the organization. Academy of management review 14 (1), pp. 20–39. Cited by: §1.
- (2018) Beaten up on twitter? exploring fake news and satirical responses during the black panther movie event. In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp. 97–103. Cited by: §1.
- (2019) Bot-ivistm: assessing information manipulation in social media using network analytics. In Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining, pp. 19–42.
- (2018) Mining online communities to inform strategic messaging: practical methods to identify community-level insights. Computational and Mathematical Organization Theory 24 (2), pp. 224–242.
- (2018) Bot conversations are different: leveraging network metrics for bot detection in twitter. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 825–832. Cited by: §1.
- (2016) Social bots distort the 2016 us presidential election online discussion. First Monday 21 (11-7). Cited by: §1.
- (2011) Discriminating gender on twitter. In Proceedings of the conference on empirical methods in natural language processing, pp. 1301–1309. Cited by: §2.
- (1994) Social theory and the politics of identity. Wiley-Blackwell. Cited by: §2.
- (1985) Role-identity salience. Social psychology quarterly, pp. 203–215. Cited by: §1.
- (2018) Social cyber-security. In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp. 389–394. Cited by: §1.
- (2019) Pretending positive, pushing false: comparing captain marvel misinformation campaigns. Fake News, Disinformation, and Misinformation in Social Media-Emerging Research Challenges and Opportunities. Cited by: §1.
- (2010) Measuring user influence in twitter: the million follower fallacy.. Icwsm 10 (10-17), pp. 30. Cited by: §1.
- (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.2.
- (2010) Who is tweeting on twitter: human, bot, or cyborg?. In Proceedings of the 26th annual computer security applications conference, pp. 21–30. Cited by: §1, §2.
- (2012) Detecting automation of twitter accounts: are you a human, bot, or cyborg?. IEEE Transactions on Dependable and Secure Computing 9 (6), pp. 811–824. Cited by: §2.
- (2014) Echo chamber or public sphere? predicting political orientation and measuring political homophily in twitter using big data. Journal of Communication 64 (2), pp. 317–332. Cited by: §2.
- (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.5.
- (2010) Self, identity, and social institutions. Springer.
- (2014) Finding users we trust: scaling up verified twitter users using their communication patterns.. In ICWSM, Cited by: §1.
- (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.
- (2016) What the language you tweet says about your occupation. In Tenth International AAAI Conference on Web and Social Media, Cited by: §2.
- (2017) On predicting geolocation of tweets using convolutional neural networks. In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp. 281–291. Cited by: §2.
- (2019) A hierarchical location prediction neural network for twitter user geolocation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4731–4741. External Links: Cited by: §2.
- (2014) Social identity. Routledge. Cited by: §2.
- (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §4.3.
- (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §3.1, §4.3.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
- (2019) A decadal survey of the social and behavioral sciences: a research agenda for advancing intelligence analysis. The National Academies Press, Washington, DC. External Links: Cited by: §1.
- (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.2.
- (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §4.5.
- (2015) An analysis of the user occupational class through twitter content. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1754–1764. Cited by: §2.
- (2016) # whoami in 160 characters? classifying social identities based on twitter profile descriptions. In Proceedings of the First Workshop on NLP and Computational Social Science, pp. 55–65. Cited by: §2.
- (2015) Overview of the 3rd author profiling task at pan 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers, pp. 1–8. Cited by: §2.
- (2011) Identity, non-identity, and near-identity: addressing the complexity of coreference. Lingua 121 (6), pp. 1138–1152. Cited by: §2.
- (2007) The cyberself: the self-ing project goes online, symbolic interaction in the digital age. New Media & Society 9 (1), pp. 93–110. Cited by: §2.
- (2007) The strength of weak identities: social structural sources of self, situation and emotional experience. Social Psychology Quarterly 70 (2), pp. 106–124.
- (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §4.2.
- (2000) The past, present, and future of an identity theory. Social psychology quarterly, pp. 284–297. Cited by: §2.
- (1974) Social identity and intergroup behaviour. Information (International Social Science Council) 13 (2), pp. 65–93. Cited by: §2.
- (1982) Social identity and intergroup relations. Cambridge University Press. Cited by: §2.
- (2019) Characterizing bot networks on twitter: an empirical analysis of contentious issues in the asia-pacific. In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, pp. 153–162. Cited by: §1.
- (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.3.
- (2017) RATE: overcoming noise and sparsity of textual features in real-time location estimation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2423–2426. Cited by: §2.