DeepStance at SemEval-2016 Task 6: Detecting Stance in Tweets Using Character and Word-Level CNNs
This paper describes our approach for the Detecting Stance in Tweets task (SemEval-2016 Task 6). We utilized recent advances in short text categorization using deep learning to create word-level and character-level models. The choice between word-level and character-level models in each particular case was informed through validation performance. Our final system is a combination of classifiers using word-level or character-level models. We also employed novel data augmentation techniques to expand and diversify our training dataset, thus making our system more robust. Our system achieved a macro-average precision, recall and F1-scores of , and respectively.
Stance detection is the task of automatically determining whether the authors of a text are against or in favour of a given target. For instance, take the following sentence: ”It has been such a cold April, so much for global warming.” This sentence’s author is most likely against the concept of global warming (i.e., does not believe in it). The work presented here is specifically targeted towards detecting stance in tweets. The noisy and idiosyncratic nature of tweets make this a particularly hard task.
Automatic identification of stance in tweets has practical applications for a range of domains. For instance, it can be used as a sensor to measure the attitude of Twitter users on various issues, such as: political issues, candidates, brand names, TV shows, etc.
There has been extensive research done on modelling and automatic detection of stance in political arenas (e.g., debates) [\citenameThomas et al.2006] and on online forums [\citenameSomasundaran and Wiebe2009, \citenameMurakami and Raymond2010]. However, as we alluded to earlier, the peculiar nature of tweets make techniques that have been developed for other platforms unsuitable. The field closest to this work is the field of Twitter sentiment classification, where the task is to detect the sentiment of a given tweet, usually as positive, negative, or neutral. Nonetheless, it is important to note that there are substantial differences between sentiment classification and stance detection. Sentiment classifiers determine the polarity of a given tweet, without considering any targets (see Vosoughi et al. [\citenameVosoughi et al.2015] for an example of a Twitter sentiment classifier). For instance, consider the tweet: ”I love Donald Trump”, this tweet has a positive sentiment, and the author of the tweet has a positive stance towards Donald Trump, but it can also be inferred that the author is most likely against or at best neutral towards Bernie Sanders. In this paper, we present a system for automatic detection of stance in Tweets.
2 Our Approach
We trained a different model for each of the five targets. Models for some of the targets used character-level convolutional neural networks(CNN), while other used word-level models. In one particular target (Hillary Clinton), a combination of character-level and word-level models was used. Though Character-level models are robust to the idiosyncratic and noisy nature of tweets, they require a larger dataset compared to word-level models. Our choice between the models was informed by validation performance (as explained in section 5). The character and word-level models are explained in the section below.
2.1 Character-Level CNN Tweet Model
Character-level CNN (CharCNN) is a slight variant of the deep character level convolutional neural network introduced by Zhang et al [\citenameZhang and LeCun2015], based on the success of CNNs in image recognition tasks [\citenameGirshick et al.2014] [\citenameHinton et al.2012]. In this model, we perform temporal (one-dimensional) convolutional and max-pooling operations. Given a discrete input function , a discrete kernel function and stride , the convolution between and and pooling operation of is calculated as:
where is an offset constant. In our implementation of the model, the stride is set to 1.
This model is illustrated in Figure 1. We adapted this model for the size limit of tweets (140 characters). The character set includes English alphabets, numbers, special characters and unknown character. There are 70 characters in total, given below:
Each character in the tweet can be encoded using one-hot vector . Hence, a tweet is represented as a binary matrix with padding wherever necessary, where 150 is the maximum number of characters in a tweet plus padding and 70 is the size of the character set.
Each tweet, in the form of a matrix, is now fed into a deep model consisting of four 1-d convolutional layers. A convolution operation employs a filter , to extract l-gram character feature from a sliding window of characters at the first layer and learns abstract textual features in the subsequent layers. This filter is applied across all possible windows of size to produce a feature map. A sufficient number () of such filters are used to model the rich structures in the composition of characters. Generally, with tweet s, each element of a feature map at the layer is generated by:
where is the filter associated with feature map F at layer ; denotes the segment of output of layer for convolution at location (where — one-hot vectors of characters from tweet s); is the bias associated with that filter at layer ; is a rectified linear unit and is element-wise multiplication. The output of the convolutional layer is a matrix, the columns of which are feature maps .
The output of the convolutional layer is followed by a 1-d max-overtime pooling operation [\citenameCollobert et al.2011] over the feature map and selects the maximum value as the prominent feature from the current filter. Pooling size may vary at each layer (given by at layer ). The pooling operation shrinks the size of the feature representation and filters out trivial features like unnecessary combination of characters (in the initial layer). The window length , number of filters , pooling size at each layer can vary for each classification task.
The output from the last convolutional layer is flattened and passed into a series of fully connected layers. The output of the final fully connected layer (sigmoid or softmax) gives a probability distribution over categories in our classification task. For regularization we apply a dropout [\citenameHinton et al.2012] mechanism after the first fully connected layer. This prevents co-adaptation of hidden units by randomly setting a proportion of the hidden units to zero (Generally, we set ). CharCNN can be robust to misspellings and noise, provided there is sufficiently large dataset to train the model.
2.2 Convolutional Word-Embedding Model
The convolutional embedding model (see Figure 2) assigns a dimensional vector to each of the words of an input tweet resulting in a matrix of size . Each of these vectors are initialized with uniformly distributed random numbers i.e. . The model, though randomly initialized, will eventually learn a look-up matrix where is the vocabulary size, which represents the word embedding for the words in the vocabulary.
A convolution layer is then applied to the input tweet matrix, which takes into consideration all the successive windows of size , sliding over the entire tweet. A filter operates on the tweet to give a feature map . We apply a max-pooling function [\citenameCollobert et al.2011] of size shrinking the size of the resultant matrix by . In this model, we do not have several hierarchical convolutional layers - instead we apply convolution and max-pooling operations with filters on the input tweet matrix for different window sizes ().
The vector representations derived from various window sizes can be interpreted as prominent n-gram word features for the tweets. These features are concatenated to give a vector of size , where L is the number of different values which is further compressed to a size before passing it to a fully connected softmax or a sigmoid layer whose output is the probability distribution over different categories of our classification task.
3 Model Training
We trained the CharCNN model and the Word-Embedding convolutional model for different targets and selected the best model for each of them. In our task, the tweets are classified into three categories: Favor, Against, and None. We defined the ground truth vector as a one-hot vector. The commonly used hyperparameters for the convolutional layers of our CharCNN are: , (first two layers) and (other 3 layers). The sizes of the fully connected layers in our CharCNN model are 1,024 and 512.
Similarly, the commonly used hyperparameters of the Convolutional Word-Embedding model are: , , , . Softmax layer takes the output from the penultimate layers of the corresponding models, thereby generating a distribution over the three classes in our task. The class with the maximum probability is the label for the given input tweet.
To learn the parameters of the model we minimize the cross-entropy loss as the training objective using the Adam Optimization algorithm [\citenameKingma and Ba2014]. It is given by
where p is the true distribution (1-of-C representation of ground truth) and q is the output of the softmax. This, in turn, corresponds to computing the negative log-probability of the true class. Each of the classifiers were trained for approximately 8-10 epochs.
In order to deal with the imbalance in the data, we used a simple balancing technique: to choose a sample on each training step, we randomly picked a class and then randomly selected a tweet associated with this class.
4 Training Set Expansion
We expanded the training set by collecting additional tweets for each target-stance pair from the Twitter historical archives. To form a query for the historical API, we automatically selected 40 representative hashtags for each target-stance pair and manually filtered the resulting hashtags lists. The total amount of additional tweets was 1.7 million. Since number of collected tweets vastly exceeded the size of the official dataset, we decided to abstain from using the latter for training and instead use it for validation purposes. For some targets (mentioned in the section 5.2), we augmented the collected set with tweets obtained by replacing some words and phrases with similar ones, using Word2Vec.
4.1 Identifying Representative Hashtags
We found hashtags well-suited for forming a data expansion query. Hashtags are commonly used to represent a “topic” or “theme” of a tweet and thus often convey information of both the target and the stance (e.g. #stophillary2016).
We measured the strength of association between a hashtag and a particular target-stance pair by computing mutual information between them. More precisely, we defined two indicator variables for hashtag occurrences in tweets:
Whether the current hashtag is equal to the hashtag of interest.
Whether the tweet has the target and the stance of interest.
The mutual information between two random variables is computed as:
We estimate mutual information between our indicator variables using a Bayesian approach. We find the expected value of mutual information, assuming an uninformed Dirichlet prior on the joint distribution of the two variables. It can be approximately computed using the formula provided in [\citenameHutter2002]:
Where is the count of samples with indicator variables assuming values and respectively, corrected by a pseudo-count of 0.5; and .
To get a more reliable estimation of hashtag frequencies for tweets unrelated to the targets, we collected a “background” sample of 1.2 million English-language tweets. We treated these tweets as having no stance in relation to any of the targets and used them in computation of the counts above.
For each target-stance pair, we selected 40 hashtags with highest mutual information for further manual filtering. Samples of selected hashtags can be seen in Table 1. The manual filtering step was necessary, since the statistical association with a target-stance pair could only serve as a proxy for the fact that the tag explicitly expresses the target and the stance. For example, #tcot (standing for “top conservatives on Twitter”) was highly associated with the stance “AGAINST” for the target “Climate Change is a Real Concern”, but not explicitly expressing this stance. Although we did not make the identification of representative hashtags completely automatic, we found that hashtag filtering is a very manageable task for the annotator, taking only an hour of time for all five targets, making it an ideal place to introduce minimal human input.
4.2 Collecting and Preprocessing Tweets
As can be seen in Table 2, the collected tweets are very unevenly distributed between target-stance pairs. There are two causes for this: the uneven distribution of different stances on Twitter in general, and uneven number of representative hashtags that we were able to associate with each target-stance pair. For instance, the pair “Climate Change is a Global Concern: AGAINST” was represented by only 15 tweets in the training data that was provided, limiting us to only two representative hashtags. Since our deep learning models require balanced amount of samples, we used the balancing technique described in the previous section.
To eliminate the possibility that resulting classifiers would only learn the hashtags in the query, we removed these hashtags from the majority of the collected tweets, keeping them only in 25% of the tweets.
4.3 Augmenting Data Using Word2Vec
Data augmentation techniques are widely used to enhance generalization of models with respect to input transformations that are known to not affect the output significantly. An example application of data augmentation in NLP can be found in [\citenameZhang and LeCun2015], where they used thesaurus-based synonym replacement (WordNet [\citenameFellbaum1998]) to generate additional training samples. We applied the technique used by Zhang et al [\citenameZhang and LeCun2015] to our task, with the difference that we used Word2Vec [\citenameMikolov et al.2013] instead of a thesaurus to find similar words. The underlying intuition was that Word2Vec can provide better coverage for phrases related to our targets.
The algorithm of the data augmentation is as follows. At every step, we randomly selected a tweet from the non-augmented training set. We sampled a number of words/phrases we would like to replace from a geometric distribution with parameter . We then randomly sampled words/phrases, that are part of the Word2Vec vocabulary from the current tweet. (if was larger than number of available words/phrases , we used .) For each of these words/phrases, we retrieved a list of most similar ones in terms of cosine similarity of Word2Vec vectors. We ordered the list in decreasing order of similarity and truncated it to not include items with similarity less than threshold . We then sampled index of selected replacement from another geometric distribution with parameter (again, we used modulo if was too big). The original words/phrases were then replaced, and the tweet was added to the augmented dataset. The particular values of , and were 0.5, 0.5 and 0.25 respectively. Using this method, we generated 500,000 extra tweets for each target-stance pair.
To have a better sense of our approach’s performance, we compared results against a simple baseline. We built a set of Naive Bayes classifiers using bag-of-word features and optimized their parameters using 20-fold cross validation on original training data. We experimented with different thresholds on word count for a word to be included into vocabulary. We also set up separate thresholds for hashtags and at-mentions. After selecting the most promising values of thresholds, priors and the smoothing parameter, we ran the Naive Bayes classifiers on the test data to obtain results shown in Table 3.
5.2 Validation Results
We trained the models using the collected dataset and validated them on the training set provided for the task. The validation results informed the choice between the word-level and the character-level classifiers for each target. Without Word2Vec augmentation, character-level classifier achieved the best performance only for the target “Feminist Movement”. When Word2Vec augmentation was introduced, the character-level model achieved the best performance for the target “Climate Change” and the stance ”FAVOR” of the target ”Hillary Clinton”. The word-level model performed better for the targets: “Legalization of Abortion”, “Atheism” and the stance ”AGAINST” of the target “Hillary Clinton”. We were able to achieve better average performance for the target ”Hillary Clinton” by combining character-level and word-level classifiers with a simple heuristic: whenever character-level model predicts “AGAINST”, use that decision, otherwise resort to the decision of word-level model.
Table 4 compares the performance of the character-level and word-level classifiers for the targets where character-level classifiers yielded an advantage. The macro-average F1 validation score was 0.65.
5.3 SemEval Competition Results
Our model was able to achieve a Macro F-score of 0.6354 (placing us eighth out of 19 teams), while the best performing model had a Macro F-score of 0.6782. Table 5 details results on the test data for each target and stance.
|H. Cl (9)||F|
|Cl. Ch. (12)||F|
6 Discussion and Future Work
An interesting result of our work was that given enough data, character-level models outperformed word-level models for tweet classification (with a dramatic improvement in case of ”Hillary Clinton: FAVOR”). Due to the lack of data, it was necessary to resort to a data augmentation technique to generate sufficient amount and diversity of data for character-level model to show its advantage. Another interesting finding from our work is the suitability of word2vec-based substitution as a data augmentation technique. As far as we know, word2vec has not previously been used for data augmentation in this manner.
As can be seen in Table 5, our system did not perform too well for certain target-stance pairs (e.g., Atheism-Against). We hypothesize that the reason for this is the noise and the limited size of the collected training data. Thus, we believe that the performance of the system can be improved through better data expansion and cleaning techniques.
We see several avenues for future improvements. First, it might be beneficial to use unsupervised pre-training for our models (e.g., using autoencoders for Twitter [\citenameVosoughi et al.2016]). Second, data cleaning can potentially be improved using bootstrapping. This would entail using our current models (optimized for high precision) to gather cleaner data for the second tier of models. It could be repeated while validation performance improves. Finally, because of the constraints of this SemEval task, we did not manually select hashtags or terms commonly associated with target-stance pairs. Inclusion of such hashtags can potentially boost the quality of the dataset, leading to better performance of our models.
- [\citenameCollobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537.
- [\citenameFellbaum1998] Christiane Fellbaum. 1998. WordNet. Wiley Online Library.
- [\citenameGirshick et al.2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587.
- [\citenameHinton et al.2012] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
- [\citenameHutter2002] Marcus Hutter. 2002. Distribution of mutual information. Advances in neural information processing systems, 1:399–406.
- [\citenameKingma and Ba2014] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [\citenameMikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- [\citenameMurakami and Raymond2010] Akiko Murakami and Rudy Raymond. 2010. Support or oppose?: classifying positions in online debates from reply activities and opinion expressions. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 869–875. Association for Computational Linguistics.
- [\citenameSomasundaran and Wiebe2009] Swapna Somasundaran and Janyce Wiebe. 2009. Recognizing stances in online debates. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 226–234. Association for Computational Linguistics.
- [\citenameThomas et al.2006] Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 327–335. Association for Computational Linguistics.
- [\citenameVosoughi et al.2015] Soroush Vosoughi, Helen Zhou, and Deb Roy. 2015. Enhanced twitter sentiment classification using contextual information. In 6TH Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA 2015), page 16.
- [\citenameVosoughi et al.2016] Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb Roy. 2016. Tweet2vec: Learning tweet embeddings using character-level cnn-lstm encoder-decoder. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM.
- [\citenameZhang and LeCun2015] Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch. arXiv preprint arXiv:1502.01710.