Neural Attention Models for Sequence Classification:
Analysis and Application to
Key Term Extraction and Dialogue Act Detection
Recurrent neural network architectures combining with attention mechanism, or neural attention model, have shown promising performance recently for the tasks including speech recognition, image caption generation, visual question answering and machine translation. In this paper, neural attention model is applied on two sequence labeling tasks, dialogue act detection and key term extraction. In the sequence labeling tasks, the model input is a sequence, and the output is the label of the input sequence. The major difficulty of sequence labeling is that when the input sequence is long, it can include many noisy or irrelevant part. If the information in the whole sequence is treated equally, the noisy or irrelevant part may degrade the classification performance. The attention mechanism is helpful for sequence classification task because it is capable of highlighting important part among the entire sequence for the classification task. The experimental results show that with the attention mechanism, discernible improvements were achieved in the sequence labeling task considered here. The roles of the attention mechanism in the tasks are further analyzed and visualized in this paper.
Neural Attention Models for Sequence Classification:
Analysis and Application to
Key Term Extraction and Dialogue Act Detection
|Sheng-syun Shen Hung-yi Lee|
|Graduate Institute of Communication Engineering|
|National Taiwan University|
Index Terms: attention model, key term extraction, dialogue act detection, long short-term memory (LSTM)
Recently, attention-mechanism has been incorporated with recurrent neural networks, and has shown significant improvement on a great variety of tasks. Attention mechanism is first introduced by Bahdanau et al.  in the task of machine translation. They proposed an recurrent neural network (RNN)[2, 3] encoder-decoder model for end-to-end translation, and this mechanism is intuitively designed in order to take care about the positions of input elements according to previous output result. Inspired by this work, Chorowski et al.  then proposed attention-based models for speech recognition, which are claimed to be robust to long inputs. Kelvin Xu et al.  and Huijuan Xu et al.  also demonstrated how attention mechanism works while reading a picture. The above works iteratively process their input by selecting relevant content at every step. Attention-mechanism are also useful for tasks other than sequence to sequence learning. Memory Neural Networks (MemNN) which are developed by Weston et al.  and Sukhbaatar et al.  can deal with question answering (QA) task [9, 7, 8], and the attention-mechanism plays an important role in the model.
In this paper, neural attention model is applied on sequence classification tasks. In a sequence classification task, the input of the model is a sequence, and the model output is the class of the sequence. Many common tasks can be formulated as sequence classification including speaker recognition , audio emotion classification , spoken term detection (STD) [12, 13, 14], dialogue act detection [15, 16, 17], key term extraction [18, 19, 20, 21], etc. One of the major difficulties for sequence classification is that when the input sequence is long, it can include many noisy or irrelevant parts, and without techniques to ignore these parts, they may degrade the classification performance. Attention-mechanism shows the potential of automatically ignoring the unimportant parts in the entire input sequence and highlighting the important parts [9, 7, 8]. This inspires us to explore the use of attention mechanism on sequence classification.
In this paper, we present a novel attention-mechanism long short-term memory (LSTM) [22, 23] network architecture for sequence classification, in which the LSTM network reads the entire input, attention-mechanism highlights the important elements, and the sequence classes are predicted by the highlighted parts. This model is first tested on dialogue act detection in which the model input is the transcriptions of one to several utterances, and the output is the dialogue acts. It is shown that the attention-mechanism is especially helpful with longer input. We further formulate the key term extraction as sequence classification task , and apply the proposed model. This methodology shows promising results on key term extraction. Finally, visualization and analysis are also performed to understand how the attention process works.
2 Neural Attention Model for
The overall structure of the proposed method is in Figure 1. The inputs of model would be represented as a dense sequence vector , which will be described in section 2.1. With the sequence vector, attention mechanism is then applied to extract related information from input sequence in section 2.2. In section 2.3, the model will predict target according to the selected feature vectors.
2.1 Sequence Representation
We use recurrent neural networks (RNN) for encoding. RNNs are capable of handling sequence information over time, so they have demonstrated outstanding performance on natural language understanding tasks [24, 25, 26] in recent years. We select long short-term memory (LSTM) networks, a type of recurrent neural networks with a more complex computational unit, to processes inputs sequentially. A brief introduction of LSTMs can be found in [22, 23].
In the upper part of Figure 1, we demonstrate the encoding procedure to transform input sequences into fixed-length vector representation . The set denotes the input sequence, where is the sequence length. Each element in represents a fixed-length feature vector. For example, it might be a high dimensional 1-of-N encoding unigram vector for the task of text classification. In order to reduce the model complexity, we set an embedding layer, a linear transformation matrix, to turn the inputs into low dimensional dense vectors , and then they will be sent to the LSTM encoder. In each time step, the LSTM takes one element from feature vector set, and after processing the last element, it then generates an output vector , which can be regarded as the summaries of the preceding feature vectors.
2.2 Attention Mechanism
When input sequence is long, the summaries vector is likely to contain noisy information from many irrelevant feature vectors , we thus apply attention mechanism to select only relevant frames among the entire sequence. The procedures are shown in the lower part of Figure 1. There is also an embedding layer to transform input sequences into dense vectors, and all the parameters in the embedding layer are shared with the previous one. We then calculate the cosine similarity between the sequence vector and word embedding set :
where denotes cosine similarity between two vectors. As a result, we have a list of score . The attention weights come from the normalized score list . Due to some considerations, we normalize the scores in two ways, which is inspired by Chorowski et al. in :
Sharpening: The score list is normalized using activation function:
Smoothing: The sharpening normalization method prefers to mostly focus on only a single feature vector , and might negatively affects the model’s performance. We then apply a new way for the model to aggregate selections from multiple top-scored frames. In this way, more input locations are considered for bringing more diversity to the model. We replace the exponential function in equation (2) with logistic sigmoid function :
Visualization and analysis of the both normalization functions are provided in the experiment section.
2.3 Target Selection
The right part of Figure 1 illustrates the target selection procedures. We weighted sum all the feature vectors as , and sending it to a fully connected layer. Usually, the neurons in this layer are activated by nonlinear functions. The last layer is for target prediction, and the dimension is set to be candidate target numbers.
We conducted two sequence classification tasks in this section. In section 3.1, we describe the definition of dialogue act detection, and also demonstrate the experimental results. In section 3.2, we introduce how to apply the proposed methodology on key term extraction task. The role of attention mechanism during classification procedure will be discussed in section 3.3, and we also show the visualization results.
3.1 Dialogue Act Detection
Dialogue act (DA) detection [15, 16, 17] is about categorizing the intention behind the speaker’s move in conversations, and recognition of a speaker’s act may help reason the entire dialogue. This prediction task is still challenging because there are various distinct ways of formulating an intention. In this work, DAs are labeled with one of a number of tags. For example, the tag <OFFER> is related to the situation that someone commands partner to carry out actions, e.g., “You need to give me your ideas, and then I need to see whether that would sell in the market place.”
3.1.1 Experimental setup
We conducted experiments on Switchboard Dialog Act (SwDA) Corpus , which is a corpus of telephone conversations on selected topics. It consists of about 2,500 conversations by 500 speakers from the U.S. The conversations in the corpus are labeled with 43 unique dialogue act tags and split to 1,115 train and 19 test conversations. The training and testing corpus respectively contain 213,543 and 4,514 utterances, having average length of about 8 words.
We compared the proposed model with the following baselines.
Support Vector Machines: SVM is the most common way to be adopted for text classification. Silva et al.  chose sentence unigrams as input feature vector, and trained the SVM model. We extracted one-of-N encoding unigram features for every word in the dataset, aggregating them together for each training example. To reduce the number of dimensions, we set minimum word counts to 5. The Radial basis function (RBF)  kernel was also applied.
Multiple Layer Perceptron: The work introduced by Ries et. al  is the first approach that importing artificial neural networks (ANN) for dialogue act detection. We also extracted unigram features as the model input for experiments. We trained an MLP model with 3 hidden layers. Each hidden layer has 512 neurons. The activation function was applied on every hidden layer, and we set as the optimizer. The training epoch was set to be 20.
Long Short-term Memory: In order to examine the use of attention mechanism, we also implemented the original LSTM network. The LSTM model takes one word from the input sequence in each time step. We applied word embedding for unigram features, thus the high dimensional sparse vectors are transformed into dense vectors. The embedding size was 400, and we set the dimension of recurrent layers as 128 and the fully connected layer before output as 500, respectively. To avoid overfitting, we only trained the LSTM network for 10 epochs.
3.1.3 Experimental results
We implemented both sharpening-attend and smoothing-attend neural attention model in the experiments. The LSTM part of the proposed model is the same as the original LSTM briefly illustrated in the previous subsection, and the hyper-parameters for model training was also the same. As the previous work stated , context information from previous utterances may help for the dialogue act prediction. Therefore, we also appended previous utterances to the the utterance being classified, and was set to be 3 in the experiments.
The results are reported in Table 1. Rows (a) to (d) are the baseline results, and the results of the proposed approaches are in rows (e) to (h). It is clear that the LSTM networks already outperformed the other baselines (rows (c) vs (a), (b)) because the LSTM networks have better capability of handling sequence information than multiple layer perceptrons and support vectors. Moreover, with context information the LSTM can have higher accuracy than the one without it (rows (d) vs (c)).
Considering the case without context information, the proposed approaches show improvements comparing to all the baselines no matter the attention is sharpening or smoothing (rows (e), (f) vs (a), (b), (c)). The neural attention model with sharpening attention is only slightly better than the original LSTM (rows (e) v.s. (c)), but the smoothing attention shows significant improvement (rows (f) v.s. (c)). Besides, we also know that the prediction of sequence classification cannot just rely on the most relevant element, the rest of the relevant part should also be considered. Neural attention model with sharpening attention does not show any improvement after adding context information into the prediction procedure (rows (g) v.s. (e)). This is because the sharpening-attend mechanism only focuses on the most relevant part of the input sequence, adding more candidates would not be helpful. On the other hand, when using smoothing attention, context information became very helpful (rows (h) v.s. (f)). This shows that smoothing attention can better exploit the context information than sharpening attention.
3.2 Key Term Extraction
The goal of key term extraction [18, 19, 20, 21] is to automatically extract relevant terms from a given document. Key terms may possibly describe the core concept or summary of a document, which can help users understand, organize, and extract important information efficiently from documents. These terms are usually manually labeled by humans according to cognition and domain knowledge, so automatic key term extraction is not an easy task.
Key term extraction can be regarded as a sequence classification problem . The model input is a document, while the model selects some terms as key terms from a set of candidates. Each term in the set of candidate terms is considered as a class, and the documents containing the same key terms belong to the same class. In our task, chances are that some terms do not exist in the document, but they represent the core concepts of the document. These terms are also regarded as key terms here, which makes this task even more difficult. It is possible that a document has more than one key term, or a document can belong to multiple classes. However, the number of key terms in each testing document is unknown, as a result we consider this task to be a ranking problem. That is, the model assigns a score to each candidate term. Then, the candidate terms are ranked according to the scores. The target of the system is to rank the key terms above the non key terms.
In training procedures, each document with labeled key terms would be mapped into a sparse vector, which is the probability training target. The dimension of this sparse vector is the number of candidate terms. Most of the values are zero, only the indexes corresponding to labeled terms would be assigned to a value , and the summation of this vector is 1. For example, assuming we have 1,000 term candidates and the number of labeled key terms is 4 in a document, we then have an 1,000-dimension sparse target vector with only 4 elements all assigned with .
3.2.1 Experimental setup
We collected the data from Stack Overflow111 http://stackoverflow.com/ website where serves as a platform for users to ask and answer questions. While users of Stack Overflow post questions on the forum, they are asked to label 2~6 key terms for each post. The dataset we collected includes 290,000 examples in total (250,000 for training and 40,000 for testing), and there are about 24,000 kinds of labeled key term. Each example contains a post and 2~6 key term labels, and the average length of the article is about 120 words. The collected dataset is available for download. 222http://speech.ee.ntu.edu.tw/~sense/stackoverflow_pack.zip
In practice, to reduce the training complexity, we only selected the 1,000 most frequent key terms in the training set as candidates. These top 1,000 candidates cover over 76% of the key term labels in the training set, so we can still expect to get reasonable results.
We implemented multiple layer perceptrons (MLP) and long short-term memory (LSTM) networks as the baseline models, which have already been described in section 3.1.2.
Tf-idf Sorting is the baseline we also applied. “Tf-idf” is the abbreviation of term frequency-inverse document frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. A brief introduction about how Tf-idf Sorting extracts key terms can be found in . We calculated the tf-idf values of a set of candidate key terms according to the dataset, and these candidates were sorted by their values. We then reported the ranking list for evaluation.
3.2.3 Experimental results
To examine the prediction result, we chose MAP and P@R as the evaluation methods. The MAP score for a set of documents is the mean of the average precision scores for each document. P@R is defined as the precision after elements have been selected by the system, where is also the total number of judged relevant results for the given inputs. Precision is defined as the portion of returned results that are truly belong to the ground truth set.
The experimental results are demonstrated in Table 2. Row (a) is the oracle score, which is for reference. Since we only selected 1,000 most frequent key terms as candidates from the training set, we can’t achieve 100% accurate performance. The score of baseline approaches we applied are in rows (b) to (d), and rows (e), (f) are the performance of the proposed neural attention model. The supervised learning baselines outperformed the Tf-idf Sorting baseline (rows (c), (d) vs (b)). That is because without supervised learning, we may not fit the dataset, and we also can’t predict the key terms which do not exist in the document. Besides, like the experiment we previously conducted, LSTM shows better ability of handling sequence information in comparison to original neural networks (rows (d) vs (c)), so the LSTM network performs better while using both MAP and P@R as evaluation methods. We found that the performance of our neural attention model with sharpening-attend mechanism degraded while comparing to the original LSTM (rows (e) vs (d), but the one with smoothing attention outperformed all the other approaches (rows (f) vs (b), (c), (d), (e)). This result proved that adding more relevant elements into consideration can help solving sequence classification problems.
3.3 Visualization and Analysis
Figure 2 demonstrates the visualization of how attention-mechanism works in the sequence classification tasks. The upper row is for dialogue act detection and the lower row is for key term extraction. The darker the color, the higher the weights. We only chose the smoothing-attend mechanism for visualization due to its better performance. According to this figure, we found that attention weights are capable of reducing sentence disfluency problems and filtering out most of the unimportant elements such as function words.
In this paper, we proposed a neural attention model for sequence classification. In such kinds of task, the input of model is a sequence, and the output is the class of sequence. The major difficulty is that when the input sequence is long, the noisy or irrelevant part may degrade the classification performance. The proposed model can reduce the influences because it is able to highlight important part among the entire sequence. In the experiments, the neural attention model can achieve 72.6% accuracy for dialogue act detection task and 50.5% MAP score for key term extraction task, which shows discernible improvements comparing to the other approaches.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
-  Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
-  Michael I Jordan. Serial order: A parallel distributed processing approach. Advances in psychology, 121:471–495, 1997.
-  Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, pages 577–585, 2015.
-  Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.
-  Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. arXiv preprint arXiv:1511.05234, 2015.
-  Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
-  Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in Neural Information Processing Systems, pages 2431–2439, 2015.
-  Wei-Ning Hsu, Yu Zhang, and James Glass. Recurrent neural network encoder with attention for community question answering, 2016.
-  Najim Dehak, Reda Dehak, Patrick Kenny, Niko Brummer, Pierre Ouellet, and Pierre Dumouchel. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In INTERSPEECH, 2009.
-  Bjorn Schuller, Stefan Steidl, and Anton Batliner. The INTERSPEECH 2009 emotion challenge. In INTERSPEECH, 2009.
-  Hung-Yi Lee and Lin-Shan Lee. Enhanced spoken term detection using support vector machines and weighted pseudo examples. Audio, Speech, and Language Processing, IEEE Transactions on, 21(6):1272–1284, 2013.
-  I.-F. Chen and C.-H. Lee. A hybrid HMM/DNN approach to keyword spotting of short words. In INTERSPEECH, 2013.
-  A. Norouzian, A. Jansen, R. Rose, and S. Thomas. Exploiting discriminative point process models for spoken term detection. In INTERSPEECH, 2012.
-  Max M Louwerse and Scott A Crossley. Dialog act classification using n-gram algorithms. In FLAIRS Conference, pages 758–763, 2006.
-  Kristy Elizabeth Boyer, Joseph F Grafsgaard, Eun Young Ha, Robert Phillips, and James C Lester. An affect-enriched dialogue act classification model for task-oriented dialogue. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 1190–1199. Association for Computational Linguistics, 2011.
-  Dinoj Surendran and Gina-Anne Levow. Dialog act tagging with support vector machines and hidden markov models. In INTERSPEECH, 2006.
-  Kamal Sarkar, Mita Nasipuri, and Suranjan Ghose. A new approach to keyphrase extraction using neural networks. arXiv preprint arXiv:1004.3274, 2010.
-  Yun-Nung Chen, Yu Huang, Sheng-Yi Kong, and Lin-Shan Lee. Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features. In Spoken Language Technology Workshop (SLT), 2010 IEEE, pages 265–270. IEEE, 2010.
-  Hiroshi Nakagawa and Tatsunori Mori. A simple but powerful automatic term extraction method. In COLING-02 on COMPUTERM 2002: second international workshop on computational terminology-Volume 14, pages 1–7. Association for Computational Linguistics, 2002.
-  Yun-Nung Chen, Wei Yu Wang, and Alexander I Rudnicky. An empirical investigation of sparse log-linear models for improved dialogue act classification. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8317–8321. IEEE, 2013.
-  Felix A Gers and Jürgen Schmidhuber. Lstm recurrent networks learn simple context-free and context-sensitive languages. Neural Networks, IEEE Transactions on, 12(6):1333–1340, 2001.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. Spoken language understanding using long short-term memory neural networks. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 189–194. IEEE, 2014.
-  Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. Recurrent neural networks for language understanding. In INTERSPEECH, pages 2524–2528, 2013.
-  Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. Using recurrent neural networks for slot filling in spoken language understanding. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 23(3):530–539, 2015.
-  Dan Jurafsky, Elizabeth Shriberg, and Debra Biasca. Switchboard swbd-damsl shallow-discourse-function annotation coders manual. Institute of Cognitive Science Technical Report, pages 97–102, 1997.
-  Joao Silva, Luísa Coheur, Ana Cristina Mendes, and Andreas Wichert. From symbolic to sub-symbolic information in question classification. Artificial Intelligence Review, 35(2):137–154, 2011.
-  Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. Training and testing low-degree polynomial data mappings via linear svm. The Journal of Machine Learning Research, 11:1471–1490, 2010.
-  Klaus Ries. Hmm and neural network based speech act detection. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, volume 1, pages 497–500. IEEE, 1999.
-  Eugénio Ribeiro, Ricardo Ribeiro, and David Martins de Matos. The influence of context on dialogue act recognition. arXiv preprint arXiv:1506.00839, 2015.
-  http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/.