A Dual-Attention Hierarchical Recurrent Neural Network for Dialogue Act Classification
Recognising dialogue acts (DA) is important for many natural language processing tasks such as dialogue generation and intention recognition. In this paper, we propose a dual-attention hierarchical recurrent neural network for dialogue act classification. Our model is partially inspired by the observation that conversational utterances are normally associated with both a dialogue act and a topic, where the former captures the social act and the latter describes the subject matter. However, such a dependency between dialogue acts and topics has not been utilised by most existing systems for DA classification. With a novel dual task-specific attention mechanism, our model is able, for utterances, to capture information about both dialogue acts and topics, as well as information about the interactions between them. We evaluate the performance of our model on two publicly available datasets, i.e., Switchboard and DailyDialog. Experimental results show that by modelling topic as an auxiliary task, our model can significantly improve DA classification.
Dialogue Acts (DA) are semantic labels of utterances, which are crucial to understanding communication: much of a speaker’s intent is expressed, explicitly or implicitly, via social actions (e.g., questions or requests) associated with utterances [\citeauthoryearSearle1969]. Recognising DA labels is important for many natural language processing tasks. For instance, in dialogue systems, knowing the DA label of an utterance supports its interpretation as well as the generation of an appropriate response. In the security domain, being able to detect intention in conversational texts can effectively support the recognition of sensitive information exchanged in email conversations within a company, which can be extremely valuable for IT managers or the security department [\citeauthoryearVerma, Shashidhar, and Hossain2012].
A wide range of techniques have been investigated for DA classification. Early works on DA classification are mostly based on general machine learning techniques such as Hidden Markov models (HMM) [\citeauthoryearStolcke et al.2000], dynamic Bayesian networks [\citeauthoryearDielmann and Renals2008], and Support Vector Machines (SVM) [\citeauthoryearLiu2006]. Recent studies to the problem of DA classification have seen an increasing uptake of deep learning techniques, where promising results have been obtained. \citeauthorkalchbrenner2013recurrent \shortcitekalchbrenner2013recurrent model a DA sequence with a recurrent neural network (RNN), where sentence representations are constructed by means of a convolutional neural network (CNN). \citeauthorkumar2017dialogue \shortcitekumar2017dialogue propose a hierarchical, bidirectional long short-term memory (Bi-LSTM) model with a conditional random field (CRF) for DA classification, achieving an overall accuracy of 79.2% on the SWDA dataset. There is also work exploring different deep learning architectures (e.g., hierarchical CNN or RNN/LSTM) to incorporate context information for DA classification, showing that incorporating context information improves DA classification [\citeauthoryearLiu et al.2017].
Most of the deep learning approaches to DA classification utilise the dependencies from data, e.g., the dependency between adjacent utterances [\citeauthoryearJi, Haffari, and Eisenstein2016] as well as the implicit and intrinsic dependencies among DAs [\citeauthoryearKumar et al.2017]. It has been observed that conversational utterances are normally associated with both a dialogue act and a topic, where the former captures the social act (e.g., promising) and the latter describes the subject matter [\citeauthoryearWallace et al.2013]. In addition, the set of DAs associated with a conversation is likely to be affected by the topic of the conversation. For instance, DAs such as request and suggestion might appear more frequently in conversations relating to topics about work. However, such a reasonable source of information, surprisingly, has not been explored in the deep learning literature for DA classification. We hypothesize that modelling the topics of utterances as an auxiliary task may effectively support dialogue act classification.
In this paper, we propose a dual-attention hierarchical recurrent neural network for dialogue act classification. Our model is distinguished from existing methods in a few aspects. First, compared to the flat structure employed by existing models [\citeauthoryearKhanpour, Guntakandla, and Nielsen2016, \citeauthoryearJi, Haffari, and Eisenstein2016, \citeauthoryearTran, Zukerman, and Haffari2017b], our hierarchical recurrent neural network can represent the input at the word, utterance, and conversation levels, preserving the natural hierarchical structure of a conversation. Second, our model is able to incorporate rich context information for DA classification with a novel task-specific dual-attention mechanism. Employing attention into our model sheds light on the observation that different dialogue acts are semantically related to different words in an utterance [\citeauthoryearTran, Zukerman, and Haffari2017a]. Third, apart from incorporating the commonly used dependencies between utterances, our dual-attention mechanism can further capture, for utterances, information about both dialogue acts and topics. This is a useful source of context information which has not previously been explored in existing deep learning models for DA classification.
We evaluate our model against several strong baselines [\citeauthoryearKalchbrenner and Blunsom2013, \citeauthoryearLee and Dernoncourt2016, \citeauthoryearKhanpour, Guntakandla, and Nielsen2016, \citeauthoryearJi, Haffari, and Eisenstein2016, \citeauthoryearKumar et al.2017] on the task of dialogue act classification. Extensive experimentation conducted on two publicly available datasets (namely Switchboard [\citeauthoryearJurafsky1997] and DailyDialog [\citeauthoryearLi et al.2017]) shows that by modelling the topic information of utterances as an auxiliary task, our model can significantly improve DA classification, yielding comparable performance to state-of-the-art deep learning methods [\citeauthoryearKumar et al.2017] in classification accuracy.
2 Related Work
Dialogue Act (DA) recognition is a supervised classification task, where each utterance in a conversation is assigned with a DA label. Broadly speaking, methods for DA classification can be divided into two categories, i.e., instance-based methods and sequence labelling methods. Instance-based methods treat each utterance as an independent data point and predict the DA label for each utterance separately, e.g., naive Bayes [\citeauthoryearGrau et al.2004] and maximum entropy [\citeauthoryearAng, Liu, and Shriberg2005]. In contrast, sequence labelling methods cast DA recognition as a sequence labelling task where the dependency among consecutive utterances are taken into consideration, where example methods include Hidden Markov Models (HMM) [\citeauthoryearStolcke et al.2000] and Conditional Random Fields (CRF) [\citeauthoryearKim, Cavedon, and Baldwin2010].
Recently, deep learning has been widely applied in many natural language processing tasks, including DA classification. \citeauthorkalchbrenner2013recurrent \shortcitekalchbrenner2013recurrent proposed to model a DA sequence with a recurrent neural network (RNN) where sentence representations were constructed by means of a convolutional neural network (CNN). \citeauthorlee2016sequential \shortcitelee2016sequential tackled DA classification with a model built upon RNNs and CNNs. Specifically, their model can leverage the information of preceding texts, which can effectively help improve the DA classification accuracy. More recently, a latent variable recurrent neural network was developed for jointly modelling sequences of words and discourse relations between adjacent sentences [\citeauthoryearJi, Haffari, and Eisenstein2016]. In their work, the shallow discourse structure is represented as a latent variable and the contextual information from preceding utterances are modelled with a RNN. \citeauthorkumar2017dialogue \shortcitekumar2017dialogue proposed a hierarchical, bidirectional long short-term memory (Bi-LSTM) model with a CRF for DA classification, where the inter-utterance and intra-utterance information are encoded by a hierarchical Bi-LSTM and the dependency between DA labels is captured by a CRF.
In addition to modelling dependency between utterances, various contexts have also been explored for improving DA classification or modelling DA under multi-task learning. For instance, \citeauthorwallace2013generative \shortcitewallace2013generative proposed a generative joint sequential model to classify both DA and topics of patient-doctor conversations. Their model is similar to the factorial LDA model [\citeauthoryearPaul and Dredze2012], which generalises LDA to assign each token a -dimensional vector of latent variables. The model of \citeauthorwallace2013generative, only assumed that each utterance is generated conditioned on the previous and current topic/DA pairs. In contrast, our model is able to model the dependencies of all utterances of a conversation, and hence can better capture the effect between DAs and topics. \citeauthorqin2017joint \shortciteqin2017joint introduced a joint model for identifying salient discussion points in spoken meetings as well as labelling discourse relations. They assumed that the interaction between content and discourse relations might improve the classification performance on both phrase selection and DA classification. A tree-structured discourse was constructed to jointly model the content and discourse relations. Lexical and syntactic features were utilised for the two tasks, such as TF-IDF scores for words, part of speech (POS) tags, etc. \citeauthorshen2016neural \shortciteshen2016neural proposed a neural attention model for DA detection and key term extraction, where their model shows that the attention mechanism is effective for sequence classification.
Given a training corpus , where is a conversation consisting of a sequence of utterances, and are the corresponding sequences of dialogue act (labels) and topics for , respectively. Each utterance of a conversation is a sequence of words. Our goal is to learn a model from , such that, given an unseen conversation , the model can predict the dialogue act (labels) of the utterances of .
Figure 1 gives an overview of the proposed Dual-Attention Hierarchical recurrent neural network (DAH). We adopt a shared utterance encoder for the input, which encodes each word of an utterance into a vector . The dialogue act attention and topic attention mechanisms capture DA and topic information as well as the interactions bewteen them. The outputs of the dual-attention are then encoded in the corresponding conversation-level sequence taggers (i.e., and ), based on the corresponding utterance representations and target labels.
3.1 Shared Utterance Encoder
In our model, we adopt a shared utterance encoder to encode the input utterances. Such a design is based on the rationale that the shared encoder can transfer knowledge between two tasks and reduce the risk of overfitting. Specifically, the shared utterance encoder is implemented using the bidirectional gated recurrent unit (BiGRU) [\citeauthoryearCho et al.2014], which encodes each utterance of a conversation as a series of hidden states . Here, indicates the timestamp within a sequence, and we define as follows
where is an operation for concatenating two vectors, and and are the -th hidden state of the forward gated recurrent unit (GRU) [\citeauthoryearCho et al.2014] and backward GRU for , respectively. Formally, the forward GRU is calculated as follows,
where is the hidden state for word , is the word embedding of , and , , are the reset, update, and new gates, respectively. Sigmoid (denoted as ) and functions are applied to each element of their vector arguments as pointwise operations, and denotes element-wise multiplication. are parameters that need to be estimated. Finally, the backward GRU encodes from the reverse direction (i.e. ) and generates following the same formulation as the forward GRU.
3.2 Task-specific Attention
Recall that one of the key challenges of our model is to capture for each utterance, information about both dialogue acts and topics, as well as information about the interactions between them. We address this challenge by incorporating into our model a novel task-specific dual-attention mechanism, which accounts for both DA and topic information extracted from utterances. In addition, DAs and topics are semantically relevant to different words in an utterance. With the proposed attention mechanism, our model can also assign different weights to the words of an utterance by learning the degree of importance of the words to the DA or topic labelling task, i.e., promoting the words which are important to the task and reducing the noise introduced by less important words.
For each utterance , the dialogue act attention calculates a weight vector for , the hidden states of . can then be represented as a weighted combination vector
In contrast to the traditional attention mechanism [\citeauthoryearBahdanau, Cho, and Bengio2014], which only depends on one set of hidden vectors from the Seq2Seq decoder, the dialogue act attention in DAH relies on two sets of hidden vectors, i.e., of the conversation-level DA tagger and of the conversation-level topic tagger, where the interaction between DAs and topics in each task-specific attention mechanism can capture, for utterances, information about both DAs and topics. Specifically, the weights for the dialogue act attention are calculated by
The topic attention layer has a similar architecture to the dialogue act attention layer, which takes as input both and . Similar to the dialogue act attention, the weight vector for the topic attention output can be calculated as follows
Note that , , , , , , and are vectors of parameters that need to be learned during training.
3.3 Conversational Sequence Tagger
Dialogue act sequence tagger. The conversational dialogue act sequence tagger predicts the next DA conditioned on the attention vector and all previous predicted DAs (c.f. Figure 1). Formally, this conditional probability can be formulated as
Here is the sequence of all utterances seen so far, is the length of a conversation. is the hidden state of the conversational DA tagger for the -th utterance, is the attention vector of , is a linear transformation function, and are model parameters which need to be learned during training.
Vector is calculated in a GRU (denoted as ):
In training, teacher forcing [\citeauthoryearWilliams and Zipser1989] with a value of 0.5 is used for label in order to avoid accumulation of false prediction.
Topic sequence tagger. The conversational topic sequence tagger is designed to predict conditioned on and all previous predicted topics . Similar to the formulation of the dialogue act tagger, we have
Here is also the sequence of all utterances seen so far, is the hidden state of the conversational topic tagger for the -th utterance, is the attention vector of , and and are model parameters.
Let be all the model parameters that need to be estimated for the DAH model. We can then estimate based on by minimising the objective function below, which seeks to jointly optimise the prediction for both dialogue acts and topics
The hyper-parameter controls the contribution of the conversational topic tagger towards the objective function. In our experiments, is determined empirically.
4 Experimental Settings
Switchboard Dialogue Act Corpus (SWDA).
The SWDA dataset
DailyDialog Corpus (DyDA).
The DyDA dataset
“Inform class contains all statements and questions by which the speaker is providing information”;
“Questions class is labelled when the speaker wants to know something and seeks for some information”;
“Directives class contains dialogue acts like request, instruct, suggest and accept/reject offer”;
“Commissive class is about accept/reject request or suggestion and offer”.
4.2 Implementation Details
For both experimental datasets (SWDA and DyDA), the top 15,000 words with the highest frequency are indexed. For SWDA, the standard split is adopted based on [\citeauthoryearStolcke et al.2000], utilising 1,115 conversations for training and 19 conversations for testing. We select 112 conversations from the training dataset as the validation dataset following [\citeauthoryearLee and Dernoncourt2016]. For DyDA, we also use the standard split from the original dataset, employing 11,118 conversations for training, 1,000 for validating, and 1,000 for testing. The statistics of the two datasets are summarised in Table 1.
The input data is represented with 300-dimensional Glove word embeddings [\citeauthoryearPennington, Socher, and Manning2014] in order to capture the word similarity and accelerate model training. The shared encoder is a BiGRU with two layers, whereas the conversational sequence tagger is a GRU containing a single layer. We set the dimension of the hidden layers (i.e., , and ) to 100 and applied a dropout layer [\citeauthoryearSrivastava et al.2014] to both the shared encoder and the sequence tagger at a rate of 0.2. The Adam optimiser [\citeauthoryearKingma and Ba2014] is used for training with an initial learning rate of 0.001 and a weight decay of 0.0001. Each utterance in a mini-batch was padded to the maximum length for that batch and the maximum batch-size allowed is 10.
5 Experimental Results
5.1 Dialogue Acts Classification
We compare our Dual-Attention Hierarchical RNN model (DAH) against several state-of-the-art models for dialogue act classification. In order to show the effectiveness of DAH, we also report the performance the Single-Attention Hierarchical RNN model (SAH), i.e., a simplified version of DAH that only models dialogue acts, with topical information omitted.
Results on the SWDA dataset. For the SWDA dataset, we compare our models against the following baselines:
HMM: A Hidden Markov Model for the discourse structure of a conversation [\citeauthoryearStolcke et al.2000];
3: A generative joint, additive, sequential model of topics and speech acts in patient-doctor communication [\citeauthoryearWallace et al.2013];
CNN: A CNN containing contextual information [\citeauthoryearLee and Dernoncourt2016];
RCNN: A hierarchical CNN modelling utterances followed by a RNN capturing contextual information [\citeauthoryearKalchbrenner and Blunsom2013];
LSTM-Softmax: A deep bidirectional LSTM to classify dialogue acts via a softmax classifier [\citeauthoryearKhanpour, Guntakandla, and Nielsen2016];
4: A latent variable recurrent neural network for dialogue act classification [\citeauthoryearJi, Haffari, and Eisenstein2016];
Bi-LSTM-CRF: A hierarchical bidirectional LSTM with a CRF as the top layer to classify dialogue acts [\citeauthoryearKumar et al.2017].
Note that while all the aforementioned baselines model the dependency between the dialogue acts of a sequence of utterances, only the JAS model has attempted to model both dialogue acts and topics. All baselines above use the same test dataset as our model.
Table 2 shows the dialogue act classification results of our model and the baselines on the SWDA dataset. Among the baseline models, Bi-LSTM-CRF achieved the the best classification performance with 79.2% accuracy. It can also be observed that the deep learning models (e.g, Bi-LSTM-CRF, DRLM-Cond) in general give better performance than the traditional statistical models (i.e., HMM and JAS).
The SAH model, that only models dialogue acts, obtains 74.1% accuracy, which is better than JAS and RCNN. By jointly modelling dialogue act and topics, the DAH model achieves an overall accuracy of 78.3%, which is a significant performance boost over SAH (i.e., 4.2% higher; paired t-test ). This result shows that the performance of DA classification can be improved significantly by using topic information. When comparing DAH with the baselines models, we can see that DAH achieves comparable performance to the state-of-the-art model Bi-LSTM-CRF (i.e., 78.3% vs. 79.2%). Although Bi-LSTM-CRF outperforms DAH, the architecture of DAH is simpler: Bi-LSTM-CRF employs a bidirectional LSTM in the conversational layer, and the DA classifier is a CRF which is more complicated than the softmax of DAH.
Results on the DyDA dataset. We also evaluated our models on the DyDA dataset. As for the baselines, we ran and report the results for JAS and DRLM-Cond as only the source code for these two models are publicly available. Nevertheless, one should note that DRLM-Cond is the second-best performing baseline on the SWDA dataset. We fine-tuned the model parameters for both JAS and DRLM-Cond to make the comparison as fair as possible.
|Models||DA Type||P (%)||R (%)||F1 (%)||Acc. (%)|
As can be seen from Table 3, DRLM-Cond performs better than JAS and achieves an overall accuracy of 81.1%. Our DAH and SAH models, in contrast, give much better performance where both models outperform DRLM-Cond for more than 3.2% on utterance-level dialogue act classification. As with the SWDA dataset, DAH outperforms the SAH model on DyDA. By examining the classification performance of DAH and SAH on each dialogue act type, we see that both models achieve fairly similar performance on the Info, Questions classes, but DAH outperforms SAH on Directives and Commissive by more than 4% in F1 scores. This again proves that conversation-level topic information is helpful for dialogue act recognition.
To summarise, our DAH model achieves comparable performance to the-state-of-the-art for dialogue act classification on the SWDA dataset; it also gives the best classification performance on the DyDA dataset. Experimental results demonstrate that modelling conversational topic information as an auxiliary task does improve the classification on dialogue acts.
5.2 Analysing the Effectiveness of Joint Modelling Dialogue Act and Topic
In this section, we provide detailed analysis on why DAH can yield better performance than SAH by jointly modelling dialogue acts and topics.
Figure 2 shows the normalized confusion matrix derived from 10 DA classes of SWDA for both the DAH and SAH models. It can be observed that DAH yields improvement on recall for many DA classes compared to SAH, e.g., 17.8% improvement on bk and 7% on sv. For bk (Response Acknowledge) which has the highest improvement level, we see that the improvement largely comes from the reduction of misclassifing bk to b (Acknowledge Backchannel). The key difference between bk and b is that an utterance labelled with bk has to be produced within a question-answer context, whereas b is a “continuer” simply representing a response to the speaker. It is not surprising that SAH makes poor prediction as the utterances of these two DAs: they share many syntactic cues, e.g., indicator words such ‘okay’, ‘oh’, and ‘uh-huh’, which can easily confuse the model. When comparing the topic distribution of the utterances under the bk and b categories (cf. Figure 4), we found topics relating to personal leisure (e.g., music, exercise and fitness, pets, and gardening) are much more prominent in bk than b. By leveraging the topic information, DAH can better handle the confusion cases and hence improve the prediction for bk significantly.
There are also cases where DAH performs worse than SAH. Take the DA pair of qo (Open Question) and qw (wh-questions) as an example. qo refers to questions like ’How about you?’ and its variations (e.g., ’What do you think?’), whereas qw represents wh-questions which are much more specific in general (e.g. ’What other long range goals do you have?’). SAH gives quite decent performance in distinguishing qw and qo classes. This is somewhat reasonable as linguistically the utterances of these two classes are quite different, i.e., the qw utterance expresses very specific question and is relatively lengthy, whereas qo utterances tends to be very brief. We see that DAH performs worse than SAH, where quite a large percentage of qw utterances are misclassified as qo. This is likely due to the fact that there is no significant difference between the topic distribution of qw and qo as shown in Figure 4, and incorporating the topic information into DAH actually makes these two DAs less distinguishable for the model.
We also conducted a similar analysis on the DyDA dataset. As can be seen from the confusion matrices shown in Figure 3, DAH obtains improvement over SAH for all the four DA classes of DyDA. In particular, Directives and Commissive achieve higher improvement margin compared to the other two classes, where the improvement are largely attributed to less number of instances of the Directives and Commissive classes being mis-classified into Inform and Questions.
Examining the topic distributions
Finally, we show in Figure 5 a DA attention visualisation example of SAH and DAH for an utterance from SWDA. It can be seen that SAH gives very high weight to the word “because” and de-emphasizes other words. By modelling both DAs and topics with the dual-attention mechanism, DAH can capture more important words for the task (e.g., “reasonable”, “ever”, etc.) and correctly predicts the DA label as sd.
In this paper, we developed a dual-attention hierarchical recurrent neural network for dialogue act classification. Compared to the flat structure employed by existing models, our hierarchical model can better preserve the hierarchical structure of natural language conversations. More importantly, with the proposed task-specific dual-attention mechanism, our model is able to capture information about both dialogue acts and topics, as well as information about the interactions between them. Experimental results based on two public benchmark datasets show that modelling conversational topic information as an auxiliary task can effectively improve dialogue act classification, and that our model is able to achieve comparable performance to the state-of-the-art deep learning methods for DA classification.
- Ang, J.; Liu, Y.; and Shriberg, E. 2005. Automatic dialog act segmentation and classification in multiparty meetings. In ICASSP, volume 1, I–1061.
- Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Cho, K.; van Merriënboer, B.; Bahdanau, D.; and Bengio, Y. 2014. On the properties of neural machine translation: Encoder–decoder approaches. Syntax, Semantics and Structure in Statistical Translation 103.
- Dielmann, A., and Renals, S. 2008. Recognition of dialogue acts in multiparty meetings using a switching dbn. IEEE transactions on audio, speech, and language processing 16(7):1303–1314.
- Grau, S.; Sanchis, E.; Castro, M. J.; and Vilar, D. 2004. Dialogue act classification using a bayesian approach. In 9th Conference Speech and Computer.
- Ji, Y.; Haffari, G.; and Eisenstein, J. 2016. A latent variable recurrent neural network for discourse relation language models. arXiv preprint arXiv:1603.01913.
- Jurafsky, D. 1997. Switchboard swbd-damsl shallow-discourse-function annotation coders manual. Institute of Cognitive Science Technical Report.
- Kalchbrenner, N., and Blunsom, P. 2013. Recurrent convolutional neural networks for discourse compositionality. arXiv preprint arXiv:1306.3584.
- Khanpour, H.; Guntakandla, N.; and Nielsen, R. 2016. Dialogue act classification in domain-independent conversations using a deep recurrent neural network. In COLING, 2012–2021.
- Kim, S. N.; Cavedon, L.; and Baldwin, T. 2010. Classifying dialogue acts in one-on-one live chats. In EMNLP, 862–871.
- Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kumar, H.; Agarwal, A.; Dasgupta, R.; Joshi, S.; and Kumar, A. 2017. Dialogue act sequence labeling using hierarchical encoder with crf. arXiv preprint arXiv:1709.04250.
- Lee, J. Y., and Dernoncourt, F. 2016. Sequential short-text classification with recurrent and convolutional neural networks. In NAACL-HLT, 515–520.
- Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; and Niu, S. 2017. Dailydialog: A manually labelled multi-turn dialogue dataset. In IJCNLP.
- Liu, Y.; Han, K.; Tan, Z.; and Lei, Y. 2017. Using context information for dialog act classification in dnn framework. In Proceedings of EMNLP, 2170–2178.
- Liu, Y. 2006. Using svm and error-correcting codes for multiclass dialog act classification in meeting corpus. In Ninth International Conference on Spoken Language Processing.
- Paul, M., and Dredze, M. 2012. Factorial lda: Sparse multi-dimensional text models. In NIPS, 2582–2590.
- Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In EMNLP, 1532–1543.
- Qin, K.; Wang, L.; and Kim, J. 2017. Joint modeling of content and discourse relations in dialogues. arXiv preprint arXiv:1705.05039.
- Searle, J. R. 1969. Speech acts: An essay in the philosophy of language, volume 626. Cambridge University Press.
- Shen, S.-s., and Lee, H.-y. 2016. Neural attention models for sequence classification: Analysis and application to key term extraction and dialogue act detection. arXiv preprint arXiv:1604.00077.
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1):1929–1958.
- Stolcke, A.; Ries, K.; Coccaro, N.; Shriberg, E.; Bates, R.; Jurafsky, D.; Taylor, P.; Martin, R.; Ess-Dykema, C. V.; and Meteer, M. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics 26(3):339–373.
- Tran, Q. H.; Zukerman, I.; and Haffari, G. 2017a. A hierarchical neural model for learning sequences of dialogue acts. In EACL, volume 1, 428–437.
- Tran, Q. H.; Zukerman, I.; and Haffari, G. 2017b. Preserving distributional information in dialogue act classification. In Proceedings of EMNLP, 2151–2156.
- Verma, R.; Shashidhar, N.; and Hossain, N. 2012. Detecting phishing emails the natural language way. In European Symposium on Research in Computer Security, 824–841.
- Wallace, B. C.; Trikalinos, T. A.; Laws, M. B.; Wilson, I. B.; and Charniak, E. 2013. A generative joint, additive, sequential model of topics and speech acts in patient-doctor communication. In EMNLP, 1765–1775.
- Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation 1(2):270–280.