CAiRE: An End-to-End Empathetic Chatbot
In this paper, we present an end-to-end empathetic conversation agent, CAiRE. Our system adapts the learning approach from TransferTransfo Wolf et al. (2019) which fine-tunes a large-scale pre-trained language model with multiple objectives: response language modeling, response prediction, and dialogue emotion detection. We evaluate our model on the recently proposed empathetic-dialogues dataset Rashkin et al. (2019). Our experiment results show that CAiRE achieves state-of-the-art performance on dialogue emotion detection and empathetic response generation.
Empathetic chatbots are conversational agents that can understand user emotions and respond appropriately, which is an essential step toward human-like conversation. In the early development stage of such conversational systems as ELIZA Weizenbaum (1966), PARRY Colby et al. (1971), and ALICE AbuShawar and Atwell (2015), most of the efforts were put on hand-crafting the rules of engagement. Recently, a modularized system, XiaoIce Zhou et al. (2018) achieved an impressive number of conversational turns per session even higher than average normal human conversations. Despite the promising results of XiaoIce, this system is designed using a complex architecture with hundreds of independent components such as Natural Language Understanding and Response Generation modules, using a tremendous amount of labeled data for training each of them.
In contrast to the modularized dialogue system, end-to-end systems learn all components as a single model in a fully data-driven manner, and it mitigates the lack of labeled data by sharing representations among different modules. Incorporating empathy into the dialogue system is essential to achieve human-like conversations because, naturally, humans express and perceive emotion in natural language to increase their sense of social bonding. Practically, a multi-task training strategy with an additional objective function to optimize emotion label prediction of the conversation can produce more emotion-evoking responses Rashkin et al. (2019).
|Situation: Speaker felt this when …|
|“I have had a great week!”|
|Speaker: I have had a great start to my week!|
|Listener: That’s great. Do you think the rest of the|
|week will be as great?|
|Speaker: I hope so! It looks promising!!|
|Listener: Lucky you. Are you always a positive per-|
|son or it’s just been an amazing week really?|
|Speaker: haha. Kind of both. And also probably too|
|much coffee to start my shift tonight.|
However, data-driven end-to-end empathetic chatbot currently suffers from two limitations: 1) model capacity and 2) the sparsity of data for both emotion recognition and empathetic response generation Rashkin et al. (2019). Thanks to the recent success of large pre-trained language models Peters et al. (2018); Devlin et al. (2019), both problems can be mitigated.
In this paper, we extend TransferTransfo Wolf et al. (2019) learning approach on an empathetic dialogue learning scenario Rashkin et al. (2019), by fine-tuning a large-scale pre-trained language model Radford et al. (2018) with an auxiliary dialogue emotion classification objective. The goal is to not only generate grammatical and coherent responses but also empathetic ones according to the context of the conversation. Our experimental results show that the model trained with this strategy outperforms existing models on Empathetic Dialogues dataset in terms of the perplexity of responses and BLEU score.
2 Related Work
Detecting sentiment and emotion Felbo et al. (2017); Xu et al. (2018); Fan et al. (2018a, b) has been affirmed indispensable for creating empathetic chatbots Fung et al. (2016); Bertero et al. (2016); Winata et al. (2017); Shin et al. (2019). Recently, Zhou et al. (2017); Hu et al. (2017); Wang and Wan (2018) introduced a framework to control the sentiment and emotion of the generated response, while Zhou and Wang (2018) introduced a new Twitter conversation dataset and proposed to leverage the emoji labels of Twitter data to generate emotional responses. Besides, Rashkin et al. (2019) proposed a new benchmark for empathetic dialogue generation, which is grounded in a situation prompted by specific emotion labels. Lin et al. (2019) improved on the initial baselines with Mixture Of Expert framework Shazeer et al. (2017). Meanwhile, personalized dialogue agents Li et al. (2016); Zhang et al. (2018b); Madotto et al. (2019) have been proposed to make the conversation more consistent and engaging.
Previous work Peters et al. (2018); Radford et al. (2018); Devlin et al. (2019) showed that leveraging a large amount of data to learn context-sensitive features from a language model can create state-of-the-art models for a wide range of tasks. Taking this further, Radford et al. (2019); Yang et al. (2019) deployed higher capacity models and improved the state-of-the-art results. In this paper, we build the empathetic chatbot based on the pre-trained language model and achieve state-of-the-art results on dialogue emotion detection and empathetic response generation.
|Pretrained Rashkin et al. (2019)||27.96||5.01||-|
|Fine-Tuned Rashkin et al. (2019)||21.24||6.27||-|
|MULTITASK Rashkin et al. (2019)||24.07||5.42||-|
|EmoPrepend-1 Rashkin et al. (2019)||24.30||4.36||-|
|ENSEM-DM Rashkin et al. (2019)||19.05||6.83||-|
3.1 Language Model Pre-training
We apply the Generative Pre-trained Transformer (GPT) Radford et al. (2018) as our pre-trained language model. GPT is a multi-layer Transformer decoder with a causal self-attention which is unsupervised pre-trained on BooksCorpus dataset Zhu et al. (2015). BooksCorpus dataset contains over 7,000 unique unpublished books from a variety of genres. Pre-training on such large contiguous text corpus enable the model to capture long-range dialogue context information.
3.2 Persona Dialogue Pre-training
As existing empathetic dialogue dataset Rashkin et al. (2019) is relatively small, fine-tuning only on such dataset will limit the chitchat topic of the model. To enhance the chitchat capability of CAiRE, we first pre-train the model on PersonaChat Zhang et al. (2018a) by following the transfer learning strategy of Wolf et al. (2019). This pre-training procedure endows CAiRE a persona, thus improve the engagement and consistency of the model. We refer interested readers to the code repository
3.3 Empathetic Dialogue Fine-tuning
In order to optimize the empathy of CAiRE, we fine-tune the pre-trained model using empathetic dialogue dataset Rashkin et al. (2019) with custom persona and three objectives: response language modeling, response prediction, and dialogue emotion detection.
Empathetic Dialogue Dataset
Rashkin et al. (2019) introduced a new empathetic dialogue dataset of 25k open-domain one-on-one conversations based on emotional scenarios triggered by specific emotion labels. The dataset provides 32 emotion labels; the distribution of which is close to even. Table 1 shows an example from the training set. The speakers are talking about their situations, and the listeners are trying to understand their feeling and reply accordingly. At training time, the emotional labels of the speakers are given, while we hide the label in test time to evaluate the empathy of our model.
The whole fine-tuning schema for empathetic dialogues is illustrated in Figure 1. To fully leverage the pre-training on PersonaChat, we customize the persona of CAiRE with sentences such as “my name is caire”, “i want to help humans to make a better world”, “i am a good friend of humans”.
Following the fine-tuning schema of Wolf et al. (2019), we first concatenate the custom persona, dialogue history and response (distractor) with special separate tokens and represent all the input source with the summation of trainable positional embeddings, word embeddings, and dialogue state embeddings. Positional embeddings and word embeddings are required for transformer input, while dialogues state embeddings are added to help CAiRE model the hierarchical dialogue structure and distinguish persona sentences and dialogue context and response. The input representation is fed into the causal attention transformer decoder to get the contextualized representation. Here we denote the contextualized representation of the last special token as , the special token before reply (distractor) as .
To optimize the response prediction objective, at each training step, we sample one distractor from other conversation against the gold response. Then the representation is passed to a linear classifier to classify the correct response and get the cross-entropy loss .
To optimize the response language model objective, we take each contextualized representation of gold reply to predict the next reply tokens, and we compute the language model loss using cross-entropy .
To enable CAiRE detecting conversational partner’s emotion, we add the dialogue emotion detection objective during the training. We take as summarization of the current state of dialogue and pass it to a linear projection layer to predict the score of 32 emotions. The cross-entropy is applied for emotion classification loss .
Our final fine-tuning loss function is the weighted sum of the aforementioned losses:
4 Experiment and Result
We evaluate our model on the empathetic dialogue dataset against the following baselines:
Pretrained: This model is trained with the full Transformer network architecture Vaswani et al. (2017) on 1.7 billion REDDIT conversations.
Fine-Tuned: This model fine-tunes Pretrained model using the Emotion Dialogue Dataset.
MULTITASK: This model is trained by adding another linear layer on top of the encoder of the Transformer to classify the emotion of the dialogue based on the context.
EmoPrepend-1: This model prepends the top-1 predicted emotions to the beginning of the token sequence as encoder input.
ENSEM-DM: This model concatenates the encoded representations from the encoder of the Transformer and the representations from the pre-trained emotion classifier. And then, the concatenated representations are fed to the decoder of the Transformer.
We use perplexity (PPL), average BLEU of BLEU-1, BLEU-2, BLEU-3, BLEU-4 (AVG BLEU), and emotion classification accuracy (EMO ACC) as our evaluation metrics. As a result, shown in Table 2, CAiRE outperforms all the baseline models in terms of all metrics, which shows the strong capacity of modeling empathetic response and dialogue emotion classification.
5 CAiRE System Demostration
We establish a web-based user interface which allows multiple users to asynchronously chat with CAiRE online
5.1 User Interface
As shown in Figure 2, our user interface is based solely on text inputs. Users can type anything in the input box and get a response immediately from the server. A report button is added at the bottom to allow users to report unethical dialogues, which will then be marked and saved in our back-end server separately. To facilitate the need for teaching our chatbot how to respond properly, we add an edit button next to the response. When the user clicks it, a new input box will appear, and the user can type in the appropriate response they think the chatbot should have replied with.
5.2 Scalable to Multiple Users
Due to the high demand for GPU computations during response generation, the computation cost needs to be well distributed across different GPUs to support multiple users. We adopt several approaches to maximize the utility of GPUs without crashing the system. Firstly, we set up two independent processes in each GTX 1080Ti, where we found the highest GPU utilities to be around 90%, with both processes working stably. Secondly, we employ a load-balancing module to distribute the requests to idle processes based on their working loads. During a stress testing, we simulated users sending requests every 2 seconds, and using 8 GPUs, we were able to support more than 50 concurrent requests.
5.3 Active Learning of Ethical Values
CAiRE was first presented in ACL 2019 keynote talk “Loquentes Machinea: Technology, Applications, and Ethics of Conversational Systems”, and after that, we have released the chatbot to the public. In one week, we received traffic from more than 500 users, along with several reports of unethical dialogues. According to such feedback, CAiRE does not have any sense of ethical value due to the lack of training data informing of inappropriate behavior. Thus, when users raise some ethically concerning questions, CAiRE may respond without considering ethical implications. For example, a user might ask “Would you kill a human?”, and CAiRE could respond “yes, I want!”. To mitigate this issue, we perform imitation learning based on the collected user-revised responses. We observe that this approach can greatly reduce unethical responses. As CAiRE gathers more unethical dialogues and their revisions, its performance can be further improved.
In this paper, we introduce CAiRE, an end-to-end empathetic chatbot. Our system fine-tunes a large-scale pre-trained language model with three multi-task objectives: response language modeling, response prediction and dialogue emotion detection. The evaluation on the empathetic dialogue dataset shows that it achieves state-of-the-art performance on detecting dialogue emotion and generating empathetic responses. We built a web interface for our model and have made it accessible to multiple users via a web-link. By further collecting user feedback and improving our model, we can make CAiRE more empathetic in the future, which can be a forward step for end-to-end dialogue models.
- ALICE chatbot: trials and outputs. Computación y Sistemas 19 (4), pp. 625–632. Cited by: §1.
- Real-time speech emotion and sentiment recognition for interactive dialogue systems. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1042–1047. Cited by: §2.
- Artificial paranoia. Artificial Intelligence 2 (1), pp. 1–25. Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.
- Multi-region ensemble convolutional neural network for facial expression recognition. In International Conference on Artificial Neural Networks, pp. 84–94. Cited by: §2.
- Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 584–588. Cited by: §2.
- Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
- Zara the Supergirl: an empathetic personality recognition system. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, California, pp. 87–91. External Links: Cited by: §2.
- Toward controlled generation of text. In International Conference on Machine Learning, pp. 1587–1596. Cited by: §2.
- A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 994–1003. External Links: Cited by: §2.
- MoEL: mixture of empathetic listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 121–132. Cited by: §2.
- Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy, pp. 5454–5459. External Links: Cited by: §2.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §1, §2.
- Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §1, §2, §3.1.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §2.
- Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy, pp. 5370–5381. External Links: Cited by: CAiRE: An End-to-End Empathetic Chatbot, Table 1, §1, §1, §1, Table 2, §2, §3.2, §3.3, §3.3.
- Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. Cited by: §2.
- HappyBot: generating empathetic dialogue responses by improving user experience look-ahead. arXiv preprint arXiv:1906.08487. Cited by: §2.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: 1st item.
- SentiGAN: generating sentimental texts via mixture adversarial networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 4446–4452. External Links: Cited by: §2.
- ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM 9 (1), pp. 36–45. Cited by: §1.
- Nora the empathetic psychologist. Proc. Interspeech 2017, pp. 3437–3438. Cited by: §2.
- Transfertransfo: a transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149. Cited by: CAiRE: An End-to-End Empathetic Chatbot, §1, §3.2, §3.3.
- Emo2Vec: learning generalized emotion representation by multi-task training. Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. External Links: Cited by: §2.
- XLNet: generalized autoregressive pretraining for language understanding. External Links: Cited by: §2.
- Personalizing dialogue agents: i have a dog, do you have pets too?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213. External Links: Cited by: §3.2.
- Personalizing dialogue agents: i have a dog, do you have pets too?. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Cited by: §2.
- Emotional chatting machine: emotional conversation generation with internal and external memory. External Links: Cited by: §2.
- The design and implementation of xiaoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989. Cited by: §1.
- MojiTalk: generating emotional responses at scale. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1128–1137. External Links: Cited by: §2.
- Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27. Cited by: §3.1.