Learning Personas from Dialogue with Attentive Memory Networks
The ability to infer persona from dialogue can have applications in areas ranging from computational narrative analysis to personalized dialogue generation. We introduce neural models to learn persona embeddings in a supervised character trope classification task. The models encode dialogue snippets from IMDB into representations that can capture the various categories of film characters. The best-performing models use a multi-level attention mechanism over a set of utterances. We also utilize prior knowledge in the form of textual descriptions of the different tropes. We apply the learned embeddings to find similar characters across different movies, and cluster movies according to the distribution of the embeddings. The use of short conversational text as input, and the ability to learn from prior knowledge using memory, suggests these methods could be applied to other domains.
Individual personality plays a deep and pervasive role in shaping social life. Research indicates that it can relate to the professional and personal relationships we develop Barrick and Mount (1993), Shaver and Brennan (1992), the technological interfaces we prefer Nass and Lee (2000), the behavior we exhibit on social media networks Selfhout et al. (2010), and the political stances we take Jost et al. (2009).
With increasing advances in human-machine dialogue systems, and widespread use of social media in which people express themselves via short text messages, there is growing interest in systems that have an ability to understand different personality types. Automated personality analysis based on short text analysis could open up a range of potential applications, such as dialogue agents that sense personality in order to generate more interesting and varied conversations.
We define persona as a person’s social role, which can be categorized according to their conversations, beliefs, and actions. To learn personas, we start with the character tropes data provided in the CMU Movie Summary Corpus by Bamman et al. (2014). It consists of 72 manually identified commonly occurring character archetypes and examples of each. In the character trope classification task, we predict the character trope based on a batch of dialogue snippets.
In their original work, the authors use Wikipedia plot summaries to learn latent variable models that provide a clustering from words to topics and topics to personas – their persona clusterings were then evaluated by measuring similarity to the ground-truth character trope clusters. We asked the question – could personas also be inferred through dialogue? Because we use quotes as a primary input and not plot summaries, we believe our model is extensible to areas such as dialogue generation and conversational analysis.
|Corrupt corporate executive||Les Grossman||Tropic Thunder|
|Retired outlaw||Butch Cassidy||Butch Cassidy and the Sundance Kid|
Our contributions are:
Data collection of IMDB quotes and character trope descriptions for characters from the CMU Movie Summary Corpus.
Models that greatly outperform the baseline model in the character trope classification task. Our experiments show the importance of multi-level attention over words in dialogue, and over a set of dialogue snippets.
We also examine how prior knowledge in the form of textual descriptions of the persona categories may be used. We find that a ‘Knowledge-Store’ memory initialized with descriptions of the tropes is particularly useful. This ability may allow these models to be used more flexibly in new domains and with different persona categories.
2 Related Work
Prior to data-driven approaches, personalities were largely measured by asking people questions and assigning traits according to some fixed set of dimensions, such as the Big Five traits of openness, conscientiousness, extraversion, agreeability, and neuroticism Tupes and Christal (1992). Computational approaches have since advanced to infer these personalities based on observable behaviors such as the actions people take and the language they use Golbeck et al. (2011).
Our work builds on recent advances in neural networks that have been used for natural language processing tasks such as reading comprehension Sukhbaatar et al. (2015) and dialogue modeling and generation Vinyals and Le (2015); Li et al. (2016); Shang et al. (2015). This includes the growing literature in attention mechanisms and memory networks Bahdanau et al. (2014); Sukhbaatar et al. (2015); Kumar et al. (2016).
The ability to infer and model personality has applications in storytelling agents, dialogue systems, and psychometric analysis. In particular, personality-infused agents can help “chit-chat” bots avoid repetitive and uninteresting utterances Walker et al. (1997); Mairesse and Walker (2007); Li et al. (2016); Zhang et al. (2018). The more recent neural models do so by conditioning on a ‘persona’ embedding – our model could help produce those embeddings.
Finally, in the field of literary analysis, graphical models have been proposed for learning character personas in novels Flekova and Gurevych (2015); Srivastava et al. (2016), folktales Valls-Vargas et al. (2014), and movies Bamman et al. (2014). However, these models often use more structured inputs than dialogue to learn personas.
Characters in movies can often be categorized into archetypal roles and personalities. To understand the relationship between dialogue and personas, we utilized three different datasets for our models: (a) the Movie Character Trope dataset, (b) the IMDB Dialogue Dataset, and (c) the Character Trope Description Dataset. We collected the IMDB Dialogue and Trope Description datasets, and these datasets are made publicly available
3.1 Character Tropes Dataset
The CMU Movie Summary dataset provides tropes commonly occurring in stories and media Bamman et al. (2014). There are a total of 72 tropes, which span 433 characters and 384 movies. Each trope contains between 1 and 25 characters, with a median of 6 characters per trope. Tropes and canonical examples are shown in Table 1.
3.2 IMDB Dialogue Snippet Dataset
To obtain the utterances spoken by the characters, we crawled the IMDB Quotes page for each movie. Though not every single utterance spoken by the character may be available, as the quotes are submitted by IMDB users, many quotes from most of the characters are typically found, especially for the famous characters found in the Character Tropes dataset. The distribution of quotes per trope is displayed in Figure 1. Our models were trained on 13,874 quotes and validated and tested on a set of 1,734 quotes each.
We refer to each IMDB quote as a (contextualized) dialogue snippet, as each quote can contain several lines between multiple characters, as well as italicized text giving context to what might be happening when the quote took place. Figure 2 show a typical dialogue snippet. 70.3% of the quotes are multi-turn exchanges, with a mean of 3.34 turns per multi-turn exchange. While the character’s own lines alone can be highly indicative of the trope, our models show that accounting for context and the other characters’ lines and context improves performance. The context, for instance, can give clues to typical scenes and actions that are associated with certain tropes, while the other characters’ lines give further detail into the relationship between the character and his or her environment.
3.3 Character Trope Description Dataset
We also incorporate descriptions of each of the character tropes by using the corresponding descriptions scraped from TVTropes
4 Problem Formulation
Our goal is to train a model that can take a batch of dialogue snippets from the IMDB dataset and predict the character trope.
Formally, let be the total number of character tropes in the character tropes dataset. Each character is associated with a corresponding ground-truth trope category . Let be a dialog snippet associated with a character , where refers to the character’s own lines, is the contextual information and denotes the other characters’ lines. We define all three components of to have fixed sequence length and pad when necessary. Let be the total number of dialogue snippets for a trope. We sample a set of (where ) snippets from snippets related to the trope as inputs to our model.
5 Attentive Memory Network
The Attentive Memory Network consists of two major components: (a) Attentive Encoders, and (b) a Knowledge-Store Memory Module. Figure 3 outlines the overall model. We describe the components in the following sections.
5.1 Attentive Encoders
Not every piece of dialogue may be reflective of a latent persona. In order to learn to ignore words and dialogue snippets that are not informative about the trope we use a multi-level attentive encoder that operates at (a) the individual snippet level, and (b) across multiple snippets.
Attentive Snippet Encoder
The snippet encoder extracts features from a single dialogue snippet , with attention over the words in the snippet. A snippet is fed to the encoder to extract features from each of these textual inputs and encode them into an embedding space. We use a recurrent neural network as our encoder, explained in detail in Section 5.1.1. In order to capture the trope-reflective words from the input text, we augment our model with a self-attention layer which scores each word in the given text for its relevance. Section 5.1.2 explains how the attention weights are computed. The output of this encoder is an encoded snippet embedding .
Attentive Inter-Snippet Encoder
As shown in Figure 3, the snippet embeddings from the snippet encoder are fed to our inter-snippet encoder. This encoder captures inter-snippet relationship using recurrence over the snippet embeddings for a given trope and determines their importance. Some of the dialogue snippets may not be informative about the trope, and the model learns to assign low attention scores to such snippets. The resulting attended summary vector from this phase is the persona representation , defined as:
where are learnable weight parameters. refers to summary vectors of the character’s lines, contextual information, and other characters’ lines, respectively. In Section 7, we experiment with models that have and set to 0 to understand how the contextual information and other characters’ lines contribute to the overall performance.
Given an input sequence , we use a recurrent neural network to encode the sequence into hidden states . In our experiments, we use a gated recurrent network (GRU) Chung et al. (2014) over LSTMs Hochreiter and Schmidhuber (1997) because the latter is more computationally expensive. We use bidirectional GRUs and concatenate our forward and backwards hidden states to get for .
We define an attention mechanism that computes from the resultant hidden states of a GRU by learning to generate weights . This can be interpreted as the relative importance given to a hidden state to form an overall summary vector for the sequence. Formally, we define it as:
where is a two layer fully connected network in which the first layer projects to an attention hidden space , and the second layer produces a relevance score for every hidden state at timestep .
5.2 Memory Modules
Our model consists of a read-only ‘Knowledge-Store’ memory, and we also test a recent read-write memory. External memories have been shown to help on natural language processing tasks Sukhbaatar et al. (2015); Kumar et al. (2016); Kaiser and Nachum (2017), and we find similar improvements in learning capability.
The main motivation behind the Knowledge-Store memory module is to incorporate prior domain knowledge. In our work, this knowledge refers to the trope descriptions described in Section 3.3.
Related works have initialized their memory networks with positional encoding using word embeddings Sukhbaatar et al. (2015); Kumar et al. (2016); Miller et al. (2016). To incorporate the descriptions, we represent them with skip thought vectors Kiros et al. (2015) and use them to initialize the memory keys , where is the number of tropes, and is set to the size of embedded trope description , i.e. .
The values in the memory represent learnable embeddings of corresponding trope categories , where is the size of the trope category embeddings. The network learns to use the persona representation from the encoder phase to find relevant matches in the memory. This corresponds to calculating similarities between and the keys . Formally, this is calculated as:
where is a fully-connected layer that projects the persona representation in the space of memory keys . Based on the match probabilities , the values are weighted and cumulatively added to the original persona representation as:
We iteratively combine our mapped persona representation with information from the memory . The above process is repeated times. The memory mapped persona representation is updated as follows:
where , and is a fully-connected layer. Finally, we transform the resulting using another fully-connected layer, , via:
We also tested a Read-Write Memory following Kaiser et. al Kaiser and Nachum (2017), which was originally designed to remember rare events. In our case, these ‘rare’ events might be key dialogue snippets that are particularly indicative of latent persona. It consists of keys, which are activations of a specific layer of model, i.e. the persona representation , and values, which are the ground-truth labels, i.e. the trope categories. Over time, it is able to facilitate predictions based on past data with similar activations stored in the memory. For every new example, the network writes to memory for future look up. A memory with memory size is defined as:
Memory Read We use the persona embedding as a query to the memory. We calculate the cosine similarities between and the keys in , take the softmax on the top-k neighbors, and compute a weighted embedding using those scores.
Memory Write We update the memory in a similar fashion to the original work by Kaiser and Nachum (2017), which takes into account the maximum age of items as stored in .
6 Objective Losses
To train our model, we utilize the different objective losses described below.
6.1 Classification Loss
We calculate the probability of a character belonging to a particular trope category through Equation 11, where is a fully-connected layer, and is the persona representation produced by the multi-level attentive encoders described in Equation 1. We then optimize the categorical cross-entropy loss between the predicted and true tropes as in Equation 12, where is the total number of tropes, is the predicted distribution that the input character fulls under trope , and denotes the ground-truth of whether the input snippets come from characters from the trope.
6.2 Trope Description Triplet Loss
In addition to using trope descriptions to initialize the Knowledge-Store Memory, we also test learning from the trope descriptions through a triplet loss Hoffer and Ailon (2015). We again use the skip thought vectors to represent the descriptions. Specifically, we want to maximize the similarity of representations obtained from dialogue snippets with their corresponding description, and minimize their similarity with negative examples. We implement this as:
where is a fully-connected layer. The triplet ranking loss is then Equation 14, where is a learnable margin parameter and denotes the similarity between trope embeddings (), positive () and negative () trope descriptions.
Trope Description Triplet Loss with Memory Module
If a memory module is used, we compute a new triplet loss in place of the one described in Equation 14. Models that use a memory module should learn a representation , based on either the prior knowledge stored in the memory (as in Knowledge-Store memory) or the top- key matches (as in Read-Write memory), that is similar to the representation of the trope descriptions.
This is achieved by replacing the persona embedding in Equation 13 with the memory output as shown in Equation 15, where is a fully-connected layer. To compute the new loss, we combine the representations obtained from Equations 13 and 15 through a learnable parameter that determines the importance of each representation. Finally, we utilize this combined representation to calculate the loss as shown in Equation 17.
6.3 Read-Write Memory Losses
When the Read-Write memory is used, we use two extra loss functions. The first is a Memory Ranking Loss as done in Kaiser and Nachum (2017), which learns based on whether a query with the persona embedding returns nearest neighbors with the correct trope. The second is a Memory Classification Loss that uses the values returned by the memory to predict the trope. The full details for both are found in Supplementary Section A.
6.4 Overall Loss
We combine the above losses through:
where are learnable weights such that . Depending on which variant of the model is being used, the list is modified to contain only relevant losses. For example, when the Knowledge-Store memory is used, we set and is modified to . We discuss different variants of our model in the next section.
We experimented with combinations of our various modules and losses. The experimental results and ablation studies are described in the following sections, and the experimental details are described in Supplementary Section B. The different model permutation names in Table 2, e.g. “attn_3_tropetrip_ks-mem_ndialog16”, are defined as follows:
baseline vs attn: The ‘baseline’ model uses only one dialogue snippet to predict the trope, i.e. . Hence, the inter-snippet encoder is not used. The ‘attn’ model operates on dialogue snippets using the inter-snippet encoder to assign an attention score for each snippet .
char vs. 3: To measure the importance of context and other characters’ lines, we have two variants – ‘char’ uses only the character’s lines, while ‘3’ uses the character’s lines, other character’s lines, and all context lines. Formally, in ‘char’ mode, we set and to 0 in Equation 1. In ‘attn’ mode, are learned by the model.
tropetrip: The presence of ’tropetrip’ indicates that the triplet loss on the trope descriptions was used. If ‘-500’ is appended to ‘tropetrip’, then the 4800-dimensional skip embeddings representing the descriptions in Equations 15 and 17 are projected to 500 dimensions using a fully connected layer.
ks-mem vs. rw-mem: ‘ks-mem’ refers to the Knowledge-Store memory, and ‘rw-mem’ refers to the Read-Write memory.
ndialog: The number of dialogue snippets used as input for the attention models. Any attention model without the explicit listed uses .
7.1 Ablation Results
Baseline vs. Attention Model. The attention model shows a large improvement over the baseline models. This matches our intuition that not every quote is strongly indicative of character trope. Some may be largely expository or ‘chit-chat’ pieces of dialogue. Example attention scores are shown in Section 7.2.
Though our experiments showed marginal improvement between using the ‘char’ data and the ‘3’ data, we found that using all 3 inputs had greater performance for models with the triplet loss and read-only memory. This is likely because the others’ lines and context capture more of the social dynamics and situations that are described in the trope descriptions. Subsequent results are shown only for the ‘attn_3’ models.
Trope Description Triplet Loss. Adding the trope description loss alone provided relatively small gains in performance, though we see greater gains when combined with memory. While both use the descriptions, perhaps the Knowledge Store memory matches an embedding against all the tropes, whereas the trope triplet loss is only provided information from one positive and one negative example.
Memory Modules. The Knowledge-Store memory in particular was helpful. Initialized with the trope descriptions, this memory can ‘sharpen’ queries toward one of the tropes. The Read-Write memory had smaller gains in performance. It may be that more data is required to take advantage of the write capabilities.
Combined Trope Description Triplet Loss and Memory Modules. Using the triplet loss with memory modules led to greater performance when compared to the model, but the performance sits around the use of either triplet only or memory only. However, when we increase the to 16 or 32, we find a jump in performance. This is likely the case because the model has both increased learning capacity and a larger sample of data at every batch, which means at least some of the quotes should be informative about the trope.
7.2 Attention Scores
Because the inter-snippet encoder provides such a large gain in performance compared to the baseline model, we provide an example illustrating the weights placed on a batch of snippets. Figure 4 shows the attention scores for the character’s lines in the “byronic hero” trope. Matching what we might expect for an antihero personality, we find the top weighted line to be full of confidence and heroic bluster, while the middle lines hint at the characters’ personal turmoil. We also find the lowly weighted sixth and seventh lines to be largely uninformative (e.g. “I heard things.”), and the last line to be perhaps too pessimistic and negative for a hero, even a byronic one.
7.3 Purity scores of character clusters
Finally, we measure our ability to recover the trope ‘clusters’ (with one trope being a cluster of its characters) with our embeddings through the purity score used in Bamman et al. (2014). Equation 19 measures the amount of overlap between two clusterings, where is the total number of characters, is the -ith ground truth cluster, and is the -th predicted cluster.
We use a simple agglomerative clustering method on our embeddings with a parameter for the number of clusters. The methods in Bamman et al. (2014) contain a similar hyper-parameter for the number of persona clusters. We note that the metrics are not completely comparable because not every character in the original dataset was found on IMDB. The results are shown in Table 3. It might be expected that our model perform better because we use the character tropes themselves as training data. However, dialogue may be noisier than the movie summary data; their better performing Persona Regression (PR) model also uses useful metadata features such as the movie genre and character gender. We simply note that our scores are comparable or higher.
8 Application: Narrative Analysis
We collected IMDB quotes for the top 250 movies on IMDB. For every character, we calculated a character embedding by taking the average embedding produced by passing all the dialogues through our model. We then calculated movie embeddings by taking the weighted sum of all the character embeddings in the movie, with the weight as the percentage of quotes they had in the movie. By computing distances between pairs of character or movie embeddings, we could potentially unearth notable similarities. We note some of the interesting clusters below.
8.1 Clustering Characters
Grumpy old men: Carl Fredricksen (Up); Walk Kowalski (Gran Torino)
Shady tricksters, crooks, well versed in deceit: Ugarte (Casablanca); Eames (Inception)
Intrepid heroes, adventurers: Indiana Jones (Indiana Jones and the Last Crusade); Nemo (Finding Nemo); Murph (Interstellar)
8.2 Clustering Movies
Epics, historical tales: Amadeus, Ben-Hur
Tortured individuals, dark, violent: Donnie Darko, Taxi Driver, Inception, The Prestige
Gangsters, excess: Scarface, Goodfellas, Reservoir Dogs, The Departed, Wolf of Wall Street
We used the character trope classification task as a test bed for learning personas from dialogue. Our experiments demonstrate that the use of a multi-level attention mechanism greatly outperforms a baseline GRU model. We were also able to leverage prior knowledge in the form of textual descriptions of the trope. In particular, using these descriptions to initialize our Knowledge-Store memory helped improved performance. Because we use short text and can leverage domain knowledge, we believe future work could use our models for applications such as personalized dialogue systems.
Appendix A Read-Write Memory Losses
a.1 Memory Ranking Loss
We want the model to learn to make efficient matches with the memory keys in order to facilitate look up on past data. To do this, we find a positive and negative neighbor after computing the nearest neighbors by finding the smallest index such that and such that respectively. We define the memory ranking loss as:
where is a learnable margin parameter and denotes the similarity between persona embeddings (), key representations of positive () and negative () neighbors. The above equation is consistent with the memory loss defined in the original work Kaiser and Nachum (2017).
a.2 Memory Classification Loss
The Read-Write memory returns and values . The probability of the given input dialogues belonging to a particular persona category is computed using values returned from the memory via:
where is a fully-connected layer. We replace the with in Equation 12 and calculate the categorical cross entropy to get .
Appendix B Experimental details
The vocabulary size was set to 20000. We used a GRU hidden size of 200, a word embedding size of 300, and the word embedding lookup was initialized with GLoVe Pennington et al. (2014). For the Read-Write memory module, we used k=8 when calculating the nearest neighbors and a memory size of 150. Our models were trained using Adam Kingma and Ba (2014).
- TVtropes.org defines a byronic hero as “Sometimes an Anti-Hero, others an Anti-Villain, or even Just a Villain, Byronic heroes are charismatic characters with strong passions and ideals, but who are nonetheless deeply flawed individuals….”
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- David Bamman, Brendan O’Connor, and Noah A Smith. 2014. Learning latent personas of film characters. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), page 352.
- Murray R Barrick and Michael K Mount. 1993. Autonomy as a moderator of the relationships between the big five personality dimensions and job performance. Journal of applied Psychology, 78(1):111.
- Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Lucie Flekova and Iryna Gurevych. 2015. Personality profiling of fictional characters using sense-level links between lexical resources. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1805–1816.
- Jennifer Golbeck, Cristina Robles, Michon Edmondson, and Karen Turner. 2011. Predicting personality from twitter. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on, pages 149–156. IEEE.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer.
- John T Jost, Tessa V West, and Samuel D Gosling. 2009. Personality and ideology as determinants of candidate preferences and âobama conversionâ in the 2008 us presidential election. Du Bois Review: Social Science Research on Race, 6(1):103–124.
- Łukasz Kaiser and Ofir Nachum. 2017. Aurko roy, and samy bengio. Learning to remember rare events. arXiv preprint.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
- Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378–1387.
- Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155.
- François Mairesse and Marilyn Walker. 2007. Personage: Personality generation for dialogue. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 496–503.
- Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126.
- Clifford Nass and Kwan Min Lee. 2000. Does computer-generated speech manifest personality? an experimental test of similarity-attraction. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 329–336. ACM.
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Maarten Selfhout, William Burk, Susan Branje, Jaap Denissen, Marcel Van Aken, and Wim Meeus. 2010. Emerging late adolescent friendship networks and big five personality traits: A social network approach. Journal of personality, 78(2):509–538.
- Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364.
- Phillip R Shaver and Kelly A Brennan. 1992. Attachment styles and the” big five” personality traits: Their connections with each other and with romantic relationship outcomes. Personality and Social Psychology Bulletin, 18(5):536–545.
- Shashank Srivastava, Snigdha Chaturvedi, and Tom M Mitchell. 2016. Inferring interpersonal relations in narrative summaries. In AAAI, pages 2807–2813.
- Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448.
- Ernest C Tupes and Raymond E Christal. 1992. Recurrent personality factors based on trait ratings. Journal of personality, 60(2):225–251.
- Josep Valls-Vargas, Jichen Zhu, and Santiago Ontanón. 2014. Toward automatic role identification in unannotated folk tales. In Tenth Artificial Intelligence and Interactive Digital Entertainment Conference.
- Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
- Marilyn A Walker, Janet E Cahn, and Stephen J Whittaker. 1997. Improvising linguistic style: Social and affective bases for agent personality. In Proceedings of the first international conference on Autonomous agents, pages 96–105. ACM.
- Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243.