Unsupervised Learning of Style-sensitive Word Vectors

Unsupervised Learning of Style-sensitive Word Vectors


This paper presents the first study aimed at capturing stylistic similarity between words in an unsupervised manner. We propose extending the continuous bag of words (CBOW) model [Mikolov et al.(2013a)Mikolov, Chen, Corrado, and Dean] to learn style-sensitive word vectors using a wider context window under the assumption that the style of all the words in an utterance is consistent. In addition, we introduce a novel task to predict lexical stylistic similarity and to create a benchmark dataset for this task. Our experiment with this dataset supports our assumption and demonstrates that the proposed extensions contribute to the acquisition of style-sensitive word embeddings.


1 Introduction

Analyzing and generating natural language texts requires the capturing of two important aspects of language: what is said and how it is said. In the literature, much more attention has been paid to studies on what is said. However, recently, capturing how it is said, such as stylistic variations, has also proven to be useful for natural language processing tasks such as classification, analysis, and generation [Pavlick and Tetreault(2016), Niu and Carpuat(2017), Wang et al.(2017)Wang, Jojic, Brockett, and Nyberg].

This paper studies the stylistic variations of words in the context of the representation learning of words. The lack of subjective or objective definitions is a major difficulty in studying style [Xu(2017)]. Previous attempts have been made to define a selected aspect of the notion of style (e.g., politeness) [Mairesse and Walker(2007), Pavlick and Nenkova(2015), Flekova et al.(2016)Flekova, PreoŢiuc-Pietro, and Ungar, Preotiuc-Pietro et al.(2016)Preotiuc-Pietro, Xu, and Ungar, Sennrich et al.(2016)Sennrich, Haddow, and Birch, Niu et al.(2017)Niu, Martindale, and Carpuat]; however, it is not straightforward to create strict guidelines for identifying the stylistic profile of a given text. The systematic evaluations of style-sensitive word representations and the learning of style-sensitive word representations in a supervised manner are hampered by this. In addition, there is another trend of research forward controlling style-sensitive utterance generation without defining the style dimensions [Li et al.(2016)Li, Galley, Brockett, Spithourakis, Gao, and Dolan, Akama et al.(2017)Akama, Inada, Inoue, Kobayashi, and Inui]; however, this line of research considers style to be something associated with a given specific character, i.e., a persona, and does not aim to capture the stylistic variation space.

The contributions of this paper are three-fold. (1) We propose a novel architecture that acquires style-sensitive word vectors (Figure 1) in an unsupervised manner. (2) We construct a novel dataset for style, which consists of pairs of style-sensitive words with each pair scored according to its stylistic similarity. (3) We demonstrate that our word vectors capture the stylistic similarity between two words successfully. In addition, our training script and dataset are available on https://jqk09a.github.io/style-sensitive-word-vectors/.

Figure 1: Word vector capturing stylistic and syntactic/semantic similarity.

2 Style-sensitive Word Vector

The key idea is to extend the continuous bag of words (CBOW) [Mikolov et al.(2013a)Mikolov, Chen, Corrado, and Dean] by distinguishing nearby contexts and wider contexts under the assumption that a style persists throughout every single utterance in a dialog. We elaborate on it in this section.

2.1 Notation

Let denote the target word (token) in the corpora and denote the utterance (word sequence) including . Here, or is a context word of (e.g., is the context word next to ), where is the distance between the context words and the target word .

For each word (token) , bold face and denote the vector of and the vector predicting the word . Let denote the vocabulary.

2.2 Baseline Model (CBOW-near-ctx)

First, we give an overview of CBOW, which is our baseline model. CBOW predicts the target word given nearby context words in a window with width :


The set contains in total at most words, including words to the left and words to the right of a target word. Specifically, we train the word vectors and () by maximizing the following prediction probability:


The CBOW captures both semantic and syntactic word similarity through the training using nearby context words. We refer to this form of CBOW as CBOW-near-ctx. Note that, in the implementation of \citetmikolov_13nips, the window width is sampled from a uniform distribution; however, in this work, we fixed for simplicity. Hereafter, throughout our experiments, we turn off the random resizing of .

2.3 Learning Style with Utterance-size Context Window (CBOW-all-ctx)

CBOW is designed to learn the semantic and syntactic aspects of words from their nearby context [Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean]. However, an interesting problem is determining the location where the stylistic aspects of words can be captured. To address this problem, we start with the assumption that a style persists throughout each single utterance in a dialog, that is, the stylistic profile of a word in an utterance must be consistent with other words in the same utterance. Based on this assumption, we propose extending CBOW to use all the words in an utterance as context,


instead of only the nearby words. Namely, we expand the context window from a fixed width to the entire utterance. This training strategy is expected to lead to learned word vectors that are more sensitive to style rather than to other aspects. We refer to this version as CBOW-all-ctx.

2.4 Learning the Style and Syntactic/Semantic Separately

To learn the stylistic aspect more exclusively, we further extended the learning strategy.

Distant-context Model (CBOW-dist-ctx)

First, remember that using nearby context is effective for learning word vectors that capture semantic and syntactic similarities. However, this means that using the nearby context can lead the word vectors to capture some aspects other than style. Therefore, as the first extension, we propose excluding the nearby context from all the context . In other words, we use the distant context words only:


We expect that training with this type of context will lead to word vectors containing the style-sensitive information only. We refer to this method as CBOW-dist-ctx.

Separate Subspace Model (CBOW-sep-ctx)

As the second extension to distill off aspects other than style, we use both nearby and all contexts ( and ). As Figure 2 shows, both the vector and of each word are divided into two vectors:


where denotes vector concatenation. Vectors and indicate the style-sensitive part of and respectively. Vectors and indicate the syntactic/semantic-sensitive part of and respectively. For training, when the context words are near the target word (), we update both the style-sensitive vectors (, ) and the syntactic/semantic-sensitive vectors (, ), i.e., , . Conversely, when the context words are far from the target word (), we only update the style-sensitive vectors (, ). Formally, the prediction probability is calculated as follows:


At the time of learning, two prediction probabilities (loss functions) are alternately computed, and the word vectors are updated. We refer to this method using the two-fold contexts separately as the CBOW-sep-ctx.

Figure 2: The architecture of CBOW-sep-ctx.

3 Experiments

We investigated which word vectors capture the stylistic, syntactic, and semantic similarities.

3.1 Settings

Training and Test Corpus We collected Japanese fictional stories from the Web to construct the dataset. The dataset contains approximately M utterances of fictional characters. We separated the data into a %–% split for training and testing. In Japanese, the function words at the end of the sentence often exhibit style (e.g., desu+wa, desu+ze1;) therefore, we used an existing lexicon of multi-word functional expressions [Miyazaki et al.(2015)Miyazaki, Hirano, Higashinaka, Makino, and Matsuo]. Overall, the vocabulary size was K.

Hyperparameters We chose the dimensions of both the style-sensitive and the syntactic/semantic-sensitive vectors to be , and the dimensions of the baseline CBOWs were . The learning rate was adjusted individually for each part in such that “the product of the learning rate and the expectation of the number of updates” was a fixed constant. We ran the optimizer with its default settings from the implementation of \citetmikolov_13iclr. The training stopped after 10 epochs. We fixed the nearby window width to .

3.2 Stylistic Similarity Evaluation

Data Construction

To verify that our models capture the stylistic similarity, we evaluated our style-sensitive vector by comparing to other word vectors on a novel artificial task matching human stylistic similarity judgments. For this evaluation, we constructed a novel dataset with human judgments on the stylistic similarity between word pairs by performing the following two steps. First, we collected only style-sensitive words from the test corpus because some words are strongly associated with stylistic aspects [Kinsui(2003), Teshigawara and Kinsui(2011)] and, therefore, annotating random words for stylistic similarity is inefficient. We asked crowdsourced workers to select style-sensitive words in utterances. Specifically, for the crowdsourced task of picking “style-sensitive” words, we provided workers with a word-segmented utterance and asked them to pick words that they expected to be altered within different situational contexts (e.g., characters, moods, purposes, and the background cultures of the speaker and listener.). Then, we randomly sampled word pairs from the selected words and asked workers to rate each of the pairs on five scales (from : “The style of the pair is different” to : “The style of the pair is similar”), inspired by the syntactic/semantic similarity dataset [Finkelstein et al.(2002)Finkelstein, Gabrilovich, Matians, Rivlin, Solan, Wolfman, and Ruppin, Gerz et al.(2016)Gerz, Vulić, Hill, Reichart, and Korhonen]. Finally, we picked only word pairs featuring clear worker agreement in which more than annotators rated the pair with the same sign, which consisted of random pairs of highly agreeing style-sensitive words. Consequently, we obtained word pairs with similarity scores. To our knowledge, this is the first study that created an evaluation dataset to measure the lexical stylistic similarity.

In the task of selecting style-sensitive words, the pairwise inter-annotator agreement was moderate (Cohen’s kappa is ). In the rating task, the pairwise inter-annotator agreement for two classes ( or ) was fair (Cohen’s kappa is ). These statistics suggest that, at least in Japanese, native speakers share a sense of style-sensitivity of words and stylistic similarity between style-sensitive words.

Stylistic Sensitivity

We used this evaluation dataset to compute the Spearman rank correlation () between the cosine similarity scores between the learned word vectors and the human judgements. Table 1 shows the results on its left side. First, our proposed model, CBOW-all-ctx outperformed the baseline CBOW-near-ctx. Furthermore, the of CBOW-dist-ctx and CBOW-sep-ctx demonstrated better correlations for stylistic similarity judgments ( and , respectively). Even though the of CBOW-sep-ctx was trained with the same context window as CBOW-all-ctx, the style-sensitivity was boosted by introducing joint training with the near context. CBOW-dist-ctx, which uses only the distant context, slightly outperforms CBOW-sep-ctx. These results indicate the effectiveness of training using a wider context window.

Model SyntaxAcc
@5 @10
CBOW-near-ctx 12.1 27.8 86.3 85.2
CBOW-all-ctx 36.6 24.0 85.3 84.1
CBOW-dist-ctx 56.1 15.9 59.4 58.8
   (Stylistic) 51.3 28.9 68.3 66.2
   (Syntactic/semantic) 9.6 18.1 88.0 87.0

Table 1: Results of the quantitative evaluations.
Word The top similar words to w.r.t. cosine similarity
(stylistic half) (syntactic/semantic half)


{CJK}UTF8ipxm俺 (I; male, colloquial) {CJK}UTF8ipxmおまえ (you; colloquial, rough), {CJK}UTF8ipxm僕 (I; male, colloquial, childish),
{CJK}UTF8ipxmあいつ (he/she; colloquial, rough), {CJK}UTF8ipxmあたし (I; female, childish),
{CJK}UTF8ipxmねーよ (not; colloquial, rough, male) {CJK}UTF8ipxm私 (I; formal)
{CJK}UTF8ipxm拙者 (I; classical) {CJK}UTF8ipxmでござる(be; classical), {CJK}UTF8ipxm僕 (I; male, childish),
   e.g., samurai, ninja {CJK}UTF8ipxmござる(be; classical), {CJK}UTF8ipxm俺 (I; male, colloquial),
{CJK}UTF8ipxmござるよ(be; classical) {CJK}UTF8ipxm私 (I; formal)
{CJK}UTF8ipxmかしら (wonder; female) {CJK}UTF8ipxmわね (QUESTION; female), {CJK}UTF8ipxmかな (wonder; childish),
{CJK}UTF8ipxmないわね (not; female), {CJK}UTF8ipxmでしょうか (wonder; fomal),
{CJK}UTF8ipxmわ (SENTENCE-FINAL; female) {CJK}UTF8ipxmかしらね (wonder; female)
{CJK}UTF8ipxmサンタ (Santa Clause; shortened) {CJK}UTF8ipxmサンタクロース (Santa Clause; -), {CJK}UTF8ipxmお客 (customer; little polite),
{CJK}UTF8ipxmトナカイ (reindeer; -), {CJK}UTF8ipxmプロデューサー (producer; -),
{CJK}UTF8ipxmクリスマス (Christmas; -) {CJK}UTF8ipxmメイド (maid; shortened)
shit fuckin, fuck, goddamn shitty, crappy, sucky
hi hello, bye, hiya, meet goodbye, goodnight, good-bye
guys stuff, guy, bunch boys, humans, girls
ninja shinobi, genin, konoha shinobi, pirate, soldier
Table 2: The top similar words for the style-sensitive and syntactic/semantic vectors learned with proposed model, CBOW-sep-ctx. Japanese words are translated into English by the authors. Legend: (translation; impression).

3.3 Syntactic and Semantic Evaluation

We further investigated the properties of each model using the following criterion: (1) the model’s ability to capture the syntactic aspect was assessed through a task predicting part of speech (POS) and (2) the model’s ability to capture the semantic aspect was assessed through a task calculating the correlation with human judgments for semantic similarity.

Syntactic Sensitivity

First, we tested the ability to capture syntactic similarity of each model by checking whether the POS of each word was the same as the POS of a neighboring word in the vector space. Specifically, we calculated SyntaxAcc@ defined as follows:


where if the condition is true and otherwise, the function returns the actual POS tag of the word , and denotes the set of the top similar words to w.r.t. in each vector space.

Table 1 shows SyntaxAcc@ with and . For both , the (the syntactic/semantic part) of CBOW-near-ctx, CBOW-all-ctx and CBOW-sep-ctx achieved similarly good. Interestingly, even though the of CBOW-sep-ctx used the same context as that of CBOW-all-ctx, the syntactic sensitivity of was suppressed. We speculate that the syntactic sensitivity was distilled off by the other part of the CBOW-sep-ctx vector, i.e., learned using only the near context, which captured more syntactic information. In the next section, we analyze CBOW-sep-ctx for the different characteristics of and .

Semantic and Topical Sensitivities

To test the model’s ability to capture the semantic similarity, we also measured correlations with the Japanese Word Similarity Dataset (JWSD) [Sakaizawa and Komachi(2018)], which consists of Japanese word pairs annotated with semantic similarity scores by human workers. For each model, we calculate and show the Spearman rank correlation score () between the cosine similarity score and the human judgements on JWSD in Table 12. CBOW-dist-ctx has the lowest score (); however, surprisingly, the stylistic vector has the highest score (), while both vectors have a high . This result indicates that the proposed stylistic vector captures not only the stylistic similarity but also the captures semantic similarity, contrary to our expectations (ideally, we want the stylistic vector to capture only the stylistic similarity). We speculate that this is because not only the style but also the topic is often consistent in single utterances. For example, “{CJK}UTF8ipxmサンタ (Santa Clause)” and “{CJK}UTF8ipxmトナカイ (reindeer)” are topically relevant words and these words tend to appear in a single utterance. Therefore, stylistic vectors using all the context words in an utterance also capture the topic relatedness. In addition, JWSD contains topic-related word pairs and synonym pairs; therefore the word vectors that capture the topic similarity have higher . We will discuss this point in the next section.

3.4 Analysis of Trained Word Vectors

Finally, to further understand what types of features our CBOW-sep-ctx model acquired, we show some words3 with the four most similar words in Table 2. Here, for English readers, we also report a result for English4. The English result also shows an example of the performance of our model on another language. The left side of Table 2 (for stylistic vector ) shows the results. We found that the Japanese word “{CJK}UTF8ipxm拙者 (I; classical)” is similar to “{CJK}UTF8ipxmござる (be; classical)” or words containing it (the second row of Table 2). The result looks reasonable, because words such as “{CJK}UTF8ipxm拙者 (I; classical)” and “{CJK}UTF8ipxmござる (be; classical)” are typically used by Japanese Samurai or Ninja. We can see that the vectors captured the similarity of these words, which are stylistically consistent across syntactic and semantic varieties. Conversely, the right side of the table (for the syntactic/semantic vector ) shows that the word “{CJK}UTF8ipxm拙者 (I; classical)” is similar to the personal pronoun (e.g., “{CJK}UTF8ipxm僕 (I; male, childish)”). We further confirmed that the top similar words are also personal pronouns (even though they are not shown due to space limitations). These results indicate that the proposed CBOW-sep-ctx model jointly learns two different types of lexical similarities, i.e., the stylistic and syntactic/semantic similarities in the different parts of the vectors. However, our stylistic vector also captured the topic similarity, such as “{CJK}UTF8ipxmサンタ (Santa Clause)” and “{CJK}UTF8ipxmトナカイ (reindeer)” (the fourth row of Table 2). Therefore, there is still room for improvement in capturing the stylistic similarity.

4 Conclusions and Future Work

This paper presented the unsupervised learning of style-sensitive word vectors, which extends CBOW by distinguishing nearby contexts and wider contexts. We created a novel dataset for style, where the stylistic similarity between word pairs was scored by human. Our experiment demonstrated that our method leads word vectors to distinguish the stylistic aspect and other semantic or syntactic aspects. In addition, we also found that our training cannot help confusing some styles and topics. A future direction will be to addressing the issue by further introducing another context such as a document or dialog-level context windows, where the topics are often consistent but the styles are not.


This work was supported by JSPS KAKENHI Grant Number 15H01702. We thank our anonymous reviewers for their helpful comments and suggestions.


  1. These words mean the verb be in English.
  2. Note that the low performance of our baseline ( for CBOW-near-ctx) is unsurprising comparing to English baselines (cf., \citettaguchi17).
  3. We arbitrarily selected style-sensitive words from our stylistic similarity evaluation dataset.
  4. We trained another CBOW-sep-ctx model on an English fan-fiction dataset that was collected from the Web


  1. Reina Akama, Kazuaki Inada, Naoya Inoue, Sosuke Kobayashi, and Kentaro Inui. 2017. Generating stylistically consistent dialog responses with transfer learning. In Proceedings of the Eighth International Joint Conference on Natural Language Processing. pages 408–412.
  2. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matians, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems 20(1):116–131. https://doi.org/10.1145/503104.503110.
  3. Lucie Flekova, Daniel PreoŢiuc-Pietro, and Lyle Ungar. 2016. Exploring stylistic variation with age and income on twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. pages 313–319. https://doi.org/10.18653/v1/P16-2051.
  4. Daniela Gerz, Ivan Vulić, Felix Hill, Roi Reichart, and Anna Korhonen. 2016. Simverb-3500: A large-scale evaluation set of verb similarity. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pages 2173–2182. https://doi.org/10.18653/v1/D16-1235.
  5. Satoshi Kinsui. 2003. Vaacharu nihongo: yakuwari-go no nazo (In Japanese). Tokyo, Japan: Iwanami.
  6. Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. pages 994–1003. https://doi.org/10.18653/v1/P16-1094.
  7. Francois Mairesse and Marilyn Walker. 2007. Personage: Personality generation for dialogue. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. pages 496–503.
  8. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of Workshop at the International Conference on Learning Representations.
  9. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In The 26th Annual Conference on Neural Information Processing Systems. pages 3111–3119.
  10. Chiaki Miyazaki, Toru Hirano, Ryuichiro Higashinaka, Toshiro Makino, and Yoshihiro Matsuo. 2015. Automatic conversion of sentence-end expressions for utterance characterization of dialogue systems. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. pages 307–314.
  11. Xing Niu and Marine Carpuat. 2017. Discovering stylistic variations in distributional vector space models via lexical paraphrases. In Proceedings of the Workshop on Stylistic Variation at the 2017 Conference on Empirical Methods in Natural Language Processing. pages 20–27. https://doi.org/10.18653/v1/W17-4903.
  12. Xing Niu, Marianna Martindale, and Marine Carpuat. 2017. A study of style in machine translation: Controlling the formality of machine translation output. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 2804–2809. https://doi.org/10.18653/v1/D17-1299.
  13. Ellie Pavlick and Ani Nenkova. 2015. Inducing lexical style properties for paraphrase and genre differentiation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 218–224. https://doi.org/10.3115/v1/N15-1023.
  14. Ellie Pavlick and Joel Tetreault. 2016. An empirical analysis of formality in online communication. Transactions of the Association of Computational Linguistics 4:61–74.
  15. Daniel Preotiuc-Pietro, Wei Xu, and Lyle H. Ungar. 2016. Discovering user attribute stylistic differences via paraphrasing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. pages 3030–3037.
  16. Yuya Sakaizawa and Mamoru Komachi. 2018. Construction of a japanese word similarity dataset. In Proceedings of the 11th International Conference on Language Resources and Evaluation. pages 948–951.
  17. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Controlling politeness in neural machine translation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 35–40. https://doi.org/10.18653/v1/N16-1005.
  18. Yuya Taguchi, Hideaki Tamori, Yuta Hitomi, Jiro Nishitoba, and Kou Kikuta. 2017. Learning Japanese word distributional representation considering of synonyms (in Japanese). Technical Report 17, The Asahi Shimbun Company, Retrieva Inc.
  19. Mihoko Teshigawara and Satoshi Kinsui. 2011. Modern Japanese ‘role language’ (yakuwarigo): fictionalised orality in Japanese literature and popular culture. Sociolinguistic Studies 5(1):37.
  20. Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Nyberg. 2017. Steering output style and topic in neural response generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 2140–2150. https://doi.org/10.18653/v1/D17-1228.
  21. Wei Xu. 2017. From shakespeare to twitter: What are language styles all about? In Proceedings of the Workshop on Stylistic Variation at the 2017 Conference on Empirical Methods in Natural Language Processing. pages 1–9. https://doi.org/10.18653/v1/W17-4901.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description