The emotions that we perceive in music: the influence of language and lyrics comprehension on agreement

The emotions that we perceive in music: the influence of language and lyrics comprehension on agreement


In the present study, we address the relationship between the emotions perceived in pop and rock music (mainly in Euro-American styles with English lyrics) and the language spoken by the listener. Our goal is to understand the influence of lyrics comprehension on the perception of emotions and use this information to improve Music Emotion Recognition (MER) models. Two main research questions are addressed:

  1. Are there differences and similarities between the emotions perceived in pop/rock music by listeners raised with different mother tongues?

  2. Do personal characteristics have an influence on the perceived emotions for listeners of a given language?

Personal characteristics include the listeners’ general demographics, familiarity and preference for the fragments, and music sophistication [14]. Our hypothesis is that inter-rater agreement (as defined by Krippendorff’s alpha coefficient) from subjects is directly influenced by the comprehension of lyrics.


\ul \multauthorJuan Sebastián Gómez Cañón           Perfecto Herrera Emilia Gómez           Estefanía Cano
Music Technology Group, Universitat Pompeu Fabra, Spain
Social and Cognitive Computing Department, A*STAR, Singapore

1 Introduction

Computational models able to predict emotions in audio signals are designed to evaluate emotionally-relevant acoustic features from a particular data set. These features are used to narrow down the so-called "semantic gap" from the physical properties of an audio signal (e.g., spectral flux, zero crossing rate) and the semantic concepts related to these properties (e.g., emotions, mood) [7, 3]. The field of MER has been evaluated consistently since 2007 in the Music Information Retrieval Evaluation eXchange (MIREX) Audio Mood Classification task. This task consists of the classification of audio into five mood categories or clusters, containing different emotional tags. However, a "glass ceiling" of performance of these algorithms has been reached given the nature of the data and the annotations [3]. This is due to the limited/low agreement in the annotations of the data sets and the generalized confusion between annotating the perceived/expressed emotion and the felt/induced emotion by music. The low agreement problem extends to other research problems in Music Information Research/Retrieval (MIR): music auto-tagging [2], music genre recognition [19, 20], and music similarity [5].

Concretely, algorithms in MIR have natural upper bounds in performance due to the quality of human annotations, including the lack of an agreed methodology for gathering these annotations and the degree of subjectivity in the annotation process. Given this subjective nature of the perception of emotions, low agreement will lead to limited performance and poor generalization to new data.

2 Related work

In this study we focus on the perception of emotions in music, which requires a brief conceptualization of the logical progression in which a "musical communication" is achieved from the musician to the listener. This process begins with the expression of an emotion by the performer or composer, which can be perceived by a listener and perhaps arouse a particular emotion. The listener can then regard the music as aesthetically valuable and finally like it [11]. However, this single-channel perspective is debatable, since perception and induction of emotions can operate in parallel (i.e., happy music can induce melancholy or sad music can induce excitement). Perception refers to the emotion that a listener identifies when listening to music, which can be different to the emotion the musician intended to express/convey and the emotion aroused/induced. The reason to focus on the perception of emotions is mainly twofold: it is relatively less influenced by situational factors of listening (environment, mood, etc.) [22], and listeners are generally consistent in their ratings of the emotional expression of music [11, 10, 9]. Additionally, certain musical features can be associated to particular emotions (e.g., happiness: fast mean tempo, small tempo variability, bright timbre, sharp duration contrasts, sadness: slow mean tempo, legato articulation, low sound level, dull timbre, slow vibrato, fear: staccato articulation, very low sound level, large sound level variability, large timing variations, etc.). The relation between features and the expression of emotion has been widely researched in the literature [11, 10].

Certain attributes and characteristics of emotion are important to continue formalizing the definition of emotions: 1) emotions appear to vary their intensity (e.g., irritation - rage), 2) emotions appear to involve distinct qualitative feelings (e.g., feeling of being afraid - nostalgic), 3) some pairs of emotions appear more similar than others (e.g., joy and contentment versus joy and disgust, and 4) some emotions appear to have opposites (e.g., happiness - sadness, love - hate, calm - worry) [10]. These attributes have led to the definition of two different dominant views or approaches for emotion representation [23]:

  • Categorical approach: emotions are represented as categories, distinct from each other - such as happiness, sadness, anger, surprise, and fear. Hevner wrote a seminal paper on finding and grouping 66 adjectives into 8 groups of emotions [8]. Ekman defined a set of basic emotions in terms of human goal-relevant events that occurred during evolution [4]. The discrete categories of emotions can also be regarded as colors: there are different shades of sadness, but there is an abrupt change from one category (sad - blue) to another (angry - red) [11]. In this case, major drawbacks are: 1) the number of primary emotion categories results too small compared to the richness of music emotion perceived by humans (i.e., poor resolution) and 2) the high ambiguity of using language to describe rich human emotions [22].

  • Dimensional approach: emotions are conceptualized based on their positions on a small number of dimensions; mainly valence and arousal. Russell popularized the two-dimensional circumplex model, where the valence dimension describes the pleasantness or positiveness of the emotion and arousal describes the activation or energy (i.e., happiness would have positive arousal and valence) [16]. However, the major drawback is that categories are not mutually exclusive and tend to overlap (i.e., rage-anger), making the mapping of categories on the dimensional space vague and unreliable [22].

In theory, certain categories of emotions could be directly mapped from a categorical to a dimensional approach [11]. However this mapping is not necessarily straightforward, when trying to map emotions such as nostalgia or transcendence. Although both approaches offer several variations and advantages over each other, some researchers in the field of psychology have chosen categorical models over dimensional models [11]. The main reasons for this are: 1) categories are needed to capture and do justice to complex human emotions experienced and perceived by listeners (i.e., complex emotions such as nostalgia, awe, and transcendence cannot be reduced to a deterministic value on VA dimensional space), and 2) emotions tend to show discreteness and boundaries between them (e.g., content - happy), rather than continuity when tested [13].

In the particular case of instrumental classical music, researchers have attempted to find the relationship between listeners’ characteristics and perceived emotions [17]. Web surveys were designed to collect information about personal characteristics and the perception of 15 segments of 3rd Symphony Eroica by L.V. Beethoven (classical instrumental music). They used a selection of the GEMS categorical model of emotions [23], including the following emotions: transcendence, peacefulness, power, joyful activation, tension, sadness, anger, disgust, fear, surprise, and tenderness. With respect to personal characteristics, the results suggest that: 1) the perception of transcendence and power correlate significantly with basic user characteristics; 2) participants trained on classical music tend to disagree more on perceived emotions of peacefulness, tension, sadness, anger, disgust, and fear; 3) the agreement among perceived emotions decreases with increasing familiarity with the piece. With respect to the correlation of perceived music characteristics irrespective of listener characteristics, they have found that: 1) there are substantial correlations between pairs of anger, fear, and disgust; 2) peacefulness is negatively correlated with power and tension, but positively correlated with tenderness; 3) power is significantly correlated with tension and anger; 4) transcendence and surprise do not show significant correlation with other aspects.

The goal of this work is to study the relationship between listeners’ demographics preference, familiarity, musical knowledge, and native language with agreement of perceived emotions in music. We aim to characterize perceived emotion with respect to these factors, and attempt to replicate previous studies that show lower agreement in perceived emotions among subjects with more musical experience and knowledge. In order to achieve this, we analyze agreement of emotion ratings of musical fragments from different styles that have been previously annotated. The rest of the paper is structured as follows: in Section 3 we detail the methodology of our study, including the selected measures, musical excerpts and annotation gathering scheme. Section 4 later provides partial results of our study which are later discussed in Section 5.

3 Methodology

3.1 Agreement metrics

Following [17], we use inter-rater reliability statistics (i.e., Krippendorff’s ) to assess the agreement of the annotated data with respect to different individual characteristics and the understanding of the semantic content of music [12]. Krippendorff’s coefficient is defined as: " is the extent to which the proportion of the differences (amongst all observations) that are in the error deviates from perfect agreement. […] It is the proportion of the observed to expected above-chance agreement". In general, is defined as:


where is the measure of observed disagreement:


and is a measure of the expected disagreement given chance:


The variables , , are the frequencies of values of observed coincidences of and values or ranks and is the total amount of paired values or ranks. In this case if , these are the observed coincidences for these values (i.e., all users have the same rating on a particular emotion). On the other hand, if there are mismatched ratings. Advantages of using are that it is suitable for any number of observers, any type of metric (nominal, interval, ordinal), it can handle incomplete or missing data, and does not require a minimum of sample size. When disagreement is absent ( = 0), there is perfect reliability ( = 1). Conversely, when agreement and disagreement are a matter of chance ( = ), there is absence of reliability ( = 0). Nevertheless, could be smaller than zero due to sampling errors (too small sample sizes) and systematic disagreements (agreement below what would be expected by chance).

Depending on the data, is the difference function, the squared difference between any two values or ranks and , function depending on the type of metric. In the case of ordinal metrics (i.e., using Likert response formats for emotional ratings) standardized as :


where is the largest and is the smallest rank among all ranks.

3.2 Music material

For our agreement study, we selected a set of 22 music fragments from the 4Q emotion data set [15] which has been previously annotated with categories in the four arousal-valence quadrants: Q1 (A+V+), Q2 (A+V-), Q3 (A-V-), Q4 (A-V+). We use fragments of pop and rock music since these musical styles can be considered as neutral and homogeneous even when sung in different languages. The music fragments (with 30 seconds duration) were collected from AllMusic API, and 289 emotion tags were selected from the original AllMusic Tags and intersected with the Warriner’s list [21]. In this way, emotion tags were mapped to AV space. Finally, they conducted a manual blind validation to remove inadequate fragments (e.g., containing noise or speech). The data set is balanced with 225 fragments per quadrant and a total of 900 clips.

We use the emotion tags of the Geneva Emotion Music Scale (GEMS) [18] and a subset from basic emotions [4] to rate the different fragments (see Table 1). In comparison to [17], we replaced Disgust by Bitterness attempting to improve the balance of the number of emotions per quadrant (see Table 1). Nonetheless, Q3 only contains two emotions, whereas other quadrants contain three. To perform the selection of fragments for this experiment, we performed a query for the selected emotions. Since some emotions were not found in the metadata, synonyms were used to select the songs, as seen in Column 3. After the automatic selection of songs by the query, we manually selected two fragments with lyrics per emotion (only two songs are instrumental).

Quadrants Emotions Synonyms
Q1 (A+V+) Joyful activation joy
Power -
Surprise -
Q2 (A+V-) Anger angry
Fear anguished
Tension tense
Q3 (A-V-) Bitterness bitter
Sadness sad
Q4 (A-V+) Tenderness gentle
Peace -
Transcendence spiritual
Table 1: Selected emotions and query synonyms for song selection.

In the case of music emotion studies, very few studies have explored different styles of music. Also several studies refer to the WEIRDness of music psychology studies[6] as the fact that the participants from experiments usually come from Western, Educated, Industrialized, Rich, and Democratic countries. Since we base our approach on re-annotating previously annotated data, it is difficult to find annotated emotion data sets that contain music with lyrics in other languages than English. Nonetheless, we managed to include 3 songs in Spanish. The summary of previously annotated emotions, quadrants, language of lyrics and song information can be seen in Table 2.

Emotion Q Lang. Artist - Song
Anger Q2 Eng. Disincarnate - In Sufferance
Inst. Obituary - Redneck Stomp
Bitterness Q3 Eng. Liz Phair - Divorce Song
Eng. Lou Reed - Heroine
Fear Q2 Inst. Joe Henry - Nico Lost One Small Buddha
Eng. Silverstein - Worlds Apart
Joy Q1 Eng. Taio Cruz - Dynamite
Eng. Miami Sound Machine - Conga
Peace Q4 Eng. Jim Brickman - Simple Things
Spa. Gloria Estefan - Mi Buen Amor
Power Q1 Eng. Ultra Montanes - Anyway
Eng. Rose Tattoo - Rock n Roll Outlaw
Sadness Q3 Eng. Motorhead - Dead and Gone
Spa. Juan Luis Guerra - Sobremesa
Surprise Q1 Eng. The Jordanaires - Hound Dog
Eng. Shakira - Animal City
Tenderness Q4 Eng. Celine Dion - Beautiful Boy
Spa. Beyonce - Amor Gitano
Tension Q2 Eng. Pennywise - Pennywise
Eng. Squeeze - Here Comes That Feeling
Transc. Q4 Eng. Steven C. Chapman - Made for Worshipping
Eng. Matisyahu - On Nature
Table 2: Song selection with emotion and quadrant information.

3.3 Annotation Methodology

Q1 Q2 Q3 Q4
Configuration Ratings % joy surp. pow. ang. fear tens. sad bit. peace tend. trans.
All 23562/23562 100.00% 0.401 0.064 0.268 0.355 0.193 0.274 0.286 0.24 0.367 0.346 0.057
By Preference (>3) 9427/23562 40.01% 0.407 0.075 0.267 0.282 0.165 0.185 0.32 0.228 0.368 0.35 0.064
By Preference (<3) 7755/23562 32.91% 0.321 0.039 0.261 0.384 0.179 0.332 0.251 0.203 0.372 0.348 0.046
By Familiarity (>3) 3795/23562 16.11% 0.456 0.052 0.153 0.332 0.262 0.157 0.329 0.318 0.224 0.194 0.034
By Familiarity (<3) 18183/23562 77.17% 0.313 0.047 0.277 0.345 0.167 0.292 0.243 0.194 0.408 0.379 0.069
By Understanding (>3) 11616/23562 49.30% 0.424 0.079 0.268 0.28 0.183 0.21 0.324 0.274 0.334 0.336 0.043
By Understanding (<3) 8261/23562 35.06% 0.336 0.035 0.251 0.375 0.181 0.318 htbp0.231 0.181 0.362 0.299 0.067
Table 3: Krippendorff’s for each emotion for all participants filtered by preference, familiarity, and lyrics comprehension (positive and negative). We use a 5-point Likert response format, we consider positive ratings as higher than 3 (neither agree or disagree) and negative as less than 3

Our study is structured as follows: we first provide a brief explanation to show participants the difference between induced and perceived emotions. Additionally, we use synonyms for each emotion to clarify the annotations (see Figure 1). To collect user ratings, we created online surveys in four languages (Spanish111, English222, German333 and Mandarin444 using two excerpts per emotion, for a total of 22 excerpts. Besides emotion ratings, we also collect personal information about musical knowledge, musical taste, listeners’ familiarity with the stimuli, listeners’ understanding of the lyrics, and demographics.

All audio excerpts were normalized from -1 to 1 and every participant must complete a previous step to set the volume of the survey with respect to a 1 KHz sinusoid. Figure 1 shows an example of one of the questions from the survey. In order to measure the musical knowledge from the participants, we use the Music Sophistication Index[14] and make the results available to the participants at the end of the survey as a small thank you.

4 Results and Discussion

The participation to this moment has been unbalanced, as Tables 3 and 4 show. The color coding shows green when the agreement of the emotion is higher by 0.05 than the agreement measured across 126 participants from all languages, which are coded in yellow. Conversely, the cell is red when the difference is less than -0.05. Initially, it is possible to see that the agreement over complex emotions is very low (approximately 0.2), such as bitterness, fear, power, surprise, and transcendence. On the other hand, a higher agreement is reached for more basic emotions, such as anger, joy, peace, sadness, and tenderness.

We use the word filter, since we are removing the ratings of fragments that were or not preferred, familiar or understood by the listeners. Inter-rater agreement can also be evaluated across every song with positive and negative filters. Since we use a 5-point Likert response format, we consider positive ratings as higher than 3 (neither agree or disagree) and negative as less than 3 (see Figure 1). From now on we refer to the filters and emotions as a combination pair, i.e. joy - preference means the positive and negative preference for the emotion joy. In the case of Q1 and Q3, agreement is higher with positive filter than negative filtering for all the filters (with the exception of power - familiarity. Quadrants Q1 (A+V+) and Q3 (A-V-) can be considered as very distinct and "universal". Conversely, quadrants Q2 and Q4 show the opposite behavior, agreement tends to be lower negative than positive filtering for all the filters (with the exceptions of fear - familiarity, fear - understanding, tenderness - preference, transcendence - preference, and tenderness - understanding). We refer to exceptions when the results do not adjust to the tendencies described above.

Emotions Eng. (26) Spa. (56) Man. (27) Ger. (17) All (126)
anger 0.429 0.311 0.367 0.482 0.364
bitter 0.278 0.209 0.155 0.278 0.202
fear 0.241 0.175 0.091 0.207 0.171
joy 0.304 0.437 0.311 0.476 0.372
peace 0.401 0.332 0.401 0.438 0.371
power 0.379 0.287 0.296 0.325 0.289
sad 0.330 0.343 0.279 0.378 0.326
surprise 0.041 0.055 0.068 0.218 0.075
tender 0.444 0.314 0.452 0.581 0.396
tension 0.264 0.324 0.282 0.323 0.296
transc. 0.080 0.049 0.083 -0.012 0.057
Table 4: Krippendorff’s for each emotion and four questionnaires.

Furthermore unbalanced participation in the surveys forces to evaluate all subjects simultaneously. It is important to note that Table 4 evaluates agreement over subjects, while Table 3 evaluates agreement over all subjects that selected a song with a particular filter. For this reason, Table 3 contains information about the number of ratings taken into account when using the filters.

It is important to acknowledge in the case of familiarity filter, only 16% of the ratings were evaluated as positive. This means that the selected music (and personal memories attached to them) should have a smaller influence on the experiment and that this filter is unbalanced while evaluating agreement. Since agreement is evaluated upon the number of subjects and number of ratings made, the comparison of the familiarity filter should be very noisy. The exceptions fear - familiarity and power - familiarity show a difference of 12% that could be explained by this. On the other hand, the remaining pairs of exceptions (fear - understanding, tenderness - preference, transcendence - preference, and tenderness - understanding) show a variability of 1-3% when using the filters. The remaining 27 combinations of pairs show the overall behavior mentioned previously with a variation of 1-18% of the tendencies depending on the emotion and filter.

5 Conclusions

Our results have confirmed that basic emotions will have higher universal agreement, while complex ones will show the opposite (as seen in Table 4). However, agreement in our experiment appears to be lower than values reported in [1, 5]. Following [12], data should only be considered reliable when obtaining .

Our initial results suggest that lyrics comprehension (LC) improves agreement for emotions in quadrants Q1-Q3, and decreases it for quadrants Q2-Q3. This relates to the type of emotions that we find related to each quadrant and the subjectivity that has been previously researched regarding valence. This has given us new understanding of the effect of LC and its impact on different emotions: 1) in the case of Q1-Q3, better understanding could lead to similar emotions being recognized and result in a higher agreement in the intra-linguistic setting, 2) in the case of Q2-Q4, better understanding could lead to a finer criteria when judging and result in a lower agreement in the intra-linguistic setting.

This study is part of an ongoing research which is now focusing on improving the present study. We consider balancing the styles with respect to the different languages. This would mean to have music relating to each culture/language and all languages in the study should be considered. Additionally, it is highly debatable that pop and rock are in fact neutral and homogeneous since several variations across the world show different ways to convey emotions (e.g., Hindi pop). Lastly, the experiment could have biased the responses in the sense that asking for the understanding forces the subjects to pay attention to the lyrics, losing ecological validity. Future studies will consider other types of tests to prove if subjects effectively understood the lyrics (e.g., using a "fill in the lyrics" question after listening to each fragment).


  • [1] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. Developing a benchmark for emotional analysis of music. PLoS 1, pages 1–22, 2017.
  • [2] Emmanuel Bigand and Jean-Julien Aucouturier. Seven problems that keep MIR from attracting the interest of cognition and neuroscience. Journal of Intelligent Information Systems, 41(3):483–497, 2013.
  • [3] Òscar Celma, Perfecto Herrera, and Xavier Serra. Bridging the Music Semantic Gap. In Workshop on Mastering the Gap: From Information Extraction to Semantic Representation, volume 187, Budva, Montenegro, 2006. CEUR, CEUR.
  • [4] Paul Ekman. An argument for basic emotions. Cognition and Emotion, 6(3-4):169–200, 1992.
  • [5] Arthur Flexer and Thomas Grill. The Problem of Limited Inter-rater Agreement in Modelling Music Similarity. Journal of New Music Research, 45(3):239–251, 2016.
  • [6] Joseph Henrich, Steven J. Heine, and Ara Norenzayan. The weirdest people in the world? Behavioral and Brain Sciences, 33(2-3):61–83, 2010.
  • [7] Perfecto Herrera. MIRages : an account of music audio extractors, semantic description and context-awareness, in the three ages of MIR. PhD thesis, Universitat Pompeu Fabra, 2018.
  • [8] Kate Hevner. Experimental studies of the elements of expression in music. American Journal of Psychology, 48(2):246–268, 1936.
  • [9] Patrik N Juslin. Perceived Emotional Expression in Synthesized Performances of a Short Melody: Capturing the Listener’s Judgment Policy. Musicae Scientiae, 1(2):225–256, 1997.
  • [10] Patrik N. Juslin. Handbook of Music and Emotion: Theory, Research, Applications. Oxford University Press, Oxford, 2010.
  • [11] Patrik N. Juslin. Musical Emotions Explained. Oxford University Press, Oxford, 1 edition, 2019.
  • [12] Klaus H. Krippendorff. Content Analysis: An Introduction to Its Methodology. SAGE Publications, 2 edition, 2004.
  • [13] Petri Laukka. Categorical perception of vocal emotion expressions. Emotion (Washington, D.C.), 5:277–295, 2005.
  • [14] Daniel Müllensiefen, Bruno Gingras, Jason Musil, and Lauren Stewart. The Musicality of Non-Musicians: An Index for Assessing Musical Sophistication in the General Population. PLoS ONE, 9(2):89642, 2014.
  • [15] Renato Panda, Ricardo Malheiro Rui, and Pedro Paiva. Musical texture and expressivity features for music emotion recognition. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, 2018.
  • [16] James A. Russell. A Circumplex Model of Affect. Personality and Social Psychology, 39(6):1161–1178, 1980.
  • [17] Markus Schedl, Emilia G Gómez, Erika S Trent, Marko Tkal Ci C, Hamid Eghbal-Zadeh, and Agust In Martorell. On the Interrelation Between Listener Characteristics and the Perception of Emotions in Classical Orchestra Music. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 9(4):507–525, 2018.
  • [18] Klaus R Scherer. Expression of Emotion in Voice and Music. Journal of Voice, 9(3):235–248, 1995.
  • [19] Bob L. Sturm. Evaluating music emotion recognition: Lessons from music genre recognition? In Proceedings of the 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pages 1–6, San Jose, USA, 2013.
  • [20] Bob L Sturm. A Simple Method to Determine if a Music Information Retrieval System is a Horse. IEEE TRANSACTIONS ON MULTIMEDIA, 16(6), 2014.
  • [21] Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4):1191–1207, 2013.
  • [22] Yi-Hsuan Yang and Homer H. Chen. Music Emotion Recognition. CRC Press, 2011.
  • [23] Marcel Zentner, Didier Grandjean, and Klaus R Scherer. Emotions Evoked by the Sound of Music: Characterization, Classification, and Measurement. Emotion, 8(4):494–521, 2008.
Figure 1: Example of emotion rating survey in English.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description