in Word Embeddings:Cockamamie Gobbledegook for NincompoopsWARNING: This paper contains words that people rated as humorous including many that are offensive in nature.

Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops

Limor Gultchin
University of Oxford
   Genevieve Patterson
   Nancy Baym
Microsoft Research
   Nathaniel Swinger
Lexington High School
   Adam Tauman Kalai
Microsoft Research

We study humor in Word Embeddings, a popular AI tool that associates each word with a Euclidean vector. We find that: (a) the word vectors capture multiple aspects of humor discussed in theories of humor; and (b) each individual’s sense of humor can be represented by a vector, and that these sense-of-humor vectors accurately predict differences in people’s sense of humor on new, unrated, words. The fact that single-word humor seems to be relatively easy for AI has implications for the study of humor in language. Humor ratings are taken from the work of Englethaler and Hills (2017) as well as our own crowdsourcing study of 120,000 words.

1 Introduction

Why is humor so difficult for machine learning and AI systems to understand? In light of recent studies in Psychology showing that individual words can be humorous Engelthaler & Hills (2017); Westbury et al. (2016), and in light of the fact that Word Embeddings (WEs) have been to shown to capture numerous properties of words (e.g., Mikolov et al., 2013), it is natural to study if and how WEs capture humor.

First, we find that individual-word humor possesses many aspects of humor that have been discussed in general theories of humor, and that many of these aspects of humor are captured by WEs. To more deeply understand which features of humor WEs capture and to what extent, we draw on existing theories of humor to define a number of candidate features of word humor. Interestingly, many of these theories can be applied to word humor. For example, ‘incongruity theory,’ which we discuss shortly, can be found in words which juxtapose surprising combinations of words like hippo and campus in hippocampus. Incongruity can also be found in words that have surprising combinations of letters or sounds (e.g. in words borrowed from foreign languages). The ‘superiority theory’ can be clearly seen in insulting words such as twerp. The ‘relief theory’ explains why humor may be found in taboo subjects and is found in subtle and not-so-subtle sexual and scatological connotations such as cockamamie and nincompoop. The most humorous words are found to exhibit multiple humor features, i.e., funny words are funny in multiple ways. WEs are shown to capture these features to varying extents. We correlate these features with humor ratings in a crowdsourcing study.

Second, like previous studies of humor, single-word humor is found to be highly subjective. We show here that individual senses of humor can be well-represented by vectors. In particular, an embedding for each person’s sense of humor can be approximated by averaging a handful of words they rate as funny, and successfully generalize to predict preferences for future unrated words.

WEs fit each word to a -dimensional Euclidean vector based on text corpora of hundreds of billions of words. While we consider multiple embeddings, we focus mainly on the popular and comprehensive Google News WE Mikolov et al. (2013), referred to as GNEWS, which contains embeddings of three million tokens in dimensions. Directions (i.e., vectors) in GNEWS and other WEs have been shown to capture numerous concepts. For example, the word pair vector difference dog:dogs captures the singular/plural distinction (and she:he captures binary gender stereotypes (e.g., Bolukbasi et al., 2016)).

We find humor in WE to be more slippery than the relationships above because: (a) it depends on multiple different dimensions relating to sound, topic, and internal incongruity, which cannot be described by a single pair of words; and (b) while it is well-known that humor is highly subjective, even humor ratings for individual words differ greatly across people.

Nonetheless, we find that aspects of humor and different senses of humor can be captured as vectors. The embedding of a person’s sense of humor is defined to be the mean of the words they identify as humorous. We show that WEs well-represent the differences between people’s senses of humor and generalize to new unrated words. When clustered, these vector groups differ significantly by gender and age. We introduce a “know-your-audience” test which shows that these sense-of-humor vectors can be meaningfully differentiated using only a handful of words per person.

To many readers, it may not be apparent that individual words can be amusing in and of themselves, devoid of context. However, Engelthaler & Hills (2017), henceforth referred to as EH, showed that many words reliably achieve higher average humor ratings than others through crowdsourced ratings of 4,997 nouns. Another study found consistent mean humor differences in non-word strings Westbury et al. (2016), and lists of inherently funny words have been published Beard (2009); (2017).

We show that GNEWS not only captures multiple features of aggregate humor ratings but also different senses of humor of different people. While it would be easy to use Netflix-style collaborative filtering to predict humor ratings, WEs are shown to generalize: given humor ratings on some words, it is able to predict humor features and differences on test words that have no humor ratings. We focus on the funniest (top-rated) words where we find the most interesting differences. Using crowdsourcing, we collected hundreds of thousands of judgments across 120,000 lower-case words and phrases from GNEWS, culminating in a study in which 1,659 people identified humorous words from the set of 216 words with highest mean ratings. These data, which will be made publicly available, may also prove useful in a future research and applications.

Implications. What does this say about why humor is hard for AI systems to understand? Our study suggests that some of the fundamental features of humor are computationally simple (e.g., linear directions in WEs). This further suggests that the difficulty in understanding humor may lie in understanding rich domains such as language. If humor is simple, then humorous sentences may be hard for computers due to inadequate sentence embeddings: if you don’t speak Greek, you won’t understand a simple joke in Greek. Conversely, it also suggests as language models improve, they may naturally represent humor (or at least some types of humor) without requiring significant innovation.

Furthermore, understanding single-word humor is a possible first step in understanding humor as it occurs in language. A natural next step would be to analyze humor is phrases and then short sentences, and indeed the 120,000 tokens rated include 45,353 multi-word tokens and the set of funniest 216 tokens include 41 multi-word tokens.

Finally, some comedians recommend humorous word choice. For instance, Willie in the Neil Simon play and subsequent film The Sunshine Boys says:

Fifty-seven years in this business, you learn a few things. You know what words are funny and which words are not funny. Alka Seltzer is funny. You say “Alka Seltzer” you get a laugh…. Words with k in them are funny. Casey Stengel, that’s a funny name. Robert Taylor is not funny. Cupcake is funny. Tomato is not funny. Cookie is funny. Cucumber is funny. Car keys. Cleveland is funny. Maryland is not funny. Then, there’s chicken. Chicken is funny. Pickle is funny. All with a k. Ls are not funny. Ms are not funny. Simon (1974) (as cited in Kortum, 2013).

This suggests that understanding funny words in isolation may be helpful as a feature in identifying or generating longer humorous texts.

Organization. Section 2 defines humor features drawn from humor theories and briefly discusses related work. Section 4 covers the aggregate humor ratings and different features of humor. Section 5 analyzes how WEs capture differences in senses of humor.

2 Relevant features from humor theories

Numerous theories of humor exist, dating at least far back as the philosophers of ancient Greece. Plato followed by Hobbes and Descartes contributed to a ‘superiority theory’ view of humor, which was formalized in the 20th century Morreall (2016). However, since the 18th century, two new theories of humor surfaced and became much more widely acceptable: the ‘relief theory’ and ‘incongruity theory.’

The relief theory offers a physiological explanation for the importance of laughter to our health, which argues that laughter functions as a pressure relief to our nervous system, sometimes as a means to address taboo subjects that cause such stress. Proponents of this theory were Lord Shaftesbury (1709), Herbert Spencer (1911) and Sigmund Freud (1905). Incongruity theory explains humor as a reaction to moments of uncertainty, in which two non-related issues which seem to be unfitting are juxtaposed in one sentence or event, but are then resolved. This theory could be seen as an explanation of the intuitive importance of punch lines in common jokes, where the “set up” builds an expectation that the “punch line” violates. Albeit confusing or disorienting logically, the revelation of the incongruity creates a humorous event when this contradiction violates our expectation, in a harmless, benevolent way. Among the supporters of this method were Kant (1790), Kierkegaard (1846) and Schopenhauer (1844).

These theories are also relevant to word humor. Based on these theories and other discussions of word humor Beard (2009); Bown (2015), we consider the following six features of word humor:

  1. Humorous sound (regardless of meaning): certain words such as bugaboo or razzmatazz are funny regardless of their meaning. This feature is related to the incongruity theory in that an unusual combination of sounds that normally do not go together can make a word sound funny. This feature is also consistent with the comedy intuition of Neil Simon, Monty Python, Dr. Seuss and the like, who discuss funny-sounding words (see, e.g. Bown, 2015).

  2. Juxtaposition/Unexpected incongruity: certain words create an association between words that are otherwise completely unrelated. For example, there is little relation between the hippocampus part of the brain and the words hippo and campus. This feature is clearly motivated by incongruity theory.

  3. Sexual connotation: some words are explicitly sexual such as sex or have sexual connotations such as thrust reverser. This can be explained by Freud’s view of humor, as a venting mechanism for social taboos. This was also discussed in the context of computational humor by Mihalcea et al. Mihalcea & Strapparava (2005).

  4. Scatological connotation: some words have connotations related to excrement to varying degrees such as nincompoop or apeshit. The justification of this feature is similar to the sexual connotation feature above.

  5. Insulting words: in the context of word humor, the superiority theory suggests that insulting words may be humorous to some people.

  6. Colloquial words: extremely informal words such as crapola can be surprising and provide relief in part because they are unusually informal.

It is interesting to study the extents to which these features correlate with humor and how well the a popular WE (GNEWS) captures each one. Humor is known to vary by culture and gender, and EH focused on age and gender differences in word humor. They also found strong correlations between humor ratings and word frequency and word length, with shorter words and less frequent words tending to be rated higher. This is interesting because word length and word frequency are strongly anti-correlated. We also study word length and frequency to see how well GNEWS captures these.

2.1 Related work in computational humor

Many of these features of humor have been considered in prior work on computational humor (see, e.g. Mihalcea & Strapparava, 2005), including early work such as HAHAcronym Stock & Strapparava (2003), which focused on producing humorous acronyms. In later work, social media was also used to explore automatic-humor capabilities: Barbieri & Saggion (2014) utilized Twitter for the automatic detection of irony, while Raz (2012) focused on automatic classification of types of humor, such as anecdotes, fantasy, insult, irony, etc. Other work study visual humor Chandrasekaran et al. (2016) or humorous image captions Shahaf et al. (2015); Radev et al. (2015). WEs have been used as features in a number NLP humor systems (e.g. Chen & Soo, 2018; Hossain et al., 2017; Joshi et al., 2016), though the humor inherent in individual words was not studied.

3 Data

As is common, we will abuse terminology and refer to both the words and multi-word tokens from embeddings as words, for brevity. When we wish to distinguish one word from multiple words, we write single words or phrases.

3.1 The Engelthaler-Hill dataset

Our first source of data is the EH dataset, which is publicly available Engelthaler & Hills (2017). It provides mean ratings on 4,997 single words, each rated by approximately 35 raters on a scale of 1-5. The following words had the highest average rating according to EH: booty, tit, booby, hooter, nitwit, twit, waddle, tinkle, bebop, and twerp; the lowest-rated words were rape, torture, torment, gunshot, and death. They further break down the means by gender (binary, M and F) and age (over 32 and under 32). However, since the EH data is in aggregate form, it is not possible to study questions of individual humor senses beyond age and gender.

3.2 Crowdsourcing studies

The EH data serves the purpose of finding a humor direction correlating with mean humor ratings. However, in order to better understand the differences between people on the funniest of words, we sought out a smaller set of more humorous words that could be labeled by more people.

Eligible words were lower-case words or phrases, i.e., strings of the twenty-six lower-case Latin letters with at most one underscore representing a space. In our study, we omitted strings that contained digits or other non-alphabetic characters.

English-speaking labelers were recruited on Amazon’s Mechanical Turk platform. All labelers identified themselves as fluent in English, and 98% of them identified English as their native language. All workers were U.S.-based, unless otherwise mentioned. We study a subset of 120,000 words and phrases from GNEWS, chosen to be the most frequent alphabetic lower-case entries from the embedding. While our study included both words and phrases, for clarity, in the tables in this paper we often present only words. The list of 120,000 strings is included with the data.

Humor rating study. Our series of humor rating studies culminated in a set of 216 words with high mean humor ratings, which were judged by 1,678 U.S.-based raters. In each study, each participant was presented with random sets of six words and asked to select the one they found most humorous. In the first study only, the interface also enabled an individual to indicate that they found none of the six words humorous. We treat the selected word, if any, as being labeled positive and the words not selected as being labeled negative. Prior work on rating the humor in non-words found similar results for a forced-choice design and a design in which participants rated individual words on a Likert-scale Westbury et al. (2016). We also compare the results for our study and that of EH. To prevent fatigue, workers were shown words in daily-limited batches of up to fifty sextuple comparisons over the course of weeks. No worker labeled more than 16 batches.

We refer to the three humor-judging studies by the numbers of words used: 120k, 8k, and 216. In the 120k study, each string was shown to at least three different participants in three different sextuples. 80,062 strings were not selected by any participant, consistent with EH’s finding that the vast majority of words are not found to be funny. The 8k study applied the same procedure (except without a “none are humorous” option) to the 8,120 words that were chosen as the most humorous in a majority of the sextuples they were shown. Each word was shown to at least 15 different participants in random sextuples. To filter down to 216 words, we selected the words with the greatest fraction of positive labels. However, several near duplicate words appeared in this list. To avoid asking people to compare such similar words in the third stage, amongst words with the same stem, the word with the greatest score was selected. For instance, among the four words wank, wanker, wankers, and wanking, only wanker was selected for the third round as it had the greatest fraction of positive votes. The 216 words are shown in Appendix A with a sample in Table 1.

In the 216 study, each participant selected not only a set of 36 “humorous” words, but further sets of 6 “more humorous” words and a single most humorous word, as follows. The 216 words were first presented randomly in 36 sextuples comprising the 216 words. In the same task, the 36 chosen words from these sextuples were shown randomly in 6 sextuples. The selected words were presented in a final sextuple from which they selected a single word. We associate a rating of 3 with the single funniest word selected at the end, a rating of 2 with the 5 other words shown in the final sextuple, a rating of 1 with the 30 words selected that did not receive a rating of 2 or 3, and a rating of 0 with the 180 unselected words.

nut butters chits portaloos red wigglers nookie
batshit big honkin nut butters moobs moobs
bunkum yadda yadda tchotchkes crotches namby pamby
arseholes pussy willows dangly nookie foppish
wazoo nutjobs whizzy spermatogenesis dangly
cuckolded galumphing cockfights twerps ballyhoo
fusarium backhoes nutjobs boobies bitty
glockenspiel boondoggle crapola nut flush diktats
stinkbait skivvies nookie kerfuffle wazoo
razzmatazz shuttlecock ballyhoo whizzy batshit
Table 1: A random selection of 50 of the 216 words used in the main crowdsourcing study. These were selected from 120,000 tokens by multiple rounds of voting. The full list can be found in Appendix A.

Feature annotation studies. The feature annotation study drew on the same worker pool. 1,500 words were chosen from the original 120,000 words by including the 216 words discussed above plus random words (half from the 8k study and half from the 120k). We asked the workers to annotate six features discussed earlier, namely humorous sound, juxtaposition, colloquial, insulting, sexual connotation, and scatological connotation. Each feature was given a binary “yes/no” annotation, and results were averaged over at least 8 different raters per feature/word pair.

In each task, each rater was given the definition of a single feature and was presented with 30 words. They were asked to mark all words which possessed that feature by checking a box next to each word. A small number of workers were eliminated due to poor performance on gold-standard words. Further experimental details are in Appendix B.

4 Aggregate humor ratings and features

We start by studying the average humour ratings of a population. In our case, this group would be U.S.-based English speaking workers on Amazon’s Mechanical Turk platform. On this group, we can use our identified features of humor from existing theories to see how well they correlate with humor ratings, and thus how well these theories are actually captured in humour ratings. Equally interesting, we test how predictable they are in the WE. Finally, we discuss how difficult it is to find a word pair that captures mean humor ratings in the same way that Paris-France aligns with country capitals or she-he captures binary gender stereotypes.

4.1 Humor features representation

We consider the six features described in Section 2 rated on 1,500 words that include many words rated highly for humor. For each of these features, we compute its predictability as follows. As in other studies, we set aside 10% of the 1,500 words as a test set. We then use ridge regression to fit a linear function to the feature values (with default parameters from scikit-learn Pedregosa et al. (2011)). Finally, we compute the correlation coefficient between the predictions and labels on the test set. This is repeated 1,000 times and the mean of the correlation coefficients is taken as the feature predictability. To compute feature correlation with humor, we use the mean ratings of the 216 words in the personalized data and assign a rating of -1 to any word not in that set.

Figure 1 shows these correlation with humor (y-axis) versus predictability (x-axis). The WE representation is well-suited for predicting colloquial (informal) words and insults, with correlations greater than 0.5, while the feature that was most difficult to predict with GNEWS was the juxtaposition feature, with a correlation slightly greater than 0.2. Similarly, all the features had positive correlation with mean humor ratings, with funny sounding having the highest correlation.

We can easily evaluate predictability and correlations with humor for an automatic feature such as word length. For word length, we find a predictability correlation coefficient of 0.518 indicating a good ability of a linear direction in the WE to predict word length, and a correlation with mean humor ratings of -.126, consistent with EH’s findings that shorter words tend to be rated higher.

Figure 1: Feature correlation with mean humor ratings vs. feature predictability (word-embedding correlation) for the six features on the 1,500 words with feature labels.

4.2 Identifying an average humor direction

An approach that has been repeatedly used to find directions in WEs corresponding to cultural concepts is to choose vectors corresponding to the difference between emebeddings of one or more word pairs. The pairs are chosen to convey the concept of interest, such as woman-man or she-he Bolukbasi et al. (2016); Caliskan et al. (2017); Garg et al. (2017), rich-poor, black-white, liberal-conservative Kozlowski et al. (2018), or sets of names that are statistically mostly Asian or White Garg et al. (2017).

While numerous words could be chosen to be associated with humor, such as humorous, funny or LOL, it is less clear what the counterpart would be. Thesauri commonly list words such as humorless, serious, and tragic, as antonyms to humor words. Table 2 shows the correlation coefficient between different directions using different embeddings with the data provided by EH. The directions were formed by taking the difference in vectors between a pair of words, i.e., the inner product where and correspond to the vectors of the two words in the pair. In the second-to-last row the direction is the mean of all of these differences. The last row is a best fit direction found by least-squares linear regression predicting the humor rating as a linear function of the embedding. To avoid overfitting, the data was split into 90% for training and 10% for estimating the correlation coefficient; this was repeated 1,000 times and results were averaged.

With all embeddings studied, we normalize all word vectors to unit length, as is standard (see, e.g., Bolukbasi et al., 2016). The embeddings used were the GNEWS embedding, trained with the word2vec algorithm, and two embeddings Pennington et al. (2014) trained with the GloVe algorithm: one also of 300 dimensions on a set of 6 billion tokens from Wikipedia and Gigaword (called WikiGiga), and one of 200 dimensions on 27 billion tokens from Twitter (called Twitter). The Pearson correlation coefficients were computed on the set of EH words restricted to the 4,719 words that were common to all three embeddings.

While there is a great deal of consistency in the best-fit direction, there is little consistency between the directions within and between embeddings. We conclude that a WE does generally have a direction that captures average humor, but it is difficult to “get one’s hands on it” using word pairs. This is unlike previous studies that found consistency between embeddings on other concepts Bolukbasi et al. (2016); Kozlowski et al. (2018). Perhaps there is just no reliable mirror for the word funny in the same way that he mirrors she. The hilarious-tragic direction achieves a correlation near 0.5 for the GNEWS and WikiGiga.

GNEWS WikiGiga Twitter
Direction corr. corr. corr.
humorous-humorless -0.141 -0.332 -0.183
funny-unfunny 0.005 -0.351 -0.205
hilarious-tragic 0.514 0.481 0.146
laughable-grave 0.297 0.473 0.214
LOL-serious 0.379 0.508 0.118
droll-dull 0.212 0.417 0.249
average of words 0.398 0.500 0.219
best fit 0.673 0.619 0.650
Table 2: Correlation coefficient of the EH ratings for different directions and embeddings.

In GNEWS with EH ratings, the hilarious-tragic pair was best. Figure 2 illustrates the correlation. Treated as a classification problem between the 100 highest-rated words (positives) and 100 lowest-rated words (negatives), 91.5% of them are classified correctly by the sign of their hilarious-tragic projection in GNEWS, i.e., by whether the embedding is closer to that of hilarious or tragic. Using a linear Support Vector Machine with default parameters in scikit-learn Pedregosa et al. (2011), we find a mean accuracy of 96.7% (std. dev. 0.0061) classification accuracy between the highest and lowest sets of 100 words, based upon 10-fold cross validation. Of Beard’s list of 100 words Beard (2009), 62 of the words are present in GNEWS (with some form of capitalization), and all but one (brouhaha) are closer to hilarious than tragic. The words nincompoop, flibbertigibbet, and cantankerous are furthest on the hilarious-tragic axis. In a list of words from 1,678 words from a website entitled “Inherently funny words,” Among 1,210 of them appearing in GNEWS, 84.9% are closer to hilarious than to tragic. The words furthest towards tragic were die and dead. The inclusion of these words on the list illustrates the subjective nature of humor – at least one person found them humorous.

5 Individual differences in word humor

It is well known that humor differs across groups, cultures, and individuals. We hypothesize that an individual’s sense of humor can be successfully embedded as a vector as well. More specifically, if each person’s “sense-of-humor” embedding is taken to be the vector average of words they rate as funny, this may predict which new, unrated words different people would find funny.

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
word 1 gobbledegook tootsies clusterfuck whakapapa dickheads
word 2 kerfuffle fufu batshit codswallop twat
word 3 hullaballoo squeegee crapola dabbawalas cocking
word 4 razzmatazz doohickey apeshit pooja titties
word 5 gazumped weenies fugly spermatogenesis asshattery
word 6 boondoggle muumuu wanker diktats nutted
word 7 galumphing thingies schmuck annulus dong
word 8 skedaddle wigwams arseholes chokecherry wanker
word 9 guffawing weaner dickheads piggeries cockling
word 10 bamboozle peewee douchebaggery viagogo pussyfooting
sound 1.11 1.02 0.97 1.02 0.90
scatological 0.80 0.99 1.15 0.89 1.14
colloqial 0.95 1.00 1.14 0.87 1.02
insults 0.86 0.90 1.23 0.84 1.12
juxtaposition 0.89 0.86 0.99 1.10 1.13
sexual 0.81 0.91 0.99 1.00 1.25
female % 70.3% 57.5% 53.8% 52.4% 35.2%
mean age 38.5 37.4 42.3 37.2 34.8

statistically significant difference with p-value .

Table 3: Raters were clustered based on the the sense-of-humor embeddings, the average vector of the words they rated funny. For each cluster, the single words that differentiate each rating cluster are shown together with scores on six humor features and demographics. Note that the demographics differences emerged from clustering the sense-of-humor vectors—demographics were not used in forming the clusters.

To evaluate our hypothesis, we consider the following test inspired by the comedy maxim, “know your audience.” In this test, we take two random people and with funniest words (rating of 3), respectively, where we require that rated as 0 and rated as 0. Note that this requirement is satisfied for 60% of pairs of people in our data. In other words, 60% of pairs of people had the property that neither one chose the other’s funniest word even in the first round. This reflects the fact that individual notions of humor indeed differ significantly.

Given the two sets of 35 other words each participant rated positively, which we call the training words, and given words which we call the test words, the goal is to predict which participants rated or funniest. Just as a good comedian would choose which joke to tell to which audience, we use the WE to predict which person rated which test word as funny, based solely on the training words. For example, the training sets might be {bamboozle, bejeezus, …, wigwams, wingnuts} and {batshit, boobies, …, weaner, whakapapa} and test set to match might be the words {poppycock, lollygag}.

Know-your-audience test Success rate
Easy: disjoint sets of 35 training words. 78.1%
Normal: 35 training words. 68.2%
Hard: 5 training words. 65.0%
Table 4: Success rates at know-your-audience tests, which test the ability of sense-of-humor embeddings to distinguish different raters based on sense of humor.
Figure 2: Mean humor rating of EH words Engelthaler & Hills (2017) vs. GNEWS projection on the hilarious-tragic axis. Dots on the right (left) represent words closer to hilarious (tragic).

To test the embedding we simply average training word vectors and see which matches best to the test words. In particular, if and are the two training word vector averages and and are the corresponding test word vectors, we match to if and only if,

Simple algebra shows the above two inequalities are equivalent. Thus, the success rate on the test is the fraction of eligible pairs for which . Note that this test is quite challenging as it involves prediction on two completely unrated words, the analog of what is referred to as a “cold start” in collaborative filtering (i.e., predicting ratings on a new unrated movie).

We also consider an easy and a hard version of this know-your-audience test. In the easy version, a pair of people is chosen who have disjoint sets of positively rated words, indicating distinct senses of humors. In the hard version, we use as training words only the five words each person rated 2. It is important to note that no ratings of the test words (or the other participants) are used, and hence it tests the WE’s ability to generalize (as opposed to collaborative filtering). The test results given in Table 4 were calculated by Brute force over all eligible pairs in the data (1,004 for the easy test and 818,790 for the normal and hard tests).

Word rated funnier by F adjusted p-value Word rated funnier by M adjusted p-value
whakapapa 3.2e-04 sexual napalm 2.1e-11
doohickey 0.0011 poundage 1.3e-05
namby pamby 0.0014 titties 2.6e-05
hullaballoo 0.003 dong 3.2e-05
higgledy piggledy 0.0039 jerkbaits 7.4e-05
gobbledegook 0.0047 semen samples 1.8e-04
schlocky 0.008 nutted 0.0019
gazumped 0.014 cock ups 0.0021
kooky 0.026 boobies 0.0027
schmaltzy 0.033 nut butters 0.004
Table 5: Among our set of 216 words (including phrases), the ten with most confident differences in ratings across gender (using a two-sided t-test and Bonferoni correction for p-values, i.e., multiplying each p-value by 216).

5.1 Clustering

For exploratory purposes, we next clustered the raters based on their sense-of-humor embeddings, i.e., the normalized average vector of the 36 words they rated positively. We used K-means++ in sci-kit learn with default parameters Pedregosa et al. (2011). For each cluster and each word , we define a relative mean for that cluster to be the difference between the fraction of raters in that cluster who rated that word positively and the overall fraction of raters who rated that word positively. Table 3 shows, for each cluster, the ten single words with maximum relative mean for that cluster (phrases are not displayed to save space). Similarly, one may compute a relative vector for each cluster as the difference between its centroid and the overall centroid of all the sense vectors. A similar list is found if one sorts the words by their relative vectors. Table 3 also shows, for each feature of humor, the mean feature score of words rated positively by the raters in that cluster, normalized by the overall mean. The number of clusters was chosen to be 5 using the elbow method Thorndike (1953), though different were found to exhibit qualitatively similar findings.

The demographic differences between the clusters that emerged is perhaps surprising, since the clusters were formed using a small number of word ratings (not demographics). First, Cluster 1 is significantly female-skewed (compared to overall 53.4% female), Cluster 5 is male-skewed and younger, and Cluster 3 is older. How statistically significant are the age and gender differences between the clusters? We compute this by randomly shuffling and partitioning the users into groups with the same sizes as the clusters, repeatedly times and computing statistics of the cluster means. From this, the 95% confidence interval for mean age was and for percentage female was . Hence, the significant age differences were clusters 3 and 5 and the significant gender differences were in clusters 1 and 5. Moreover these four differences were greater than any observed in the samples.

Cluster 1 appears to rate funny sounding words highly, while Cluster 5 highly rates sexual words and juxtapositions. The “older” cluster 3 rates scatological words higher as well as insults and informal words. While none of the humor features we mentioned stand out for clusters 2 or 4, the clustering suggests new features that may be worth examining in a further study. For instance, Cluster 4 appears to highly rate unfamiliar words that may be surprising merely by the fact that they are words at all. Cluster 2 seems to rate “random” nouns highly (see the analysis of concreteness in Engelthaler & Hills (2017)) as well as words like “doohickey” and “thingie” which can describe random nouns.

Table 5 reports the strings that males and females differed most on, sorted by confidence in the difference. The strings males found funnier appear to be more sexual, while the words females found funnier appear to be more “funny sounding.” A key difference between Tables 3 and 5 is whether or not the WE was used — in Table 3 the participants were clustered by their sense of humor embeddings without using their gender, while Table 5 presents differences in word ratings without using the WE at all. It is interesting that the WE recovers gender with a nontrivial degree of accuracy from the sense-of-humor embeddings.

Also, humor is known to differ strongly across cultures. Mirroring this for word humor, we ran our study of 216 words with 312 raters from India. We find significant differences compared to the 1,659 raters from the U.S., with the raters from the U.S. finding longer words more humorous. The greatest differences are shown in Table 6 of the appendix.

6 Conclusions

We have shown that WEs capture aggregate word humor as rated on the EH dataset and also differences between humor as rated on a collection of about two hundred words. We have shown that each individual’s sense-of-humor can be easily embedded using a handful of ratings, and that differences in these embeddings generalizes to predict different ratings on unrated words. We have shown that word humor possesses many features motivated by theories of humor more broadly, and that these features are represented WEs to varying degrees.

The datasets we have collected will be made publicly available and may be useful for other projects. There are numerous possible applications of word humor to natural language humor more generally. As discussed, comedians and writers are aware and indeed use such word choice to amplify their humor. Similarly, humorous words may help in identifying and generating humorous text. Moreover, our ratings could be used by text synthesis systems such as chat-bots that use WEs to tweak them towards or away from different types of humor (e.g., with or without sexual connotations), depending on the application at hand and training data.

Finally, one approach to improving AI recognition and generation of humor is to start with humorous words, then move on to humorous phrases and sentences, and finally to humor in broader contexts. Our work may be viewed as a first step in this programme.


  • Barbieri & Saggion (2014) Barbieri, F. and Saggion, H. Modelling irony in twitter. In EACL, pp. 56–64, 2014.
  • Beard (2009) Beard, R. The 100 Funniest Words in English. Lexiteria, 2009. ISBN 9780615267043.
  • Bolukbasi et al. (2016) Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, pp. 4349–4357, 2016.
  • Bown (2015) Bown, L. Writing Comedy: How to use funny plots and characters, wordplay and humour in your creative writing. Hodder & Stoughton, 2015. ISBN 9781473602205. URL
  • Caliskan et al. (2017) Caliskan, A., Bryson, J. J., and Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
  • Chandrasekaran et al. (2016) Chandrasekaran, A., Vijayakumar, A. K., Antol, S., Bansal, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. We are humor beings: Understanding and predicting visual humor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4603–4612, 2016.
  • Chen & Soo (2018) Chen, P.-Y. and Soo, V.-W. Humor recognition using deep learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2, pp. 113–117, 2018.
  • Cooper (1709) Cooper, A. A. Sensus communis: An essay on the freedom of wit and humour. Characteristicks of men, manners, opinions, times, 1, 1709.
  • Engelthaler & Hills (2017) Engelthaler, T. and Hills, T. T. Humor norms for 4,997 english words. Behavior Research Methods, Jul 2017. ISSN 1554-3528. doi: 10.3758/s13428-017-0930-6. URL
  • Freud (1905) Freud, S. Jokes and their relation to the unconscious (j. strachey, trans.). N. Y.: Holt, Rinehart, and Winston, 1905.
  • Garg et al. (2017) Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. CoRR, abs/1711.08412, 2017. URL
  • Hossain et al. (2017) Hossain, N., Krumm, J., Vanderwende, L., Horvitz, E., and Kautz, H. Filling the blanks (hint: plural noun) for mad libs humor. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 638–647, 2017.
  • Joshi et al. (2016) Joshi, A., Tripathi, V., Patel, K., Bhattacharyya, P., and Carman, M. Are word embedding-based features useful for sarcasm detection? arXiv preprint arXiv:1610.00883, 2016.
  • Kant (1790) Kant, I. Critique of judgment. Hackett Publishing, 1790.
  • Kierkegaard (1846) Kierkegaard, S. Concluding unscientific postscript to philosophical fragments (translation by d. swenson, 1941), 1846.
  • Kortum (2013) Kortum, R. D. Funny words: The language of humor. In Varieties of Tone, pp. 134–142. Springer, 2013.
  • Kozlowski et al. (2018) Kozlowski, A. C., Taddy, M., and Evans, J. A. The geometry of culture: Analyzing meaning through word embeddings. CoRR, abs/1803.09288, 2018. URL
  • Mihalcea & Strapparava (2005) Mihalcea, R. and Strapparava, C. Making computers laugh: Investigations in automatic humor recognition. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pp. 531–538, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. doi: 10.3115/1220575.1220642. URL
  • Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • Morreall (2016) Morreall, J. Philosophy of humor. In Zalta, E. N. (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2016 edition, 2016.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. URL
  • Radev et al. (2015) Radev, D., Stent, A., Tetreault, J., Pappu, A., Iliakopoulou, A., Chanfreau, A., de Juan, P., Vallmitjana, J., Jaimes, A., Jha, R., et al. Humor in collective discourse: Unsupervised funniness detection in the new yorker cartoon caption contest. arXiv preprint arXiv:1506.08126, 2015.
  • Raz (2012) Raz, Y. Automatic humor classification on twitter. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pp. 66–70. Association for Computational Linguistics, 2012.
  • Schopenhauer (1844) Schopenhauer, A. Arthur Schopenhauer: The World as Will and Presentation: Volume I (translation by R. B. Haldane, 1888. Routledge, 1844.
  • Shahaf et al. (2015) Shahaf, D., Horvitz, E., and Mankoff, R. Inside jokes: Identifying humorous cartoon captions. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1065–1074. ACM, 2015.
  • Simon (1974) Simon, N. The Sunshine Boys: A Comedy in Two Acts. Samuel French, Inc., 1974.
  • Spencer (1911) Spencer, H. Essays on Education, Etc. London: Dent, 1911.
  • Stock & Strapparava (2003) Stock, O. and Strapparava, C. Hahacronym: Humorous agents for humorous acronyms. Humor, 16(3):297–314, 2003.
  • Thorndike (1953) Thorndike, R. L. Who belongs in the family? Psychometrika, 18(4):267–276, 1953.
  • (2017) Just For Fun / Inherently Funny Words, 2017., accessed 2017-07-25.
  • Westbury et al. (2016) Westbury, C., Shaoul, C., Moroschan, G., and Ramscar, M. Telling the world’s least funny jokes: On the quantification of humor as entropy. Journal of Memory and Language, 86:141 – 156, 2016. ISSN 0749-596X. doi: URL

Appendix A Funniest words

The 216 top-rated words, sorted in order of mean humor ratings (funniest first), used in the main crowdsourcing experiment:

asshattery, clusterfuck, douchebaggery, poppycock, craptacular, cockamamie, gobbledegook, nincompoops, wanker, kerfuffle, cockle pickers, pussyfooting, tiddlywinks, higgledy piggledy, kumquats, boondoggle, doohickey, annus horribilis, codswallop, shuttlecock, bejeezus, bamboozle, whakapapa, artsy fartsy, pooper scoopers, fugly, dunderheaded, dongles, didgeridoo, dickering, bacon butties, woolly buggers, pooch punt, twaddle, dabbawalas, goober, apeshit, nut butters, hoity toity, glockenspiel, diktats, mollycoddling, pussy willows, bupkis, tighty whities, nut flush, namby pamby, bugaboos, hullaballoo, hoo hah, crapola, jerkbaits, batshit, schnitzels, sexual napalm, arseholes, buffoonery, lollygag, weenies, twat, diddling, cockapoo, boob tube, galumphing, ramrodded, schlubby, poobahs, dickheads, fufu, nutjobs, skedaddle, crack whore, dingbat, bitch slap, razzmatazz, wazoo, schmuck, cock ups, boobies, cummerbunds, stinkbait, gazumped, moobs, bushwhacked, dong, pickleball, rat ass, bootlickers, skivvies, belly putter, spelunking, faffing, spermatogenesis, butt cheeks, blue tits, monkeypox, cuckolded, wingnuts, muffed punt, ballyhoo, niggly, cocksure, oompah, trillion dong, shiitake, cockling, schlocky, portaloos, pupusas, thrust reverser, pooja, schmaltzy, wet noodle, piggeries, weaner, chokecherry, tchotchkes, titties, doodad, troglodyte, nookie, annulus, poo poo, semen samples, nutted, foppish, muumuu, poundage, drunken yobs, yabbies, chub, butt whipping, noobs, ham fisted, pee pee, woo woo, squeegee, flabbergasted, yadda yadda, dangdut, coxless pairs, twerps, tootsies, big honkin, porgies, dangly, guffawing, wussies, thingies, bunkum, wedgie, kooky, knuckleheads, nuttin, mofo, fishmonger, thwack, teats, peewee, cocking, wigwams, red wigglers, priggish, hoopla, poo, twanged, snog, pissy, poofy, newshole, dugong, goop, whacking, viagogo, chuppah, fruitcakes, caboose, cockfights, hippocampus, vindaloo, holeshot, hoodoo, clickety clack, backhoes, loofah, skink, party poopers, civvies, quibble, whizzy, gigolo, bunged, whupping, weevil, spliffs, toonie, gobby, infarct, chuffed, gassy, crotches, chits, proggy, doncha, yodelling, snazzy, fusarium, bitty, warbled, guppies, noshes, dodgems, lard, meerkats, lambast, chawl

Table 7 shows the feature ratings on the ten single words funniest. Tables 5 and 6 present binary comparisons the funniest words by gender (female vs. male) and nationality (India vs. U.S.).

Word rated funnier in India adjusted p-value Word rated funnier in U.S. adjusted p-value
poo poo 6.2e-14 codswallop 4.5e-25
pissy 2.4e-12 craptacular 5.7e-22
woo woo 5.4e-12 asshattery 4.2e-20
poofy 9.2e-11 kerfuffle 1.3e-19
gigolo 3.2e-10 gobbledegook 3e-18
muumuu 4.5e-10 glockenspiel 2.9e-17
pee pee 6.2e-10 clusterfuck 2.4e-16
guppies 2.4e-09 ramrodded 1.8e-13
gassy 1e-07 douchebaggery 9.4e-13
boobies 4.2e-07 twaddle 4.7e-12
Table 6: Among our set of 216 words (including phrases), the ten with most confident differences in ratings from people in India and the U.S. (again using a two-sided t-test and Bonferoni correction). There is a strong (0.45) correlation between word length and difference in rating between U.S. and India.
FM sound juxtaposition colloquial insulting sexual scatological
clusterfuck M
poppycock M
cockamamie F
gobbledegook F
nincompoops F
wanker M
kerfuffle F
Table 7: The ten words rated funniest in our study, their female/male mean rating discrepancy (if significant), and some features of these words.

Appendix B Further experiment details

We found many proper nouns and words that would normally be capitalized in the 120,000 most frequent lower-case words from GNEWS. To remove these words, for each capitalized entry, such as New_York, we removed it and also less frequent entries (ignoring spacing) such newyork if the lower-case form was less frequent than the upper-case word according to the WE frequency. (For example, the WE has 13 entries that are equivalent to New_York up to capitalization and punctuation.)

To disincentivize people from randomly clicking, a slight pause (600 ms) was introduced between the presentation of each word and the interface required the person to wait until all six words had been presented before clicking. In each presentation, words were shuffled to randomize the effects of positional biases. The fractions of clicks on the different locations did suggest a slight positional bias with click percentages varying from 15.7% to 18.6%.

We refer to the three humor-judging experiments by the numbers of words used: 120k, 8k, and 216. In the 120k experiment, each string was shown to at least three different participants in three different sextuplets. 80,062 strings were not selected by any participant, consistent with EH’s finding that the vast majority of words are not found to be funny. The 8k experiment applied the same procedure (except without a “none are humorous” option) to the 8,120 words that were chosen as the most humorous in a majority (1/2 or more) of the sextuples they were shown. Each word was shown to at least 15 different participants in random sextuples. The list of 216 words is in the appendix. A slight inconsistency may be found in the published data in that we have removed duplicates where one person voted on the same word more than once, however in forming our sets of 8,120 and 216 words we did not remove duplicates.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description