Recognizing Film Entities in Podcasts
In this paper, we propose a Named Entity Recognition (NER) system to identify film titles in podcast audio. Taking inspiration from NER systems for noisy text in social media, we implement a two-stage approach that is robust to computer transcription errors and does not require significant computational expense to accommodate new film titles/releases. Evaluating on a diverse set of podcasts, we demonstrate more than a 20% increase in F1 score across three baseline approaches when combining fuzzy-matching with a linear model aware of film-specific metadata.
Podcast audiences have more than doubled in size over the last decade, bringing with them demand for more frequent releases and an expanded scope of content (??, Win). In parallel, the barrier to entry for hosting a podcast has been significantly lowered, with reduced costs for audio recording technology allowing non-experts to engage in discussion of topics such as politics or entertainment that were previously reserved for more traditionally-accredited and financially-supported individuals (Baumgartner and Morris, 2010; Kang and Gretzel, 2012). In this combination, podcasts have demonstrated tremendous potential as a gauge for social opinion.
Nonetheless, the volume of podcast content produced today necessarily precludes large-scale topical analysis. In particular, it is currently difficult to track mentions of noteworthy properties across multiple podcast channels. As such, NER, the identification of qualitatively significant word phrases such as people and organizations (Chieu and Ng, 2002), is a well-needed focus in the area of podcast analysis. NER is a key step necessary to make higher level inferences such as measuring sentiment, identifying emotions associated with properties, or building predictive models for property-level response variables such as revenue based on podcast data.
Although NER systems for formal, written language perform accurately (Lin and Wu, 2009; Ratinov and Roth, 2009), there remains substantial room for improvement in evolving communication mediums where traditional linguistic structures are used less consistently (e.g. social media posts and informal conversations). Entity recognition faces further challenges in the domain of human speech, where an intermediary step to transform audio into computer-readable text entirely removes orthographic features, which are often used to highlight entities in writing (e.g. captitalization, punctuation).
To address the aforementioned challenges for entity detection in podcasts, we propose a two-stage NER system and evaluate it in the context of film title detection. We note that detection of film titles is one of the more niche and ambitious tasks within this research domain, given that new films are released each week and most NER systems rely on large volumes of training data that can be costly to obtain. Given this scope, we believe that a successful film title detection approach has promise to transfer to other entity classes which are traditionally more stable over time.
In this paper, we begin by briefly reviewing existing NER systems used in informal language domains such as human speech, highlighting their limitations in the context of film title detection. In Section 3, we discuss our data sources, external dependencies and our pre-processing approach. In Section 4, we propose and detail the two-step candidate identification and entity classification procedure that lies at the crux of this paper. Finally, in Section 5, we evaluate our proposed method and compare it against three baselines.
2. Related Work
Although there exists plenty of research on NER for traditional noun phrases such as people, locations, and organization names, little has been done for niche entities such as movies and books. The challenge in the latter is that these properties evolve in much shorter time intervals (i.e. new movies are released every week).
Prior research has used supervised machine learning to recognize entities from audio using acoustic features. For example, the speech recognition model proposed by (Hatmi et al., 2013) uses constrained maximum likelihood linear regression to simultaneously predict the most probable sequence of words and entity classes within an audio waveform. However, this approach requires a large set of token-labeled and time-synchronized training transcripts. Such a dependency is prohibitive in the dynamically changing landscape for film titles.
Focusing less on acoustic signal, (Chowdhury, 2013) feeds word tokens and linguistic features from audio transcripts into a Conditional Random Field (CRF) model to detect people, locations, organizations, and geo-political entities. While this approach works quite well, it is not suited for unstable entity classes and proves quite vulnerable to word/phrase transcription errors.
Given the limitations of previous research in the audio analysis domain, we look for inspiration in another challenging medium - social media. Notably, language in social media typically suffers from inconsistent spelling, poor grammar, and a dearth of orthographic features (Eisenstein, 2013). Given the current state of automatic audio transcription tools, the computer-readable text generated from podcast audio often encounters the same inconsistencies. Moreover, language in both social media and human speech requires normalization, as people often invoke non-standard tokens and phrases within these domains (Liu et al., 2012).
In a relatively recent study, (Ashwini and Choi, 2014) propose a system that identifies entity candidates in social media by matching token sequences in tweets to phrases in a gazetteer. To address issues with precision, they train a classifier using a combination of orthographic, n-gram, and syntactic features to determine whether entity candidates are indeed true entities. Importantly, their system does not require constant re-training, as the gazetteer may be updated with new elements in real time. While their results demonstrate promise for a variety of entity types, their system still relies on capitalization, special characters, and syntactic features to achieve the desired performance. As mentioned above, these features are critically absent in transcripts of podcast audio and thus motivate our work.
3. Data and Pre-processing
We collect 20 film-related podcasts from various publicly available channels (see Table 1), including National Public Radio, SlashFilm, Screen Junkies and Looper. We listen to each podcast and manually note film properties mentioned within each one to serve as the ground truth. The quality of labels is evaluated by the Cohen’s Kappa inter-annotator agreement (0.63). The podcasts are of similar length (10 minutes) except those from SlashFilm (100 minutes). The complete distribution of entities within our dataset and estimated transcription errors can be found in Table 1.
|All Things Consd.||4||2||-||-||6||26%|
Each podcast is subjected to the same set of pre-processing steps. First, raw podcast audio is transcribed using an open-source speech recognition framework from (Zhang, 2017). The output is a long sequence of lowercase words separated by whitespace; there are not any orthographic features or punctuation, which notably have been deemed critical in existing NER systems. Applying (Zhang, 2017)’s model to audio from a National Public Radio podcast in which a human-curated transcript is available, we empirically observe a 23% Word Error Rate (WER) on average. Given this relatively high WER, we reiterate the importance of an entity-detection system that is flexible enough to handle errors in the transcription procedure.
To aid in the inference of syntactic features, punctuation is inferred using a bidirectional recurrent neural network (with attention) that has been trained on European Parliment speech data (Tilk and Alumäe, 2016). To evaluate the quality of this inference, we apply it to the same podcast used to test WER. The punctuation model performs significantly better than chance and achieves a precision, recall, and F1 score of 0.78, 0.64, and 0.70, respectively.
After punctuation is complete, all numeric values are converted to text (e.g. 1984 to nineteen eighty four). Finally, the processed text string is tokenized using the sentence and word tokenizers from NLTK (Bird et al., 2009).
4.1. Identifying Candidates
We use a proprietary database of 9000 films produced between 2000 and 2016 as our gazetteer. Each film has the following metadata available: production budget, keywords, plot summary, and logline. Notably, 70% of films are missing data from at least one of these fields.
4.1.2. Entity Lookup
(Ashwini and Choi, 2014) performs exact string matching using suffix trees to identify words and phrases which may be entities (a.k.a. entity candidates). However, exact string matching tends to miss titles that have minor word errors (e.g. film ”Coco” transcribed as ”cocoa”). In our approach, we use a Levenshtein ratio (Fuad, 2012) with a threshold that varies with the number of tokens present in the phrase to determine whether a match exists. The threshold for each n-gram length is determined using cross-validation within the training set. This similarity measure is implemented to account for the relatively high WER of the (Zhang, 2017) audio transcription model. Accordingly, long movie titles such as ”Three Billboards Outside Ebbing, Missouri” can have minor transcription errors without being ignored during the entity candidate identification stage.
4.1.3. Feature Extraction
Using the tokenized text output from the preprocessing stage in Section 3, we infer part of speech (POS) tags for each token in the audio transcript using the Stanford POS Tagger (Bird et al., 2009). Then, we add several features to each entity candidate based on metadata from their potential film match. All features can be found in Table 2.
Notably, we design a metric to capture the thematic relevance the context around an entity candidate has to its associated film. Informally, we define closeness to be a normalized value representing the number of words in the transcript which seperate the entity candidate, , from relevant keyword, , where and denotes keywords of the movie in our database. Their word indices within the transcript are noted by and respectively. We define this metric mathematically in Table 1. After calculating each ’s closeness value to corresponding entity candidate , we extract , and values as model features. For example, if the entity candidate is ”Godzilla”, we may expect words like ”monster”, ”large”, or ”japan” to appear in close proximity to the entity candidate’s position within the transcript.
4.2. Classifying Entities
After identifying potential film mentions via our fuzzy matching algorithm, the entity candidates are subjected to binary classification via logistic regression. We evaluated baseline approaches and our system using 9-fold Leave One Channel Out (LOCO) cross-validation. Results from our approach are highlighted in the left-side of Figure 1. Hyperparameters (regularization and penalty) are selected to optimize F1 score within each training set.
Since the model is agnostic to specific words or phrases, this NER system does not require any retraining when the gazetteer is updated to reflect new movie releases.
4.2.1. Feature Selection
We find that the most predictive features of true entity mentions include the following: n-gram levels, POS-tags, closeness, and Levenshtein ratio (using null hypothesis testing p-values¡0.05 for mentioned features). While n-gram levels and POS-tags provide adequate performance, the most significant performance gain comes from the addition of the metadata-based features: closeness and production budget.
5. Results and Discussion
We compare performance of our model-based entity recognition to three baseline approaches (right-side of Figure 1). Baseline 1 classifies all entity candidates as true film mentions. Baseline 2 is similar to Baseline 1, except that we limit candidates to those inferred to be a noun-phrase using the Stanford POS Tagger. For Baseline 3, we consider all candidates identified in the first stage of our process and then remove those whose closeness statistics are below thresholds determined via cross-validation over the training data.
Although (Ritter et al., 2011) has proven useful in the social media domain, we find it does not serve as an adequate baseline for our task. The pre-trained model from (Ritter et al., 2011) identifies zero film titles across our dataset of transcripts. To understand why, we applied this model to a human annotated transcription from National Public Radio (NPR) with and without capitalization. While the model identified 16 out of 33 true film title mentions in the capitalized transcript, it did not identify any within the uncapitalized version. The lack of capitalization dramatically reduces the performance and highlights the value of a gazetteer-based approach.
The rule-based approach (Baseline 3) demonstrates that the metadata adds most of the predictive power. The linear model is better at taking into account multiple features as compared to the rule-based approach. As shown in the right-side of Figure 1, our model achieves an average F1 score, precision and recall of 0.61, 0.67 and 0.65, respectively.
To estimate the effect that the high WER from (Zhang, 2017)’s speech transcription model has on our overall method, we also apply our two-stage NER system to the NPR podcast used to evaluate (Ritter et al., 2011)’s NER model. We find that our system correctly identifies 27 true film mentions in the human-curated transcript as opposed to 22 true mentions in the computer-generated transcript. As such, we believe our system has room to improve given access to more accurate speech transcription models.
Future research will explore two key directions. First, we plan to include additional film metadata fields such as release date, production studio, and cast members as features in the candidate classification model. We hypothesize that several of these fields can be represented within the model in a similar fashion to keyword mentions. Second, we plan to source a larger and more granularly labeled set of podcast transcript data to allow the use of data-greedy sequence learning models.
- ?? (Win) Audio podcast consumption in the U.S. 2018 — Statistic. https://www.statista.com/statistics/270365/audio-podcast-consumption-in-the-us/. (????). Retrieved May 7, 2018.
- Ashwini and Choi (2014) Sandeep Ashwini and Jinho D. Choi. 2014. Targetable Named Entity Recognition in Social Media. arXiv preprint arXiv:1408.0782 (2014).
- Baumgartner and Morris (2010) Jody C Baumgartner and Jonathan S Morris. 2010. MyFaceTube politics: Social networking web sites and political engagement of young adults. Social Science Computer Review 28, 1 (2010), 24–44.
- Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
- Chieu and Ng (2002) Hai Leong Chieu and Hwee Tou Ng. 2002. Named entity recognition: a maximum entropy approach using global information. In Proceedings of the 19th international conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 1–7.
- Chowdhury (2013) Md Faisal Mahbub Chowdhury. 2013. A simple yet effective approach for named entity recognition from transcribed broadcast news. In Evaluation of Natural Language and Speech Tools for Italian. Springer, 98–106.
- Eisenstein (2013) Jacob Eisenstein. 2013. What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the association for computational linguistics: Human language technologies. 359–369.
- Fuad (2012) Muhammad Marwan Muhammad Fuad. 2012. Towards Normalizing the Edit Distance Using a Genetic Algorithms–Based Scheme. In International Conference on Advanced Data Mining and Applications. Springer, 477–487.
- Hatmi et al. (2013) Mohamed Hatmi, Christine Jacquin, Emmanuel Morin, and Sylvain Meigner. 2013. Incorporating named entity recognition into the speech transcription process. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech’13). 3732–3736.
- Kang and Gretzel (2012) Myunghwa Kang and Ulrike Gretzel. 2012. Differences in social presence perceptions. In Information and Communication Technologies in Tourism 2012. Springer, 437–447.
- Lin and Wu (2009) Dekang Lin and Xiaoyun Wu. 2009. Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 1030–1038.
- Liu et al. (2012) Fei Liu, Fuliang Weng, and Xiao Jiang. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 1035–1044.
- Ratinov and Roth (2009) Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 147–155.
- Ritter et al. (2011) Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 1524–1534.
- Tilk and Alumäe (2016) Ottokar Tilk and Tanel Alumäe. 2016. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration.. In Interspeech. 3047–3051.
- Zhang (2017) Anthony Zhang. 2017. Speech Recognition. (2017). https://github.com/Uberi/speech_recognition#readme