Decoding visemes: improving machine lip-reading (PhD thesis)
Helen L. Bear
University of East Anglia
School of Computing Sciences
©This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that no quotation from the thesis, nor any information derived therefrom, may be published without the author’s prior written consent.
This thesis is about improving machine lip-reading, that is, the classification of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision.
Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the video recording for example, the video resolution. We begin our work with a literature review to understand the restrictions current technology limits machine lip-reading recognition and conduct an experiment into resolution affects. We show that high definition video is not needed to successfully lip-read with a computer.
The term “viseme” is used in machine lip-reading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally defined, we use the common working definition ‘a viseme is a group of phonemes with identical appearance on the lips’. A phoneme is the smallest acoustic unit a human can utter. Because there are more phonemes per viseme, mapping between the units creates a many-to-one relationship. Many mappings have been presented, and we conduct an experiment to determine which mapping produces the most accurate classification. Our results show Lee’s [lee2002audio] is best. Lee’s classification also outperforms machine lip-reading systems which use the popular Fisher [fisher1968confusions] phoneme-to-viseme map.
Further to this, we propose three methods of deriving speaker-dependent phoneme-to-viseme maps and compare our new approaches to Lee’s. Our results show the sensitivity of phoneme clustering and we use our new knowledge for our first suggested augmentation to the conventional lip-reading system.
Speaker independence in machine lip-reading classification is another unsolved obstacle. It has been observed, in the visual domain, that classifiers need training on the test subject to achieve the best classification. Thus machine lip-reading is highly dependent upon the speaker. Speaker independence is the opposite of this, or in other words, is the classification of a speaker not present in the classifier’s training data. We investigate the dependence of phoneme-to-viseme maps between speakers. Our results show there is not a high variability of visual cues, but there is high variability in trajectory between visual cues of an individual speaker with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual.
Finally, we investigate how many visemes is the optimum number within a set. We show the phoneme-to-viseme maps in literature rarely have enough visemes and the optimal number, which varies by speaker, ranges from 11 to 35. The last difficulty we address is decoding from visemes back to phonemes and into words. Traditionally this is completed using a language model. The language model unit is either: the same as the classifier, e.g. visemes or phonemes; or the language model unit is words. In a novel approach we use these optimum range viseme sets within hierarchical training of phoneme labelled classifiers. This new method of classifier training demonstrates significant increase in classification with a word language network.
This is my opportunity to say thank you to some extraordinary people without whom I really couldn’t have completed my PhD. My heartfelt thanks go to each and every one of you for so much more than I am capable of conveying in words.
Professor Richard Harvey (aka PhD supervisor extraordinare), you are my dream supervisor and friend. Your intelligence, patience, support and humour have been invaluable and I have loved working with you this past four years. To the rest of my supervisory team: Dr Barry-John Theobald, Dr Yuxuan Lan, Professor Stephen Cox, and Dr Anthony Bagnall, you are amazing. Thank you all for your patient education, support and guidance. I am also grateful for my examiners Professor Andy Day and Dr Naomi Harte for assessing my viva performance.
To my lab colleagues Mr Thomas Le Cornu and Mr Danny Websdale - you guys are the best lab buddies I could ever have dreamed of. Thank you for making coming in to work every day so good.
Finally, thank you to my family, Barbara (aka Mum), Jeremy (aka Dad), Philip, Michelle, and Amelia, who know barely anything about what I’ve been doing for the past four years and understand it even less, but have supported me throughout this crazy endeavour, #proudtobeabear
Chapter 1 Introduction
Speech is bimodal. This means there are two modes of information: acoustic and visual. Humans use both signals to understand the speech of others [mcgurk]. Given that acoustic recognition has been studied for over fifty years [davis1952automatic], it is not surprising that acoustic recognition is far more mature than visual-only recognition and there have been significant increases in performance in speech recognition systems, although they remain susceptible to noise [galatas2011audio]. Imagine trying to recognise a pilot’s speech over the background noise of the aeroplane engine in a cockpit. In this case, the audio signal is severely deteriorated by the noise of the environment. However, this noise does not affect the visual signal. Thus, a desire to recognise speech from the visual signal alone is born. The visual signal can be used in combination with the acoustic signal, this is audio-visual speech recognition (AVSR) [potamianos2004audio], or, there is the possibility of using the visual signal alone. This latter configuration is machine lip-reading which is the topic of this thesis.
Lip-reading is a challenging task. When researchers investigate AVSR, it is common for audio recognition to dominate any benefit from lip-reading, nevertheless, if we can make pure lip-reading successful there would be benefits for audio-visual recognition. Furthermore, there are a few scenarios where it is impractical or senseless to install a close microphone. An example might be an interactive booth in a busy station or airport where there is poor signal-to-noise ratio (SNR) or some distance between the person and the screen. In practice however, a major use of a good machine lip-reading system would be as part of an AVSR system.
1.1 Applications of machine lip-reading
There are a range of scenarios where a machine lip-reading system would be beneficial. We discuss a few examples here.
During sports events there are often headlines about arguments between players, referees and even supporters. In the 2006 football World Cup Final between France and Italy, it was 19 minutes into extra time when Zinedine Zidane, on the opposite end of the pitch to the football, head-butted an Italian player without apparent justification. This action earned him a red card and consequently France went on to lose both the match and the world cup [zidane]. It later transpired, as admitted by Materazzi (the recipient of the head-butt), Zidane was provoked by a targeted insult of a late family member. In this case, if a machine lip-reading system had been present to confirm the provocation, whilst Zidane would have still been red carded, so would have Materazzi. Thus playing ten men against ten, the outcome of the match, and the World Cup, could have been different.
In history there are a great number of silent videos. Common examples are silent entertainment films and historical documentaries. In [battlesomme] we see a professional human lip-reader assist researchers comprehend what soldier’s conversations were before they went into battle and during battle preparations. Similarly, in [hitler] we are shown how lip-readers used on the home movies of Hitler give historians an insight to an infamous figure of interest.
There has been long debate about if, in silent entertainment films of the era 1895-1927, films were ever scripted as the audio could not be captured with the video channel. In [silentmovie] we learn that, not only were these films in fact fully scripted, but in human lip-reading experiments, variation from the scripts were fully noticeable. Collectively, this human nature to be interested in history and learn from historical evidence is a further motivator for achieving robust automatic lip-reading systems.
Theobald et al. [theobald:640205] examine lip-reading for law enforcement. They note that in law-enforcement there are many departments who would benefit from an automatic lip-reading system. They present a new technique for improved lip-reading whereby the extracted features are modified to increase the classification performance. The modification is amplifying the feature parameters (they use Active Appearance Models which we explain fully in Chapter LABEL:chap:featuretypes), to exaggerate the lip gestures recorded on camera. The technique was tested using a phonetically balanced corpus of syntactically correct sentences. The data set had very little contextual information [theobaldPHD] to remove effects of context network support. Machine lip-reading would help in law enforcement as robust lip-reading of filmed conversations during criminal acts, e.g. on CCTV could be evidence for the prosecution of offenders.
In the murder case of Arelene Fraser, Nat Fraser was caught and imprisoned. Evidence used by the prosecution included transcripts provided by professional lip-reader Jessica Rees [natfraser]. Whilst the perpetrator thought he had committed the ‘perfect’ murder, and took steps to avoid any conversations being overheard, he had not thought about those who could read lips. With the transcripts of Fraser’s conversations, prosecutors turned the co-conspirators into witnesses and Nat Fraser was prosecuted. However later, the reliability of lip-reading transcripts as evidence was successfully challenged, because human lip-readers are unreliable.
The reliability of human lip-readers is debatable. It has been said that this reliability varies not just between different pairings of speakers within a conversation, but also subject to the situation (context and environment) of the conversation (with the same speakers) [lott1960influence]. This means that a good lip-reader on one day with a particular speaker could either misinterpret an alternative speaker or if lip-reading the same person in another place, fail to comprehend the speech uttered. Furthermore, human lip-readers are expensive, examples of Consuelo Gonsales [consuelo] and Jessica Rees [rees] operate on an as quoted basis. So we know that robust lip-readers are rare [lott1960influence] and often we have no way of verifying the accuracy of the lip-reading performance as a ground truth is rarely available. It is only in controlled experiments that a ground truth exists [Benoit1996, stork1996speechreading, comparHumMacLipRead].
In [lott1960influence] an investigation into the effect of likeability between individuals in a lip-read conversation, such as the status of their relationship, showed that a good relationship increases the accuracy of the lip-reading interpretation. To apply this observation to a real world scenario of introducing a lip-reader to someone they do not know personally, such as on a video documentary, deteriorates the confidence that their lip-reading ability will be robust. This idea is supported in Nichie’s lip-reading and practice handbook [nrrchm1912lipreading] where in Chapter two it is suggested that the value of practicing lip-reading is rightly attached to the teacher’s personality for success.
In [summerfield1992lipreading] Summerfield describes some reasons which can distinguish poor from good lip-readers. This list is deduced from the results of a series of experiments ([heider1940experimental, dodd1989teaching, macleod1987quantifying, lyxell1989beyond, woodward1960phoneme]) which show that the achievement rates in lip-reading tests can range from 10% to over 70%. These achievement rates vary due to the parameter selections for each experiment which are chosen for the specific task being addressed. In particular, the accuracy metric (some present word error rate, whereas others present percent true positive matches, , others alternative metrics like the HTK correctness and accuracy scores (explained in full in section LABEL:sec:htk, Equations LABEL:eq:correctness & LABEL:eq:accuracy respectively) and the classification unit (there are a number of options here - matching on phonemes, visemes or words) have a significant affect on how one should compare such investigations.
Some affects on human lip-reading performance are:
intelligence and verbal reasoning - McGrath [mcgrath1985examination] showed that a fundamental level of intelligence and verbal reasoning are essential to be able to lip-read at all, but beyond a limit these skills could not raise human comprehension further.
Training - human lip-readers who have either self-studied or have been trained in some manner to practice the skill of lip-reading are shown to be no better than those who have received no training [conrad1977lip, dodd1989teaching]. Also it has been shown that human lip-readers can actually get worse with training [binnie1976visual], and this effect is more present when humans lip read from videos rather than in the presence of the speaker [lan2012insights].
Low-level visual-neural processing - Summerfield [summerfield1992lipreading] discusses the physiological matter of the processing speed of these neural processes in the human brain. The suggestion is that lip-reading is difficult to learn because it is dependent upon these low-level neural processes. This suggestion has however, not received reproducible results to support the proposition which comforts us that human lip-reading is possible, however challenging.
Closeness between the conversation participants - studies show that a relationship of some description between those talking, or personable knowledge of the speaker by the interpreter can improve human lip-reading [lott1960influence, ronnberg1998conceptual, ewing1944lipreading].
Knowledge of conversation context - without the constraint that is the ‘rules’ of a language to limit what a probable utterance is, lip-reading becomes almost impossible, or akin to guessing [samuelsson1993implicit]. In [samuelsson1991script] experiments showed that recognising isolated sentences was as low scoring as simply guessing from the context alone.
In summary, the main application of a machine lip-reading system would be any situation where the audio signal in a video is either absent or too noisy to comprehend, or where the alternative, human lip-readers, are too expensive or too unreliable.
1.2 The research problem
A conventional lip-reading system consists of a sequence of tasks as shown in Figure 1.1. Our work focuses on the classification task. Currently we have to make some assumptions by tracking a face in a video in order to extract some features before we can undertake machine lip-reading.
The first task on the left hand side of Figure 1.1, is face tracking. This means to locate a face in an image (one frame of a video) and track it throughout the whole video sequence. By the end of the tracking process, often completed by fitting a model to each frame, we have a data structure containing information about the face through time. Examples of work showing face finding and tracking are in [schwerdt2000robust] and [tomitaka1995human]. Example tracking methods are, with Active Appearance Models , or with Linear Predictors [ong2011robust]. We discuss these two methods in Chapter LABEL:chap:featuretypes. The second task, in the centre of Figure 1.1, is visual feature extraction. Using the fitted data parameters from task one, we can extract features which contain solely information pertaining to the speaker’s lips. The third and final task on the right hand side of Figure 1.1 is classification. This is where we train some kind of classification model, using some visual features as training data, and use the classifiers to classify some unseen test data. Classification produces an output which can be compared with a ground truth to evaluate the accuracy of the classifiers.
There is a lot of literature on methods of feature extraction methods [ong2008robust, 4362878, hong2006pca, yang1996real, potamianos1998image, luettin1997speechreading] and tracking faces through images, [927467, 1027648, mckenna1996tracking, lerdsudwichai2003algorithm, crowley1997multi] for lip-reading. However, to date, there is no one accepted method as the de facto method for extracting lip-reading features. In lieu of this, in [zhou2014review], Zhou et al. ask two questions about feature extraction, specifically for lip reading: primarily, how to cope with the speaker identity dependency in visual data? But also, how to incorporate the temporal information of visual speech? The intent of this second question is for capturing co-articulation effects into features. Zhou et al. categorise a comprehensive range of feature extraction techniques into four groups: Image-based e.g. [gowdy2004dbn], Motion-based e.g. [mase1991automatic], Geometric-feature-based e.g. [nefian2002dynamic], or, Model-based e.g. [eveno2004accurate].
This categorisation serves to show the breadth of current research into features. However, this attention on feature extraction does not address the only challenges in machine lip-reading. Improvements can still be made in the classification stage of lip-reading also. Therefore much of this thesis is focused on classification, rather than additional tasks such as tracking and feature extraction. That is not to say we are dismissive of the feature extraction and tracking requirements, rather that we wish focus our work to improve the classification methods.
Figure 1.2 shows the situation in which we are trying to recreate the text in the mind of the speaker. Each speaker articulates differently, and so the identity of the individual speaker is a significant affect on the efficacy of lip-reading. The visual signal is also affected by the speaker’s pose, motion and expression. Cameras typically have many parameters that might affect lip-reading. Of these, we mention frame rate and resolution as highly probable to be significant.