On Learning Associations of Faces and Voices
In this paper, we study the associations between human faces and voices. Audiovisual integration (AVI), specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is well known that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study, we confirm previous findings that people can associate faces with corresponding voices and vice versa with greater than chance accuracy. We show that machines can learn such associations and use the learned information to identify matching faces and voices, with close to human performance. We analyze our findings statistically and evaluate our learned representations.
Keywords:face-voice association, multi-modal learning, representation learning
“Can machines put a face to the voice?”
We humans often deduce various, albeit perhaps crude, information from the voice of others, such as gender, approximate age and even personality. We may even imagine the appearance of the person on the other end of the line when we phone a stranger. Unfortunately, we are also affected by prejudice, stereotypes, and unfounded or unjust biases. In this paper we pose questions about whether machines can put faces to voices, or vice versa, like humans presumably do, and if they can, how accurately they can do so.
To answer these questions, we need to define what “putting a face to a voice” means. We approximate this task by designing a simple discrete test: we judge whether a machine can choose the most plausible facial depiction of the voice it hears, given multiple candidates. This definition has a number of advantages: (1) it is easy to implement in machines, (2) it is possible to conduct the same test on human subjects, and (3) the performance can be quantitatively measured. We can imagine a more formidable task, for instance, where a machine would generate a facial image given a voice. However, we leave this to future work.
We hypothesize that we can train machines to perform the task at least as well as humans do. Neuroscientists have observed that the multimodal associations of faces and voices play a role in perceptual tasks such as speaker recognition [1, 2, 3]. However, it is yet to be studied in the computer vision community whether such human ability can be implemented by machine vision and intelligence. In this work, we attempt to lay the groundwork for this study by identifying the baseline and showing the feasibility of the task. Inspired by the findings by neuroscientist, we try to learn the overlapping information between the two modalities. We first perform an experimental study on human subjects through Amazon Mechanical Turk to show that humans are capable of matching faces to voices with higher than chance accuracy. We then use deep neural networks to learn the co-embedding of modal representations of faces and voices. The learned representations can then be used to retrieve the closest match in the target modality, given a sample in the reference modality. We further evaluate the multi-modal representations in both qualitative and quantitative manners.
Specifically, our technical contributions include the following.
We verify and extend the existing findings which suggest that humans are capable of correctly matching unfamiliar face images to corresponding voice recordings and vice versa with greater than chance accuracy. When participants were presented with photographs of two native English speakers of the same gender, ethnicity, and age group and a voice recording of one of the models in the photographs, they were able to choose the correct match on average 58% of the time, far from perfect but statistically meaningful greater-than-chance accuracy.
We present a model to learn the multi-modal representations reflecting the overlapping information between human faces and voices, using existing techniques based on deep neural networks. We support our model by a set of experiments and evaluations.
We show that machines can indeed perform the matching task using the learned multi-modal representations. The performance we obtain is on a par with human performance.
2 Related Work
Studies on face-voice association span multiple disciplines. Among the most relevant to computer vision researchers are cognitive science and neuroscience which study human subjects, and machine learning, specifically, cross-modal modeling. To put our work into a proper context, we briefly review the related literature in these areas.
Human capability for face-voice association.
Behavioural and neuroimaging studies of face-voice integration showed clear evidence of early perceptual integrative mechanisms between face and voice processing pathways. The study of Campanella and Belin  reveals that humans leverage the interface between facial and vocal information for both person recognition and identity processing. This human capability is unconsciously learned by processing a tremendous number of auditory-visual examples throughout their whole life , and the ability to learn the associations between faces and voices starts to develop as early as three-months old , without intended discipline.111In machine learning terminology, this could be seen as natural supervision  or self-supervision  with unlabeled data. This ability has also been observed in other primates .
These findings led to the question about to what extent people are able to correctly match which unfamiliar voice and face belong to the same person [10, 11, 12, 13]. Early work [10, 11] argued that people could match voices to dynamically articulating faces but not to static photographs. More recent findings of Mavica and Barenholtz  and Smith et al.  contradicted these results, and presented evidence that humans can actually match static facial images to corresponding voice recordings with greater than chance accuracy. In a separate study, Smith et al. also showed that there is a strong agreement between the participants’ ratings of a model’s femininity, masculinity, age, health and weight made separately from faces and voices . The discrepancy between these sets of studies were attributed to the different experimental procedures. For instance, Kamachi et al.  and Lachs and Pisoni  presented the stimuli sequentially (participants either heard a voice and then saw two faces or saw a face and then heard two voices), while the latter works presented faces and voices simultaneously. In addition, the particular stimuli used could also have led to a difference in performance. For example, Kamachi et al. experimented with Japanese models, whereas Marvica and Barenholtz used Caucasian models. Smith et al.  showed that different models vary in the extent to which they look and sound similar, and performance could be highly dependent on the particular stimuli used.
The closest work to our human subject study is Mavica and Barenholtz’s experiment. We extend the previous work in several ways. First, we exploit crowdsourcing to collect a larger and more diverse dataset of models. We collected faces and voices of more than 200 models of different gender, ethnicity, age-group and first-language. This diversity allowed us to investigate a wider spectrum of task difficulties according to varying control factors in demographic parameters. Specifically, whereas previous work only test on models from a homogenous demographic group (same gender, ethnicity, age group), we vary the homogeneity of the sample group in each experiment and test models from same gender (G), same gender and ethnicity (G/E), same gender, ethnicity, first language and age group (G/E/F/A). By comparing the performances across experiments, we explicitly test the assumption, hereto taken for granted, that people infer demographic information from both face and voice and use this to perform the matching task.
Visual-auditory cross-modal learning by machinery.
Inspired by the early findings from cognitive science and neuroscience that humans integrate audio-visual information for perception tasks [15, 16, 17], the machine learning community has also shown increased interest in the visual-auditory cross-modal learning. The key motivation has been to understand whether machine learning models can reveal correlations across different modalities. With the recent advance of deep learning, multi-modal learning leverages neural networks as a universal approximator  to approximate arbitrary functions, so that common or complementary information can be effectively mined from large-scale paired data.
In the real world, the concurrency of visual and auditory information provides a natural supervision . Recent emergence of deep learning has witnessed the understanding of the correlation between audio and visual signals in applications such as: improving sound classification  by combining images and their concurrent sound signals in videos; scene and place recognition  by transferring knowledge from visual to auditory information; vision-sound cross modal retrieval [7, 19, 22]; and sound source localization in visual scenes . These works focus on the fact that visual events are often positively correlated with their concurrent sound signals. This fact is utilized to learn representations that are modality-invariant. We build on these advances and extend to the face-voice pair.
Other closely related work include Ngiam et al.  and Chung et al. , which showed that the joint signals from face and audio help disambiguate voiced and unvoiced consonants. Similarly, Hoover et al.  and Gebru et al.  developed systems to identify active speakers from a video by jointly observing the audio and visual signals. Although the voice-speaker matching task seems similar, these work mainly focus on distinguishing active speakers from non-speakers at a given time, and they do not try to learn cross-modal representations. A different line of work has also shown that recorded or synthesized speech can be used to generate facial animations of animated characters [28, 29] or real persons .
Our interest is to investigate whether people look and sound similar, i.e., to explore the existence of the learnable relationship between the face and voice. To this end, we leverage the face-voice matching task. We examine whether faces and voices encode redundant identity information and measure to which extent.
3 Study on Human Performance
We conducted a series of experiments to test whether people can match a voice of an unfamiliar person to a static facial image of the same person. Participants were presented with photographs of two different models and a 10 second voice recording of one of the models. They were asked to choose one and only one of the two faces they thought would have a similar voice to the recorded voice (V F). We hypothesized that people may rely on information such as gender, ethnicity and age inferred from both face and voice to perform the task. To test this possibility, in each experiment, we added additional constraints on the sample demography and only compared models of the same gender (G - Experiment 1), same gender and ethnicity (G/E - Experiment 2), and finally same gender, ethnicity, first language, and age group (G/E/F/A - Experiment 3), specifically male pairs and female pairs from non-Hispanic white, native speakers in their 20s. For the most constrained condition (G/E/F/A), we also performed a follow-up experiment, where participants were presented with a single facial image and two voice recordings and chose the recording they thought would be similar to the voice of the person in the image (F V).
For the face-voice matching experiments, we conducted a separate study also through Amazon Mechanical Turk. Before starting the experiment, participants filled out a questionnaire about their demographic information, identical to the one above presented for data collection. Following the questionnaire, they completed 16 matching tasks, along with 4 control tasks for quality control. Each task consists of comparing two pairs of faces and selecting one of them as matching a voice recording (vice versa for Experiment 4). Two of the four control tasks check for consistency; we repeat a same pair of faces and voice. The other two control for correctness; we add two pairs with one male model and one female model. From preliminary studies we noticed that people are generally very good at identifying gender from face or voice, and indeed less than 3% of the participants incorrectly answered the correctness control questions (11 out of 301 participants). In the analysis, we discarded data from participants who failed in two or more control questions (9/301).
The rest of the 16 tasks comprise of 16 different pairs. Each unique person in the dataset is paired with 8 other persons from the dataset, randomly selected within the experiment’s demographic constraint (Experiment 1: same gender, Experiment 2: same gender and ethnicity, Experiments 3 and 4: same gender, ethnicity, age group and first language). Each participant in the experiment was presented with 16 randomly selected pairs (8 male pairs and 8 female pairs). The pairs were presented sequentially. Participants had to listen to the audio recording(s) and choose an answer, before they could move on to the next pair. No feedback was given on whether their choice was correct or not, precluding learning of face-voice pairings. We also discarded data from participants who partook in the data collection (4/301).
|G/E/F/A, F V||55.2%||12.2%||3.69 (75)|
Table 1 shows the average performance across participants for each of the four experimental conditions. Individual tests found significantly better than chance performance (50%) for each of the four experimental conditions. An ANOVA comparing the four experiments found a significant difference in performance (, ). Tukey’s HSD showed that performance in Experiment 1 (G) was significantly better than Experiment 2 (G/E) (), and performance in Experiment 2 (G/E) was significantly better than Experiment 3 (G/E/F/A) (). However, results from Experiment 3 (V F) and Experiment 4 (F V) were not significantly different from one another.
Similarly to Mavica and Barenholtz , in order to assess whether some models were more or less difficult to match, for Experiment 3, we also calculated the percentage of trials on which the participants chose the correct response whenever the model’s face was presented either as the correct match or as the incorrect comparison. In other words, a high score for a model means participants are able to correctly match the model’s face to its voice as well as reject matching that face to another person’s voice. Figure LABEL:fig:model_performance shows the average performance for each of the 42 models (18 male and 24 female). Despite the wide variance in performance, we observe a clear trend toward better-than-chance performance, with 34 of the 42 models (80%) yielding a performance above 50%.
Overall, participants were able to match a voice of an unfamiliar person to a static facial image of the same person at better than chance levels. The performance drop across experimental conditions 1 to 3 supports the view that participants leverage demographic information inferred from the face and voice to perform the matching task. Hence, participants performed worse when comparing pairs of models from demographically more homogeneous groups. This was an assumption taken for granted in previous work, but not experimentally tested. More interestingly, even for the most constrained condition, where participants compared models of the same gender, ethnicity, age group and first language, their performance was better than chance. This result aligns with that of Mavica and Barenholtz  that humans can indeed perform the matching task with greater than chance accuracy even with static facial images. The direction of inference (F V vs. V F) did not affect the performance.
4 Cross-modal Learning on Faces and Voices
Our attempt to learn cross-modal representations between faces and voices is inspired by the significance of the overlapping information in certain cognitive tasks like identity recognition, as discussed earlier. We use standard network architectures to learn the latent spaces that represent the visual and auditory modalities for human faces and voices and are compatible enough to grasp the associations between them. Analogous to human unconscious learning , we train the networks to learn the voice-face pairs from naturally paired face and voice data without other human supervision.
4.1 Network Architecture
We build our experimental model by choosing the architectures used widely and proven to work in various contexts. The overall architecture is based on the triplet network , which is widely used for metric learning. As base networks for two modalities, we use VGG16  and SoundNet , which have shown sufficient model capacities while allowing for stable training in a variety of applications. In particular, SoundNet was devised in the context of transfer learning between visual and auditory signals.
Unlike typical configurations where all three networks share the weights, in our model, two heterogeneous subnetworks are hooked up to the triplet loss (Figure 2). The face subnetwork is based on VGG16, where the conv5_3 layer is average-pooled globally, resulting in 512-d output. It is fed to a 128-d fully connected layer with the ReLU activation, followed by another 128-d fully connected layer but without ReLU, which yields the face representation. The voice subnetwork is based on SoundNet. The conv6 layer of SoundNet is similarly average-pooled globally, yielding 512-d output, which is fed to two fully-connected layers with the same dimensions as those in the face subnetwork one after another. In our experiments with the voice as the reference modality, for a single voice subnetwork, there are two face subnetworks, which share the weights.
During training, for each random voice sample , one positive face sample and one negative sample are drawn, and the tuple is fed forward to the triplet network. Optimizing for the triplet loss
minimizes the distance between the representations of the voice and the positive face, , while maximizing the distance between those of the voice and the negative face, , pushing representations of the same identity closer and pulling those of different identities away.
4.2 Implementation Details
All face images are scaled to 224224 pixels. Audio clips are resampled at 22,050 Hz and trimmed to 10 seconds; those shorter than 10 seconds are tiled back to back before trimming. Besides this, a standard data augmentation scheme was used optionally: face images are randomly cropped around the face region by to , rotated for a random angle between , and horizontally flipped randomly. Negative cropping means including more background. Image brightness and contrast as well as audio volume are jittered up to . We trained our network both with and without data augmentation under the same setup outlined below, but did not find significant difference in performance. Training tuples are randomly drawn from the pool of faces and voices: for a random voice, a random but distinct face of the same identity and a random face of a different identity are sampled.
We use Adam  to optimize our network with and , with the batch size of 8. We use the pretrained models of VGG16 and SoundNet. The fully connected layers are trained from scratch with a learning rate of and the pretrained part of the network is fine-tuned with a learning rate of at the same time. The training continues for 240k iterations, where the learning rates are decayed by a factor of every 80k iterations. After 120k iterations, the network is trained with harder training samples, where 16 tuples are sampled for each batch, from which only the 8 samples with the highest losses are used for back-propagation.
In practice, we train two separate models for the two-choice experiments and cross-modal retrieval tests: the network with the voice as the reference modality (shown in Figure 2) and the network with the face as the reference (not shown in the figure, but defined analogously).
Our collected dataset of 239 samples was not large enough to train a large deep neural network. Thus, we turned to unconstrained, “in-the-wild” datasets, which provide a large amount of videos mined from video sharing services like YouTube and Flickr. We use the VoxCeleb dataset  to train our network. From the available 21,063 videos,222The actual video files constituting the dataset are downloaded from YouTube and not all videos are available at the time of submission. 114,109 video clips of 1,251 celebrities are cut and used. We split these into two sets: randomly chosen 1,001 identities as the training set and the rest 250 identities as the test set. The dataset comes with facial bounding boxes. We first filtered the bounding boxes temporally as there were fluctuations in their sizes and positions, and enlarged them by 1.5 times to ensure that the entire face is always visible. From each clip, the first frame and first 10 seconds of the audio are used, as the beginning of the clips is usually well aligned with the beginning of utterances. Figure 3 shows some face samples we used for training; more samples including voices are presented in our supplementary material. We manually annotated the samples in the test set with demographic attributes, which allowed us to conduct the experiments with the same controls as presented in Section 3 and to examine the clustering on such attributes naturally arising in the learned representations (Section 4.5). The demographic distributions of the annotated test set are illustrated in Table 2.
We conducted the same experiments introduced in Section 3 using our trained model. A voice recording is fed to the network along with two candidate face images, resulting in three representation vectors. The distances from the voice representation to the representations of two face candidates are calculated and the candidate with the shorter distance is picked as the matching face. The performance of our model measures on the test set is tabulated in Table 3.
Similarly to our user study, we measure the test accuracy on a number of different conditions. We replicate the conditions of Experiments 1 (G), 2 (G/E), and 3 (G/E/F/A) as before but in both directions (thus including Experiment 4), in addition to two more experiments where the accuracy is measured on the same ethnic group (E) and on the entire test set samples (–). For Experiment 3 (G/E/F/A), we show the accuracy on the largest and most homogeneous group of people in the test set (non-Hispanic white, male, native speakers in their 30s). We used the age group of 30s instead of 20s, as the test set demography includes more identities in their 30s, compared to the 20s in our user study dataset. We observe that the gender of the subject provides the strongest cue for our model to decide the matching, as we assume it does for human testers.333Gender is such a strong cue that we use it for control questions in our user study. See Section 3. Unlike the experiments with human participants, conditioning on ethnicity lowers the accuracy only marginally. For the most constrained condition (G/E/F/A) the accuracy shows about 20% drop from the uncontrolled experiment.
These results largely conform to our findings from the user study (Table 1). One noticeable difference is that the performance drop due to the demographic conditioning is less drastic in the machine experiments (4%) than in the human experiments (13%), while their accuracies on the most controlled group (G/E/F/A; i.e., the hardest experiment) are similar (59.0% and 58.4%, respectively). Note that the accuracy on the uncontrolled group was not measured on human participants, and the machine’s best accuracy should not erroneously be compared to the human best accuracy, which is already measured among same gender candidates.
|Modalities||Demographic grouping of test samples|
4.5 Evaluations on the Learned Representation
Figure 4 demonstrates the clustering that emerges in our learned representation using the t-SNE visualization . We drew 100 random samples for each of 10 unique identities from the VoxCeleb test set for the identity visualization in Figure 4ad, which show per-identity clustering. Additionally, we drew 1,000 random samples and used our demographic attribute annotation to color-code the sample points according to their attributes (Figure 4bcef). Note that our network has not seen any of the demographic attributes during training. The learned representation forms the clearest clusters regarding gender (Figure 4be), which explains the performance drop when the experiment is constrained by gender. Also noticeable is the distribution by age (Figure 4cf). While correlated with gender, it shows a distinct grouping to gender, in particular for face representations. The t-SNE visualization does not reveal such clustering regarding the first language or the ethnic group (shown in the supplementary material).
In Table 4, we further evaluate our representation by training a simple linear classifier on our representations. We inspect whether any additional interpretable information is encoded in the representations and how much discrepancy there exists between the representations from two modalities. Similarly to the data-driven probing used in Bau et al. , we use the demographic attributes as probing data to check the learnability of those attributes from given representations. Given the set of representations and their corresponding attributes, we train binary linear SVM classification models for each of the attribute class in a one-vs-all manner. The results further support that, while the attributes are never used for training, the learned representation encodes a significant amount of attribute information. They also demonstrate that our representation encodes additional information, which is more prominent in the face modality. Statistical insignificance of age group classification from voice representations aligns with t-SNE (Figure 4f), which shows less obvious patterns than those found in its face counterpart (Figure 4c). Please refer to our supplementary material for more visualizations and further evaluations.
Lastly, we use the learned representations for cross-modal retrieval. Given a face (voice) sample, we retrieve the voice (face) samples closest in the latent space. We report recall@ in Table 5, which measures the portion of queries where the true identity is among the top retrievals, as in Vendrov et al. , for varying and set sizes. The number of samples per identity was kept the same while samples within the same identity was randomly chosen.
Comparisons and model parameter selection.
We conducted the same two-choice experiments on a network with a different architecture. The network was inspired by the “ network” , an audio-visual correlation classifier, and trained to do binary classification: given a face and a voice, whether or not the two belong to the same identity. We trained and tested the network in a similar manner to the network presented in Section 4.1. Both network presented similar performance on our experiments, which supports the learnability of the overlapping information between two modalities regardless of the particular network architecture. We also measured the test accuracy with varying configurations of the presented network, e.g., the dimensions of the fully connected layers (and thus those of the representation vectors). While this did not influence the test accuracy much, generally smaller (narrower) fully connected layers resulted in a better performance. We discuss the comparisons with different architectures and choices of hyper-parameters in more details in our supplementary material.
We observed that face and voice representations are learned asymmetrically, depending on the modality used as the reference of the triplet. This was not observed in the classification network. We simply trained two networks, one with the voice as reference (denoted by ) for the voice-to-face retrieval experiment, and vice versa. A more sophisticated model, such as the quadruplet network [39, 40], could be used to alleviate this issue. In this work, however, we focus more on showing the learnability of the task using widely-used models, minimizing the complexity and dependency of a particular architecture.
We also tested our trained model on the dataset used for our user studies on human subjects. As summarized in Table 6, our model trained on the VoxCeleb dataset results in significantly degraded performance, failing to generalize to another dataset, or “domain” . This could be alleviated by additional fine-tuning on the new dataset or by domain adaptation (e.g., Tzeng et al. ), which is left to future work.
|Modalities||Demographic grouping of test samples|
We studied the associations between human faces and voices through a series of experiments: first, with human subjects, showing the baseline for how well people perform such tasks, and then on machines using deep neural networks, demonstrating that machines perform on a par with humans. We expect that our study on these associations can provide insights into challenging tasks in a broader range of fields, pose fundamental research challenges, and lead to exciting applications.
For example, cross-modal representations such as ours could be used for finding the the voice actor that sounds like how an animated character looks, manipulating synthesized facial animations [28, 29, 30] to harmonize with corresponding voices, or as an entertaining application to find the celebrity whose voice sounds like a user’s face or vice versa. While understanding of the face-voice association at this stage is far from perfect, its advance could potentially lead to additional means towards criminal investigation like lie detection , which is still arguable but practically used. However, we emphasize that, similar to lie detectors, such associations should not be used for screening purposes or as hard evidence. Our work only suggests the possibility of learning potential associations by referring to a part of the human cognitive process, but not their definitive existence.
Changil Kim was supported by a Swiss National Science Foundation fellowship P2EZP2 168785. We thank Sung-Ho Bae for his help in this work during his appointment at MIT.
-  von Kriegstein, K., Kleinschmidt, A., Sterzer, P., Giraud, A.L.: Interaction of face and voice areas during speaker recognition. J. Cognitive Neuroscience 17(3) (April 2005) 367–376
-  Joassin, F., Pesenti, M., Maurage, P., Verreckt, E., Bruyer, R., Campanella, S.: Cross-modal interactions between human faces and voices involved in person recognition. Cortex 47(3) (2011) 367 – 376
-  Zweig, L.J., Suzuki, S., Grabowecky, M.: Learned face-voice pairings facilitate visual search. Psychonomic Bulletin & Review 22(2) (2015) 429â–436
-  Campanella, S., Belin, P.: Integrating face and voice in person perception. Trends in cognitive sciences 11(12) (2007) 535–543
-  Gaver, W.W.: What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology 5(1) (1993) 1–29
-  Brookes, H., Slater, A., Quinn, P.C., Lewkowicz, D.J., Hayes, R., Brown, E.: Three-month-old infants learn arbitrary auditory–visual pairings between voices and faces. Infant and Child Development 10(1-2) (2001) 75–82
-  Owens, A., Isola, P., McDermott, J.H., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. (2016) 2405–2413
-  Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1422–1430
-  Sliwa, J., Duhamel, J.R., Pascalis, O., Wirth, S.: Spontaneous voice–face identity matching by rhesus monkeys for familiar conspecifics and humans. Proceedings of the National Academy of Sciences 108(4) (2011) 1735–1740
-  Kamachi, M., Hill, H., Lander, K., Vatikiotis-Bateson, E.: “Putting the face to the voice”: Matching identity across modality. Current Biology 13(19) (2003) 1709 – 1714
-  Lachs, L., Pisoni, D.B.: Crossmodal source identification in speech perception. Ecological Psychology 16(3) (2004) 159–187
-  Mavica, L.W., Barenholtz, E.: Matching voice and face identity from static images. Journal of Experimental Psychology: Human Perception and Performance 39(2) (2013) 307–312
-  Smith, H.M., Dunn, A.K., Baguley, T., Stacey, P.C.: Matching novel face and voice identity using static and dynamic facial images. Attention, Perception, & Psychophysics 78(3) (2016) 868–879
-  Smith, H.M., Dunn, A.K., Baguley, T., Stacey, P.C.: Concordant cues in faces and voices: Testing the backup signal hypothesis. Evolutionary Psychology 14(1) (2016) 1–10
-  McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264(5588) (1976) 746–748
-  Jones, B., Kabanoff, B.: Eye movements in auditory space perception. Attention, Perception, & Psychophysics 17(3) (1975) 241–245
-  Shelton, B.R., Searle, C.L.: The influence of vision on the absolute identification of sound-source position. Perception & Psychophysics 28(6) (1980) 589–596
-  Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural networks 4(2) (1991) 251–257
-  Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I. (2016) 801–816
-  Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. (2017) 609–617
-  Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. (2016) 892–900
-  Solèr, M., Bazin, J.C., Wang, O., Krause, A., Sorkine-Hornung, A.: Suggesting sounds for images from video collections. In: European Conference on Computer Vision Workshops. (2016) 900–917
-  Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. arXiv preprint arXiv:1803.03849 (2018)
-  Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011. (2011) 689–696
-  Chung, J.S., Senior, A.W., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. (2017) 3444–3453
-  Hoover, K., Chaudhuri, S., Pantofaru, C., Slaney, M., Sturdy, I.: Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers. arXiv preprint arXiv:1706.00079 (2017)
-  Gebru, I.D., Ba, S., Evangelidis, G., Horaud, R.: Tracking the active speaker based on a joint audio-visual observation model. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. (2015) 15–21
-  Taylor, S.L., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A.G., Hodgins, J.K., Matthews, I.A.: A deep learning approach for generalized speech animation. ACM Trans. Graph. 36(4) (2017) 93:1–93:11
-  Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36(4) (2017) 94:1–94:12
-  Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36(4) (2017) 95:1–95:13
-  Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: A large-scale speaker identification dataset. In: Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017. (2017) 2616–2620
-  Saragih, J.M., Lucey, S., Cohn, J.F.: Face alignment through subspace constrained mean-shifts. In: Computer Vision, 2009 IEEE 12th International Conference on, Ieee (2009) 1034–1041
-  Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Similarity-Based Pattern Recognition - Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015, Proceedings. (2015) 84–92
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014)
-  van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(Nov) (2008) 2579–2605
-  Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying interpretability of deep visual representations. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 3319–3327
-  Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. CoRR abs/1511.06361 (2015)
-  Huang, C., Loy, C.C., Tang, X.: Local similarity-aware deep feature embedding. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. (2016) 1262–1270
-  Ustinova, E., Lempitsky, V.S.: Learning deep embeddings with histogram loss. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. (2016) 4170–4178
-  Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011. (2011) 1521–1528
-  Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. (2017) 2962–2971
-  Wu, Z., Singh, B., Davis, L.S., Subrahmanian, V.S.: Deception detection in videos. CoRR abs/1712.04415 (2017)
Appendix A.1 Data Collection for Human Performance
Both the data acquisition and user study were carried out through web applications deployed via Amazon Mechanical Turk. In the following, we present further details of the two tasks.
a.1.1 User Study
a.1.2 Dataset Acquisition
Figure A.3 shows the instructions for data collection. Every participant was requested to read the instructions carefully and to consent to the use of the collected dataset for research purposes. Figure A.4 shows the questionnaire for demographic information and an example recording session. Words in the text are sequentially highlighted in order to guide the reading pace at a constant speed. A visual guide is layered on the video playback to help the participants align their face. These guides are to assist them to successfully complete the recording, and not used as hard constraints. Instead, we manually cleaned unsuitable samples afterwards. They can repeat the recording session until they are satisfied. From the collected video recordings of them speaking, we extract still face images and ten-second-long audio clips containing their voices. Text is chosen from the following pool:
“Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.”
“That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and irresponsibility on the part of some, but also our collective failure to make hard choices and prepare the nation for a new age. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many; and each day brings further evidence that the ways we use energy strengthen our adversaries and threaten our planet.”
“In reaffirming the greatness of our nation, we understand that greatness is never a given. It must be earned. Our journey has never been one of short-cuts or settling for less. It has not been the path for the faint-hearted - for those who prefer leisure over work, or seek only the pleasures of riches and fame. Rather, it has been the path for the risk-takers, the doers, the makers of things - some celebrated but more often men and women obscure in their labor, who have carried us up the long, rugged path towards prosperity and freedom.”
“For us, they fought and died, in places like Concord and Gettysburg; Normandy and Khe Sahn. Time and again these men and women struggled and sacrificed and worked till their hands were raw so that we might live a better life. They saw America as bigger than the sum of our individual ambitions; greater than all the differences of birth or wealth or faction.”
“To the Muslim world, we seek a new way forward, based on mutual interest and mutual respect. To those leaders around the globe who seek to sow conflict, or blame their society’s ills on the West - know that your people will judge you on what you can build, not what you destroy. To those who cling to power through corruption and deceit and the silencing of dissent, know that you are on the wrong side of history; but that we will extend a hand if you are willing to unclench your fist.”
“For as much as government can do and must do, it is ultimately the faith and determination of the American people upon which this nation relies. It is the kindness to take in a stranger when the levees break, the selflessness of workers who would rather cut their hours than see a friend lose their job which sees us through our darkest hours. It is the firefighter’s courage to storm a stairway filled with smoke, but also a parent’s willingness to nurture a child, that finally decides our fate.”
“So let us mark this day with remembrance, of who we are and how far we have traveled. In the year of America’s birth, in the coldest of months, a small band of patriots huddled by dying campfires on the shores of an icy river. The capital was abandoned. The enemy was advancing. The snow was stained with blood. At a moment when the outcome of our revolution was most in doubt, the father of our nation ordered these words be read to the people:”
“America. In the face of our common dangers, in this winter of our hardship, let us remember these timeless words. With hope and virtue, let us brave once more the icy currents, and endure what storms may come. Let it be said by our children’s children that when we were tested we refused to let this journey end, that we did not turn back nor did we falter; and with eyes fixed on the horizon and God’s grace upon us, we carried forth that great gift of freedom and delivered it safely to future generations.”
Appendix A.2 Evaluations on Machine Performance
In this section, further evaluations and visualizations of our learned representations omitted from Section 4.5 of the main paper are provided. We conclude this section with additional discussions.
a.2.1 Further Evaluations on the Learned Representation
The t-SNE visualizations.
Figures A.5 and A.6 show the t-SNE visualization  of our learned voice and face representation, respectively. We drew 1,000 random samples and used our annotations to color-code the sample points according to their four demographic attributes. See Figure 4 of the main paper for the t-SNE visualized with face/voice identities. Note that our network has not seen any of the demographic attributes during training.
As discussed in the main paper, the learned representation forms the clearest clusters regarding gender (Figures A.5a and A.6a), which explains the performance drop when the samples are constrained by gender. While correlated with gender, age shows a distinct grouping from gender (Figures A.5c and A.6c). In particular in face representations, Figure A.6c shows the age distributed orthogonal to gender: it increases from bottom to top while the gender is split horizontally. The t-SNE visualization does not reveal such strong clustering regarding the first language or the ethnic group (Figures A.5bd and A.6bd), and presents only small clusters scattered across the projection. As also noted in the main paper, such absence of clustering does not rule out the existence of additional information encoded in our learned representations, which we argue with additional evidences in the following.
More evaluations of linear classifiers on our representations.
Table A.1 summarizes the quality of linear classifiers for demographic attributes on our learned representations, similar to Table 4 of the main paper, to demonstrate what information our representation encodes.
As evidenced by a high classification precision, the representation provides the most distinctive information for gender classification, which is consistent with the distributions observed in Figures A.5a and A.6a. Age is a continuous attribute as demonstrated in Figures A.5a and A.6a, and grouping into a discrete set of ranges (as in Table 4) makes the classification results more conservative: i.e., Figure A.6f shows overall smooth transition in age, but far from perfect ordering especially in mid-ages, resulting in less decisive age classification results shown in Table A.1. We note that the analyses of Table A.1 and Figures A.5 and A.6 (as well as Table 4 and Figure 4 in the main paper) are complementary to, and consistent with, each other. t-SNE is an unsupervised method for visualization which typically reveals dominant information encoded in the representation, while the experiment in Table A.1 (and Table 4 of the main paper) exploits supervised information to reveal hidden information in the representation.
a.2.2 Further Discussions
Comparisons to binary classification.
|Modalities||Demographic grouping of test samples|
We conducted the same two-choice experiments on another network with a different architecture. The network was inspired by the “ network” , an audio-visual correlation classifier, and trained to do binary classification: given a face and a voice, whether or not the two belong to the same identity. This network shares the same subnetworks as our architecture based on the triplet loss, but the two 512-d feature vectors average-pooled from the conf5_3 and conv6 of VGG16 and SoundNet, respectively, are concatenated to form a 1024-d vector, which is then fed to two 128-d fully-connected layers, in succession, followed by a 2-d fully-connected layer and the softmax activation. The class probability of the positive association is used as a score to measure the similarity of the face and the voice, hence for gauging the distances between a given voice (face) and two candidate faces (voices). The candidate with the higher similarity score is taken as the matching pair. Table A.2 shows the results controlled with the same constraints as in Table 3 of the main paper. Both networks present similar performance on the task, which supports the learnability of the task regardless of the particular network architecture.
Dimensions of fully connected layers.
|Experiments||Global spatial pooling||2 spatial pooling|
We measured the test accuracy with varying dimensions of the fully connected layers (and thus the representation vectors). While this did not have a significant influence on test accuracy, generally, narrower fully connected layers resulted in slightly better performance (Table A.3).
The batch size and the learning rate were chosen by grid search within the machine limit. Decaying learning rates and mining hard negative samples helped stabilize training and prevent from overfitting to training data, but did not contribute much to improve accuracy. The timing and amount of decaying were set empirically. Our model was implemented using TensorFlow and trained on an NVIDIA Titan X (Pascal) with 12 Gb RAM. Training typically takes less than a day on a single GPU.