Looking Enhances Listening: Recovering Missing Speech Using Images

Looking Enhances Listening: Recovering Missing Speech Using Images


Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.


Tejas Srinivasan, Ramon Sanabria, Florian Metze \address Language Technologies Institute, Carnegie Mellon University, U.S.A
CSTR and ILCC, University of Edinburgh, UK
tsriniva|fmetze@cs.cmu.edu, r.sanabria@ed.ac.uk \ninept {keywords} Multimodal learning, noisy ASR, robustness

1 Introduction

Humans process and understand language better when integrating information from multiple modalities. More concretely, in conversations, we use the context that is surrounding us to properly interpret what has been said. Consequently, visual modality integration has recently become a trend in the speech and natural language processing communities. Previous works show improvements in the domains of visual question-answering [2], multimodal machine translation [19], visual dialog [8], and automatic speech recognition (ASR) [16]. Although there are several visual adaption approaches for ASR [16, 6, 15, 10, 14, 21], it is still unclear how the models leverage the visual modality.

Previous works have demonstrated that the visual modality can be used to individually adapt the language [10, 21, 15] and acoustic [14] model components, as well as end-to-end (e2e) models [16, 20]. In acoustic model adaptation [14], images can provide acoustic context (e.g. outdoors and indoors acoustics can be inferred from images). In a similar vein, breakthroughs in image captioning inspired language models contextualization using visual information [10, 21]. By inferring the domain of a conversation from images, we can bias the language model towards a desired space and improve the transcription. However, a lack of analysis and understanding of these models inhibits their widespread application.

In previous work, we analyze the contribution of different visual representations in an end-to-end multimodal ASR system [6]. We observe that an ASR model trained with multimodal information is able to maintain its performance even when the visual modality is discarded during test time. This observation indicates that the model is using the visual modality as a regularization technique, and not using the semantics of the image. Similar findings in multimodal machine translation [9] led to an investigation of the utility of visual context  [5]. They concluded that visual modality is helpful when the primary modality is degraded.

Figure 1: Our set of experiments demonstrate that multimodal speech recognition systems can recover masked words by using images as an auxiliary signal.

In this paper, inspired by [5], we hypothesize that Multimodal ASR systems can take advantage of the visual modality under severely noisy conditions and be more robust than unimodal models. Such robustness to noise is particularly desirable in ASR - since speech is inherently a noisy medium.

The main contributions of this paper are as follows:

  • A set of noise-masking experiments, where we inject different kinds of noise into the acoustic signal in a structured manner (Section 2).

  • A comparison of the ability of unimodal and multimodal ASR models to recover the masked audio signal (Section 4.1).

  • Quantitative (Section 4.2) and qualitative analysis (Section 4.3) to investigate how exactly the visual modality is utilized.

Our observations on the Flickr 8k Audio Caption Corpus indicate that multimodal ASR models perform better at recovering masked words than unimodal models, thereby validating our hypothesis that multimodal models are more robust to noise.

Furthermore, our analysis suggests that recovery performance drops when we provide incorrect visual features during training and/or testing - this result indicates that the model is sensitive to the the semantic content of the visual context.

2 Methodology

In this section, we introduce our multimodal architecture, similar to [6, 20], that we use to incorporate the image modality in an end-to-end ASR system. We also describe the different masking experiments that we design to test our hypothesis. Inspired by [5], we simulate different scenarios where the audio signal is corrupted by various masking techniques. This audio corruption is done in a controlled manner, by deterministically fixing a set of words and masking all occurrences in the dataset.

2.1 Models

To be consistent with previous work, we test our hypothesis with the best performing models and features according to [6, 20].

Unimodal ASR

Our baseline unimodal ASR model is a sequence-to-sequence model with attention [3, 7]. The encoder consists of 6 bidirectional LSTM layers with tanh activation, with sub-sampling at the 3rd and 4th layers [3]. The decoder is a two-layer GRU which computes attention over the encoder states.

Multimodal ASR

For our Multimodal ASR model, we use early decoder fusion. In this fusion technique, during decoding, a visual feature vector is projected down to the hidden state dimension, and further concatenated to the input embedding vector at each timestep before passing it through the GRU decoder.

2.2 Audio Corruption

To create scenarios where information is missing from the primary modality, we perform masking on the audio data. Because we need to have control over the words being masked and recovered, we generate word-level masking. To do so, we first generate forced aligned timings to localize specific words in the transcriptions. After that, we experiment with two masking techniques:

  1. Silence Masking: We substitute the masked word with a specific value, silence, similar to the special token in [5]. In this non-realistic scenario, the model is trained to generate the missing word when a known signal is present in the audio.

  2. White Noise Masking: We substitute the masked word with white noise. This more realistic scenario is an approximation to a noisy-ASR problem where speech is corrupted by some stochastic signal. In this case, in contrast to a traditional noisy-ASR situation where the speech is overlapped with noise, the masked word is completely overwritten.

It is important to note that the masking is performed on all data splits. This set up differs from [20], where the masking was performed only in the test set and models were trained on unmasked data.

Masked Words

To have a more clear insight into how the model uses the image modality, we experiment by masking two different sets of words in the dataset: nouns and places. We analyze the efficiency of the model on recovering these different categories and how the visual representation influences the result. We hypothesize that visual embeddings trained on place recognition should be more effective for place word recovery than objects, and vice versa.

2.3 Congruency Experiments

Similar to [9], we perform a set of experiments where we misalign images and utterances. The outcome of this set up quantifies the sensitivity of our model towards unaligned images. These results, therefore, can provide insights on our previous claim that multimodal models use the visual modality as a regularization signal [6].

  • Incongruent Decoding: We train multimodal models on the correct (congruent) image-utterance pairings, and deliberately misalign the images and utterances during testing. This is done to check that our multimodal models rely on the images - when presenting an unrelated image, we expect to see poor performance.

  • Incongruent Training: We both train and test the models on incongruent image-utterance alignments. In this experiment, we wish to see if providing an incorrect visual signal can still result in improvements in the model, similar to the regularization effect observed in [6].

3 Experimental Setup

Masking Type Masked Words Model Visual Features Recovery Rate Rel RR WER Rel WER
None - Unimodal - - - 16.4% -
Multimodal Object Ftrs - - 14.8% 10.0%
Multimodal Place Ftrs - - 15.6% 4.7%
Silence Nouns Unimodal - 46.0% - 32.9% -
Multimodal Object Ftrs 57.8% 25.5% 29.9% 4.0%
Multimodal Place Ftrs 52.6% 14.2% 31.5% 3.2%
Silence Places Unimodal - 42.7% - 22.6% -
Multimodal Object Ftrs 57.0% 33.5% 19.3% 14.6%
Multimodal Place Ftrs 53.7% 25.7% 22.0% 2.7%
Whitenoise Nouns Unimodal - 43.3% - 43.5% -
Multimodal Object Ftrs 57.3% 32.3% 36.5% 16.2%
Multimodal Place Ftrs 50.4% 16.3% 43.1% 0.9%
Whitenoise Places Unimodal - 42.2% - 25.2% -
Multimodal Object Ftrs 57.8% 37.1% 19.3% 23.6%
Multimodal Place Ftrs 53.7% 27.4% 23.5% 7.0%
Table 1: Word Error Rate (WER) and Recovery Rate (RR) scores for our unimodal and multimodal models in the different masking conditions. We also show the improvements (s) of our multimodal models, relative to the unimodal model under same masking conditions.

3.1 Dataset

We perform experiments on the Flickr 8k Audio Caption Corpus [11], which contains 40,000 spoken captions (total 65 hours of speech) corresponding to 8,000 natural scene images from the Flickr8k dataset [13]1. We use the pre-defined training, development and test splits of 30k, 5k and 5k utterances respectively. The audio and text captions in this data are more structured and clean, compared to the ‘in-the-wild’ nature of How2 [18].

3.2 Implementation Details

Word Masking

  • Nouns: We use the Stanford POS tagger to tag all sentences in the dataset, and find the top 100 nouns (NN tag) in the entire dataset. All words among the top 100 nouns are then masked. This affects of all words in the corpus.

  • Places: We use the categories from the Places365 dataset, which describe 365 scene categories. All words in our dataset which correspond to the Places365 categories are subsequently masked. This affects of all words in the corpus.

Visual Features

We preprocess images as in [6]. We experiment with two visual feature types to capture different semantic information from the images:

  • Object Features: We use a ResNet-50 CNN [12] trained on ImageNet for recognizing 1000 object categories. We extract 2048-dim average pooled features from the penultimate CNN layer.

  • Place Features: A ResNet-50 trained on Places365 [23] for scene recognition with 365 categories. We extract the posterior class probabilities, which gives us 365-dimensional visual features.

Acoustic Features

We extract 40-dimensional filter bank features from 16kHz raw speech signals using a time window of 25ms and an overlap of 10ms, using Kaldi [17], and 3-dimensional pitch features. To corrupt the original audio signals, we first extract word-audio alignments from a Kaldi GMM-HMM model trained on the Wall-Street Journal Corpus2, and then we increase the beginning and end timing marks by 25% of the segment duration to account for possible misalignments. Finally, we use SoX3 and the audio-word alignments to mask the words with white noise or silence. To remove a possible bias induced by the duration of the masked audio segment, the silence/white noise masks are of fixed duration of 0.5 seconds.

Model Implementation

The model hyper parameters are the same ones as in [20]. All models are implemented using the nmtpytorch framework [4]. For each experiment, we train three models with different random seeds and report the average results.

3.3 Evaluation Metrics

For consistency with previous ASR work, we report Word Error Rate (WER). Because WER is sensitive to variations in unmasked words, it is not a direct indicator of the ability to recover masked words. We propose a second metric, called Recovery Rate (RR), which evaluates the models’ ability to accurately recover masked-words.

To compute our metrics, we use sclite4 which provides WER and hypothesis-reference word alignments for RR calculation.

4 Results & Analysis

4.1 Results

In Table 1, we summarize the results of our masking experiments. We see that multimodal ASR models consistently improve over unimodal ones on WER, under all masking conditions. We also note that Object Features consistently outperform the Place Features, even when place words are masked. We hypothesize that this might be due to a domain mismatch between Places365 and Flickr8k datasets. It is important to note that due to an intrinsic unbalance between nouns and places in Flickr8k, the results on place and noun masking are not comparable. We leave a more accurate comparison for future work.

Masked Words Model WER RR
Nouns Unimodal 32.9% 46.0%
Multimodal (Object) 29.9% 57.8%
Multimodal (GT) 25.3% 92.63%
Places Unimodal 22.6% 42.7%
Multimodal (Object) 19.3% 57.3%
Multimodal (GT) 20.5% 92.66%
Table 2: Comparison of multimodal improvements when we provide ground-truth (GT) information about what words have been masked

However, previous work has suggested that visual modality is only being used as a regularization signal [6]. To investigate this behavior, we also measure the models’ ability to recover masked words - ensuring that the observed improvements come from the semantics of the visual context. We observe that multimodal models have a significant improvement in noun and place RR, with both Object and Place features. Once again, Object features seem to be more useful for recovery than Place features, even when place words are masked. However, we also note that Place features are better at recovering place words than nouns. These improvements in RR demonstrate that visual context is highly useful when it comes to recovering information which is missing in the primary modality.

Reference A dog wearing a collar jumping from a platform .
LM A man is a red is over a blue
Unimodal ASR A woman wearing a collar jumping into a hurdle .
Multimodal (object feats) A dog wearing a collar jumping through a puddle .
Multimodal (place feats) A dog wearing a collar jumping over a hurdle.
Reference A boy with red shorts is holding a basketball in a basketball court .
LM Man in a hair is jumping a stick . a field ball
Unimodal ASR A boy with red shorts is holding a soccer ball .
Multimodal (object feats) A man in red shorts is holding a basketball in a basketball court .
Multimodal (place feats) A man with red shorts is holding a basketball in a basketball court .
Table 3: Examples of noun recovery from the different models. Blue words indicate reference words (nouns or places) which are masked with silence in the input speech signal, red and green words indicate incorrectly and correctly substituted words in the hypotheses transcriptions. The language model (LM) outputs are generated by feeding in the ground-truth reference context at each timestep (not the generated context).

An interesting observation is that even with no visual information, a unimodal ASR model is able to successfully recover of the masked words. We hypothesize that this is due to the predictable nature of the captions in Flickr 8k (similar to observations in Multi30k [5]). Due to this structured nature of the text, the decoder’s implicit language model is able to predict the correct masked word using the decoded text context. To test this hypothesis, we train a GRU language model (LM) and calculate the masked word prediction accuracy when it is fed the ground-truth context. The LM correctly predicts the masked word 34% of the time. Because the decoder learns to sample from the set of masked words when it encounters silence, we only consider the instances where the LM makes a prediction from the set of masked words. We observe that the LM masked word accuracy jumps to 45%, matching the unimodal ASR RR. We conclude that the high RR for the unimodal system is a consequence of the structured captions in Flickr 8k, and the subsequent strong LM.

While our multimodal models show significant improvements in WER and RR, we believe that there is a lot of room for improvement. To set an upper bound, we provide a binary vector that encodes which words have been masked. In Table 2, we summarize the improvements in WER and RR obtained by feeding ground-truth masking information when nouns and places are masked with silence. These results encourage further research on more semantically meaningful features for this task. We leave this investigation for future work.

4.2 Congruency Analysis

In Table 4, we summarize the results of our congruency experiments (Section 2.3). When incorrect visual features are provided during inference time (Incongruent Decoding), the models’ performance drops significantly on both the WER and RR metrics. These results show that incorrect visual context is actively harming the model. We can conclude that the model learns some useful information during training which is not present during test time.

Masked Words Visual Features Rel RR Rel WER
Incongruent Decoding
Nouns Congruent 25.5% 9.1%
Incongruent -43.5% -28.3%
Places Congruent 33.5% 14.6%
Incongruent -52.7% -17.4%
Incongruent Training
Nouns Congruent 25.5% 9.1%
Incongruent 2.8% 6.2%
Places Congruent 33.5% 14.6%
Incongruent 3.1% 10.9%
Table 4: Comparison of multimodal models’ performance under the incongruent decoding and training setups. Object features and silence masking were used for all experiments.

When we train the models using misaligned images (Incongruent Training), we don’t observe a considerable improvement over unimodal ASR on the masked RR task. We believe that the slight RR gains are due to semantic overlap between images. We also obtain larger improvements in WER (not as much as congruent training), consistent with the regularization effect previously observed in [6].

4.3 Qualitative Analysis

In Table 3, we consider several examples in the test set, and compare how unimodal and multimodal models transcribe. In the first example, the LM and unimodal ASR substitute words that occur frequently in the dataset (i.e. man and woman, repectively). However, the multimodal models correctly recover the word “dog”.

In the second example, unlike the multimodal systems, the unimodal model recovers the word “boy” correctly. However, the multimodal models, relying on the image, identifies the object (i.e. basketball) and scene (i.e. basketball court) and correctly recover the masked words. They are able to reason that the masked word is something the boy is holding, and localize the basketball correctly.

5 Conclusions and Future Work

We show that when we deterministically mask out words in the input speech signal using silence/white noise, having auxiliary context from images helps us improve recovery of masked words, demonstrating that the model can avail semantic information from the image.

For future work, we plan to extend the structured experiments (i.e. masking concrete words) performed in this paper into an unstructured regime (i.e. speech corrupted by overlapping noise), to see if multimodality can help with speech recognition under noisy conditions. If the results are successful, this work could potentially be tested in real-world multimodal environments such as [18].

6 Acknowledgments

This research was supported in part by DARPA grant FA8750-18-2-0018 funded under the AIDA program, along with faculty research grants from Amazon, AWS and Facebook. This work used the computational resources of the PSC Bridges cluster at Extreme Science and Engineering Discovery Environment (XSEDE) [22]. We would also like to thank the anonymous reviewers, along with Shikib Mehri for their valuable inputs.


  1. https://groups.csail.mit.edu/sls/downloads/flickraudio/
  2. By manually inspecting the alignments, we concluded that their quality was good enough for the purpose of this paper.
  3. http://sox.sourceforge.net/
  4. http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm


  1. (2016) 2016 IEEE international conference on acoustics, speech and signal processing, ICASSP 2016, shanghai, china, march 20-25, 2016. IEEE. External Links: Link, ISBN 978-1-4799-9988-0 Cited by: 3.
  2. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick and D. Parikh (2015) VQA: Visual Question Answering. In Proc. of the International Conference on Computer Vision (ICCV), Cited by: §1.
  3. D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel and Y. Bengio (2016) End-to-end attention-based large vocabulary speech recognition. See 1, pp. 4945–4949. External Links: Link, Document Cited by: §2.1.1.
  4. O. Caglayan, M. García-Martínez, A. Bardet, W. Aransa, F. Bougares and L. Barrault (2017) NMTPY: a flexible toolkit for advanced neural machine translation systems. Prague Bull. Math. Linguistics. Cited by: §3.2.4.
  5. O. Caglayan, P. Madhyastha, L. Specia and L. Barrault (2019) Probing the need for visual context multimodal machine translation. In Proc. of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §1, §1, item 1, §2, §4.1.
  6. O. Caglayan, R. Sanabria, S. Palaskar, L. Barrault and F. Metze (2019) Multimodal Grounding for Sequence-to-Sequence Speech Recognition. In Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §1, 2nd item, §2.1, §2.3, §2, §3.2.2, §4.1, §4.2.
  7. W. Chan, N. Jaitly, Q. V. Le and O. Vinyals (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. of the International Conference on Acoustics, Speech and Signal Processing, ICASSP, Cited by: §2.1.1.
  8. A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M.F. Moura, D. Parikh and D. Batra (2017) Visual Dialog. In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  9. D. Elliott (2018) Adversarial evaluation of multimodal machine translation. In Proc. of Empirical Methods Natural Language Processing (EMNLP), Cited by: §1, §2.3.
  10. A. Gupta, Y. Miao, L. Neves and F. Metze (2017) Visual features for context-aware speech recognition. In Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §1.
  11. D. Harwath and J. Glass (2015) Deep multimodal semantic embeddings for speech and images. In Proc. of the Workshop on Automatic Speech Recognition and Understanding (ASRU), Cited by: §3.1.
  12. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proc. Conference on Computer Vision and Pattern Recognition, (CVPR), Cited by: 1st item.
  13. M. Hodosh, P. Young and J. Hockenmaier (2015) Framing image description as a ranking task: data, models and evaluation metrics (extended abstract). In Proc. of the Twenty-Fourth International Joint Conference on Artificial Intelligence IJCAI, Cited by: §3.1.
  14. Y. Miao and F. Metze (2016) Open-domain audio-visual speech recognition: a deep learning approach.. In Proc. Interspeech, Cited by: §1, §1.
  15. Y. Moriya and G. J. F. Jones (2018) LSTM language model adaptation with images and titles for multimedia automatic speech recognition. In Proc. of the Spoken Language Technology Workshop (SLT), Cited by: §1, §1.
  16. S. Palaskar, R. Sanabria and F. Metze (2018) End-to-end multimodal speech recognition. In Proc. of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §1.
  17. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, P. Qian, J. Silovsky, G. Stemmer and K. Vesely (2011) The kaldi speech recognition toolkit. In Proc. of the Workshop on Automatic Speech Recognition and Understanding (ASRU), Cited by: §3.2.3.
  18. R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia and F. Metze (2018) How2: a large-scale dataset for multimodal language understanding. In In Proc. of the Workshop on Visually Grounded Interaction and Language (ViGIL), External Links: Link Cited by: §3.1, §5.
  19. L. Specia, S. Frank, K. Sima’an and D. Elliott (2016) A shared task on multimodal machine translation and crosslingual image description (WMT). In Proc. of the Conference on Machine Translation (WMT), Cited by: §1.
  20. T. Srinivasan, R. Sanabria and F. Metze (2019) Analyzing utility of visual context multimodal speech recognition under noisy conditions. In Proc. of The How2 Challenge New Tasks for Vision and Language Workshop, Cited by: §1, §2.1, §2.2, §2, §3.2.4.
  21. F. Sun, D. Harwath and J. Glass (2016) Look, listen, and decode: multimodal speech recognition with images. In Proc. of the Spoken Language Technology Workshop (SLT), Cited by: §1, §1.
  22. J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. Scott and N. Wilkins-Diehr (2014-09) XSEDE: accelerating scientific discovery. Computing in Science and Engineering 16 (05), pp. 62–74. External Links: ISSN 1558-366X, Document Cited by: §6.
  23. B. Zhou, A. Lapedriza, A. Khosla, A. Oliva and A. Torralba (2017) Places: a 10 million image database for scene recognition. Transactions on pattern analysis and machine intelligence (PAMI). Cited by: 2nd item.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description