Speech-Based Visual Question Answering

Speech-Based Visual Question Answering


This paper introduces speech-based visual question answering (VQA), the task of generating an answer given an image and a spoken question. Two methods are studied: an end-to-end, deep neural network that directly uses audio waveforms as input versus a pipelined approach that performs ASR (Automatic Speech Recognition) on the question, followed by text-based visual question answering. Furthermore, we investigate the robustness of both methods by injecting various levels of noise into the spoken question and find both methods to be tolerate noise at similar levels.







The recent years have witnessed great advances in computer vision, natural language processing, and speech recognition thanks to the advances in deep learning [?] and abundance of data [?]. This is evidenced not only by the surge of academic papers, but also by the world-wide industry interests. The convincing successes in these individual fields naturally raise the potentials of further integration towards solutions to more general AI problems. Much work has been done to integrate vision and language, resulting in a wide collection of successful applications such as image/video captioning [?], movie-to-book alignment [?], and visual question answering (VQA) [?]. However, the importance of integrating vision and speech has remained relatively unexplored.

Pertaining to practical applications, voice-user interface (VUI) has become more commonplace, and people are increasingly taking advantage of its characteristics; it is natural, hands-free, eyes-free, far more mobile and faster than typing on certain devices [?]. As many of our daily tasks are relevant to visual scenes, there is a strong need to have a VUI to talk to pictures or videos directly, be it for communication, cooperation, or guidance. Speech-based VQA can be used to assist blind people in performing ordinary tasks, and to dictate robotics in real visual scenes in a hand-free manner such as clinical robotic surgery.

Figure 1: An example of speech-based visual question answering and the two method in this study. A spoken question what food is this? is asked about the picture, and the system is expected to generate the answer pizza.
Figure 1: An example of speech-based visual question answering and the two method in this study. A spoken question what food is this? is asked about the picture, and the system is expected to generate the answer pizza.

This work investigates the potential of integrating vision and speech in the context of VQA. A spoken version of the VQA1.0 dataset is generated to study two different methods of speech-based question answering. One method is an end-to-end approach based on a deep neural network architecture, and the other uses an ASR to first transcribe the text from the spoken question, as shown in . The former approach can be particularly useful for languages that are not serviced by popular ASR systems, i.e. minor languages that have scarce text-speech aligned training data.

The main contributions of this paper are three-fold: 1) We introduce an end-to-end model that directly produces answers from auditory input without transformations into intermediate pre-learned representations, and compare this with the pipelined approach. 2) We inspect the performance impact of having different levels of background noise mixed with the original utterances. 3) We release the speech dataset, roughly 200 hours of synthetic audio data and 1 hour of real speech data, to the public. 1 The emphasis of this paper is not on achieving state of the art numbers on VQA, but rather on exploring ways to address a new and challenging task.

2Related Works

2.1Visual Question Answering

The initial introduction of VQA into the AI community [?], [?] was motivated by a desire to build intelligent systems that can understand the world more holistically. In order to complete the task of VQA, it was both necessary to understand a textual question and a visual scene. However, it was not until the introduction of VQA1.0 [?] that the application took mainstream in the computer vision and natural language processing (NLP) communities.

Recently, popular topics of exploration have been on the development of attention models. Attention models were popularized by their success with the NLP community in machine translation [?], and quickly demonstrated their efficacy in computer vision [?]. Within the context of visual question answering, attention mechanisms ‘show’ a model where to look when answering a question. Stacked Attention Network [?] learns an attention mechanism based on the of the question’s encoding to determine the salient regions in an image. More sophisticated attention-centric models such as [?] were since then developed.

Other points of research are based on the pooling mechanism that combines the language component with the vision components. Some use an element-wise multiplication [?] to pool these modalities, while [?] and [?] have shown much success in using more complex methods. Our work differs from theirs in that we aim not to improve the performance of VQA, but rather add a new modality of input and introduce appropriate new methods.

2.2Integration of Speech and Vision

The works also relevant to ours are those integrating speech and vision. Pixeltone [?] and Image spirit [?] are examples that use voice commands to guide image processing and semantic segmentation. There is also academic work [?] and an app [?] that use speech to provide image descriptions. Their tasks and algorithms are both different from ours. We study the potential of integrating speech and vision in the context of VQA and aim to learn a joint understanding of speech and vision. Those approaches, however, use speech recognition for data collection or result refinement.

Our work also shares similarity with visual-grounded speech understanding or recognition. The closest one in this vein is [?], in which a deep model is learned with speeches about image captions for speech-based image retrieval. In a broader context of integration of sound and vision, Soundnet [?] transfers visual information into sound representations, but this differs from our work because their end goal is to label a sound, not to answer a question.

2.3End-To-End Speech Recognition

In the past decade, deep learning has allowed many fields in artificial intelligence to replace traditional hand-crafted features and pipeline systems with end-to-end models. Since speech recognition is typically thought of as a sequence to sequence transduction problem, i.e. given an input sequence, predict an output sequence, the application of LSTM and the CTC [?] promptly showed the success needed to justify its superiority over traditional methods. Current state of the art ASR systems such as DeepSpeech2 [?] uses stacked Bi-directional Recurrent Neural Networks in conjunction with Convolutional Neural networks. Our model is similar to theirs in that we use CNNs connected with an LSTM to process audio inputs, however our goal is question answering and not speech recognition.

Table 1: Dimensions for the conv layers. Example shown with a 2 second long audio waveform, sampled at 16 kHz. The final output dimensions are (3, 512)
Input Dim 32,000 16,000 4,000 2,000 500 250 62 31 7
# Filters 32 32 64 64 128 128 256 256 512
Filter Length 64 4 32 4 16 4 8 4 4
Stride 2 4 2 4 2 4 2 4 2
Output Dim 16,000 4,000 2,000 500 250 62 31 7 3
Figure 2: TextMod (left) and SpeechMod (right) architectures
Figure 2: TextMod (left) and SpeechMod (right) architectures


Two models are employed in this work, they will be referred to henceforth as TextMod and SpeechMod. TextMod and SpeechMod only differ in their language components, keeping rest of the architecture the same. On the language side, TextMod takes as input a series of one-hot encodings, followed by an embedding layer that is learned from scratch, a LSTM encoder, and a dense layer. It is similar to VQA1.0 with some minor adjustments.

The language side of SpeechMod takes as input the raw waveform, and pushes it through a series of 1D convolutions. After the CNN layers follows a LSTM. The LSTM serves the same purpose as in TextMod, which is to interpret and encode the sequence meaningfully into a single vector.

Convolution layers are used to encode waveforms because they reduce dimensionality of data while finding salient patterns. The maximum length of a spoken question in our dataset is 6.7 seconds and corresponds to a waveform length of 107,360 elements, while the minimum is 0.63 seconds and corresponds to 10,080 elements. One could directly feed the input waveform to a LSTM, but a LSTM will be unable to learn from sequences that are excessively long, so dimensionality reduction is a necessity. Each consecutive convolution layer halves in filter length but doubles the number of filters. This is done for simplicity rather than for performance optimization. The main consideration taken in choosing the parameters is that the last convolution should output dimensions of (, 512), where must be a positive integer. represents the length of a sequence of 512-dim vectors. The sequence is then fed into an LSTM, which then outputs a single vector of (512). Thus, should not be too big, and the CNN parameters are chosen to ensure a sensible sequence length. The exact dimensions of the convolution layers are shown in . The longest and shortest waveforms correspond to final convolution outputs of size (13, 512) and (1, 512) respectively. 512 is used as the dimension of the LSTM to be consistent with TextMod and the original VQA baseline.

On the visual side, both models ingest as input the 4,096 dimensional vector of the last layer of VGG19 [?] followed by a single dense layer. After both visual and linguistic representations are computed, they are merged using element-wise multiplication, a dense layer, and an output layer. The full architecture of both these models are seen in , where is the symbol for element-wise multiplication. After merging the language and visual components of each model, two dense layers are stacked. The last dense layer outputs a probability distribution over the number of output classes, and the answer corresponding to the element with the highest probability is selected.

The architectures presented in this chapter were chosen for two main reasons: simplicity and similarity. First, the intention is to keep the model complexity low. In order to establish a baseline for speech-based VQA, it is necessary to use only the bare minimum components. TextMod, as mentioned before, is similar to the original VQA baseline, which is well referenced and remains the simplest architecture on VQA1.0. Despite its many convolution layers, SpeechMod also uses minimal components. Second, it is important that TextMod and SpeechMod differ from each other as little as possible. Similarity between models allows one to locate the source of discrepancies and helps produce a more rigorous comparison. The only difference in the two models is replacing an embedding layer with a series of convolution layers. In our implementation, the layers that are common between the two models also have the same dimensions.


We chose to use VQA1.0 Open-Ended dataset, for its numerous training examples and familiarity to those working in question answering. To avoid confusion, VQA1.0 henceforth refers to the dataset and the original paper, while VQA refers to the task of visual question answering. The dataset contains 248,349 questions in the training set, 121,512 in validation set, and 60,864 in the test-dev set. The complete test set contains 244,302 questions, but because the evaluation server allows for only one submission, we instead evaluate on test-dev, which has no such limit. During training, questions which do not contain the 1000 most common answers are filtered out.

Amazon Polly API is used to generate audio files for each question. 2 The generated speech is in mp3 format, then sampled into waveform format at 16 kHz. 16 kHz was chosen due to its common usage among the speech community, but also because the model used to transcribe speech was trained on 16 kHz audio waveforms. It is worthwhile to note that the audio generator uses a female voice, thus the training and testing data are all with the same voice, except for the examples we’ve recorded, which is covered below. The full Amazon Polly speech dataset will be made available to the public.

Figure 3: Spectrograms for 3 example questions with corresponding transcribed text below. 3 synthetically generated and 1 human-recorded audio clips for each question.
Figure 3: Spectrograms for 3 example questions with corresponding transcribed text below. 3 synthetically generated and 1 human-recorded audio clips for each question.

The noise we mixed with the original speech files is selected randomly from the Urban8K dataset [?]. This dataset contains 10 categories: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music. Some clips are soft enough in volume and thus considered background noise, others are loud enough to be considered foreground noise. For each original audio file, a random noise file is selected, and combined to produce a corrupted question file according to the weighting scheme:

where NL is the noise level. The noise audio files are subsampled to 16 kHz in order to match that of the original audio file, and is clipped to also match the spoken question length. When the spoken question is longer than the noise file, the noise file is repeated until its duration exceeds that of the spoken question. Both files are normalized before being combined so that contributions are strictly proportional to the noise level chosen. We choose 5 noise levels to mix together: 10%-50%, at 10% intervals. Anything beyond 50% is unrealistic. A visualization of different noise levels can be seen in and its corresponding audio clips can be found online.3

We also make an additional, supplementary study of the practicality of speech-based VQA with real data. 1000 questions from the val set were randomly selected and recorded with human speakers. Two speakers (one male and one female) participated the recording task. In total, 1/3 of the data is from a male speaker, the rest is from a female speaker. Both speakers are graduate students who are not native anglophones. The data was recorded in an office environment, and there are various background noises in the audio clips as they naturally occurred.

Table 2: Word Error Rate from Kaldi speech recognition
Noise (%) WER (%)
0 8.46
10 12.37
20 17.77
30 25.41
40 35.15
50 47.90



For SpeechMod, the first preprocessing step is to scale each waveform to a range of [-256, 256], similar to the procedure from SoundNet [?]. There was no need to center each example around 0, as they are already centered. Next, each batch of waveforms were padded with 0 at the end to be of the same length.

For TextMod, the standard preprocessing steps from VQA1.0 were followed. The procedure tokenizes each sentence and replaces it with a number that corresponds to the word’s index. These number indices are used as input, since the question will be fed to the model as a sequence of one hot encodings. Because questions have different lengths, the 0 index is used as padding for sequences that are too short. The 0 index essentially causes the model to skip that position. 0 is also used for unseen tokens, which is especially useful when dealing with out of vocabulary words during evaluation.


We use Kaldi [?] for ASR, due to its open-source codebase and popularity with the speech research community. The model used in this work is a DNN-HMM4 that has been pre-trained on assistant.ai logs (essentially short commands), making it suitable for transcribing short utterances such as the questions in VQA1.0. Other ASRs such as wit.ai from Facebook, Cloud Speech from Google, and Bing Speech Microsoft were tested but not used in the final experiments because Kaldi achieved the lowest word error rates.

Word error rate (WER) is used to measure the accuracy of speech to text transcriptions. WER is defined as follows:

Where S is the number of substitutions, D is the number of deletions, and I is the number of insertions. N is the total number of words in the sentence being translated. Each transcribed question is compared with the original; the results are shown in . WER is not expected to be a perfect measure of transcription accuracy, since some words are more essential to the meaning of a sentence than other words. For example, missing the word dog in the sentence what is the dog eating is more detrimental than missing the word the, but we nevertheless employ it to convey a general notion of how many words are understood by the ASR. Naturally the more noise there is, the higher the word error rate becomes. Due to transcription errors, there are resulting questions that contain words not seen in the original datasets. These words, as mentioned above, are indexed as 0 and are masked when fed into TextMod.


Keras was used to run all experiments, with the Adam [?] optimizer for both architectures. No parameter tuning was done; default Adam parameters are as follows: learning rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, learning rate decay=0.0. Training TextMod for 10 epochs on train + val takes roughly an hour on a Nvidia Titan X GPU, and our best model was taken at 30 epochs. Training SpeechMod for 10 epochs takes roughly 7 hours. The reported model is taken at 30 epochs. The code is available to the public.5


The goal of the main experiments were to observe how each model performs with and without different levels of noise added. Results are reported on test-dev, which corresponds to training on train + val (). The standard format of reporting results from VQA1.0 is followed: All is the overall accuracy, Y/N is for questions with yes or no as answers, Number is for questions that are answered by counting, and Other covers the rest.

TextMod is trained on the original questions (OQ), with the best performing model being selected. ASR is used on the 0-50% variants to convert the audio question to text. Then, the selected model from OQ is used to evaluate based on the transcribed text. Concretely, the best performing model obtained on test-dev is used to evaluate the transcribed variants of test-dev. Likewise, SpeechMod is first trained on audio data with 0% noise, with the strongest model being selected. The selected model is used to evaluate on the 10-50% variants of the same data subset. Typically, the best model on val is used to evaluate on test or another ‘unseen’ portion of the dataset. However in these experiments, the noisy variants of the same datasets are in fact unseen because the data for which the model is trained on contains no noise. We show this in the zero-shot section of the paper.

Table 3: Accuracy on test-dev with different levels of noise added. (Higher is better)
All Y/N Number Other
Baseline 53.74 78.94 35.24 36.42
TextMod Blind 48.76 78.20 35.68 26.59
OQ 56.66 78,89 37.24 42.07
0% 54.03 75.47 36.82 39.62
10% 52.56 74.06 36.50 37.85
20% 50.22 71.16 35.72 35.64
30% 47.03 67.31 34.45 32.56
40% 42.83 62.35 31.97 28.64
50% 37.12 25.42 27.05 23.77
SpeechMod Blind 42.05 70.85 31.62 19.84
0% 46.99 67.87 30.84 32.82
10% 45.81 67.29 30.13 31.03
20% 43.33 65.88 29.24 27.28
30% 40.07 64.15 27.82 22.28
40% 35.85 61.47 24.68 16.52
50% 32.14 59.33 20.84 11.50

Blind denotes no visual information, meaning it removes the visual components while rest of the model stays the same. TextMod Blind is trained and evaluated on the original questions. SpeechMod Blind is trained and evaluated on the 0% noise audio. Baseline is from VQA1.0 using the model ‘LSTM Q+I’.

A graphical version of the table is shown in . The constant values of SpeechMod Blind and TextMod Blind are included to show the noise level at which they perform better than their full model counterparts. Examples of the two models answering questions from the dataset are shown in .

One might imagine SpeechMod to perform better because of its direct optimization and end-to-end training solely for the task, yet this hypothesis does not hold true. At 0% noise, TextMod achieves 7% higher accuracy than SpeechMod. As noise is added, both models initially falter at similar rates, although their trends seem to head towards convergence. This is expected since, since at 100% noise the question would not be audible at all; it would be random guessing, thus both methods would perform exactly the same.

Figure 4: SpeechMod and TextMod performance with varying amounts of added noise on test-dev. Blind counterparts are not tested on different noise levels.
Figure 4: SpeechMod and TextMod performance with varying amounts of added noise on test-dev. Blind counterparts are not tested on different noise levels.

Next, we compare TextMod and SpeechMod against their respective Blind models. The bias of questions in VQA1.0 is well documented. Namely, if the question is understood, there is a good chance of answering correctly without looking at the image (i.e. blind guessing). For example, Y/N questions have the answer yes more commonly than no, so the system should guess yes if a question is identified to be a Y/N type. As a reference, always answering yes yields a Y/N accuracy of 70.81% on test-dev. The bias is clearly evident in both test-dev and val for TextMod and SpeechMod; the Y/N section of Blind always performs better than that of the 0% data. Therefore, Blind tells us how many questions are understood by these two modes of linguistic inputs. When comparing the linguistic only models with their complementary TextMod and SpeechMod, one can be certain that performances falling below the linguistic signifies that the model no longer understands the questions. Furthermore, perceiving the image and a noisy question becomes less informative than understanding a clean question without an image.


In this section zero-shot (ZS) results are analyzed to further understand the behavior of both models. ZS in the context of VQA refers to questions that were never seen in training. To get ZS data, we discard questions in val subset that appeared in the train subset, which decreased the number of valid questions from 104,654 to 65,365. Put differently, ZS is simply a subset of val.

The models were trained on the complete train, and best performing on the complete val were selected. Next, the models were tested on the original and ZS datasets with noise injected (, ). These experiments were not be performed on test-dev because the ground truth from test-dev and test are withheld and cannot be evaluated partially on the server.

As one would expect, ZS accuracies are worse than accuracies of the entire set, since models tend to perform more poorly on unseen data. TextMod performs 4% better on the complete dataset than on the ZS, and SpeechMod on the complete dataset performs better by 7%. The performance gap decreases as more noise is added. At 50% noise, the performance on ZS and the original data have practically converged for both models. To the models, questions seen during the training but with high amount of noise added are as foreign as unseen questions.

Table 4: Accuracy on zero-shot with different levels of noise added. (Higher is better)
All Y/N Number Other
TextMod OQ 49.41 77.23 31.18 27.12
0% 46.41 73.37 30.64 24.38
10% 45.23 71.93 30.32 23.24
20% 43.30 69.26 29.63 21.75
30% 40.85 65.84 28.55 19.89
40% 37.79 62.56 26.10 16.91
50% 34.41 59.58 21.50 13.42
SpeechMod 0% 37.01 65.58 23.19 12.99
10% 36.52 65.12 22.83 12.45
20% 35.47 64.04 22.29 11.29
30% 34.08 62.77 21.45 9.67
40% 32.12 60.59 19.94 7.81
50% 29.88 57.70 18.20 6.09

6.2Human Recordings

Finally, a small, supplementary test is run on non-synthetic, human-recorded questions to see if the models would perform differently on real-world audio inputs. 1000 samples were randomly selected from val, and the best performing models from the ZS section were used for evaluation. shows the performance on the synthetic and human-recorded versions of this subset.

Although it is clear that both models have difficulties handling recorded questions, SpeechMod performs especially poorly. TextMod on the synthetic dataset achieves similar accuracy as it does on val and test-dev with 40% noise. SpeechMod however, gets similar performance as the synthetic data with 50% noise on only the Y/N questions, while it seems to understand none of the other question types.


As a modality, speech contains more information than text. In the process of reducing the high-dimensional audio inputs to the low-dimensional class output label (i.e. the answer), the best performing system must be that which extracts patterns most effectively.

TextMod relies heavily on the intermediate ASR system, which is more complicated than the entire architecture of SpeechMod, as the number of parameters one needs to learn for speech recognition is also much greater. The Kaldi model has also been trained on many times more data than contained in VQA1.0. The ASR serves to filter out noise in high dimensions and extract meaningful patterns in the form of text. In a sense, one can think of the ASR as a feature extractor, with text being the salient feature and an explicit intermediate standardization of data before the question answering module.

Conversely, the only audio data SpeechMod learns from are the questions in the dataset. It does not include any mechanisms that explicitly learn semantics in a language, nor does it have intermediate data standardization. Thus, the model may not extract the concept of words from audio sounds. Whether or not forcing the system to learn words (i.e. transcribing words in the question and answering simultaneously) will be beneficial is left to future research, but it is evident that data standardization is helpful for unseen data.

Figure 5: SpeechMod and TextMod performance with varying amounts of added noise for zero-shot subset in reference to the complete datasets.
Figure 5: SpeechMod and TextMod performance with varying amounts of added noise for zero-shot subset in reference to the complete datasets.
Table 5: Performance on 1000 human-recorded questions.
All Y/N Number Other
SpeechMod (Recorded) 21.46 57.26 0.77 0.91
SpeechMod (Synthetic) 42.69 66.58 32.31 27.58
TextMod (Recorded) 41.66 66.33 35.69 25.37
TextMod (Original Text) 53.09 77.73 41.54 38.26

In ZS experiments, the gap in performance between the unseen and full dataset with TextMod is much smaller than in SpeechMod (4% vs 7%). A text-based system can still glimpse the meaning of a question even if a word has never been seen, but from the perspective of SpeechMod, new words represent entirely different signal trajectories. Furthermore, audio inputs are continuous streams, making it difficult to differentiate when the new words begin or end.

A similar effect is amplified in answering human-recorded questions. The synthetic audio sounds monotonous, disinterested, with little silence between words while the human-recorded audio has inflections, emphasis, accents, and pauses. An inspection of the spectrograms () confirms this, as the synthetic waveforms have vastly different audio signatures. Because SpeechMod has no training data similar to the human-recorded samples, it is unable to extract salient patterns. In comparison, the ASR removes most of the variance in the input by standardizing the audio into a compact, salient textual representation. From the perspective of TextMod, the human-recorded questions is only slightly different than those provided in training.

It is evident in our experiments that text-based VQA performs better than speech-based, but bearing in mind the simple architecture and limited amount of training data, we believe the results of SpeechMod merits further study into end-to-end methods.

6.4Future Work

As alluded to in previous sections, there are a few research directions that may yield interesting results. One straightforward approach to improving the end-to-end model is by data augmentation. It is widely accepted that effectiveness of neural architectures is data driven, so training with noisy data and different speakers will make the model more robust to inputs during run time. Just as many possibilities exist in improving the architecture. One can add feature extractors, attention mechanisms, GAN training, or any amalgamation of the techniques in the deep learning mainstream. An interesting study would be to enforce the prediction of the question while simultaneously learning to answer the question. Doing so may improve performance, but more importantly allows us to interpret the concepts learned by the neural network.

Another direction is to restrict the amount of training data available to both approaches to observe their learning efficiency. For example, minor languages may not have a reliable ASR. One can simulate a minor language by training an ASR with only the data available in the training set, and comparing this approach with the end-to-end method trained on the same amount of data.


We have proposed speech-based visual question answering and introduced two approaches that tackle this problem, one of which can be trained end-to-end on audio inputs. Despite its simple architecture, the end-to-end method works well when the test data has audio signatures comparable to its training data. Both methods suffered performance decreases at similar rates when noise is introduced. A pipelined method using an ASR tolerates varied inputs much better because it normalizes the input variance into text before running the VQA module. We release the speech dataset and invite the multimedia research community to explore the intersection of speech and vision.

This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Comments 0
Request answer
The feedback must be of minumum 40 characters
Add comment
Loading ...