Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling
for Visual Question Answering
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multi-modal feature fusion that is able to capture the complex interactions between multi-modal features; 3) automatic answer prediction that is able to consider the complex correlations between multiple diverse answers for the same question. For fine-grained image and question representations, a ‘co-attention’ mechanism is developed by using a deep neural network architecture to jointly learn the attentions for both the image and the question, which can allow us to reduce the irrelevant features effectively and obtain more discriminative features for image and question representations. For multi-modal feature fusion, a generalized Multi-modal Factorized High-order pooling approach (MFH) is developed to achieve more effective fusion of multi-modal features by exploiting their correlations sufficiently, which can further result in superior VQA performance as compared with the state-of-the-art approaches. For answer prediction, the KL (Kullback-Leibler) divergence is used as the loss function to achieve more accurate characterization of the complex correlations between multiple diverse answers with same or similar meaning, which can allow us to achieve faster convergence rate and obtain slightly better accuracy on answer prediction. A deep neural network architecture is designed to integrate all these aforementioned modules into one unified model for achieving superior VQA performance. With an ensemble of 9 models, we achieve the state-of-the-art performance on the large-scale VQA datasets and win the runner-up in the VQA Challenge 2017.
Thanks to recent advances in computer vision and natural language processing, computers are expected to be able to automatically understand the semantics of images and natural languages in the near future. Such advances have also stimulated new research topics like image-text retrieval [1, 2], image captioning [3, 4], and visual question answering [5, 6].
Compared with image-text retrieval and image captioning (which just require the underlying algorithms to search or generate a free-form text description for a given image), visual question answering (VQA) is a more challenging task that requires fine-grained understanding of the semantics of both the images and the questions as well as supports complex reasoning to predict the best-matching answer correctly. In some aspects, the VQA task can be treated as a generalization of image captioning and image-text retrieval. Thus building effective VQA algorithms, which can achieve close performance like human beings, is an important step towards enabling artificial intelligence in general.
To support the VQA task, we need to address the following three issues effectively (see the example in Fig. 1): (1) extracting discriminative features for image and question representations; (2) combining the visual features from the image and the textual features from the question to generate the fused image-question features; (3) using the fused image-question features to learn a multi-class classifier for predicting the best-matching answer correctly. Deep neural networks (DNNs) are very effective and flexible, many existing approaches tackle these three issues in one single DNNs model and train the model in an end-to-end fashion through back-propagation.
For feature-based image representation, directly using the global features extracted from the whole image may introduce noisy information (i.e., irrelevant features) that are irrelevant to the given question, e.g., the given question may strongly relate with only a small part of the image (i.e., image attention region) rather than the whole image. Therefore, it is very intuitive to introduce visual attention mechanism  into the VQA task to adaptively learn the most relevant image regions for a given question. Modeling visual attention may significantly improve performance . On the other hand, the questions interpreted in natural languages may also contain colloquialisms that can be treated as noise, thus it is very important to model the question attention simultaneously. Unfortunately, most existing approaches only model the image attention without considering the question attention. Motivated by these observations, we design a deep network architecture for the VQA task by using a co-attention learning module to jointly learn the attentions for both the image and the question, which may allow us to extract more discriminative features for image and question representations.
For multi-modal feature fusion, most existing approaches simply use linear models (e.g., concatenation or element-wise addition) to integrate the visual feature from the image with the textual feature from the question even their distributions may vary dramatically [8, 9]. Such linear models may not be able to generate expressive image-question features that are able to fully capture the complex correlations between multi-modal features. In contrast to linear pooling, bilinear pooling  has recently been used to integrate different CNN features for fine-grained image recognition . Unfortunately, such bilinear pooling approach may output high-dimensional features for image-question representation and the underlying deep networks for feature extraction may contain huge number of model parameters, which may seriously limit its applicability for VQA. To tackle these problems effectively, Multi-modal Compact Bilinear (MCB) pooling  and Multi-modal Low-rank Bilinear (MLB) pooling  have been developed to reduce the computational complexity of the original bilinear pooling model and make it practicable for VQA. However, MCB need very high-dimensional feature to guarantee good performance and MLB needs a great many training iterations to converge to a satisfactory solution. To tackle these problems, we propose a Multi-modal Factorized Bilinear pooling approach (MFB) which enjoys the dual benefits of compact output features of MLB and robust expressive capacity of MCB. Moreover, we extend the bilinear MFB model to a generalized high-order setting and proposed a Multi-modal Factorized High-order pooling (MFH) method to achieve more effective combination (fusion) of multi-modal features by exploiting their complex correlations sufficiently. By introducing more complex high-order interactions between multi-modal features, our MFH method can achieve more discriminative image-question representation and further result in significant improvement on the VQA performance.
For answer prediction, some datasets like  provide multiple answers for each image-question pair and such diverse answers are typically annotated by different users. As the answers are represented in natural languages, for a given question, different users may provide diverse answers or expressions which have same or similar meaning, thus such diverse answers may have strong correlations and they are not independent at all. For example, both a little dog and a puppy could be the correct answers for the same question. Motivated by this observation, it is very important to design an appropriate mechanism to model the complex correlations between multiple diverse answers for the same question. In , an answer sampling strategy was proposed to randomly pick an answer from a set of candidates during the training course. In this way, the complex correlations between multiple diverse answers could be eventually learned by the model with sufficient training iterations. In this paper, we formulate the problem of answer prediction as a label distribution learning problem. The answers for an image-question pair in the training dataset are converted to a probability distribution over all possible answers. We use the Kullback-Leibler divergence (KLD) as the loss function to achieve more accurate characterization of the consistency between the probability distribution of the predicted answers and the probability distribution of the true answers (ground truth) given by the annotators. Compared with the answer sampling method in , using the KLD loss can achieve faster convergence rate and obtain slightly better accuracy on answer prediction.
In summary, we have made the following contributions in this study:
A co-attention learning architecture is designed to jointly learn the attentions for both the image and the question, which can allow us to reduce the irrelevant features (i.e., noisy information) effectively and obtain more discriminative features for image and question representations.
A Multi-modal Factorized Bilinear Pooling (MFB) approach is developed to achieve more effective fusion (combination) of the visual features from the image and the textual features from the question. By supporting more effective exploitation of the complex correlations between multi-modal features, our MFB approach can significantly outperform the existing bilinear pooling approaches.
A generalized Multi-modal Factorized High-order pooling (MFH) approach is developed by cascading multiple MFB blocks. Compared with MFB, MFH captures more complex correlations of multi-modal feature to achieve more discriminative image-question representation and further result in significant improvement on the VQA performance.
The KL divergence (KLD) is used as the loss function to achieve more accurate characterization of the consistency between the predicted answers and the annotated answers, which can allow us to achieve faster convergence rate and obtain slightly better accuracy on answer prediction.
Extensive experiments over multiple VQA datasets are conducted to explain the reason why our approaches are effective. Our experimental results have demonstrated that: (a) our proposed approaches can achieve the state-of-the-art performance on the real-world VQA datasets; and (b) the normalization techniques are extremely important in bilinear pooling models.
The rest of the paper is organized as follows: In section II, we review the related work of VQA approaches , especially the ones introducing the bilinear pooling. In section III, we revisit the bilinear model and its factorized extension. Then, we propose the bilinear MFB model and reveal the fact that MFB is a generalization form of MLB. Based on MFB, we further propose its generalized high-order extension MFH. In section IV, we propose the co-attention learning network architecture for VQA based on MFB or MFH. In section V, we analyze the importance of modeling answer correlation in VQA and propose a solution with KLD loss. In section VI, we introduce our extensive experimental results for algorithm evaluation and multiple real-word VQA datasets are used to evaluate our proposed approaches. Finally, we conclude this paper in section VII.
Ii Related Work
In this section, we briefly review the most relevant research on VQA, especially those studies that use multi-modal bilinear models.
Ii-a Visual Question Answering (VQA)
Malinowski et al.  made an early attempt at solving the VQA task. Since then, solving the VQA task has received increasing attention from the communities of computer vision and natural language processing. Most existing VQA approaches can be classified into the following three categories: (a) the coarse joint-embedding models [8, 5, 13]; (b) the fine-grained joint-embedding models with attention [14, 9, 7, 15, 16]; (c) the external knowledge based models [17, 18].
The coarse joint-embedding models are the most straightforward solution for VQA. Image and question are first represented as global features and then integrated to predict the answer. Zhou et al. proposed a baseline approach for the VQA task by using the concatenation of the image CNN features and the question BoW (bag-of-words) features, and a linear classifier is learned to predict the answer . Some approaches introduce more complex deep models, e.g., LSTM networks  or residual networks , to tackle the VQA task in an end-to-end fashion.
One limitation of coarse joint-embedding models is that their global features may contain noisy information (i.e., irrelevant features), and such noisy global features may not be able to answer the fine-grained problems correctly (e.g., “what color are the cat’s eyes?”) . Therefore, recent VQA approaches introduce the visual attention mechanism  into the VQA task by adaptively learning the local fine-grained image features for a given question. Chen et al. proposed a “question-guided attention map” that projects the question embeddings to the visual space and formulates a configurable convolutional kernel to search the image attention region . Yang et al. proposed a stacked attention network to learn the attention iteratively . Some approaches introduce off-the-shelf object detectors  or object proposals  as the candidates of the attention regions and then use the question to identify the relevant ones. Fukui et al. proposed multi-modal compact bilinear pooling to integrate the visual features from the image spatial grids with the textual features from the questions to predict the attention . In addition, some approaches perform attention learning on both the images and the questions. Lu et al. proposed a co-attention learning framework to alternately learn the image attention and the question attention . Nam et al. proposed a multi-stage co-attention learning framework to refine the attentions based on memory of previous attentions .
Despite the joint embedding models can deliver impressive VQA performance, they are not good enough for answering the questions that require complex reasoning or knowledge of common senses. Therefore, introducing external knowledge is beneficial for VQA. However, existing approaches have either only been applied to specific datasets , or have been ineffective on benchmark datasets . Thus there still have rooms for further exploration and development.
Ii-B Multi-modal Bilinear Models for VQA
Multi-modal feature fusion plays a critical and fundamental role in VQA. After the image and the question representations are obtained, concatenation or element-wise summations are most frequently used for multi-modal feature fusion. Since the distributions of two feature sets in different modalities (i.e.,the visual features from images and the textual features from questions) may vary significantly, the representation capacity of the simply-fused features may be insufficient, limiting the final prediction performance.
Fukui et al. first introduced the bilinear model to solve the problem of multi-modal feature fusion in VQA. In contrast to the aforementioned approaches, they proposed the Multi-modal Compact Bilinear pooling (MCB), which uses the outer product of two feature vectors in different modalities to produce a very high-dimensional feature for quadratic expansion . To reduce the computational cost, they used a sampling-based approximation approach that exploits the property that the projection of two vectors can be represented as their convolution. The MCB model outperformed the simple fusion approaches and demonstrated superior performance on the VQA dataset . Nevertheless, MCB usually needs high-dimensional features (e.g., 16,000-D) to guarantee robust performance, which may seriously limit its applicability for VQA due to limitations in GPU memory.
To overcome this problem, Kim et al. proposed the Multi-modal Low-rank Bilinear Pooling (MLB) approach based on the Hadamard product of two feature vectors (i.e., the image feature and the question feature ) in the common space with two low-rank projection matrices: :
where and are the projection matrices, is the dimensionality of the output feature, and denotes the Hadamard product or the element-wise multiplication of two vectors. To further increase model capacity, nonlinear activation like is added after . Since the MLB approach can generate feature vectors with low dimensions and deep networks with fewer model parameters, it has achieved very comparable performance to MCB. In , the experimental results indicated that MLB may lead to a slow convergence rate (the MLB with attention model takes 250k iterations with the batch size 200, which is about 140 epochs, to converge ).
Iii Generalized Multi-modal Factorized High-order Pooling
In this section, we first revisit the multi-modal bilinear models and then introduce the Multi-modal Factorized Bilinear pooling (MFB) model. We give detailed explanation on the implementation of our MFB model and further analyze its relationship with the existing MLB approach . By treating our MFB model as the basic building block, we extend the idea of bilinear pooling into a generalized high-order pooling and we further propose a Multi-modal High-order pooling (MFH) model by simply cascading multiple MFB blocks to capture more complex high-order interactions between multi-modal features.
Iii-a Multi-modal Factorized Bilinear Pooling
Given two feature vectors in different modalities, e.g., the visual features for an image and the textual features for a question, the simplest multi-modal bilinear model is defined as follows:
where is a projection matrix, is the output of the bilinear model. The bias term is omitted here since it is implicit in . To obtain a -dimensional output , we need to learn . Although bilinear pooling can effectively capture the pairwise interactions between the feature dimensions, it also introduces huge number of parameters that may lead to high computational cost and a risk of over-fitting.
where is the factor or the latent dimensionality of the factorized matrices and , is the Hadmard product or the element-wise multiplication of two feature vectors, is an all-one vector.
To obtain the output feature by Eq.(3), the weights to be learned are two three-order tensors and accordingly. Without loss of generality, we can reformulate and as 2-D matrices and respectively with simple reshape operations. Accordingly, Eq.(3) is rewritten as follows:
where the function means using a one-dimensional non-overlapped window with the size to perform sum pooling over . We name this model Multi-modal Factorized Bilinear pooling (MFB).
The detailed procedures of MFB are illustrated in Fig. 2(a). The approach can be easily implemented by combining some commonly-used layers such as fully-connected, element-wise multiplication and pooling layers. Furthermore, to prevent over-fitting, a dropout layer is added after the element-wise multiplication layer. Since element-wise multiplication is introduced, the magnitude of the output neurons may vary dramatically, and the model might converge to an unsatisfactory local minimum. Therefore, similar to , the power normalization () and normalization () layers are appended after MFB output. The flowchart of the entire MFB module is illustrated in Fig. 2(b).
Relationship to MLB. Eq.(4) shows that the MLB in Eq.(1) is a special case of the proposed MFB with , which corresponds to the rank-1 factorization. Figuratively speaking, MFB can be decomposed into two stages (see in Fig. 2(b)): first, the features from different modalities are expanded to a high-dimensional space and then integrated with element-wise multiplication. After that, sum pooling followed by the normalization layers are performed to squeeze the high-dimensional feature into the compact output feature, while MLB directly projects the features to the low-dimensional output space and performs element-wise multiplication. Therefore, with the same dimensionality for the output features, we can conjecture that MLB may suffer from insufficient representation.
Iii-B From Bilinear Pooling to Generalized High-order Pooling
From the previous work like [7, 13], we have witnessed that the bilinear pooling models have superior representation capacity than the traditional linear pooling models. This inspires us that exploiting the complex interactions among the feature dimensions is beneficial for capturing the common semantics of multi-modal features. Therefore, a natural idea is to extend the second-order bilinear pooling to the generalized high-order pooling to further enhance the representation capacity of fused features. In this section, we introduce a generalized Multi-modal Factorized High-order pooling (MFH) model by cascading multiple MFB blocks.
As shown in Fig. 2(b), the MFB module can be separated into the expand stage and the squeeze stage as follows.
where , and refer to the dropout, sum pooling and normalization layers respectively. and are the internal and the output feature of the MFB module respectively.
To make MFB blocks cascadable, we slightly modify the original MFB stage in Eq.(5) as follows:
where is the index for the MFB blocks. , and are the weight matrices and the internal feature for MFB block respectively. are the internal feature of MFB block and is an all-one vector.
After the internal feature is obtained for -th MFB block, the output feature for -th MFB block can be computed by Eq.(6). The final output feature of the high-order model is obtained by concatenating the output feature of MFB blocks as follows:
The overall flowchart of the MFH approach is illustrated in Fig. 3. With the increase of , the model size and the dimensionality of the output feature for MFH grow linearly. In order to control the model complexity and the training time that we can afford, we use in our experiments. It is worth noting that the propoed MFB model in section III-A is a special case of our MFH model with .
Iv Network Architectures for VQA
The goal of the VQA task is to answer a question about an image. The inputs to the model contain an image and a corresponding question about the image. Our model extracts the representations for both the image and the question, integrates multi-modal features by using the MFB or MFH module in Fig. 2(b), treats each individual answer as one class and performs multi-class classification to predict the correct answer. In this section, two network architectures are introduced. The first one is the baseline with one MFB or MFH module, which is used to perform ablation analysis with different hyper-parameters for comparison with other baseline approaches. The second one introduces co-attention learning (which jointly learns the attentions for both the image and the question) to achieve more effective characterization of the fine-grained correlations between multi-modal features, which may result in a model with better representation capability.
Iv-a The Baseline Model
Similar to , we extract the image features by using 152-layer ResNet model  which is pre-trained on the ImageNet dataset. Images are resized to 448 448, and 2048-D pool5 features (with normalization) are used for image representation. Questions are first tokenized into words, and then further transformed to one-hot feature vectors with max length . Then, the one-hot vectors are passed through an embedding layer and fed into a LSTM networks with 1024 hidden units . Similar to , we extract the output feature of the last word from the LSTM network to form a vector for question representation. For predicting the answers, we simply use the top- most frequent answers as classes since they follow the long-tail distribution.
The multi-modal features (that are extracted from the image and the question) are fed to the MFB or MFH module to generate the fused image-question feature . Finally, is fed to a -way classifier to predict the best-matching answer. Therefore, all the weights except the ones for the ResNet (due to the limitation of GPU memory) are optimized jointly in an end-to-end manner. The whole network architecture is illustrated in Fig. 4.
Iv-B The Co-Attention Model
For a given image, different questions could result in entirely different answers. Therefore, an image attention model, which can predict the relevance between each spatial grid of the image with the question, is beneficial for predicting the best-matching answer accurately. From the results reported in , one can see that incorporating such image attention mechanism allows the model to effectively learn which image region is important for the question, clearly contributing to better performance than the models without using attention. However, their attention model only focuses on learning the image attention while completely ignoring the question attention. Since the questions are interpreted in natural languages, the contribution of each word is definitely different. Therefore, we develop a co-attention learning approach named MFB+CoAtt or MFH+CoAtt (see Fig. 5) to jointly learn the attentions for both the question and the image.
Specifically, 1414 (196) spatial grids of the image (res5c feature maps in ResNet) are used to represent the input image and output features from the LSTM networks are used to represent each word in the input question. After that, the question features are fed into a question attention module and output an attentive question representation. This attentive question representation is fed into an image attention module (with 196 image features), and MFB or MFH is used to generate a fused image-question representation. Such fused image-question representation is further used to learn a multi-class classifier for answer prediction. In our excrements, we find that using MFH rather than MFB in the image attention module does not improve the prediction accuracy significantly while inducing much higher computational cost. Therefore, in most of our experiments (unless in the final model ensemble experiment), the MFH module is only used in the feature fusion stage for integrating the attentive features extracted from the image and the question.
Both the image attention module and question attention module consist of sequential 1 1 convolutional layers and ReLU layers followed by the softmax normalization layers to predict the attention weight for each input feature. The attentive feature are obtained by the weighted sum of the input features. To further improve the representation capacity of the attentive feature, multiple attention maps are generated to enhance the learned attention map, and these attention maps are concatenated to output the attentive image features.
It is worth noting that the question attention in our network architecture is learned in a self-attentive manner by using the question feature itself. This is different from the image attention module which is learned by using both the image features and question features. The reason is that we assume that the question attention (i.e., the key words of the question) can be inferred without seeing the image, as humans do.
V Answer Correlation Modeling
In most existing VQA approaches, the answering stage is formulated as a multi-class classification problem and each answer refers to an individual class. In practice, this assumption may not hold for the VQA task because the answers with the same or similar meaning can be expressed diversely by different annotators. For example, both the answers ‘a little dog’ and ‘a puppy’ could be correct for a given image-question pair. Therefore, it is crucial to model the answer correlations in the VQA task so that the learned model could be more robust.
In some datasets like VQA , each question is annotated with multiple answers by different people. To exploit the answer correlations, an answer sampling strategy was used in . Specifically, for each image-question pair in the training set, its answers form a repeatable list , referring to the situation that multiple people annotate the sample with the same answer. In each epoch the sample is accessed, a single answer is randomly picked from as the label for this sample in this epoch. In this way, the problem become the traditional multi-class classification problem and traditional softmax loss function could be used to train the model. With the repeatable property of , the model can learn the answer correlation eventually with sufficient number of iterations.
In practice, using answer sampling strategy may introduce uncertainty to the learned model and may take more iterations to converge. To overcome the problem, we transform the multi-class classification problem with sampled answers to the label distribution learning (LDL) problem with a fixed answer distribution . The answers for each sample are represented as a distribution vector of all the possible answers , where is the total number of answers for the whole training set. indicates the occurrence probability of the -th answer in with that . Accordingly, we use the KL-divergence loss function to penalize the prediction after the softmax activation of the last fully-connected layer:
We have conducted several experiments to evaluate the performance of our MFB models for the VQA task by using the VQA dataset  to verify our approach. We first perform ablation analysis on the MFB and MFH baseline models to verify the superior performance of the proposed approaches over existing state-of-the-art methods such as MCB  and MLB . We then provide detailed analyses of the reasons why our models outperforms their counterparts. Finally, we choose the optimal hyper-parameters for the MFB or MFH module and train the models with co-attention for fair comparison with the state-of-the-art approaches on the real-world VQA datasets. The corresponding source codes and pre-trained models are released at 111https://github.com/yuzcccc/mfb.
Vi-a Datasets and Evaluation Criteria
We have evaluated the performances of our proposed approaches over multiple VQA datasets. In addition, we have compared our proposed approaches with the state-of-the-art algorithms.
The VQA dataset (a.k.a. the VQA-1.0 dataset)  consists of approximately 200,000 images from the MS-COCO dataset , with 3 questions per image and 10 answers per question. The data set is split into three: train (80k images and 240k question-answer pairs), val (40k images and 120k question-answer pairs), and test (80k images and 240k question-answer pairs). Additionally, there is a 25 test subset named test-dev. Two tasks are provided to evaluate performance: Open-Ended (OE) and Multiple-Choices (MC). We use the tools provided by Antol et al.  to evaluate the accuracy on the two tasks. Specifically, the accuracy of a predicted answer is calculated as follows:
where is the count of the answer voted by different annotators.
The VQA-2.0 dataset  is the updated version of the VQA dataset. Compared with the VQA dataset, it contains more training samples (440k question-answer pairs for training and 214k pairs for validation), and is more balanced to weaken the potential that a overfitted model may achieve good results. Specifically, for every question there are two images in the dataset that result in two different answers to the question. At this point only the train and validation sets are available. Therefore, we report the results of the Open-Ended task on validation set with the model trained on train set. The evaluation criterion on this dataset is same as the one used in the VQA dataset.
Vi-B Experimental Setup
For the VQA and VQA 2.0 datasets, we use the Adam solver with , . The base learning rate is set to 0.0007 and decays every 4 epochs using an exponential rate of 0.5 for MFB and 0.25 for MFH. All the models are trained up to 10 epochs. Dropouts are used after each LSTM layer (dropout ratio ) and MFB and MFH modules (). The number of answers . For all experiments (except for the ones shown in Table II, which use the train and val sets together as the training set like the comparative approaches), we train on the train set, validate on the val set, and report the results on the test-dev and test-standard sets222the submission attempts for the test set are strictly limited. Therefore, we report most of our results on the test-dev set and the best results on the test-standard set. The batch size is set to 200 for the models without the attention mechanism, and set to 64 for the models with attention (due to GPU memory limitation).
All experiments are implemented with the Caffe toolbox  and performed on the workstations with NVIDIA GTX-1080 and TitanX GPUs.
Vi-C Ablation Study on the VQA Dataset
We design the following ablation experiments to verify the efficacy of our MFB and MFH modules, as well as the advantage of the KLD loss in modeling answer correlations.
Vi-C1 Design of the MFB and MFH Module
In Table I, we compare the performance of MFB and MFH with other state-of-the-art bilinear pooling models, namely MCB  and MLB . The models are trained on the train set and evaluated on the test-dev set. For fair comparison, all the compared approaches use power+ normalizations. None of these approaches introduce the attention mechanism. We explore different hyper-parameters and normalizations introduced in MFB to explore why MFB outperform the compared bilinear models. Finally, we evaluate MFH with different to explore the effect of high-order feature pooling.
From Table I, we can see that:
First, MFB significantly outperforms MCB and MLB. With 5/6 parameters, MFB() achieves an improvement of about 1.0 points compared with MCB and MLB. Moreover, with only 1/3 parameters and 2/3 GPU memory usage, MFB() obtains similar results to MCB and MLB. These characteristics allows us to train our model on a memory limited GPU with larger batch-size. In Fig. 8, we show the courses of validation, from which it can seen that the convergence rates of MFB are faster than that of MLB during training (MLB takes more than 80,000 iterations to achieve good a validation accuracy), and MFB significantly outperforms the two other methods in terms of accuracy on the validation set. Furthermore, it can be seen from the MCB accuracy that its performance gradually falls after 25,000 iterations, indicating that it suffers from overfitting with the high-dimensional output features. In comparison, the performance of our MFB model is relatively robust.
Second, when is fixed to a constant, e.g., 5000, the number of factors affects the performance. Increasing from 1 to 5, produces a 0.5 points performance gain. When , the performance has approached saturation. This phenomenon can be explained by the fact that a large corresponds to using a large window to sum pool the features, which can be treated as a compressed representation and may loss some information. When is fixed, increasing does not produce any further improvement. This suggests that high-dimensional output features may be easier to overfit. Similar results can be seen in . In summary, and may be a suitable combination for our MFB model on the VQA dataset, so we use these settings in our follow-up experiments.
|-w/o power norm.||60.4||-|
|-w/o power and norms.||57.3||-|
Third, both the power and normalization benefit MFB performance. Power normalization results in an improvement of 0.5 points and normalization, perhaps surprisingly, results in an improvement of about 3 points. Results without and power normalizations were also reported in  and are similar to those reported here. To explain why normalization are so important, we randomly choose one typical neuron from the MFB output feature before normalization to illustrate how its distribution evolves over time in Fig. 6. It can be seen that the standard MFB model (with both normalizations) leads to the most stable neuron distribution (i.e. small neuron variance) and without the power normalization, about 10,000 iterations are needed to achieve stabilization. Without the normalization, the distribution varies seriously over the entire training course. This observation is consistent with the results shown in Table I. The effects of power normalization and normalization are also observed by . Furthermore, although MLB does not use any normalization, it introduces the tanh activation after the fused feature, which regularizes the distribution of the output feature in some way.
Finally, with a faster convergence rate, MFH and MFH further outperform MFB with an improvement of about 0.7 points on the test-dev set. This observation demonstrates the efficacy of high-order pooling model for VQA. However, the performance of MFH is slightly worse than MFH with a more complex model. This may be explained that the representation capacity of MFH is saturated with for the VQA task. Therefore, in our following experiments, is used for MFH and the superscript is omitted for simplicity.
Vi-C2 Answer Correlation Modeling Strategies
In Fig. 8(a) and 8(b), the validation accuracies of MFB and MFB+CoAtt models w.r.t. different answer sampling strategies are demonstrated respectively. Max Prob means using the most frequent answer of the sample as the unique label and formulate the optimization for VQA as the traditional multi-class problem with single label. This strategy refer to the baseline approach that does not consider answer correlation. Answer Sampling is the strategy used in , which random sample an answer from the candidate answer set at each time. KLD is the strategy proposed in section V of this paper.
From the results, we have the following observations. First, modeling answer correlation bring remarkable improvement on the VQA dataset. The Answer Sampling and KLD strategies which model the answer correlation, significantly outperform the Max Prob strategy. Second, compared with the Answer Sampling strategy, the proposed KLD strategy has the merits of faster convergence rate and slightly better accuracy, especially on the complex MFB+CoAtt model.
Vi-D Results on the VQA Dataset
|VQA team ||57.8||80.5||36.8||43.1||62.7||58.2||80.6||36.5||43.7||63.1|
Table II compares our approaches with the current state-of-the-art. The table is split into four parts over the rows: the first summarizes the methods without introducing the attention mechanism; the second includes the methods with attention; the third illustrates the results of approaches with external pre-trained word embedding models, e.g., GloVe  or Skip-thought Vectors (StV) ; and the last includes the models trained with the external large-scale Visual Genome dataset  additionally. To best utilize model capacity, the training data set is augmented so that both the train and val sets are used as the training set. Also, to better understand the question semantics, pre-trained GloVe word vectors are concatenated with the learned word embedding. The MFB model corresponds to the MFB baseline model. The MFB+Att model indicates the model that replaces the MCB with our MFB in the MCB+Att model . The MFB+CoAtt model represents the network shown in Fig. 5. The MFB+CoAtt+GloVe model additionally concatenates the learned word embedding with the pre-trained GloVe vectors. The MFB+CoAtt+GloVe+VG model further introduce the data from the Visual Genome dataset  into the training set.
From Table II, we have the following observations.
First, the model with MFB outperforms other comparative approaches significantly. The MFB baseline outperforms all other existing approaches without the attention mechanism for both the OE and MC tasks, and even surpasses some approaches with attention. When attention is introduced, MFB+Att consistently outperforms current next-best model MCB+Att, highlighting the efficacy and robustness of the proposed MFB.
Second, the co-attention model further improve the performance over the attention model with only considering the image attention. By introducing co-attention learning, MFB+CoAtt delivers an improvement of 0.5 points on the OE task compared to the MFB+Att model in terms of overall accuracy, indicating the additional benefits of the co-attention learning framework.
Third, by replacing MFB with MFH, the performance of all of our models further enjoy an improvement of about 0.71.1 points steadily. The performance of a single MFH+CoAtt+GloVe model has even surpassed the best published results with an ensemble of 7 MLB or MFB models shown in Table III on the test-standard set.
Finally, with external pre-trained GloVe model and the Visual Genome dataset, the performance of our models are further improved. The MFH+CoAtt+GloVe+VG model significantly outperforms the best reported results with a single model on both the OE and MC task.
In Table III, we compare our model with the state-of-the-art results with model ensemble. Similar with [7, 12], we train 7 individual MFB (or MFH)+CoAtt+GloVe models and average the prediction scores of them. 4 of the 7 models additionally introduce the Visual Genome dataset  into the training set. All the reported results are fetched from the leaderboard of the VQA dataset333the Standard tab in http://www.visualqa.org/roe.html. For fair comparison, only the published results are demonstrated. From the results, the ensemble of MFB models outperforms the next best result by 1.5 points on the OE task and by 2.2 points on the MC task respectively. Furthermore, the result of the ensemble of MFH models obtain a further improvement of 0.8 points and achieve the new state-of-the-art. Finally, compared with the results obtained by human, there is still a lot of room for improvement to approach the human-level.
To demonstrate the effects of co-attention learning, we visualize the learned question and image attentions of some image-question pairs from the val set in Fig. 9. The examples are randomly picked from different question types. It can seen that the learned question and image attentions are usually closely focus on the key words and the most relevant image regions. From the incorrect examples, we can also draw conclusions about the weakness of our approach, which are perhaps common to all VQA approaches: 1) some key words in the question are neglected by the question attention module, which seriously affects the learned image attention and final predictions (e.g., the word catcher in the first example and the word bottom in the third example); 2) even the intention of the question is well understood, some visual contents are still unrecognized (e.g., the flags in the second example) or misclassified (the meat in the fourth example), leading to the wrong answer for the counting problem. These observations are useful to guide further improvement for the VQA task in the future.
|7 MCB models ||66.5||83.2||39.5||58.0||70.1|
|7 MLB models ||66.9||84.6||39.1||57.8||70.3|
|7 MFB models||68.4||85.6||41.0||59.8||72.5|
|7 MFH models||69.2||86.2||41.8||60.7||73.4|
Vi-E Results on the VQA-2.0 Dataset
Table IV demonstrates our results on the VQA-2.0 dataset (a.k.a, the VQA challenge 2017). We compare our models with the results of baseline models (including the MCB model which is the champion of the VQA challenge in 2016) and the results of the top-ranked teams on the leaderboard. We use the same training strategies aforementioned for this dataset.
From the results, our single MFB and MFH models (with CoAtt+GloVe but without the Visual Genome data argumentation) significantly surpass all the baseline approaches. If we neglect the tiny difference between the results on test-dev and test-standard sets, MFB and MFH is about 2.7 points and 3.5 points higher than the MCB model respectively. Finally, with an ensemble 9 models, we report the accuracy of 68.02 on the test-dev set and 68.16 on the test-challenge set respectively 444http://visualqa.org/roe_2017.html, which ranks the second place (tied with another team) in the VQA Challenge 2017. The details of the 9 models are illustrated in Table V.
|Adelaide-ACRV-MSR (1st place)||-||69.00|
|DLAIT (2nd place)||-||68.07|
|LVNUS (4th place)||-||67.62|
|1 MFB model||64.98||-|
|1 MFH model||65.80||-|
|7 MFB models||67.24||-|
|7 MFH models||67.96||-|
|9 MFH models (2nd place)||68.02||68.16|
|index||VG||MFH / MFB(I)||Q||I||Accuracy()|
In this paper, a network architecture with co-attention learning is designed to model both the image attention and the question attention simultaneously, so that we can reduce the irrelevant features effectively and extract more discriminative features for image and question representations. A Multi-modal Factorized Bilinear pooling (MFB) approach is developed to achieve more effective fusion of the visual features from the images and the textual features from the questions, and a generalized high-order model called MFH is developed to capture more complex interactions between multi-modal features. Compared with the existing bilinear pooling methods, our proposed MFB and MFH approaches can achieve significant improvement on the VQA performance because they can achieve more effective exploitation of the complex correlations between multi-modal features. By using the KL divergence as the loss function, our proposed answer prediction approach can achieve faster convergence rate and obtain better performance as compared with the state-of-the-art strategies. Our experimental results have demonstrated that our approaches have achieved the state-of-the-art or comparable performance on two large-scale real-world VQA datasets.
-  F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang, “Sparse multi-modal hashing,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 427–439, 2014.
-  Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Discriminative coupled dictionary hashing for fast cross-media retrieval,” in ACM SIGIR, 2014, pp. 395–404.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015, pp. 2625–2634.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention.” in ICML, vol. 14, 2015, pp. 77–81.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015, pp. 2425–2433.
-  M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” in NIPS, 2014, pp. 1682–1690.
-  A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv preprint arXiv:1606.01847, 2016.
-  B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus, “Simple baseline for visual question answering,” arXiv preprint arXiv:1512.02167, 2015.
-  J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in NIPS, 2016, pp. 289–297.
-  J. B. Tenenbaum and W. T. Freeman, “Separating style and content,” NIPS, pp. 662–668, 1997.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in ICCV, 2015, pp. 1449–1457.
-  J.-H. Kim, K. W. On, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” arXiv preprint arXiv:1610.04325, 2016.
-  J.-H. Kim, S.-W. Lee, D. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang, “Multimodal residual learning for visual qa,” in NIPS, 2016, pp. 361–369.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Learning to compose neural networks for question answering,” arXiv preprint arXiv:1601.01705, 2016.
-  I. Ilievski, S. Yan, and J. Feng, “A focused dynamic attention model for visual question answering,” arXiv preprint arXiv:1604.01485, 2016.
-  H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal reasoning and matching,” arXiv preprint arXiv:1611.00471, 2016.
-  P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick, “Explicit knowledge-based reasoning for visual question answering,” arXiv preprint arXiv:1511.02570, 2015.
-  Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel, “Ask me anything: Free-form visual question answering based on knowledge from external sources,” in CVPR, 2016, pp. 4622–4630.
-  K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia, “Abc-cnn: An attention based convolutional neural network for visual question answering,” arXiv preprint arXiv:1511.05960, 2015.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in CVPR, 2016, pp. 21–29.
-  K. J. Shih, S. Singh, and D. Hoiem, “Where to look: Focus regions for visual question answering,” in CVPR, 2016, pp. 4613–4621.
-  Y. Li, N. Wang, J. Liu, and X. Hou, “Factorized bilinear models for image recognition,” arXiv preprint arXiv:1611.05709, 2016.
-  S. Rendle, “Factorization machines,” in ICDM, 2010, pp. 995–1000.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  X. Geng, C. Yin, and Z.-H. Zhou, “Facial age estimation by learning from label distributions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 10, pp. 2401–2412, 2013.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014, pp. 740–755.
-  Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” arXiv preprint arXiv:1612.00837, 2016.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM Multimedia, 2014, pp. 675–678.
-  F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” ECCV, pp. 143–156, 2010.
-  H. Noh, P. Hongsuck Seo, and B. Han, “Image question answering using convolutional neural network with dynamic parameter prediction,” in CVPR, 2016, pp. 30–38.
-  M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A neural-based approach to answering questions about images,” in ICCV, 2015, pp. 1–9.
-  H. Xu and K. Saenko, “Ask, attend and answer: Exploring question-guided spatial attention for visual question answering,” in ECCV, 2016, pp. 451–466.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural module networks,” in CVPR, 2016, pp. 39–48.
-  H. Noh and B. Han, “Training recurrent answering units with joint loss minimization for vqa,” arXiv preprint arXiv:1606.03647, 2016.
-  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.” in EMNLP, vol. 14, 2014, pp. 1532–1543.
-  R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in NIPS, 2015, pp. 3294–3302.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” arXiv preprint arXiv:1602.07332, 2016.