Question-Guided Hybrid Convolution for Visual Question Answering
In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features. To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Experiments on VQA datasets validate the effectiveness of QGHC.
Keywords:VQA Dynamic Parameter Prediction Group Convolution
Convolution Neural Networks (CNN)  and Recurrent Neural Networks (RNN)  have shown great success in vision and language tasks. Recently, CNN and RNN are jointly trained for learning feature representations for multi-modal tasks, including image captioning [3, 4], text-to-image retrieval [5, 33], and Visual Question Answering (VQA) [6, 11, 12, 38]. Among the vision-language tasks, VQA is one of the most challenging problems. Instead of embedding images and their textual descriptions into the same feature subspace as in the text-image matching problem [7, 8, 26], VQA requires algorithms to answer natural language questions about the visual contents. The methods are thus designed to understand both the questions and the image contents to reason the underlying truth.
To infer the answer based on the input image and question, it is important to fuse the information from both modalities to create joint representations. Answers could be predicted by learning classifiers on the joint features. Early VQA methods  fuse textual and visual information by feature concatenation. State-of-the-art feature fusion methods, such as Multimodal Compact Bilinear pooling (MCB) , utilize bilinear pooling to learn multi-model features.
However, the above type of methods have main limitations. The multi-modal features are fused in the latter model stage and the spatial information from visual features gets lost before feature fusion. The visual features are usually obtained by averaging the output of the last pooling layer and represented as 1-d vectors. But such operation abandons the spatial information of input images. In addition, the textual and visual relationship is modeled only on the topmost layers and misses details from the low-level and mid-level layers.
To solve these problems, we propose a feature fusion scheme that generates multi-modal features by applying question-guided convolutions on the visual features (see Figure 1). The mid-level visual features and language features are first learned independently using CNN and RNN. The visual features are designed to keep the spatial information. And then a series of kernels are generated based on the language features to convolve with the visual features. Our model tightly couples the multi-modal features in an early stage to better capture the spatial information before feature fusion. One problem induced by the question-guided kernels is that the large number of parameters make it hard to train the model. Directly predicting “full” convolutional filters requires estimating thousands of parameters (e.g. number of filters convolve with the 256-channel input feature map). This is memory-inefficient and time-consuming, and does not result in satisfactory performances (as shown in our experiments).
Motivated by the group convolution [13, 1, 14], we decompose large convolution kernels into group kernels, each of which works on a small number of input feature maps. In addition, only a portion of such group convolution kernels (question-dependent kernels) are predicted by RNN and the remaining kernels (question-independent kernels) are freely learned via back-propagation. Both question-dependent and question-independent kernels are shown to be important, and we name the proposed operation as Question-guided Hybrid Convolution (QGHC). The visual and language features are deeply fused to generate discriminative multi-modal features. The spatial relations between the input image and question could be well captured by the question-guided convolution. Our experiments on VQA datasets validate the effectiveness of our approach and show advantages of the proposed feature fusion over the state-of-the-arts.
Our contributions can be summarized in threefold. 1) We propose a novel multi-modal feature fusion method based on question-guided convolution kernels. The relative visual regions have high response to the input question and spatial information could be well captured by encoding such connection in the QGHC model. The QGHC explores deep multi-modal relationships which benefits the visual question reasoning. 2) To achieve memory efficiency and robust performance in the question-guided convolution, we propose the group convolution to learn kernel parameters. The question-dependent kernels model the relationship of visual and textual information while the question-independent kernels reduce parameter size and alleviate over-fitting. 3) Extensive experiments and ablation studies on the public datasets show the effectiveness of the proposed QGHC and each individual component. Our approach outperforms the state-of-the-art methods using much fewer parameters.
2 Related work
Bilinear pooling for VQA. Solving the VQA problem requires the algorithms to understand the relation between images and questions. It is important to obtain discriminative multi-modal features for accurate answer prediction. Early methods utilize feature concatenation  for multi-modal feature fusion [15, 26, 33]. Recently, bilinear pooling methods are introduced for VQA to capture high-level interactions between visual and textual features. Multimodal Compact Bilinear Pooling (MCB)  projects the language and visual features into a higher dimensional space and convolves them in the Fast Fourier Transform space. In Multimodal Low-rank Bilinear (MLB) , the weighting tensor for bilinear pooling is approximated by three weight matrices, which enforces the rank of the weighting tensor to be low-rank. The multi-modal features are obtained as the Hadamard product of the linear-projected visual and language features. Ben-younes et al  propose the Multimodal Tucker Fusion (MUTAN), which unifies MCB and MLB into the same framework . The weights are decomposed according to the Tucker decomposition. MUTAN achieves better performance than MLB and MCB with fewer parameters.
Attention mechanisms in language and VQA tasks. The attention mechanisms [17, 39] are originally proposed for solving language-related tasks . Xu et al  introduce an attention mechanism for image captioning, which shows that the attention maps could be adaptively generated for predicting captioning words. Based on , Yang et al  propose to stack multiple attention layers so that each layer can focus on different regions adaptively. In , a co-attention mechanism is proposed. The model generates question attention and spatial attention masks so that salient words and regions could be jointly selected for more effective feature fusion. Similarly, Lu et al  employ a co-attention mechanism to simultaneously learn free-form and detection-based image regions related to the input question. In MCB , MLB , and MUTAN , attention mechanisms are adopted to partially recover the spatial information from the input image. Question-guided attention methods [21, 17] are proposed to generate attention maps from the question.
Dynamic Network. Network parameters could be dynamically predicted across different modalities. Our approach is mostly related to methods in this direction. In , language are used to predict parameters of a fully-connected (FC) layer for learning visual features. However, the predicted fully-connected layer cannot capture spatial information of the image. To avoid introducing too many parameters, they predict only a small portion of parameters using a hashing function. However, this strategy introduces redundancy because the FC parameters only contain a small amount of training parameters. In , language is used to modulate the mean and variance parameters of the Batch Normalization layers in the visual CNN. However, learning the interactions between two modalities by predicting the BN parameters has limited learning capacity. We conduct comparisons with  and . Our proposed method shows favorable performance. We notice that  use language-guided convolution for object tracking. However, they predict all the parameters which is difficult to train.
Group convolution in deep neural networks. Recent research found that the combination of depth-wise convolution and channel shuffle with group convolution could reduce the number of parameters in CNN without hindering the final performance. Motivated by Xception , ResNeXt , and ShuffleNet , we decompose the visual CNN kernels into several groups. By shuffling parameters among different groups, our model can reduce the number of predicted parameters and improve the answering accuracy simultaneously. Note that for existing CNN methods with group convolution, the convolutional parameters are solely learned via back-propagation. In contrast, our QGHC consists of question-dependent kernels that are predicted based on language features and question-independent kernels that are freely updated.
3 Visual Question Answering with Question-guided Hybrid Convolution
ImageQA systems take an image and a question as inputs and output the predicted answer for the question. ImageQA algorithms mostly rely on deep learning models and design effective approaches to fuse the multi-modal features for answering questions. Instead of fusing the textual and visual information in high level layers, such as feature concatenation in the last layer, we propose a novel multi-modal feature fusion method, named Question-guided Hybrid Convolution (QGHC). Our approach couples the textual-visual features in early layers for better capturing textual-visual relationships. It learns question-guided convolution kernels and reserves the visual spatial information before feature fusion, and thus achieves accurate results. The overview of our method is illustrated in Figure 1. The network predicts convolution kernels based on the question features, and then convolve them with visual feature maps. We stack multiple question-guided hybrid convolution modules, an average pooling layer, and a classifier layer together. The output of the language-guided convolution is the fused textual-visual features maps which used for answering questions. To improve the memory efficiency and experimental accuracy, we utilize the group convolution to predict a portion of convolution kernels based on the question features.
3.1 Problem formulation
Most state-of-the-art VQA methods rely on deep neural networks for learning discriminative features of the input image and question . Usually, Convolutional Neural Networks (CNN) are adopted for learning visual features, while Recurrent Neural Networks (RNN) (e.g., Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU)) encode the input question, i.e.,
where and represent visual features and question features respectively.
Conventional ImageQA systems focus on designing robust feature fusion functions to generate multi-modal image-question features for answer prediction. Most state-of-the-art feature fusion methods fuse 1-d visual and language feature vectors in a symmetric way to generate the multi-modal representations. The 1-d visual features are usually generated by the deep neural networks (e.g., GoogleNet and ResNet) with a global average pooling layer. Such visual features and the later fused textual-visual features abandon spatial information of the input image and thus less robust to spatial variations.
3.2 Question-guided Hybrid Convolution (QGHC) for multi-modal feature fusion
To fully utilize the spatial information of the input image, we propose Language-guided Hybrid Convolution for feature fusion. Unlike bilinear pooling methods that treat visual and textual features in a symmetric way, our approach performs the convolution on visual feature maps and the convolution kernels are predicted based on the question features which can be formulated as:
where is the output before the last pooling layer, denotes the convolutional kernels predicted based on the question feature , and the convolution on visual feature maps with the predicted kernels results in the multi-modal feature maps .
However, the naive solution of directly predicting “full” convolutional kernels is memory-inefficient and time-consuming. Mapping the question features to generate full CNN kernels contains a huge number of learnable parameters. In our model, we use the fully-connected layer to learn the question-guided convolutional kernels. To predict a commonly used kernel from a 2000-d question feature vector, the FC layer for learning the mapping generates 117 million parameters, which is hard to learn and causes over-fitting on existing VQA datasets. In our experiments, we validate that the performance of the naive solution is even worse than the simple feature concatenation.
To mitigate the problem, we propose to predict parameters of group convolution kernels. The group convolution divides the input feature maps into several groups along the channel dimension, and thus each group has a reduced number of channels for convolution. Outputs of convolution with each group are then concatenated in the channel dimension to produce the output feature maps. In addition, we classify the convolution kernels into dynamically-predicted kernels and freely-updated kernels. The dynamic kernels are question-dependent, which are predicted based on the question feature vector . The freely-updated kernels are question-independent. They are trained as conventional convolution kernels via back-propagation. The dynamically-predicted kernels fuse the textual and visual information in early model stage which better capture the multi-model relationships. The freely-updated kernels reduce the parameter size and ensure the model can be trained efficiently. By shuffling parameters among these two kinds of kernels, our model can achieve both the accuracy and efficiency. During the testing phase, the dynamic kernels are decided by the questions while the freely updated kernels are fixed for all input image-question pairs.
Formally, we substitute Eqn. (3) with the proposed QGHC for VQA,
where CNN denotes a group convolution network with dynamically-predicted kernels and freely-updated kernels . The output of the CNN fuses the textual and visual information and infers the final answers. MLP is a multi-layer perception module and is the predicted answers.
The freely-updated kernels can capture pre-trained image patterns and we fix them during the testing stage. The dynamically-predicted kernels are dependent on the input questions and capture the question-image relationships. Our model fuses the textual and visual information in early model stage by the convolution operation. The spatial information between two modalities is well preserved which leads to more accurate results than previous feature concatenation strategies. The combination of the dynamic and freely-updated kernels is crucial important in keeping both the accuracy and efficiency and shows promising results in our experiments.
3.3 QGHC module
We stack multiple QGHC modules to better capture the interactions between the input image and question. Inspired by ResNet  and ResNeXt , our QGHC module consists of a series of , , and convolutions.
As shown in Figure 2, the module is designed similarly to the ShffuleNet  module with group convolution and identity shortcuts. The -channel input feature maps are first equally divided into groups (paths). Each of the groups then goes through 3 stages of convolutions and outputs -d feature maps. For each group, the first convolution is a convolution that outputs -channel feature maps. The second convolution outputs -channel feature maps, and the final convolution outputs -channel feature maps. We add a group shuffling layer after the convolution layer to make features between different groups interact with each other and keep the advantages of both the dynamically-predicted kernels and freely-updated kernels. The output of -channel feature maps for the groups are then concatenated together along the channel dimension. For the shortcut connection, a convolution transforms the input feature maps to -d features, which are added with the output feature maps. Batch Normalization and ReLU are performed after each convolutional operation except for the last one, where ReLU is performed after the addition with the shortcut.
The group convolution is guided by the input questions. We randomly select group kernels. Their parameters are predicted based on the question features. Those kernel weights are question-dependent and are used to capture location-sensitive question-image interactions. The remaining group kernels have freely-updated kernels. They are updated via back-propagation in the training stage and are fixed for all images during testing. These kernels capture the pre-trained image patterns or image-question patterns. They are constant to the input questions and images.
3.4 QGHC network for visual question answering
The network structure for our QGHC network is illustrated in Figure 3. The ResNet  is first pre-trained on the ImageNet to extract mid-level visual features. The question features are generated by a language RNN model.
The visual feature maps are then send to three QGHC modules with groups and . The output of the QGHC modules has the same spatial sizes with the input feature maps. A global average pooling is applied to the final feature maps to generate the final multi-modal feature representation for predicting the most likely answer .
To learn the dynamic convolution kernels in the QGHC modules, the question feature is transformed by two FC layers with a ReLU activation in between. The two FC layers first project the question to a -d vector. The question-dependent kernel weights of the three QGHC modules are obtained by reshaping the learned parameters into . However, directly training the proposed network with both dynamically-predicted kernels and freely-updated kernels is non-trivial. The dynamic kernel parameters are the output of the ReLU non-linear function with different magnitudes compared with the freely-updated kernel parameters. We adopt the Weight Normalization  to balance the weights between the two types of kernels, which stabilizes the training of the network.
3.5 QGHC network with bilinear pooling and attention
Our proposed QGHC network is also complementary with the existing bilinear pooling fusion methods and the attention mechanism.
To combine with the MLB fusion scheme , the multi-modal features extracted from the global average pooling layer could be fused with the RNN question features again using a MLB. The fused features could be used to predict the final answers. The second stage fusion of textual and visual features brings a further improvement on the answering accuracy in our experiments.
We also apply an attention model to better capture the spatial information. The original global average pooling layer is thus replaced by the the attention map. To weight more on locations of interest, a weighting map is learned by attention mechanism. A convolution following a spatial Softmax function generates the attention weighting map. The final multi-modal features is the weighted summation of features at all the locations. The output feature maps from the last QGHC module are added with the linearly transformed question features. The attention mechanism is shown as the green rectangles in Figure 3.
|QD Weights||QI Weights||All|
|QGHC-group 16||1.3 M||0.15M||58.22|
4.1 VQA Dataset
4.1.1 Data and experimental setup.
The VQA dataset is built from 204,721 MS-COCO images with human annotated questions and answers. On average, each image has 3 questions and 10 answers for each question. The dataset is divided into three splits: training (82,783 images), validation (40,504 images) and testing (81,434 images). A testing subset named test-dev with 25% samples can be evaluated multiple times a day. We follow the setup of previous methods and perform ablation studies on the testing subset. Our experiments focus on the open-ended task, which predict the correct answer in the free-form language expressions. If the predicted answer appears more than 3 times in the ground truth answers, the predicted answer would be considered as correct.
Our models have the same setting when comparing with the state-of-the-art methods. The compared methods follow their original setup. For the proposed approach, images are resized to . The visual features are learned by an ImageNet pre-trained ResNet-152, and the question is encoded to a 2400-d feature vector by the skip-thought  using GRU. The candidate questions are selected as the most frequent 2,000 answers in the training and validation sets. The model is trained using the ADAM optimizer with an initial learning rate of . For results on the validation set, only the training set is used for training. For results on test-dev, we follow the setup of previous methods, both the training and validation data are used for training.
4.1.2 Ablation studies on the VQA dataset.
We conduct ablation studies to investigate factors that influence the final performance of our proposed QGHC network. The results are shown in Table 1. Our default QGHC network (denoted as QGHC) has a visual ResNet-152 followed by three consecutive QGHC modules. Each QGHC module has a stage-1 convolution with freely-updated kernels, a stage-2 convolution with both dynamically-predicted kernels and freely-updated kernels, and another convolution stage with freely-updated kernels (see Figure 2). Each of these three stage convolutions has 8 groups. They have 32, 32, and 64 output channels respectively.
We first investigate the influence of the number of QGHC modules and the number of convolution channels. We list the results of different number of QGHC modules in Table 1. QGHC-1, QGHC-2, QGHC-4 represent 1, 2, and 4 QGHC modules respectively. As shown in Table 1, the parameter size improves as the number of QGHC increases but there is no further improvement when stacking more than 3 QGHC modules. We therefore keep 3 QGHC modules in our model. We also test halving the numbers of output channels of the three group convolutions to 16, 16, and 32 (denoted as QGHC-1/2). The results show that halving the number of channels only slightly decreases the final accuracy.
We then test different group numbers. We change the group number from 8 to 4 (QGHC-group 4) and 16 (QGHC-group 16). Our proposed method is not sensitive to the group number of the convolutions and the model with 8 groups achieves the best performance. We also investigate the influence of the group shuffling layer. Removing the group shuffling layer (denoted as QGHC-w/o shuffle) decreases the accuracy by 0.32% compared with our model. The shuffling layer makes features between different groups interact with each other and is helpful to the final results.
For different QGHC module structures, we first test a naive solution. The QGHC module is implemented as a single “full” convolution without groups. Its parameters are all dynamically predicted by question features (denoted as QGHC-1-naive). We then convert the single full convolution to a series of , , full convolutions with residual connection between the input and output feature maps (denoted as QGHC-1-full), where the convolution kernels are all dynamically predicted by the question features. The improvement of QGHC-1-full over QGHC-1-naive demonstrates the advantages of the residual structure. Based on QGHC-1-full, we convert all the full convolutions to group convolutions with 8 groups (denoted as QGHC-1-group). The results outperforms QGHC-1-full, which show the effectiveness of the group convolution. However, the accuracy is still inferior to our proposed QGHC-1 with hybrid convolution. The results demonstrate that the question-guided kernels can help better fuse the textual and visual features and achieve robust answering performance.
Finally, we test the combination of our method with different additional components. 1) The multi-modal features are concatenated with the question features, and then fed into the FC layer for answer prediction. (denoted as QGHC+concat). It results in a marginal improvement in the final accuracy. 2) We use MUTAN  to fuse our QGHC-generated multi-modal features with question features again for answer prediction (denoted as QGHC+MUTAN). It has better results than QGHC+concat. 3) The attention is also added to QGHC following the descriptions in Section 3.5 (denoted as QGHC+att.).
4.1.3 Comparison with state-of-the-art methods.
QGHC fuses multi-modal features in an efficient way. The output feature maps of our QGHC module utilize the textual information to guide the learning of visual features and outperform state-of-the-art feature fusion methods. In this section, we compare our proposed approach (without using the attention module) with state-of-the-arts. The results on the VQA dataset are shown in Table 2. We compare our proposed approach with multi-modal feature concatenation methods including MCB , MLB , and MUTAN . Our feature fusion is performed before the spatial pooling and can better capture the spatial information than previous methods. Since MUTAN can be combined with MLB (denoted as MUTAN+MLB) to further improve the overall performance.
Attention mechanism is widely utilized in VQA algorithms for associating words with image regions. Our method can be combined with attention models for predicting more accurate answers. In Section 3.5, we adopt a simple attention implementation. More complex attention mechanisms, such as hierachical attention  and stacked attention  can also be combined with our approach. The results in Table 3 list the answering accuracies on the VQA dataset of different state-of-the-art methods with attention mechanism.
We also compare our method with dynamic parameter prediction methods. DPPNet  (Table 2) and MODERN  (Table 3) are two state-of-the-art dynamic learning methods. Compared with DPPNet(VGG) and MODERN(ResNet-152), QGHC improves the performance by 6.78% and 3.73% respectively on the test-dev subset, which demonstrates the effectiveness of our QGHC model.
4.2 CLEVR dataset
The CLEVR dataset  is proposed to test the reasoning ability of VQA tasks, such as counting, comparing, and logical reasoning. Questions and images from CLEVR are generated by a simulation engine that randomly combines 3D objects. This dataset contains 699,989 training questions, 149,991 validation questions, and 149,988 test questions.
4.2.1 Experimental setting.
In our proposed model, the image is resized to . The question is first embedded to a 300-d vector through a FC layer followed by a ReLU non-linear function, and then input into a 2-layer LSTM with 256 hidden states to generate textual features. Our QGHC network contains three QGHC modules for fusing multi-modal information. All parameters are learned from scratch and trained in an end-to-end manner. The network is trained using the ADAM optimizer with the learning rate and batch size 64. All the results are reported on the validation subset.
4.2.2 Comparison with state-of-the-arts.
We compare our model with the following methods. CNN-LSTM  encodes images and questions using CNN and LSTM respectively. The encoded image features and question features are concatenated and then passed through a MLP to predict the final answers. Multimodal Compact Bilinear Pooling (MCB)  fuses textual and visual feature by compact bilinear pooling which captures the high level interaction between images and questions. Stacked Attention (SA)  adopts multiple attention models to refine the fusion results and utilizes linear transformations to obtain the attention maps. MCB and SA could be combined with the above CNN-LSTM method. Neural Module Network (NMN)  propose a sentence parsing method and a dynamic neural network. However, sentence parsing might fail in practice and lead to bad network structure. End-to-end Neural Module Network (N2NMN)  learns to parse the question and predicts the answer distribution using dynamic network structure.
The results of different methods on the CLEVR dataset are shown in Table 4. The multi-modal concatenation (CNN-LSTM) does not perform well, since it cannot model the complex interactions between images and questions. Stacked Attention (+SA) can improve the results since it utilizes the spatial information from input images. Our QGHC model still outperforms +SA by 17.40%. For the N2NMN, it parses the input question to dynamically predict the network structure. Our proposed method outperforms it by 2.20%.
|Compare integers||Query attribute||Compare attribute|
4.3 Visualization of question-guided convolution
Motivated by the class activation mapping (CAM) , we visualize the activation maps of the output feature maps generated by the QGHC modules. The weighted summation of the topmost feature maps can localize answer regions.
Convolution activation maps for our last QGHC module are shown in Figure 4. We can observe that the activation regions relate to the questions and the answers are predicted correctly for different types of questions, including shape, color, and number. In addition, we also visualize the activation maps of different QGHC modules by training an answer prediction FC layer for each of them. As examples shown in Figure 1, the QGHC gradually focus on the correct regions.
In this paper, we propose a question-guided hybrid convolution for learning discriminative multi-modal feature representations. Our approach fully utilizes the spatial information and is able to capture complex relations between the image and question. By introducing the question-guided group convolution kernels with both dynamically-predicted and freely-updated kernels, the proposed QGHC network shows strong capability on solving the visual question answering problem. The proposed approach is complementary with existing feature fusion methods and attention mechanisms. Extensive experiments demonstrate the effectiveness of our QGHC network and its individual components.
This work is supported by SenseTime Group Limited, the General Research Fund sponsored by the Research Grants Council of Hong Kong (Nos. CUHK14213616, CUHK14206114, CUHK14205615, CUHK14203015, CUHK14239816, CUHK419412, CUHK14207814, CUHK14208417, CUHK14202217), the Hong Kong Innovation and Technology Support Program (No.ITS/121/15FX), the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centres in Singapore Funding Initiative.
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. (2014) 3104–3112
-  Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 1261–1270
-  Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. (2015) 2048–2057
-  Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4555–4564
-  Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 2425–2433
-  Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: A deep visual-semantic embedding model. In: Advances in neural information processing systems. (2013) 2121–2129
-  Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 49–58
-  Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)
-  Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
-  Kim, J.H., On, K.W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
-  Ben-younes, H., Cadene, R., Cord, M., Thome, N.: Mutan: Multimodal tucker fusion for visual question answering. arXiv preprint arXiv:1705.06676 (2017)
-  Chollet, F.: Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357 (2016)
-  Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431 (2016)
-  Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1449–1457
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
-  Xu, H., Saenko, K.: Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: European Conference on Computer Vision, Springer (2016) 451–466
-  Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 21–29
-  Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems. (2016) 289–297
-  Lu, P., Li, H., Zhang, W., Wang, J., Wang, X.: Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering. In: Proceedings of AAAI. (2018) 7218-7225
-  Chen, K., Wang, J., Chen, L.C., Gao, H., Xu, W., Nevatia, R.: Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015)
-  Noh, H., Hongsuck Seo, P., Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 30–38
-  de Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.: Modulating early visual processing by language. arXiv preprint arXiv:1707.00683 (2017)
-  Li, Z., Tao, R., Gavves, E., Snoek, C.G., Smeulders, A., et al.: Tracking by natural language specification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 6495–6503
-  Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083 (2017)
-  Li Shuang, Xiao Tong, Li Hongsheng, Yang Wei, and Wang Xiaogang: Identity-aware textual-visual matching with latent co-attention. In: IEEE International Conference on Computer Vision. (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
-  Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems. (2016) 901–909
-  Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. arXiv preprint arXiv:1612.06890 (2016)
-  Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Advances in neural information processing systems. (2015) 3294–3302
-  Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2921–2929
-  Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering
-  Li Shuang, Xiao Tong, Li Hongsheng, Zhou Bolei, Yue Dayu, and Wang Xiaogang: Person search with natural language description. In: IEEE Conference on Computer Vision and Pattern Recognition. (2017)
-  Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 39–48
-  Kim, J.H., Lee, S.W., Kwak, D., Heo, M.O., Kim, J., Ha, J.W., Zhang, B.T.: Multimodal residual learning for visual qa. In: Advances in Neural Information Processing Systems. (2016) 361–369
-  Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neural networks for question answering. In: Proceedings of NAACL-HLT. (2016) 1545–1554
-  Noh, H., Han, B.: Training recurrent answering units with joint loss minimization for vqa. arXiv preprint arXiv:1606.03647 (2016)
-  Lu, P., Ji, L., Zhang, W., Duan, N., Zhou, M., Wang, J.: R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering. In: Proceedings of SIGKDD. (2018)
-  Li Shuang, Bak Slawomir, Carr Peter, and Wang Xiaogang: Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition. (2018)
-  Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471 (2016)
-  Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: End-to-end module networks for visual question answering. arXiv preprint arXiv:1704.05526 (2017)
-  Johnson, J., Hariharan, B., van der Maaten, L., Hoffman, J., Fei-Fei, L., Zitnick, C.L., Girshick, R.: Inferring and executing programs for visual reasoning. In: ICCV. (2017)
-  Perez, Ethan and Strub, Florian and De Vries, Harm and Dumoulin, Vincent and Courville, Aaron: Film: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871 (2017)