Modal-aware Features for Multimodal Hashing

Modal-aware Features for Multimodal Hashing

Haien Zeng, Hanjiang Lai, Hanlu Chu, Yong Tang, Jian Yin Email address: zenghen@mail2.sysu.edu.cn (H. Zeng), (laihanj3, issjyin)@mail.sysu.edu.cn (H. Lai, J. Yin), (hlchu, ytang)@m.scnu.edu.cn (H. Chu, Y. Tang)H. Zeng, H. Lai, and J. Yin are with School of Data and Computer Science, Sun Yat-Sen University, China. H. Chu and Y. Tang are with School of Computer Science, South China Normal University, China. Hanjiang Lai is Corresponding author.
Abstract

Many retrieval applications can benefit from multiple modalities, e.g., text that contains images on Wikipedia, for which how to represent multimodal data is the critical component. Most deep multimodal learning methods typically involve two steps to construct the joint representations: 1) learning of multiple intermediate features, with each intermediate feature corresponding to a modality, using separate and independent deep models; 2) merging the intermediate features into a joint representation using a fusion strategy. However, in the first step, these intermediate features do not have previous knowledge of each other and cannot fully exploit the information contained in the other modalities. In this paper, we present a modal-aware operation as a generic building block to capture the non-linear dependences among the heterogeneous intermediate features that can learn the underlying correlation structures in other multimodal data as soon as possible. The modal-aware operation consists of a kernel network and an attention network. The kernel network is utilized to learn the non-linear relationships with other modalities. Then, to learn better representations for binary hash codes, we present an attention network that finds the informative regions of these modal-aware features that are favorable for retrieval. Experiments conducted on three public benchmark datasets demonstrate significant improvements in the performance of our method relative to state-of-the-art methods.

Multimodal Learning, Modal-aware Features, Information Retrieval, Hashing, Nearest Neighbor Search.

I Introduction

Multimodal hashing [36] is a task of embedding multimodal data into a single binary code, which aims to improve performance by using complimentary information provided by the different types of data sources. Since good representations are important for multimodal hashing, in this paper, we focus on developing a better feature learning approach.

To learn the representations, multimodal fusion [3] is proposed, which aims to generate a joint representation from two or more modalities in favor of the given task. Multimodal fusion can be mainly divided into two categories [3]: model-agnostic approaches [7] and model-based approaches [26]. The model-agnostic methods do not use a specific machine learning method. According to the data processing stage, model-agnostic methods can be mainly split into early and late fusion. Early fusion immediately combines multiple raw/preprocessed data into a joint representation. In contrast, late fusion performs integration after all of the modalities have made decisions. The model-based approaches fuse the heterogeneous data using different machine learning models, e.g., multiple kernel learning [12], graphical models [10] and neural networks [32].

Fig. 1: Illustration of two different feature extractions for multimodal data: (A) each modality uses individual neural layers to learn intermediate feature; (B) our proposed modal-aware feature learning that can learn the non-linear dependences among the heterogeneous data.

Recently, deep multimodal fusion has attracted much attention because it is able to extract powerful feature representations from raw data. As shown in Figure 1 (A), the common practices for deep multimodal fusion are as follows  [1, 29, 31]: 1) Each modality start with several individual neural layers to learn intermediate feature. 2) These multiple intermediate features are merged into a joint representation via a fusion strategy. Such a fusion approach is referred to as intermidiate fusion [20] because the powerful intermediate features obtained by deep neural networks (DNNs) are merged to construct the joint representation. Deep multimodal learning has been shown to achieve remarkable performance for many machine learning tasks, such as deep cross-modal hashing [17] and deep semantic multimodal hashing [18].

While they have achieved great success, most existing methods focus on designing better fusion strategies, e.g., gated multimodal units (GMUs) [2] and multimodal compact bilinear pooling (MCB) [11] for data fusion, and only limited attention has been paid to the intermediate features. The multiple intermediate features are separately learned and do no fully utilize the underlying correlation structures in other modalities. Thus, a natural question arises: Can we incorporate the information from other modalities to learn the intermediate features?

In this paper, we propose a modal-aware operation as a generic building block to learn the multiple intermediate features. Unlike in other deep multimodal approaches, in which each intermediate feature is learned via several individual neural layers, our method learns the dependent and joint intermediate features via the proposed modal-aware operation. The features are forwarded to the modal-aware operation to produce new intermediate features, in which these new intermediate features are learned jointly and dependently and each intermediate feature consists of information from other modalities.

In the context of multimodal hashing, two factors are considered in the proposed modal-aware operation. The first consideration is how to learn the the non-linear dependences from other modalities. Inspired by the kernel methods [12], we present a kernel network to learn the underlying correlation structures in other modalities. Given two intermediate features from two modalities, we first calculate the kernel similarities, i.e., dot-product similarities, between the two features. Then, the similarities are used as weights to reweight the original features. The second consideration is how to learn better intermediate features for binary hash codes. The binary representations always introduce information loss compared to the original real values, e.g., each bit has only two values: 0 or 1. To reduce the information loss, we further propose an attention network that focuses on selecting the informative parts of multimodal data. The uninformative parts will be removed and will not be used to encode the binary codes. Thus, this method is able to alleviate the information loss to some extent because the binary codes are generated from the informative parts of multimodal data that are favorable for retrieval. To fully utilize the modalities, all of the intermediate features are incorporated to learn the attention maps.

The main contributions of this paper can be summarized as follows.

  • We propose a modal-aware operation to learn the intermediate features. This operation can learn the information contained in other modalities prior to fusion, which is helpful for better capturing data correlations.

  • We propose a kernel network to capture the non-linear dependences and an attention network to find the informative regions. These two networks learn better intermediate features for generating binary hash codes.

  • We conduct extensive experiments on three multimodal databases to evaluate the usefulness of the proposed modal-aware operation. Our method yields better performance compared with several state-of-the-art baselines.

Ii Related Work

Ii-a Multimodal Fusion

Multimodal fusion is an important step for multimodal learning. A simple approach for multimodal fusion is to concatenate or sum the features to obtain a joint representation [19]. For instant, Hu et al. [15] concatenated text embeddings and visual features for image segmentation. Reconstruction methods were also proposed to fuse the multimodal data. For example, autoencoders [30] and deep Boltzmann machines [34] were trained to reconstruct both modalities with only one modality as the input. Subsequently, inspired by the success of bilinear pooling and gated recurrent networks, Fukui et al. [11] proposed multimodal compact bilinear pooling to efficiently combine multimodal features, and John et al. [2] proposed a gated multimodal unit to determine how much each modality affects unit activation. Liu et al. [28] multiplicatively combined a set of mixed source modalities to capture cross-modal signal correlations. Although many approaches have been proposed for multimodal fusion, these deep learning methods do not fully explore the dependences among the modalities prior to the fusion operations. In this paper, we argue that capturing the dependences among the heterogeneous modalities will benefit multimodal fusion.

Ii-B Multimodal Retrieval

A similar work is that on cross-modal hashing [37]. Given a query of one modality, the goal of cross-modal hashing is to retrieve the relevant data from another modality. For example, Cross-view hashing (CVH) [35] and semantic correlation maximization (SCM) [43] use hand-crafted features. Deep cross modal hashing (DCMH) [17] and pairwise relationship guided deep hashing (PRDH) [41] are deep-network-based methods. Attention-aware deep adversarial hashing [44] and self-supervised adversarial hashing (SSAH) [23] apply the adversarial learning to generate better binary codes. Although many approaches have been proposed for cross-modal hashing, our multimodal hashing is different from the cross-modal hashing. The proposed multimodal hashing aims to learn the joint representations but not coordinated representations, in which the joint approach combines multiple samples into the same representation space while coordinated approach process the multiple data separately and enforce similarity-preserving among different modalities [3].

Other similar works include those on multi-view hashing that leverages multiple views to learn better binary codes. Some represetative studies focus on multiple feature hashing (MFH) [21], composite hashing with multiple information sources (CHMIS) [42], multi-view latent hashing (MVLH) [33], dynamic multi-view Hashing (DMVH) [40] and so on. In this paper, we only consider the multimodal data but not the multiple views, e.g., SIFT and HOG from the same image modality.

Limited attention has been paid for multimodal hashing. Wang et al. [36] proposed deep multimodal hashing with orthogonal regularization to exploit the intra-modality and inter-modality correlations. Cao et al. [4] proposed an extended probabilistic latent semantic analysis (pLSA) to integrate the visual and textural information. In this paper, we focus on learning better intermediate features for multimodal hashing.

Fig. 2: Overview of deep multimodal hashing. It consists of three sequential parts: (A) feature learning module; (B) fusion module; and (C) hashing module. Please note that the intermediate features are learned separately. In this paper, we focus on learning better intermediate features.

Iii Overview of Deep Multimodal Hashing

In this section, we briefly summarize the deep multimodal hashing framework.

Let denote a set of instances, where each instance is represented in multiple modalities. For ease of presentation, we only consider two modalities, i.e., image and text, to explain our main idea. We denote the instance as , where and are image and text descriptions of the -th instance, and is the corresponding ground-truth label. Let denote the binary codes, where is the -dimensional binary code associated with . The aim of multimodal hashing is to learn hash functions that encode the instance into one binary code while preserving the similarities between the instances. For example, if and are similar, the Hamming distance between and should be small. When and are dissimilar, the Hamming distance should be large.

Different from unimodal data, each instance consists of multiple unimodal signals. Combining these signals into a joint representation becomes a critical step. Currently, the deep multimodal learning (DML) approaches have been shown to achieve remarkable performance because they can learn the powerful features from all of the modalities. Merging these powerful features into a joint representation will lead to better and flexible multimodal fusion.

An illustration of a deep network for multimodal hashing is shown in Figure 2. The network is divided into three sequential parts: 1) the feature learning module, which learns the efficient intermediate features from the image and text raw data; 2) the multimodal fusion module, which merges the two intermediate features into a joint representation; and 3) the hashing module, which encodes the joint representations to the binary codes, followed by a similarity-preserving loss.

In the feature learning module, the convolutional layers are applied to produce powerful feature maps for the image modality. The images go through several convolutional layers to obtain high-level intermediate feature maps. For the text modality, the feed-forward neural network with stacked fully-connected layers is utilized to encode the text into semantic text features.

In the fusion module, with two intermediate features, a fusion strategy is utilized to obtain a joint representation. Many methods for fusion have been proposed, e.g., concatenation, gate multimodal units (GMUs) [2] and multimodal compact bilinear pooling (MCB) [11].

In the hashing module, the joint representation is mapped into a feature vector with the desired length, e.g., an -bit approximate binary code. Then, the similarity-preserving loss is used to preserve the relative similarities of multimodal data.

However, in the above deep multimodal hashing, these intermediate features are learned separately and had no prior knowledge of other modalities before the fusion. In this paper, we present modal-aware operation that aims to learn better intermediate feature representations. It contains a kernel network that aims to learn the correlations among different modalities and an attention network that finds the informative regions. These two aspects are described in detail in the next section.

Fig. 3: Illustration of the kernel network. The image feature maps with size of and the text feature vector with feature length . “” denotes matrix multiplication and “” denotes element-wise multiplication. “conv” and “fc” denote the convolutional and fully connected layers, respectively. “GAP” represents the global average pooling layer.

Iv Modal-aware Operation

In this section, we present a modal-aware operation that consists of two parts: a kernel network and an attention network.

Iv-a Kernel Network

The kernel network takes two intermediate features as inputs: image feature maps and the text feature vector. More specifically, suppose that represents feature maps for the image modality, where and are the height, weight and channel, respectively. is the corresponding textural feature, where is the feature length.

Inspired by the non-local features [38] and kernel methods, the outputs of the kernel network are defined as

(1)

where and are the kernel functions that measure the similarity between the inputs and . We use the kernel methods to exploit the correlation structures obtained in other modalities. In Eq. (1), the intermediate features of the image modality are learned from both the textural and image features. First, the kernel similarity between the image feature and textural feature is calculated. Then, this similarity is used to reweight the original feature. Thus, using these operations, the image feature is embedded into textural information. The same approach is used for the text modality. We note that we use different kernel functions because the textural feature is a one-dimensional vector while the image feature maps are three-dimensional tensors.

To train the kernel network in an end-to-end manner, the kernel function is further expressed as the inner product in another space , which is reformulated as

(2)

where and are two mapping functions to project the data into another space. Since we use deep networks to learn the multimodal data, we also design two networks as these two mapping functions. That is, the convolutional layer and the fully connected layer are utilized as the mapping functions: is a convolutional layer and is a fully-connected layer.

Figure 3 shows the specific structure of the kernel network. For the image modality, the network takes the feature maps and the textural vector as inputs. The approach consists of three parts: 1) two mapping functions (a convolutional layer) and (a fully connected layer) are first learned; 2) the kernel similarity is calculated using the inner product layer; and 3) the origin features are reweighted using the kernel similarity. In the first part, is a convolutional layer with kernel size, and is a single-layer neural network with transformation matrix that maps the textural feature and the visual features to the same dimension, which is given by

(3)

Since is a tensor while is vector, we first reshape the feature maps by flattening the height and width of the original features: , where and . The inner products between these features and the text feature can be calculated. The output of can be defined as

(4)

where is the -th vector corresponding to .

A similar approach is used for the text modality. First, the global average pooling (GAP) layer reduces with dimensions to dimensions by taking the average of each feature map. Let denote the output vector of the GAP layer. Since is a vector, and are two fully connected layers:

(5)

where is connected with the transformation matrix and is connected with the transformation matrix . Finally, the output for the text modality can be formulated as

(6)
Fig. 4: Illustration of the attention network. “GAP” represents the global average pooling layer, and “fc” denotes the fully connected layer. “” denotes concatenation of two vectors and “” denotes element-wise multiplication.

Iv-B Attention Network

Inspired by how humans process information, we propose an attention network that adaptively focuses on salient parts to learn more powerful multiple intermediate features. To compute the attention efficiently, we aggregate information from all intermediate features. That is, we exploit both features rather than using each independently to locate the informative regions. The detailed operations are described below.

Fig. 5: The proposed modal-aware feature learning for multimodal hashing. Two modal-aware operations were added in the feature learning module.

Figure 4 shows the specific structure of the attention network. First, the visual feature maps are forwarded to a global average pooling layer to produce a visual vector . Then, we concatenate visual and textural features as , which contains information from different modalities. The feature goes through two different networks to separately produce attention maps for the image and textural features. Both the networks are composed of a single-layer neural network followed by a softmax function to obtain the attention distributions.

(7)

where and are transformation matrices. and are model biases. Here, is also called the channel attention map [39], which exploits the inter-channel relationship of the features. The main different is that our method use both the visual and textural features from different modalities to find the salient channels. Then, element-wise multiplication is applied to obtain the final outputs and , which are defined as

(8)

where is the -th channel with size and is the -th value in vector .

V Implementation Details

The proposed modal-aware feature learning for multimodal hashing is shown in Figure 5. We apply modal-aware operations in the earlier layers. Please note that it only has two fully-connected layers for text modality. Hence, the two modal-aware operations were applied after each fully-connected layer.

V-a Network Architectures

For the image modality, ResNet-18 [14] is used as the basic architecture to learn the powerful image features. ResNet is a residual learning framework that has shown great success in many machine learning tasks. In the ResNet-18, the last global average pooling layer and a 1000-way fully connected layer are removed. The feature maps in Conv4_2 and Conv5_2 are used as the image intermediate features , respectively. For the text modality, the well-known bag-of-words (BoW) vectors are used as the inputs. Then, the vectors go through a feed-forward neural network (BoW ) to learn the semantic text features .

After the modal-aware operation, we have two features: and . Since is a tensor, the global average pooling layer is used to map into a vector . Then, a simple approach that concatenates these two features is applied to obtain a joint representation. Let represent the joint representation. The joint representation is forwarded to an -way fully connected layer to generate -bit binary codes .

V-B Training Object

We use the triplet ranking loss [22] to train the deep network. We note that other losses, e.g., contrastive loss [13], can also be used in our framework and the loss function is not our focus in this paper. Specifically, given a triplet of instances , in which the instant is more similar to than to , these three instances go through the deep multimodal network, and the outputs of the network are and , which are respectively associated with the instances. The similarity-preserving loss function is defined by

(9)

where is the triplet form and is the margin.

Method NUS-WIDE MIR-Flickr 25k IAPR TC-12
16bits 32bits 48bits 64bits 16bits 32bits 48bits 64bits 16bits 32bits 48bits 64bits
DPSH 0.7057 0.7216 0.7252 0.7298 0.8262 0.8316 0.8304 0.8301 0.5386 0.5448 0.5383 0.5355
DSH 0.5712 0.5952 0.5998 0.6039 0.7234 0.7312 0.7390 0.7403 0.4746 0.4851 0.4892 0.4926
HashNet 0.7115 0.7252 0.7286 0.7317 0.8297 0.8333 0.8331 0.8328 0.5391 0.5451 0.5379 0.5386
DTH 0.7096 0.7193 0.7267 0.7362 0.8251 0.8332 0.8418 0.8406 0.5662 0.5854 0.5920 0.6032
TextHash 0.6027 0.6037 0.6088 0.6104 0.7154 0.7142 0.7121 0.7065 0.5238 0.5487 0.5542 0.5623
Concat 0.7274 0.7391 0.7432 0.7495 0.8352 0.8453 0.8554 0.8508 0.5762 0.5993 0.6213 0.6206
GMU 0.7250 0.7416 0.7458 0.7569 0.8398 0.8465 0.8505 0.8552 0.5694 0.6006 0.6207 0.6241
MCB 0.7262 0.7421 0.7481 0.7510 0.8379 0.8444 0.8524 0.8528 0.5721 0.5975 0.6149 0.6151
Ours 0.7395 0.7563 0.7627 0.7639 0.8564 0.8658 0.8697 0.8723 0.5925 0.6194 0.6330 0.6384
TABLE I: Comparison with state-of-the-art methods on three datasets.

Vi Experiments

In this section, we conduct extensive evaluations of the proposed method and compare it with several state-of-the-art algorithms.

Vi-a Datasets

  • NUS-WIDE [6]: This dataset consists of 269,648 images and the associated tags from Flickr. Each image is associated with several textural tags. The text for each point is represented as a 1,000-dimensional bag-of-words vector.

  • MIR-Flickr 25k [16]: This dataset contains 25,000 images collected from Flickr. Each image has associated textural tags. The textural tags are represented as a 1,386-dimensional bag-of-words vector.

  • IAPR TC-12  [9]: This dataset consists of 20,000 still natural images. Each image is associated with a text caption, which is represented as 2,912-dimensional bag-of-words vector.

For all of the experiments, we follow the experimental protocols of DCMH [17] to construct the query sets, retrieval databases and training sets. The NUS-WIDE dataset contains 81 ground-truth concepts. To prune the data without sufficient tag information, a subset of 195,834 image-text pairs that belong to the 21 most-frequent concepts are selected, as suggested by [17]. The randomly sampled 2,100 image-text pairs (100 pairs per concept) are used as the query set, and the rest of the image-text pairs are constructed as the retrieval database. In the retrieval database, 10,000 image-text pairs are randomly selected to train the hash functions. In the MIR-Flickr 25k and IAPR TC-12 databases, the randomly sampled 2,000 image-text pairs are used as the query set. The rest of the pairs are used as the database for retrieval. We randomly select 10,000 pairs from the retrieval database to form the training set.

Vi-B Experimental Settings

We implement our codes based on the open source deep learning platform PyTorch 111https://pytorch.org/. For the image modality, ResNet-18 is adapted as the basic architecture. The weights of ResNet-18 are initialized with the pretrained model that learns from the ImageNet dataset. For the text modality, the weights of all fully connected layers are randomly initialized following a Gaussian distribution with a standard deviation of 0.01 and a mean of 0. We train the networks by the stochastic gradient solver, i.e., ADAM (weight_decay = ). The batch size is 100, and the base learning rate is 0.0001, which is changed to one-tenth of the current value after every 20 epochs. For fair comparison, all of deep learning methods are based on the same network architectures and same experimental settings.

(a) NUS-WIDE
(b) MIR-Flickr 25k
(c) IAPR TC-12
Fig. 6: The comparison results of precision-recall curves with 32 bits.
(a) NUS-WIDE
(b) MIR-Flickr 25k
(c) IAPR TC-12
Fig. 7: The comparison results of precision curve w.r.t. different numbers of top returned samples.

Evaluations: Following the common practice, the mean average precision (MAP), precision-recall and precision w.r.t different numbers of top returned samples are used as the evaluation metrics. MAP is used to measure the accuracy of the whole binary codes based on the Hamming distances. The precision-recall aims to measure the hash lookup protocol and the precision considers only the top returned samples.

Method NUS-WIDE MIR-Flickr 25k IAPR TC-12
16bits 32bits 48bits 64bits 16bits 32bits 48bits 64bits 16bits 32bits 48bits 64bits
w/o KN 0.7295 0.7398 0.7467 0.7508 0.8349 0.8481 0.8564 0.8557 0.5796 0.6037 0.6228 0.6261
w/o AN 0.7339 0.7420 0.7519 0.7583 0.8430 0.8555 0.8625 0.8644 0.5839 0.6073 0.6242 0.6326
Ours 0.7395 0.7563 0.7627 0.7639 0.8564 0.8658 0.8697 0.8723 0.5925 0.6194 0.6330 0.6384
TABLE II: Ablation study on each component on three datasets.
(a) NUS-WIDE
(b) MIR-Flickr 25k
(c) IAPR TC-12
Fig. 8: The comparison results of precision curves for ablation study.

Vi-C Comparison with State-of-the-art Methods

In the first set of experiments, we compare the performance of the proposed method with state-of-the-art baselines. We evaluate two different approaches as the baselines.

The first set of baselines is the unimodal approaches. In this set of baselines, only one modality is used to train the hash functions. For the image modality, several state-of-the-art image hashing algorithms are selected: deep pairwise-supervised hashing (DPSH) [24], deep supervised hashing (DSH) [27], HashNet [5] and deep triplet hashing (DTH) [22]. DPSH and DSH belong to deep pair-wise approaches, and DTH is a triplet-based approach. HashNet aims to minimize the quantization errors of the hash codes. For fair comparison, the deep architectures for these four methods are all the same as ours. For the text modality, we use the same network for text data, which is referred to as TextHash. TextHash only uses the text representations to learn the binary codes.

The second set of baselines is different fusion strategies used to combine multiple modalities. We note that only the fusion module in Figure 2 uses different fusion strategies and the other modules are the same.

  • Concat We concatenate both intermediate features of the image and text modalities to train the hashing architectures.

  • GMU A gate multimodal unit (GMU) [2] is an internal unit in a neural network for data fusion. GMU uses multiplicative gates to determine how modalities influence the activation of the unit.

  • MCB Multimodal compact bilinear pooling (MCB) [11] uses bilinear pooling [25] to combine visual and text representations.

Table I show the results of comparisons of the obtained MAP values for the three mulitmodal datasets. Figure 6 and Figure 7 show the precision-recall and precision curves on 32 bits. Our proposed method yields the highest accuracy and beats all the baselines for most levels. Two observations can be made from the results as follows.

1) Compared with the unimodal approaches, our method performs significantly better than all baselines. For instance, our method yields the higher accuracy compared to the TextHash that only use the text modality. For image hashing methods, our method obtains a MAP of 0.7395 on 16 bits, compared with the value of 0.7115 of the HashNet on NUS-WIDE. On MIR-Flickr 25k, the MAP of DTH is 0.8332, while the proposed method is 0.8658 on 32 bits. The proposed method shows a relative increase of 4.6%6.9% on the IAPR TC-12 compared to the DTH algorithm. Note that DTH and our method use the same triplet ranking loss function and DTH achieves an excellent performance. Even so, our method performances better than DTH. These results indicate that multi-modal approaches can improve the performance.

2) Compared with other deep fusion strategies, our method also yields the best performance on all databases. Firstly, compared to the Concat approach, the only different is that using or not using the modal-aware operations, these comparisons can show us whether the modal-aware features can contribute to the accuracy or not. The results indicate that our modal-aware features can achieve better performance. For example, the MAP of our proposed method is 0.7395 when the bit length is 16, compared to 0.7274 of Concat on NUS-WIDE. Thus it is desirable to learn the powerful features for multi-modal retrieval. Compared to the GMU and MCB two baselines which achieve excellent performances, our proposed method also yields better performance. The main reason is that our method can incorporate the information from other modalities to learn the intermediate features, while the intermediate features of GMU and MCB are learned via individual neural layers.

Vi-D Ablation Study

In the second set of the experiments, an ablation study was perform to elucidate the impact of each part of our method on the final performance.

In the first baseline, we explore the effect of the kernel network. In this baseline, the attention network is fixed and we do not use the kernel network. That is the features are directly forwarded to the attention network and the only difference is that using or not using the kernel network, which is referred to as w/o KN.

The second baseline explores the effect of the attention network. In this baseline, the kernel network was first performed to obtain two intermediate features. Then, we concatenate the two features to obtain the joint representation. We note that the only difference between the baseline and our method is the use or lack of use of the attention network. We use w/o AN to denote the baseline that is not using the attention network.

The comparison results are shown in Table II and Figure 8. The results show that our proposed method can achieve better performance than the two baselines. For instance, our method obtains a MAP of 0.7627 on 48 bits, compared to 0.7519 of the w/o AN and 0.7467 of the w/o KN. The results indicate that it is desirable to learn the intermediate features with both the kernel network and the attention network.

In this paper, the text is represented as a bag-of-word vector. Other text representations, e.g., Sent2Vec or BERT [8], can be used in our framework. For example, on IAPR TC-12 database, each image is associated with a text caption. Thus Sent2Vec, which is computed via the pre-trained model 222https://github.com/epfml/sent2vec, can be used as the text representations. Table III shows the comparison results with respect to MAP.

Method IAPR TC-12
16bits 32bits 48bits 64bits
BoW 0.5925 0.6194 0.6330 0.6384
Sent2Vec 0.5961 0.6232 0.6336 0.6357
TABLE III: The comparison results of different texts representation.

Vii Conclusion

In this paper, we proposed a modal-aware operation for learning good feature representations. The key to success comes from designing a generic building block to capture the underlying correlation structures in heterogeneous multi-modal data prior to multimodal fusion. First, we proposed a kernel network to learn the non-linear relationships. The kernel similarities between two modalities were learned to reweight the original features. Then, we proposed an attention network, which aims to select the informative parts of the intermediate features. The experiments were conducted on three benchmark datasets, and the results demonstrate the appealing performance of the proposed modal-aware operations.

References

  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2425–2433. Cited by: §I.
  • [2] J. Arevalo, T. Solorio, M. Montes-y-Gómez, and F. A. González (2017) Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992. Cited by: §I, §II-A, §III, 2nd item.
  • [3] T. Baltrušaitis, C. Ahuja, and L. Morency (2019) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 423–443. Cited by: §I, §II-B.
  • [4] Y. Cao, S. Steffey, J. He, D. Xiao, C. Tao, P. Chen, and H. Müller (2014) Medical image retrieval: a multimodal approach. Cancer informatics 13, pp. CIN–S14053. Cited by: §II-B.
  • [5] Z. Cao, M. Long, J. Wang, and P. S. Yu (2017) Hashnet: deep learning to hash by continuation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §VI-C.
  • [6] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, pp. 48. Cited by: 1st item.
  • [7] S. K. D’mello and J. Kory (2015) A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys 47 (3), pp. 43. Cited by: §I.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §VI-D.
  • [9] H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. López-López, M. Montes, E. F. Morales, L. E. Sucar, L. Villaseñor, and M. Grubinger (2010) The segmented and annotated iapr tc-12 benchmark. Computer vision and image understanding 114 (4), pp. 419–428. Cited by: 3rd item.
  • [10] S. Fidler, A. Sharma, and R. Urtasun (2013) A sentence is worth a thousand pixels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1995–2002. Cited by: §I.
  • [11] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 457–468. Cited by: §I, §II-A, §III, 3rd item.
  • [12] M. Gönen and E. Alpaydın (2011) Multiple kernel learning algorithms. Journal of machine learning research 12 (Jul), pp. 2211–2268. Cited by: §I, §I.
  • [13] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1735–1742. Cited by: §V-B.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §V-A.
  • [15] R. Hu, M. Rohrbach, and T. Darrell (2016) Segmentation from natural language expressions. In Proceedings of the European Conference on Computer Vision, pp. 108–124. Cited by: §II-A.
  • [16] M. J. Huiskes and M. S. Lew (2008) The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp. 39–43. Cited by: 2nd item.
  • [17] Q. Y. Jiang and W. Li (2016) Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-B, §VI-A.
  • [18] L. Jin, J. Tang, Z. Li, G. Qi, and F. Xiao (2019) Deep semantic multimodal hashing network for scalable multimedia retrieval. arXiv preprint arXiv:1901.02662. Cited by: §I.
  • [19] D. Kiela and L. Bottou (2014) Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, pp. 36–45. Cited by: §II-A.
  • [20] J. Kim, J. Koh, Y. Kim, J. Choi, Y. Hwang, and J. W. Choi (2018) Robust deep multi-modal learning based on gated information fusion network. arXiv preprint arXiv:1807.06233. Cited by: §I.
  • [21] S. Kim, Y. Kang, and S. Choi (2012) Sequential spectral learning to hash with multiple representations. In Proceedings of the European Conference on Computer Vision, pp. 538–551. Cited by: §II-B.
  • [22] H. Lai, Y. Pan, Y. Liu, and S. Yan (2015) Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3270–3278. Cited by: §V-B, §VI-C.
  • [23] C. Li, C. Deng, N. Li, W. Liu, X. Gao, and D. Tao (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4242–4251. Cited by: §II-B.
  • [24] W. Li (2016) Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3485–3492. Cited by: §VI-C.
  • [25] T. Lin, A. RoyChowdhury, and S. Maji (2015) Bilinear cnn models for fine-grained visual recognition. In ICCV, pp. 1449–1457. Cited by: 3rd item.
  • [26] F. Liu, L. Zhou, C. Shen, and J. Yin (2014) Multiple kernel learning in the primal for multimodal alzheimer’s disease classification. IEEE journal of biomedical and health informatics 18 (3), pp. 984–990. Cited by: §I.
  • [27] H. Liu, R. Wang, S. Shan, and X. Chen (2016) Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2064–2072. Cited by: §VI-C.
  • [28] K. Liu, Y. Li, N. Xu, and P. Natarajan (2018) Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730. Cited by: §II-A.
  • [29] Y. Mroueh, E. Marcheret, and V. Goel (2015) Deep multimodal learning for audio-visual speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2130–2134. Cited by: §I.
  • [30] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In Proceedings of The International Conference on Machine Learning, pp. 689–696. Cited by: §II-A.
  • [31] W. Ouyang, X. Chu, and X. Wang (2014) Multi-source deep learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329–2336. Cited by: §I.
  • [32] S. S. Rajagopalan, L. Morency, T. Baltrusaitis, and R. Goecke (2016) Extending long short-term memory for multi-view structured learning. In Proceedings of the European Conference on Computer Vision, pp. 338–353. Cited by: §I.
  • [33] X. Shen, F. Shen, Q. Sun, and Y. Yuan (2015) Multi-view latent hashing for efficient multimedia search. In ACM MM, pp. 831–834. Cited by: §II-B.
  • [34] N. Srivastava and R. R. Salakhutdinov (2012) Multimodal learning with deep boltzmann machines. In Proceedings of the Neural Information Processing Systems, pp. 2222–2230. Cited by: §II-A.
  • [35] L. Sun, S. Ji, and J. Ye (2008) A least squares formulation for canonical correlation analysis. In Proceedings of The International Conference on Machine Learning, pp. 1024–1031. Cited by: §II-B.
  • [36] D. Wang, P. Cui, M. Ou, and W. Zhu (2015) Deep multimodal hashing with orthogonal regularization. In Proceedings of the International Joint Conference on Artificial Intelligence, Cited by: §I, §II-B.
  • [37] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215. Cited by: §II-B.
  • [38] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §IV-A.
  • [39] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision, pp. 3–19. Cited by: §IV-B.
  • [40] L. Xie, J. Shen, J. Han, L. Zhu, and L. Shao (2017) Dynamic multi-view hashing for online image retrieval. Cited by: §II-B.
  • [41] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao (2017) Pairwise relationship guided deep hashing for cross-modal retrieval.. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1618–1625. Cited by: §II-B.
  • [42] D. Zhang, F. Wang, and L. Si (2011) Composite hashing with multiple information sources. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 225–234. Cited by: §II-B.
  • [43] D. Zhang and W. Li (2014) Large-scale supervised multimodal hashing with semantic correlation maximization.. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 1, pp. 7. Cited by: §II-B.
  • [44] X. Zhang, H. Lai, and J. Feng (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In Proceedings of the European Conference on Computer Vision, pp. 591–606. Cited by: §II-B.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
399495
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description