Textual Relationship Modeling for Cross-Modal Information Retrieval

Textual Relationship Modeling for Cross-Modal Information Retrieval


Feature representation of different modalities is the main focus of current cross-modal information retrieval research. Existing models typically project texts and images into the same embedding space. In this paper, we explore the multitudinous of textural relationships in text modeling. Specifically, texts are represented by a graph generated using various textural relationships including semantic relations, statistical co-occurrence, and predefined knowledge base. A joint neural model is proposed to learn feature representation individually in each modality. We use Graph Convolutional Network (GCN) to capture relation-aware representations of texts and Convolutional Neural Network (CNN) to learn image representations. Comprehensive experiments are conducted on two benchmark datasets. The results show that our model outperforms the state-of-the-art models significantly by 6.3% on the CMPlaces data and 3.4% on English Wikipedia, respectively.

Textual Relationship Modeling for Cross-Modal Information Retrieval

Jing Yu  Chenghao Yang  Zengchang Qin  Zhuoqian Yang  Yue Hu  Yanbing Liu
Institute of Information Engineering, Chinese Academy of Sciences, China
Intelligent Computing and Machine Learning Lab, School of ASEE, Beihang University, China

Index Terms—  GCN, CNN, Knowledge graph, Cross-modal retrieval, word2vec

1 Introduction

Cross-modal information retrieval (CMIR), which enables queries from one modality to retrieve information in another, plays an increasingly important role in intelligent searching and recommendation systems. A typical practice of CMIR is to map features from different modalities into a common semantic space where the similarity between different entries can be directly measured. Thus, feature representation is the main focus of current CMIR research. Much effort has been devoted to using vector-space models to represent multi-modal data. For example, both unstructured text data and grid-structured image data (as well as video) are modeled as feature vectors projected into a semantic space.

However, such vector representation is seriously limited by its inability to capture complex structures hidden in texts[1, 2] – there are many implicit and explicit textual relations that characterize syntactic rules in text modeling [3].

Fig. 1: (a) The original text and three kinds of textual relationships: (b) distributed semantic relationship in the embedding space, (c) word co-occurrence relationship and (d) general knowledge relationship defined by a knowledge graph.

Nevertheless, the possibility of infusing priori facts or relations (e.g., from a knowledge graph) into pre-trained models is excluded by the great difficulty it imposes. A recent work [2] models text as featured graphs with semantic relations. They achieve state-of-the-art performance on text-image retrievals by utilizing the cosine distance between word embeddings to form -nearest relationships. However, the performance of this practice heavily relies on the generalization ability of the word embeddings [4] and may be biased under some circumstances. It also fails to incorporate general human knowledge and textual relations such as co-occurrence [5] and spatial relations [6] between words. To exemplify the above point and suggest potential improvements, the modeling of a text specimen using different types of textual relationships is examined in Fig.1. It can be observed in the KNN graph (Fig. 1-b) that Spielberg is located relatively far away from Hollywood as compared to the way director is to film, whereas in the common sense knowledge graph given in (Fig. 1-d), these two words are closely related to each other as they should be. Fig.1-c shows the less-frequent subject-predicate relation pattern (e.g. Spielberg and E.T.) which is absent in the KNN-based graph. Consequently, a more sophisticated model should correlate Spielberg with all the following words {director, film, E.T., Hollywood, producer, sci-fi, screenwriter, U.S.}. The above analysis indicates that graph construction can be improved by fusing different types of textual relations, which is the basic motivation of this work.

The mainstream solution for CMIR is to project data from different modalities into a common semantic space to measure their similarity directly. Several statistical methods are proposed by studying the correlation between mid-level features to maximize the pairwise similarity, e.g., Canonical Correlation Analysis (CCA) [7] and topic correlation model (TCM) [8, 9]. However, these methods ignore high-level semantic priority and could be hard to extend to large-scale data [10]. With the development of deep learning, we can train neural models with big data, therefore, deep learning based retrieval models [2, 11, 12, 13] have drawn much attention for their superior performance. Instead of using multilayer perceptron (MLP) for feature mapping, [12] proposes a 2-stage CNN-LSTM network to generate and refine cross-modal features progressively. In [13], authors leverage the attention mechanism to focus on essential image regions and words for correlation learning. Early works attempt to learn shallow statistical relationships, such as co-occurrence [5] or location [6]. Later on, semantic relationship based on syntactic analysis [3] or semantic rules between conceptual terms are explored. Besides, semantic relationship derived from knowledge graphs (e.g., Wikidata [14], DBpedia [15]) has attracted increasing attention. Words in a document are linked to real word entities based on knowledge graphs, and the relationships are inferred along the connected paths in the graphs [16].

Our work belongs to the deep learning methods. Similar to recent work [2], we utilize GCN-CNN architecture to learn textual and visual features for similarity matching. The key idea is to explore the effects of multi-view relationships and propose a graph-based integration model to combine complementary information from different given relationships. Finally, the similarity between the learned text features and the pre-trained image features are computed via distance metric learning.

Fig. 2: The semantic illustration of our proposed framework based on GCN and CNN.

2 Methodology

In this paper, we focus on text-image retrieval problems. The schematics of our approach is shown in Fig. 2. It consists of three parts: (1) Text modeling: all extracted nouns from the training corpus form a dictionary. Each text is represented by a featured graph. The graph structure is identical for all the texts while the graph features are content-specific for each text. For the graph structure, vertices correspond to the words in the dictionary while edges are the integration of multi-level relationships based on semantic relations, statistic co-occurrence, and predefined knowledge base. For graph features, we adopt bi-LSTM [17] to extract the word embeddings in a text as the word representation. We utilize two layers of GCN module to refine the graph-based textual representation progressively. More details about graph structures and features will be elaborated in Section 2.1 and Section 2.2, respectively. (2) Image modeling: we explore pre-trained Convolutional Neural Network (CNN), i.e., VGGNet [18], for visual feature learning. The last fully connected layer maps the visual features along with textual features to the same common semantic space. (3) Distance metric learning [2]: The training objective is a pairwise similarity loss function with the same settings as [19]. It maximizes the similarity of positive text-image samples and minimizes the similarity of negative samples.

2.1 Fine-grained Textual Relationship

Following the distributional hypothesis [20], words appear in similar context may share semantic relationship, which is critical for relation modeling. To model such semantic relationship, we follow the previous work [2] to build a semantic graph . Each edge is defined as follows:


where is the set of -nearest neighbors computed by the cosine similarity between words using word2vec embedding and is the neighbor numbers, which is set to 8 in our experimental studies.
Word Co-occurrence Relationship (CR) Co-occurrence statistics have been widely used in many tasks such as keyword extraction [21] and web search [22]. It can serve as effective backup information to capture infrequent but important relations. Each edge in the graph indicates that the words and co-occur at least times. We define as the threshold to rule out noise, which aims to achieve better generalization and facilitate the computation efficiency.
General Knowledge Relationship (KR) General knowledge can effectively support decision-making and inference by providing high-level expert knowledge as complementary information to training corpus. However, it is not fully covered by task-specific text. In this paper, we only consider knowledge that contains tons of triples (Subject, Relation, Predicate), which well representing various relationships in human commonsense knowledge. To incorporate such real-world relationships, we construct the graph and each edge satisfying the following equation:


where refers to the knowledge graph. In this paper, we adopt wikidata [14] because it is free and easy to use. Notice that, for simplification, we ignore the types of relations in KG and left it for the future work. The knowledge graph alone may be too general to fit specific tasks, and it may fail to capture semantic relationships in the text. Practically, we see it as additional prior knowledge and combine it with the task-specific features.
Relationship Integration Different textual relationships capture information from different angles. Though the effectiveness of SR [2], CR can be viewed as a backoff for SR while KR will endow SR with commonsense knowledge. It is conceivable that the relationship integration will achieve richer representation. Here we utilize the union operation to obtain multi-view relationships , where the edge set satisfying:


In-depth study of the integration effects on CMIR will be detailed in Section 3, including Semantic and Co-occurrence Relationship (SCR), Semantic and Knowledge Relationship (SKR) as well as Semantic, Co-occurrence and Knowledge Relationship (SCKR).

2.2 Graph Feature Extraction

Previous work [2] adopt bag-of-word, i.e., the word frequency, as the feature of the word in each document. We argue this kind of feature is not informative enough to capture the essential semantic information. Instead, we first pre-train a Bi-LSTM to predict the category label of the text, then sum up the concatenated outputs of Bi-LSTM [17] of each word over every mention in the document to obtain the word representation. Such representation is context-relevant and can better incorporate the semantic information in the document. As for the unseen words in the generated dictionary in the text, we represent them with zeros.

Method Avg. Dataset
CCA [7] 18.7 21.6 20.2 Eng-Wiki
SCM [7] 23.4 27.6 25.5
LCFS [23] 20.4 27.1 23.8
LGCFL [24] 31.6 37.8 34.7
GMLDA [25] 28.9 31.6 30.2
GMMFA [25] 29.6 31.6 30.6
AUSL [26] 33.2 39.7 36.4
JFSSL [27] 41.0 46.7 43.9
GIN [2] 76.7 45.3 61.0
SR [ours] 83.5 41.4 62.4
SCR [ours] 84.3 42.6 63.4
SKR [ours] 83.9 42.0 62.9
SCKR [ours] 84.9 44.0 64.4
BL-Ind [28] 0.6 0.8 0.7 CMPlaces
BL-ShFinal [28] 3.3 12.7 8.0
BL-ShAll [28] 0.6 0.8 0.7
Tune(Free) [28] 5.2 18.1 11.7
TuneStatReg [28] 15.1 22.1 18.6
GIN [2] 19.3 16.1 17.7
SR [ours] 18.6 15.8 17.2
SCR [ours] 25.4 20.3 22.8
SKR [ours] 24.8 20.5 22.6
SCKR [ours] 28.5 21.3 24.9
Table 1: MAP score comparison on two benchmark datasets.

3 Experimental Studies

In this section, we test our models on two benchmark datasets: Cross-Modal Places [28] (CMPlaces) and English Wikipedia [7] (Eng-Wiki). CMPlaces is one of the largest cross-modal dataset providing weakly aligned data in five modalities divided into 205 categories. In our experiments, we utilize the natural images (about 1.5 million) and text descriptions (11,802) for evaluation. We randomly sample 64 images from each category and split the images for training, validation, and test with the proportion of 50:7:7. We also randomly split text descriptions for training, validation, and testing with the proportion of 14:3:3. Eng-Wiki, which is the most widely used dataset in cross-modal information retrieval, it contains 2,866 image-text pairs divided into ten classes, where 2,173 pairs are for training, and 693 pairs are for test. We follow the standard division of data for this dataset. As for the evaluation, MAP@100 is used to evaluate the query performance.

Implementation Details. We randomly selected 204,800 positive samples and 204,800 negative samples for training. We set the dropout ratio 0.2 at the input of the last fully connected layer, learning rate 0.001 with an Adam optimization, and regularization weight 0.005. The parameters for loss function are set to be the same as [2]. In the last semantic mapping layers of both text path and image path, the reduced dimensions are set to 1,024 for both datasets.

3.1 Comparison to State-of-the-Art Methods

In the Eng-Wiki dataset, we compare our model with a series of the SOTA models, which are listed in Table 1. We observe that our model SCKR achieves the best performance on the average MAP scores and slightly inferior to JFSSL on the image query (), which confirms that our relation-aware model can bring an overall improvement over existing CMIR models. Especially, text query () gains remarkable 8.2% increase over the SOTA model GIN, which proves that our model leads to better representation and generalization ability for the text query. In the large CMPlaces dataset, compared with the previous SOTA models, SCKR also achieves improvement compared with TuneStatReg [28].

Fig. 3: Some samples of text query results using four of our models on the CMPlaces dataset. The corresponding textual relation networks are also shown in the 2nd column. .

3.2 Comparisons to Baseline Models

We compare our proposed SCKR model to three baseline models. The retrieval performance is also listed in Table 1. Compared to SR, both SCR and SKR achieve a significant improvement by up to 5% on the average MAP scores on CMPlaces and obtain a slight improvement on Eng-Wiki. It indicates that either low-frequent co-occurrence or the general knowledge graph could provide complementary information to the distributed semantic relationship and improve the retrieval performance. By integrating all kinds of textual relationships (SCKR), we obtain further promotion on MAP scores, especially on the relation-rich CMPlaces dataset. It verifies that SR, CR or KR alone focuses on different views of relationships and their incorporation could bring more informative connections over the relational graph, thus effectively facilitating the information reasoning and generalization.

Fig. 3 gives an example for the text-query task on SCKR and three baseline models. We show the corresponding relation graphs and the retrieved results. We observe that SR captures the least relationships and the retrieval results are far from satisfaction, which necessitates the exploration of the richer textual relationship. SCR can effectively emphasize the descriptive textual relationship (e.g. “sun-ball” and “sun-bright”), which is infrequent in the data but informative for better understanding the text. Notice that, only SKR incorporates the relationship between “overhead” and “airplane” through “sky-overhead-airplane” inference path, which indicates that general knowledge is beneficial in facilitating relation inference and information propagation. The SCKR model leverages the advantages of SCR and SKR and achieves the best performance.

4 Conclusions

In this paper, we proposed a graph-based neural model to integrate multi-level textual relationships, including the semantic relations, statistical co-occurrence, and pre-defined knowledge graph, for text modeling in CMIR tasks. The new model uses a GCN-CNN framework for feature learning and cross-modal semantic correlation modeling. Experimental results on both large-scale and widely-used benchmark datasets show that our model can significantly outperform the state-of-the-art models, especially for text queries. In the future work, we can extend this model to other cross-modal areas like automatic image captioning and video captioning.


  • [1] Zengchang Qin, Jing Yu, Yonghui Cong, and Tao Wan, “Topic correlation model for cross-modal multimedia information retrieval,” Pattern Analysis & Applications, vol. 19, no. 4, pp. 1007–1022, 2016.
  • [2] Jing Yu, Yuhang Lu, Zengchang Qin, Weifeng Zhang, Yanbing Liu, Jianlong Tan, and Li Guo, “Modeling text with graph convolutional network for cross-modal information retrieval,” in PCM. Springer, 2018, pp. 223–234.
  • [3] Chuntao Jiang, Frans Coenen, Robert Sanderson, and Michele Zito, “Text classifcation using graph mining-based feature extraction,” Knowledge-Based Systems, vol. 23, no. 4, pp. 302–308, 2010.
  • [4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [5] François Rousseau and Michalis Vazirgiannis, “Graph-of-word and twidf: New approach to ad hoc ir,” in CIKM, 2013, pp. 59–68.
  • [6] Rada Mihalcea and Paul Tarau, “Textrank: Bringing order into text,” in EMNLP, 2004, pp. 404–411.
  • [7] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in ACMMM. ACM, 2010, pp. 251–260.
  • [8] Jing Yu, Yonghui Cong, Zengchang Qin, and Tao Wan, “Cross-modal topic correlations for multimedia retrieval,” in ICPR. IEEE, 2012, pp. 246–249.
  • [9] Weifeng Zhang, Zengchang Qin, and Tao Wan, “Semi-automatic image annotation using sparse coding,” in Machine Learning and Cybernetics (ICMLC). IEEE, 2012, vol. 2, pp. 720–724.
  • [10] Zhuang Ma, Yichao Lu, and Dean Foster, “Finding linear structure in large datasets with scalable canonical correlation analysis,” in ICML, 2015, pp. 169–178.
  • [11] Liwei Wang, Yin Li, and Svetlana Lazebnik, “Learning deep structure-preserving image-text embeddings,” in CVPR, 2016, pp. 5005–5013.
  • [12] Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang, “Identity-aware textual-visual matching with latent co-attention,” in ECCV, 2017, pp. 1908–1917.
  • [13] Kuang-huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He, “Stacked cross attention for image-text matching,” in ECCV, 2018, p. arXiv:1803.08024.
  • [14] Denny Vrandečić and Markus Krötzsch, “Wikidata: a free collaborative knowledgebase,” Communications of the ACM, vol. 57, no. 10, pp. 78–85, 2014.
  • [15] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives, “Dbpedia: A nucleus for a web of open data,” in The semantic web, 2007, pp. 722–735.
  • [16] Michael Schuhmacher and Simone Paolo Ponzetto, “Knowledge-based graph document modeling,” in WSDM, 2014, pp. 543–552.
  • [17] Alex Graves and Jürgen Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
  • [18] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [19] Vijaya B G Kumar, Gustavo Carneiro, and Ian Reid, “Learning local image descriptors with deep siamese and triplet convolutional networks by minimizing global loss functions,” in CVPR, 2016, p. 5385–5394.
  • [20] Zellig S Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–162, 1954.
  • [21] Yutaka Matsuo and Mitsuru Ishizuka, “Keyword extraction from a single document using word co-occurrence statistical information,” International Journal on Artificial Intelligence Tools, vol. 13, no. 01, pp. 157–169, 2004.
  • [22] Yutaka Matsuo, Takeshi Sakaki, Kôki Uchiyama, and Mitsuru Ishizuka, “Graph-based word clustering using a web search engine,” in EMNLP. Association for Computational Linguistics, 2006, pp. 542–550.
  • [23] Kaiye Wang, Ran He, Wei Wang, and Liang Wang, “Learning coupled feature spaces for cross-modal matching,” in ICCV, 2013, pp. 2088–2095.
  • [24] Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan, “Learning consistent feature representation for cross-modal multimedia retrieval,” TMM, vol. 17, no. 3, pp. 370–381, 2015.
  • [25] Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in CVPR, 2012, pp. 2160–2167.
  • [26] Liang Zhang, Bingpeng Ma, Jianfeng He, Guorong Li, Qingming Huang, and Qi Tian, “Adaptively unified semi-supervised learning for cross-modal retrieval,” in IJCAI, 2017, pp. 3406–3412.
  • [27] Kaiye Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan, “Joint feature selection and subspace learning for cross-modal retrieval,” PAMI, vol. 38, no. 10, pp. 2010–2023, 2016.
  • [28] Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba, “Learning aligned cross-modal representations from weakly aligned data,” in CVPR. IEEE, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description