Interactive Attention for Semantic Text Matching

Interactive Attention for Semantic Text Matching

Written by AAAI Press Staff1
AAAI Style Contributions by Pater Patel Schneider,
Sunil Issar, J. Scott Penberthy, George Ferguson, Hans Guesgen
1Association for the Advancement of Artificial Intelligence
2275 East Bayshore Road, Suite 160
Palo Alto, California 94303
publications20@aaai.org
Primarily Mike Hamilton of the Live Oak Press, LLC, with help from the AAAI Publications Committee
   Sendong Zhao1, Yong Huang2, Chang SU1, Yuantong Li3, Fei Wang1
1 Weill Cornell Medical College, Cornell University, USA
2 Cornell Tech, Cornell University, USA
3 Department of Statistics, Purdue University
{sez4001,csu4001, few2001}@med.cornell.edu, yh849@cornell.edu, li3551@purdue.edu
1footnotemark: 1
2footnotemark: 2
3footnotemark: 3
Abstract

Semantic text matching, which matches a target text to a source text, is a general problem in many domains like information retrieval, question answering, and recommendation. There are several challenges for this problem, such as semantic gaps between words, implicit matching, and mismatch due to out-of-vocabulary or low-frequency words, etc. Most existing studies made great efforts to overcome these challenges by learning good representations for different text pieces or operating on global matching signals to get the matching score. However, they did not learn the local fine-grained interactive information for a specific source and target pair. In this paper, we propose a novel interactive attention model for semantic text matching, which learns new representations for source and target texts through interactive attention via global matching matrix and updates local fine-grained relevance between source and target. Our model could enrich the representations of source and target objects by adopting global relevance and learned local fine-grained relevance. The enriched representations of source and target encode global relevance and local relevance of each other, therefore, could empower the semantic match of texts. We conduct empirical evaluations of our model with three applications including biomedical literature retrieval, tweet and news linking, and factoid question answering. Experimental results on three data sets demonstrate that our model significantly outperforms competitive baseline methods.

Introduction

Semantic text matching, which matches a target text to a source text, is one of the most important research problems in many domains, such as information retrieval, question answering, and recommendation. It estimates the semantic similarity between the source and target text pieces. The difficulties of semantic text matching are three-fold. First, semantically similar words can have different expressions such as “cancer” and “tumor”, “diabetes” and “hyperglycemia”. Second, the words can be implicitly related, such as “diabetes” and “metformin”. Third, out-of-vocabulary or low-frequency words in source text usually cause a mismatch of source and target texts (e.g., low-frequency word “pneumothorax” in the source text but its synonym “collapsed lung” in the target text). All these difficulties make it still a challenging problem in different domains.

One of the most conventional approaches to semantic text matching is to compute a vector as the representation for each text piece, such as bag-of-words and latent Dirichlet allocation models [1], and then apply typical similarity metrics to compute the matching scores. Unfortunately, the performance of these traditional approaches is unsatisfactory, as they often fail to identify semantically similar source-target pairs without an exact expression match. The rise of deep learning approaches such as recurrent neural networks [19], long short-term memory neural networks [8], and convolutional neural networks [13], have been firmly established as state-of-the-art approaches to understanding the complex semantic correlations within and between texts. These existing deep neural approaches for semantic text matching can be divided into two main categories: 1) the representation-focused model, and 2) the interaction-focused model. Several studies have demonstrated the latter one is more reasonable and promising for semantic text matching [29, 5, 14]. Generally, the interaction-focused model builds a matching matrix between source and target texts, and then takes the matching matrix as input and uses deep neural networks to learn the overall matching score. This type of models could exploit global word-level relevance through pre-trained word embeddings. Therefore, they could get rid of semantic gaps of words between texts and capture some global implicit relations between texts.

In addition to global word-level relevance from external resources, we could consider the target text as the neighbor information or the context of the corresponding source text and vice versa, which is referred to as the local relevance. However, the local relevance for a specific source and target pair is generally ignored by existing interaction-focused models. In other words, existing interaction-focused models could not update word-level relevance between source and target texts by operating only on the pre-defined matching matrix. This might lead to a mismatch of source and target especially when there are out-of-vocabulary or low-frequency keywords in source or target text. In this paper, we propose a new interaction-focused model called interactive attention for semantic matching (IASM) to incorporate both global and local word-level relevance between source and target texts. In particular, the proposed model builds a matching matrix between source and target texts from external resources, and learns new representations of source and target text through interactive attention upon the matching matrix. On the one hand, the IASM exploits global word-level relevance via external knowledge (pre-trained word embeddings from knowledge bases or large corpus). One the other hand, it enriches representations for source and target objects through local message-passing in between via interactive attention on matching matrix. Both aspects could enhance the semantic match between source and target texts. Mutual differences between the original and the new enriched representation are measured as the matching score. We would like to highlight our contributions in this paper summarized as follows:

  • To the best of our knowledge, this paper is the first work to leverage the matching matrix to enrich the representations of source and target texts.

  • The IASM model takes advantage of global and local word-level relevance between source and target texts, which could benefit for alleviating the mismatch of these two objects especially for those not well-learned words, such as the out-of-vocabulary or low-frequency words.

  • Experiments are conducted on three data sets with 3 different semantic text matching applications to demonstrate the effectiveness of our proposed method.

Related Work

Semantic text matching estimates semantic similarity between the source and target text pieces (e.g., query-to-document match, question-to-answer match, etc.). To measure the similarity between texts, it is intuitive for traditional approaches to compare the words in the texts. For example, Mihalcea et al. [18] computed the word similarity while Wu et al. [26] exploited the vector space model with the term frequency-inverse document frequency (TF-IDF). However, words in the texts are extremely sparse for bag-of-words representation. In addition, the semantics between individual words also cannot be captured, so these approaches usually obtain unsatisfactory results. Although some studies attempted to leverage the semantics in external resources such as knowledge bases [24] or alleviate the data sparseness by Latent Semantic Analysis (LSA) [28], traditional approaches are still limited by discrete words.

The recent development of deep learning provides a new opportunity for semantic text matching. The deep learning based semantic text matching models, such as DeepMatch [16], MV-LSTM [25], PACRR [11], Delta [21], Conv-KNRM [2], MASH RNN [12], have dominated the semantic text matching research. In particular, these models have been focusing on 1) flexible representation learning for source and target texts and 2) measuring the similarity between the source and target texts at different levels. Correspondingly, there are two main categories of deep neural semantic text matching models. One is the representation-focused model, which tries to learn good representations for both source and target with deep neural networks, and then conducts matching between the learned representations. Examples include DSSM [10], C-DSSM [4], ARC-I [9], siamese RNN [22], MASH RNN [12]. The other is the interaction-focused model, which first builds interactions (i.e., matching matrix) between the source and target texts, and then uses deep neural networks to learn the overall matching score based on the interactions. Examples include DeepMatch [16], ARC-II [9], DRMM [5], ESR [27], PACRR [11], Conv-KNRM [2] and Delta [21].

The interaction-focused models can alleviate the semantic gaps between words in source and target texts via the interaction between pre-trained words to incorporate the global relevance between words or n-grams. But the representation-focused models could not directly use the global word-level or n-gram-level similarities between source and target texts. Therefore, interaction-focused models usually perform better. However, the existing interaction-focused models neglect the local relevance of words for a specific source and target pair. The target text can be considered as the neighbor or the context of the source text and vice versa. Therefore, the local interaction between the source and target texts could be very useful to enrich representations of the two texts, especially for some not well-learned words in source or target texts. This paper is the first one to exploit local interactions to enrich representations of source and target texts and empower semantic text matching.

Our Method

Interactive Attention on Local Interaction

The input of our model is a pair of source and target texts (). The source text is composed of a sequence of words and the target text is composed of a sequence of words . The pre-trained word embedding for each word and can be get via representation learning on external resources like knowledge bases or large corpus. Therefore, we can get the representation of source text and the representation of target text .

Here, we can get a matching matrix through the word-level similarity between the source text and the target text based on the pre-trained representations of source and target texts and .

(1)

where we exploit cosine similarity as the function. This matching matrix specifies the space of element-wise interaction between objects and . In addition, the matching matrix can be an adjacency matrix for a bipartite graph extracted from some external knowledge graphs.

Figure 1: The framework of one layer 2-channel interactive attention for text matching. the dimension of is , is , is , is , is .

The architecture of interactive attention on a matching matrix is composed of three components.

  • Interactive attention-based learning for the source text.

  • Interactive attention-based learning for the target text.

  • Matching score computation.

The first two components reffed to as 2-channel interactive attention are illustrated in Figure 1. They are designed to learn new representations for source and target texts to leverage global and local relevance between these two via message passing in the matching matrix and its transpose. Specifically, we adopt a interactive attention with the local interaction to achieve this goal.

interactive attention learning for source text

Given the matching matrix , which represents the word-level similarity between and , we take it as the relational matrix between target and source text. For the representation of source text which is composed of pre-trained word embeddings {}, we conduct interactive attention on the matching matrix and its transpose to learn new representation for each word in source text . More concretely, interactive attention is conducted on all words in target text with matching matrix. In this way, it could integrate relevance information from all words {} in target text to enrich the representation of each word in source text . For the 1st layer, we have

(2)

where incoming interactive information is accumulated and passed through a neural network-like function , such as a linear transformation plus a ReLU. is the original representation of source text which is composed of pre-trained word embeddings {}. This process can be multiple layers. For the ()th layer, , we have

(3)

and for the ()th layer, ,

(4)

where and are weight matrices. The weight matrices of different odd layers can either share the same one or different ones. The same setting can be applied for even layers.

interactive attention learning for target text

Given the matching matrix , which represents the word-level similarity between and , we take it as the relation matrix between source and target text. For the representation of target text which is composed of pre-trained word embeddings {}, we conduct interactive attention on the matching matrix and its transpose to learn new representation for each word in target text . More concretely, interactive attention is conducted on all words in source text via matching matrix. Hence, it could integrate relevance information from all words {} in the source text to enrich the representation of each word in target text . For the 1st layer, we have

(5)

where incoming relevance information is accumulated and passed through a neural network-like function , such as a linear transformation plus a ReLU. is the original representation of target text which is composed of pre-trained word embeddings {}. This process can be multiple layers. For the ()th layer, we have

(6)

and for the ()th layer,

(7)

Matching score computation.

To get the matching score between source and target texts, we conduct two comparisons on two different channels, i.e., source channel and target channel. We compare the difference between original source text representation and the new learned target representation which is obtained after odd layers interactive attention. The distance between these two is computed as . We compare the difference between original target text representation and the new learned source representation which is obtained after odd layers interactive attention. The distance between these two is computed as . Each and should be normalized before conducting . You can choose an appropriate distance metric, e.g. euclidean and cosine. Here, the distance measurement we exploit is euclidean distance. Therefore, the scoring function is defined as follows.

(8)

where and are hyper-parameters.

Learning of IASM

To learn the parameters of our IASM, we consider a ranking criterion. Intuitively, given a true pair , if the target text is missing, we would like the model to be able to predict the correct target text. For each true pair of source and target texts , we sampled several negative samples. The objective of training is to learn the proposed model so that it can successfully rank the true pair to precede all other possible negative samples. Therefore, we define a loss to formalize this intuition:

(9)

where is the set of true source and target text pairs, contains corrupted pairs constructed by negative sampling which replaces the source text or the target text in the true , is a margin separating true pairs and corrupted pairs, and denotes the positive part of .

Experiments

Experimental Settings

In the experiments, we verify the performance of IASM in three real-world applications of semantic text matching, including (1) biomedical literature retrieval, (2) tweet and news linking, and (3) factoid question answering. The experimental settings about the datasets employed in each task are described in the following corresponding subsections.

Evaluation

For the three different applications of text semantic matching, we conduct ranking experiments, which are evaluated with three standard information retrieval metrics, including precision at N (such as and ), mean reciprocal rank (MRR) and mean average precision (MAP). The precision at N is the retrieval precision at rank top N. The MRR is the mean of the multiplicative inverse of the rank of the first correct answer. The MAP is the mean of the average precision across source texts in our test datasets.

Implementation Details

The model is implemented in PyTorch. We follow the training procedure outlined in Section “Learning of IASM” with the word embeddings setup in Section “Pre-trained word embeddings”. We train the model using Adam with the initial learning rate of for 50 epochs. The best setting for is and is . The best setting for the number of layers is layer for tasks 1 and 2, layers for task 3. For baseline deep neural models, we use the same setting reported in their corresponding papers.

Pre-trained word embeddings

We initialized the word embedding matrix with three types of pre-trained word embeddings respectively. The first is trained by Word2Vec [20]. For biomedical literature retrieval, we choose pre-trained Word2Vec 100-dimensional embeddings trained on all 27.5 million MEDLINE biomedical articles. For tweet to news linking and factoid QA, we choose Google’s trained Word2Vec 300-dimensional embeddings trained on roughly 100 billion words from a Google News dataset. The second is trained by GloVe [23]. For biomedical literature retrieval, we choose GloVe 100-dimensional embeddings trained on 27.5 million MEDLINE biomedical articles. For the tweet to news linking, we choose Stanford pre-trained GloVe 100-dimensional embeddings trained on 2 billion tweets. For factoid QA, we use Stanford pre-trained GloVe 100-dimensional embeddings trained on Wikipedia 2014 + Gigaword 5. The third is the randomly initialized 100-dimensional embeddings which are uniformly sampled from range , where is the dimension of embeddings [7].

Baseline Methods

For all of the tasks, we compare with the following baseline methods to evaluate the IASM.

  • RNN-based approach (RNN) [22] exploits an RNN to model each document as a word sequence. The siamese structure is then applied to measure the relations between documents. It is the representative of the approaches based on word-level RNNs.

  • CNN-based approach (CNN) integrates the word embeddings into an embedding matrix and applies several convolutional filters to extract representative features with a max-pooling layer that covers the whole document [13]. The cosine similarity is then applied to measure the relations between documents.

  • BERT-based approach (BERT) exploits pre-trained BERT [3] to model each document as a word sequence and output its vector representation. A multi-layer perceptron is utilized on the concatenation of source and target representations to get the matching score.

  • ARC-I [9] finds the representation of the source and target texts and then compares the representation for the two with a multi-layer perceptron. ARC-I is a representation-focused model. It enjoys the flexibility brought by the convolutional sentence model.

  • DeepMatch [16] is a deep learning architecture aiming at capturing the complicated matching relations between two objects from heterogeneous domains more effectively. DeepMatch is an interaction-focused model. It directly models the object-object interactions with a CNN-based architecture. In particular, it convolves on the object-object interaction matrix and predicts if two objects are related. There are related studies share similar structures with this model, such as MV-LSTM [25].

  • Delta [21] constructs a “modified” document matrix first by replacing the words in the documents by the closest words in the query. Convolutions will be performed on this matrix to obtain a final relevance score.

Task 1: Biomedical Literature Retrieval

The first application in the experiments is biomedical literature retrieval. Biomedical literature retrieval plays a central role in biomedical informatics. Given a source text as a query, the goal is to find the most relevant biomedical articles to the query. This application can contribute to many real biomedical scenarios. For example, when a user types in some disease-related information in the text (e.g., symptoms, disease name, genetic information, personal characteristics, history, etc.), the system can automatically provide the most relevant articles with treatment, prevention, or prognosis of the corresponding disease.

Experimental Dataset

The PubMed articles are adopted for the experiments on biomedical literature retrieval. The dataset is collected through sampling PMID 111A PMID is the unique identifier number used in PubMed for each article. on PubMed. Each biomedical article includes a title with very brief-expression and abstract which is the summary of this biomedical article. We exploit the title of each biomedical article as the source text and take the corresponding abstract to be the unique target text to be retrieved. The biomedical article corpus is composed of 40,000 biomedical abstracts with titles. With the ratio 8:1:1, we partition the biomedical articles into training, validation, and testing sets. For the title and abstract in each article, we conduct tokenization for the content and stemming for each word via NLTK [15] and Stanford CoreNLP [17]. The statistics of the dataset is shown in Table 1.

Query Document Query Running Words
Train 31,984 31,984 25,042
Valid 3,997 3,997 7,999
Test 4,020 4,020 8,008
Table 1: The statistics of the dataset for task 1.

Experimental Results

Table 2 shows the ranking performance of the baseline methods and the proposed IASM method. For the baseline methods, the representation-focused models which are based on representation learning, i.e., RNN, CNN, ARC-I, and BERT, have similar performance. While the interaction-focused models which operate on interactions, i.e., DeepMatch, Delta and our proposed IASM, perform better than representation-focused models, which is consistent with previous studies [29, 5, 14]. Most importantly, our proposed IASM achieves the best performance in different criteria.

Comparing these three interaction-focused models, our proposed IASM outperforms both DeepMatch and Delta by large margins. All these three methods operate on the interaction between source and target texts. The difference is that DeepMatch and Delta take the interaction as the only input for deep neural networks, while the IASM considers the document words as the neighbor and the context of the query words and vice versa. In particular, the IASM exploits both global information from pre-trained word embeddings and local information from the specific context in a specific query and document pair to implement interactive attention, making it possible to learn more similar representations for the true source and target pair.

Model P@1 P@10 MRR MAP
CNN 0.0005 0.0017 0.0013 0.0013
RNN 0.0005 0.0029 0.0021 0.0021
ARC-I 0.0008 0.0075 0.0015 0.0015
Simple BERT 0.0050 0.046 0.0175 0.0175
DeepMatch 0.1474 0.3738 0.2634 0.2634
Delta 0.1527 0.4213 0.2724 0.2724
IASM 0.2189 0.6206 0.3439 0.3439
Table 2: The ranking performance of biomedical literature retrieval.

We also test the effects of initialization with different strategies for pre-training the word embeddings described in Section “Pre-trained word embeddings”. The results are shown in Table 3. From this table, we observed that

  • Models using pre-trained word embeddings achieve a significant improvement as opposed to the ones using random embeddings.

  • Models using GloVe embeddings outperforms using Word2Vec consistently for different interaction-focused models.

  • DeepMatch and Delta rely more heavily on pre-trained word embeddings compared to our proposed IASM.

  • IASM outperforms DeepMatch and Delta even with randomly initialized word embeddings, which indicates the advantage of encoding local relevance between query and document for this task.

The reason why pre-trained word embeddings affect less on our IASM is probably because our model learns new representations for source and target texts based on the interaction matrix {provide training of word-level relevance between source and target texts}.

Word Embedding Model P@1 MAP
Random DeepMatch 0.0005 0.0009
Delta 0.0945 0.1871
IASM 0.1073 0.1935
Word2Vec DeepMatch 0.1023 0.2243
Delta 0.1465 0.2631
IASM 0.1832 0.3166
Glove DeepMatch 0.1474 0.2634
Delta 0.1527 0.2724
IASM 0.2189 0.3439
Table 3: Performance of task 1 with different pre-trained word embeddings on interaction-focused models

Task 2: Tweet to News Linking

The second application in the experiments is tweet to news linking. The task of linking a tweet to a news article that is relevant to the tweet is proposed by Guo et al. [6]. More specifically the task is given the text in a tweet, a system aims to find the most relevant news article. Therefore, it is very natural to take a tweet as the source text and the most relevant news article as the target text in the framework of semantic text matching. Linking news to tweet can enrich the context of tweets which are usually short and informal. It can benefit the analysis of tweets and topics and event discovery from the tweet. There are many real-world similar scenarios because people tend to discuss the same event and topic in different web spaces. For example, the reporting of the same event differs across different news media. Individuals tend to have different comments for the same event in different expressions or even in different languages. Discovering those different versions for the same event or topic could be very useful to provide diverse views and opinions.

Experimental Dataset

This dataset we used here is provided by Guo et al. [6], which contains explicit URL links from each tweet to a related news article. All the tweets that have a single URL link to a CNN or NYTIMES news article, dated from the 11th of Jan to the 27th of Jan, 2013. The goal is to predict the URL referred news article based on the text in each tweet. For the tweet and the news, tokenization is conducted via Stanford CoreNLP [17]. The statistics of the dataset is shown in Table 4.

Tweet News Tweet Running Words
Train 24,881 3,191 27,905
Valid 3,110 1,231 8,884
Test 6,897 1,890 13,940
Table 4: The statistics of the dataset for task 2.

Experimental Results

We trained our model with 24,881 original (tweet, news) pairs. Compared to Task 1, there are two extra challenges. First, the writing style of a tweet is more free and informal. Second, semantic mismatch of the same or similar content due to different wording and expression styles. Both two challenges make the tweet to news linking a more difficult task compared to Task 1. Therefore, this task places greater demands on learning implicit relations (i.e., relevant or similar meaning with different expressions like “happy” and “thrilled”, “niiiiiiice” and “nice”) between tweet and news. As shown in Table 5, we compare different baseline models including representation-focused models and interaction-focused models for linking news to tweets. Again IASM beats other models with large margins, while two other interaction-focused models DeepMatch and Delta come next.

Model P@1 MRR MAP
RNN 0.0011 0.0052 0.0052
CNN 0.0017 0.0041 0.0041
ARC-I 0.0021 0.0047 0.0047
BERT 0.0231 0.0340 0.0340
DeepMatch 0.1385 0.2332 0.2332
Delta 0.1503 0.2643 0.2643
IASM 0.2534 0.3494 0.3494
Table 5: The performance of linking news to tweets.

As mentioned in Section “Pre-trained word embeddings”, in order to test the importance of pre-trained word embeddings for linking news to tweets, we performed experiments with two different sets of word embeddings, as well as a random sampling method, to initialize our model. The results are shown in Table 6. Again the GloVe embedding shows its superiority compared to the other two embedding models. However, there is something new compared to experiment in Task 1.

  • For IASM, the difference in results between GloVe and Word2Vec is smaller.

  • The IASM beats DeepMatch and Delta by large margins even using randomly initialized embeddings.

These two observations show that IASM is more stable compared to the other two interaction-focused models for linking news to tweet. The reason is that IASM is good at alleviating semantic mismatch between tweet and news by considering global and local word-level relevance between these two.

Word Embedding Model P@1 MAP
Random DeepMatch 0.0004 0.0007
Delta 0.0877 0.1603
IASM 0.1336 0.2433
Word2Vec DeepMatch 0.1184 0.2047
Delta 0.1433 0.2579
IASM 0.2521 0.3432
GloVe DeepMatch 0.1385 0.2332
Delta 0.1503 0.2643
IASM 0.2534 0.3494
Table 6: Performance of task 2 with different pre-trained word embeddings of interaction-focused models.

Task 3: Factoid Question Answering

The third application in the experiments is factoid question answering. The task is to select the passages that contain the correct answer with a question and a pool of candidate passages. The performance of the text selection task is not only crucial to non-factoid QA systems, where a question is expected to be answered with a sequence of descriptive text, but also very important to factoid QA systems, where the answer passage selection step is also known as passage scoring. Here, we focus on factoid QA which takes the question as the source text and the sentences carrying the corresponding answer as the target text.

Experimental Dataset

The dataset we used here is YodaQA. YodaQA is an open-source Factoid Question Answering system that can produce answer both from databases and text corpora using on-the-fly information extraction. By default, open domain question answering is performed on top of the Freebase and DBpedia knowledge bases as well as the texts of enwiki articles.

Question Answer Question Running Words
Train 343 56,042 970
Valid 341 7,541 964
Test 429 73,380 1166
Table 7: The statistics of the dataset for task 3.

In this dataset, the questions come from the YodaQA dataset and the YodaQA system generated the candidate sentences based on enwiki, using YodaQA sentence-selection branch. Sentences were generated by running fulltext solr search on enwiki for keywords extracted from the question, and then considering all sentences from top N results that contain at least a single such keyword. Sentences that match the gold standard answer regex are labeled as 1, the rest is 0. The statistics of the dataset is shown in Table 7.

Experimental Results

Table 8 reports the results for searching correct answers for questions. We trained our model with 68,880 true and fake (Question, Answer) pairs. Compared to Task 1 and Task 2, this task is slightly different. For each question, there might be multiple true answers. However, many fake but similar answers make this task much more difficult compared to the above two tasks, especially for ranking all true ones at the top. From table 8, you can see the consistency that interaction-focused models outperform most representation-focused models except the BERT-based model. It is interesting to see that the BERT-based model outperforms the other representation-focused models by large margins and outperforms DeepMatch as well, which shows the strength of BERT to encode factoid relevance through pre-training on super large factoid corpus like Wikipedia. Again IASM beats other interaction-focused models but not that large margin compared to the above experiments in task 1 and 2. The reason is that there are many semantically related false answers which make it hard to find factoid true answers. That means semantic relevance is not sufficient to give the right answers.

Model P@1 MRR MAP
RNN 0.0004 0.0018 0.0074
CNN 0.0004 0.0035 0.0052
ARC-I 0.0004 0.0034 0.0053
BERT 0.2279 0.1334 0.4310
DeepMatch 0.2246 0.1430 0.4451
Delta 0.2423 0.1665 0.4738
IASM 0.2924 0.2345 0.5778
Table 8: The ranking performance of factoid QA.

We also check the impacts of different pre-trained word embedding as the same as the above two tasks and the results are shown in Table 9. Again the GloVe embedding shows its superiority compared to the other two embedding models and the two pre-trained embeddings outperform the randomly initialized embeddings. IASM still outperforms DeepMatch and Delta even with randomly initialized word embeddings, indicating the advantage of encoding local relevance between question and answer for this task.

Word Embedding Model P@1 MAP
Random DeepMatch 0.0003 0.0012
Delta 0.0513 0.1129
IASM 0.1242 0.3117
Word2Vec DeepMatch 0.1764 0.3851
Delta 0.1932 0.4156
IASM 0.2876 0.5721
Glove DeepMatch 0.2246 0.4451
Delta 0.2423 0.4738
IASM 0.2924 0.5778
Table 9: Performance of factoid QA with different pre-trained word embeddings on interaction-focused models.

Case Study

Last but not least, we performed three case studies to better understand the power of IASM for semantic text matching. Due to modeling global and local relevance between source and target texts, the IASM could find those target texts which are semantically and implicitly related to the corresponding source text. Here are three examples in which the target text are placed at rank 1 by the IASM but are placed at rank n () by baseline models. The bold words are keywords in texts, which are either low-frequency, out-of-vocabulary or implicitly related to match objects in their specific pairwise contexts.

  • Biomedical literature retrieval. Query: “Neonatal chest drain insertion–an animal model.” Document: “Trainees rarely see a pneumothorax in the newborn because of the combination of decreased doctors’ hours, the use of surfactant, and modern ventilator techniques ….”

  • Tweet to news linking. Tweet: “A modicum of progress: RT @cnnbrk: Saudi King Abdullah decrees currently male-dominated Council be at least 20% women.” News: “Saudi Arabia’s King Abdullah has appointed 30 women to the previously all-male consultative Shura Council.”

  • Question answering. Question: “What does the bugler play at the end of the day on a US military base?” Answer: “The most widely circulated one states that a Union Army infantry officer, whose name often is given as Captain Robert Ellicombe, first ordered “Taps” performed at the funeral of his son, a Confederate soldier killed during the Peninsula Campaign.”

From these examples, it is clear to see that IASM is very good at modeling those semantic and implicit relevance between source and target texts.

Conclusion

In this paper, we proposed a novel deep neural interactive attention architecture, IASM, for text semantic matching. This model takes pre-trained representations as input and learns new representations for source and target texts through interactive attention with local interaction between the two texts. Therefore, IASM not only utilized global relevance from pre-trained word embeddings but also incorporated local relevance by taking messages from the other side through interactive attention upon local interaction. We conducted empirical evaluations of our model with three applications, including biomedical literature retrieval, tweet to news linking, and factoid question answering. The experimental results show that our model: (i) greatly improves on the previous state-of-the-art models by large margins on three tasks; (ii) is much more stable for the different pre-trained word embedding models; (iii) is more powerful to put semantic relevant target objects at top 1.

References

  • [1] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: Introduction.
  • [2] Z. Dai, C. Xiong, J. Callan, and Z. Liu (2018) Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the WSDM, pp. 126–134. Cited by: Related Work.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 3rd item.
  • [4] J. Gao, P. Pantel, M. Gamon, X. He, and L. Deng (2014) Modeling interestingness with deep neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2–13. Cited by: Related Work.
  • [5] J. Guo, Y. Fan, Q. Ai, and W. B. Croft (2016) A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64. Cited by: Introduction, Related Work, Experimental Results.
  • [6] W. Guo, H. Li, H. Ji, and M. Diab (2013) Linking tweets to news: a framework to enrich short text data in social media. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 239–249. Cited by: Experimental Dataset, Task 2: Tweet to News Linking.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proc. of ICCV-2015, Washington, DC, USA, pp. 1026–1034. Cited by: Pre-trained word embeddings.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Introduction.
  • [9] B. Hu, Z. Lu, H. Li, and Q. Chen (2014) Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042–2050. Cited by: Related Work, 4th item.
  • [10] P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pp. 2333–2338. Cited by: Related Work.
  • [11] K. Hui, A. Yates, K. Berberich, and G. de Melo (2017-09) PACRR: a position-aware neural IR model for relevance matching. In Proceedings of EMNLP, Copenhagen, Denmark, pp. 1049–1058. Cited by: Related Work.
  • [12] J. Jiang, M. Zhang, C. Li, M. Bendersky, N. Golbandi, and M. Najork (2019) Semantic text matching for long-form documents. Cited by: Related Work.
  • [13] Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: Introduction, 2nd item.
  • [14] B. Liu, T. Zhang, D. Niu, J. Lin, K. Lai, and Y. Xu (2018) Matching long text documents via graph convolutional networks. arXiv preprint arXiv:1802.07459. Cited by: Introduction, Experimental Results.
  • [15] E. Loper and S. Bird (2002) NLTK: the natural language toolkit. arXiv preprint cs/0205028. Cited by: Experimental Dataset.
  • [16] Z. Lu and H. Li (2013) A deep architecture for matching short texts. In Advances in neural information processing systems, pp. 1367–1375. Cited by: Related Work, 5th item.
  • [17] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky (2014) The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60. Cited by: Experimental Dataset, Experimental Dataset.
  • [18] R. Mihalcea, C. Corley, C. Strapparava, et al. (2006) Corpus-based and knowledge-based measures of text semantic similarity. In Aaai, Vol. 6, pp. 775–780. Cited by: Related Work.
  • [19] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: Introduction.
  • [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: Pre-trained word embeddings.
  • [21] S. Mohan, N. Fiorini, S. Kim, and Z. Lu (2018) A fast deep learning model for textual relevance in biomedical information retrieval. In Proceedings of WWW, pp. 77–86. Cited by: Related Work, 6th item.
  • [22] J. Mueller and A. Thyagarajan (2016) Siamese recurrent architectures for learning sentence similarity. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: Related Work, 1st item.
  • [23] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of EMNLP, pp. 1532–1543. Cited by: Pre-trained word embeddings.
  • [24] G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis (2010) Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research 37, pp. 1–39. Cited by: Related Work.
  • [25] S. Wan, Y. Lan, J. Guo, J. Xu, L. Pang, and X. Cheng (2016) A deep architecture for semantic matching with multiple positional sentence representations. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: Related Work, 5th item.
  • [26] H. C. Wu, R. W. P. Luk, K. F. Wong, and K. L. Kwok (2008) Interpreting tf-idf term weights as making relevance decisions. ACM Transactions on Information Systems (TOIS) 26 (3), pp. 13. Cited by: Related Work.
  • [27] C. Xiong, R. Power, and J. Callan (2017) Explicit semantic ranking for academic search via knowledge graph embedding. In Proceedings of the 26th international conference on world wide web, pp. 1271–1279. Cited by: Related Work.
  • [28] W. Yih, K. Toutanova, J. C. Platt, and C. Meek (2011) Learning discriminative projections for text similarity measures. In Proceedings of the fifteenth conference on computational natural language learning, pp. 247–256. Cited by: Related Work.
  • [29] Y. Zhang, M. M. Rahman, A. Braylan, B. Dang, H. Chang, H. Kim, Q. McNamara, A. Angert, E. Banner, V. Khetan, et al. (2016) Neural information retrieval: a literature review. arXiv preprint arXiv:1611.06792. Cited by: Introduction, Experimental Results.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398438
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description