Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search

Multi-Perspective Relevance Matching
with Hierarchical ConvNets for Social Media Search

Jinfeng Rao, Wei Yang, Yuhao Zhang, Ferhan Ture, and Jimmy Lin Department of Computer Science, University of Maryland David R. Cheriton School of Computer Science, University of Waterloo Department of Computer Science, Stanford University Comcast Applied AI Research jinfeng@cs.umd.edu, w85yang,jimmylin@uwaterloo.ca, yuhao.zhang@stanford.edu, ferhan˙ture@comcast.com
Abstract.

Despite substantial interest in applications of neural networks to information retrieval, neural ranking models have only been applied to “standard” ad hoc retrieval tasks over web pages and newswire documents. This paper proposes MP-HCNN (Multi-Perspective Hierarchical Convolutional Neural Network) a novel neural ranking model specifically designed for ranking short social media posts. We identify document length, informal language, and heterogeneous relevance signals as features that distinguish documents in our domain, and present a model specifically designed with these characteristics in mind. Our model uses hierarchical convolutional layers to learn latent semantic soft-match relevance signals at the character, word, and phrase levels. A pooling-based similarity measurement layer integrates evidence from multiple types of matches between the query, the social media post, as well as URLs contained in the post. Extensive experiments using Twitter data from the TREC Microblog Tracks 2011–2014 show that our model significantly outperforms prior feature-based as well and existing neural ranking models. To our best knowledge, this paper presents the first substantial work tackling search over social media posts using neural ranking models.

copyright: none

1. Introduction

Neural networks have achieved great success in many natural language processing (NLP) tasks, such as question answering (Rao et al., 2016; Severyn and Moschitti, 2015), paraphrase detection (Socher et al., 2011), and textual semantic similarity modeling (He and Lin, 2016). Many of these tasks can be treated as variants of a semantic matching problem, where two pieces of texts are jointly modeled through distributed representations of sentences for similarity learning. Various neural network architectures, e.g., Siamese networks (He et al., 2016), sequence-to-sequence models (Sutskever et al., 2014), and attention mechanism (Yin et al., 2015), have been proposed to model the semantic similarity of a text pair using diverse modeling techniques.

Techniques based on deep learning and neural networks offer exciting opportunities for the information retrieval community. For example, distributed word representations (e.g., word2vec (Mikolov et al., 2013)) provide a promising solution to overcome the vocabulary mismatch problem in ranking (Ganguly et al., 2015). However, there are still fundamental challenges to be solved. Guo et al. (Guo et al., 2016) pointed out that relevance matching, which is the core problem in IR, has different characteristics from the semantic matching problem that many NLP models are designed for. In particular, exact match signals still play a critical role in ranking, more than the role of term matching in, for example, paraphrase detection. Furthermore, in document ranking there is an asymmetry between queries and documents in terms of length and the richness of signals that can be extracted; thus, symmetric models such as Siamese architectures may not be entirely appropriate. Nevertheless, significant progress has been made, and many neural ranking models have been recently proposed (Xiong et al., 2017; Mitra et al., 2017; Shen et al., 2014; Huang et al., 2013; Pang et al., 2016), which have been shown to be effective on ad hoc retrieval.

Despite much progress, it remains unclear how neural ranking models designed for “traditional” ad hoc retrieval tasks perform on searching social media posts such as tweets on Twitter. We can identify several important differences:

  • Document length. Social media posts are much shorter than web or newswire documents. For example, tweets are limited to 280 characters. Thus, ad hoc retrieval in this domain contains elements of semantic matching because queries and posts are much closer in length. In particular, neural models that rely on sentence-level or paragraph-level interactions and global matching mechanisms (Pang et al., 2016) are unlikely to be effective.

  • Informality. Idiosyncratic conventions (e.g., hashtags), abbreviations (“Happy Birthday” as “HBD”), typos, intentional misspellings, and emojis are prevalent in social media posts. An effective ranking model should account for such language variations and term mismatches due to the informality of posts.

  • Heterogeneous relevance signals. The nature of social media platforms drives users to be actively engaged in many real-world news and events; users frequently take advantage of URLs or hashtags to gain exposure to their posts. Such heterogeneous signals are not well exploited by existing models, which can potentially boost ranking effectiveness when modeled together with the textual content.

To this end, we present a novel neural ranking model for ad hoc retrieval over short social media posts that is specifically designed with the above characteristics in mind. Our model, MP-HCNN (Multi-Perspective Hierarchical Convolutional Neural Network), aims to model the relevance of a social media post to a query in a multi-perspective manner, and has three key features:

  1. To cope with the informality of social media and to support more robust matching, we apply word-level as well as character-level modeling, with URL-specific matching. This allows us to exploit noisy relevance signal at different granularities.

  2. Our model consists of stacked convolutional neural network layers to capture latent semantic soft-match signals between query and post contents. By gradually expanding the convolutional window in a hierarchical manner, increasingly larger contexts can be leveraged for modeling relevance, starting from character-level and word-level to phrase-level, and finally to sentence-level.

  3. Matching of learned representations between query and posts as well as URLs is accomplished with a pooling-based similarity measurement layer where term importance weights are injected at each convolutional layer as priors.

Finally, all relevance signals are then integrated using a fully-connected layers to yield the final relevance ranking. Optionally, the neural matching score can be integrated with lexical matching via linear interpolation to further enhance effectiveness.

Contributions. We view our contributions as follows:

  • We highlight three important characteristics of social media posts that make ad hoc retrieval over such collections different from searching web pages and newswire documents. Starting from these insights, we developed MP-HCNN, a novel neural ranking model specifically designed to address these characteristics. To our best knowledge, ours is also the first neural ranking model developed specifically for ad hoc retrieval over social media posts.

  • We evaluate the effectiveness of our MP-HCNN model on four Twitter benchmark collections from the TREC Microblog Tracks 2011–2014. Our model is compared to learning-to-rank approaches as well as many recent state-of-the-art neural ranking models that are designed for web search and “traditional” ad hoc retrieval. Extensive experiments show that our model improves the state of the art over previous approaches significantly. Ablation studies further confirm that these improvements come from specific components of our model designed to tackle characteristics of social media posts as identified above.

2. Background and Related Work

2.1. Learning to Rank

Ranking is the core problem in many information retrieval and natural language processing tasks, i.e, ad hoc retrieval (Joachims, 2002; Cao et al., 2007; Rao et al., 2017d) and question answering (Severyn and Moschitti, 2015; Rao et al., 2016, 2017a; Sequiera et al., 2017). Learning to rank (L2R) is a field that takes advantage of recent advances in machine learning to improve ranking effectiveness. Existing work on L2R can be summarized into three main categories: pointwise, pairwise and listwise. The main difference lies in the problem formulations with different assumptions, input/output spaces and loss functions. Pointwise methods, such as logistic regression (Gey, 1994), focus on learning a relevance score for each query-document pair represented in a feature space, while pairwise approaches, such as LambdaMART (Burges, 2010) and RankSVM (Joachims, 2002), aim to learn the preference between a pair of documents to a query. Listwise approaches, such as ListNet (Cao et al., 2007), directly optimize the input list of documents to a query to find the best-ranked list. The major drawback of L2R is that it requires effective hand-crafted feature engineering, which can be time-consuming, incomplete and difficult to generalize to other problems.

2.2. Neural Information Retrieval

Recently deep learning has achieved great success in many languages and information retrieval applications (He and Lin, 2016; Sutskever et al., 2014; Yin et al., 2015; Yu et al., 2017; Rao et al., 2017b; Li et al., 2017; Rao et al., 2017c). The current neural approaches for IR can be divided into representation-based (Huang et al., 2013; Shen et al., 2014; Severyn and Moschitti, 2015) and interaction-based (Guo et al., 2016; Xiong et al., 2017; Dai et al., 2018; Mitra et al., 2017) approaches. The early attempts on neural IR mainly focus on representation-based modeling between query and document, such as DSSM (Huang et al., 2013), C-DSSM (Shen et al., 2014), and SM-CNN (Severyn and Moschitti, 2015). DSSM (Huang et al., 2013) is a classical NN architectures for Web search that maps word sequence to character-level trigrams by using a word hashing layer, and then feeds the dense hashed features to a multi-layer perceptron (MLP) for similarity learning. C-DSSM (Shen et al., 2014) extends this idea by replacing the MLP in DSSM with a convolutional neural network-based (CNN) layer to capture local contextual signals from neighboring character trigrams. SM-CNN can be viewed as a hybrid approach with a main component of a convolutional layer for learning discriminative representations of query and document and a feature layer that exploits hand-crafted features.

Interaction-based approaches (Guo et al., 2016; Xiong et al., 2017; Dai et al., 2018; Mitra et al., 2017) model on the similarity matrix of word pairs from the query and document. The preparation of similarity matrix is usually computed through word embeddings, such as word2vec (Mikolov et al., 2013), which solves the sparsity issue of count-based approaches. The DRMM approach (Guo et al., 2016) introduces a pyramid pooling technique to convert the similarity matrix to histogram representations, on top of which a term gating network aggregates weighted matching signals from different query terms. Inspired by DRMM, Xiong et al. (2017) propose K-NRM that introduces a differentiable kernel-based pooling technique to capture matching signals at different strength levels. Dai et al. (2018) extends this idea to model soft-match signals for n-grams with an additional convolutional layer. The DUET model (Mitra et al., 2017) combines the representation-based and interaction-based idea with a global component for the semantic match and a local component for the exact match. Our model differs from previous work in a number of various ways, as described in the introduction. Detailed ablation experiments verify the contributions of various components in our model architecture.

3. Multi-Perspective Model

The core contribution of this paper is a novel neural ranking model specifically designed for ad hoc retrieval over short social media posts. As discussed in the introduction, our model, MP-HCNN (Multi-Perspective Hierarchical Convolutional Neural Network), has three key features: First, we apply word-level as well as character-level modeling on query, posts, and URLs to cope with the informality of social media posts (Section 3.1). Second, we exploit stacked convolutional layers to learn soft-match relevance at multiple granularities (Section 3.2). Finally, we learn matches between the learned representations via pool with injected external weights (Section 3.3). Our overall model architecture is shown in Figure 1, and each of the above key features are described in detail below.

Figure 1. Overview of our Multi-Perspective Hierarchical Convolutional Neural Network, which consists of two parallel components for word-level and character-level modeling between queries, social media posts, and URLs. The two parallel components share the same architecture (with different parameters), which comprises hierarchical convolutional layers for representation learning and a semantic similarity layer for multi-level matching. Finally, all relevance signals are integrated using a fully-connected layer to produce the final relevance score.

3.1. Multi-Perspective Input-level Modeling

A standard way for neural text processing is to take advantage of word embeddings (e.g., word2vec (Mikolov et al., 2013)) to encode each word. However, in the social media domain, informal post contents produce a large amount of out of vocabulary (OOV) words which can’t be found in pre-trained word embeddings. The embeddings of OOV words are randomly initialized by default. In fact, we observe about 50%-60% words are OOV words in the TREC Microblog datasets (details in Table 2). This greatly complicates the matching process simply relying on word-level semantics, motivating the need for character-level input modeling to copy with noisy texts.

To better understand the origination of OOV errors, we randomly select 500 OOV words from the vocabulary and provide a summary of the major sources of OOV occurrences in the social media domain as well as a few examples below:

  1. Compounds(42.4%): chome-os, actor-director, earlystage

  2. Non-English words(29.2%): emociones (Spainish, emotions), desgostosa (Portuguese, disgusted), hayatım (Turkish, sweetheart)

  3. Typos (17.1%): begngen (beggen), yawnn (yawn), tansport (transport), afternoo (afternoon), foreverrrr (forever)

  4. Abbreviations(5.6%): EASP (European Association of Social Psychology), b-day (birthday)

  5. Domain-specific words(5.7%): utf-8, vlookup

As we can see above, compounds, non-English words and typos are the three major source for OOV words. Character-level modeling is beneficial for both the compounds and typos cases.

In addition, social media posts often comprise many heterogeneous signals which can contain fruitful relevance signals, such as mentions, hashtags or external URL links. An analysis over the TREC Microblog Track 2011–2014 datasets show around 50% tweet posts contain one or more URL links. More detailed statistics can be found in Table 2. In fact, by taking a closer look on the real data, we observed many URL links can be fuzzy matched to query texts. We provide one example in Table 1. For those posts without URLs, we add a placeholder symbol “¡URL¿”. It’s worth to note that we don’t consider to model the document texts referenced by the URLs since many URL links are not accessible over time and the HTML formats of many web documents are quite noisy, making it difficult to extract texts.

Query MB001: BBC world service cuts
Tweet BBC news - BBC world service cuts to
be outlined to staff #bbcworldservice.
URL http://bbc-world-service-to-cut-staff.html?spref=tw
Table 1. Example query-post pair retrieved by topic MB001 from the TREC Microblog 2011 dataset.

To tackle the above overwhelming language variation issues and utilize the URL information, we consider multiple inputs for relevance modeling: (1) query and post at word-level; (2) query and post at character-level; (3) query and URL at character-level. For character-level modeling, we partition the query and post content as well as the URL link to a sequence of character trigrams (e.g., “hello” to {#he, hel, ell, llo, lo#}), which has shown to obtain good effectiveness in capturing morphological variations and reducing the vocabulary size for efficient learning (Huang et al., 2013). Then we adopt the same architecture as the word-level semantic modeling to capture the matching evidences at character-level, which we will discuss in the following section.

3.2. Hierarchical Representation Learning

Given a query and a document , the textual matching component aims to learn a relevance score using the query terms and document terms , where and are the number of terms in and , respectively. To be clear, “document” can either refer to a social media post or an URL, and “term” refers to either words or character trigrams. One important novel aspect of our model is relevance modeling from multiple perspectives, and our architecture exhibits symmetry in the word- and character-level modeling (see Figure 1), and thus for expository convenience, we use “document” and “term” in the generic sense above. We first employ an embedding layer to convert each term into a -dimensional vector representation, generating a matrix representation for the query Q and document D, where and . In the following, we introduce our representation learning method with hierarchical convolutional neural networks.

A convolutional layer applies convolutional filters to the text which are represented by an embedding matrix M, such as Q or D. Let denote a convolutional filter with a window size of ( is the size of embeddings). We move this filter through the input text gradually, and at each step, we sum up the term embeddings from the input matrix slice weighted by the filter parameters W. More formally, we obtain a vector representation of the input, with the -th dimension of calculated as:

where is a bias value added to the weighted sum. Intuitively, can be regarded as a weighted average of the -th -gram in the input sentence, learned by the filter W. To ensure a fixed-size output vector , we pad the input matrix M with zero columns such that has a size of , where equals to for Q and for D. To increase the modeling capacity, each convolutional layer applies different filters to the input, and therefore produces output vectors . Lastly we concatenate all output vectors and apply a non-linear activation function ReLU element-wise to obtain the output representation matrix for this CNN layer:

This CNN layer with filters comprises of parameters with parameters from the filters and from the bias terms.

We then stack multiple convolutional layers in a hierarchical manner to obtain higher-level -gram representations. For notation simplicity, we drop the superscript from all output matrices and add a superscript to denote the output of the -th convolutional layer. Stacking CNN layers therefore corresponds to obtaining the output matrix of the -th layer via:

where is the output matrix of the -th convolutional layer. Note that denotes the matrix Q and P obtained directly from the word embedding layer, and the parameters of each CNN layer are shared by the query and document inputs.

Intuitively, consecutive convolutional layers allow us to obtain higher-level abstractions of the texts, starting from character-level or word-level to phrase-level and eventually to sentence-level. A single CNN layer is able to capture the -gram semantics from the input embeddings, and two CNN layers together would allow us to expand the context window to up to terms. Generally speaking, the deeper the convolutional layers, the wider the context considered for relevance matching. Empirically, we found the filter size of 2 for word-level inputs and 4 for character-level inputs worked well. The number of convolution layers was set to 4. This setting is reasonable as it enables us to gradually learn the representations of word-level and character-level -gram of up to length. Since most queries and documents in the social media domain are shorter or closer to this length, we can think the outputs from the last convolutional layer as an approximation of the sentence representations.

An alternative to our deep hierarchical design is a wide architecture, which reduces the depth but expands the width of the network, by concatenating multiple convolutional layers with different filter sizes in parallel to learn the variable-sized phrase representations. However, such design will require quadratically more parameters and be less efficient than our approach. More specifically, our deep model comprises of parameters with CNN layers, while a wide architecture with the same representation window will need parameters. The saved parameters mainly come from the representation reusing at each CNN layer, which also generalizes the learning process by sharing representations between successive layers.

3.3. Similarity Measurement and Weighting

To measure the similarity between the query and the document, we match the query with the document at each convolutional layer by taking the dot product between the query representation matrix and the document representation matrix :

where can be considered the similarity score by matching the query phrase vector with the document phrase vector . Since the query and document share the same convolutional layers, similar phrases will be placed closer in a high-dimensional embedding space and their product will produce larger scores. Next we obtain a normalized similarity matrix by applying a softmax function over S to normalize the similarity scores into range:

For each query phrase , the above softmax function normalizes its matching scores to all phrases in the document, and helps discriminate those matches with significant higher scores. An exact match will dominate others and contribute a similarity score close to . We then apply max and mean pooling to the similarity matrix to obtain discriminative feature vectors:

Each score generated from pooling can be viewed as a matching evidence of a specific query phrase to the document. Its value denotes the significance of relevance signal. Compared to Max pooling, Mean pooling is beneficial for the cases when a query phrase is matched to multiple relevant terms in the document.

To measure the relative importance of different query terms and phrases, we inject external weights as prior information by multiplying the score after pooling with the weighting of that specific query term/phrase. These are provided as feature inputs to the subsequent learning to rank layer, denoted by :

(1)

where is an element-wise product between the weights of query terms/phrases with the pooling scores. denotes the weight of the -th term or phrase in the query. Its value changes in the intermediate CNN layers since deeper CNN layer represents longer phrases. Note that the weights of long phrases become sparse as the depth of CNN layers increases. Therefore we only use weights for the first two CNN layers () for word-level inputs, and for character-level inputs. The weights of upper layers are assigned a default value of . We choose the classical inverse document frequency (IDF) as our weighting measure. A higher IDF weight implies a rarer occurrence in the collection thus a larger discrimination power. The weighting method also allows us to reduce the impact of high matching scores from common words like stop words. There can be some other weighting mechanisms, like weights generated from a pseudo-relevance feedback method (Lavrenko and Croft, 2001) or from a sequential dependency model (Metzler and Croft, 2005). We leave these as future directions.

Our similarity measurement layer has two important properties. First, all the layers here, including matching, softmax, pooling, and weights, have no learnable parameters. Second, the parameter-free nature enables our model to be highly interpretable and more robust from overfitting. By matching query phrases with document phrases in a joint manner, we can easily track which phrase matching contributes more relevance signal to the final prediction. This boosts the interpretability of our model greatly as it has become a prevalent concern with the complicated neural models for IR and NLP applications (Li et al., 2015).

3.4. Evidence Integration

Given the similarity features learned from word-level (from Equation 1) and character-level , we employ a simple fully-connected layer with two linear layers and a non-linear ReLU activation in between as our learning to rank module:

where and are the weight matrices and bias vectors in the two linear layers, is a concatenation operation. The outside softmax function normalizes the final prediction to a similarity vector with its values between 0 and 1. The training goal is to minimize the negative log likelihood loss summed over all samples below:

where is the annotation label of sample .

3.5. Interpolation with Language Model

Various studies have shown that neural network-based models are good at capturing soft-match signals (Guo et al., 2016; Xiong et al., 2017). However, are the exact match signals still effective to neural network-based methods? We examine this hypothesis by adopting a commonly-used linear interpolation method to combine the ranking scores of NN-based model with language model between a (query, document) pair:

(2)

The best hyper-parameter is tuned on the training and validation set, and the interpolated scores are leveraged for re-ranking. We choose the query-likelihood method (QL) (Ponte and Croft, 1998) as the language model here. The interpolation technique is applied to our multi-perspective model and other NN-based methods we used as baselines in this paper. We report both effectiveness with and without interpolation in the experimental section.

4. Experimental Setup

4.1. Dataset

To evaluate our proposed model for social media search, we choose four Twitter test collections from the TREC Microblog Tracks in 2011, 2012, 2013, and 2014. Each dataset contains about 50 queries. We use the open-source implementations of tweet search provided by the TREC Microblog API111https://github.com/lintool/twitter-tools to retrieve up to 1000 tweets per query using query likelihood (QL) method. This helps us rule out the effects of different preprocessing strategies in collection preparation (i.e., tokenization, stemming). The statistics of the four datasets are shown in Table 2. Since most URLs in tweet contents are masked and shortened, for example, http://zdxabf, we recover the original URL addresses from redirection for character-level modeling. The recovered URLs are truncated to a maximum of 120 characters.

Following standard experimental procedures, we evaluate our models in a reranking task, using as input the top 1000 retrieved documents (tweets) from a bag-of-words retrieval QL ranking. We use the Stanford Tokenizer tool222https://nlp.stanford.edu/software/tokenizer.shtml to divide the retrieved tweets into token sequences to serve as model input. Non-ASCII characters are removed and no stemming is performed. We run four sets of experiments where each of the four datasets is used for evaluation, with the other three used for training (e.g., train on TREC 2011–2013, test on TREC 2014). In each experiment, we sample 10% of the training queries as the validation set. Following the official track guidelines (Ounis et al., 2011), we adopt mean average precision (MAP) and precision at 30 (P@30) as our evaluation metrics. The relevance judgments are made on a three-point scale (“not relevant”, “relevant”, “highly relevant”), and we treat both higher grades as relevant, also per Ounis et al. (Ounis et al., 2011). All the data used in this paper are publicly available333https://github.com/Jeffyrao/TREC-Microblog-Datasets.

Test Set 2011 2012 2013 2014
# of query topics 49 60 60 55
# of query-doc pairs 39,780 49,879 46,192 41,579
# of relevant docs 1,940 4,298 3,405 6,812
# of unique words 21649 27470 24,546 22099
# of unique OOV words 13067 17190 15724 14331
# of URLs 20351 25405 23100 20885
# of hashtags 6784 8019 7869 7346
Table 2. Statistics of the TREC Microblog Track datasets

4.2. Baselines

We compare our model to a number of non-neural baselines as well as recent neural ranking models designed for “standard” ad hoc retrieval tasks on web and newswire documents (we call these the neural baselines). The non-neural baselines are as follows:

  1. Query Likelihood (QL) (Ponte and Croft, 1998) is the most widely-used language modeling baseline.

  2. RM3 (Lavrenko and Croft, 2001) is an interpolation model combining the QL score with a relevance model using pseudo-relevance feedback.

  3. LambdaMART (Burges, 2010) is a competitive ranking algorithm that won the Yahoo! Learning to Rank Challenge (Burges et al., 2011). We designed three sets of features: (a) text-based: in addition to QL, we compute another four overlap-based measures between each query-tweet pair (word overlap and IDF-weighted word overlap computed between all words and only non-stopwords, from Severyn and Moschitti (Severyn and Moschitti, 2015)); (b) URL-based: whether the tweet contains URLs and the fraction of query terms that matched parts of URLs; (c) hashtag-based: whether tweets contains hashtags and the fraction of query terms that matched hashtags.

The neural baselines are as follows:

  1. DSSM (2013) (Huang et al., 2013) is one of the earliest NN architectures for web search that uses word hashing to model interactions between queries and programs at the level of character 3-grams.

  2. C-DSSM (2014) (Shen et al., 2014) is a variant of DSSM that replaces the fully-connected layer in DSSM with a CNN-based model to capture local contextual signals from neighboring n-grams.

  3. MatchPyramid (2016) (Pang et al., 2016) uses a CNN-based model to to extract matching patterns from word level to phrase level and sentence level from a similarity matrix.

  4. DRMM (2016) (Guo et al., 2016) is an interaction-based approach that converts the similarity matrix of query and document to a histogram representation for relevance prediction.

  5. DUET (2017) (Mitra et al., 2017) is document ranking model that combines a local component for exact match and a global component for semantic match between query and document.

  6. K-NRM (2017) (Xiong et al., 2017) introduces a differentiable kernel-based layer to capture multi-level granularities of soft match signals from the input similarity matrix.

4.3. Implementation Details

Dataset Preprocessing. The same padding strategy is used across the four datasets by setting to the largest query/document/URL length, where each query is padded to 10 words and 51 characters, each tweet is padded to 68 words and 140 characters and each URL is padded to 120 characters, respectively. The mentions are removed and hashtags are treated as normal words (i.e., “#bbc” to “bbc”). The IDF weights of word and character -grams are computed from the Tweets2013 collection (Lin and Efron, 2013), which consists of 243 million tweets crawled from Twitter’s public sample stream between February 1 and March 31, 2013 (inclusive).

Model Training. To enable fair comparison with the baselines, we adopt the same tuning strategies, such as embeddings, optimizer, and hyper-parameter tuning, in our experiments. We use the word2vec (Mikolov et al., 2013) 300-dimension word vectors pre-trained from Google News dataset with 100B tokens. From Table 2, more than 50% words (OOV words) are out of the word2vec vocabulary across all datasets. This could have a negative impact on model effectiveness since the embeddings of those OOV words and character trigrams are both initialized from a uniform sampling between . All the embeddings are updated during training. Stochastic gradient descent (SGD) with a learning rate of 0.05 and a batch size of 256 is used for training. The linear layer size in the learning to rank component is set to 150. The convolutional filter sizes are set to 2 for words and 4 for characters. The maximum number of convolutional layers is set to . The number of convolutional filters is tuned between , and the dropout rate is tuned between on validation set. At test time, we selected the model that obtained the lowest loss on the validation set for evaluation. The interpolation parameter is tuned after the neural network model converges. Our model is implemented using the Keras framework, while the other neural baselines are open-sourced in the MatchZoo library.444https://github.com/faneshion/MatchZoo

Model Size. The total number of parameters in our proposed model is about 71M, where 48% parameters are coming from the learnable word embeddings and another 47% are from the character trigram embeddings. Only about 5% (3.5M) parameters are from the convolutional part and learning to rank layer. It’s worth noting that although word-level and character-level inputs share the same architecture, they have different parameters. For character-level inputs, query-post and query-URL modeling share the same parameters. The training process consumes about 3 minutes per epoch on a GPU machine (GeForce GTX 1080) with 8 GB memory and usually converges in 10 epochs.

ID Model 2011 2012 2013 2014
Metric MAP P30 MAP P30 MAP P30 MAP P30
Non-Neural Baselines
1 QL (Ponte and Croft, 1998) 0.3576 0.4000 0.2091 0.3311 0.2532 0.4450 0.3924 0.6182
2 RM3 (Lavrenko and Croft, 2001) 0.3824 0.4211 0.2342 0.3452 0.2766 0.4733 0.4480 0.6339
3 LambdaMART (Burges, 2010) (all) 0.3845 0.4279 0.2291 0.3559 0.2477 0.4617 0.3943 0.6200
(text) 0.3547 0.4027 0.2072 0.3294 0.2394 0.4456 0.3824 0.6091
(text+URL) 0.3816 0.4272 0.2317 0.3667 0.2489 0.4506 0.3974 0.6206
(text+hashtag) 0.3473 0.4020 0.2039 0.3175 0.2447 0.4533 0.3815 0.5939
Neural Baselines
4 DSSM (Huang et al., 2013) (2013) 0.1742 0.2340 0.1087 0.1791 0.1434 0.2772 0.2566 0.4261
5 C-DSSM (Shen et al., 2014) (2014) 0.0887 0.1122 0.0803 0.1525 0.0892 0.1717 0.1884 0.2752
6 DUET (Mitra et al., 2017) (2017) 0.1533 0.2109 0.1325 0.2356 0.1380 0.2528 0.2680 0.4091
7 MatchPyramid (Pang et al., 2016) (2016) 0.1967 0.2259 0.1334 0.2390 0.1378 0.2561 0.2722 0.4491
8 DRMM (Guo et al., 2016) (2016) 0.2635 0.3095 0.1777 0.3169 0.2102 0.4061 0.3440 0.5424
9 K-NRM (Xiong et al., 2017) (2017) 0.2519 0.3034 0.1607 0.2966 0.1750 0.3178 0.3472 0.5388
Neural Baselines with Interpolation
10 DRMM+ 0.3477 0.4034 0.2213 0.3537 0.2639 0.4772 0.4042 0.6139
11 DUET+ 0.3576 0.4000 0.2243 0.3644 0.2779 0.4878 0.4219 0.6467
12 K-NRM+ 0.3576 0.4000 0.2277 0.3520 0.2721 0.4756 0.4137 0.6358
Our Model
13 MP-HCNN 0.3940 0.4306 0.2313 0.3757 0.2856 0.5211 0.4178 0.6279
14 MP-HCNN+ 0.4040 0.4435 0.2482 0.3915 0.2937 0.5250 0.4403 0.6455
(+12.9%) (+10.8%) (+18.6%) (+18.2%) (+15.9%) (+17.9%) (+12.2%) (+4.4%)
Table 3. Main results on TREC Microblog 2011–2014 datasets. Rows are numbered in the first column for convenience, and each row represents a model or a contrastive condition. Superscripts indicate the row indexes for which a metric difference is statistically significant at .

5. Results

Our main results are shown in Table 3. Rows are numbered in the first column for convenience. We compare our model to three set of baselines: non-neural, neural, and interpolation. We run statistical significance tests using Fisher’s two-sided, paired randomization test (Smucker et al., 2007) against the three non-neural baselines: QL, RM3, and LambdaMART (with all features). Superscripts indicate the row indexes for which a metric difference is statistically significant at .

From the first block “Non-Neural Baselines” in Table 3, we can see that RM3 significantly outperforms QL on all datasets, demonstrating its superior effectiveness. However, RM3 requires an extra round of retrieval to select terms for query expansion, which is substantially slower. LambdaMART achieves effectiveness on par with RM3 when using all the hand-crafted features. From its contrastive variant with only text-based features, we can see that the overlap-based features provide little gain over QL. Comparing the rows “(text+URL)” and “(text+hashtag)” to row “(text)”, adding URL-based features leads to a significant improvement over text-based features, while hashtag-based features seem to bring fewer benefits. This confirms our observation in Table 2 that URLs appear more frequently in tweets and contain meaningful relevance signals.

Looking at the second block “Neural Baselines”, we find all the neural methods perform worse than the QL baseline. In fact, all the character-based approaches (DSSM, C-DSSM, and DUET) are consistently worse than the word-based approaches (MatchPyramid, DRMM, K-NRM). This is likely attributable to the fact that all word-based NN models use pre-trained word vectors that encode more semantics than a random initialization of character trigram embeddings, suggesting that the Twitter datasets are not sufficient to support learning character-based representations from scratch. Particularly, C-DSSM suffers more than DSSM, showing that a more complex model leads to lower effectiveness in a data-poor setting. Comparing the three word-based NN models, DRMM seems to be most effective while MatchPyramid is the worst. Considering that the three models share the same embedding-based similarity matrix as input, the large effectiveness differences between DRMM/K-NRM and MatchPyramid suggest that term weighting is crucial for tweet search. In addition, the smaller parameter space of DRMM (161 parameters in total) affirms that the low effectiveness is not simply because due to a shortage of data. As a comparison, our MP-HCNN achieves high effectiveness on all datasets across both metrics, significantly beating all baselines in most settings.

Setting 2011 2012 2013 2014
Metric MAP P30 MAP P30 MAP P30 MAP P30
QL 0.3576 0.4000 0.2091 0.3311 0.2532 0.4450 0.3924 0.6182
Full MP-HCNN 0.3940 0.4306 0.2313 0.3757 0.2856 0.5211 0.4178 0.6279
mean pooling 0.3687 0.4054 0.2251 0.3480 0.2766 0.5000 0.3907 0.5897
max pooling 0.0982 0.1320 0.0767 0.1243 0.0920 0.1706 0.1934 0.2176
IDF weighting 0.3511 0.3714 0.2119 0.3452 0.2717 0.4967 0.3992 0.6097
word module 0.1651 0.1293 0.0762 0.1119 0.0987 0.1517 0.1849 0.2048
URL char rep. 0.3594 0.3707 0.2131 0.3333 0.2797 0.4989 0.4037 0.6085
doc char rep. 0.3603 0.3721 0.2188 0.3537 0.2757 0.5122 0.4012 0.6103
char module 0.3528 0.3709 0.2087 0.3271 0.2718 0.5011 0.4050 0.6091
Table 4. Ablation Study. denotes the score is significantly lower than the base MP-HCNN model at .

In the third block “Interpolation Baselines”, we observe that simple interpolation with QL boosts the effectiveness of all neural baselines dramatically, showing the exact match signal is complementary to the soft match signals captured by NN methods. This observation also holds for our MP-HCNN and only differs in a smaller margin of improvement (due to the effectiveness of MP-HCNN alone). The best results on TREC Microblog 2011–2013 datasets are both achieved by MP-HCNN+, with an average of 15% relative improvement against QL (shown in last row). A minor exception is TREC 2014, where we see that the QL baseline already achieves fairly high absolute numbers, limiting the space for potential improvement.

Overall, our findings are consistent in the base model and interpolation setups: (1) existing NN models do not appear to provide effective rankings alone, while some are marginally effective with interpolation, showing that these ranking models fail to adapt to tweet search; (2) our MP-HCNN is more effective than the neural and non-neural baselines we examined, suggesting that our customized design is necessary to capture domain-specific characteristics and challenges.

5.1. Ablation Study

To better understand the contribution of each module in our proposed model, we perform an ablation study on the base MN-CNN model, removing each component step by step. Here, we aim to study how the semantic-level, character-level, and weighting modules contribute to model effectiveness. These results are shown in Table 4, with each row denoting the removal of a specific module. For example, the row “w/o URL char rep.” represents removing the URL modeling module. The symbol denotes that the model’s effectiveness in an ablation setting is significantly lower than base MP-HCNN model at . We also add QL performance in the table as a reference.

From the first two rows “w/o max/mean pooling”, we can see that removing the max pooling leads to a significant performance drop while taking out mean pooling only results in a minor reduction. This matches our observation that most query terms only receive at most one exact or relevant match in the short tweets. Mean pooling on matching features is largely dominated by the max pooling, which selects the largest matching score for each query term. Also, removing the IDF weights makes the results consistently and significantly worse across the four datasets, which confirms that injecting external weights is important for tweet search. It is also no surprise that the complete word-level module is essential to model effectiveness, as shown in the table.

Turning our attention to the last three rows, we observe that removing the character representations of URLs or documents both lead to significant drops across all datasets, with larger drops when URL representations are removed. This suggests that URLs provide more relevance signals than character-level document modeling. Taking away the entire character-level module causes slightly more effectiveness loss. To conclude, the word-level matching module contributes the most effectiveness, but the character-level matching module still provides complementary and significantly useful signals. However, recall the low effectiveness of character-based methods in Table 3, we add a caveat: with more training data or pre-trained character trigram embeddings, we could expect the benefits of the character-level matching module to improve.

Additionally, we examine how the depth of hierarchical convolutional layer affects the model effectiveness. Figure 2 shows the performance distribution on MAP score with different convolutional depth on TREC 2011–2014 datasets. A setting of means there are no convolutional layers on top of the embedding layer, and the prediction is purely based on the matching evidence at word-level. A larger value of indicates wider ranges of phrases are represented and modeled. We can clearly see there is a consistent climbing pattern with increasing depths on all datasets, except for on TREC 2011. For the dataset 2011, 2012 and 2014, the improvements at are quite close to the upper bound at . This implies modeling of short phrases brings immediate effectiveness gains while the inclusion of longer phrases further boosts the overall effectiveness. We don’t explore larger values of as already enables us to model a window of consecutive words, which is longer than than most queries and close to the length of many tweets. Overall, this ablation experiment clearly shows the value of our hierarchical convolutional layers in semantic modeling at the phrasal level.

2011

2012

2013

Depth of CNN Layer

2014

Figure 2. MAP score with different convolutional depth on TREC 2011–2014 datasets.

5.2. Error Analysis

Figure 3. Per-query MAP differences of MP-HCNN and MP-HCNN+ vs. QL on TREC 2011.
Figure 4. Per-query MAP differences of MP-HCNN and MP-HCNN+ vs. QL on TREC 2012.

So far, we have shown that our weighted similarity measurement component, as well as the URL matching and phrase matching (enabled by the hierarchical architecture), are crucial to our model’s effectiveness. However, we still lack knowledge about the following two questions: (1) What are the common characteristics of well-performing queries, and how do the different components contribute their effectiveness? (2) When does our model fail, and how can we further improve the model? Therefore, we provide additional qualitative and quantitative analysis over sample tweets from well-performing and poor-performing queries.

In Figure 3 and 4, we visualize the per-query improvements on the MAP metric for MP-HCNN and MP-HCNN+ against the QL baseline on the TREC 2011 and 2012 datasets, respectively. Since the TREC 2013 and 2014 datasets exhibit similar trends, we omit their figures here. Overall, we see that the base MP-HCNN model shows improvements for the majority of queries in both the 2011 and 2012 datasets. In 2011, MP-HCNN wins on 26 topics and loses on 13 topics out of 49 topics; in 2012, it wins on 35 topics and loses on 19 topics out of 60 topics. The average margin of improvement is also greater than the losses. With the interpolation technique, MP-HCNN+ is able to smooth out the errors of many poor-performing topics, such as topic 5 “nist computer security”, resulting in more stable improvements.

For the five best-performing queries (15, 17, 39, 91, 105), we select the top 20 tweets for each query sorted by the MP-HCNN prediction scores for analysis. We manually classify the matching evidence of the selected 100 tweets into the following categories (a tweet can satisfy multiple categories):

  • Exact word match: the tweet has exact word matches with the query.

  • Exact phrase match: the tweet has exact phrase matches with the query.

  • Partial paraphrase match: the tweet has partial phrase matches with the query. For example, the phrase “the white stripes call it quits” is partially matched to the query 17 (“white stripes breakup”).

  • Partial URL match: the query is contained in or partially matched to the URL in the tweet.

Category Percentage (%)
Exact word match 100
Exact phrase match 44
Partial paraphrase match 59
Partial URL match 29
Table 5. Matching evidence breakdown by category based on manual analysis of the top 100 tweets for the five best-performing topics.
ID Query Sample Tweet Label Score/Rank
QL MP-HCNN
1 2: 2022 fifa soccer #ps3 best sellers: fifa soccer 11 ps3 #cheaptweet https://www.amazon.com/fifa-soccer-11-playstation-3 I 7.33(#54) 0.85(#1)
2 qatar ’s 2022 fifa world cup stadiums: https://wordlesstech.com/qatars-2022-fifa-world-cup-stadiums/ R 10.58(#2) 0.41(#105)
3 2022 world cup could be held at end of year: fifa : lausanne switzerland the 2022 world cup in qatar: http://www.reuters.com/article/us-soccer-world-blatter R 11.25(#1) 0.31(#127)
4 5: nist computer security cybersecurity : nist provides advice on securing full virtualization technologies: the national #security #hacker https://www.infosecurity.com/news/nist-provides-advice-on-securing-full/ R 9.79(#6) 0.39(#1)
5 photo: abdul buvar (computer security expert) malware expert and consultant for network security as a http://krr48.tumblr.com/post/abdul-buvar-computer-security-expert-malware I 5.40(#45) 0.28(#2)
6 new nist guidance tackles public cloud security : 2 other special pubs on cloud defs virtualization http://www.govinfosecurity.com/articles.php?art_id=3321 R 9.79(#5) 0.24(#5)
Table 6. Sample analysis of the bad-performing topic 2 (“2022 fifa soccer”) and topic 5 (“nist computer security”). I denotes irrelevant and R denotes relevant.

Table 5 provides a breakdown of matching evidence by category. We can see that all tweets have exact word matches to the queries, and partial paraphrase matches occur more frequently than exact phrase matches, suggesting that our hierarchical architecture with embedding inputs is able to capture those soft semantic match signals. In addition, partial URL matches make up another big portion, affirming the need for character-level URL modeling.

To gain additional insights into how our model fails, we select some sample tweets for the worst-performing queries 2 (“2022 fifa soccer”) and 5 (“nist computer security”). Some of these sample tweets are shown in Table 6. Column “Label” represents whether the tweet is relevant to the query: “R” denotes relevant and “I” denotes irrelevant. Column “Score/Rank” shows the prediction scores and the ranked position of sample tweets produced by each method (QL or MP-HCNN). In addition, we also visualize the matching scores produced by the similarity measurement layer. The scores are normalized to range [0, 1] from the softmax function, and are visualized with the pink color background. The brighter the color, the higher the score. For example, in the second tweet, the word “fifa” has a matching score of 0.99 to the query, while “2022” has a matching score of 0.22.

Looking at the first tweet, it obtains the highest score by MP-HCNN due to the phrase match “fifa soccer” (a matching score of 0.89) from the content and URL. However, the MP-HCNN model fails to understand that “fifa soccer 11” refers to a video game on PS3, showing the limits of a matching-based algorithm for entity disambiguation. In contrast, though the second and third tweets look more relevant to the query, they are assigned much lower scores by MP-HCNN. This is because the query word “2022” is an out-of-vocabulary word, thus the impact of its matching evidence is greatly reduced due to the random initializations of OOV word embeddings. Comparing the second and third tweet, they share similar matching evidence in the content while the third tweet has a higher MP-HCNN score due to the character -gram match “2022-fifa” in its URL. Also, it’s worth noting that we see many terms that co-occur with “fifa soccer” in relevant tweets such as “qatar” and “world cup”, suggesting that neural networks for term expansion can be promising. Since tweets 4–6 show similar patterns, we omit detailed discussions here.

In summary, the results of these manual analyses confirm the quantitative results from the previous sections. Exact term match remains critical to relevance modeling, while soft matches that incorporate phrases and semantic similarities make substantial contributions as well. Furthermore, although URLs provide a smaller role in matching, they appear to provide complementary signals as well. Though soft-match signals can be led astray, as our failure analysis shows, overall they help more than they hurt.

6. Conclusions

To conclude, this paper presents, to our knowledge, the first substantial work on neural ranking models for ad hoc retrieval on social media. We have identified three main characteristics of social media posts that make our problem different from “standard” document ranking over web and newswire documents. Our model is specifically designed to cope with each of these issues, capturing multiple signals from queries, social media posts, as well as URLs contained in the posts—at the character-, word-, and phrase-levels. Extensive experiments demonstrate the effectiveness of our model and ablation studies verify the importance of each model components, suggesting that our customized architecture indeed captures the characteristics of our domain-specific ranking challenge.

References

  • (1)
  • Burges et al. (2011) C. Burges, K. Svore, P. Bennett, A. Pastusiak, and Q. Wu. 2011. Learning to Rank using an Ensemble of Lambda-Gradient Models. In Proceedings of the Learning to Rank Challenge. 25–35.
  • Burges (2010) C. J. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Learning 11, 23-581 (2010).
  • Cao et al. (2007) Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li. 2007. Learning to Rank: from Pairwise Approach to Listwise Approach. In ICML. 129–136.
  • Dai et al. (2018) Z. Dai, C. Xiong, J. Callan, and Z. Liu. 2018. Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In WSDM. 126–134.
  • Ganguly et al. (2015) D. Ganguly, D. Roy, M. Mitra, and G. J. F. Jones. 2015. Word Embedding based Generalized Language Model for Information Retrieval. In SIGIR. 795–798.
  • Gey (1994) F. C Gey. 1994. Inferring Probability of Relevance using the Method of Logistic Regression. In SIGIR. 222–231.
  • Guo et al. (2016) J. Guo, Y. Fan, Q. Ai, and W B. Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM. 55–64.
  • He and Lin (2016) H. He and J. Lin. 2016. Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. In NAACL-HLT. 937–948.
  • He et al. (2016) H. He, J. Wieting, K. Gimpel, J. Rao, and J. Lin. 2016. UMD-TTIC-UW at SemEval-2016 Task 1: Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement. In SemEval. 1103–1108.
  • Hoffer and Ailon (2015) E. Hoffer and N. Ailon. 2015. Deep metric learning using triplet network. In SIMBAD. 84–92.
  • Huang et al. (2013) P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. 2013. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In CIKM. 2333–2338.
  • Joachims (2002) T. Joachims. 2002. Optimizing Search Engines using Clickthrough Data. In SIGKDD. 133–142.
  • Lavrenko and Croft (2001) V. Lavrenko and W B. Croft. 2001. Relevance Based Language Models. In SIGIR. 120–127.
  • Li et al. (2017) A. Li, J. Sun, J. Yue-Hei Ng, R. Yu, V. I. Morariu, and L. S. Davis. 2017. Generating Holistic 3D Scene Abstractions for Text-based Image Retrieval. CVPR.
  • Li et al. (2015) J. Li, X. Chen, E. Hovy, and D. Jurafsky. 2015. Visualizing and Understanding Neural Models in NLP. arXiv:1506.01066 (2015).
  • Lin and Efron (2013) J. Lin and M. Efron. 2013. Overview of the TREC-2014 Microblog Track. In TREC.
  • Metzler and Croft (2005) D. Metzler and W B. Croft. 2005. A Markov Random Field Model for Term Dependencies. In SIGIR. 472–479.
  • Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 3111–3119.
  • Mitra et al. (2017) B. Mitra, F. Diaz, and N. Craswell. 2017. Learning to Match using Local and Distributed Representations of Text for Web Search. In WWW. 1291–1299.
  • Ounis et al. (2011) I. Ounis, C. Macdonald, J. Lin, and I. Soboroff. 2011. Overview of the trec-2011 microblog track. In TREC, Vol. 32.
  • Pang et al. (2016) L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. 2016. Text Matching as Image Recognition.. In AAAI. 2793–2799.
  • Ponte and Croft (1998) J. M Ponte and W B. Croft. 1998. A Language Modeling Approach to Information Retrieval. In SIGIR. 275–281.
  • Rao et al. (2016) J. Rao, H. He, and J. Lin. 2016. Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks. In CIKM. 1913–1916.
  • Rao et al. (2017a) J. Rao, H. He, and J. Lin. 2017a. Experiments with Convolutional Neural Network Models for Answer Selection. In SIGIR. 1217–1220.
  • Rao et al. (2017b) J. Rao, H. He, H. Zhang, F. Ture, R. Sequiera, S. Mohammed, and J. Lin. 2017b. Integrating Lexical and Temporal Signals in Neural Ranking Models for Social Media Search. In SIGIR Workshop on Neural Information Retrieval (Neu-IR).
  • Rao et al. (2017c) J. Rao, F. Ture, H. He, O. Jojic, and J. Lin. 2017c. Talking to Your TV: Context-Aware Voice Search with Hierarchical Recurrent Neural Networks. In CIKM. 557–566.
  • Rao et al. (2017d) J. Rao, F. Ture, X. Niu, and J. Lin. 2017d. Mining the Temporal Statistics of Query Terms for Searching Social Media Posts. In ICTIR. 133–140.
  • Sequiera et al. (2017) R. Sequiera, G. Baruah, Z. Tu, S. Mohammed, J. Rao, H. Zhang, and J. Lin. 2017. Exploring the Effectiveness of Convolutional Neural Networks for Answer Selection in End-to-End Question Answering. arXiv:1707.07804 (2017).
  • Severyn and Moschitti (2015) A. Severyn and A. Moschitti. 2015. Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks. In SIGIR. 373–382.
  • Shen et al. (2014) Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. 2014. Learning Semantic Representations using Convolutional Neural Networks for Web Search. In WWW. 373–374.
  • Smucker et al. (2007) M. D Smucker, J. Allan, and B. Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In CIKM. 623–632.
  • Socher et al. (2011) R. Socher, E. Huang, J. Pennin, C. Manning, and A. Ng. 2011. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS. 801–809.
  • Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V Le. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS. 3104–3112.
  • Xiong et al. (2017) C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. 2017. End-to-end Neural Ad-hoc Ranking with Kernel Pooling. In SIGIR. 55–64.
  • Yin et al. (2015) W. Yin, H. Schütze, B. Xiang, and B. Zhou. 2015. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. arXiv:1512.05193 (2015).
  • Yu et al. (2017) R. Yu, A. Li, V. I. Morariu, and L. S. Davis. 2017. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. ICCV.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
198711
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description