On the Effect of Low-Frequency Terms on Neural-IR Models
Low-frequency terms are a recurring challenge for information retrieval models, especially neural IR frameworks struggle with adequately capturing infrequently observed words. While these terms are often removed from neural models – mainly as a concession to efficiency demands – they traditionally play an important role in the performance of IR models. In this paper, we analyze the effects of low-frequency terms on the performance and robustness of neural IR models. We conduct controlled experiments on three recent neural IR models, trained on a large-scale passage retrieval collection. We evaluate the neural IR models with various vocabulary sizes for their respective word embeddings, considering different levels of constraints on the available GPU memory.
We observe that despite the significant benefits of using larger vocabularies, the performance gap between the vocabularies can be, to a great extent, mitigated by extensive tuning of a related parameter: the number of documents to re-rank. We further investigate the use of subword-token embedding models, and in particular FastText, for neural IR models. Our experiments show that using FastText brings slight improvements to the overall performance of the neural IR models in comparison to models trained on the full vocabulary, while the improvement becomes much more pronounced for queries containing low-frequency terms.
Neural network approaches for Information Retrieval have been showing promising performance in a wide range of document retrieval tasks. Various studies apply neural methods by introducing pre-trained word embeddings into classical IR models (Rekabsaz et al., 2016, 2017), adapting word embeddings to retrieval tasks (Hofstätter et al., 2019; Diaz et al., 2016), or proposing altogether novel neural IR models (Dai et al., 2018; Xiong et al., 2017; Pang et al., 2016, 2017). An essential core building block of all these approaches is the word embedding model, which defines the semantic relations between the terms.
Typically, word embeddings are defined on a fixed vocabulary. As a common practice in neural network approaches, terms with very low collection frequencies are pruned from the vocabulary, becoming Out-Of-Vocabulary (OOV) terms. The reason for limiting the vocabulary in this way often stems from (GPU) memory constraints, efficiency considerations, or noise reduction efforts.
However, in the context of retrieval modeling, low-frequency terms are known to bear high degrees of informativeness or salience, and therefore play an important role in identifying relevant documents. In classical IR, the importance of such terms is quantified by term salience measures, such as the Inverse Document Frequency. Removing low-frequency terms in the training stage of neural IR models potentially harms the effectiveness and robustness of the derived models, especially for queries containing the affected terms. Even if neural IR models cover the full collection vocabulary, two issues remain: (1) The model performs poorly on previously unseen terms appearing at retrieval time (OOV terms). (2) Due to the lack of training data for low-frequency terms, the learned vectors may not be semantically robust.
In this study, we explore the effect of low-frequency terms on the effectiveness and robustness of neural IR models. We conduct an extensive range of controlled experiments on three recent neural IR models, namely KNRM (Xiong et al., 2017), CONV-KNRM (Dai et al., 2018), and MatchPyramid (Pang et al., 2016), evaluated on the MS MARCO (Bajaj et al., 2016) passage ranking collection, and finally propose potential solutions to the general underlying vocabulary issue in neural IR models.
The novel contributions of this paper are two-fold: We begin by exploring the performance of neural IR models trained on different vocabularies (Section 4). We observe that despite the significant benefits of using larger vocabularies, model performance is highly sensitive to another essential parameter common to virtually all neural IR models: the re-ranking threshold, which defines how many of the initially retrieved documents are re-ranked by a neural IR model. We investigate the relationship between vocabulary size and re-ranking threshold, noting the sensitivity of the models to the latter, especially for models with smaller vocabularies. Our results suggest that a well-tuned re-ranking threshold can largely mitigate the negative effect of pruned vocabularies.
Secondly, we study the effect of embedding sub-word tokens in comparison to using the full vocabulary of word-level tokens (Section 5). In particular, we investigate the use of FastText (Bojanowski et al., 2017), a model based on the composition of character n-gram vector representations, designed to address OOV issues. Our results suggest that the overall performance of the model with FastText remains close to the results of using the full vocabulary. However, character-level models achieve significantly better performance on queries containing low-frequency terms. We argue that this is due to better generalization of the character-level model that benefits from other words with similar n-gram contexts. This early-stage study therefore recommends the use of sub-word token embeddings as a strategy for retaining the effectiveness and robustness of neural IR models, especially with regard to low-frequency query terms.
2. Background and Related Work
In this section, we briefly explain the sub-word embeddings, followed by discussing related work to our study.
Sub-word embedding models produce a vector representation of a word based on composing embeddings of the character n-grams composing the word. In this way, the models can provide a semantically meaningful embedding vector even for unseen terms by exploiting the contexts of the observed terms with similar character n-grams and there are virtually no out-of-vocabulary terms. The FastText model (Bojanowski et al., 2017), an effective and efficient sub-word embedding model, simply sums up the character n-gram vectors to build the word embedding. For highly frequent terms, FastText directly assigns a vector per word. ELMo (Peters et al., 2018) is another well-known character-based embedding model, which in addition, takes into account the context around the word. In this work, we use FastText due its direct comparability to traditional word embeddings.
In more traditional retrieval models, Woodland et al. (Woodland et al., 2000) explore the role of OOV terms for spoken document retrieval, proposing query and document expansion approaches. To the best of our knowledge there is no existing research on the effect of low-frequency terms on neural IR models.
Other studies explore related aspects of neural IR models. Pyreddy et al. (Pyreddy et al., 2018) investigate the variance and consistency of kernel-based neural models over various parameter initializations. Zamani et al. (Zamani et al., 2018) propose a method to skip the re-ranking step, and directly retrieve documents from an index of sparse representations. In contrast, in this paper, we analyze the sensitivity of the neural IR models to the re-ranking threshold parameter, since most recently proposed neural models are based on the re-ranking mechanism.
3. Experiment design
We conduct our experiments on the MS MARCO (Bajaj et al., 2016) passage re-ranking collection. The collection provides a large set of informational question-style queries from Bing’s search logs, accompanied by human-annotated relevant/non-relevant passages. Besides training data, MS MARCO provides a development set – containing queries and relevance data for evaluation – in two sizes: sample111Provided in the form of evaluation tuples: top1000.dev.tsv and full. In our experiments, we use the queries from the sample as our validation set and the rest of the full development set as our test set. In total, the collection consists of 8,841,822 documents, 6,980 queries for validation, and 48,598 queries for test purposes.
We use GloVe (Pennington et al., 2014) word embeddings with 300 dimensions22242B lower-cased (CommonCrawl) from: https://nlp.stanford.edu/projects/glove/, and the FastText model, trained on the Wikipedia corpus of August 2015, with trigram-character subwords in 200 dimensions.
We create several vocabularies based on varying thresholds to the collection frequency of terms. In our experiments, we refer to the set of terms with frequency greater or equal to , as Voc-. Voc-Full uses all the terms in the collection. The details of the resulting vocabularies, as well as the corresponding statistics of OOV terms, are shown in Table 1.
We evaluate our models with the main metric for the MS MARCO ranking challenge: the Mean Reciprocal Rank measure (MRR), as well as Recall, both at rank 10. Statistical significance tests are done using a two sided paired -test ().
|Name||# Terms||% Covered||Size||OOV Queries|
|- Min #||Terms||%||#|
Neural Retrieval Models.
KNRM (Xiong et al., 2017) establishes a similarity matching matrix using the embeddings of query and document terms. The model then estimates the relevance score based on the outputs of a set of Gaussian kernel functions, applied on the matching matrix.
CONV-KNRM (Dai et al., 2018) extends KNRM by adding a Convolutional Neural Network (CNN) layer on top of the word embedding matrix, enabling learning word-level n-gram representations.
MatchPyramid (Pang et al., 2016) ranking model is inspired by deep neural image processing architectures. Similar to KNRM, the model first computes the similarity matching matrix, which is used for several stacked layers of CNN with dynamic max-pooling. Like the two previous models, this architecture facilitates end-to-end training.
Implementation and Parameter Setting
We use the Anserini (Yang et al., 2017) toolkit to compute the BM25, and RM3 models. The model parameters are tuned on the validation set, resulting in , for BM25. We observe no significant performance increase for RM3, therefore we only report BM25 baseline results. The BM25 rankings are used as a starting point for the neural re-ranking models.
We implement the neural models in PyTorch (Paszke et al., 2017).333Our code is available at https://github.com/sebastian-hofstaetter/sigir19-neural-ir We project all characters to lower case and apply tokenization using the WordTokenizer provided by AllenNLP (Gardner et al., 2017). We use the Adam optimizer with learning rate 0.001, 1 epoch, and early stopping. We use a batch size of 64, and the maximum word length of queries and documents is set to 30 and 180, respectively. In all models, the pre-trained word embeddings are updated during training.
Regarding model-specific parameters, for KNRM and CONV-KNRM, we set the number of kernels to with the mean values of the Gaussian kernels varying from to in steps of (one extra kernel is added for exact matching), and standard deviation of for all kernels. The dimension of the CNN vectors in CONV-KNRM is set to . In the MatchPyramid model, we set the number of convolution layers to , each with kernel size and 16 convolution channels. For each model, we find the best re-ranking threshold parameter by extensively tuning it on a range from to , based on the MRR results of the validation set.
Top: best performance on the validation set. Bottom: results on the test set. The best results per model are shown in bold
4. Effect of the Vocabulary Size
The performance of the neural ranking models, trained on various vocabularies as well as on FastText embeddings, on both validation and test sets are shown in Table 2. We calculate tests of significance between the pairs of rankings, and mention the results in the following. We also evaluate BM25, achieving an MRR of and Recall of on the test set. Consistent with previous studies, the BM25 model is outperformed by all neural ranking models, and CONV-KNRM shows the best overall performance (Dai et al., 2018; Xiong et al., 2017; Pang et al., 2016).
Comparing the results over each model, in two out of the three models, the FastText embedding significantly outperforms Voc-Full, while FastText only requires 55% of the memory needed by Voc-Full (based on the statistics in Table 1). Looking at the results of the models with various vocabulary sizes, using Voc-Full brings significant advantages in comparison to using smaller vocabularies. However, their differences become marginal, especially for the models with Voc-5 and Voc-10 vocabulary sets, taking into account that the embeddings of the Voc-5 and Voc-10 vocabularies require much less memory space, namely only 15% (Voc-5) and 8% (Voc-10) of the memory used by the Voc-Full embeddings.
While the reported results are based on an exhaustive tuning of hyper-parameters on the validation set, in the following we study the sensitivity of the models to the re-ranking threshold, an important – but not well studied – hyper-parameter of the neural IR models. Figure 1 demonstrates the sensitivity of the three neural IR models to the changes of the re-ranking threshold parameter. Looking at the trends in the plots, as the performance improves, either by using a better performing model or a bigger vocabulary size, the models become less sensitive to the re-ranking threshold. Such that the optimal re-ranking thresholds also become larger, indicating that the model is able to effectively generalize over a larger set of non-relevant documents. Since increasing the re-ranking threshold mostly adds non-relevant documents. On the other hand, models with lower performances (MatchPyramid and KNRM), especially with smaller vocabularies, are highly sensitive to the re-ranking threshold. For such models, an exhaustive parameter search provides a significant enhancement. This indicates the importance of well-tuning the re-ranking threshold parameter, especially in scenarios with constrained memory resources.
Finally, to confirm whether the effects of tuning the re-ranking threshold on the validation set is also transferred to the test set, we compare the results on the validation and test set in Table 2. As shown, even on models with high sensitivity to the re-ranking threshold, the results are highly similar, indicating the effectiveness of extensive tuning of the re-ranking threshold444We should note that the cause of this effect might be due to high underlying similarities between the validation and test set of the particular collection. However, further investigations on this aspect is out of the scope of this paper..
5. Queries with Low-Frequency Terms
In this section, we take a closer look at the differences between the models trained on the traditional embeddings (GloVe in our experiments) using different vocabularies, and the ones trained on the FastText embeddings.
Figure 2 shows the MRR differences of the neural ranking models, using the traditional embeddings with the Voc-Full vocabularies, to the ones using FastText, over the range of collection frequencies. For each point on the X axes, we calculate the MRR values for the queries, which at least have one term with collection frequency of equal to or smaller than the corresponding value of that point. The figure reveals strong contrast between the area, related to the queries with very low-frequency terms, and the rest, indicating higher performances of the models with FastText for these queries.
Let us have a closer look at this area. Figure 3 shows the MRR of the CONV-KNRM models, using the traditional word embeddings with different vocabularies, as well as the one using the FastText embeddings, for queries with very infrequent terms. The MRR values are calculated in the same fashion as in Figure 2.
As shown, the model with FastText by a large margin improves all other models, especially until a collection frequency of around 10 to 15. Interestingly, BM25, as an exact term matching model, shows better performance than neural IR models with traditional embeddings, especially on very low values. We argue that the low performance of the models with traditional embeddings is due to the lack of enough contexts for learning meaningful representations, which causes ineffective semantic similarity estimations. On the other hand, the subword embeddings exploit the contexts of other observed terms in the collection with similar character n-grams. Therefore, the neural ranking models with subword embeddings still benefit from meaningful semantic relations between very infrequent words, outperforming the ranking models based on traditional embeddings as well as exact matching.
Our work takes a first step to understanding the effects of infrequent terms in neural ranking models, and exploit novel representation learning approaches to address it. We first study the sensitivity of the neural IR models to their vocabulary size, pointing out the importance of fine-grained tuning of the re-ranking threshold. We then investigate the effects of using subword embeddings in neural IR models, showing that using these embeddings in particular brings remarkable improvements to the performance of queries containing very low-frequency terms. As future work, we aim to pursue the investigations of this study into the area of query performance prediction of neural IR models.
- Bajaj et al. (2016) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew Mcnamara, Bhaskar Mitra, and Tri Nguyen. 2016. MS MARCO : A Human Generated MAchine Reading COmprehension Dataset. In Proc. of NIPS.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Tr. of the ACL 5 (2017).
- Dai et al. (2018) Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In Proc. of WSDM.
- Diaz et al. (2016) Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016. Query Expansion with Locally-Trained Word Embeddings. In In Proc. of ACL.
- Gardner et al. (2017) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A Deep Semantic Natural Language Processing Platform.
- Hofstätter et al. (2019) Sebastian Hofstätter, Navid Rekabsaz, Mihai Lupu, Carsten Eickhoff, and Allan Hanbury. 2019. Enriching Word Embeddings for Patent Retrieval with Global Context. In Proc. of ECIR.
- Pang et al. (2017) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, and Xueqi Cheng. 2017. A Deep Investigation of Deep IR Models. In Proc. of SIGIR Neu-IR’17.
- Pang et al. (2016) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text Matching as Image Recognition. In Proc of. AAAI.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In In Proc of EMNLP.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
- Pyreddy et al. (2018) Mary Arpita Pyreddy, Varshini Ramaseshan, Narendra Nath Joshi, Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Consistency and Variation in Kernel Neural Ranking Model. In Proc. of SIGIR.
- Rekabsaz et al. (2017) Navid Rekabsaz, Mihai Lupu, Allan Hanbury, and Hamed Zamani. 2017. Word Embedding Causes Topic Shifting; Exploit Global Context!. In In Proc. of SIGIR.
- Rekabsaz et al. (2016) Navid Rekabsaz, Mihai Lupu, Allan Hanbury, and Guido Zuccon. 2016. Generalizing translation models in the probabilistic relevance framework. In In Proc. of CIKM.
- Woodland et al. (2000) Philip Woodland, Sue Johnson, Pierre Jourlin, and Karen Spärck Jones. 2000. Effects of out of vocab. words in spoken document retrieval. In In Proc. of SIGIR.
- Xiong et al. (2017) Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In Proc. of SIGIR.
- Yang et al. (2017) Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene for information retrieval research. In Proc. of SIGIR.
- Zamani et al. (2018) Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In In Proc. of CIKM.