Curriculum Learning Strategies for IR

Curriculum Learning Strategies for IR

An Empirical Study on Conversation Response Ranking

Abstract

Neural ranking models are traditionally trained on a series of random batches, sampled uniformly from the entire training set. Curriculum learning has recently been shown to improve neural models’ effectiveness by sampling batches non-uniformly, going from easy to difficult instances during training. In the context of neural Information Retrieval (IR) curriculum learning has not been explored yet, and so it remains unclear (1) how to measure the difficulty of training instances and (2) how to transition from easy to difficult instances during training. To address both challenges and determine whether curriculum learning is beneficial for neural ranking models, we need large-scale datasets and a retrieval task that allows us to conduct a wide range of experiments. For this purpose, we resort to the task of conversation response ranking: ranking responses given the conversation history. In order to deal with challenge (1), we explore scoring functions to measure the difficulty of conversations based on different input spaces. To address challenge (2) we evaluate different pacing functions, which determine the velocity in which we go from easy to difficult instances. We find that, overall, by just intelligently sorting the training data (i.e., by performing curriculum learning) we can improve the retrieval effectiveness by up to 2%3.

Keywords:
curriculum learning conversation response ranking
\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

1 Introduction

Curriculum Learning (CL) is motivated by the way humans teach complex concepts: teachers impose a certain order of the material during students’ education. Following this guidance, students can exploit previously learned concepts to more easily learn new ones. This idea was initially applied to machine learning over two decades ago [elman1993learning] as an attempt to use a similar strategy in the training of a recurrent network by starting small and gradually learning more difficult examples. More recently, Bengio et al. [bengio2009curriculum] provided additional evidence that curriculum strategies can benefit neural network training with experimental results on different tasks such as shape recognition and language modelling. Since then, empirical successes were observed for several computer vision [hacohen2019power, weinshall2018curriculum] and natural language processing (NLP) tasks [sachan2016easy, rajeswar2017adversarial, zhang2018].

In supervised machine learning, a function is learnt by the learning algorithm (the student) based on inputs and labels provided by the teacher. The teacher typically samples randomly from the entire training set. In contrast, CL imposes a structure on the training set based on a notion of difficulty of instances, presenting to the student easy instances before difficult ones. When defining a CL strategy we face two challenges that are specific to the domain and task at hand [hacohen2019power]: (1) arranging the training instances by a sensible measure of difficulty, and, (2) determining the pace in which to present instances—going over easy instances too fast or too slow might lead to ineffective learning.

We conduct here an empirical investigation into those two challenges in the context of IR. Estimating relevance—a notion based on human cognitive processes—is a complex and difficult task at the core of IR, and it is still unknown to what extent CL strategies are beneficial for neural ranking models. This is the question we aim to answer in our work.

Given a set of queries—for instance user utterances, search queries or questions in natural language—and a set of documents—for instance responses, web documents or passages—neural ranking models learn to distinguish relevant from non-relevant query-document pairs by training on a large number of labeled training pairs. Neural models have for some time struggled to display significant and additive gains in IR [Yang:2019:CEH:3331184.3331340]. In a short time though, BERT [devlin2019bert] (released in late 2018) and its derivatives (e.g. XLNet [yang2019xlnet], RoBERTa [liu2019roberta]) have proven to be remarkably effective for a range of NLP tasks. The recent breakthroughs of these large and heavily pre-trained language models have also benefited IR [yang2019simple, yang2019end, yilmaz2019cross].

In our work we focus on the challenging IR task of conversation response ranking [wu2017sequential], where the query is the dialogue history and the documents are the candidate responses of the agent. The set of responses are not generated on the go, they must be retrieved from a comprehensive dialogue corpus. A number of deep neural ranking models have recently been proposed for this task [tao2019one, yang2018response, zhang2018modeling, wu2017sequential, zhou2018multi], which is more complex than retrieval for single-turn interactions, as the ranking model has to determine where the important information is in the previous user utterances (dialogue history) and how it is relevant to the current information need of the user. Due to the complexity of the relevance estimation problem displayed in this task, we argue it to be a good test case for curriculum learning in IR.

In order to tackle the first challenge of CL (determine what makes an instance difficult) we study different scoring functions that determine the difficulty of query-document pairs based on four different input spaces: conversation history {}, candidate responses , both ,, and , , , where are relevance labels for the responses. To address the second challenge (determine the pace to move from easy to difficult instances) we explore different pacing functions that serve easy instances to the learner for more or less time during the training procedure. We empirically explore how the curriculum strategies perform for two different response ranking datasets when compared against vanilla (no curriculum) fine-tuning of BERT for the task. Our main findings are that (i) CL improves retrieval effectiveness when we use a difficulty criteria based on a supervised model that uses all the available information , , , (ii) it is best to give the model more time to assimilate harder instances during training by introducing difficult instances in earlier iterations, and, (iii) the CL gains over the no curriculum baseline are spread over different conversation domains, lengths of conversations and measures of conversation difficulty.

2 Related Work

Neural Ranking Models

Over the past few years, the IR community has seen a great uptake of the many flavours of deep learning for all kinds of IR tasks such as ad-hoc retrieval, question answering and conversation response ranking. Unlike traditional learning to rank (LTR) [liu2009learning] approaches in which we manually define features for queries, documents and their interaction, neural ranking models learn features directly from the raw textual data. Neural ranking approaches can be roughly categorized into representation-focused [huang2013learning, shen2014latent, wan2016deep] and interaction-focused [guo2016deep, wan2016match]. The former learns query and document representations separately and then computes the similarity between the representations. In the latter approach, first a query-document interaction matrix is built, which is then fed to neural net layers. Estimating relevance directly based on interactions, i.e. interaction-focused models, has shown to outperform representation-based approaches on several tasks [nie2018empirical, hu2014convolutional].

Transfer learning via large pre-trained Transformers [vaswani2017attention]—the prominent case being BERT [devlin2019bert]—has lead to remarkable empirical successes on a range of NLP problems. The BERT approach to learn textual representations has also significantly improved the performance of neural models for several IR tasks [yang2019simple, yang2019end, sakata2019faq, Qu:2019:BHA:3331184.3331341, yilmaz2019cross], that for a long time struggled to outperform classic IR models [Yang:2019:CEH:3331184.3331340]. In this work we use the no-CL BERT as a strong baseline for the conversation response ranking task.

Curriculum Learning

Following a curriculum that dictates the ordering and content of the education material is prevalent in the context of human learning. With such guidance, students can exploit previously learned concepts to ease the learning of new and more complex ones. Inspired by cognitive science research [rohde1999language], researchers posed the question of whether a machine learning algorithm could benefit, in terms of learning speed and effectiveness, from a similar curriculum strategy [elman1993learning, bengio2009curriculum]. Since then, positive evidence for the benefits of curriculum training, i.e. training the model using easy instances first and increasing the difficulty during the training procedure, has been empirically demonstrated in different machine learning problems, e.g. image classification [hacohen2019power, gong2016multi], machine translation [platanios2019competence, kocmi2017curriculum, zhang2018] and answer generation [liu2018curriculum].

Processing training instances in a meaningful order is not unique to CL. Another related branch of research focuses on dynamic sampling strategies [kumar2010self, chang2017active, shrivastava2016training, breiman1998arcing], which unlike CL that requires a definition of what is easy and difficult before training starts, estimates the importance of instances during the training procedure. Self-paced learning [kumar2010self] simultaneously selects easy instances to focus on and updates the model parameters by solving a biconvex optimization problem. A seemingly contradictory set of approaches give more focus to difficult or more uncertain instances. In active learning [cohn1996active, tong2001support, chang2017active], the most uncertain instances with respect to the current classifier are employed for training. Similarly, hard example mining [shrivastava2016training] focuses on difficult instances, measured by the model loss or magnitude of gradients for instance. Boosting [breiman1998arcing, zhang2017boosting] techniques give more weight to difficult instances as training progresses. In this work we focus on CL, which has been more successful in neural models, and leave the study of dynamic sampling strategies in neural IR as future work.

The most critical part of using a CL strategy is defining the difficulty metric to sort instances by. The estimation of instance difficulty is often based on our prior knowledge on what makes each instance difficult for a certain task and thus is domain dependent (cf. Table 1 for curriculum examples). CL strategies have not been studied yet in neural ranking models. To our knowledge, CL has only recently been employed in IR within the LTR framework, using LambdaMart [burges2010ranknet], for ad-hoc retrieval by Ferro et al. [ferro2018continuation]. However, no effectiveness improvements over randomly sampling training data were observed. The representation of the query, document and their interactions in the traditional LTR framework is dictated by the manually engineered input features. We argue that neural ranking models, which learn how to represent the input, are better suited for applying CL in order to learn increasingly more complex concepts.

Difficulty criteria Tasks
sentence length machine translation [platanios2019competence], language generation  [rajeswar2017adversarial], reading comprehension  [yu2016end]
word rarity machine translation [platanios2019competence, zhang2018], language modeling [bengio2009curriculum]
external model confidence machine translation [zhang2018], image classification [weinshall2018curriculum, hacohen2019power], ad-hoc retrieval [ferro2018continuation]
supervision signal intensity facial expression recognition [gui2017curriculum], ad-hoc retrieval [ferro2018continuation]
noise estimate speaker identification [ranjan2018curriculum], image classification [chen2015webly]
human annotation image classification [tudor2016hard] (through weak supervision)
Table 1: Difficulty measures used in the curriculum learning literature.

3 Curriculum Learning

Before introducing our experimental framework (i.e., the scoring functions and the pacing functions we investigate), let us first formally introduce the specific IR task we explore—a choice dictated by the complex nature of the task (compared to e.g. ad-hoc retrieval) as well as the availability of large-scale training resources such as MSDialog [qu2018analyzing] and UDC [lowe2015ubuntu].

Conversation Response Ranking

Given a historical dialogue corpus and a conversation, (i.e., the user’s current utterance and the conversation history) the task of conversation response ranking [wu2017sequential, yang2018response, tao2019one] is defined as the ranking of the most relevant response available in the corpus. This setup relies on the fact that a large corpus of historical conversation data exists and adequate replies (that are coherent, well-formulated and informative) to user utterances can be found in it [yang2019hybrid]. Formally, let be an information-seeking conversations data set consisting of triplets: dialogue context, response candidates and response labels. The dialogue context is composed of the previous utterances at the turn of the dialogue. The candidate responses are either the true response () or negative sampled candidates4. The relevance labels indicate the responses’ binary relevance scores, 1 if and 0 otherwise. The task is then to learn a ranking function that is able to generate a ranked list for the set of candidate responses based on their predicted relevance scores .

Figure 1: Our curriculum learning framework is defined by two functions. The scoring function defines the instances’ difficulty (darker/lighter blue indicate higher/lower difficulty). The pacing function indicates the percentage of the dataset available for sampling according to the training step .

Curriculum Framework

When training neural networks, the common training procedure is to divide the dataset into and randomly (i.e., uniformly—every sample has the same likelihood of being sampled) sample mini-batches of instances from where , and perform an optimization procedure sequentially in . The CL framework employed here is inspired by previous works [weinshall2018curriculum, platanios2019competence]. It is defined by two functions: the scoring function which determines the difficulty of instances and the pacing function which controls the pace with which to transition from easy to hard instances during training. More specifically, the scoring function , is used to sort the training dataset. The pacing function determines the percentage of the sorted dataset available for sampling according to the current training step (one forward pass plus one backward pass of a batch is considered to be one step). The neural ranking model samples uniformly from the initial instances sorted by , while the rest of the dataset is not available for sampling. During training goes from (percentage of initial training data) to 1 when . Both and are hyperparameters. We provide an illustration of the training process in Figure 1.

Scoring Functions

Input Space Name Definition Difficulty notion
baseline random
information spread
distraction in responses
responses heterogeneity
model confidence
Table 2: Overview of our curriculum learning scoring functions.

In order to measure the difficulty of a training triplet composed of , we define pacing functions that use different parts of the input space: functions that leverage (i) the text in the dialogue history (ii) the text in the response candidates (iii) interactions between them, i.e., , and, (iv) all available information including the labels for the training set, i.e., . The seven5 scoring functions we propose are defined in Table 2; we now provide intuitions of why we believe each function to capture some notion of instance difficulty.

  • and : The important information in the context can be spread over different utterances and words. Bigger dialogue contexts means there are more places where the important part of the user information need can be spread over. : Longer responses can distract the model as to which set of words or sentences are more important for matching. Previous work shows that it is possible to fool machine reading models by creating longer documents with additional distracting sentences [jia2017adversarial].

  • and : Inspired by query performance prediction literature [shtok2009predicting], we use the variance of retrieval scores to estimate the amount of heterogeneity of information, i.e. diversity, in the response candidate. Homogeneous ranked lists are considered to be easy. We deploy a semantic matching model (SM) and BM25 to capture both semantic correspondences and keyword matching [jinfeng2019bridging]. SM is the average cosine similarity between the first words from (concatenated utterances) with the first words from using pre-trained word embeddings.

  • and : Inspired by CL literature [hacohen2019power], we use external model prediction confidence scores as a measure of difficulty6. We fine-tune BERT [devlin2019bert] on for the conversation response ranking task. For easy dialogue contexts are the ones that the BERT confidence score for the positive response candidate is higher than the confidence for the negative response candidate . The higher the difference the easier the instance is. For we consider the loss of the model to be an indicator of the difficulty of an instance.

Pacing functions

{floatrow}\capbtabbox
Pacing function Definition
baseline_training
step
root
linear
root_n
geom_progression
Figure 2: Overview of our curriculum learning pacing functions. and are hyperparameters.
\ffigbox

[0.35]

Figure 3: Example with and .

Assuming that we know the difficulty of each instance in our training set, we still need to define how are we going to transition from easy to hard instances. We use the concept of pacing functions ; they should each have the following properties [platanios2019competence, weinshall2018curriculum]: (i) start at an initial value of training instances with , so that the model has a number of instances to train in the first iteration, (ii) be non-decreasing, so that harder instances are added to the training set, and, (iii) eventually all instances are available for sampling when it reaches iterations, .

As intuitively visible in the example in Figure 3, we opted for pacing functions that introduce more difficult instances at different paces—while introduces difficult instances very early (after 125 iterations, 80% of all training data is available), introduces them very late (80% is available after iterations). We consider four different types of pacing functions, formally defined in Table 3. The function [bengio2009curriculum, hacohen2019power, soviany2019image] divides the data into fixed sized groups, and after iterations a new group of instances is added, where is a hyperparameter. A more gradual transition was proposed by Platanios et. al [platanios2019competence], by adding a percentage of the training dataset linearly with respect to the total of CL iterations , and thus the slope of the function is ( function). They also proposed functions motivated by the fact that difficult instances will be sampled less as the training data grows in size during training. By making the slope inversely proportional to the current training data size, the model has more time to assimilate difficult instances. Finally, we propose the use of a geometric progression that instead of quickly adding difficult examples, it gives easier instances more training time.

4 Experimental Setup

Datasets

We consider two large-scale information-seeking conversation datasets (cf. Table 3) that allow the training of neural ranking models for conversation response ranking. MSDialog7 [qu2018analyzing] contain 246K context-response pairs, built from 35.5K information seeking conversations from the Microsoft Answer community, a question-answer forum for several Microsoft products. MANtIS8 [penha2019introducing] was created by us and contains 1.3 million context-response pairs built from conversations of 14 different sites of Stack Exchange. Each MANtIS conversation fulfills the following conditions: (i) it takes place between exactly two users (the information seeker who starts the conversation and the information provider); (ii) it consists of at least 2 utterances per user; (iii) one of the provider’s utterances contains a hyperlink, providing grounding; (iv) if the final utterance belongs to the seeker, it contains positive feedback. We created MANtIS to consider diverse conversations from different domains besides technical ones. We include MSDialog [qu2018analyzing, yang2018response, InforSeek_User_Intent_Pred] here as a widely used benchmark.

MSDialog MANtIS
Number of domains 75 14
Train Valid Test Train Valid Test
Number of pairs 173k 37k 35k 904k 199k 197k
Number of candidates per 10 10 10 11 11 11
Average number of turns 5.0 4.8 4.4 4.0 4.1 4.1
Average number of words per 55.8 55.8 52.7 98.2 107.2 110.4
Average number of words per 67.3 68.8 67.7 91.0 100.1 94.6
Table 3: Dataset used. is the dialogue context, a response and an utterance.

Implementation Details

As strong neural ranking model for our experiments, we employ BERT [devlin2019bert] for the conversational response ranking task. We follow recent research in IR that employed fine-tuned BERT for retrieval tasks [nogueira2019passage, yang2019simple] and obtain strong baseline (i.e., no CL) results for our task. The best model by Yang et. al [yang2018response], which relies on external knowledge sources for MSDialog, achieves a MAP of 0.68 whereas our BERT baselines reaches a MAP of 0.71 (cf. Table 4). We fine-tune BERT9 for sentence classification, using the CLS token10; the input is the concatenation of the dialogue context and the candidate response separated by SEP tokens. When training BERT we employ a balanced number of relevant and non-relevant context and response pairs11. We use cross entropy loss and the Adam optimizer [kingma2014adam] with learning rate of and .

For , as word embeddings we use pre-trained fastText12 embeddings with 300 dimensions and a maximum length of words of dialogue contexts and responses. For , we use default values13 of , and . For CL, we fix as 90% percent of the total training iterations—this means that we continue training for the final 10% of iterations after introducing all samples—and the initial number of instances as 33% of the data to avoid sampling the same instances several times.

Evaluation

To compare our strategies with the baseline where no CL is employed, for each approach we fine-tune BERT five times with different random seeds—to rule out that the results are observed only for certain random weight initialization values—and for each run we select the model with best observed effectiveness on the development set. The best model of each run is then applied to the test set. We report the effectiveness with respect to Mean Average Precision (MAP) like prior works [wu2017sequential, yang2018response]. We perform paired Student’s t-tests between each scoring/pacing-function variant and the baseline run without CL.

5 Results

We first report the results for the pacing functions (Figure 5) followed by the main results (Table 4) comparing different scoring functions. We finish with an error analysis to understand when CL outperforms our no-curriculum baseline.

{floatrow}\ffigbox

[0.6]

Figure 4: Average development MAP for 5 different runs, using different curriculum learning pacing functions. is the maximum observed MAP.
\ffigbox

[0.39]

Figure 5: MSDialog test set MAP of curriculum learning and baseline by number of turns.

Pacing Functions

In order to understand how CL results are impacted by the pace we go from easy to hard instances, we evaluate the different proposed pacing functions. We display the evolution of the development set MAP (average of 5 runs) during training on Figure 5 (we use development MAP to track effectiveness during training). We fix the scoring function as ; this is the best performing scoring function, more details in the next section. We see that the pacing functions with the maximum observed average MAP are and for MSDialog and MANtIS respectively14. The other pacing functions, linear, geom_progression and step, also outperform the standard training baseline with statistical significance on the test set and yield similar results to the root_2 and root_5 functions.

Our results are aligned with previous research on CL [platanios2019competence], that giving more time for the model to assimilate harder instances (by using a root pacing function) is beneficial to the curriculum strategy and is better than no CL with statistical significance on both development and test sets. For the rest of our experiments we fix the pacing function as , the best pacing function for MSDialog. Let’s now turn to the impact of the scoring functions.

Scoring Functions

MSDialog
run
1 0.7142 0.7220 0.7229 0.7182 0.7239 0.7175 0.7272 0.7244
2 0.7044 0.7060 0.7053 0.6968 0.7032 0.7003 0.7159 0.7194
3 0.7126 0.7215 0.7163 0.7171 0.7174 0.7159 0.7296 0.7225
4 0.7031 0.7065 0.7043 0.6993 0.7026 0.6949 0.7154 0.7204
5 0.7148 0.7225 0.7203 0.7169 0.7171 0.7134 0.7322 0.7331
AVG 0.7098 0.7157 0.7138 0.7097 0.7128 0.7084 0.7241 0.7240
SD 0.0056 0.0086 0.0086 0.0106 0.0095 0.0101 0.0079 0.0055
MANtIS
1 0.7203 0.7192 0.7198 0.7194 0.7166 0.7200 0.7257 0.7268
2 0.6984 0.6993 0.6989 0.6996 0.6964 0.7009 0.7067 0.7051
3 0.7200 0.7197 0.7134 0.7206 0.7153 0.7153 0.7282 0.7221
4 0.7114 0.7117 0.7002 0.6978 0.7140 0.7084 0.7240 0.7184
5 0.7156 0.7174 0.7193 0.7162 0.7147 0.7185 0.7264 0.7258
AVG 0.7131 0.7135 0.7103 0.7107 0.7114 0.7126 0.7222 0.7196
SD 0.0090 0.0085 0.0102 0.0111 0.0084 0.0079 0.0088 0.0088
Table 4: Test set MAP results of 5 runs using different curriculum learning scoring functions. Superscripts denote statistically significant improvements over the baseline where no curriculum learning is applied () at 95%/99% confidence intervals. Bold indicates the highest MAP for each line.

The most critical challenge of CL is defining a measure of difficulty of instances. In order to evaluate the effectiveness of our scoring functions we report the test set results across both datasets in Table 4. We observe that the scoring functions which do not use the relevance labels are not able to outperform the no CL baseline (random scoring function). They are based on features of the dialogue context and responses that we hypothesized make them difficult for a model to learn. Differently, for and we observe statistically significant results on both datasets across different runs. They differ in two ways from the unsuccessful scoring functions: they have access to the training labels and the difficulty of an instance is based on what a previously trained model determines to be hard, and thus not our intuition.

Our results bear resemblance to Born Again Networks [furlanello2018born], where a student model which is identical in parameters and architecture to the teacher model outperforms the teacher when trained with knowledge distillation [hinton2015distilling], i.e., using the predictions of the teacher model as labels for the student model. The difference here is that instead of transferring the knowledge from the teacher to the student through the labels, we transfer the knowledge by imposing a structure/order on the training set, i.e. curriculum learning.

Error Analysis

In order to understand when CL performs better than random training samples, we fix the scoring () ad pacing function (root_2) and explore the test set effectiveness along several dimensions (cf. Figures 5 and 6). We report the results only for MSDialog, but the trends hold for MANtIS as well.

Figure 6: Test set MAP for MSDialog across different domains (left) and instances’ difficulty (right) according to for curriculum learning and the baseline.

We first consider the number of turns in the conversation in Figure 5. CL outperforms the baseline approach for the types of conversations appearing most frequently (2-5 turns in MSDialog). The CL-based and baseline effectiveness drops for conversations with a large number of turns. This can be attributed to two factors: (1) employing pre-trained BERT in practice allows only a certain maximum number of tokens as input, so longer conversations can lose important information due to truncating; (2) for longer conversations it is harder to identify the important information to match in the history, i.e information spread.

Next, we look at different conversation domains in Figure 6 (left), such as physics and askubuntu—are the gains in effectiveness limited to particular domains? The error bars indicate the confidence intervals with confidence level of 95%. We list only the most common domains in the test set. The gains of CL are spread over different domains as opposed to concentrated on a single domain.

Lastly, using our scoring functions we sort the test instances and divide them into three buckets: first 33% instances, 33%–66% and 66%–100%. In Figure 6 (right), we see the effectiveness of CL against the baseline for each bucket using (the same trend holds for the other scoring functions). As we expect, the bucket with the most difficult instances according to the scoring function is the one with lowest MAP values. Finally, the improvements of CL over the baseline are again spread across the buckets, showing that CL is able to improve over the baseline for different levels of difficulty.

6 Conclusions

In this work we studied whether CL strategies are beneficial for neural ranking models. We find supporting evidence for curriculum learning in IR. Simply reordering the instances in the training set using a difficulty criteria leads to effectiveness improvements, requiring no changes to the model architecture—a similar relative improvement in MAP has justified novel neural architectures in the past [wu2017sequential, zhang2018modeling, zhou2018multi, tao2019one]. Our experimental results on two conversation response ranking datasets reveal (as one might expect) that it is best to use all available information as evidence for instance difficulty. Future work directions include considering other retrieval tasks, different neural architectures and an investigation of the underlying reasons for CL’s workings.

Acknowledgements

This research has been supported by NWO projects SearchX (639.022.722) and NWO Aspasia (015.013.027).

References

Footnotes

  1. email: {g.penha-1,c.hauff}@tudelft.nl
  2. email: {g.penha-1,c.hauff}@tudelft.nl
  3. The source code is available at https://github.com/Guzpenha/transformers_cl.
  4. In a production setup the ranker would either retrieve responses from the entire corpus or re-rank the responses retrieved by a recall-oriented retrieval method.
  5. The function random is the baseline—instances are sampled uniformly (no CL).
  6. We note, that using BM25 average precision as a scoring function failed to outperform the baseline.
  7. MSDialog is available at https://ciir.cs.umass.edu/downloads/msdialog/
  8. MANtIS is available at https://guzpenha.github.io/MANtIS/
  9. We use the PyTorch-Transformers implementation https://github.com/huggingface/pytorch-transformers and resort to bert-base-uncased with default settings.
  10. The BERT authors suggest CLS as a starting point for sentence classification tasks [devlin2019bert].
  11. We observed similar results to training with 1 to 10 ratio in initial experiments.
  12. https://fasttext.cc/docs/en/crawl-vectors.html
  13. https://radimrehurek.com/gensim/summarization/bm25.html
  14. If we increase the of the root function to bigger values, e.g. , the results drop and get closer to not using CL. This is due to the fact that higher generate root functions with a similar shape to standard training, giving the same amount of time to easy and hard instances (cf. Figure 3).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402448
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description