Continual Learning for Neural Semantic Parsing

Continual Learning for Neural Semantic Parsing

Abstract

A semantic parsing model is crucial to natural language processing applications such as goal-oriented dialogue systems. Such models can have hundreds of classes with a highly non-uniform distribution. In this work, we show how to efficiently (in terms of computational budget) improve model performance given a new portion of labeled data for a specific low-resource class or a set of classes. We demonstrate that a simple approach with a specific fine-tuning procedure for the old model can reduce the computational costs by  90% compared to the training of a new model. The resulting performance is on-par with a model trained from scratch on a full dataset. We showcase the efficacy of our approach on two popular semantic parsing datasets, Facebook TOP, and SNIPS.

\nocopyright

1 Introduction

Semantic parsing is the task of mapping a natural language query into a formal language, that is extensively used in goal-oriented dialogue systems. For a given query, such model should identify the requested action (intent) and the associated values specifying parameters of the action (slots). For example, if the query is Call Mary the action is call and the value of slot contact is Mary.

The number of different intents and slots in publicly available datasets [gupta2018top, coucke2018snips] can be close to a hundred and it may be the orders of magnitude larger in real-world systems. Such a big number of classes usually causes a long tail in the class frequency distribution (Figure 1). These tail classes can be significantly improved with small quantities of additional labeled data.

Figure 1: Class distribution in TOP dataset. The most frequent class (on the left) is SL:DATE_TIME with 13214 examples and the least frequent are IN:NEGATION and SL:GROUP (or the right) with just one example available.

However, training a neural semantic parsing model from scratch can take hours even on a relatively small public dataset (e.g., TOP dataset [gupta2018top] consists of 30k examples). The real-world datasets can contain millions of examples [damonte2019practical] which can change the time scale to weeks. In this work, we propose to fine-tune a model that has already been trained on the old dataset (e.g., the model that already works in production) instead of training a new model to significantly speed up the incorporation of a new portion of data. We call this setting Incremental training, as the new portions of data can be added incrementally.

We focus on semantic parsing for our case studies for the following reasons. Semantic parsing is a more complex NLP task compared to classification or NER and we hope that the lessons learned here would be more widely applicable. Task-oriented semantic parsing tend to have a large output vocabulary that can be frequently updated, and thus, benefit most from the Incremental setting.

The main contributions of this work are:

  • We formulate Incremental Training task, which is a special kind of a Continual Learning task.

  • We propose new metrics: tree-path F1 and relative improvement/degradation, that are specific to Incremental training and semantic parsing.

  • We experiment with three different approaches: data mixing, layer freezing, and move norm - a regularization method specific to Incremental training.

  • We showcase a simple retraining recipe that allows to reduce computational costs by a factor of ten (in several cases) compared to training from scratch.

2 Related work

Fine-tuning of pre-trained deep learning models is an active area of research, but most recent work  [Peters2019, devlin2018bert] focuses on new tasks or new labels.  \citeauthorHoulsby2019 \shortciteHoulsby2019 introduce adapters which modify the pre-trained representations for new tasks by adding per-task parameters.  \citeauthorPeters2019 \shortcitePeters2019 compare two common fine-tuning strategies, freezing the encoder (feature extraction) and fine-tuning the whole model on variety of downstream tasks.  \citeauthorHoward2018 \shortciteHoward2018 propose techniques for effective fine-tuning of language models for classification tasks including gradual unfreezing while training. All of these works focus on using a pre-trained model and fine-tuning to a different downstream task.

In contrast, there is not much recent work in the NLP community on the ”data-patch” use-case. In this setting, we fine-tune on the same task but under a different data distribution. This has been commonly referred to as continual learning (CL) in the broader ML community. Continual Learning has been a long-studied [Hassabis2017] problem. One of the main challenges of continual learning is the catastrophic forgetting [Thrun1995] – the network forgets existing knowledge when learning from novel observations – is a common problem in this setting. Due to the interleaving of data, training from scratch usually does not suffer from catastrophic forgetting as the network is jointly optimized for all classes.

We draw inspiration from the work on lifelong learning  [Kirkpatrick2017, de2019episodic, Sun2020LAMOLLM]. We also survey popular fine-tuning strategies including gradual unfreezing and discriminative fine-tuning  [Howard2018].  \citeauthorKirkpatrick2017 \shortciteKirkpatrick2017 introduce Elastic Weight Consolidation (EWC) as a regularization approach which adds a penalty between weights of original and the fine-tuned model.  \citeauthorrobins1995catastrophic \shortciterobins1995catastrophic showed that interleaving information about new experiences with previous experiences can help overcome catastrophic forgetting.  \citeauthorde2019episodic \shortcitede2019episodic propose sparse experience replay for continual language learning.  \citeauthorParisi2019 \shortciteParisi2019 provides a comprehensive review of Continual lifelong learning techniques in neural networks.

Our proposed approaches are a combination of interleaving old and new information, selective layer freezing, and simple regularization methods between the pre-trained and fine-tuned models.

3 Incremental training

Here, we simplify the training setting as follows. Suppose, we have a network trained on the dataset as our base model. Now we are presented with a dataset with a different data distribution from as our fine-tuning dataset. If is the time to train from scratch on , our goal is to fine-tune and produce , resulting in comparable performance with with . In this work, we study the case where has a skewed and significantly different from data distribution.

3.1 Splits: i.i.d vs non-i.i.d

While the identical distribution of classes in and may be a natural assumption, such a setup may not be very common in practice.

In many cases to improve an existing system, you may want to collect and label either more challenging examples (i.e. active learning) or a specific small subset of classes. For example, to improve the user experience with navigation-related queries, can mainly consist of the corresponding intents and slots and does not contain queries related to music or news. Note that such a dataset will have a significantly skewed distribution. Such discrepancy can cause additional issues for incremental training as the train set distribution will be additionally distanced from the test set.

3.2 Incremental Training for Semantic Parsing

Several particularities make semantic parsing an interesting task for incremental training. While standard classification tasks assume a single label for each training example, the output for semantic parsing is a tree with multiple nodes. Generally, this does not allow to contain a single class which increases the diversity of and should limit (at least a bit) the train/test discrepancy.

Another characteristic of semantic parsing for dialogue is the number of classes. For example, the publicly available TOP dataset [gupta2018top] uses 25 intents and 36 slots for just two domains navigation and events. Real production systems can contain hundreds of classes for a single domain [rongali2020don]. This makes Incremental training particularly interesting for semantic parsing as the long tail of the class distribution calls for frequent model updates.

4 Fine-tuning approaches

One of the most important issues to address when fine-tuning a model on new data is the performance on the underrepresented classes in

4.1 Old data sampling

\citeauthor

de2019episodic \shortcitede2019episodic proposed to use sparse experience replay to mitigate catastrophic forgetting in lifelong language learning. We follow their idea, but instead of implementing an experience replay, we sample directly from the pretrain data during fine-tuning.

We study two setups: static and dynamic sampling. Each of them has its pros and cons.

Static sampling takes a proportion of size of the pretrain data before the fine-tuning procedure. Then, the fine-tuning happens on a combined dataset of sampled data and the fine-tune subset, i.e., on the where is a random sampler with probability . In this case, the fine-tuning method only uses a limited amount of pretrain data. This may be beneficial in terms of privacy and in the federated training setup. As this kind of sampling can be considered the simplest way to avoid catastrophic forgetting, we call it baseline in our plots.

Dynamic sampling samples the amount of pretraining data at the beginning of each epoch. In this case, more of the data gets reused yielding a more diverse dataset, while the epoch length stays the same compared to the static sampling.

4.2 Regularization methods

As the fine-tuning set is relatively small and its class distribution is very different from the pretraining set, it is natural to expect overfitting and poor generalization performance. Dropout is a standard deep learning tool that helps to deal with this problem.

We also introduce a simple regularization method similar to weight decay but targeted specifically for Incremental training / Continual learning. Move norm is a regularizer that prevents the model diverging from the pre-trained weights (Equation 1). It is added to the loss function analogous to weight decay and is parametrized by .

(1)

For the distance between and , we experimented with both euclidean (L2) and manhattan (L1) distances and found euclidean to be more effective.

4.3 Layer Freezing

One of the methods to limit how much the fine-tuned model diverges from the pre-trained one is layer freezing - not updating the weights in chosen layers. \citeauthorHoward2018 \shortciteHoward2018 showed that freezing can reduce test error by about 10% (relative change). Freezing can also be looked at as a “hard” form of the move norm regularization.

5 Experimental setup

Our experiments aim to model realistic scenarios in which retraining can be useful. The model used in the study is a sequence to sequence transformer model with a pointer network and BERT encoder like in \citeauthorrongali2020don \shortciterongali2020don. Training a model on a full TOP dataset using a single V100 GPU with early stopping patience 10 (monitoring exact match accuracy) took 1.5 hours in our experiments.

To mimic an incremental learning setup we first split the training set into two parts: pretrain 1 (old data) and fine-tune (new data) . First, a semantic parsing model () is trained on 2 . represents the new data that we want to include into the pre-trained model. This data is used to fine-tune the model.

In our freezing experiments, we freeze either an encoder or encoder with the decoder for the whole fine-tuning procedure. In the second case, only the final projection and the pointer network are updated.

We use exact match (EM) to evaluate our model and a Tree Path F1 Score (TP-F1, Section 5.2) to evaluate performance on a specific class or a subset of classes. TP-F1 is used instead of EM as a more fine-grained metric that is better suitable for per-class evaluation.

In addition to this, we also use Relative Improvement (RI) and Relative Degradation (RD) scores that are computed over classes that changed significantly. We describe these metrics in Section 5.3.

5.1 Data splits

We experiment with multiple data splits that are summarized in Table 1. Every split is constructed like the following: you chose a split class and a split percentage , then you randomly select percent of the training examples containing class to the fine-tune subset. This splits the original training set into fine-tune (new) and pretrain (old) parts.

For our experiments, we selected the mid-frequent classes for splitting. In this case, a 90%+ split leaves most of the training data in the old subset and the amount of new data is enough to improve the model significantly.

The splitting procedure aims to mimic the real-world iterative setup when a trained (on the old data) model already exists and we want to incorporate new data into this model. To ensure that the resulting pretraining subset contains all the possible classes, before the splitting procedure, we include a small set of training examples that contain all of the classes.

Split name Dataset # pretrain # fine-tune
PATH 99 TOP 15 1.5k
PATH 90 TOP 150 1.3k
NAME EVENT 95 TOP 60 1.1k
GETWEATHER 99 SNIPS 20 1.9k
GETWEATHER 90 SNIPS 200 1.7k
Table 1: Summary of the data splits used in the experiments. # pretrain is the number of examples with the split class (e.g. PATH) in the pretrain set, note that it is not equal to the size of the pretrain set as it also contains other classes. # fine-tune is the size of the fine-tune set (every example in the fine-tune set contains the split class).

5.2 Tree Path Score

To compute Tree Path F1 Score (TP-F1), the parsing tree is flattened into tree paths. Intent-related text tokens are ignored, slot values (or terminating intents) finish each tree path as a single token. This procedure is performed for both correct and predicted trees and then the F1 score is computed on the paths as

Where, following the standard definition, precision and recall .

Figure 2: Semantic parsing example. Each tree path starts at the IN:GET_DEPARTURE node and finishes with a slot value (the values are in the boxes). Tree paths are colored.

For example, if the query is ”When should I leave for my dentist appointment at 4 pm”, and the parsing tree looks like the Figure 2, then it has 4 tree paths. Every path starts at the root (IN:GET_DEPARTURE) and goes to the slot value. For the SL:DESTINATION slot, the value is compositional and equal to the string [IN:GET_EVENT [SL:NAME_EVENT dentist] [SL:CATEGORY_EVENT appointment]] .

Say, the predicted tree has a different value for the slot name event, it would have two paths different from the correct ones. A path to the value of the name event slot and a path with to the value of the destination slot, because destination value is compositional and contains name event as a part of it. In this case, the number of correctly predicted paths would be 2 (time arrival and category event slots), the number of predicted paths would be 4, and the number of expected paths also 4. TP-F1 in this case equals 1/2.

To compute a per-class score only the paths with the class are considered. In the example above, name event’s TP-F1 is zero, as there are no correctly predicted paths for it.

5.3 RI and RD scores

Dialogue datasets for semantic parsing can contain dozens of classes. In our initial experiments, we saw significant model retrain jitter which made it harder to compare different retraining setups.

For this reason, we use Relative Improvement (RI) and Relative Degradation (RD) scores. To compute them, we

  1. before and after fine-tuning, bootstrap the test set 5 times and estimate the mean and standard deviation of each per-class metric;

  2. split all classes into three categories: the metric is degraded down by more than two standard deviations 3, improved up, and did not change that significantly;

  3. compute the relative value change for each class;

  4. sum metrics for improved and degraded classes with the weights corresponding to the class frequencies;

which gives RI and RD, respectively. In comparison to TP-F1 which is used to compute a per-class score, RI and RD consider metrics for all changed classes. A good Incremental training procedure should have a near-zero RD and a high RI.

A Note on Bootstrapping.

To estimate the uncertainty of our metrics, we evaluate the model on different parts of the test set. First, we randomly divide the test set into 5 folds. Then, we evaluate the model 5 times, each time using all but the i-th folds. Finally mean and standard deviation and computed on these 5 points.

Figure 3: BERT encoder (solid line) vs best randomly initialized transformer (no-BERT, dashed line).
Figure 4: PATH 90 experiments. The model is fine-tuned on 90% of the examples containing PATH class with different amounts of pretrain data mixed with the fine-tuning data. Legend description: initial – train on pretrain set; from scratch – train on all data; baseline – merge a part of the pretrain set with the fine-tune set; dynamic sampling – sample a different subset form the pretrain every epoch; move norm – regularize the distance to the initial model; freeze encoder - do not update encoder weights. The training time estimates are approximate and slightly depend on early stopping parameter and the fine-tuning method.

5.4 Datasets

We use 2 popular task-oriented semantic parsing datasets: Facebook TOP [gupta2018top] and SNIPS [coucke2018snips]. Unlike SNIPS, which consists of flat queries containing a single intent and simple slots (no slot nesting), the TOP format allows complex tree structures. More specifically, each node of the parsing tree is either an intent name (e.g., IN:GET_WEATHER), a slot name (SL:DATE), or a piece of the input query text.

The tree structure is represented as a string of labeled brackets with text, and the semantic parsing model should predict the structure using the input query in a natural language. For example, What’s the weather like today in Boston should become [IN:GET_WEATHER what’s the weather like [SL:DATE today] in [SL:LOC Boston]].

The TOP dataset consists of 45k hierarchical queries, About 35% of the queries are complex (tree depth 2). SNIPS dataset has a simpler structure (only a single intent for each query) and the train part consists of 15k examples.

To unify the datasets, SNIPS is reformatted to fit the TOP structure 4. More precisely, an example from the reformatted SNIPS dataset could look like this: [IN:INTENT some text [SL:SLOT1 value1]].

6 Results

We first measured the effect of BERT encoder on task performance. The results are shown in Figure 3. To have a more fair comparison we performed a hyperparameter search on no-BERT number of layers and hidden sizes. The best no-BERT model had 4 layers, 4 heads, 256 model hidden size and 1024 ffn hidden size. Expectedly, we see better sample efficiency with BERT. Even with 20% of the original dataset BERT matches the performance of the full no-BERT model. Since BERT could have a profound effect on the final results, we use BERT in all our experiments.

6.1 Baselines

We start our experiments by establishing two important baselines. The first baseline shows if the fine-tuning procedure is useful at all - no fine-tuning. It is simply the performance of the model trained on the pretrain (old) subset. In our figures, we call this baseline initial. Table 2 contains the metrics (on the validation set) of this baseline for the models used in our experiments.

Figure 5: PATH 99 and GETWEATHER 99 experiments. Compared to PATH 90, degradation decreases more significantly, however performance on PATH class increases less significantly when more of the old data is used. The behavior of GETWEATHER 99 is similar to PATH 99.
Split TP-F1 EM
PATH 99 0.13 0.01 0.801 0.003
PATH 90 0.51 0.02 0.815 0.002
NAME EVENT 95 0.61 0.02 0.809 0.002
GETWEATHER 90 0.93 0.01 0.911 0.004
GETWEATHER 99 0.77 0.01 0.892 0.004
Table 2: Performance of the pre-trained models. TP-F1 reported for the split class. Mean and standard deviations are estimated via bootstraping.

The second baseline is an upper bound for the fine-tuning procedure. It is also the standard way to incorporate more data into a model - to retrain the whole model from scratch on the merged pretrain+fine-tune dataset. This baseline yields high final performance and no catastrophic forgetting as all of the data is being used, but also the method is inherently slow and computationally expensive. In the figures, we call this baseline from scratch. EM values for this baseline can be found in the Table 3 and TP-F1 values in the Table 4.

To estimate this important baseline with a high precision we trained 5 models with different random seeds. The resulting mean and standard deviation were compared with the bootstrapping method. Both bootstrapping and training on multiple random seeds provided similar results, hence we chose to only use bootstrapping for the fine-tuning experiments to significantly reduce computational requirements.

Dataset EM
TOP 0.819 0.003
SNIPS 0.949 0.002
Table 3: Exact match for the models trained on the whole dataset. Mean and standard deviation are estimated via training 5 models with different random seeds.
Dataset Class TP-F1
TOP PATH 0.66 0.01
TOP NAME EVENT 95 0.84 0.01
SNIPS GETWEATHER 90 0.97 0.01
Table 4: Tree-Path F1 score for selected classes for the models trained on the whole dataset. Mean and standard deviation are estimated via training 5 models with different random seeds.

6.2 TOP dataset results

PATH 90 split.

Figure 4 presents the experimental results on the PATH 90 split. In this split, the fine-tuning part contains 90% of the examples with PATH class and the left plot shows its TP-F1 score. Surprisingly, even this class degrades if the pretraining data is not used along with the fine-tuning data (notice that all of the plots at old data = 0 are below the initial baseline). There is a 3.5% relative degradation at this point (old data = 0). However, starting from 5% of pretraining data, the performance on PATH class beats the no-fine-tuning baseline. Dynamic sampling method, that selects a new subset of the pretraining data every epoch is able to reach the performance of the model trained from scratch in 10 minutes which is about 10% of the training from scratch time (training from scratch takes 1.5 hours).

The exact match metric is different from the TP-F1 and RD as it is computed on all classes. We can see that model significantly improves when RD is low and TP-F1 is high, as expected (EM plots are available in the Appendix B).

PATH 99 split.

This split is more extreme compared to PATH 90. The fine-tuning subset contains 99% of the PATH data and the pretraining subset contains only 15 PATH examples (Table 1). Our experiments (Figure 5) show that this case benefits faster (in terms of the amount of old data needed to overcome initial performance baseline) and significantly from fine-tuning. However, 50% of the data is needed to reach retraining from scratch performance (Figure 5).

6.3 SNIPS dataset results

GETWEATHER 99 split.

Along with the TOP dataset we also experimented with SNIPS. Figure 5 presents the experimental results for this split. One extra experiment that we tried with this split is to freeze both encoder and decoder. In this setup, we only updated the weights of output projection and pointer network. The results show that this method performs substantially worse than the others. It may happen because the features the model has learned on the pretrain subset are insufficient to correctly classify examples from the new data. Thus, the model needs to change its representations to and not only to reweight the existing ones. Another reason that can contribute to this behavior is a distribution shift between pretrain and fine-tune set, as the logit network mainly learns the simplest features such as class frequency.

More experimental data can be found in the Appendix B.

6.4 Parametric regularization methods

In our experiments, we tried multiple methods to further mitigate catastrophic forgetting and overfitting. While some of the methods are non-parametric, others, like dropout or move norm require parameter tuning. To study parameter dependence we selected the most challenging split - NAME EVENT 95, and fine-tuned multiple models with different regularization parameter values.

Dropout: Our initial results showed that the model easily fits the whole fine-tuning set and we decided that a slight increase in dropout probability may be beneficial 5. Nonetheless, our results (Figure 6) suggest that increasing dropout does not improve fine-tuning performance and that large dropout probabilities can significantly hurt underrepresented in the fine-tuning data classes.

Move norm: This method, unlike dropout, is specifically targeted to mitigate catastrophic forgetting. However, unlike the freezing or sampling methods, it requires careful parameter tuning as we demonstrate in Figure 6. While the move norm did not improve the results significantly in the NAME EVENT 95 split, we can see positive examples in PATH 99 (Figure 5). Another advantage of move-norm is that the high regularization strengths do not cause the model to degrade its performance on the classes underrepresented in the fine-tuning data.

Figure 6: Dropout probability and move norm regularization strength variation experiments. NAME EVENT 95 split. The high move norm, unlike dropout, does not increase RD.

7 Practical Recommendations

In summary, here is a short list of practical recommendations that can be helpful for real-world applications:

  • Sample from old data during fine-tuning, use 20-30% for dynamic sampling and 30-50% for other methods.

  • If only a part of the old data is accessible, additionally regularize the model via encoder freezing or move norm (the latter requires hyperparameter tuning).

  • If the application allows, use the whole old subset for sampling (dynamic sampling).

8 Discussion

In this work, we consider a practical side of CL that has been previously overlooked by NLP researchers - the ability to quickly update an existing model with new data. Nowadays, with the performance of models scaling superlinearly with their size, training time becomes a more challenging issue every year. We anticipate that in the near future of billion-parameter-sized models, incremental and continual learning settings can not only lead to a significant advantage in terms of resource efficiency but also become a necessity.

Our experimental results show that a simple incremental setup can reduce computational costs up to 90%. It is both beneficial in terms of increasing the speed of the development cycle and in terms of the environmental impact that is becoming more and more significant in the field [Strubell_2019].

We also want to notice some of the negative results we discovered. Training only the top layer (including pointer network) is a surprisingly bad way to include more data. Possibly, because some of the feature-engineering should happen in the lower layers of the model. Also, even though the model quickly fits the fine-tuning data, increasing regularization does not seem to improve the final performance. And finally, a naïve combination of successful methods such as dynamic sampling, freezing, and move norm does not seem to help either.

The Continual Learning community has made incredible progress using methods such as EWC [Kirkpatrick2017] and LAMOL [Sun2020LAMOLLM]. Many of these approaches are applicable in a real-world scenario and should be tested in practical applications. In our future work, we want to consider such models and evaluate them in terms of both performance (e.g., accuracy) and computational costs. Another important direction is to study how the model changes after multiple iterative updates.

References

Appendix A Hyperparameters and training setup

Table 5 contains hyperparameters used for pretraining. We used the Noam schedule [Vaswani2017] for the learning rate. Note that it involves learning rate scaling.

For fine-tuning, the same parameters were used unless otherwise stated in the experiment description. The only exception to this rule is the batch size that we set to 128 during all fine-tuning experiments. Model, optimizer, and learning rate scheduler states were restored from the checkpoint with the best EM.

encoder model BERT-Base cased
decoder layers 4
decoder n heads 4
decoder model size 256
decoder ffn hidden size 1024
label smoothing 0.1
batch size 192
optimizer ADAM
ADAM betas ,
ADAM
decoder lr 0.2
encoder lr 0.02
warmup steps 1500
freezed encoder steps 500
dropout 0.2
early stopping 10
ranodm seed 1
Table 5: Training and model hyperparameters

Appendix B Additional Experimental Results

I.I.D. Splits Fine-Tuning

Figure 10 presents fine-tuning experiments for the i.i.d cases. Each model is trained on a fraction (train amount) of the original train set and fine-tuned on the rest. We do not plot TP-F1, because with i.i.d split there is no specific split class that we expect to improve and instead, we monitor the performance of the model on all classes. Also, the initial performance baseline is not present at the plot as every model has a different initial performance. Table 6 presents initial values for every model.

From Figure 10 we can see that the models with smaller pretrain sets (bigger fine-tuning sets) require almost no data sampling to achieve from scratch performance, while the models that have been trained on a larger subset need up to 50%. We can also see that the models pretrained on larger amounts of data tend to have higher RD. This can be explained by the fact that these models have better initial performance, thus, more degrading potential.

Train amount 0.1 0.3 0.7 0.9
EM 0.74 0.78 0.81 0.81
Table 6: Models are trained on a train amount data from i.i.d splits. Standard deviation (estimated via bootstrapping) does not exceed 0.003 in each split.

Varying the Amount of New Data

Figure 7: PATH 99 new data results.

An important question regarding the Iterative training procedure is how much additional data labeling we need to make to improve the model. To answer it, we studied how the amount of fine-tuning data affects the final performance on PATH 99 split.

A random subset of size from 10% to 90% of the fine-tuning data was sampled from the fine-tuning data and used as a fine-tuning set in the experiment. The results are depicted in Figure 7, dynamic sampling method was used to mitigate catastrophic forgetting. Each model was fine-tuned with early stopping patience = 10.

In general, we can say that catastrophic forgetting affects the model more when more new data is present. It can be explained by the fact that to finish a single epoch the model with a bigger new data subset will be updated more times and the network can derive further from the pretrained. This effect is especially visible with the old data = 0 fine-tuning procedure that changes the relative degradation value by an order of magnitude (Figure 8). The results suggest that even an extra hundred of labeled examples can make a significant improvement over the initial model if the target class is very rare in the pretraining data.

Figure 8: PATH 99 new data RD without y-axis truncation.

RI vs RD

In Section 5.3 we presented two metrics relative improvement (RI) and relative degradation (RD). Here we demonstrate them on a single plot to compare the scales for a typical task. We can see that even without old data sampling, the model still significantly improves on some classes and the associated relative degradation is significantly lower than the improvement 9. This can be specific to the semantic parsing and be caused by the fact that every example contains multiple classes including the frequent ones. As the RD is weighted according to the class frequency, it can only be large (in absolute value) if the frequent classes are degraded.

Figure 9: Relative improvement and degradation metrics for the PATH 99 split. No old data sampling.

Additional plots

In this section we present additional experimental data, analogous to the one presented in the paper. Figures 11 and 12 demonstrate fine-tuning results for the splits NAME EVENT 95 and GET WEATHER 90 respectively. Figures 13 and 14 show exact match accuracy for the PATH 90 and GET WEATHER 99 splits respectively.

These plots exhibit a similar to their counterparts from Section 8 behavior.

Figure 10: I.I.D. splits. Each model is pretrained on the train amount of the data and then fine-tuned on the rest of the data with different amounts of old data sampling.
Figure 11: TOP NAME EVENT 95 results.
Figure 12: SNIPS GETWEATHER 90 results.
Figure 13: Exact match for PATH 90 experiments.
Figure 14: SNIPS GETWEATHER 99 exact match.

Footnotes

  1. Note that we use word pretrain for a part of the supervised data that the initial model was trained on, or the process of training. We do not mean the BERT-like pretraining procedure or the data used for this procedure by this term.
  2. Hyperparameters and training details can be found in the Appendix A
  3. We sum the standard deviation before retraining and after to account for different amounts of uncertainty.
  4. We are planning to release reformatting scripts along with the rest of the source code.
  5. We used 0.2 dropout during pretraining.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
416949
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description