Continual Learning for Neural Semantic Parsing
A semantic parsing model is crucial to natural language processing applications such as goal-oriented dialogue systems. Such models can have hundreds of classes with a highly non-uniform distribution. In this work, we show how to efficiently (in terms of computational budget) improve model performance given a new portion of labeled data for a specific low-resource class or a set of classes. We demonstrate that a simple approach with a specific fine-tuning procedure for the old model can reduce the computational costs by 90% compared to the training of a new model. The resulting performance is on-par with a model trained from scratch on a full dataset. We showcase the efficacy of our approach on two popular semantic parsing datasets, Facebook TOP, and SNIPS.
Semantic parsing is the task of mapping a natural language query into a formal language, that is extensively used in goal-oriented dialogue systems. For a given query, such model should identify the requested action (intent) and the associated values specifying parameters of the action (slots). For example, if the query is Call Mary the action is call and the value of slot contact is Mary.
The number of different intents and slots in publicly available datasets [gupta2018top, coucke2018snips] can be close to a hundred and it may be the orders of magnitude larger in real-world systems. Such a big number of classes usually causes a long tail in the class frequency distribution (Figure 1). These tail classes can be significantly improved with small quantities of additional labeled data.
However, training a neural semantic parsing model from scratch can take hours even on a relatively small public dataset (e.g., TOP dataset [gupta2018top] consists of 30k examples). The real-world datasets can contain millions of examples [damonte2019practical] which can change the time scale to weeks. In this work, we propose to fine-tune a model that has already been trained on the old dataset (e.g., the model that already works in production) instead of training a new model to significantly speed up the incorporation of a new portion of data. We call this setting Incremental training, as the new portions of data can be added incrementally.
We focus on semantic parsing for our case studies for the following reasons. Semantic parsing is a more complex NLP task compared to classification or NER and we hope that the lessons learned here would be more widely applicable. Task-oriented semantic parsing tend to have a large output vocabulary that can be frequently updated, and thus, benefit most from the Incremental setting.
The main contributions of this work are:
We formulate Incremental Training task, which is a special kind of a Continual Learning task.
We propose new metrics: tree-path F1 and relative improvement/degradation, that are specific to Incremental training and semantic parsing.
We experiment with three different approaches: data mixing, layer freezing, and move norm - a regularization method specific to Incremental training.
We showcase a simple retraining recipe that allows to reduce computational costs by a factor of ten (in several cases) compared to training from scratch.
2 Related work
Fine-tuning of pre-trained deep learning models is an active area of research, but most recent work [Peters2019, devlin2018bert] focuses on new tasks or new labels. \citeauthorHoulsby2019 \shortciteHoulsby2019 introduce adapters which modify the pre-trained representations for new tasks by adding per-task parameters. \citeauthorPeters2019 \shortcitePeters2019 compare two common fine-tuning strategies, freezing the encoder (feature extraction) and fine-tuning the whole model on variety of downstream tasks. \citeauthorHoward2018 \shortciteHoward2018 propose techniques for effective fine-tuning of language models for classification tasks including gradual unfreezing while training. All of these works focus on using a pre-trained model and fine-tuning to a different downstream task.
In contrast, there is not much recent work in the NLP community on the ”data-patch” use-case. In this setting, we fine-tune on the same task but under a different data distribution. This has been commonly referred to as continual learning (CL) in the broader ML community. Continual Learning has been a long-studied [Hassabis2017] problem. One of the main challenges of continual learning is the catastrophic forgetting [Thrun1995] – the network forgets existing knowledge when learning from novel observations – is a common problem in this setting. Due to the interleaving of data, training from scratch usually does not suffer from catastrophic forgetting as the network is jointly optimized for all classes.
We draw inspiration from the work on lifelong learning [Kirkpatrick2017, de2019episodic, Sun2020LAMOLLM]. We also survey popular fine-tuning strategies including gradual unfreezing and discriminative fine-tuning [Howard2018]. \citeauthorKirkpatrick2017 \shortciteKirkpatrick2017 introduce Elastic Weight Consolidation (EWC) as a regularization approach which adds a penalty between weights of original and the fine-tuned model. \citeauthorrobins1995catastrophic \shortciterobins1995catastrophic showed that interleaving information about new experiences with previous experiences can help overcome catastrophic forgetting. \citeauthorde2019episodic \shortcitede2019episodic propose sparse experience replay for continual language learning. \citeauthorParisi2019 \shortciteParisi2019 provides a comprehensive review of Continual lifelong learning techniques in neural networks.
Our proposed approaches are a combination of interleaving old and new information, selective layer freezing, and simple regularization methods between the pre-trained and fine-tuned models.
3 Incremental training
Here, we simplify the training setting as follows. Suppose, we have a network trained on the dataset as our base model. Now we are presented with a dataset with a different data distribution from as our fine-tuning dataset. If is the time to train from scratch on , our goal is to fine-tune and produce , resulting in comparable performance with with . In this work, we study the case where has a skewed and significantly different from data distribution.
3.1 Splits: i.i.d vs non-i.i.d
While the identical distribution of classes in and may be a natural assumption, such a setup may not be very common in practice.
In many cases to improve an existing system, you may want to collect and label either more challenging examples (i.e. active learning) or a specific small subset of classes. For example, to improve the user experience with navigation-related queries, can mainly consist of the corresponding intents and slots and does not contain queries related to music or news. Note that such a dataset will have a significantly skewed distribution. Such discrepancy can cause additional issues for incremental training as the train set distribution will be additionally distanced from the test set.
3.2 Incremental Training for Semantic Parsing
Several particularities make semantic parsing an interesting task for incremental training. While standard classification tasks assume a single label for each training example, the output for semantic parsing is a tree with multiple nodes. Generally, this does not allow to contain a single class which increases the diversity of and should limit (at least a bit) the train/test discrepancy.
Another characteristic of semantic parsing for dialogue is the number of classes. For example, the publicly available TOP dataset [gupta2018top] uses 25 intents and 36 slots for just two domains navigation and events. Real production systems can contain hundreds of classes for a single domain [rongali2020don]. This makes Incremental training particularly interesting for semantic parsing as the long tail of the class distribution calls for frequent model updates.
4 Fine-tuning approaches
One of the most important issues to address when fine-tuning a model on new data is the performance on the underrepresented classes in
4.1 Old data sampling\citeauthor
de2019episodic \shortcitede2019episodic proposed to use sparse experience replay to mitigate catastrophic forgetting in lifelong language learning. We follow their idea, but instead of implementing an experience replay, we sample directly from the pretrain data during fine-tuning.
We study two setups: static and dynamic sampling. Each of them has its pros and cons.
Static sampling takes a proportion of size of the pretrain data before the fine-tuning procedure. Then, the fine-tuning happens on a combined dataset of sampled data and the fine-tune subset, i.e., on the where is a random sampler with probability . In this case, the fine-tuning method only uses a limited amount of pretrain data. This may be beneficial in terms of privacy and in the federated training setup. As this kind of sampling can be considered the simplest way to avoid catastrophic forgetting, we call it baseline in our plots.
Dynamic sampling samples the amount of pretraining data at the beginning of each epoch. In this case, more of the data gets reused yielding a more diverse dataset, while the epoch length stays the same compared to the static sampling.
4.2 Regularization methods
As the fine-tuning set is relatively small and its class distribution is very different from the pretraining set, it is natural to expect overfitting and poor generalization performance. Dropout is a standard deep learning tool that helps to deal with this problem.
We also introduce a simple regularization method similar to weight decay but targeted specifically for Incremental training / Continual learning. Move norm is a regularizer that prevents the model diverging from the pre-trained weights (Equation 1). It is added to the loss function analogous to weight decay and is parametrized by .
For the distance between and , we experimented with both euclidean (L2) and manhattan (L1) distances and found euclidean to be more effective.
4.3 Layer Freezing
One of the methods to limit how much the fine-tuned model diverges from the pre-trained one is layer freezing - not updating the weights in chosen layers. \citeauthorHoward2018 \shortciteHoward2018 showed that freezing can reduce test error by about 10% (relative change). Freezing can also be looked at as a “hard” form of the move norm regularization.
5 Experimental setup
Our experiments aim to model realistic scenarios in which retraining can be useful. The model used in the study is a sequence to sequence transformer model with a pointer network and BERT encoder like in \citeauthorrongali2020don \shortciterongali2020don. Training a model on a full TOP dataset using a single V100 GPU with early stopping patience 10 (monitoring exact match accuracy) took 1.5 hours in our experiments.
To mimic an incremental learning setup we first split the training set into two parts:
In our freezing experiments, we freeze either an encoder or encoder with the decoder for the whole fine-tuning procedure. In the second case, only the final projection and the pointer network are updated.
We use exact match (EM) to evaluate our model and a Tree Path F1 Score (TP-F1, Section 5.2) to evaluate performance on a specific class or a subset of classes. TP-F1 is used instead of EM as a more fine-grained metric that is better suitable for per-class evaluation.
In addition to this, we also use Relative Improvement (RI) and Relative Degradation (RD) scores that are computed over classes that changed significantly. We describe these metrics in Section 5.3.
5.1 Data splits
We experiment with multiple data splits that are summarized in Table 1. Every split is constructed like the following: you chose a split class and a split percentage , then you randomly select percent of the training examples containing class to the fine-tune subset. This splits the original training set into fine-tune (new) and pretrain (old) parts.
For our experiments, we selected the mid-frequent classes for splitting. In this case, a 90%+ split leaves most of the training data in the old subset and the amount of new data is enough to improve the model significantly.
The splitting procedure aims to mimic the real-world iterative setup when a trained (on the old data) model already exists and we want to incorporate new data into this model. To ensure that the resulting pretraining subset contains all the possible classes, before the splitting procedure, we include a small set of training examples that contain all of the classes.
|Split name||Dataset||# pretrain||# fine-tune|
|NAME EVENT 95||TOP||60||1.1k|
5.2 Tree Path Score
To compute Tree Path F1 Score (TP-F1), the parsing tree is flattened into tree paths. Intent-related text tokens are ignored, slot values (or terminating intents) finish each tree path as a single token. This procedure is performed for both correct and predicted trees and then the F1 score is computed on the paths as
Where, following the standard definition, precision and recall .
For example, if the query is ”When should I leave for my dentist appointment at 4 pm”, and the parsing tree looks like the Figure 2, then it has 4 tree paths. Every path starts at the root (IN:GET_DEPARTURE) and goes to the slot value. For the SL:DESTINATION slot, the value is compositional and equal to the string [IN:GET_EVENT [SL:NAME_EVENT dentist] [SL:CATEGORY_EVENT appointment]] .
Say, the predicted tree has a different value for the slot name event, it would have two paths different from the correct ones. A path to the value of the name event slot and a path with to the value of the destination slot, because destination value is compositional and contains name event as a part of it. In this case, the number of correctly predicted paths would be 2 (time arrival and category event slots), the number of predicted paths would be 4, and the number of expected paths also 4. TP-F1 in this case equals 1/2.
To compute a per-class score only the paths with the class are considered. In the example above, name event’s TP-F1 is zero, as there are no correctly predicted paths for it.
5.3 RI and RD scores
Dialogue datasets for semantic parsing can contain dozens of classes. In our initial experiments, we saw significant model retrain jitter which made it harder to compare different retraining setups.
For this reason, we use Relative Improvement (RI) and Relative Degradation (RD) scores. To compute them, we
before and after fine-tuning, bootstrap the test set 5 times and estimate the mean and standard deviation of each per-class metric;
split all classes into three categories: the metric is degraded down by more than two standard deviations
3, improved up, and did not change that significantly;
compute the relative value change for each class;
sum metrics for improved and degraded classes with the weights corresponding to the class frequencies;
which gives RI and RD, respectively. In comparison to TP-F1 which is used to compute a per-class score, RI and RD consider metrics for all changed classes. A good Incremental training procedure should have a near-zero RD and a high RI.
A Note on Bootstrapping.
To estimate the uncertainty of our metrics, we evaluate the model on different parts of the test set. First, we randomly divide the test set into 5 folds. Then, we evaluate the model 5 times, each time using all but the i-th folds. Finally mean and standard deviation and computed on these 5 points.
We use 2 popular task-oriented semantic parsing datasets: Facebook TOP [gupta2018top] and SNIPS [coucke2018snips]. Unlike SNIPS, which consists of flat queries containing a single intent and simple slots (no slot nesting), the TOP format allows complex tree structures. More specifically, each node of the parsing tree is either an intent name (e.g., IN:GET_WEATHER), a slot name (SL:DATE), or a piece of the input query text.
The tree structure is represented as a string of labeled brackets with text, and the semantic parsing model should predict the structure using the input query in a natural language. For example, What’s the weather like today in Boston should become [IN:GET_WEATHER what’s the weather like [SL:DATE today] in [SL:LOC Boston]].
The TOP dataset consists of 45k hierarchical queries, About 35% of the queries are complex (tree depth 2). SNIPS dataset has a simpler structure (only a single intent for each query) and the train part consists of 15k examples.
To unify the datasets, SNIPS is reformatted to fit the TOP structure
We first measured the effect of BERT encoder on task performance. The results are shown in Figure 3. To have a more fair comparison we performed a hyperparameter search on no-BERT number of layers and hidden sizes. The best no-BERT model had 4 layers, 4 heads, 256 model hidden size and 1024 ffn hidden size. Expectedly, we see better sample efficiency with BERT. Even with 20% of the original dataset BERT matches the performance of the full no-BERT model. Since BERT could have a profound effect on the final results, we use BERT in all our experiments.
We start our experiments by establishing two important baselines. The first baseline shows if the fine-tuning procedure is useful at all - no fine-tuning. It is simply the performance of the model trained on the pretrain (old) subset. In our figures, we call this baseline initial. Table 2 contains the metrics (on the validation set) of this baseline for the models used in our experiments.
|PATH 99||0.13 0.01||0.801 0.003|
|PATH 90||0.51 0.02||0.815 0.002|
|NAME EVENT 95||0.61 0.02||0.809 0.002|
|GETWEATHER 90||0.93 0.01||0.911 0.004|
|GETWEATHER 99||0.77 0.01||0.892 0.004|
The second baseline is an upper bound for the fine-tuning procedure. It is also the standard way to incorporate more data into a model - to retrain the whole model from scratch on the merged pretrain+fine-tune dataset. This baseline yields high final performance and no catastrophic forgetting as all of the data is being used, but also the method is inherently slow and computationally expensive. In the figures, we call this baseline from scratch. EM values for this baseline can be found in the Table 3 and TP-F1 values in the Table 4.
To estimate this important baseline with a high precision we trained 5 models with different random seeds. The resulting mean and standard deviation were compared with the bootstrapping method. Both bootstrapping and training on multiple random seeds provided similar results, hence we chose to only use bootstrapping for the fine-tuning experiments to significantly reduce computational requirements.
|TOP||NAME EVENT 95||0.84 0.01|
|SNIPS||GETWEATHER 90||0.97 0.01|
6.2 TOP dataset results
PATH 90 split.
Figure 4 presents the experimental results on the PATH 90 split. In this split, the fine-tuning part contains 90% of the examples with PATH class and the left plot shows its TP-F1 score. Surprisingly, even this class degrades if the pretraining data is not used along with the fine-tuning data (notice that all of the plots at old data = 0 are below the initial baseline). There is a 3.5% relative degradation at this point (old data = 0). However, starting from 5% of pretraining data, the performance on PATH class beats the no-fine-tuning baseline. Dynamic sampling method, that selects a new subset of the pretraining data every epoch is able to reach the performance of the model trained from scratch in 10 minutes which is about 10% of the training from scratch time (training from scratch takes 1.5 hours).
The exact match metric is different from the TP-F1 and RD as it is computed on all classes. We can see that model significantly improves when RD is low and TP-F1 is high, as expected (EM plots are available in the Appendix B).
PATH 99 split.
This split is more extreme compared to PATH 90. The fine-tuning subset contains 99% of the PATH data and the pretraining subset contains only 15 PATH examples (Table 1). Our experiments (Figure 5) show that this case benefits faster (in terms of the amount of old data needed to overcome initial performance baseline) and significantly from fine-tuning. However, 50% of the data is needed to reach retraining from scratch performance (Figure 5).
6.3 SNIPS dataset results
GETWEATHER 99 split.
Along with the TOP dataset we also experimented with SNIPS. Figure 5 presents the experimental results for this split. One extra experiment that we tried with this split is to freeze both encoder and decoder. In this setup, we only updated the weights of output projection and pointer network. The results show that this method performs substantially worse than the others. It may happen because the features the model has learned on the pretrain subset are insufficient to correctly classify examples from the new data. Thus, the model needs to change its representations to and not only to reweight the existing ones. Another reason that can contribute to this behavior is a distribution shift between pretrain and fine-tune set, as the logit network mainly learns the simplest features such as class frequency.
More experimental data can be found in the Appendix B.
6.4 Parametric regularization methods
In our experiments, we tried multiple methods to further mitigate catastrophic forgetting and overfitting. While some of the methods are non-parametric, others, like dropout or move norm require parameter tuning. To study parameter dependence we selected the most challenging split - NAME EVENT 95, and fine-tuned multiple models with different regularization parameter values.
Dropout: Our initial results showed that the model easily fits the whole fine-tuning
set and we decided that a slight increase in dropout probability may be beneficial
Move norm: This method, unlike dropout, is specifically targeted to mitigate catastrophic forgetting. However, unlike the freezing or sampling methods, it requires careful parameter tuning as we demonstrate in Figure 6. While the move norm did not improve the results significantly in the NAME EVENT 95 split, we can see positive examples in PATH 99 (Figure 5). Another advantage of move-norm is that the high regularization strengths do not cause the model to degrade its performance on the classes underrepresented in the fine-tuning data.
7 Practical Recommendations
In summary, here is a short list of practical recommendations that can be helpful for real-world applications:
Sample from old data during fine-tuning, use 20-30% for dynamic sampling and 30-50% for other methods.
If only a part of the old data is accessible, additionally regularize the model via encoder freezing or move norm (the latter requires hyperparameter tuning).
If the application allows, use the whole old subset for sampling (dynamic sampling).
In this work, we consider a practical side of CL that has been previously overlooked by NLP researchers - the ability to quickly update an existing model with new data. Nowadays, with the performance of models scaling superlinearly with their size, training time becomes a more challenging issue every year. We anticipate that in the near future of billion-parameter-sized models, incremental and continual learning settings can not only lead to a significant advantage in terms of resource efficiency but also become a necessity.
Our experimental results show that a simple incremental setup can reduce computational costs up to 90%. It is both beneficial in terms of increasing the speed of the development cycle and in terms of the environmental impact that is becoming more and more significant in the field [Strubell_2019].
We also want to notice some of the negative results we discovered. Training only the top layer (including pointer network) is a surprisingly bad way to include more data. Possibly, because some of the feature-engineering should happen in the lower layers of the model. Also, even though the model quickly fits the fine-tuning data, increasing regularization does not seem to improve the final performance. And finally, a naïve combination of successful methods such as dynamic sampling, freezing, and move norm does not seem to help either.
The Continual Learning community has made incredible progress using methods such as EWC [Kirkpatrick2017] and LAMOL [Sun2020LAMOLLM]. Many of these approaches are applicable in a real-world scenario and should be tested in practical applications. In our future work, we want to consider such models and evaluate them in terms of both performance (e.g., accuracy) and computational costs. Another important direction is to study how the model changes after multiple iterative updates.
Appendix A Hyperparameters and training setup
Table 5 contains hyperparameters used for pretraining. We used the Noam schedule [Vaswani2017] for the learning rate. Note that it involves learning rate scaling.
For fine-tuning, the same parameters were used unless otherwise stated in the experiment description. The only exception to this rule is the batch size that we set to 128 during all fine-tuning experiments. Model, optimizer, and learning rate scheduler states were restored from the checkpoint with the best EM.
|encoder model||BERT-Base cased|
|decoder n heads||4|
|decoder model size||256|
|decoder ffn hidden size||1024|
|freezed encoder steps||500|
Appendix B Additional Experimental Results
I.I.D. Splits Fine-Tuning
Figure 10 presents fine-tuning experiments for the i.i.d cases. Each model is trained on a fraction (train amount) of the original train set and fine-tuned on the rest. We do not plot TP-F1, because with i.i.d split there is no specific split class that we expect to improve and instead, we monitor the performance of the model on all classes. Also, the initial performance baseline is not present at the plot as every model has a different initial performance. Table 6 presents initial values for every model.
From Figure 10 we can see that the models with smaller pretrain sets (bigger fine-tuning sets) require almost no data sampling to achieve from scratch performance, while the models that have been trained on a larger subset need up to 50%. We can also see that the models pretrained on larger amounts of data tend to have higher RD. This can be explained by the fact that these models have better initial performance, thus, more degrading potential.
Varying the Amount of New Data
An important question regarding the Iterative training procedure is how much additional data labeling we need to make to improve the model. To answer it, we studied how the amount of fine-tuning data affects the final performance on PATH 99 split.
A random subset of size from 10% to 90% of the fine-tuning data was sampled from the fine-tuning data and used as a fine-tuning set in the experiment. The results are depicted in Figure 7, dynamic sampling method was used to mitigate catastrophic forgetting. Each model was fine-tuned with early stopping patience = 10.
In general, we can say that catastrophic forgetting affects the model more when more new data is present. It can be explained by the fact that to finish a single epoch the model with a bigger new data subset will be updated more times and the network can derive further from the pretrained. This effect is especially visible with the old data = 0 fine-tuning procedure that changes the relative degradation value by an order of magnitude (Figure 8). The results suggest that even an extra hundred of labeled examples can make a significant improvement over the initial model if the target class is very rare in the pretraining data.
RI vs RD
In Section 5.3 we presented two metrics relative improvement (RI) and relative degradation (RD). Here we demonstrate them on a single plot to compare the scales for a typical task. We can see that even without old data sampling, the model still significantly improves on some classes and the associated relative degradation is significantly lower than the improvement 9. This can be specific to the semantic parsing and be caused by the fact that every example contains multiple classes including the frequent ones. As the RD is weighted according to the class frequency, it can only be large (in absolute value) if the frequent classes are degraded.
In this section we present additional experimental data, analogous to the one presented in the paper. Figures 11 and 12 demonstrate fine-tuning results for the splits NAME EVENT 95 and GET WEATHER 90 respectively. Figures 13 and 14 show exact match accuracy for the PATH 90 and GET WEATHER 99 splits respectively.
These plots exhibit a similar to their counterparts from Section 8 behavior.
- Note that we use word pretrain for a part of the supervised data that the initial model was trained on, or the process of training. We do not mean the BERT-like pretraining procedure or the data used for this procedure by this term.
- Hyperparameters and training details can be found in the Appendix A
- We sum the standard deviation before retraining and after to account for different amounts of uncertainty.
- We are planning to release reformatting scripts along with the rest of the source code.
- We used 0.2 dropout during pretraining.