Bootstrapping NLU Models with Multi-task Learning
Bootstrapping natural language understanding (NLU) systems with minimal training data is a fundamental challenge of extending digital assistants like Alexa and Siri to a new language. A common approach that is adapted in digital assistants when responding to a user query is to process the input in a pipeline manner where the first task is to predict the domain, followed by the inference of intent and slots. However, this cascaded approach instigates error propagation and prevents information sharing among these tasks. Further, the use of words as the atomic units of meaning as done in many studies might lead to coverage problems for morphologically rich languages such as German and French when data is limited. We address these issues by introducing a character-level unified neural architecture for joint modeling of the domain, intent, and slot classification. We compose word-embeddings from characters and jointly optimize all classification tasks via multi-task learning. In our results, we show that the proposed architecture is an optimal choice for bootstrapping NLU systems in low-resource settings thus saving time, cost and human effort.
Digital assistants like Amazon Alexa help users with their daily lives for various tasks such as setting up an alarm, booking a taxi, adding events to their calendar or making a dinner reservation. Since these digital assistants support only a limited set of languages, a rapid expansion of these systems to new languages is a prioritized goal for companies like Amazon, Google and Apple to expand their user base. However, the task of language expansion is not trivial since it requires large amounts of annotated training data which translates to additional costs, effort and time. Hence, developing efficient techniques for training spoken language understanding (SLU) systems in low-resource settings is an active research topic. One of the open challenges to accelerate the pace of language expansion is to bootstrap an accurate natural language understanding (NLU) module for new languages with minimal training data. The NLU module, which is our main focus, is a crucial component of an SLU system and is responsible for deriving the semantic interpretation of a spoken utterance or query.
The problem of obtaining data for low-resource languages is further amplified when bootstrapping a morphologically rich language such as German, French, Turkish, Hungarian, etc. Such languages can have extensive vocabularies as various forms of the same word can be generated through inflectional and derivational suffixation or compounding. Hence, this results in high lexical sparseness and leads to significant coverage issues for parametric models if words are modeled as the smallest units of meaning.
Apart from the scarcity of annotated training data, another problem is how the NLU modules are designed. The NLU module is generally designed using a pipeline approach (Sarikaya et al., 2016; Gupta et al., 2006) where a standard way of interpreting an input utterance starts with predicting its domain followed by domain-specific intent and slots. Since the NLU module of digital assistants like Alexa processes multiple domains, such a cascaded model prevents learning any domain-invariant features. Moreover, this pipeline approach in the NLU component propagates the errors from an upstream to a downstream task, which degrades the system’s overall performance. Further, the pipeline approach also prohibits any knowledge sharing between the closely related subtasks of domain, intent and slot classification. Such knowledge sharing among the NLU tasks has the potential to enhance the performance of each task in a low-resource setting, thereby boosting the overall performance of the NLU module.
In this work, we present an end-to-end unified neural architecture for bootstrapping NLU models in low-resource settings for domain, intent and slot classification. The task involves classifying utterances such as ’play frozen from madonna’ and classifying the domain, intent and slots as ’Music’, ’Play Music’ and ’Other Songname Other ArtistName’, respectively. Firstly, our approach uses character-level modeling to deal with the extensive vocabularies of morphologically-rich languages. This helps with the language coverage issues by efficiently modeling the out of vocabulary (OOV) words encountered during inference phase. Further, this facilitates implicit parameter sharing between the various words with similar subword units, like rain and raining, thus inducing a robust representation of words in low-resource settings. Secondly, our approach uses multi-task learning to enable a unified architecture, which jointly infers the output of all of the subtasks and overcomes the problem of error propagation. Since joint optimization facilitates information sharing, our architecture is able to achieve notable accuracy improvements across all NLU tasks. Thirdly, we present the analysis of using pre-trained word embeddings when initializing our model to further improve the model performance in low-resource settings. Finally, we evaluate the proposed architecture on real world data in a large scale setting and provide detailed analysis of the design choices.
2 Related Work
Various architectures have been proposed for joint modelling of the intent and slot classification to reduce the error propagation within a particular domain. Probabilistic models such as a triangular conditional random field (CRF) (Jeong and Lee, 2008) and a convolutional neural network based CRF (Xu and Sarikaya, 2013) reports significant performance gains by using a unified model. Since the intent and slots of an utterance are highly correlated, Zhang and Wang (2016) propose a recurrent neural network (RNN) based architecture optimized by both tasks together which operate on a shared embedding. Another RNN-based model (Liu and Lane, 2016) used an auto-regressive decoder and attention mechanism to learn these both tasks together. An approach that is most similar to ours is proposed by Kim et al. (2017) where the authors jointly optimized domain, intent and slot classification tasks. Compared to their work, we enable direct information flow from an upstream to its downstream task and show that this improves the model performance.
In addition to jointly modeling intent and slot task, there exists prior work on domain adaptation focusing on sharing features among multiple-domains. Jaech et al. (2016) models multiple domain-specific slot-filling layers together with their roots in a single RNN-based encoder thus jointly training the encoding layers. Kim et al. (2016b) presents another approach which uses slot-filling layers where in addition to the domain-specific layer for each of the domains there exists an additional layer, shared by all domains, for inducing feature augmentation.
Distributed word embeddings have improved the performance of many NLP tasks like sentiment analysis (Maas et al., 2011), language modelling (Bengio et al., 2003) and named entity recognition (NER) (Turian et al., 2010). Most of these existing approaches treat words as individual atomic units, thus completely ignoring their internal structure. Furthermore, the quality of word-based embedding models deteriorate for rare and unseen words (Bojanowski et al., 2017) since some words occur so rarely that there might not be enough instances of a word to learn its representation. Recently, compositional word embeddings have been applied to a variety of NLP tasks like language modelling (Vania and Lopez, 2017), NER (Lample et al., 2016) and neural machine translation (NMT) (Ataman and Federico, 2018) and achieved successful results. Further, it has been shown that morphological and semantic information can be exploited by composing the representations of subwords at different granularity level i.e. characters (Kim et al., 2016a), character n-grams (Bojanowski et al., 2017) and bytes (Gillick et al., 2016).
As shown in Fig. 1, our model is composed of five major building blocks: (a) the compositional CNN layer (Sec. 3.1) that derives word features from character n-grams, (b) highway layers (Sec. 3.2) that model interactions among its inputs and facilitate information flow, (c) a multi-layer stacked CNN (Sec. 3.3) that generates contextual vectors for the words in a given utterance, (d) three individual output layers (See. 3.4) performing domain, intent and slot predictions based on the context vectors from Stacked CNN, and (e) two hierarchical link layers (See.3.5) that transfer the posterior distribution of an upstream to a downstream task. We discuss each building block in detail in the following sections.
3.1 Compositional CNN
The first layer of our network, named Compositional CNN (CompCNN), is used to create word representations using character embeddings. Let be the set of characters and be the dimensionality of character embeddings. Let be the matrix of character embeddings. Consider a word, , that is made up of a sequence of characters . For each character of word , we obtain the corresponding character embedding from the embedding matrix and then concatenate them to create a character-level representation of the word.
In order to create feature maps from character-level representations we apply a convolution operation on the window of characters and obtain a vector of features . Eq. 1 describes this operation as follows:
where is the convolution operation; is the convolution filter with width ; is the bias and is the nonlinear activation function.
To obtain the most important n-gram captured by a given filter, we apply the max-over-time pooling as illustrated in Eq. 2. This operation returns the maximum valued feature in feature vector . This pooling scheme naturally facilitates the model to deal with the variable word lengths.
The operations defined in Eq. 1 and Eq. 2 contributes only one n-gram feature per convolution filter. Therefore, we use multiple filters and obtain a feature vector of length equaling the number of applied convolution filters for the input word. Further, we vary the filter width from 3 to 6 to extract n-gram features of corresponding sizes.
3.2 Highway Network
We use a highway network (Srivastava et al., 2015) which consists of a transform and a carry gate at two locations of our network; after the CompCNN layer and the stacked CNN layer. Highway networks not only enable us to apply non-linearity to the inputs but also facilitate training by carrying some of the input directly to the output.
Formally, highway networks are defined as follows:
Where is element-wise multiplication; and are input and output vectors; and denote the transform and the carry gate respectively; is the non-linearity which is in this work and is the sigmoid function; are the weights of feature transform layer and transform gate, where m is the size of the input vector; , are the corresponding bias vectors.
3.3 Stacked CNN
In order to model interactions between the words of a sentence we employ a multi-layer CNN, which we call Stacked CNN, similar to the works of (Collobert et al., 2011; Kim, 2014) where competitive results are presented for various NLP tasks. We stack multiple convolution layers to increase the receptive field of the model which enables learning long-term dependencies among the input words.
Consider an input utterance as a sequence of words , where the vector representation of word is the output of the highway network. We concatenate these vectors to obtain a matrix representation of the input utterance . We pass through the cascade of convolution layers to obtain the contextual vectors of the words of an utterance. We present a visual example for the process in Fig. 2. The contextual vector of a word encodes information about the word itself and its adjacent words. A single layer of the Stacked CNN is defined as follows:
Where is the convolution operation; is the convolution filter with width and is dimensionality of temporal vectors; is the bias; is the nonlinear activation; is the feature map capturing the local contexts. In order to obtain fixed length vectors for each sentence, we pad each sentence to be of the same length. As per Eq. 4, each convolution filter results in one feature for every context vector so we use multiple filters per CNN.
3.4 Output Layers
Our architecture models three classification tasks namely domain, intent and slot classification, where the first two tasks operate on the utterance level while the third task assigns a label to each word of the input utterance. We categorize domain and intent classification as global-context tasks and slot classification as a local-context task. We discuss these tasks below.
3.4.1 Local-context Tasks
The local-context task takes as input the context vector associated with the target word. That is to say, the output of the Stacked CNN that is associated with the target word is used as input in order to find the label of the word.
During processing, we apply a highway layer to the input of the local-context task, and then use a fully connected layer followed by softmax to produce normalized classification scores.
3.4.2 Global-context Tasks
Processing of word context vectors for global-context tasks is visualized in Fig. 3. We obtain the vector representation of an utterance by applying a max-over-time pooling operation on the matrix of word context vectors such that its dimensionality equals to the number of feature maps produced by the last layer of Stacked CNN. Subsequently, we apply a highway network which is then input to a fully-connected layer followed by softmax to obtain task specific normalized classification scores. Since we have two global-context tasks, our architecture contains two such output layers.
3.5 Hierarchical Link Layers
Although the joint optimization of the NLU tasks elevates their individual performance, it is plausible to achieve further gains when each downstream task can access the posterior distributions of its upstream task. The motivation behind this claim is the human biology where one can learn complex ideas more easily in the presence of some prior information about related simpler ideas.
Therefore, our architecture includes two hierarchical link layers: (a) domain to intent and, (b) intent to slot where we transfer the information between the semantically hierarchical NLU tasks. Eq. 5 and 6 describe the domain-intent link layer and summation of its output to the input of the intent classifier to obtain the new input vector respectively. Formally we define:
where is the input to the domain classifier softmax layer; and are the weights and bias of domain-intent link layer respectively; is a non-linear activation function.
We implement the intent-slot link layer similar to the domain-intent link layer as described in Eq. 7. Although addition of the link layer’s output to the context vector of each word seems pretty reasonable, we use a gating mechanism to control the amount of link layer information added to the inputs of slot classifier. Such scheme ensures that the model furnish the intent information to some particular words only and avoid connecting common words to any single intent. For instance, consider “play Mozart from Spotify” where only “Mozart” (Artist name) and “Spotify” (Media service) are coupled strongly to “Play music” intent whereas words like “play” and “from” can occur in an utterance like “play Hobbit from Audible” which belongs to an intent from another domain. We implement this gating mechanism using a feedforward layer as defined in Eq. 8, which transforms the input vector of the slot classifier to determine their corresponding gating vectors. As described in Eq. 9, these gating vectors are element-wise multiplied to the output vector of the intent-slot link layer and later added to the input vector to create a semantically rich input for the slot classifier. Formally we define:
where is the input to the intent classifier softmax layer; denote elementwise multiplication; and are the weight matrix and bias vector of gating layer; is the sigmoid non-linearity.
In this work, we use user requests in German to voice-controlled devices consisting of two datasets: (a) in-house collected (b) beta data. Beta data is collected from users in an open-ended environment and have a higher variance than the in-house collected data thus simulating the actual distribution of open-ended queries expected from the live users. We train our models using the in-house collected data and evaluate its performance on the beta data. This simulates the real life situation where an initial model needs to be provided to the beta user population before beta data becomes available. It is expected from the initial model to have a respectable accuracy to facilitate data collection and improve beta user experience. Note that the results presented in this paper are not on production or live user traffic; rather, they reflect NLU bootstrap performance, which is limited to the data obtained through in-house data collection. Additional, significant improvements in accuracy can be obtained by training on larger volumes of data obtained once the system is available to larger sets of users. We train our models for various sizes of data to study the effect of the dataset size on the testing performance. The in-house and beta datasets comprise of 180,000 and 12,000,000 data samples respectively. Collectively, the dataset contains 21 domains, 194 intents and 163 slots.
4.2 Experimental Setup
In this work, we implement all the models using MXNet (Chen et al., 2015). We represent each character by a 15-dimensional embedding vector. For CompCNN, we use four CNNs with kernel size 3, 4, 5 and 6 having 50, 75, 75 and 150 output channels respectively. Further, we use a two-layered CNN for the word-level Stacked CNN, where each layer contains 100 convolution filters with kernel size 3. To facilitate the residual connections in highway layers and avoid any dimensionality conflicts, we ensure that the inferred embedding vectors of words, their contexts, and sentence, must be of the same size, which is 100 in this work. We use Adam (Kingma and Ba, 2014) optimizer with learning rate and Xavier (Glorot and Bengio, 2010) initialization for training our models.
4.3 Effect of Character-level Modelling
To study the benefits of character-level modeling, we compared the performance of character-level and word-level models. In this experiment, we implement the word-level model by replacing the character embedding and CompCNN layers of our architecture with a word embedding layer where each column vector of the embedding matrix represents a word in the input vocabulary.
|Training||Character-level model||Word-level model|
Tab. 1 shows that the character-level model performs better than the word-level model for all classification tasks and all sizes of training data. For example, when using 10K training utterances, the character-level model achieves a relative improvement of 6.62%, 11.14% and 11.95% in F1 scores of domain, intent and slot classification respectively compared to the word-level model.
Since limited amounts of training data increases the number of out of vocabulary (OOV) words, the word-level model fails to capture the semantic information of an utterance reliably. On the other hand, character-level model handles the OOV words by modeling the words as a sequence of characters which alleviate the lack of training data to a certain extent. Moreover, the implicit parameter sharing between subword units facilitates the character-level model to learn semantically-rich word embeddings.
Another point of note is that using word embeddings significantly increases the number of free parameters of the model and thus a larger training set is required to be able to improve accuracy.
4.4 Effect of Multi-task Learning
Multi-task Learning (MTL) facilitates implicit knowledge transfer among the jointly-modeled tasks via parameter sharing, which is especially beneficial in low-resource settings. Such joint modeling techniques generally improve the performance of each learning task of the model, especially the ones with higher complexity. In this experiment, we created single-task models by separating domain, intent and slot classification tasks and training on the task specific data. We then compared single-task models with the model where MTL is used. We present the experiment results where we compare single and multi task models in Tab. 2.
|Training||Multi-task model||Single-task models|
For the domain classification, we observe that the single-task model achieves better performance most of the time. The domain classification task is the most basic task in our NLU system, and experiments show that a dedicated domain classifier might be enough to get a decent performance. This is the only task where we observe little to no benefit of using MTL.
On the other hand, the multi-task model performs better for intent classification for all sizes of training data. Due to a large number of intent labels, the complexity of the intent classification is much higher than the domain classification. Additionally, some intents like “Play” intent that occur in multiple domains like “Books” and “Music”, makes the intent classifier prone to ambiguity. Since MTL enables information sharing, the input vectors of intent classifier also includes the information about the domain which help the intent classifier focus more on the semantics of “Play” command rather than the domain related information, thus improving its classification accuracy.
Slot classification performs fine-grained semantic analysis of the input utterance by assigning a named entity to every word. Since it is a local-context task, a single-task model infers the slot of a word merely based on its neighboring words. On the other hand, the multi-task model takes advantage of the supplementary information provided by the domain and intent of the input utterance which helps to infer the correct slots. In utterances like “Play Beatles from Spotify” and “Play Avengers on Netflix”, accurate predictions of named entities i.e., “Band name” for “Beatles” and “Movie name” for “Avengers”, are crucial for executing the user command correctly. Similar to intent classification, the F1 scores of multi-task model are better compared to the single-task model for the slot classification for all data sizes.
4.5 Effect of Pre-trained Embeddings
Since pre-trained word embeddings are trained on a large corpus of text like Wikipedia, they cover a significant amount of structured language which enables them to encode highly robust semantic information for each word. For transferring the semantic knowledge of the pre-trained word embeddings to our character-level architecture, we train only the first three layers of our model using L2 loss such that the word-embeddings produced by the network are similar to the pre-trained embeddings. Subsequently, we use these trained parameters to initialize the first three layers of the model while randomly initializing the other layers. We use German word-embeddings obtained from fastText (Grave et al., 2018) and call the model where the first three layers are pre-trained as the FastText-init model.
|Training||FastText-init model||Randomly initialized model|
We present the comparison of the random vs FastText-init models in Tab. 3. The FastText-init model performs better than the randomly initialized model for all NLU tasks and on all sizes of training data. We observe the biggest relative gains in the slot classification task which relies on robust word-embeddings, and benefits the most from pre-training even when trained with large amounts of data. This also shows that using character embeddings does not prevent us from taking advantage of the rich literature on word embeddings and using existing word embeddings to improve model performance.
4.6 Effect of Hierarchical Link Layers
In this section, we discuss the training strategy we employ when using hierarchical link layers in our proposed multi-task architecture. Recall that the hierarchical link layers are the information pathways which transfer the output distribution from an upstream task’s classifier to the input of its immediate downstream task, as defined in the semantic hierarchy of the NLU system. To compare the effect of these link layers, we train two separate models, one with hierarchical link layers called with-link model and another without it called no-link model.
In our experiments, we observed that by naively optimizing a randomly initialized with-link model, the performance of domain and intent tasks deteriorates. We believe this is due to the fact that each task is trying to optimize its own output and the tasks working antagonistically blocking training progress. Therefore, we overcome this problem by training the with-link model in two steps. In the first step, we start training by disabling the link layers. This step not only partially optimizes the classifier of each task but also initializes the shared layers of our model and tunes each task to work together rather than against each other. In the following step, we enable the hierarchical link layers and continue training using this partly optimized model, ensuring stable training dynamics, which balances the performance of each NLU task.
We visualize the slot classification scores of the model on the validation set during training in Fig. 4. In the first 100 epochs we disable the links between the classification tasks and train the remaining layers of our model. After that point, we enable the domain-intent and intent-slot links. As presented in the figure, while this causes a small drop in the validation accuracy at the first epoch the links are introduced, the model quickly recovers and achieves a significant jump in the slot classification validation accuracy in the following epochs.
In this work, we presented an end-to-end neural architecture for the joint modeling of the NLU tasks namely domain, intent and slot classification.
For modeling extensive vocabularies of morphologically-rich languages like German and French, we used character-level modeling to compose word embeddings which enabled parameter sharing among words with similar subword units. In our architecture we did not rely on a morphological analyzer or disambiguator but rather adopted a data driven approach to extract useful information from the subwords. We showed that, the proposed character-level model performs notably better than the corresponding word-level models for all NLU tasks, and for all data sizes.
We overcame the problem of error propagation from an upstream to a downstream task by joint modeling of all NLU tasks via MTL. We followed hard-parameter sharing approach to introduce shared hidden layers in our architecture, which permits information sharing among NLU tasks to learn domain-invariant features. We showed that the multi-task models obtain higher F1 scores than the single-task models for all NLU tasks. The performance gains were not just limited to models trained with small datasets, thereby indicating that information sharing among related tasks is extremely valuable irrespective of available training data.
Additionally, we utilized pre-trained word embeddings to initialize our model for analyzing the impact of external knowledge sources on the performance of the model. We confirmed that the models initialized with pre-trained embeddings performed better than the randomly initialized models for all classification tasks. Further, we presented a simple technique to transfer the encoded semantic knowledge of pre-trained embeddings to the compositional layers of our character-level model.
We conclude the proposed neural architecture is an efficient approach for bootstrapping the NLU module with minimal training data. The experiments of this work show that the character-level models combined with multi-task learning considerably improves the overall performance of the NLU module thus saving time, cost and effort to extend the current systems to new languages.
- Compositional representation of morphologically-rich input for neural machine translation. arXiv preprint arXiv:1805.02036. Cited by: §2.
- A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §2.
- Enriching word vectors with subword information. Transactions of the Association of Computational Linguistics 5 (1), pp. 135–146. Cited by: §2.
- Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §4.2.
- Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (Aug), pp. 2493–2537. Cited by: §3.3.
- Multilingual language processing from bytes. In Proceedings of NAACL-HLT, pp. 1296–1306. Cited by: §2.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §4.2.
- Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §4.5.
- The at&t spoken language understanding system. IEEE Transactions on Audio, Speech, and Language Processing 14 (1), pp. 213–222. Cited by: §1.
- Domain adaptation of recurrent neural networks for natural language understanding. arXiv preprint arXiv:1604.00117. Cited by: §2.
- Triangular-chain conditional random fields. IEEE Transactions on Audio, Speech, and Language Processing 16 (7), pp. 1287–1302. Cited by: §2.
- Character-aware neural language models.. In AAAI, pp. 2741–2749. Cited by: §2.
- Convolutional neural networks for sentence classification. In Conference on Empirical Methods in Natural Language Processing, Cited by: §3.3.
- Onenet: joint domain, intent, slot prediction for spoken language understanding. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 547–553. Cited by: §2.
- Frustratingly easy neural domain adaptation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 387–396. Cited by: §2.
- Adam: a method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR). Cited by: §4.2.
- Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Cited by: §2.
- Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454. Cited by: §2.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142–150. Cited by: §2.
- An overview of end-to-end language understanding and dialog management for personal digital assistants. In Spoken Language Technology Workshop (SLT), 2016 IEEE, pp. 391–397. Cited by: §1.
- Training very deep networks. In Advances in neural information processing systems, pp. 2377–2385. Cited by: §3.2.
- Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 384–394. Cited by: §2.
- From characters to words to in between: do we capture morphology?. arXiv preprint arXiv:1704.08352. Cited by: §2.
- Convolutional neural network based triangular crf for joint intent detection and slot filling. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 78–83. Cited by: §2.
- A joint model of intent determination and slot filling for spoken language understanding. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2993–2999. Cited by: §2.