Deep Cascade Multi-task Learning for Slot Filling in Chinese
E-commerce Shopping Guide Assistant
Slot filling is a critical task in natural language understanding (NLU) for dialog systems. State-of-the-art solutions regard it as a sequence labeling task and adopt BiLSTM-CRF models. While BiLSTM-CRF models works relatively well on standard datasets it faces challenges in Chinese E-commerce slot filling due to more informative slot labels and richer expressions. In this paper, we propose a deep multi-task learning model with cascade and residual connections. Experimental results show that our framework not only achieves competitive performance with state-of-the-arts on a standard dataset, but also significantly outperforms strong baselines by a substantial gain of 14.6% on a Chinese E-commerce dataset.
An intelligent E-commerce online shopping guide assistant is a comprehensive human-like system providing various services such as pre-sale and after-sale inquiries, product recommendations, and user complaints processing, all of which seek to give the customers better shopping experience. The core of such assistant is a dialog system which has the ability to understand natural language utterances from a user and then give natural language responses. The architecture of a task-oriented dialog system for online shopping guide assistant is illustrated in Figure 1. Natural Language Understanding (NLU), which aims to interpret the semantic meanings conveyed by input utterances is a main component in task-oriented dialog systems. Slot filling is a subproblem in NLU, which identifies the properties and their values about the task to be performed in the dialog.
Slot filling extracts semantic constituents by using the words of input text to fill in pre-defined slots in a semantic frame [\citeauthoryearMesnil et al.2015]. It can be regarded as sequence labeling task, which assigns an appropriate semantic label to each word in the given input utterance. In the case of E-commerce shopping, there are three named entity types including Category, Property Key and Property Value. We show a real example in Table 1 with In/Out/Begin(IOB) scheme. In the named entity level, “连衣裙”(dress) is a Category (B-CG/I-CG), while “品牌”(brand) is labeled as Property Key (B-PK/I-PK), which is the name of one product property. “耐克”(Nike) and “黑色”(black) are labeled as Property Value (B-PV/I-PV) since they are concrete property values. However, labeling as Property Value is not good enough for NLU. Thus, in Slot Filling level, we further label “耐克”(Nike) as Brand Property (B-Brand/I-Brand), and “黑色”(black) as Color Property (B-Color/I-Color). In the meantime, other words in the example utterance that carry no semantic meaning are assigned O label.
State-of-the-art sequence labeling models are typically based on BiLSTM and CRF [\citeauthoryearHuang et al.2015, \citeauthoryearReimers and Gurevych2017] and evaluated on a commonly used standard dataset ATIS [\citeauthoryearPrice1990] in slot filling area. This dataset is in the domain of airline travel in America and Table 2 shows an example utterance. However, the vocabulary size of ATIS is too small (only 572) and slot labels are not diverse enough since airline travel is a relatively small and specific domain, such that recent deep learning models can achieved very high F1 scores (nearly 0.96).
|Utterance||flights from Dallas to New York|
|Slot Label||O O B-fromloc.city_name O B-toloc.city_name I-toloc.city_name|
Compared to ATIS, our E-commerce shopping guide assistant dataset is more complex. 111This dataset is available at http://Anonymized.for.blind.review. This dataset comes from a real world application and the semantic slots are more diverse and informal than ATIS, which increases the difficulty for the task. For example, to describe different properties for a product for purpose of utterance understanding and query rewrite, we should define large amount of informative slot labels such as color, brand, style, season, gender and so on. While most semantic labels of ATIS are related to only time and location. On the other hand, the spoken E-commerce Chinese language is more complex and enriched expression makes it harder to understand. For example, “红色” and “红” both mean red, “品牌” and “牌子” both mean brand, “耐克” and “Nike” and “Niky” all mean Nike. While in ATIS, expression can be simpler, and most expressions are standard locations or time.
Besides, Chinese language, like many other Asian languages, are not word segmented by nature, and word segmentation is a difficult first step in many NLP tasks. Without proper word segmentation, sequence labeling becomes very challenging as the errors from segmentation will propagate. On the other hand, more than 97% of the chunks in ATIS data have only one or two words, in which segment (or chunking) may not be a serious problem.
In this paper, we are the first to employ multi-task sequence labeling model to tackle slot filling in a novel Chinese E-commerce dialog system. We divide the slot filling task into two lower-level tasks: named entity tagging and segment tagging. Example labels of these two tasks are shown in the bottom two rows of Table 1. Segment tagging and named entity tagging can be regarded as syntactic labeling, while slot filling is more like semantic labeling. Once we know the syntactic structure of an input sentence, filling the semantic labels becomes easier. Compared to directly attacking slot filling, these two low-level tasks are much easier to solve due to fewer labels. To this end, we propose a Deep Cascade Multi-task Learning model, and co-train three tasks in the same framework with a goal of optimizing the target slot filling task.
The contributions of this paper are summarized below:
This is the first attempt to attack the real-world problem of slot filling for Chinese E-commerce online shopping guide assistant system (Section 2).
We are the first to propose a Chinese dialog spoken language dataset ECSGA (Section 4.1). Its domain is much different from common ATIS dataset and it has much more data. We believe this dataset will contribute more to the future research of dialog natural language understanding.
Given an utterance containing a sequence of words222We use “word” in problem and model description, but “word” actually means Chinese char in our problem. , the goal of our problem is to find a sequence of slot labels , one for each word in the utterance, such that:
In this paper we only define a slot filling problem in Dress category domain for simplification. That means we know in advance the defined category classification (intent). There are thousands of category classifications in E-commerce domain and in each category, there can be dozens of properties that are totally different. Performing slot filling with tens of thousands slot labels with cross-domains is impractical at this point. A joint model for category classification and slot filling is our future research.
Along with Property Key(PK), Category(CG) and O, there are altogether 29 (57 in the IOB scheme) slot labels in our problem. Examples are listed in Table 3. Notice that terms such as “brand”, “color” appearing in an utterance will be labeled as PK, while Color and Brand can be pre-defined slot labels and will be assigned to terms like “black” and “Nike”.
|Slot Label||Color, Brand, …||PK||CG|
|Example Term||black, Nike, …||brand, color, …||dress, t-shirt, …|
In this section we describe our approach in detail. Figure 2 gives an overview of the proposed architectures. First we introduce the most common and popular BiLSTM-CRF model (Figure 2(a)) for sequence labeling tasks. Then we move on to multi-task learning perspective (Figure 2(b) and (c)). Finally we propose our new method, which is called Deep Cascade Multi-task Learning in Figure 2(d).
3.1 RNN Sequence Labeling
Figure 2(a) shows the principle architecture of a BiLSTM-CRF model, which is the state-of-the-art model for various sequence labeling tasks [\citeauthoryearHuang et al.2015, \citeauthoryearReimers and Gurevych2017]. BiLSTM-CRF model consists of a BiLSTM layer and a CRF layer.
Bidirectional LSTMs enable the hidden states to capture both historical and future context information of the words. Mathematically, the input of this BiLSTM layer is a sequence of input vectors, denoted as . The output of BiLSTM layer is a sequence of the hidden states for each input word, denoted as . Each final hidden state is the concatenation of the forward and backward hidden states. We view BiLSTM as a function :
Most of time we stack multiple BiLSTMs to make the model deeper, in which the output of layer becomes the input of layer , e.g. .
It is always beneficial to consider the correlations between the current label and neighboring labels, since there are many syntactical constraints in natural language sentences. If we simply feed the above mentioned hidden states independently to a softmax layer to predict the labels [\citeauthoryearHakkani-Tür et al.2016], such constraints are more likely to be violated. Linear-chain Conditional Random Field (CRF) [\citeauthoryearLafferty et al.2001] is the most popular way to control the structure prediction and its basic idea is to use a series of potential functions to approximate the conditional probability of the output label sequence given the input word sequence.
Formally, we take the above sequence of hidden states as input to a CRF layer, and the output of the CRF is the final prediction label sequence , where is in the set of pre-defined target labels. We denote as the set of all possible label sequences. Then we derive the conditional probability of the output sequence, given the input hidden state sequence is:
where are potential functions and and are weight vector and bias of label pair . To train the CRF layer, we use the classic maximum conditional likelihood estimate and gradient ascent. For a training dataset , the final log-likelihood is:
Finally, the Viterbi algorithm is adopted to decode the optimal output sequence :
3.2 Multi-task Learning
While directly attacking the slot filling task is hard, low-level tasks with fewer labels are much easier to solve. Once we know the syntactic structure of a sentence, filling in semantic labels will become easier accordingly. Thus, it is reasonable to solve the problem in a multi-task learning framework. In our problem, we can devise three individual tasks: slot filling, named entity tagging and segment tagging. Slot filling is our target task; named entity tagging is to classify which named entity type (PV/PK/CG) a word is; and segment tagging is to judge whether a word is begin (B), in (I) or out (O) of a trunking.
In a multi-task learning (MTL) setting, we have several prediction tasks over the same input sequence, where each task has its own output vocabulary (a set of task specified labels). Intuitively, the three tasks do share a lot of information. Consider the example in Table 1 again. Knowing the named entity type of “黑色”(black) being B-PV can definitely help determine its slot label, which is B-Color. Similarly, knowing its segment type (B) also helps with both named entity tagging and slot filling. Thus it is reasonable for these tasks to share parameters and learn in the same framework cooperatively.
3.2.1 Vanilla Multi-task Learning
The general idea of multi-task learning is to share parameters of encoding part of the network. As Figure 2(b) shows, this is naturally achieved by sharing the -layers BiLSTM part of the network across three tasks. Based on that, we use a separate CRF decoder for each task : , where and are task-specific parameters. This encourages the deep BiLSTM network to learn a hidden representation which benefits all three different tasks.
3.2.2 Hierarchy Multi-task Learning
Previous discussion indicates that there is a natural order among the different tasks: slot filling may benefit more from named entity tagging, than the other way around. This motivates us to employ low-level tasks at lower BiLSTM layers, while high level tasks are trained at higher layers. This idea was first proposed by Anders and Yoav [\citeauthoryearSøgaard and Goldberg2016]. As shown in Figure 2(c), instead of decoding all tasks separately at the outermost BiLSTM layer, we associate each BiLSTM layer with one task . Then the conditional probabilities of the output sequence for each task are:
Here , and represent the tasks of segment tagging, named entity tagging and slot filling, respectively. is the word embeddings of input sequence w and . We call this model hierarchy multi-task learning, since some layers are shared by all tasks while the others are only related to specific tasks.
3.3 Deep Cascade Multi-task Learning
Hierarchy multi-task learning share parameters among different tasks, and allow low-level tasks help adjust the result of high-level target task. It is effective for those tasks which are weakly correlated, such as POS tagging, syntactic chunking and CCG supertagging [\citeauthoryearSøgaard and Goldberg2016]. However, when it comes to problems where different tasks maintain a strict order, in another word, the performance of high-level task dramatically depends on low-level tasks, the hierarchy structure is not compact and effective enough. Therefore, we propose cascade and residual connections to enable high-level tasks taking the tagging results and hidden states of low-level tasks as additional input. These connections serves as “shortcuts” that create a more closely coupled and efficient model. We call it deep cascade multi-task learning, and the framework is shown in Figure 2(d).
3.3.1 Cascade Connection
Here we feed the tagging output of the task at lower layer e.g. or to the upper BiLSTM layer as its additional input. Now the hidden states of each task layer become:
where is the weight parameter for cascade connection.
At training time, and can be the true tagging outputs. At inference time, we simply take the greedy path of our cascade model without doing search, where the model emits the best and by Viterbi inference algorithm. Alternatively, one can do beam search [\citeauthoryearSutskever et al.2014, \citeauthoryearVinyals et al.2015] by maintaining a set of best partial hypotheses at each cascade layer. However, unlike traditional seq2seq models e.g., in machine translation, where each inference step is just based on probability of a discrete variable (by softmax function), our inference for tagging output is a structured probability distribution defined by the CRF output. Efficient beam search method for this structured cascade model is left to our future work.
3.3.2 Residual Connection
To encourage the information sharing among different tasks, we also introduce the residual connection, where we add the input of a previous layer to the current input:
Deep residual learning [\citeauthoryearHe et al.2016] is first introduced to ease the gradient vanish problem for training very deep neural networks. Here we propose that residual connection between different layers can help for multi-task learning.
For our multi-task setting, we define three loss functions (refer to Section 3.1): , and for tasks of segment tagging, named entity tagging and slot filling respectively. We construct three training set, , and , where each of them (called generically) contains a set of input-output sequence pair . The input utterance w is shared across tasks, but the output is task dependent.
For vanilla multi-task learning, we define a loss function , where and are hyper-parameters.
As for hierarchy multi-task learning and cascade multi-task learning, we choose a random task at each training step, followed by a random training batch . Then we update the model parameters by back-propagating the corresponding loss .
In this section we first introduce the popular ATIS dataset, and describe how we collect our E-commerce Shopping Guide Assistant (ECSGA) dataset. Then we show the implementation details for our model. Finally we demonstrate the evaluation results on both ATIS and ECSGA dataset and give some discussions. In the following experiments, we call our proposed Deep Cascade Multi-task Learning method as DCMTL for short.
ATIS Dataset: The ATIS333The data can be found at https://github.com/yvchen/JointSLU/tree/master/data. corpus is the most commonly used dataset for slot filling research, which consists of reservation requests from the air travel domain. There are 84 different slot labels (127 with IOB prefix). We randomly selected 80% of the training data for model training and the remaining 20% as the validation set [\citeauthoryearMesnil et al.2015]. Apart from the ground-truth slot labels, we also generate its corresponding segment labels for our multi-task model setting.
ECSGA Dataset: To create large amounts of gold standard data for the training of our model, we adopt an unsupervised method to automatically tag the input utterances. All the utterances are extracted from the user input logs (either from text or voice) on an online shopping guide assistant system. Our E-commerce knowledge base is a dictionary consisting of pairs of word terms and their ground-truth slot labels such as “红色-颜色”(red-color). We use a dynamic programming algorithm to match terms in the utterances and then assign each word with its slot label in IOB scheme. We filter utterances whose matching result is ambiguous and only reserve those that can be perfectly matched (all words can be tagged by only one unique label) as our training and testing data. With the slot labels of each word, we can induce the named entity labels and segment labels straightforwardly. Our goal is to develop a sequence labeling algorithm with the ability to generalize to vocabulary outside of the dictionary.
To evaluate model’s ability to generalize, we randomly split the dictionary into three parts. One part is used to generate testing data and the other two to generate training data. If we don’t split the dictionary and use the whole to generate both training and testing data, then the trained model may remember the whole dictionary and the results will not reflect the true performance of the models.
The following experiments use a dataset of 24,892 training pairs and 2,723 testing pairs. Each pair contains an input utterance w, its corresponding gold sequence of slot labels , named entity labels and segment labels .
4.2 Implementation Details
For the RNN component in our system, we use a 3-layers LSTM network for ECSGA and 2-layers LSTM network for ATIS, they are all with unit size 100. All input sentences are padded to a maximum sequence length of 21 in ECSGA dataset and 46 in ATIS dataset. The input in ECSGA is a sequence of Chinese characters rather that words since there is no segmentation. The dimension of embedding layer and BiLSTM network output hidden state are set to 200.
For experiments on ECSGA dataset, the size of the labels for slot filling, named entity tagging and segment tagging are 57, 7 and 3 respectively. For experiments on ATIS dataset, the size of the labels for slot filling and segment tagging are 127 and 3 (no named entity tagging in this case).
We perform a mini-batch log-likelihood loss training with a batch size of 32 sentences for 10 training epochs. We use Adam optimizer, and the learning rate is initialized to 0.001.
4.3 Results and Discussions
Eval on ATIS: We compare the ATIS results of our DCMTL model with current published results in Table 4. Almost all the methods (including ours) reach very high F1 score of around 0.96. This makes us wonder whether it is meaningful enough to continue evaluating on this dataset, for minor changes between different results may be given rise by model or data variance.
|simple RNN [\citeauthoryearYao et al.2013]||0.9411|
|CNN-CRF [\citeauthoryearXu and Sarikaya2013]||0.9435|
|LSTM [\citeauthoryearYao et al.2014]||0.9485|
|RNN-SOP [\citeauthoryearLiu and Lane2015]||0.9489|
|Deep LSTM [\citeauthoryearYao et al.2014]||0.9508|
|RNN-EM [\citeauthoryearPeng and Yao2015]||0.9525|
|Bi-RNN with ranking loss [\citeauthoryearVu et al.2016]||0.9547|
|Sequential CNN [\citeauthoryearVu2016]||0.9561|
|Encoder-labeler Deep LSTM [\citeauthoryearKurata et al.2016]||0.9566|
|BiLSTM-LSTM (focus) [\citeauthoryearZhu and Yu2017]||0.9579|
|Neural Sequence Chunking [\citeauthoryearZhai et al.2017]||0.9586|
Eval on ECSGA: On ECSGA dataset, we evaluate different models including Basic BiLSTM-CRF, Vanilla Multi-task, Hierarchy Multi-task and Deep Cascade Multi-task on testing data regarding slot filling as the target task. We report Precision, Recall and F1 in Table 5.
The Basic BiLSTM-CRF model achieves an F1 score of 0.43. To show the impact of the lower tasks to slot filling, we “cheated” by using the ground-truth segment type (cond. SEG) or named entity type (cond. NE) as the extra features for each word in the Basic BiLSTM-CRF model. Row 3 and 4 (with *) in Table 5 show that the slot filling performance can be improved by 85% and 109% if the correct segment type or named entity type is pre-known. Of course in practice, the model doesn’t know the true values of these types.
Our further experiments show that DCMTL outperforms the baselines on both precision and recall. DCMTL achieves the best F1 score of 0.5105, which improves by a relative margin of 14.6% against the strong baseline method (see Table 5). Multi-task models generally perform better than the Basic BiLSTM with single-task target. The exception is the vanilla multi-task setting. This is mainly because vanilla multi-task shares parameters across all the layers which are likely to be disturbed by the interaction of three tasks. It is preferable to let the target task dominate the weights at high-level layers.
|* Basic BiLSTM-CRF (cond. SEG)||0.7948||0.7953||0.7950|
|* Basic BiLSTM-CRF (cond. NE)||0.8985||0.8986||0.8985|
|** DCMTL (- cascade)||0.4654||0.4613||0.4633|
|** DCMTL (- residual)||0.4923||0.4760||0.4840|
We further investigate the learning trend of our proposed approach against baseline methods. Figure 3(a)(b)(c) shows the typical learning curves of performance measured by Precision, Recall and F1. We can observe that our method DCMTL performs worse than other baseline methods for the first batch steps. After that, other methods converge quickly and DCMTL perform much better after batch steps and finally converge to the best F1 score. We believe that in the beginning, high-level task in DCMTL is affected more by the noise of low-level tasks comparing to others, but as the training goes on, the high-level slot filling task slowly reaps the benefits from low-level tasks.
Ablation Test: We also investigate how our model DCMTL performs with or without cascade and residual connections. As shown in Table 5, F1 score increases from 0.4840 to 0.5105 when residual connection is applied, which verifies its benefit. If we remove cascade connection from DCMTL, the model actually degenerates into hierarchy multi-task model with residual connection. From the table we find that it performs better than basic hierarchy multi-task model. Meanwhile, we can conclude that cascade connection plays a more important role than residual connection in our DCMTL model.
Furthermore, we explore how DCMTL performs with different cascade connection methods. We compare three different types of cascade connection illustrated in Figure 4(a):
Segment labeling skipped to slot filling (SLOT+SEG).
Named entity labeling directly connected to slot filling (SLOT+NE).
Segment labeling, named entity labeling and slot filling in sequence (SLOT+NE+SEG).
From Figure 3(d), we find that cascade connection with type 3 performs the best and then with type 2, while cascade method with skipped connection (type 1) performs the worst. Therefore, we design the networks with a cascade connection in a hierarchical fashion and do not apply skipped connection for the cascade inputs (Figure 4(b)). This phenomenon here may also be proved by our case study above. Slot filling performance with pre-known named entity type is much better than with pre-known segment type (rows with * in Table 5).
5 Related Work
Slot filling is considered a sequence labeling problem that is traditionally solved by generative models such as Hidden Markov Models (HMMs) [\citeauthoryearWang et al.2005], hidden vector state model [\citeauthoryearHe and Young2003], and discriminative models such as conditional random fields (CRFs) [\citeauthoryearRaymond and Riccardi2007, \citeauthoryearLafferty et al.2001] and Support Vector Machine (SVMs) [\citeauthoryearKudo and Matsumoto2001]. In recent years, deep learning approaches have been explored due to its successful application in many NLP tasks. Many neural network architectures have been used such as simple RNNs [\citeauthoryearYao et al.2013, \citeauthoryearMesnil et al.2015], convolutional neural networks (CNNs) [\citeauthoryearXu and Sarikaya2013], LSTMs [\citeauthoryearYao et al.2014] and variations like encoder-decoder [\citeauthoryearZhu and Yu2017, \citeauthoryearZhai et al.2017] and external memory [\citeauthoryearPeng and Yao2015]. In general, these works adopt a BiLSTM as the major labeling architecture to extract various features, then use a CRF layer [\citeauthoryearHuang et al.2015] to model the label dependency. We also adopt a BiLSTM-CRF model as baseline and claim that a multi-task learning framework is working better than directly applying it on Chinese E-commerce dataset. Previous works only apply joint model of slot filling and intent detection [\citeauthoryearZhang and Wang2016, \citeauthoryearLiu and Lane2016]. Our work is the first to propose a multi-task sequence labeling model with deep neural networks on slot filling problem.
Multi-task learning (MTL) has attracted increasing attention in both academia and industry recently. By jointly learning across multiple tasks [\citeauthoryearCaruana1998], we can improve performance on each task and reduce the need for labeled data. There has been several attempts of using multi-task learning on sequence labeling task [\citeauthoryearPeng and Dredze2016b, \citeauthoryearPeng and Dredze2016a, \citeauthoryearYang et al.2017], where most of these works learn all tasks at the out-most layer. Søgaard and Goldberg \shortcitesogaard2016deep is the first to assume the existence of a hierarchy between the different tasks in a stacking BiRNN model. Compared to these works, our DCMTL model further improves this idea even thorough with cascade and residual connection.
In this paper, we attempt to solve the real-world slot filling task in a novel Chinese E-commerce shopping guide assistant. We proposed a deep multi-task sequence learning framework with cascade and residual connection. Our model achieves comparable results with several state-of-the-art models on the common slot filling dataset ATIS. On our released real-world Chinese E-commerce dataset ECSGA, our proposed model DCMTL also achieves best F1 score comparing to several strong baselines.
- [\citeauthoryearCaruana1998] Rich Caruana. Multitask learning. In Learning to learn. Springer, 1998.
- [\citeauthoryearHakkani-Tür et al.2016] Dilek Hakkani-Tür, Gokhan Tur, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In INTERSPEECH, 2016.
- [\citeauthoryearHe and Young2003] Yulan He and Steve Young. A data-driven spoken language understanding system. In ASRU, 2003.
- [\citeauthoryearHe et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- [\citeauthoryearHuang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv, 2015.
- [\citeauthoryearKudo and Matsumoto2001] Taku Kudo and Yuji Matsumoto. Chunking with support vector machines. In NAACL, 2001.
- [\citeauthoryearKurata et al.2016] Gakuto Kurata, Bing Xiang, Bowen Zhou, and Mo Yu. Leveraging sentence-level information with encoder lstm for semantic slot filling. arXiv, 2016.
- [\citeauthoryearLafferty et al.2001] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
- [\citeauthoryearLiu and Lane2015] Bing Liu and Ian Lane. Recurrent neural network structured output prediction for spoken language understanding. In NIPS Workshop, 2015.
- [\citeauthoryearLiu and Lane2016] Bing Liu and Ian Lane. Joint online spoken language understanding and language modeling with recurrent neural networks. arXiv, 2016.
- [\citeauthoryearMesnil et al.2015] Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. Using recurrent neural networks for slot filling in spoken language understanding. TASLP, 23(3), 2015.
- [\citeauthoryearPeng and Dredze2016a] Nanyun Peng and Mark Dredze. Improving named entity recognition for chinese social media with word segmentation representation learning. In ACL, volume 2, 2016.
- [\citeauthoryearPeng and Dredze2016b] Nanyun Peng and Mark Dredze. Multi-task multi-domain representation learning for sequence tagging. arXiv, 2016.
- [\citeauthoryearPeng and Yao2015] Baolin Peng and Kaisheng Yao. Recurrent neural networks with external memory for language understanding. arXiv, 2015.
- [\citeauthoryearPrice1990] Patti J Price. Evaluation of spoken language systems: The atis domain. In Speech and Natural Language, 1990.
- [\citeauthoryearRaymond and Riccardi2007] Christian Raymond and Giuseppe Riccardi. Generative and discriminative algorithms for spoken language understanding. In INTERSPEECH, 2007.
- [\citeauthoryearReimers and Gurevych2017] Nils Reimers and Iryna Gurevych. Optimal hyperparameters for deep lstm-networks for sequence labeling tasks. arXiv, 2017.
- [\citeauthoryearSøgaard and Goldberg2016] Anders Søgaard and Yoav Goldberg. Deep multi-task learning with low level tasks supervised at lower layers. In ACL, volume 2, 2016.
- [\citeauthoryearSutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- [\citeauthoryearVinyals et al.2015] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
- [\citeauthoryearVu et al.2016] Ngoc Thang Vu, Pankaj Gupta, Heike Adel, and Hinrich Schütze. Bi-directional recurrent neural network with ranking loss for spoken language understanding. In ICASSP, 2016.
- [\citeauthoryearVu2016] Ngoc Thang Vu. Sequential convolutional neural networks for slot filling in spoken language understanding. arXiv, 2016.
- [\citeauthoryearWang et al.2005] Ye-Yi Wang, Li Deng, and Alex Acero. Spoken language understanding. IEEE Signal Processing Magazine, 22(5), 2005.
- [\citeauthoryearXu and Sarikaya2013] Puyang Xu and Ruhi Sarikaya. Convolutional neural network based triangular crf for joint intent detection and slot filling. In ASRU, 2013.
- [\citeauthoryearYang et al.2017] Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen. Transfer learning for sequence tagging with hierarchical recurrent networks. arXiv, 2017.
- [\citeauthoryearYao et al.2013] Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. Recurrent neural networks for language understanding. In Interspeech, 2013.
- [\citeauthoryearYao et al.2014] Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. Spoken language understanding using long short-term memory neural networks. In SLT Workshop, 2014.
- [\citeauthoryearZhai et al.2017] Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. Neural models for sequence chunking. In AAAI, 2017.
- [\citeauthoryearZhang and Wang2016] Xiaodong Zhang and Houfeng Wang. A joint model of intent determination and slot filling for spoken language understanding. In IJCAI, 2016.
- [\citeauthoryearZhu and Yu2017] Su Zhu and Kai Yu. Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding. In ICASSP, 2017.