Investigating how well contextual features are captured by bi-directional recurrent neural network models
Learning algorithms for natural language processing (NLP) tasks traditionally rely on manually defined relevant contextual features. On the other hand, neural network models using an only distributional representation of words have been successfully applied for several NLP tasks. Such models learn features automatically and avoid explicit feature engineering. Across several domains, neural models become a natural choice specifically when limited characteristics of data are known. However, this flexibility comes at the cost of interpretability. In this paper, we define three different methods to investigate ability of bi-directional recurrent neural networks (RNNs) in capturing contextual features. In particular, we analyze RNNs for sequence tagging tasks. We perform a comprehensive analysis on general as well as biomedical domain datasets. Our experiments focus on important contextual words as features, which can easily be extended to analyze various other feature types. We also investigate positional effects of context words and show how the developed methods can be used for error analysis.
Learning approaches for NLP tasks can be broadly put into two categories based on the way features are obtained or defined. The traditional way is to design features according to a specific problem setting and then use appropriate learning approach. Examples of such methods include classification algorithms like SVM [\citenameHong2005] and CRF [\citenameLafferty et al.2001] among others for several NLP tasks. A significant proportion of overall effort is spent on feature engineering itself. The desire to obtain better performance on a particular problem makes the researchers come up with a domain and task-specific set of features. The primary advantage of using these models is their interpretability. However, dependence on handcrafted features limits their applicability in low resource domain where obtaining a rich set of features is difficult.
On the other hand, neural network models provide a more generalised way of approaching problems in NLP domain. The models can learn relevant features with minimal efforts in explicit feature engineering. This ability allows the use of such models for problems in low resource domain.
The primary drawback of neural network models is that they are too complicated to interpret as the features are not manually defined. Neural networks have been applied significantly to various tasks without many insights on what the underlying structural properties are and how the models learn to classify the inputs correctly. Mostly inspired by computer vision [\citenameSimonyan et al.2013, \citenameNguyen et al.2015], several mathematical and visual techniques have been developed in this direction [\citenameElman1989, \citenameKarpathy et al.2015, \citenameLi et al.2016].
In contrast to the existing works, this study aims to investigate ability of recurrent neural models to capture important context words. Towards this goal, we define multiple measures based on word erasure technique [\citenameLi et al.2016]. We do a comprehensive analysis of performance of bi-directional recurrent neural network models for sequence tagging tasks using these measures. Analysis is focused at understanding how well the relevant contextual words are being captured by different neural models in different settings. The analysis provides a general tool to compare between different models, show that how neural networks follow our intuition by giving importance to more relevant words, study positional effects of context words and provide error analysis for improving the results.
2 Proposed Methods
A sequence tagging task involves assigning a tag (from a predefined set) to each element present in a given sequence. We model Name Entity Recognition (NER) as a sequence tagging task. We follow BIO-tagging scheme, where each named entity type is associated with two labels, (standing for Beginning) and (standing for Intermediate). The BIO scheme uses another label (standing for Other) for all the context or non-entity words.
In this section, we discuss three methods to calculate the importance score of context words. Each method creates a different ranking of context words corresponding to each entity type for a given dataset. The methods range from simple frequency based to considering sentence level or individual word level effects. We assume that we have a pretrained model on a given dataset.
2.1 Based on word frequency
For a given sentence test set , consider a window of a particular size around each entity phrase (single or multi word, defined by true tags) in . We increment the score (corresponding to ’s entity type only) for each of the context words present in this window by one. For instance, the CoNLL-2003 shared task data (described in section 3.2) has 4 entity types, namely, organization (), location (), person () and miscellaneous (). The corresponding labels under BIO-tagging scheme are B-ORG, I-ORG, B-LOC, I-LOC and so on. For a 2-word phrase with true tags as (B-LOC, I-LOC), the score corresponding to for each context word (with true tag as ) in the window is incremented by one. Let the score for a context word corresponding to entity type in one sentence be .
Hence the relevance score is calculated as follows:
Using inverse frequency to account for irrelevant, too frequent words, the score can be calculated as follows:
I(w_c,e) = (∑∀S ∈DA(wc,e,S)∑∀wc∑∀S ∈DA(wc,e,S)) (∑∀e’∑∀wc∑∀S ∈DA(wc,e’,S)∑∀e’∑∀S ∈DA(wc,e’,S) + k)
where accounts for 0 counts and sum over means summing over all the remaining entity types. In our experiments, we use =1 and a window size of 11 (5 words on each side). We refer to these methods collectively as M_WF in rest of the paper.
2.2 Using sentence level log likelihood
In the M_WF method, the relevance of each context word is calculated irrespective of its dependence on other words in the sentence. We define another measure using sentence level log likelihood to take into account the dependency between words in a sentence. We refer to this method as M_SLL in rest of the paper.
Let the set of all context words be and that of all entity types be . Define as the set of all sentences where both the word and entity type are present. We say that an entity type is present in a sentence , if a word which has it’s true tag corresponding to entity type . Let be the size of set .
Now, let the true tag sequence for a sentence be . For a context word , let be the negative log likelihood of obtained from pretrained model . Note that since we are working at a sentence level, will be same for all the context words and entities present in .
We adapt the erasure method of \newciteli2016understanding. Here, we replace the representation of word with a random word representation having same number of dimensions and recalculate the negative log likelihood for the true tag sequence . Let this value be . Intuitively, if and is relevant for the entity type , the probability of the true sequence should decrease when the word is removed from the sentence. Correspondingly, it’s negative log likelihood value should increase. Hence, the score for a given word corresponding to the entity type can be calculated in the following manner:
2.3 Considering left and right word contexts separately
The relevance scoring method M_SLL does not distinguish between words present in the same sentence. The third method, referred to as M_LRC, works at word level and calculates relevance score of each word by distinguishing its presence in the left or right side of the entity word. The measure is defined in a way that it does take into account of dependency between words in the sentence. In a bi-directional setting, the hidden layer representation for any word in a sentence, is a concatenation of two representations - one which combines words to the left, and the other which combines the words to the right.
In the output layer, we combine the weight parameters and the hidden layer representation by a dot product. We divide this dot product in two parts as discussed below. Say the hidden representation is and weight parameters corresponding to a tag t (set of all possible tags) are represented by . We can write the dot product as a sum of two dot products and , representing the contribution from left and right parts separately. In our experiments, we also include the bias term as a weight parameter.
Now, take a sentence , a context word in , and an entity word in with true tag corresponding to entity type . Define as follows:
where is the size of the set and is either or depending on whether the word lies to the left or right of respectively. Notice that this sum is over all the false tags in set for the word .
With the intuition that the important word should have higher dot product corresponding to true tag than to false tags, we define the score as follows:
L_1(w_c,w_e,S) = pTt,K.hK- AvgSum(wc,we,S)AvgSum(wc,we,S)
We again employ word erasure technique and recompute the above score by replacing the representation of word with a random word representation. We call it . Now, we can compute the final score for this instance as:
L(w_c,w_e,S) = L1(wc,we,S) - L2(wc,we,S)L2(wc,we,S)
The relevance score is then computed by taking average of over all instances.
We consider the task of sequence tagging problem for evaluation and analysis of the proposed methods to interpret neural network models. In particular, we choose the three variants of recurrent neural network models for Named Entity Recognition(NER) task.
3.1 Model architecture
The generic RNN model architecture used for this work is given in figure 1.
Input layer contains all the words in the sentence. In the embedding layer, each word is represented by it’s dimensional vector representation. The hidden layer contains a bi-directional recurrent neural network which outputs a dimensional representation for every word, where is the number of hidden layer units in the recurrent neural network. In bi-directional models, both the past and future contexts are used to represent the words in a given sentence. Finally, a fully connected network connects the hidden layer to the output layer, which contains scores for each possible tag corresponding to every word in the sentence. A sentence level log likelihood loss function [\citenameCollobert et al.2011] is used in the training process.
For this work, we experiment with standard bi-directional Recurrent Neural Network (Bi-RNN), bi-directional Long Short Term Memory Network (Bi-LSTM) [\citenameGraves2013, \citenameHuang et al.2015] and bi-directional Gated Recurrent Unit Network(Bi-GRU) [\citenameChung et al.2014]. For simplicity, we refer to these bi-directional models as RNN, LSTM and GRU in rest of the paper.
In this work, we use two NER datasets from diverse domains. One is from generic domain whereas other is from biomedical domain. Statistics of both datasets are given in Table 1.
CoNLL, 2003: This dataset was released as a part of CoNLL-2003 language independent named entity recognition task [\citenameTjong Kim Sang and De Meulder2003]. Four named entity types have been used: location, person, organization and miscellaneous. For this work, we have used the original split of the English dataset. There were 8 tags used I-PER, B-LOC, I-LOC, B-ORG, I-ORG, B-MISC, I-MISC and . We focus on three entity types, namely, location (), person () and organization () in our analysis. For this dataset, we use pretrained GloVe 50 dimensional word vectors [\citenamePennington et al.2014].
JNLPBA, 2004: Released as a part of Bio-Entity recognition task [\citenameKim et al.2004] at JNLPBA in 2004, this dataset is from GENIA version 3.02 corpus [\citenameKim et al.2003]. There are 5 classes in total - DNA, RNA, Cell_line, Cell_type and Protein. We use all the classes in our analysis. There are 11 tags, 2 (for begin and intermediate word) for each class and for other context words. We use 50 dimensional word vectors trained using skip-gram method on a biomedical corpus [\citenameMikolov et al.2013a, \citenameMikolov et al.2013b]. For this work, we calculate the relevance scores for all the words which have their true tag as for any test instance in the two datasets.
3.3 Correlation measures
In the output (last) layer we take dot product between weight parameters and the hidden layer outputs and expect that this value (normalized) would be highest corresponding to the true tag. To obtain these similarities between distributions of hidden layer outputs to the weight parameters, we consider two other measures apart from dot product:
Kullback-Leibler Divergence: Given two discrete probability distributions A and B, the Kullback-Leibler Divergence(or KL Divergence) from B to A is computed in the following manner:
may be interpreted as a measure to see that how good the distribution B approximates the distribution A. For our experiments, we take normalized weight parameters as A and hidden representations as B. The lower this KL-divergence is, higher is the correlation between A and B.
Pearson Correlation Coefficient: Given two variables X and Y, Pearson Correlation Coefficient(PCC) is defined as:
where is the covariance, and are the standard deviations of X and Y respectively. takes the values between -1 and 1.
4 Results and Discussion
Throughout our experiments, we use 50 dimensional word vectors, 50 hidden layer units, learning rate as 0.05, number of epochs as 21 and a batch size of 1. The performance of various models on both the datasets is summarized in Table 1. Among the three bi-directional models, LSTM performs the best.
4.1 Correlation Analysis
We analyze the correlation between the hidden layer representations and the weight parameters connecting hidden and output layers. Meeting our expectation, this correlation of hidden layer values is found to be higher with the weight parameters corresponding to the true tag for a given input word. For instance, take a sentence from ConLL dataset: “The students, who had staged an 11-hour protest at the junction in northern Rangoon, were taken away in three vehicles.”. Here, the word “Rangoon” has it’s true tag as I-LOC and rest all are context words. Figure 2 plots the normalized values for left side part of the hidden representation for “Rangoon”, along with corresponding weight parameters for I-LOC and I-MISC tags.
I-MISC has been chosen as it’s corresponding dot product is maximum among all the false tags. The high correlation between the hidden representation and weight parameters for the true tag can be clearly observed from the figure.
Table 2 gives the correlation values for above three measures corresponding to the “Rangoon” instance.
|Tag||Dot Product||KL Divergence||PCC|
|I-LOC (True tag)||7.27||0.15||0.62|
|I-MISC (False Tag)||1.76||0.48||0.17|
4.2 Analysis of Relevance Scores
In order to evaluate the ability of RNN models to capture important contextual words, we do a qualitative analysis at both word and sentence levels. This section provides instances from both CoNLL and JNLPBA datasets to illustrate how the three measures can be used to identify salient words with respect to bi-directional model. Although we compute word rankings using the three measures described above, our demonstrations in the paper primarily focus on the M_LRC method. M_LRC is able to treat each word individually with due attention to dependency on another words in a given sentence.
At the word level, we further breakdown the visualizations into three types:
Fixing a word and a method: In this case, we fix a particular word and use M_LRC method. We analyze how the importance scores change with various models, entities and correlation measures. Figures 2(a), 2(b) and 2(c) show heatmaps by fixing the word “midfielder” and M_LRC method for CoNLL dataset. Based on our intuition, the word “midfielder” should have higher importance scores for entity. This is clearly visible in the illustrations. All the three correlation measures are able to capture this intuition to a reasonable extent. Similarly, figures 2(d), 2(e) and 2(f) show heatmaps for “apoptosis” on JNLPBA dataset. The higher scores given to class (cell_type) are in agreement with the results of M_WF method as well as with our intuition as “apoptosis” indicates cell death. It can also be observed that all the bidirectional models do quite well in both these cases.
Fixing a model and a method: In this case, we fix a particular model and try to visualize how the models score different contextual words for different entity types. Figure 4 shows the heatmaps by fixing RNN, LSTM and GRU respectively with M_LRC method (using dot product). Our intuition that “captain”, “city” and “agency” would be relevant for , and entities respectively, is proved to be true as can be observed in all of the cases. However, neural models are unable to associate “agency” with as distinctively as in case of “captain” and “city”. This can be attributed to frequent occurrence of the word “agency” in the context of words belonging to or entities, thereby, confusing the models.
Fixing an entity and a method: Now, we fix a particular entity to analyze which model gives higher importance to different contextual words for a particular entity. Figure 5 shows the heatmaps by fixing entities , and respectively with M_LRC method. “protein”, “sequences” and “kinetics” have high frequency scores for , and respectively. The models capture this beautifully in all the cases.
At a sentence level, we only consider our best performing model, LSTM. Table 3 gives entity wise word relevance scores for two individual sentences. It uses a sentence from CoNLL dataset - “Saturday ’s national congress of the ruling Czech (I-ORG) Civic (I-ORG) Democratic (I-ORG) Party (I-ORG) ODS (I-ORG)) will discuss making the party more efficient and transparent , Foreign Minister and ODS (I-ORG) vice-chairman Josef (I-PER) Zieleniec (I-PER), said on Friday .”. The tags for all entity words are mentioned alongside each word. Notice the high scores for “vice-chairman”, “ruling”, “congress”, “minister” meets the intuitive understanding of these words. Interestingly, round brackets get the maximum scores for M_SLL method, which may be attributed to their frequent use with entity words. Similarly, sentence taken from JNLPBA dataset is: “the number of glucocorticoid (B-protein) receptor (I-protein) sites in lymphocytes (B-cell_type) and plasma cortisol concentrations were measured in dgdg patients who had recovered from major depressive disorder and dgdg healthy control subjects .”. Again, higher scores for “sites” and “plasma” for are in agreement with overall scores given to them.
4.3 Positional effects of context words
In this section, we analyze how the position of context words affects their scores obtained by M_LRC method. We do this analysis for real sentences present in the test sets as well as on artificial sentences. We achieve this by applying the proposed techniques at an individual sentence level. For instance, Table 4 shows the relevant scores of the word “minister” for entity obtained by three models, in three test sentences taken from CoNLL dataset. M_WF method indicates that “minister” has high importance for entity type matching with our intuition. However “minister” is likely to appear in different sentences with different context and may not have equal relevance as also indicated in the Table 4. In the first sentence, there is no entity word for , hence, the score for “minister”, corresponding to entity is zero. In the second sentence, the score is higher, though not too high as the word is relatively far from the relevant entity word. However, the score is much higher in the third sentence where “minister” is right before the entity words “Margaret Thatcher”. Relative scores obtained by using different neural models also match with the general notion that RNN tends to forget long range context (second sentence) compared to LSTM and GRU, and is quite good for short distance context (third sentence).
We further validate the above observation on artificial examples. Figure 5(a) gives the position verses score plot for the word “chairman” with respect to the entity word “Josef”. The position tells that how far to the left “chairman” is from the entity word. We create sentences as follows - “chairman Josef .”, “chairman R Josef .”, “chairman R R Josef .” and so on. Here, R represents a random word. It can be observed that how LSTM and GRU assign a higher score to far off words compared to RNN, justifying their ability to include such words in making the final decision.
Figure 5(b) shows a similar plot for the word “cytokines” and a entity word “erythropoietin” using the same way of creating artificial sentences. Interestingly, GRU assigns higher relevance scores than LSTM and RNN, which is in accordance with the high overall score it gives to “cytokines” compared to the other two models.
4.4 Error Analysis
The proposed methods can be effectively used to conduct error analysis on bi-directional recurrent neural network models. For a given sentence, a negative score for a particular word means that the model is able to make a better decision when the word is removed from the sentence. Relevance scores can be used to find out which words confuse the model. Knowing what those words are, is crucial to understanding why the model makes a mistake in a particular instance. For example, Table 5 shows the word importances for the sentence - “the degeneracy in sequences recognized by the otfs (B-Protein) may be important in widening the range over which gene expression can be modulated and in establishing cell type specificity .” The LSTM model makes a mistake here by tagging “otfs” with tag B-DNA. Words “degeneracy”, “sequences”, “widening”, “recognized” and “modulated” all have a higher overall score for entity class than for . Hence, the presence of these words in the sentence fool the model into making a wrong decision.
In general, we observe that the presence of words which have high scores for false entity types tend to confuse the model. Position of words also plays a vital role. Words which appear in a far off or a different position than what they generally appear in the training dataset, tend to receive negative or low scores even if they are important. For instance, “minister” mostly appears to the left of an entity word in the training dataset. If, in a test case, it appears to the right, it ends up receiving a low score.
5 Related Work
Various attempts have been made to understand neural models in the context of natural language processing. Research in this direction can be traced back to \newciteelman1989representation which gains insight into connectionist models. This work uses principal component analysis (PCA) to visualize the hidden unit vectors in lower dimensions. Recurrent neural networks have been addressed in recent works such as \newcitekarpathy2015visualizing. Instead of a sequence tagging task, they use character level language models as a testbed to study long range dependencies in LSTM networks.
li2015visualizing build methods to visualize recurrent neural networks in two settings: sentiment prediction in sentences using models trained on Stanford Sentiment Treebank and sequence-to-sequence models by training an autoencoder on a subset of WMT’14 corpus. In order to quantify a word’s salience, they approximate the output score as a linear combination of input features and then make use of first order derivatives. Erasure technique helps us to do away with such assumptions and find word importances in sequence labeling tasks for individual entities.
Similar to present work, \newcitekadar2016representation analyze word saliency by defining an omission score from the deviations in sentence representations caused by removing words from the sentence. This work, however, targets a different, multi-task GRU framework, learning visual representations of images and a language model simultaneously.
Another closely related work is \newciteli2016understanding. They use erasure technique to understand the saliency of input dimensions in several sequence labeling and word ontological classification tasks. Same technique is used to find out salient words in sentiment prediction setting. Our work focusing on sequence labeling task has several differences with \newciteli2016understanding. Firstly, in case of sequence labeling, \newciteli2016understanding only focus on feed forward neural networks while our work trains three different recurrent neural networks on general and domain specific datasets. Secondly, their analysis in sequence labeling task is only limited to important input dimensions. Instead, our work focuses on finding salient words which are basic units for most NLP tasks. Lastly, our M_SLL method is an adaptation of their method to find salient words in sentiment prediction task. Unfortunately, for a sequence labeling task, this method is not very suitable. Since it only considers sentence level log likelihood, it makes no distinction between various possible entities such as person or organization. Our M_LRC method, which takes individual word level effects into account, is more suitable.
A significant amount of work has been done in Computer Vision to interpret and visualize neural network models [\citenameSimonyan et al.2013, \citenameMahendran and Vedaldi2015, \citenameNguyen et al.2015, \citenameSzegedy et al.2013, \citenameGirshick et al.2014, \citenameZeiler and Fergus2014, \citenameErhan et al.2009]. Attention can also be useful in explaining neural models [\citenameBahdanau et al.2014, \citenameLuong et al.2015, \citenameSukhbaatar et al.2015, \citenameRush et al.2015, \citenameXu and Saenko2016].
6 Conclusions and Future Work
In this paper, we propose techniques using word erasure to investigate bi-directional recurrent neural networks for their ability to capture relevant context words. We do a comprehensive analysis of these methods across various bi-directional models on sequence tagging task in generic and biomedical domain. We show how the proposed techniques can be used to understand various aspects of neural networks at a word and sentence level. These methods also allow us to study positional effects of context words and visualize how models like LSTM and GRU are able to incorporate far off words into decision making. They also act as a tool for error analysis in general by detecting words which confuse the model. This work paves the way for further analysis into bi-directional recurrent neural networks, in turn helping to come up with better models in the future. We plan to take our analysis further by including other aspects like character and word level embedding into account.
- [\citenameBahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- [\citenameChung et al.2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- [\citenameCollobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
- [\citenameElman1989] Jeffrey L Elman. 1989. Representation and structure in connectionist models. Technical report, DTIC Document.
- [\citenameErhan et al.2009] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2009. Visualizing higher-layer features of a deep network. University of Montreal, 1341:3.
- [\citenameGirshick et al.2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587.
- [\citenameGraves2013] Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
- [\citenameHong2005] Gumwon Hong. 2005. Relation extraction using support vector machine. In International Conference on Natural Language Processing, pages 366–377. Springer.
- [\citenameHuang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
- [\citenameKádár et al.2016] Akos Kádár, Grzegorz Chrupała, and Afra Alishahi. 2016. Representation of linguistic form and function in recurrent neural networks. arXiv preprint arXiv:1602.08952.
- [\citenameKarpathy et al.2015] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078.
- [\citenameKim et al.2003] J-D Kim, Tomoko Ohta, Yuka Tateisi, and Junâichi Tsujii. 2003. Genia corpusâa semantically annotated corpus for bio-textmining. Bioinformatics, 19(suppl 1):i180–i182.
- [\citenameKim et al.2004] Jin-Dong Kim, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Nigel Collier. 2004. Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pages 70–75. Association for Computational Linguistics.
- [\citenameLafferty et al.2001] John Lafferty, Andrew McCallum, Fernando Pereira, et al. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, ICML, volume 1, pages 282–289.
- [\citenameLi et al.2015] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2015. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066.
- [\citenameLi et al.2016] Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220.
- [\citenameLuong et al.2015] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
- [\citenameMahendran and Vedaldi2015] Aravindh Mahendran and Andrea Vedaldi. 2015. Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5188–5196.
- [\citenameMikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- [\citenameMikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- [\citenameNguyen et al.2015] Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436.
- [\citenamePennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- [\citenameRush et al.2015] Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- [\citenameSimonyan et al.2013] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
- [\citenameSukhbaatar et al.2015] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448.
- [\citenameSzegedy et al.2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
- [\citenameTjong Kim Sang and De Meulder2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 142–147. Edmonton, Canada.
- [\citenameXu and Saenko2016] Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer.
- [\citenameZeiler and Fergus2014] Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer.