Generating Textual Adversarial Examples for Deep Learning Models: A Survey

Generating Textual Adversarial Examples for Deep Learning Models: A Survey

Wei Emma Zhang, Quan Z. Sheng, and Ahoud Abdulrahmn F Alhazmi Wei Emma Zhang, Quan Z. Sheng and Ahoud Abdulrahmn F Alhazmi are with the Department of Computing, Macquarie University, Sydney, NSW 2109, Australia.
E-mail: {w.zhang,michael.sheng};

With the development of high computational devices, deep neural networks (DNNs), in recent years, have gained significant popularity in many Artificial Intelligence (AI) applications. However, previous efforts have shown that DNNs were vulnerable to strategically modified samples, named adversarial examples. These samples are generated with some imperceptible perturbations, but can fool the DNNs to give false predictions. Inspired by the popularity of generating adversarial examples for image DNNs, research efforts on attacking DNNs for textual applications emerges in recent years. However, existing perturbation methods for images cannot be directly applied to texts as text data is discrete. In this article, we review research works that address this difference and generate textual adversarial examples on DNNs. We collect, select, summarize, discuss and analyze these works in a comprehensive way and cover all the related information to make the article self-contained. Finally, drawing on the reviewed literature, we provide further discussions and suggestions on this topic.

Deep neural networks, adversarial examples, textual data, natural language processing

1 Introduction

Deep neural networks are large neural networks organized into layers of neurons, the individual computing units. Neurons are connected by links with different weights and biases and transmit the results of its activation function on its inputs to other neurons. Deep neural networks try to mimic the biological neural networks of human brains to learn and build knowledge from examples. Thus they are shown the strengths in dealing with complicated tasks that are not easily to be modelled as linear or non-linear problems. Further more, they are good at handling data with various modalities, e.g., image, text, video and audio.

With the development of high computational devices, deep neural networks, in recent years have gained significant popularity in many Artificial Intelligence (AI) communities such as Computer Vision, Natural Language Processing, Web mining and Game theory. However, the interpretability of deep neural networks is still unsatisfactory as they work as black boxes, which means it is difficult to get intuitions from what each neuron exactly has learnt. One of the problems of the poor interpretability is evaluating the robustness of deep neural networks.

In recent years, research works [1, 2] used small unperceivable perturbations to evaluate the robustness of deep neural networks and found that they are not robust to these perturbations. Szegedy et al. [1] first evaluated the state-of-the-art deep neural networks used for image classification with small generated perturbations on the input images. They found that the image classifier were fooled with high probability, but human judgment is not affected. The perturbed image pixels were named adversarial examples and this notation is later used to denote all kinds of perturbed samples in a general manner. As the generation of adversarial examples is costly and impractical in [1], Goodfellow et al. [2] proposed a fast generation method which popularizes this research topic (further discussion on these works in Section 3.2). Followed their works, many research efforts have been proposed and the purposes of these works can be summarized as: i) evaluating the deep neural networks by fooling them with unperceivable perturbations; ii) intentionally changing the output of the deep neural networks; and iii) detecting the oversensitivity and over-stability points of the deep neural networks and finding solutions to defense the attack.

Jia and Liang [3] are the first to consider adversarial example generation (or adversarial attack, we will use these two representations interchangeably hereafter) on textural deep neural networks. Their work quickly gained research attention in Natural Language Processing community. However, due to several difference between images and textual data, the adversarial attack methods on images cannot be directly applied to texts. First of all, image data (e.g., pixel values) is continuous, but textual data is discrete. Usually, we vectorize the texts before inputting them into the deep neural networks. Traditional vectoring methods include leveraging term frequency and inverse document frequency, and one-hot representation (details in Section 4.2). When applying gradient-based adversarial attacks adopted from images on these representations, the generated adversarial examples are invalid character or word sequences [4]. One solution is to use word embeddings, which is continous and dense representation of words, as the input of deep neural networks. However, this will also generate words that are out of the word embedding space [5]. Secondly, the perturbation of images are small change of pixel values that are hard to be perceived by human eyes, thus humans can correctly classify the images, showing the poor robustness of deep neural models. But for adversarial attack on texts, small perturbations are easily perceptible. For example, replacement of character or words would generate invalid words or syntactically-incorrect sentences. Further, it would alter the semantics of the sentence drastically. Therefore, the perturbations are easily to be perceived–in this case, even human being cannot provide correct predictions.

To address the aforementioned differences and challenges, many research works are proposed since it first emergeed in textual domain in 2017. In this paper, we review the research works on generating adversarial examples on textual data that fool deep neural networks. The work is motivated by the drastically increasing attentions on this topic. We hope this work can provide a comprehensive review and help researchers to understand the current research status of the topic. Although there exist some works surveying attacks on deep learning models [6, 7], this work is the first to address the adversarial attack on deep neural networks in textual domain. We expect that the readers have some basic knowledge of the deep neural networks architectures, which are not the focus in this article. Our focus is to introduce, discuss and analyze the attack methods proposed for textual deep neural networks, mainly covering following aspects:

  • Black-box or White-box. Black-box attack is performed when the architectures, parameters, loss functions, activation functions of the deep neural networks are not known. The adversarial examples are generated either according to the input and output of the deep neural networks, or on the test dataset only. On the contrary, white-box attack is based on the knowledge of all information of the the deep neural networks, usually it will leverage the loss functions.

  • Untargeted or Targeted. Untargeted attack only tends to change the output of the model, while targeted attack aims at generating specific outputs. For binary tasks, e.g., binary classification, untargeted attack equals to the targeted attack. In other cases, untargeted attack requires more careful design.

  • Granularity. The attack granularity on texts is corresponding to the granularity of the attacked neural networks, in which the operations are performed on characters, words, sentences or word embeddings. Some works provide hybrid methods that combine several levels of attacks. For some specific applications in software engineering, the attacks are on application-level.

  • Adversarial training. Adversarial training is to use the generated adversarial examples to help robustify the deep neural networks. Most research works considered in this article provide adversarial training strategies. However, this is not the focus of this article, as the methods for generating adversarial examples is the focus. Therefore, for papers which target to provide more robust systems with the help of adversarial examples generated by simply adopting existing works, we will not cover in this article.

The papers we reviewd in this article are high quality papers selected from top NLP and AI conferences, including ACL111Annual Meeting of the Association for Computational Linguistics, COLING222International Conference on Computational Linguistics, NAACL333Annual Conference of the North American Chapter of the Association for Computational Linguistics, EMNLP444Empirical Methods in Natural Language Processing, ICLR555International Conference on Learning Representations, AAAI666AAAI Conference on Artificial Intelligence and IJCAI777International Joint Conference on Artificial Intelligence. Other than accepted papers in aforementioned conferences, we also consider good papers in e-Print, as it reflects the latest research works. We selected paper from archive with three metrics: paper quality, method novelty and citation (optional999As the research topic emerges from 2017, we relax the citation number to over five if it is published more than one year. If the paper has less than five citations, but is very recent and satisfies the other two metrics, we also consider it.).

The remainder of this paper is organized as follows: first, we review related survey works in Section 2. Then the basic technical backgrounds are presented in Section 3. In Section 4 we address the difference of attacking image data and textual data and briefly describe the deep neural networks attacked in the reviewed papers. Then we summarize the works according to different strategies in Section 5. In Section 6, we collect the benchmark datasets used in the reviewed works and grouped them according to the applications. We discuss the adversarial training in Section 7 and discuss the open issues in Section 8. Finally, the article is concluded in Section 9.

2 Related Works

In [8], authors present comprehensive review on different classes of attacks and defenses against machine learning systems. Specifically, they propose a taxonomy for identifying and analyzing these attacks and apply the attacks on a machine learning based application, i.e., a statistical spam filter, to illustrate the effectiveness of the attack and defense. This work targets machine learning algorithms rather than neural models.

Inspired by [8], the work in [9] reviews the defences of adversarial attack in the security point of view. The work is not limited to machine learning algorithms or neural models, but a generic report about adversarial defenses on security related applications. The authors find that existing security related defense works lack of clear motivations and explanations on how the attacks are related to the real security problems and how the attack and defense are meaningfully evaluated. Thus they establish a taxonomy of motivations, constraints, and abilities for more plausible adversaries. And they provide a series of recommendations for future works.

The work in [10] provide a thorough overview of the evolution of the adversarial attack research over the last ten years, and focuses on the research works from computer vision and cyber security. The paper covers the works from pioneering non-deep leaning algorithms to recent deep learning algorithms. It is also from the security point of view to provide detailed analysis on the effect of the attacks and defenses. The authors of [11] review the same problem in a data-driven perspective. They analyze the attack and defenses according to the learning phases, i.e., the training phase and test phase.

Unlike previous works that discuss generally on the attack methods on machine learning algorithms, [6] focuses on the adversarial examples on deep learning models. It reviews current research efforts on attacking various deep neural networks in different applications. The defense methods are also extensively surveyed. However, they mainly discussed adversarial examples for image classification and object recognition tasks.

The work in [7] provides a comprehensive review on the adversarial attacks on deep learning models used in computer vision tasks. It is an application-driven survey that groups the attack methods according to the sub-tasks under computer vision area. The article also comprehensively reports the works on against the attacks, the methods of which are mainly grouped into three categories.

All the mentioned works either target general overview of the attack/defense or focus on specific domains like computer vision and cybersecurity. Our work specifically focuses on the attack/defenses on deep learning models for textual applications.

3 Problem Definition and Backgrounds

In this work, we first formalize the definition of generating adversairl examples (Section 3.1), then we review the representative works in computer vision, which inspired majority of the research efforts in NLP (Section 3.2).

3.1 Problem Definition

Deep neural networks are large neural networks organized into layers of neurons, the individual computing units. Neurons are connected by links with different weights and biases and transmit the results of its activation function on its inputs to other neurons. A deep neural network can be simply presented as a function , where is the input, which is vectorized features; is the output, which can be a discrete set of classes (for classification problems) or a sequence of objects (for sequence labeling problems); represents the parameters in the NN model and they are learned automatically during the training of the NN model. For supervised learning tasks, the aim of the learning process is to find the best parameters that can minimize the gap between the NN’s prediction and the correct label . Usually, the gap is measured by an appropriate loss function for specific tasks.

Adversarial examples are small perturbations of the data in test phase that make the NN give incorrect prediction, which dramatically reduce the overall model accuracy. For image adversarial examples, the changes are imperceiptable by human eyes, so that human can make correct judgement, while NN models are fooled. An adversarial example can be formalized as:


where is the allowed perturbation. is the distance between original data and the perturbed sample. The distance should be trivial to be observed/detected.

To solve Equation 1, the task is formulated as an optimization problem. For an untargeted attack, the adversary is interested in any output that is different from the correct one. The problem is formulated as:


Maximizing the loss function would make the prediction goes to wrong direction. For targeted attack, the adversary has a targeted output and the optimization is:


Minimizing the loss function given target output would enforce the model to give the target output. Target attack is thus harder than untarget attack only if the problem is not a binary classification problem, for which target and untarget task are the same.

3.2 Representative Works in Computer Vision

Due to the non-convexity and non-linearity of NN, solving Eq. (2) and Eq. (3) are not always possible [12]. Thus current methods use approximation methods. We highlight some representative solutions that inspire many research efforts on generating textural adversarial examples. For comprehensive review of attack works in computer vision, please refer to [7].

3.2.1 L-Bfgs

Szegedy [1] first considered the adversarial attack for deep neural networks on image classification task. They proposed a optimization procedure to find adversarial examples, which are obtained by imperceptibly small perturbations to a correctly classified input image. They suggested that adversarial examples are rarely seen examples in the test datasets. Specifically, they generated adversarial examples using following minimization and solved it using a box-constrained Limited memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm.


They perform line-search to find the minimum .

3.2.2 Fast Gradient Sign Method (FGSM)

As the computation of L-BFGS is time-consuming and sometimes impractical, Goodfellow et al. [2] proposed a fast solution called Fast Gradient Sign Method to generate adversarial examples by linearizing the neural model’s loss function on the perturbed input. Specifically, to generate a perturbation, the method differentiates the loss function respect to the input, and identifies the direction of the input perturbation using the gradient of the cost function with respect to the input itself–this direction indicates the sensitivity of the neural model’s label assignment given an input. As gradient can be computed by similar steps used for back-propagation during training, the adversarial perturbations can be efficiently generated. The difference for gradient in training is that here the gradient is computed with respect to the input, rather then model parameters. This adversarial generation can be formulated as:


where L is the cost function associated with model F and a parameter controlling the perturbation’s magnitude.

3.2.3 Jacobian Saliency Map Adversary (JSMA)

Unlike FGSM computing gradients using back-propogration, Papernot et al. [13] generated adversarial examples using forward derivatives (i.e., model Jacobian). This method evaluates the neural model’s output sensitivity to each input component using its Jacobian Matrix and gives greater control to adversaries given the perturbations. Jacobian matrices form the adversarial saliency maps that rank each input component’s contribution to the adversarial target. A perturbation is then selected from the maps. Thus the method is named Jacobian-based Saliency Map Attack. The Jacobian matrix of a given sample x is given by:


where is the component of the input and is the component of the output. Here denotes the logits (i.e., the second-to-last) layer. In other words, JSMA did not use the output of the softmax layer but the output before the softmax layer. measures the sensitivity of with respect to .

3.2.4 C&W Attack

Carlini and Wagner [14] formed the targeted adversarial attack as following form:


where is a function that if and only if , is a constant. They proposed seven versions of and considers three norms (i.e., distances) where equals to and . The attack is described as:


with defined as


where denotes the output before the softmax layer, is a constant to control the confidence. The attack was solved by an iterative algorithm as the distance metric is non-differentiable. In each iteration, attack is used to identify unimportant pixels whose values will never be changed. The rest pixels will go to next iteration until no adversaries can be generated. As the distance metric is not fully differentiable, the authors conducted attack using an iterative algorithm by solving the following optimization task in each iteration:


where is initialized as 1 and then decreases during the iteration until .

3.2.5 DeepFool

DeepFool [15] is an iterative -regularized algorithm. The authors first assumed the neural network is linear, thus they can separate the classes with a hyperplane. They simplified the problem and found optimal solution based on this assumption and construct adversarial examples. To address the non-linearity fact of the neural network, they repeated the process again until a true adversarial example is found.

3.2.6 Substitute Attack

The above mentioned representative works are all white-box methods, which require the full knowledge of the neural model’s parameters and structures. However, in practice, it is not always possible for attackers to craft adversaries in white-box manner due to the limit access to the model. The limitation was addressed by Papernot et al. [16] and they introduced a black-box attack strategy: They trained a substitute model to approximate the decision boundaries of the target model with the labels obtained by querying the target model. Then they conducted white-box attack on this substitute and generate adversarial examples on the substitute.

3.2.7 GAN-like Attack

There are another branch of black-box attack leverages the Generative Adversarial Neural (GAN) models. Zhao et al. [4] firstly trained a generative model, WGAN, on the training dataset . WGAN could generate data points that follows same distribution with . They then separately trained an inverter to map data sample to in the latent dense space by minimizing the reconstruction error. Instead of perturbing , they searched for adversaries in the neighbour of in the latent space. Then they mapped back to and check if would change the prediction. They introduced two search algorithms: iterative stochastic search and hybrid shrinking search. The former one used expanding strategy that gradually expand the search space, while the later one used shrinking strategy that started from a wide range and recursively tighten the upper bound of the search range.

4 From Image to Text

Due to the popularity of attacking the neural networks in computer vision community, many research efforts adopt the idea to evaluate the robustness of neural models in textual applications. I this section, we will summarize the difference between attacking neural models for computer vision tasks and NLP tasks (Section 4.1) . We also introduce the methods to map the textual value to digital value (vectorizing) that used for neural models (Section 4.2). We briefly discuss the attacked neural models for textual tasks (Section 4.3).

4.1 Difference between attacking NN for text or image

To attack a textual DNN model, we cannot directly apply the approaches from the image DNN attackers as there are three differences between them:

  • Discrete vs Continuous Inputs. Image inputs are continuous, typically the methods use norm measures the distance between clean data point with the perturbed data point. However, textual data is discrete, the distance measurements in computer vision cannot be directly applicable to textual tasks as textual data is only valid on certain values. Carefully designed variants or distance measurements for textual perturbations are required. Another choice is to firstly map the textual data to continuous data, then adopt the attack method from computer vision.

  • Perceivable vs Unperceivable. Small change of the image pixels usually can not be easily perceived by human beings, so that the adversarial examples will not change the human judgment, but only fool the DNN models. But small changes on texts, e.g., character or word change, will easily be perceived, rendering the possibility of attack failure. For example, the changes could be identified or corrected by spelling-check and grammar check before input into textual DNN models. Therefore, it is nontrivial to find unperceivalble textual adversaries.

  • Semantic vs Semantic-less. In the case of images, small changes usually do not change the semantics of the image as they are trivial and unperceivable. However, perturbation on texts would easily change the semantics of a word and a sentence, which would also change the task output. For example, deleting a negation word would change the sentiment of a sentence. But this is not the case in computer vision where perturbing individual pixels does not turn the image from a cat to another animal. However, the purpose of perturbation is to keep the correct prediction (usually by human) unchanged, but make DNN model to be fooled and provides incorrect prediction.

Due to these differences, current state-of-the art textual DNN attackers either carefully adjust the methods from image DNN attackers by enforcing additional constraints, or propose novel methods using different techniques.

4.2 Vectorizing Textual Inputs

DNN models requires vectors as input, for image tasks, the normal way is to use the pixel value to form the vectors/matrices as DNN input. But for textural models, special operations are needed to transform the text into vectors. There are three main branches of methods: word-count based encoding, one-hot encoding and dense encoding (or feature embedding) and the later two are mostly used in DNN models .

4.2.1 Word-Count Based Encoding

Bag-of-words method has the longest history in vectorizing text. In BOW model, there is a vocabulary containing all words appear in a corpus. Given a sentence, firstly an zero-encoded vector with length of the vocabulary size is initialized. Then it checks for each word in the vocabulary, if it exists in the given sentence, then set the corresponding value in vector to 1; if no, leaves it as zero. When a word appears in the sentence multiple times, its corresponding value in the feature vector is set as the count of appearance in the sentence.

Another word-count based encoding is to utilize the term frequency-inverse document frequency (TF-IDF) of a word (term). TF-IDF is a statistical measure used to evaluate how important a word is to a document in a corpus. TF measures the frequency of a term occurs in a document. Since document has different length, usually we use normalized TF which is calculated as TF/ total number of term in the document IDF measures the importance of a term. It weights down the frequent terms while scale up the rare ones. This is because frequently appearing terms are unimportant, for example ”is” and ”the” do not have contributable meanings. IDF is computed as: (total number of documents)/ (number of documents with term t in it).

Similar to BOW method, vectorizing textual data using TF-IDF also requires the vector has the same length with the size of the corpus. Given a sentence, the method will check if each of the word in the corpus exists in the sentence. If exists, then set the value with the TF-IDF weight of this word. Otherwise, set 0.

4.2.2 One-hot Encoding

In one-hot encoding, a vector feature represents a token–a token could be a character (character-level model) or a word (word-level model).

For character-level one-hot encoding, the representation can be formulated as [17]:


where be a text of characters, and is the alphabet (in some works, also include symbols). m is the number of words, n is the maximum number of characters for a word. Thus each word has the same-fixed length of vector representation and the length is decided by the maximum number of characters for a word.

For word-level one-hot encoding, following the above notations, the text can be represented as:


where and is the vocabulary, which contains all words in a corpus. is the maximum number of words allowed for a text, so that is zero-paddings if .

One-hot encoding produces vectors with only 0 and 1 values, where 1 indicates the corresponding character/word appears in the sentence/paragraph, while 0 indicate it does not appear. Thus one-hot encoding usually generates sparse vectors/matrices. One-hot vectors leads to very sparse feature vectors. DNNs have proven to very successful in learning values from the sparse representations as they can learn more dense distributed representations from the one-hot vectors during the training procedure.

4.2.3 Word Embeddings

Comparing to one-hot encoding, Word embeddings generates low dimensional and distributed representations (dense) for textual data. Word embeddings are based on the distributional assumption that words appearing within similar context possess similar meaning. Mikolav et al. [18] proposed Word2Vec method, which uses continuous bag-of-words (CBOW) and skip-gram models to generate word embeddings and make the distributed representations gaining more popularity. CBOW uses a shallow neural network to compute the conditional probability of a target word given the context words in a given window size Skip-gram also uses a shallow neural network. But it predicts the conditional probability of surrounding context words given a central target word [19]. Word2Vec tends to embed both syntactical and semantic information and it is very effective for the compositionality. For example, . Here denotes the embedding of a word and the minus operation is the cosine distance of the vectors. Word embedding, to some extend, alleviates the discreteness and data-sparsity problems for vectorizing textual data [20].

4.3 Attacked Neural Networks in Textual Applications

Neural networks have been gaining increasing popularity in NLP community in recent years and various DNN models have been adopted in different NLP tasks. Apart from the feed forward neural networks and Convolutional Neural Networks (CNN), Recurrent/Recursive Neural Networks (RNN) and their variants are the most common neural networks used in NLP, because of their natural ability handling sequences (while all texts can be regarded as word sequence). In recent years, two important breakthroughs in deep learning are brought into NLP. They are sequence-to-sequence learning [21] and attention modeling [22]. Reinforcement learning and generative models are also gained much popularity. In this section, we will briefly overview the representative deep neural networks applied in NLP. We focus on the ones who received research efforts on generating adversarial examples.

4.3.1 Feed Forward Networks

Feed-forward network, in particular multi-layer perceptrons (MLP), is the simplest neural network. It has several forward layers and each node in a layer connects to each node in the following layer, making the network fully connected. MLP utilizes nonlinear activation function to distinguish data that is not linearly separable. MLP works with fixed-sized inputs and do not record the order of the elements. Thus it is mostly used in the tasks that can be formed as supervised learning problems. In NLP, it can be used in applications such as text classification, speech recognition, machine translation. The major drawback for feed forward networks in NLP is that it cannot handle well the text sequences in which the word order matters.

As the feed forward network is easy to implement, there are various implementations and no standard or general benchmark architecture worth examining. To evaluate the robustness of feed forward network in NLP, researchers often works on specific architecture in real applications. For example, authors of [23, 24, 25] worked on the specified malware detection models and [26] was on question answering.

4.3.2 Convolutional Neural Network (CNN)

Convolutional Neural Network contains convolutional layers and pooling (down-sampling) layers and final fully-connected layer. Activation functions are used to connect the down-sampled layer to the next convolutional layer or fully-connected layer. CNN allows arbitrarily-sized inputs. Convolutional layer uses convolution operation to extract meaningful local patterns of input. Pooling layer reduces the parameters and allows the network to be deeper and less-overfitting. Overall, CNN identifies local predictors and combines them together to generate a fixed-sized vector for the inputs, which contains the most informative aspects for the application task. In addition, it is order-sensitive. Therefore, it excels in computer vision tasks and later was adopted in NLP applications.

Yoon Kim [27] adopted CNN for sentence classification. He used Word2Vec to represent words as input. Then the convolutional operation is limited to the direction of word sequence, rather than the word embeddings. Multiple filters in pooling layers deal with the variable length of sentences. The model demonstrated excellent performances on several benchmark datasets against multiple state-of-the-art works. This work became a benchmark work of adopting CNN in NLP applications. Zhang et al. [28] presented CNN for text classification at character level. They used one-hot representation in alphabet for each of the character. To control the generalization error of the proposed CNN, they additionally performed data augmentation by replacing words and phrases with their synonyms. Evaluations on eight datasets showed that character-level CNN works better, compared to word-level CNN in [27] , for less curated user-generated texts. These two representative textual CNNs are evaluated on adversarial examples in many applications [29, 30, 17, 31, 32, 33].

4.3.3 Recurrent Neural Networks/ Recursive Neural Networks

Recurrent Neural Networks are neural models adapted from feed-forward neural networks for learning mappings between sequential inputs and outputs [34]. RNNs allows data with arbitrary length and it introduces cycles in their computational graph to efficiently model the influence of time [35]. The designing of the model makes it does not suffer from statistical estimation problems stemming from data sparsity and thus leads to impressive performance in dealing with sequential data [20]. Recursive neural networks [36] extends recurrent neural networks from sequences to tree, which respects the hierarchy of the language. In some situations, backwards dependencies exist, which is in need for the backward analysis. Bi-directional RNN thus was proposed for looking at sentences in both directions, forwards and backwards, using two RNN cells, and combining their outputs. Bengio et al. [37] is one of the first to apply RNN in NLP. Specifically, they utilized RNN in languge model, where the probability of a sequence of words is computed by the RNN. The input to RNN is the feature vectors for all the preceding words, and the output is the conditional probability distribution over the output vocabulary. The RNN designed in this work is general one that can also be applied to other NLP tasks, and thus received examination by adversatial examples [12].

RNN has many variants, among which Long Short-Term Memory (LSTM) network [38] is the most widely engineered one. LSTM is a specific RNN that was designed to capture the long-term dependencies. In LSTM, the hidden state are computed through combination of three ”gates” (input, forget and output), which control information flow drawing on the logistic function. LSTM networks have subsequently proved to be more effective than conventional RNNs [39]. GRUs is a simplified version of LSTM that it only consists two gates, thus it is more efficient in terms of training and prediction.

[30, 40, 41] attacked self-implemented valina LSTMs. [12, 32, 42] attacked LSTM proposed in [38, 38, 43] respectively. [3] attacked Match-LSTM [44]. [4] attacked TreeLSTM [45] and Google Translation system [46]. [47] attacked conditional Bidirectional LSTM (cBiLSTM) [48] and Enhanced LSTM model (ESIM) [45].

4.3.4 Sequence-to-Sequence Learning (Seq2Seq) Models

Sequence-to-sequence learning (Seq2Seq) [21] is one of the technological breakthroughs in deep learning and is now widely used in NLP applications. Seq2Seq model has the power of using recurrent nets to carry out both encoding and decoding in an end-to-end manner [49]. Usually, a Seq2Seq model consists of two recurrent neural networks: an encoder that processes the input and compresses it into a vector representation, a decoder that predicts the output. [50] attacked Latent Variable Hierarchical Recurrent Encoder- Decoder (VHRED) model[51] via attention mechanism [52]. [17, 53, SinghGR18] all worked on attacking seq2seq (OpenNMT) [54] [50] attacked DynoNet [55], which contains a Seq2seq-based utterance generator.

4.3.5 Attention Models

Attention mechanism [52] is another breakthrough in deep leaning. It was initially developed to overcome the difficulty of encoding a long sequence required in Seq2Seq models [49]. Attention allows the decoder to look back on the hidden states of the source sequence. The hidden states then provide a weighted average as additional input to the decoder. This mechanism pays “attention” on informative parts of the sequence. Rather than looking at the input sequence in vanilla at attention models, self-attention [56] is used to look at the surrounding words in a sequence to obtain more contextually sensitive word representations [19]. [3, 57] attacked BiDAF [58] which is a Bidirectional attention flow mechanism for machine comprehension. [33] attacked attention-based Bi-RNN [59]. [47] attacked Decomposable Attention Model (DAM) [60].

4.3.6 Reinforcement Learning Models

Reinforcement learning trains an agent by giving a reward after agents performing discrete actions. In NLP, reinforcement learning framework usually consist of an agent (RNN-based model), a policy (guiding action) and a reward. The agent picks an action (e.g., predicting next word in a sequence) based on a policy, then updates its internal state accordingly, until arriving the end of the sequence where a reward is calculated. Reinforcement learning requires proper handling of the action and the states, which may limit the expressive power and learning capacity of the models [19]. Not many works attacked the reinforcement learning models in NLP. We can find only [50] attacked a deep reinforcement learning model in [61].

4.3.7 Deep Generative Models

In recent years, two powerful deep generative models, Generative Adversarial Networks (GANs) [62] and Variational Auto-Encoders (VAEs)[63] are proposed and gain much research attention. Generative models are able to generate realistic data from data in a latent space. In NLP, they are used to generate texts. GANs [62] consist of two adversarial networks: a generator and a discriminator. Discriminator is to discriminate the real and generated samples, while the generator is to generate realistic samples that aims to fool the discriminator. GAN uses a min-max loss function to train two NN models simultaneously. VAEs consist of encoder and generator networks. Encoder encodes an input into a latent space and the generator generates samples from the latent space. Deep generative models is not easy to train and evaluate, thus no standard solutions in NLP [19]. That is probably the reason why there is no work so far to attack deep generative models in NLP.

5 Attacking Neural Models in NLP

The adversary can be grouped by the degrees of the knowledge on the attacked deep neural networks, namely white-box attack and black-box attack. In white-box attack, the attack requires the access to the model’s full information, including architecture, parameters, loss functions, activation functions, input and output data. White-box attacks typically approximate the worst-case attack for a particular model and input, incorporating a set of perturbations. Black-box attack does not require the details of the neural networks, but can access the input and output of it. Black-box attacks often rely on heuristics to generate adversarial examples. Figure 1 generalizes the main methods of the black-box and white-box attacks.

Fig. 1: Classification of Textual Adversary Methods via Model Access

Besides the discussion of white-box (Section 5.1) and black-box (Section 5.2) attacks on neural networks forNLP tasks, we further describe some representative works that attack multi-modal applications (Section 5.3), e.g., image-to-text models.

5.1 White-Box Attack

White-box adversarial attack methods usually are optimization-based as they can access the parameters, loss functions and structures of the neural networks. Most of the whit-box attacks adapt the methods from attacks in computer vision.

5.1.1 FSGM-based

FGSM is one of the first attack methods on images (Section 3.2.2). Many textual attack works adapted this method. The work [29] adopted the idea from FSGM, but considered the loss gradient’s magnitude to identify the most impact characters, which are named as hot characters and used hot characters to find hot phrases. Then they proposed three kinds of strategies to manipulate the regards to the hot phrases: insertion, modification and removal. However, as mentioned by the authors, these three strategies are performed manually.

The works [64] and [65] leveraged FGSM to identify the most contributive words for a label. Then the authors manipulated the original text by removing, adding and replacing words accordingly to change the label. For added and replaced words, the method built a candidate pool, in which the synonyms and typos and genre specific keywords (identified via term frequency) are candidate words. As the method order the words with their contribution ranking and crafted adversarial samples according to the order, it is a greedy method that always get the minimum manipulation until the label changes. To avoid being detected by the human eyes, the authors constrained the replaced/added words to not disturb the grammar and POS of the original text.

This work [5] performed attack on word embeddings leveraging FGSM and DeepFool. Then on the generated adversarial examples, they found the valid nearest neighbour using Word Mover’s Distance (WMD), an edit distance measurement.

The purpose of the [66] is for adversarial training, which makes the relation extraction DNN model more robust. They adopted the FGSM algorithm on the pre-trained word embeddings to generate adversarial examples.

The work [25] represents an executable by binary vector , and is the number of features, that using 1 and 0 to indicate the feature is present or not. The authors investigated four method to generate binary-encoded adversarial examples. The first two methods adopt FSGM method, but restricted in a binary domain by introducing deterministic rounding (dFGSM) and randomized rounding (rFGSM). The third method multi-step Bit Gradient Ascent (BGA) sets the bit of the -th feature if the corresponding partial deivative of the loss is greater than or equal to the loss gradient’s -norm divided by . The fourth method multi-step Bit Coordinate Ascent (BCA) updates one bit in each step by considering the feature with the maximum corresponding partial derivative of the loss.

The authors in [67] developed a new surrogate loss function (based on FGSM) to find attacks in malware detection models. Then they injected a sequence of bytes (payload) to the binary files to preserve the original functionality of the malware. Finally they reconstructed adverse embedding to valid binary file.

5.1.2 JSMA-based

JSMA is another pioneer work on attacking neural models for image applications (refer to Section 3.2.3). The work [12] used forward derivative as JSMA to find the most contributive words towards the adversary direction. Then the method modifies the words according to the direction that goes to the targeted class and project them onto the closest vector in the embedding space. The works in [23] and [24] are the first to attack malware classification deep neural networks. They adopted JSMA to generate adversaries. In addition, these works enforced some constrains as the input is discrete and required the applications are valid. Then the authors provided three methods to against the attacks, namely feature reduction, distillation and retraining and they found retraining is the most effective defense method.

5.1.3 C&W-based

C&W method is an optimization-based method that turns the adversarial example generation into an optimization problem. Generally, the problem can be formulated as:


where is the loss function to penalize the unsuccessful attack, indicates the regularization function to measure the magnitude of distortions. is the regularization parameter that control the trade-off between success attack and distortion.

The work in [40] adopted C&W method for attacking predictive models of medical records. The aim is to detect identify susceptible events and measurements in each patient’s medical records, which provide guidance for the clinical usage. The authors used standard LSTM as predictive model. Given the patient EHR data be presented by a matrix ( is the number of medical features and is the time index of medical check), the generation of the adversarial example is formulated as:


where denotes the logit layer output, is the regularization parameter which controls the norm regularization. is the adversarial example of . is the targeted label while is the original label. After generating adversarial examples, the authors picked the optimal one according to their proposed evaluation scheme that considers both the perturbation magnitude and the structure of the attacks. Finally they used the adversarial example to compute the susceptibility score for the EHR as well as the cumulative susceptibility score for different measurements.

Seq2Sick [53] attacked the seq2seq models using two targeted attacks: non-overlapping attack and keywords attack. For non-overlapping attack, the authors aimed to generate adversarial sequences that are entirely different from the original outputs. They proposed a hinge-like loss function that optimizes on the logit layer of the neural network:


where are the original output sequence, indicates the logit layer outputs of the adversarial example. For the keyword attack, targeted keywords are expected to appear in the output sequence. The authors also put the optimization on the logit layer and tried to ensure that the targeted keyword’s logit be the largest among all words. Further more, they defined mask function to solve the keyword collision problem. The loss function then becomes:


where denotes the -th word in output vocabulary. To constrain the generated word embeddings are valid, this work also considered two regularizations: group lasso regularization to enforce the group sparsity, and gradient regularization to make adversaries are in the permissible region of the embedding space.

5.1.4 Operation-based

HotFlip [17] represented the character-level operations, i.e., swap, insert and delete, as vectors in the input space and estimated the change in loss by directional derivatives with respect to these vectors. Specifically, given one-hot representation of inputs, a character flip in the -th character of the -th word (ab) can be represented by the vector: where -1 and 1 are in the corresponding positions for the a-th and b-th characters of the alphabet, respectively. Then the best character swap can be found by maximizing a first-order approximation of loss change via directional derivative along the operation vector:


where is the model’s loss function with input and true output . Similarly, insertion at the -th position of the -th word can also be treated as a character flip, followed by more flips as characters are shifted to the right until the end of the word. The character deletion is a number of character flips as characters are shifted to the left. Using the beam search, HotFlip efficiently finds the best directions for multiple flips.

The work [33] extended HotFlip by adding targeted attacks. Besides the swap, insertion and deletion as provided in HotFlip, the authors proposed a controlled attack, which is to remove a specific word from the output, and a targeted attack, which is to replace a specific word by a chosen one. To achieve these attacks, they maximized the loss function , where is the target word for the controlled attack, and minimize , where is the word to replace , for target attack.

5.1.5 Others

The work [68] generated both white-box and black-box adversarial examples, but their aim is to evaluate the robustness of their reading comprehension model. For the white-box attacks, they leveraged the model’s internal attention distribution to find the plot sentence that the model gives largest weight to condition to the correct answer. Then they exchanged the words which received the most attention with the randomly chosen words in a known vocabulary. They also performed another white-box attack by removing the whole sentence that gets the highest attention.

Method Works Granularity Control
FSGM-based [29] character,word targeted
[64] word untargeted
[65] word untargeted
[5] embedding untargeted
[67] application targeted
[66] sentence untargeted
[25] application targeted
JSMA-based [12] word targeted
[23] application targeted
[24] application targeted
C&W-based [40] medical feature targeted
[53] embedding targeted
Operation-based [17] character untargeted
[33] character targeted,untargeted
Others [68] word, sentence untargeted
TABLE I: Summary of white-box attack works. Granularity describes the attackis performed on the level of character, word, sentence etc. Controlindicates the attack is targeted or untargeted

5.2 Black-box Attack

Black-box adversary is more practical as in many applications the attacker cannot fully know the details of the neural network. However, generating black-box adversary is more computational expensive.

5.2.1 Concatenation Adversaries

[3] is the first work to attack reading comprehension systems. The authors proposed concatenation adversaries, that is to append distracting but meaningless sentences at the end of the paragraph, which do not change the semantics of the paragraph and the question answers, but will fool the neural model. The distracting sentences are either carefully-generated grammatical sentences or arbitrary sequence of words using a pool of 20 random common words. Both perturbations are obtained by iteratively querying the neural network until the output changes. The authors of [69] improved the work by varying the locations where the distracting sentences are placed and expanding the set of fake answers for generating the distracting sentences, rendering new adversarial examples that can help training more robust neural models. The work [68] adapted [3]’s attack method to evaluate the robustness of their reading comprehension model. Specifically, they generated distraction sentence use a pool of ten random common words in conjunction with all question words or additionally the words from all incorrect answer candidates.

5.2.2 Edit Adversaries

DeepWordBug [30] is a simple method that uses character transformations to generate adversarial examples. The authors first identified the important ‘tokens’, i.e., words or characters that affect the model prediction the most by scoring functions developed specifically for the attacked neural models. Then they modified the identified tokens using four strategies: replace, delete, add and swap. The work in [31] perturbs the input data by perform the character order changes: swap, middle random (i.e., randomly change orders of characters except the first and the last), fully random (i.e., randomly change orders of all characters) and keyboard type. They also collected typos and misspellings as adversaries. [50] attacked the neural models for dialogue generation. They applied various perturbations in dialogue context, namely Random Swap (randomly transposing neighboring tokens) and Stopword Dropout (randomly removing stopwords), Paraphrasing (replacing words with their paraphrases), Grammar Errors (e.g., changing a verb to the wrong tense) for the Should-Not-Change attacks, and the Add Negation strategy (negates the root verb of the source input) and Antonym strategy (changes verbs, adjectives, or adverbs to their antonyms) for Should-Change attacks.

The work in [57] proposed two methods for generating adversaries. The first one is using universal replacement rules to produce adversaries on the questions, aiming to obtain incorrect answers. Then they used carefully-chosen handcrafted rules to perturb the original questions by replacing part of the question into other forms, for example, “What NOUN”“Which NOUN”. They filtered all the candidate replacement rules by querying the attacked neural model: if changed the answer, then keep this rule; otherwise, drop this rule. Later, the rules which generate low semantic-equivalent question will also be dropped. They checked the semantic-equivalence degree using a self-defined score, which is the ratio between the probability of a paraphrase and the probability of the question itself. The second method will be discussed in next section.

The authors in [47] proposed a method for automatically generating adversarial examples that violate a set of given First-Order Logic constraints in natural language inference (NLI). They proposed a inconsistency loss to measure the the degree to which a set of sentences causes a model to violate a rule. The adversarial example generation is the process for finding the mapping between variables in rules to sentences that maximize the inconsistency loss and are composed by sentences with a low perplexity (defined by a language model). To generate low-perplexity adversarial sentence examples, they used three edit perturbations: i) change one word in one of the input sentences; i) remove one parse subtree from one of the input sentences; iii) insert one parse sub-tree from one sentence in the corpus in the parse tree of one of the input sentences.

5.2.3 Paraphrase-based Adversaries

The second method in [57] is a paraphrase-based method that adopts machine translation (OpenNMT) to generate paraphrases of the original question. These paraphrases are regarded as adversarial examples. The purpose of using paraphrases is to preserve the semantics of the adversaries. Then the adversaries are used to query the attacked neural model until the output changes.

SCPNs [42] is an encoder-decoder architecture for generating paraphrase-like adversaries. The method first encodes the original sentence, then inputs the paraphrases (generated by back-translation) and targets syntactic tree into the decoder, whose output is the target adversary. This method is able to control the syntax of the adversaries.

5.2.4 Generative-based Adversaries

Some works propose to leverage Generative Adversarial Network (GAN) [62] to generate adversaries [4]. In [4], to generate adversarial examples, the model consists two key components: a GAN (generating fake data samples) and an inverter (mapping input to its representation in latent dense space). The two components are trained on the original input by minimizing reconstruction error between original input and the adversarial examples. Perturbation is performed in the latent dense space by identifying the perturbed sample in the neighborhood of . Two search approaches, namely iterative stochastic search and hybrid shrinking search, are proposed to identify the proper . However, it requires querying the attacked model each time to find the that can make the model give incorrect prediction. Therefore, this method is quite time-consuming.

5.2.5 Substitution

The work in [70] proposes a black-box attack framework that targeting RNN model used in detecting malware. The framework consists of two models: one is a generative RNN, the other is a substitute RNN. The generative RNN aims to generate adversarial API sequence from the malware’s API sequence. It is based on the sequence to sequence model in [21]. The substitute RNN, which is a bi-directional RNN with attention mechanism, is to mimic the behavior of the attacked RNN. It is trained on both malware and benign sequences, as well as the Gumbel-Softmax outputs of the generative RNN, where Gumbel-softmax is used to enable the joint training of the two RNN models, because the original output of the generative RNN is discrete. Specifically, it enables the gradient to back-propagation from generative RNN to substitute RNN.

Method Works Granularity Control
Concatenation [3] word untargeted
[69] sentence untargeted
Edit [30] character, word untargeted
[31] character, word untargeted
[50] word, phrase untargeted
[57] word untargeted
[47] word, phrase untargeted
Paraphrase-based [57] word untargeted
[42] word targeted
[4] word (latent) untargeted
Substitute Model [70] application untargeted
TABLE II: Summary of black-box attack methods. Granularity describes the attack is performed on the level of character, word, sentence etc. Control indicates the attack is targeted or untargeted.

5.3 Cross-modal Attacks

Many works attack the neural models for cross-modal applications, in which the input and output are in different modals, e.g., image, audio and text. For example, the applications require image-to-text neural models or speech-to-text models. Although current attacks are performed on images or audio, rather than on text, we introduce these works in order to cover all text related applications.

5.3.1 Image-to-Text

Image-to-text models output textual data according to input images.

Optical Character Recognition (OCR). The work [71] attacked the neural system for recognizing characters from images. Recognizing characters from images is a problem named Optical Character Recognition. Similar to image captioning, the input is image and the output is recognized characters in text. The authors proposed two attacks, one attacks the image-text model by optimizing a proposed loss function that linearly combines Connectionist Temporal Classification (CTC) loss function for sequential labelling and norm distance between clean and perturbed images. The other attack is performed on the text by replacing words with their antonyms and then transforming to lines in the image and finally replacing the images of the corresponding lines in the document image.

Scene Text Recognition. [72] evaluated the models for scene text recognition. It is also a image-to-text application. The difference with OCR task is that in scene text recognition, the entire image is mapped to word strings directly and in OCR, the recognition is a pipeline process: first segments the words to characters, then do the recognition on single characters. To attack the neural models, the authors proposed the loss function with the linear combination of CTC loss and Euclidean loss and considered the optimization problem as a multi-task learning problem. The generation of the proposed attack is much faster than the state-of-the-art attack methods.

Image Captioning. The work [73] is developed for image captioning, which is a multimodal learning task that takes an image as an input and generates a language caption that best describes its visual contents. The authors specifically attacked the CNN+RNN model, which used CNN as encoder for image feature extraction and RNN as decoder for caption generation. The adversarial examples are generated by maximizing the logit of the targeted caption. To achieve the attack, the authors developed hinge-like opimization equations to attack the model. They provided two targeted attacks: targeted caption (generated caption matches the target caption) and targeted keywords (generated caption contains the targeted keywords).

Visual Question Answering (VQA). [74] attacked one image captioning neural model and two visual question answering (VQA) models. All of the attacked models are complex that contain language generation component, localization and attention mechanism. The image captioning model DenseCap consists of a localization network (RNN) to predict regions and CNN+RNN to generate textual output from images. This work attacked the CNN+RNN module of DenseCap by using targeted region-caption pairs as ground-truth and optimizing the DenseCap loss as well as the distance between perturbed and original images on the adversarial examples. For the VQA attack, the work proposed a loss function that maximizing the probability of the target answer and removing the preference of adversarial examples with smaller distance to the original image when the distance is below a threshold. The attacks are evaluated to show better success rate than previous attacks.

Visual-Semantic Embeddings (VSE). VSE is to build the bridge between natural language and the underlying visual world. In VSE, the embedding spaces of both images and descriptive texts (captions) are jointly optimized and aligned. [75] attacked the latest VSE models by generating adversarial examples in the test set and evaluate the robustness of the VSE modesls. They performed the attack on textual part by introducing three method: i) replace nouns in the image captions utilizing the hypernymy/hyponymy relations in WordNet; ii) change the numerals to different ones and singularize or pluralize the corresponding nouns when necessary; iii) detect the relaions and shuffle the noninterchangeable noun phrases or replace the prepositions. The methods propsoed are black-box edit advesaries.

5.3.2 Speech-to-Text

Speech-to-text models output textual data according to input audios. [76] attacked speech-to-text model in the speech recognition task. The neural model they attacked is a state-of-the-art speech-to-text transcription neural network (LSTM), named DeepSpeech. Given a natural waveform, the authors constructed a audio perturbation that is almost inaudible but can be recognized by adding into the original waveform. The perturbation is constructed by adopting the idea from C&W method (refer to section 3.2.4 ), which measures the image distortion by the maximum amount of changed pixels. Adapting this idea, they measured the audio distortion with a metric used for measuring relative loudness of an audio and proposed to use Connectionist Temporal Classication (CTC)loss for the optimization task. Then they solved this task with gradient method using Adam optimizer.

Cross-Modal Application Works Control
Image-to-Text Optical Character Recognition [71] targeted
Scene Text Recognition [72] targeted
Image Captioning [73] targeted
Visual Question Answering [74] targeted
Visual-Semantic Embeddings [75] untargeted
Speech-to-Text Speech Recognition [76] targeted
TABLE III: Summary of Cross-Modal Attacking Works

6 Attacked Applications and Bechmark Datasets

In recent years, neural networks gain success in different NLP domains and the main branches of applications include text classification, reading comprehensions, machine translation, text summarization, question answering, dialogue generation, to name a few. In this section, we review the current works on generating adversarial examples on the neural networks in the perspective of NLP applications. Table IV summarizes the works we reviewed in this article according to their applications. We also list the benchmark datasets used in these works in the table. Note that the auxiliary datasets which help to generate adversarial examples are not included-we only present the dataset used to evaluate the attacked neural networks.

Applications Representative Works Benchmark Datasets
Classification Text Classification [29, 5, 30, 17, 41, 32, 77] DBpedia, Reuters Newswires, AG’s news, Sogou News, Yahoo! Answers, RCV1
Sentiment Analysis [12, 65, 30, 17, 41, 32, 77, 42, 57] SST, IMDB Rev, Yelp Rev, Elec, Rotten Tomatoes Rev, Amazon Rev
Spam Detection [30, 78] Enron Spam
Gender Identification [65] Twitter Gender
Grammar Error Detection [41] FCE-public
Medical Status Prediction [40] Electronic Health Records (EHR)
Malware Detection [23, 24, 67, 70, 79] DREBIN, Microsoft Kaggle,
Relation Extraction [66] NYT Relation, UW Relation
Machine Comprehension [3, 69, 26, 68] SQuAD, 2017 NIPS Human-Computer QA, MovieQA
Machine Translation [31, 33, 80, 53, 4] TED Talks, Logical QA ,WMT’16 Multimodal Translation Task
Text Summarization [53] DUC2003, DUC2004, Gigaword
Text Entailment [81, 47, 42, 4] SNLI, SciTail, MultiNLI, SICK
Dialogue Generation [50, 82] Ubuntu Dialogue, CoCoA, Switchboard Dialogue Act, OpenSubtitles

Rev stands for Reviews

TABLE IV: Attacked Applications and Benchmark Datasets

Text Classification. Majority of the surveyed works attack the text classification neural models as there are many tasks can be framed as classification problem. Sentiment analysis aims to classify the sentiment to three groups: neural, positive and negative. Gender identification, Grammatical error detection and malware detection can be framed as binary classification problems. Predict medical status is a multi-class problem that the classes is defined by medical experts. Many of the works do not use only one dataset to evaluate their attach strategies. Instead, they perform the test on various benchmark datasets to show the generality and robustness of their attacks. [29] used DBpedia ontology dataset [83] to classify the document samples into 14 high-level classes. [5] used IMDB movie reviews [84] for sentiment analysis, and Reuters-2 and Reuters-5 newswires dataset provided by NLTK package 101010 for categorization. [12] used a un-specified movie review dataset for sentiment analysis. [65] also used IMDB movie review dataset for sentiment analysis. The work also performed gender classification on and Twitter dataset 111111CloudFlower. Twitter gender classification dataset. for gender detection. [30] performed spam detection on Enron Spam Dataset [85] and adopted six large dataset form [28], i.e., AG’s news 121212˜gulli/, Sogou news [86], DBPedia ontology dataset, Yahoo! Answers 131313Yahoo! Answers Comprehensive Questions and Answers version 1.0 dataset through the Yahoo! Webscope program. for text categorization and Yelp reviews 141414Yelp Dataset Challenge in 2015, Amazon reviews [87] for sentiment analysis. [17] also used AG’s news for text classification. Further, they usedStanford Sentiment Treebank (SST) dataset [88] for sentiment analysis. [41] conducted evaluation on three tasks: sentiment analysis (IMDB movie review, Elec [89], Rotten Tomatoes [90]), text categorization (DBpedia Ontology dataset and RCV1 [91]) and grammatical error detection (FCE-public [92]). [32] used IMDB movie review for sentiment analysis, and AG’s news and Yahoo! Answers for text categorization. [78] used Enron Spam dataset for spam detection. [40] generated adversarial examples on the neural medical status prediction system and works on real-world electronic health records data. Many works target the malware detection models. [23, 24] performed attack on neural malware detection systems. They used DREBIN dataset which contains both benigh and malicious android applications [93]. [67] collected benigh windows application files and useed Microsoft Malware Classification Challenge dataset [94] as the malicious part. [70] crawled 180 programs with corresponding behavior reports from a website for malware analysis 151515 70% of the crawled programs are malware. In the work [66], the authors modelled the relation extraction as a classification problem, where the goal is to predict the relations exist between entity pairs given text mentions. They used two relation dataset NYT dataset [95] and UW dataset [96]. [77] proposed another kind of attack, called reprogramming. They specifically targeted the text classification neural models and used four datasets to evaluate their attack: Surname Classification Dataset161616Classifying names with a character-level rnn - pytroch tutorial. , Experimental Data for Question Classification [97], Arabic Tweets Sentiment Classification Dataset [98] and IMDB movie review dataset.

Machine Comprehension. Machine comprehension usually provides context documents or paragraphs to the machines, which can answer a question based on the comprehension of the contexts. Jia and Liang are one of the first to consider the textual adversary and they target the neural machine comprehension models [3]. They used the Stanford Question Answering Dataset (SQuAD) to evaluate the impact of their attack on the neural machine comprehension models. [69] followed the previous work and also worked on SQuAD dataset. [26] evaluated their attacks on 2017 NIPS Human-Computer Question Answering competition [99]. Althouth the focus of the work [68] is to develop a robust machine comprehension model, they used the adversarial examples to evaluate their proposed system. They used MovieQA multiple choice question answering dataset [100] for the evaluation.

Machine Translation. Machine Translation works on parallel datasets one of which uses source language and the other one is in the translated language. [31] used the TED talks parallel corpus prepared for IWSLT 2016 [101] for testing all of the NMT systems. They also collected French, German and Czech corpus for generating natural noises to build a look-up table of possible lexical replacements (they are used for generating adversarial examples). [33] also used the same TED talks corpus and used German to English, Czech to English, and French to English pairs. [80] targeted attacks on differentiable neural computer (DNC), which is a novel computing machine with DNN. They evaluated the attacks on logical question answering using bAbI tasks.

Text Entailment. The fundamental task of text entailment is to decide whether a premise text entails a hypothesis, i.e., the truth of one text fragment follows from another text. [81] assessed various models on two entailment datasets: Standord Natural Lauguage Inference (SNLI) [102] and SciTail [103]. [47] also used SNLI dataset. Furthermore, they use MultiNLI [104] dataset.

Text Summarization. The goal for text summarization is to summarize the core meaning of a given document or paragraph with succinct expressions. There is no surveyed papers only target the application of text summarization. [53] evaluated their attack on multiple applications including text summarization and they used DUC2003171717, DUC2004181818, and Gigaword191919 for evaluation.

Dialogue Generation. Dialogue generation is the fundamental component for real-world virtual assistants such as Siri and Alexa. It is the generative task that generates reposes given a conversation. [50] is one of the first to attack the generative dialogue models. They used the Ubuntu Dialogue Corpus [105] and Dynamic Knowledge Graph Network with the Collaborative Communicating Agents (CoCoA) dataset [55] for the evaluation of their two attack strategies. [82] also used Ubuntu Dialogue Corpus. In addition, they also use Switchboard Dialogue Act Corpus , which is a collection of two-sided telephone conversations, annotated with utterance-level dialogue acts. Another dataset, OpenSubtitles data-set[106] is also used for evaluation the conversations in this dataset contains a large number of egregious sentences.

Multi-Applications Some works adapt their attack methods into different applications, namely, they evaluate their method not limited to a single applciation. [53] attacked the sequence-to-sequence models. Specifically, they evaluated their attack on two applications: text summarization and machine translation. For text summarization, as mentioned before, they used three datasets DUC2003, DUC2004, and Gigaword. For the machine translation, they used sample dataset form WMT’16 Multimodal Translation task202020 [42] proposed syntactically adversarial paraphrase and evaluated the attack on sentiment analysis and text entailment applications. They used SST for sentimental analysis and SICK [107] for text entailment. [4] is a generic approach for generating adversarial examples on neural models and the applications they evaluated are image classification (MINIST digital image dataset), textual entailment (SNLI), and machine translation (only provided some examples). [108] evaluated their attacks on five datasets covering sentiment analysis (IMDB movie review, Elec product review, Rotten Tomatoes movie review) and text categorization (DBpedia Ontology, RCV1 news articles). [57] targeted two application. For sentiment analysis, they used Rotten Tomato movie reviews and IMDB movie reviews datasets. For visual question answering, they tested on dataset provided by Zhu et al. [109]. Although the visual QA invloves the visual content, their attack performed only on the questions and do not manipulate the images. Thus this work is not recognized as a multi-modal attack.

7 Robustifying Models with Adversarial Examples

The purpose for generating adversarial examples for neural networks is to utilize these generated examples to improve the model’s robustness [2]. There are two common way to achieve this goal: data augmentation and adversarial training. Data augmentation adds back the generated adversarial examples into the training process and try to let the model see more data outside the previous data distribution. Adversarial training is to modify the model’s loss function by adding the adversarial examples as regularizer. We introduce here some representative studies in utilizing adversarial examples to robustify the neural networks.

7.1 Data Augmentation

The authors of the work [3] try to improve the reading comprehension model with training on the augmented dataset that includes the adversarial examples. They showed that this data augmentation is effective on the attack which uses the same adversarial examples, but less effective on different attacks. [69] shared similar idea to augment the training dataset, but use improved adversarial examples as introduced in Section 5.2.1.

The work in [81] train the text entailment system augmented with adversarial example. Their aim is to make the system more robust. They propose three methods to generate more various data: knowledge-based, which replaces words with their hypernym/hyponym provided in several given knowledge bases, hand-crafted, which adds negations to the the existing entailment, and neural-based, which leverages a seq2seq model to generate an entailment examples by enforcing the loss function to measure the cross-entropy between the original hypothesis and the predicted hypothesis. During the training process, they adopt the idea from generative adversarial network to train a discriminator and a generator, incorporating the adversarial examples in the discriminator’s optimization step.

7.2 Adversarial Training

Szegedy et al. [1] invented adversarial training, a strategy that consists of training a neural network to correctly classify both normal examples and adversarial examples. Adversarial training enforces the generated adversarial examples as the regularizer and follows the form of:


where represents the adversarial examples generation [2]. Following [2], the work [108] forms the adversarial training with a liner approximation and a norm constraint as minimizing the following loss function:


given , is the parameter of the neural model, and is a constant copy of . The difference to [2] is that, the authors perform the adversarial generation and training on the word embeddings. Further, they extends their previous work on attacking image deep neural model [110], where they define local distribution smoothness (LDS) as the negative of the KL divergence of two distributions (original data and the adversaries). LDS measures the robustness of the model against the perturbation in local and ‘virtual’ adversarial direction. Thus, the adversary is defined as the direction to which the model distribution is most sensitive in the sense of KL divergence. They also apply this attack on word embedding and provide adversarial training by adding adversarial examples as regularizer.

The work [41] follows the idea from [108] and extends the adversarial training on LSTM. The authors followed FSGM work to add the adversarial training as a regularizer. But in order to enable the interpretability of adversarial examples, i.e., contrain the word embedding of the adversaries to valid word embeddings in the provided vocabulary, they introduce a direction vector from perturbed embedding to the valid word embedding, and put the direction constraint in the regularizer. This work also extened the adversarial training to semi-supervised learning by adopting method from [108]. [66] simply adopts the regularizer in work in [108], but applies in a different application, relation extraction.

We have discussed the attack method of [25] in Section 5.1.1. They incorporate the adversarial training as a regularizer, but different from aforementioned works, they use a saddle-point formulation that involves an inner non-concave maximization problem and an outer non-convex minimization problem. Their leaning objective can be formulated as:


where where is the set of binary indicator vectors that preserve the functionality of malware , is the loss function for the original classification model, is the correct lable, is the parameters need to be found.

8 Discussions and Open Issues

Generating textual adversarial examples has relatively shorter history than generating image adversarial examples on DNNs. This is because it is more challenging to make perturbation on discrete data, while preserving the invalid syntactic, grammar and semantics. We discuss some of the issues in this section and provide suggestions on future directions. [6] and [7] have discussed some issues in general way and specific for computer vision applications. In this section, we will discuss the issues in following aspects: i) perceivability; ii) semantics; iii) transferability; iv) automation; and v) unattacked textual neural networks

8.1 Perceivablity

Perturbations in image pixels are usually hard to be perceived, thus do not affect human judgment, but can only fool the deep neural networks. However, the perturbation on text is obvious, no matter the perturbation is flipping characters or changing words. Invalid words and syntactic errors can be easily identified by human and detected by the grammar check software, hence the perturbation is hard to attack a real NLP system. However, many research works generate such types of adversarial examples. It is acceptable, only if the purpose is purely proposing research oriented methods for generating adversarial examples. But for practical attack and examine the victim neural networks, we need to turn to the method that make the perturbations not easily perceivable, which is syntactically correct and better to be semantically-similar.

8.2 Semantics

Changing a word in a sentence sometimes change its semantics drastically. For NLP applications (e.g., reading comprehension, sentiment analysis), the adversarial examples need to be carefully designed in order not to change the should-be output. Otherwise, both correct output and perturbed output change, thus violate the purpose of generating adversarial examples. This is challenging and only few works reviewed in this article achieved it. We encourage more works on semantics-preserving attacks.

8.3 Transferability

Transferability is a common property for adversarial examples. It reflects the generalization of the attack methods. Transferability means adversarial examples generated for one deep neural network on a dataset can also effectively attack another deep neural network or/and dataset. This property is more often exploited in black-box attacks as the details of the deep neural networks does not affect the attack method. It is also shown that untargeted adversarial examples are much more transferable than targeted ones [111]. Transferability can be three levels in deep neural networks: same architecture with different data, different architectures with same application, and different architectures [6]. Although current works on textual attacks cover both three levels, the effectiveness of the transferred attacks still decrease drastically compared to it on the original architecture and data, rendering poor generalization ability. More general methods need to be proposed to tackle this issue.

8.4 Automation

although most research works are able to generate adversarial examples automatically, many works attack the textual DNNs manually. In white-box attacks, leveraging the loss function of the DNN can identify the most affected points (e.g., character, word) in a text automatically, then the attacks are performed on these points by automatically modifying the corresponding texts. In black-box attacks, some works automatically query the DNNs and idenfy most affected points by examining the outputs. But many other works use manual attacks. For example, [3] concatenated manually-chosen meaningless paragraphs to fool the reading comprehension systems, in order to show the vulnerability of the victim DNNs. Many research works followed their way, not aiming on practical attacks, but on research problems. These manaul works are time-consuming and impractical, which we suggest to avoid in future attack works.

8.5 Un-attacked textual neural networks

Although most of the common textual DNNs have gained attention from the adversarial attacks (Section 4.3), some DNNs haven’t been attacked so far. For examples, the generative neural models: Generative Adversarial Networks (GANs) and VariationalAuto-Encoders (VAEs). In NLP, they are used to generate texts. Deep generative models is not easy to train and evaluate, thus no standard solutions in NLP so far. That is probably the reason why there is no work so far to attackdeep generative models in NLP. Future works may consider about generating adversarial examples for textual generation DNNs.

9 Conclusion

This article presented the first comprehensive survey in the direction of generating textual adversarial examples on deep neural networks. We reviewed recent findings, summarize and analyze them from different aspects. We attempted to provide a good reference for researchers to gain insight of the challenges, methods and issues in this research direction and hope more robust deep neural networks are proposed based on the knowledge of the adversarial attacks.


  • [1] C. Szegedy, W. Zaremba, I. Sutskever, and J. Bruna, “Intriguing properties of neural networks,” in Proc. of the 2nd International Conference on Learning Representations (ICLR 2014), 2014.
  • [2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harnessing Adversarial Examples,” in Proc. of the 3rd International Conference on Learning Representations (ICLR 2015), 2015.
  • [3] R. Jia and P. Liang, “Adversarial examples for evaluating reading comprehension systems,” in Proc. of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark, September 2017, pp. 2021–2031.
  • [4] Z. Zhao, D. Dua, and S. Singh, “Generating natural adversarial examples,” arXiv preprint arXiv:1710.11342, 2017.
  • [5] Z. Gong, W. Wang, B. Li, D. Song, and W.-S. Ku, “Adversarial texts with gradient methods,” arXiv preprint arXiv:1801.07175, 2018.
  • [6] X. Yuan, P. He, Q. Zhu, R. R. Bhat, and X. Li, “Adversarial Examples: Attacks and Defenses for Deep Learning,” CoRR, vol. abs/1712.07107, 2017. [Online]. Available:
  • [7] N. Akhtar and A. S. Mian, “Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey,” IEEE Access, vol. 6, pp. 14 410–14 430, 2018.
  • [8] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar, “The security of machine learning,” Machine Learning, vol. 81, no. 2, pp. 121–148, 2010.
  • [9] J. Gilmer, R. P. Adams, I. J. Goodfellow, D. Andersen, and G. E. Dahl, “Motivating the Rules of the Game for Adversarial Example Research,” CoRR, vol. abs/1807.06732, 2018. [Online]. Available:
  • [10] B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,” Pattern Recognition, vol. 84, pp. 317–331, 2018.
  • [11] Q. Liu, P. Li, W. Zhao, W. Cai, S. Yu, and V. C. M. Leung, “A Survey on Security Threats and Defensive Techniques of Machine Learning: A Data Driven View,” IEEE Access, vol. 6, pp. 12 103–12 117, 2018.
  • [12] N. Papernot, P. McDaniel, A. Swami, and R. Harang, “Crafting Adversarial Input Sequences for Recurrent Neural Networks,” in Military Communications Conference, MILCOM 2016-2016 IEEE.   IEEE, 2016, pp. 49–54.
  • [13] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The Limitations of Deep Learning in Adversarial Settings,” in IEEE European Symposium on Security and Privacy (EuroS&P 2016), Saarbrücken, Germany, March 2016, pp. 372–387.
  • [14] N. Carlini and D. A. Wagner, “Towards Evaluating the Robustness of Neural Networks,” in Proc. of the 2017 IEEE Symposium on Security and Privacy (SP 2017), San Jose, CA, USA, May 2017, pp. 39–57.
  • [15] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A simple and accurate method to fool deep neural networks,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, June 2016, pp. 2574–2582.
  • [16] N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical Black-Box Attacks against Machine Learning,” in Proc. of the 2017 ACM on Asia Conference on Computer and Communications Security (AsiaCCS 2017), Abu Dhabi, United Arab Emirates.
  • [17] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “HotFlip: White-Box Adversarial Examples for Text Classification,” in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018, Melbourne, Australia, July 2018, pp. 31–36.
  • [18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in Proc. of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), Lake Tahoe, Nevada, United States, December 2013, pp. 3111–3119.
  • [19] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent Trends in Deep Learning Based Natural Language Processing,” IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55–75, 2018.
  • [20] Y. Goldberg, Neural Network Methods for Natural Language Processing, ser. Synthesis Lectures on Human Language Technologies.   Morgan & Claypool Publishers, 2017. [Online]. Available:
  • [21] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” in Proc. of the Annual Conference on Neural Information Processing Systems 2014 (NIPS 2014), Montreal, Quebec, Canada, December 2014, pp. 2672–2680.
  • [22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” CoRR, vol. abs/1409.0473, 2014. [Online]. Available:
  • [23] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, “Adversarial perturbations against deep neural networks for malware classification,” arXiv preprint arXiv:1606.04435, 2016.
  • [24] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. D. McDaniel, “Adversarial Examples for Malware Detection,” in Proc. of the 22nd European Symposium on Research in Computer Security (ESORICS 2017), Oslo, Norway, September 2017, pp. 62–79.
  • [25] A. Al-Dujaili, A. Huang, E. Hemberg, and U. O’Reilly, “Adversarial Deep Learning for Robust Detection of Binary Encoded Malware,” in Proc. of the 2018 IEEE Security and Privacy Workshops (SPW 2018), Francisco, CA, USA, May 2018, pp. 76–82.
  • [26] E. Wallace and J. L. Boyd-Graber, “Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions,” in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia, July 2018, pp. 127–133.
  • [27] Y. Kim, “Convolutional Neural Networks for Sentence Classification,” in Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, October 2014, pp. 1746–1751.
  • [28] X. Zhang, J. J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in Proc. in Annual Conference on Neural Information Processing Systems 2015 (NIPS 2015), Montreal, Quebec, Canada, June 2015, pp. 649–657.
  • [29] B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi, “Deep Text Classification Can be Fooled,” arXiv preprint arXiv:1704.08006, 2017.
  • [30] J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi, “Black-box generation of adversarial text sequences to evade deep learning classifiers,” arXiv preprint arXiv:1801.04354, 2018.
  • [31] Y. Belinkov and Y. Bisk, “Synthetic and natural noise both break neural machine translation,” arXiv preprint arXiv:1711.02173. ICLR, 2018.
  • [32] P. Yang, J. Chen, C.-J. Hsieh, J.-L. Wang, and M. I. Jordan, “Greedy attack and gumbel attack: Generating adversarial examples for discrete data,” arXiv preprint arXiv:1805.12316, 2018.
  • [33] J. Ebrahimi, D. Lowd, and D. Dou, “On Adversarial Examples for Character-Level Neural Machine Translation,” in Proc. of the 27th International Conference on Computational Linguistics (COLING 2018), Santa Fe, New Mexico, USA, August, pp. 653–663.
  • [34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533, 1986.
  • [35] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, 2016, vol. 1.
  • [36] C. Goller and A. Kuchler, “Learning task-dependent distributed representations by backpropagation through structure,” Neural Networks, vol. 1, pp. 347–352, 1996.
  • [37] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A Neural Probabilistic Language Model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
  • [38] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [39] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. of IEEE 2013 International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), Vancouver, BC, Canada, May 2013, pp. 6645–6649.
  • [40] M. Sun, F. Tang, J. Yi, F. Wang, and J. Zhou, “Identify Susceptible Locations in Medical Records via Adversarial Attacks on Deep Predictive Models,” in Proc. of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, August 2018, pp. 793–801.
  • [41] M. Sato, J. Suzuki, H. Shindo, and Y. Matsumoto, “Interpretable adversarial perturbation in input embedding space for text,” arXiv preprint arXiv:1805.02917, 2018.
  • [42] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer, “Adversarial Example Generation with Syntactically Controlled Paraphrase Networks,” in Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, Louisiana, USA, June 2018, pp. 1875–1885.
  • [43] K. S. Tai, R. Socher, and C. D. Manning, “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,” in Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, July 2015, pp. 1556–1566.
  • [44] S. Wang and J. Jiang, “Machine Comprehension Using Match-LSTM and Answer Pointer,” CoRR, vol. abs/1608.07905, 2016. [Online]. Available:
  • [45] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen, “Enhanced LSTM for Natural Language Inference,” in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada, July 2017, pp. 1657–1668.
  • [46] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” CoRR, vol. abs/1609.08144, 2016. [Online]. Available:
  • [47] P. Minervini and S. Riedel, “Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge,” in Proc. of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), Brussels, Belgium, October 2018, pp. 65–74.
  • [48] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kociský, and P. Blunsom, “Reasoning about entailment with neural attention,” in Proc. of the 2016 International Conference on Learning Representations (ICLR 2016), 2016.
  • [49] Y. L. Li Deng, Deep Learning in Natural Language Processing.   Springer Singapore, 2018.
  • [50] T. Niu and M. Bansal, “Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models,” in Proc. of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), Brussels, Belgium, October 2018, pp. 486–496.
  • [51] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau, “Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models,” in Proc. of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), pages = 3776–3784, year = 2016, month = February, address = Phoenix, Arizona, USA.
  • [52] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” 2015.
  • [53] M. Cheng, J. Yi, H. Zhang, P.-Y. Chen, and C.-J. Hsieh, “Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples,” arXiv preprint arXiv:1803.01128, 2018.
  • [54] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “OpenNMT: Open-Source Toolkit for Neural Machine Translation,” in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada, July 2017, pp. 67–72.
  • [55] H. He, A. Balakrishnan, M. Eric, and P. Liang, “Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings,” in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada, July 2017, pp. 1766–1776.
  • [56] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Proc. of the Annual Conference on Neural Information Processing Systems 2017 (NIPS 2017), Long Beach, CA, USA, December 2017, pp. 6000–6010.
  • [57] S. Singh, C. Guestrin, and M. T. Ribeiro, “Semantically Equivalent Adversarial Rules for Debugging NLP models,” in Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia, July 2018, pp. 856–865.
  • [58] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional Attention Flow for Machine Comprehension,” CoRR, vol. abs/1611.01603, 2016. [Online]. Available:
  • [59] M. R. Costa-Jussà and J. A. R. Fonollosa, “Character-based Neural Machine Translation,” in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany, August 2016.
  • [60] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit, “A Decomposable Attention Model for Natural Language Inference,” in Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, Texas, USA, November 2016, pp. 2249–2255.
  • [61] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao, “Deep Reinforcement Learning for Dialogue Generation,” in Proc.s of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, Texas, USA, November 2016, pp. 1192–1202.
  • [62] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Proc. of the Annual Conference on Neural Information Processing Systems 2014 (NIPS 2014), Montreal, Quebec, Canada, December 2014, pp. 2672–2680.
  • [63] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in Proc. of the 2014 International Conference on Learning Representations (ICLR 2014), 2014.
  • [64] S. Samanta and S. Mehta, “Towards crafting text adversarial samples,” arXiv preprint arXiv:1707.02812, 2017.
  • [65] ——, “Generating Adversarial Text Samples,” in Proc. of the 40th European Conference on IR Research (ECIR 2018), Grenoble, France, March 2018, pp. 744–749.
  • [66] Y. Wu, D. Bamman, and S. Russell, “Adversarial training for relation extraction,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1778–1783.
  • [67] I. Rosenberg, A. Shabtai, L. Rokach, and Y. Elovici, “Generic Black-Box End-to-End Attack against RNNs and Other API Calls Based Malware Classifiers,” arXiv preprint arXiv:1707.05970, 2017.
  • [68] M. Blohm, G. Jagfeld, E. Sood, X. Yu, and N. T. Vu, “Comparing Attention-Based Convolutional and Recurrent Neural Networks: Success and Limitations in Machine Reading Comprehension,” in Proc. of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), Brussels, Belgium, October 2018, pp. 108–118.
  • [69] Y. Wang and M. Bansal, “Robust Machine Comprehension Models via Adversarial Training,” in Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, Louisiana, June 2018, pp. 575–581.
  • [70] W. Hu and Y. Tan, “Black-Box Attacks against RNN based Malware Detection Algorithms,” arXiv preprint arXiv:1705.08131, 2017.
  • [71] C. Song and V. Shmatikov, “Fooling OCR Systems with Adversarial Text Images,” CoRR, vol. abs/1802.05385, 2018. [Online]. Available:
  • [72] X. Yuan, P. He, and X. A. Li, “Adaptive adversarial attack on scene text recognition,” CoRR, vol. abs/1807.03326, 2018. [Online]. Available:
  • [73] H. Chen, H. Zhang, P.-Y. Chen, J. Yi, and C.-J. Hsieh, “Attacking visual language grounding with adversarial examples: A case study on neural image captioning,” in Proceedings of ACL 2018, 2018.
  • [74] X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell, and D. Song, “Fooling Vision and Language Models Despite Localization and Attention Mechanism,” in Proc. of 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, June 2018, pp. 4951–4961.
  • [75] H. Shi, J. Mao, T. Xiao, Y. Jiang, and J. Sun, “Learning Visually-Grounded Semantics from Contrastive Adversarial Samples,” in Proc. of the 27th International Conference on Computational Linguistics (COLING 2018), Santa Fe, New Mexico, USA, August 2018, pp. 3715–3727.
  • [76] N. Carlini and D. A. Wagner, “audio adversarial examples: Targeted attacks on speech-to-text.”
  • [77] P. Neekhara, S. Hussain, S. Dubnov, and F. Koushanfar, “Adversarial Reprogramming of Sequence Classification Neural Networks,” CoRR, vol. abs/1809.01829, 2018. [Online]. Available:
  • [78] C. Wong, “Dancin seq2seq: Fooling text classifiers with adversarial text example generation,” arXiv preprint arXiv:1712.05419, 2017.
  • [79] A. Al-Dujaili, A. Huang, E. Hemberg, and U. O’Reilly, “Adversarial Deep Learning for Robust Detection of Binary Encoded Malware,” in Proc. of the 2018 IEEE Security and Privacy Workshops (SP Workshops 2018), San Francisco, CA, USA, May 2018, pp. 76–82.
  • [80] A. Chan, L. Ma, F. Juefei-Xu, X. Xie, Y. Liu, and Y. S. Ong, “Metamorphic Relation Based Adversarial Attacks on Differentiable Neural Computer,” CoRR, vol. abs/1809.02444, 2018. [Online]. Available:
  • [81] D. Kang, T. Khot, A. Sabharwal, , and E. Hovy, “Adventure: Adversarial training for textual entailment with knowledge-guided examples,” in Proceedings of ACL 2018, 2018.
  • [82] T. He and J. Glass, “Detecting egregious responses in neural sequence-to-sequence models,” CoRR, vol. abs/1809.04113, 2018. [Online]. Available:
  • [83] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, “DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia,” Semantic Web, vol. 6, no. 2, pp. 167–195, 2015.
  • [84] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning Word Vectors for Sentiment Analysis,” in Proc. of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland, Oregon, USA, June 2011, pp. 142–150.
  • [85] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam Filtering with Naive Bayes - Which Naive Bayes?” in Proc. of the Third Conference on Email and Anti-Spam (CEAS 2006), Mountain View, California, USA, July 2006.
  • [86] C. Wang, M. Zhang, S. Ma, and L. Ru, “Automatic online news issue construction in web environment,” in Proc. of the 17th International Conference on World Wide Web (WWW 2008), Beijing, China, April 2008, pp. 457–466.
  • [87] J. J. McAuley and J. Leskovec, “Hidden factors and hidden topics: understanding rating dimensions with review text,” in Proc. of the 7th ACM Conference on Recommender Systems (RecSys 2013), Hong Kong, China, October 2013, pp. 165–172.
  • [88] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank,” in Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, Washington, USA, October 2013, pp. 1631–1642.
  • [89] R. Johnson and T. Zhang, “Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding,” in Proc. of the Annual Conference on Neural Information Processing Systems 2015 (NIPS 2015), Montreal, Quebec, Canada, December 2015, pp. 919–927.
  • [90] B. Pang and L. Lee, “Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales,” in Proc. of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Michigan, USA, June 2005, pp. 115–124.
  • [91] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “RCV1: A New Benchmark Collection for Text Categorization Research,” Journal of Machine Learning Research, vol. 5, pp. 361–397, 2004.
  • [92] H. Yannakoudakis, T. Briscoe, and B. Medlock, “A New Dataset and Method for Automatically Grading ESOL Texts,” in Proc. of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland, Oregon, USA, June 2011, pp. 180–189.
  • [93] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck, “DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket,” in Proc. of the 21st Annual Network and Distributed System Security Symposium (NDSS 2014), San Diego, California, USA, February 2014.
  • [94] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft Malware Classification Challenge,” CoRR, vol. abs/1802.10135, 2018. [Online]. Available:
  • [95] S. Riedel, L. Yao, and A. McCallum, “Modeling Relations and Their Mentions without Labeled Text,” in Proc. of 2010 European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD 2010), Barcelona, Spain, September 2010, pp. 148–163.
  • [96] A. Liu, S. Soderland, J. Bragg, C. H. Lin, X. Ling, and D. S. Weld, “Effective Crowd Annotation for Relation Extraction,” in Proc. of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), San Diego California, USA, June 2016, pp. 897–906.
  • [97] X. Li and D. Roth, “Learning Question Classifiers,” in Proc. of the 19th International Conference on Computational Linguistics (COLING 2002), aipei, Taiwan, August 2002.
  • [98] N. A. Abdulla, N. A. Ahmed, M. A. Shehab, and M. Al-Ayyoub, “Arabic sentiment analysis: Lexicon-based and corpus-based,” in Proc. of the 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT 2013).   IEEE, 2013, pp. 1–6.
  • [99] J. Boyd-Graber, S. Feng, and P. Rodriguez, “Human-Computer Question Answering: The Case for Quizbowl,” Springer, 2018.
  • [100] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “MovieQA: Understanding Stories in Movies through Question-Answering,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, June 2016, pp. 4631–4640.
  • [101] C. Mauro, G. Christian, and F. Marcello, “Wit3: Web Inventory of Transcribed and Translated Talks,” in Conference of European Association for Machine Translation, 2012, pp. 261–268.
  • [102] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal, September 2015, pp. 632–642.
  • [103] T. Khot, A. Sabharwal, and P. Clark, “SciTaiL: A Textual Entailment Dataset from Science Question Answering,” in Proc. of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), New Orleans, Louisiana, USA, February 2018, pp. 5189–5197.
  • [104] A. Williams, N. Nangia, and S. R. Bowman, “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference,” in Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2018), New Orleans, Louisiana, USA, June 2018, pp. 1112–1122.
  • [105] R. Lowe, N. Pow, I. Serban, and J. Pineau, “The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems,” in Proc. of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2015), Prague, Czech Republic, September 2015, pp. 285–294.
  • [106] J. Tiedemann, “News from opus-a collection of multilingual parallel corpora with tools and interfaces,” 2009.
  • [107] M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, and R. Zamparelli, “SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment,” in Proc. of the 8th International Workshop on Semantic Evaluation (SemEval@COLING 2014), Dublin, Ireland, August 2014, pp. 1–8.
  • [108] T. Miyato, A. M. Dai, and I. Goodfellow, “Adversarial training methods for semi-supervised text classification,” arXiv preprint arXiv:1605.07725, 2016.
  • [109] Y. Zhu, O. Groth, M. S. Bernstein, and L. Fei-Fei, “Visual7W: Grounded Question Answering in Images,” in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, June 2016, pp. 4995–5004.
  • [110] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii, “Distributional smoothing with virtual adversarial training,” in Proc. of the 4th International Conference on Learning Representations (ICLR 2016), 2016.
  • [111] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into Transferable Adversarial Examples and Black-box Attacks,” in Proc. of the 2017 International Conference on Learning Representations (ICLR 2017), 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description