TextDecepter: Hard Label Black Box Attack on Text Classification

TextDecepter: Hard Label Black Box Attack on Text Classification

Abstract

Machine learning has been proven to be susceptible to carefully crafted samples, known as adversarial examples. The generation of these adversarial examples helps to make the models more robust and give as an insight of the underlying decision making of these models. Over the years, researchers have successfully attacked image classifiers in, both, white and black-box setting. Although, these methods are not directly applicable to texts as text data is discrete in nature. In recent years, research on crafting adversarial examples against textual applications has been on the rise. In this paper, we present a novel approach for hard label black-box attacks against Natural Language Processing (NLP) classifiers, where no model information is disclosed, and an attacker can only query the model to get final decision of the classifier, without confidence scores of the classes involved. Such attack scenario is applicable to real world black-box models being used for security-sensitive applications such as sentiment analysis and toxic content detection.

1

I Introduction

Machine learning has shown superiority over humans for tasks like image recognition, speech recognition, security critical applications like bot, malware or spam detection. However. machine learning has been proven to be susceptible to carefully crafted adversarial examples. In recent years, research on generation and development of defenses against such adversarial examples has been on the rise. These adversarial examples help to make the models more robust by highlighting the gap between sensory information processing in humans and the decisions made by the machines. Attack algorithms have been formulated for image classification problems by [2], [7] . A classic example of an adversarial attack is that of a self-driving car crashing into another car because it ignores the stop sign. The stop sign being an adversarial example which an adversary intentionally placed in place of the original stop sign. An example in the textual domain can be that of a spam detector which fails to detect a spam email. The spam email is an example of adversarial attack in which the attacker has intentionally changed a few words or characters to deceive the spam detector.

The attacks can be broadly classified based on the amount of information available to the attacker as black-box or white box attacks. White box attacks are those in which the attacker has full information about the model’s architecture, model weights and the examples it has trained on. Black-box attacks refer to those attacks in which only the final output of the model is accessible to the attacker. Black-box attacks can be further classified into 3 types based. The first type involves those attacks in which the probability scores to the outputs are accessible to the attacker referred to as the ’score-based black-box attacks’. The second type of attack involves the case where information of the training data is known to the attacker. The third attack type is the one in which only the final decision of the classifier is accessible by the attacker, with no access to the confidence scores of the various classes. We shall refer to the third type of attack as Hard Label Black-Box attack.

In the NLP domain, researchers have mainly formulated attacks in the white box setting[4],[12] with complete knowledge of gradients or the black-box setting with confidence scores accessible to the attacker[6],[12],[1],[10]. As per our knowledge, there has been no prior work done to formulate adversarial attacks against NLP classifiers in the hard label black-box setting. We emphasize hard label black-box attacks to be an important category of attacks much relevant to the real-world applications as the confidence scores can easily be hidden to avoid easy attacks.

Adversarial attacks on Natural Language models have to fulfill certain rules to qualify them as a successful attack: (1) Semantic Similarity- meaning of the crafted example should be the same as that of the original text, as judged by humans (2) Syntactic correctness: The crafted examples should be grammatically correct (3) Language Fluency: The generated example should look natural. [10] proposed TextFooler which aimed at preserving these three properties while generating adversarial examples.

We focus on the text classification task which is used for sentiment analysis, spam detection, topic modeling. Sentiment analysis is widely used in the online recommendation systems, where the reviews/comments are classified into a set of categories which are useful while ranking products or movies [13]. Text classification is also used in applications critical for online safety, like online toxic content detection [15]. Such applications involve classifying the comments or reviews into classes like irony, sarcasm, harassment and abusive content.

Adversarial attacks on text classifiers consist of two main steps. First, identifying the important words in the text. Second, introducing perturbations in those words. For finding the important words, gradients are used in white box setting and confidence scores in black-box setting. In the absence of gradients or confidence scores in the hard label black-box setting, it is non-trivial to locate important words. In the second step, there can be either character level perturbation, like, introducing space, replacements with visually similar characters, or word level perturbation by replacement of the word with its synonym. The selection of a character-level perturbation or synonym for replacement is based on the decrease in confidence value of the original class on introducing the perturbation. In the absence of confidence scores, there are no direct indicators to help select any type of perturbation unless replacement of any single word leads to misclassification of the entire text, which is not always the case.

In our work, we come up with heuristics to help determine the important sentences and words in the first step and, select appropriate perturbation in the second step. The heuristics helps us select a synonym for replacement for each of the important words, in such a way that with each successive replacement we move towards the decision boundary.

Our main contributions are as follows:

  1. We propose a novel approach to formulate natural adversarial examples against NLP classifiers in the hard label black-box setting.

  2. We test our attack algorithm on three state-of-the-art classification models over 2 popular text classification tasks.

  3. We improve upon the grammatical correctness of the generated adversarial examples.

  4. We also decrease the memory requirement for the attack when compared to published attack systems involving word-level perturbations.

Ii Attack Design

Ii-a Problem Formulation

Given a set consisting of all texts and a set of labels , , we have a text classification model which maps from the input space to the set of labels . Let there be a text whose ground truth label is and . We also have a semantic similarity function . Then, a successful adversarial attack changes the text to , such that

,where is the minimum similarity between original and adversarial text.

Let us consider a binary classification model with labels which we want to attack. We are given a piece of text with sentences, . Let the actual label of be and . We input each of the sentences to and get their individual labels. Let and , such that, and . Further, and . Let be a set of sets containting sentences each. Then, We define another set, . We, hereby, refer to each of the element in as an ’aggregate’.

Ii-B Threat Model

We consider the attack in the black-box setting, where an attacker does not have any information about the model weights or architecture and allowed to query the model with specific inputs and get the final decision of the classifier model as an output. Further, the class confidence scores are not provided to the attacker in the output, making it a hard-label black-box attack. Although, NLP APIs provided by Google, AWS and Azure do provide the confidence scores for the classes, but in a real world application setting, like toxic content detection on a social media platform, the confidence scores are not provided, thereby making it a hard label black box setting. Such an attack scenario also helps to gauge the model robustness.

Ii-C Methodology

The proposed methodology for generating adversarial text has three main steps:

Step 1: Sentence Importance Ranking: We observe that when people convey opinions or emotions, not all the sentences convey the same emotion, few sentences are just facts without any emotion or sentiment. Other sentences can also be stratified based on varying level of intensity. This forms the basis of our sentence ranking algorithm which helps to prioritize our attack on specific portions of the text in order of importance.

We assume that different sentences in text contribute to the overall class decision to a varying level of intensity. Each of the sentence can either, support or oppose the final decision of the classifier and the intensities which they do so are additive. Let us take an example of sentiment analysis, where the labels are positive and negative.

The assumption of additivity of class intensities of sentences, also helps us to infer that sentences in set when joined together to form a text, will belong to class . Hereby, we refer the same as class of set or the classifier’s decision of set .

We define the importance of a sentence in set by its ability to change the classifier’s decision of set from to . If a sentence alone is able to change the class of set from to , then we consider the sentence to belong to level 1 importance. More generally, if a sentence is able to change the classifier’s decision of set only when it is put together with some subset of with sentences , then the sentence belongs to level k importance. The other sentences in all such subsets also belong to level k importance. Also, once the importance of a sentence is fixed at the level, we do not consider it in subsequent levels.

Input: Original Sentences set , ground truth label , classifier
Output: Sentence Importance Ranking
SentsSentiment Find original labels of sentences in ;
OrigLabelSents Sentences with Label ;
OtherLabelSents Sentences with Label ;
Let OrigLabelSents represent set of all sentence combinations from OrigLabelSent ;
P 1 ;
while OrigLabelSents do
      TopSentImp ;
      for Comb in OrigLabelSents do
          AGG Add to and join it to form string ;
           if  then
                Aggregates Add ;
                for sent in Comb do
                     SentImp[] P ;
                     TopSentImp Add
                     end for
                    
                     end if
                    
                     end for
                    Delete sentences in TopSentImp from OrigLabelSents ;
                     P P + 1
                     end while
return SentImp , Aggregates
Algorithm 1 Sentence Importance Ranking
Input: Original text , ground truth label , classifier , semantic similarity threshold , cosine similarity matrix
Output: Adversarial example
Initialization: ;
Segment into sentences to get set ;
SentImp, Aggregates = GetSentenceImp(S);
WordImpScores = GetWordImp (SentImp);
Create a set of all words sorted by the descending order of their wordImpScores;
MISCLASSIFIED False ;
for  in  do
      if MISCLASSIFIED then
           break;
           end if
          CANDIDATES = GetSynonyms();
           CANDIDATES POSFilter (CANDIDATES) ;
           CANDIDATES SEMANTICSIMFilter (CANDIDATES) ;
           FINCANDIDATES Sort CANDIDATES by semantic similarity ;
           CHANGED False ;
           for  in FINCANDIDATES do
                Replace with in ;
                if  then
                     ;
                     CHANGED True ;
                     MISCLASSIFIED True ;
                    
                     end if
                    if NOT CHANGED then
                          SENT Get the sentence in which belongs s.t (SENT) = ;
                          Replace with in SENT;
                          if  ;
                          then
                              ;
                               CHANGED True ;
                               end if
                              
                               end if
                              if NOT CHANGED then
                                   AGG Get the aggregate in which belongs s.t. ;
                                    Replace with in AGG ;
                                    if  ;
                                    then
                                        ;
                                         CHANGED True ;
                                        
                                         end if
                                        
                                         end if
                                        
                                         end for
                                        
                                         end for
Algorithm 2 TextDecepter

Step 2: Word importance ranking:

After finding the importance of sentences in step 1, we need to find the importance ranking of the words to be attacked in these sentences. We observe that words with a certain Part of Speech (POS) tags are more important than others. For example, for a sentiment classification task, adjectives, verbs, adverbs are more important than nouns, pronouns, conjunctions or prepositions. Further, we consider adjectives to more important than adverbs. Consider a sentence, “The movie was very bad”. In this sentence, “bad” is the adjective and shapes the sentiment of the sentence. The adverb “very” increases the intensity of the adjective, making the predicted class confidence score to increase further.

Step 3: Attack: We use the Word Level perturbations in order of word importance obtained from the previous step. We select synonyms to replace the original words using cosine distance between word vectors. Further, to maintain the syntax of the language, only those synonyms having the same POS tags as that of original word are considered for further evaluations. Experiments are done using, both, coarse and fine POS tag masks.

Let us look at the details of each of these steps:

  1. Synonym extraction: We use the word embedding by [14] which obtained state-of the art performance on SimLex-999, a dataset designed by [8] to measure how well different models judge the semantic similarity between words.

  2. POS checking: In order to maintain the syntax of the adversarial example generated, we filter out the synonyms which have a different POS tag then the original word. We experiment with, both, coarse and fine POS tagging.

We select a synonym to replace a word based on the following rules, in order of preference-

  1. Replacing it misclassifies the review.

  2. Replacing it misclassifies the sentence to which it belonged (for words in sentences which have the original label as its class) .

  3. Replacing it misclassifies any of the aggregates to which it belonged while finding the sentence importance ranking and which belongs to the original class .

If multiple synonyms fulfill the rules, then the one which fulfills the rule of higher preference is selected. If multiple synonyms fulfill the highest preference rule, then that synonym is selected whose placement in the review is semantically nearest to the original review. We terminate the algorithm once the text misclassifies, or when all the important words have been iterated over.

The justification for the higher preference of sentence with respect to the aggregate to which it belongs comes from the additivity assumption. Let us take a sentence , . Now, add it to set to form an ’aggregate’, i.e. . Assuming the additivity of class intensities of sentences, we can easily see that when sentences in are joined to form a piece of text, it either belongs to class or, in case, it belongs to class , then the intensity of class is lesser when compared to alone. In other words, an ’aggregate’ belonging to class has lesser intensity of class when compared with individual sentences belonging to class which are part of that aggregate. Hence, a synonym which is able to flip the decision of the classifier, both, on the sentence and the aggregate (initially classified as ) to which the sentence belongs would be preferred against a synonym which is just able to change the classifier’s decision on the aggregate alone.

Iii Attack Evaluation: Sentiment Analysis

We evaluate our attack methodology on generating adversarial texts for sentiment analysis task. Sentiment Analysis is a text classification task which identifies and characterizes the sentiment of a given text. It is widely used by businesses to get the sentiment of customers towards their product or services, by analysing reviews or survey responses.

Iii-a Datasets and Models

We study the effectiveness of our attack methodology on sentiment classification on IMDB and Movie Review (MR) datasets. We target three models: word-based convolutional neural network (WordCNN) [11], word-based long-short term memory (WordLSTM) [9], and Bidirectional Encoder Representations from Transformers (BERT) [3]. We attack the pretrained models open sourced by [10] and evaluate our attack algorithm on the same set of 1000 examples that the authors had used in their work. We also run the attack algorithm against Google Cloud NLP API. The summary of the datasets used by [10] for training the models are in Table  I and their original accuracy are given in Table II

Task Dataset Train Test Avg Len
Classification MR 9K 1K 20
IMDB 25K 25K 215
TABLE I: Overview of the datasets used by [10] for training the models
wordCNN wordLSTM BERT
MR 79.9 82.2 85.8
IMDB 89.7 91.2 92.2
TABLE II: Original accuracy of the target models on standard test sets
wordCNN wordLSTM BERT GCP NLP API
MR IMDB MR IMDB MR IMDB MR
Original Accuracy 78 89.4 80.7 90.3 90.4 88.3 76.4
After-attack accuracy 18.9 17.3 18.9 32.5 42.3 30.9 16.6
Attack Success rate 75.8 80.6 76.6 64.0 53.2 65.0 78.3
% Perturbed Words 12.1 3.1 12.2 2.8 15.6 2.1 11.8
Query number 133.2 1368.6 123.2 1918.1 189.5 1719.7 126.8
Average Text Length 20 215 20 215 20 215 20
TABLE III: Automatic evaluation results on text classification datasets (using coarse POS mask)
wordCNN wordLSTM BERT GCP NLP API
MR IMDB MR IMDB MR IMDB MR
Original Accuracy 78 89.4 80.7 90.3 90.4 88.3 76.4
After-attack accuracy 20.7 18.9 21.2 34.4 45.9 33.3 16.6
Attack Success rate 73.5 78.9 73.7 61.9 49.2 62.3 78.3
% Perturbed Words 12.2 3.1 12.0 2.6 14.6 2.2 10.2
Query number 112.5 1230.0 107.0 1650.5 159.0 1507.4 109.6
Average Text Length 20 215 20 215 20 215 20
TABLE IV: Automatic evaluation results on text classification datasets (using fine POS mask)

Iii-B Evaluation Metrics

  1. Gap between Original and After attack accuracy: We first measure the accuracy of the target model on the 1000 test samples and call it original accuracy. Then, we measure the accuracy of the target models against the adversarial samples generated from the same test samples and call it after-attack accuracy. The greater the gap between the original accuracy and the after-attack accuracy, the successful is the attack.

  2. Percentage of Perturbed words: The percentage of words replaced by their synonyms on an average gives us a metric to quantify the change made to a given text.

  3. Semantic similarity: It tells us the degree to which the two given texts carry the similar meaning. We use the Universal Sentence Encoder, to measure the Semantic Similarity between original and adversarial text. Since, my main aim is to generate adversarial texts, we just control the semantic similarity to be above a certain threshold.

  4. Number of queries: The average number of queries made to the target model tells us the efficiency of the attack model.

Attack System Attack Success Rate %Perturbed Words
Li et al [12] 86.7 6.9
Alzantot al [1] 97.0 14.7
Jin et al. [3] 99.7 10.0
Ours 64.0 2.4
TABLE V: Comparison of our attack system against other published systems with wordLSTM as the target model (Dataset: IMDB)
Attack System Original Accuracy Attack Success Rate %Perturbed Words
Gao et al. [6] 76.7 67.3 10
Li et al. [12] 76.7 86.9 3.8
Ours 76.4 78.1 10.2
TABLE VI: Comparison of our attack system against other published systems with Google Cloud NLP API as the target model (Dataset: MR)

Iv Results

Iv-a Automatic Evaluation

We report our results of the hard label black-box attacks in terms of automatic evaluation on two text classification tasks using coarse and fine POS masks. The main results are summarized in Tables  III and IV. Our attack algorithm is able to bring down the accuracy of all the major text classification models with an attack success rate greater then 50% for all the models. Further, the percentage of perturbed words are nearly 3% for all the models on IMDB dataset and between 10-16% for all the models on MR dataset. In the IMDB dataset, which has an average word length of 215 words, our attack model is able to conduct successful attacks by perturbing less then 7 words on an average. That means that our attack model is able to identify the important words in the text and makes subtle manipulations to mislead the classifiers. Overall, our algorithm is able to attack text classification models pertaining to sentiment analysis with an attack success rate greater then 50%, no matter how long the text sequence or how accurate the target model. Further, our model requires the least amount of information among all the models it is compared with.

The attack model is also able to attack GCP NLP API and brings down the accuracy from 76.4% to 16.7% for MR dataset. Further, it changes only 10.2% of the words in the text to generate the adversary. The results are unprecedented as we have achieved the results without having any information about the confidence scores of the classes involved. The attack algorithm and the carefully crafted adversarial texts can be utilized for the study of interpretability of BERT model [5].

The query number is almost linear to the text length, with a ratio in (6,10) which is at par with [10] and [12].

Movie Review (Positive (POS) Negative (NEG))
Original (Label: NEG) i firmly believe that a good video game movie is going to show up soon i also believe that resident evil is not it
Attack (Label: POS i firmly feel that a good video game movie is going to show up soon i also believe that resident evil is not it
Original (Label: POS) strange and beautiful film
Attack (Label: NEG) strange and resplendent film
Original (Label: POS) the lion king was a roaring success when it was released eight years ago , but on imax it seems better, not just bigger
Attack (Label: NEG the lion king was a roaring attainment when it was released eight years ago , but on imax it transpires better , not just bigger
TABLE VII: Examples of original and adversarial sentences from MR (GCP NLP API)
Movie Review (Positive (POS) Negative (NEG))
Original (Label: NEG) after the book i became very sad when i was watching the movie . i am agree that sometimes a film should be different from the original novel but in this case it was more than acceptable . some examples: 1 ) why the ranks are different ( e.g. lt . diestl instead of sergeant etc.) 2 ) the final screen is very poor and makes diestl as a soldier who feds up himself and wants to die . but it is not true in 100 % . just read the book . he was a bull - dog in the last seconds as well . he did not want to die by wrecking his gun and walking simply towards to michael & noah . so this is some kind of a happy end which does not fit at all for this movie .
Attack (Label: POS) after the book i became very bleak when i was watching the movie . i am agree that sometimes a film should be different from the original novel but in this case it was more than acceptable . some examples:1 ) why the ranks are different ( e.g. lt . diestl instead of sergeant etc.) 2 ) the final screen is very flawed and makes diestl as a soldier who feds up himself and wants to die . but it is not true in 100 % . just read the book . he was a bull - dog in the last seconds as well . he did not want to die by wrecking his gun and walking simply towards to michael & noah . so this is some kind of a happy end which does not fit at all for this movie .
Original (Label: NEG) seriously , i do n’t understand how justin long is becoming increasingly popular . he either has the best agent in hollywood , or recently sold his soul to satan . he is almost unbearable to watch on screen , he has little to no charisma , and terrible comedic timing . the only film that he has attempted to anchor that i ’ve remotely enjoyed was waiting … and that is almost solely because i ’ve worked in a restaurant . but i digress . aside from it ’s terrible lead , this film has loads of other debits . i understand that it ’s supposed to be a cheap popcorn comedy , but that does n’t mean that it has to completely insult our intelligence , and have writing so incredibly hackneyed that it borders on offensive . lewis black ’s considerable talent is wasted here too , as he is at his most incendiary when he is unrestrained , which the pg-13 rating certainly wo n’t allow . the film ’s sole bright spot was jonah hill ( who will look almost unrecognizable to fans of the recent superbad due to the amount of weight he lost in the interim ) . his one liners were funny on occasion , but were certainly not enough to make this anywhere close to bearable . if you just want to completely turn your brain off ( or better yet , do n’t have one ) then maybe you ’d enjoy this , but i ca n’t recommend it at all .
Attack (Label: POS) seriously , i do n’t understand how justin long is becoming increasingly popular . he either has the best agent in hollywood , or recently sold his soul to satan . he is almost terrible to watch on screen , he has little to no charisma , and spooky comedic timing . the only film that he has attempted to anchor that i ’ ve remotely enjoyed was waiting … and that is almost solely because i ’ ve worked in a restaurant . but i digress . aside from it ’s spooky lead , this film has loads of other debits . i understand that it ’s supposed to be a miserly popcorn comedy , but that does n’t mean that it has to completely insult our intelligence , and have writing so incredibly hackneyed that it borders on offensive . lewis black ’s considerable talent is wasted here too , as he is at his most incendiary when he is unrestrained , which the pg-13 rating certainly wo n’t allow . the film ’s sole bright spot was jonah hill ( who will look almost unrecognizable to fans of the recent superbad due to the amount of weight he lost in the interim ) . his one liners were funny on occasion , but were certainly not enough to make this anywhere close to bearable . if you just want to completely turn your brain off ( or better yet , do n’t have one ) then maybe you ’d enjoy this , but i ca n’t recommend it at all .
TABLE VIII: Examples of original and adversarial sentences from IMDB (BERT)
Movie Review (Positive (POS) Negative (NEG))
Original (Label: POS) she may not be real , but the laughs are
Attack (Label: NEG) using coarse P.O.S tags she may not be real , but the kidding are
Attack (Label: NEG) using fine P.O.S tags she may not be real , but the chuckles are
Original (Label: NEG) falsehoods pile up , undermining the movie ’s reality and stifling its creator ’s comic voice
Attack (Label: POS) using coarse POS tags falsehoods heaps up , jeopardizes the movie ’s reality and stifle its creator ’s comic voice
Attack (Label: POS) using fine POS tags falsehoods heaps up , jeopardizing the movie ’s reality and stifle its creator ’s comic voice
TABLE IX: Qualitative comparison of adversarial attacks with coarse and fine P.O.S tagging for synonyn selection. Target Model is wordLSTM

Iv-B Benchmark Comparison

We compare our attack against state-of-the-art adversarial attack systems on the same target model and dataset. For GCP NLP API, we compare our attack results against [12] and [6] on MR datasets. With wordCNN and wordLSTM as the target models, the comparison is against [12], [1], [10]. The results of the comparison are summarised in table  V and  VI. The lower attack success rates when compared to the other attack systems can be attributed to the fact that our attack system does not make use of confidence scores of the classes which other published systems do.

Iv-C Human Evaluation

Following the practice of [10], we perform human evaluation by sampling 100 adversarial examples from the MR dataset with the WordLSTM. We perform three experiments to verify the quality of our adversarial examples. First, the human judges are asked to give the gramaticality score of a shuffled mix of original and adversarial text on a scale of 1-5. As shown in Table, the grammaticality of the adversarial texts with fine POS tag mask are closer to the original texts when compared with coarser POS tag mask. Although, both of scores are above 4 meaning that using, both, coarse and fine POS tags result in smooth adversarial texts.

Second, the judges assign classification labels to a shuffled set of original and adversarial texts, for both coarse and fine POS masks. The results show that the overall agreement between the labels of the original and adversarial text for both the cases are quite high, 92% and 93% respectively.This suggests that improving the gramaticality of the adversarial texts using fine POS mask does not contribute much to the overall meaning of the texts to humans.

Third, the judges determine whether the adversarial texts retain the meaning of the original text. The judges are given three options, 1 for similar, 0.5 for ambiguous and 0 for dissimilar. The average sentence similarity score is 0.88 when fine POS mask is used compared to 0.86 when coarse POS mask is used for synonym selection, suggesting a marginal improvement in sentence similarity scores in the former.

Fine POS filter Coarse POS filter
Original 4.5 4.5
Adversarial 4.3 4.1
TABLE X: Grammaticality of original and adversarial examples for MR (BERT) ON 1-5 scale

V Discussion

V-a Ablation study

Aggregates The most critical step of our algorithm is use of aggregates, which belong to the original class, to select or reject synonyms for replacement. To validate the effectiveness of this step we remove the usage of aggregates select a synonym for replacement only when its presence is able to misclassify the original text. The results for BERT model is shown in table . After removing the sentence importance ranking step , we see that the after-attack accuracy increases by 32% for IMDB and 35% for MR dataset, respectively.This suggests the importance of aggregates for selecting synonym for replacement, the removal of which renders the attack ineffective.The aggregates generated in the sentence importance ranking step help us to select those synonyms which can take the original text towards misclassification.

Orig Acc. After-Attack accuracy % Perturbed words
w/ agg. w/o agg. w/ agg. w/o agg.
IMDB 88.3 30.9 63.2 2.1 0.6
MR 90.4 42.4 75 13.6 10.1
TABLE XI: Comparison of the after-attack accuracies of the BERT model with and without using aggregates for synonym selection

Vi Conclusion

We propose a hard label black-box attack strategy for text classification task. We also conduct extensive experimentation on sentiment analysis datasets to validate our attack system. The attack algorithm and the carefully crafted adversarial texts can be utilized for the study of interpretability of NLP models. At last, we also conduct human evaluation to validate the grammatical and semantic correctness of the generated adversarial examples.

Acknowledgment

I thank Prof. Sunil Shende for insightful discussions. I especially appreciate Kalyan Alapati and Dheenadhyalan Kumaraswamy for helping with human evaluation.

Footnotes

  1. footnotetext: The code and links to the pretrained models are available at https://github.com/SachJbp/TextDecepter

References

  1. M. Alzantot, Y. Sharma, A. Elgohary, B. Ho, M. Srivastava and K. Chang (2018) Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998. Cited by: §I, TABLE V, §IV-B.
  2. N. Carlini and D. Wagner (2017) Magnet and” efficient defenses against adversarial attacks” are not robust to adversarial examples. arXiv preprint arXiv:1711.08478. Cited by: §I.
  3. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §III-A, TABLE V.
  4. J. Ebrahimi, A. Rao, D. Lowd and D. Dou (2017) Hotflip: white-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751. Cited by: §I.
  5. S. Feng, E. Wallace, A. Grissom II, M. Iyyer, P. Rodriguez and J. Boyd-Graber (2018) Pathologies of neural models make interpretations difficult. arXiv preprint arXiv:1804.07781. Cited by: §IV-A.
  6. J. Gao, J. Lanchantin, M. L. Soffa and Y. Qi (2018) Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. Cited by: §I, TABLE VI, §IV-B.
  7. I. J. Goodfellow, J. Shlens and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I.
  8. F. Hill, R. Reichart and A. Korhonen (2015) Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: item 1.
  9. S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-A.
  10. D. Jin, Z. Jin, J. T. Zhou and P. Szolovits (2019) Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932. Cited by: §I, §I, §III-A, TABLE I, §IV-A, §IV-B, §IV-C.
  11. Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §III-A.
  12. J. Li, S. Ji, T. Du, B. Li and T. Wang (2018) Textbugger: generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271. Cited by: §I, TABLE V, TABLE VI, §IV-A, §IV-B.
  13. W. Medhat, A. Hassan and H. Korashy (2014) Sentiment analysis algorithms and applications: a survey. Ain Shams engineering journal 5 (4), pp. 1093–1113. Cited by: §I.
  14. N. Mrkšić, D. O. Séaghdha, B. Thomson, M. Gašić, L. Rojas-Barahona, P. Su, D. Vandyke, T. Wen and S. Young (2016) Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892. Cited by: item 1.
  15. C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad and Y. Chang (2016) Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pp. 145–153. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414553
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description