TextDecepter: Hard Label Black Box Attack on Text Classification
Machine learning has been proven to be susceptible to carefully crafted samples, known as adversarial examples. The generation of these adversarial examples helps to make the models more robust and give as an insight of the underlying decision making of these models. Over the years, researchers have successfully attacked image classifiers in, both, white and black-box setting. Although, these methods are not directly applicable to texts as text data is discrete in nature. In recent years, research on crafting adversarial examples against textual applications has been on the rise. In this paper, we present a novel approach for hard label black-box attacks against Natural Language Processing (NLP) classifiers, where no model information is disclosed, and an attacker can only query the model to get final decision of the classifier, without confidence scores of the classes involved. Such attack scenario is applicable to real world black-box models being used for security-sensitive applications such as sentiment analysis and toxic content detection.
Machine learning has shown superiority over humans for tasks like image recognition, speech recognition, security critical applications like bot, malware or spam detection. However. machine learning has been proven to be susceptible to carefully crafted adversarial examples. In recent years, research on generation and development of defenses against such adversarial examples has been on the rise. These adversarial examples help to make the models more robust by highlighting the gap between sensory information processing in humans and the decisions made by the machines. Attack algorithms have been formulated for image classification problems by ,  . A classic example of an adversarial attack is that of a self-driving car crashing into another car because it ignores the stop sign. The stop sign being an adversarial example which an adversary intentionally placed in place of the original stop sign. An example in the textual domain can be that of a spam detector which fails to detect a spam email. The spam email is an example of adversarial attack in which the attacker has intentionally changed a few words or characters to deceive the spam detector.
The attacks can be broadly classified based on the amount of information available to the attacker as black-box or white box attacks. White box attacks are those in which the attacker has full information about the modelâs architecture, model weights and the examples it has trained on. Black-box attacks refer to those attacks in which only the final output of the model is accessible to the attacker. Black-box attacks can be further classified into 3 types based. The first type involves those attacks in which the probability scores to the outputs are accessible to the attacker referred to as the âscore-based black-box attacksâ. The second type of attack involves the case where information of the training data is known to the attacker. The third attack type is the one in which only the final decision of the classifier is accessible by the attacker, with no access to the confidence scores of the various classes. We shall refer to the third type of attack as Hard Label Black-Box attack.
In the NLP domain, researchers have mainly formulated attacks in the white box setting, with complete knowledge of gradients or the black-box setting with confidence scores accessible to the attacker,,,. As per our knowledge, there has been no prior work done to formulate adversarial attacks against NLP classifiers in the hard label black-box setting. We emphasize hard label black-box attacks to be an important category of attacks much relevant to the real-world applications as the confidence scores can easily be hidden to avoid easy attacks.
Adversarial attacks on Natural Language models have to fulfill certain rules to qualify them as a successful attack: (1) Semantic Similarity- meaning of the crafted example should be the same as that of the original text, as judged by humans (2) Syntactic correctness: The crafted examples should be grammatically correct (3) Language Fluency: The generated example should look natural.  proposed TextFooler which aimed at preserving these three properties while generating adversarial examples.
We focus on the text classification task which is used for sentiment analysis, spam detection, topic modeling. Sentiment analysis is widely used in the online recommendation systems, where the reviews/comments are classified into a set of categories which are useful while ranking products or movies . Text classification is also used in applications critical for online safety, like online toxic content detection . Such applications involve classifying the comments or reviews into classes like irony, sarcasm, harassment and abusive content.
Adversarial attacks on text classifiers consist of two main steps. First, identifying the important words in the text. Second, introducing perturbations in those words. For finding the important words, gradients are used in white box setting and confidence scores in black-box setting. In the absence of gradients or confidence scores in the hard label black-box setting, it is non-trivial to locate important words. In the second step, there can be either character level perturbation, like, introducing space, replacements with visually similar characters, or word level perturbation by replacement of the word with its synonym. The selection of a character-level perturbation or synonym for replacement is based on the decrease in confidence value of the original class on introducing the perturbation. In the absence of confidence scores, there are no direct indicators to help select any type of perturbation unless replacement of any single word leads to misclassification of the entire text, which is not always the case.
In our work, we come up with heuristics to help determine the important sentences and words in the first step and, select appropriate perturbation in the second step. The heuristics helps us select a synonym for replacement for each of the important words, in such a way that with each successive replacement we move towards the decision boundary.
Our main contributions are as follows:
We propose a novel approach to formulate natural adversarial examples against NLP classifiers in the hard label black-box setting.
We test our attack algorithm on three state-of-the-art classification models over 2 popular text classification tasks.
We improve upon the grammatical correctness of the generated adversarial examples.
We also decrease the memory requirement for the attack when compared to published attack systems involving word-level perturbations.
Ii Attack Design
Ii-a Problem Formulation
Given a set consisting of all texts and a set of labels , , we have a text classification model which maps from the input space to the set of labels . Let there be a text whose ground truth label is and . We also have a semantic similarity function . Then, a successful adversarial attack changes the text to , such that
,where is the minimum similarity between original and adversarial text.
Let us consider a binary classification model with labels which we want to attack. We are given a piece of text with sentences, . Let the actual label of be and . We input each of the sentences to and get their individual labels. Let and , such that, and . Further, and . Let be a set of sets containting sentences each. Then, We define another set, . We, hereby, refer to each of the element in as an ’aggregate’.
Ii-B Threat Model
We consider the attack in the black-box setting, where an attacker does not have any information about the model weights or architecture and allowed to query the model with specific inputs and get the final decision of the classifier model as an output. Further, the class confidence scores are not provided to the attacker in the output, making it a hard-label black-box attack. Although, NLP APIs provided by Google, AWS and Azure do provide the confidence scores for the classes, but in a real world application setting, like toxic content detection on a social media platform, the confidence scores are not provided, thereby making it a hard label black box setting. Such an attack scenario also helps to gauge the model robustness.
The proposed methodology for generating adversarial text has three main steps:
Step 1: Sentence Importance Ranking: We observe that when people convey opinions or emotions, not all the sentences convey the same emotion, few sentences are just facts without any emotion or sentiment. Other sentences can also be stratified based on varying level of intensity. This forms the basis of our sentence ranking algorithm which helps to prioritize our attack on specific portions of the text in order of importance.
We assume that different sentences in text contribute to the overall class decision to a varying level of intensity. Each of the sentence can either, support or oppose the final decision of the classifier and the intensities which they do so are additive. Let us take an example of sentiment analysis, where the labels are positive and negative.
The assumption of additivity of class intensities of sentences, also helps us to infer that sentences in set when joined together to form a text, will belong to class . Hereby, we refer the same as class of set or the classifier’s decision of set .
We define the importance of a sentence in set by its ability to change the classifier’s decision of set from to . If a sentence alone is able to change the class of set from to , then we consider the sentence to belong to level 1 importance. More generally, if a sentence is able to change the classifier’s decision of set only when it is put together with some subset of with sentences , then the sentence belongs to level k importance. The other sentences in all such subsets also belong to level k importance. Also, once the importance of a sentence is fixed at the level, we do not consider it in subsequent levels.
Step 2: Word importance ranking:
After finding the importance of sentences in step 1, we need to find the importance ranking of the words to be attacked in these sentences. We observe that words with a certain Part of Speech (POS) tags are more important than others. For example, for a sentiment classification task, adjectives, verbs, adverbs are more important than nouns, pronouns, conjunctions or prepositions. Further, we consider adjectives to more important than adverbs. Consider a sentence, âThe movie was very badâ. In this sentence, âbadâ is the adjective and shapes the sentiment of the sentence. The adverb âveryâ increases the intensity of the adjective, making the predicted class confidence score to increase further.
Step 3: Attack: We use the Word Level perturbations in order of word importance obtained from the previous step. We select synonyms to replace the original words using cosine distance between word vectors. Further, to maintain the syntax of the language, only those synonyms having the same POS tags as that of original word are considered for further evaluations. Experiments are done using, both, coarse and fine POS tag masks.
Let us look at the details of each of these steps:
POS checking: In order to maintain the syntax of the adversarial example generated, we filter out the synonyms which have a different POS tag then the original word. We experiment with, both, coarse and fine POS tagging.
We select a synonym to replace a word based on the following rules, in order of preference-
Replacing it misclassifies the review.
Replacing it misclassifies the sentence to which it belonged (for words in sentences which have the original label as its class) .
Replacing it misclassifies any of the aggregates to which it belonged while finding the sentence importance ranking and which belongs to the original class .
If multiple synonyms fulfill the rules, then the one which fulfills the rule of higher preference is selected. If multiple synonyms fulfill the highest preference rule, then that synonym is selected whose placement in the review is semantically nearest to the original review. We terminate the algorithm once the text misclassifies, or when all the important words have been iterated over.
The justification for the higher preference of sentence with respect to the aggregate to which it belongs comes from the additivity assumption. Let us take a sentence , . Now, add it to set to form an ’aggregate’, i.e. . Assuming the additivity of class intensities of sentences, we can easily see that when sentences in are joined to form a piece of text, it either belongs to class or, in case, it belongs to class , then the intensity of class is lesser when compared to alone. In other words, an ’aggregate’ belonging to class has lesser intensity of class when compared with individual sentences belonging to class which are part of that aggregate. Hence, a synonym which is able to flip the decision of the classifier, both, on the sentence and the aggregate (initially classified as ) to which the sentence belongs would be preferred against a synonym which is just able to change the classifier’s decision on the aggregate alone.
Iii Attack Evaluation: Sentiment Analysis
We evaluate our attack methodology on generating adversarial texts for sentiment analysis task. Sentiment Analysis is a text classification task which identifies and characterizes the sentiment of a given text. It is widely used by businesses to get the sentiment of customers towards their product or services, by analysing reviews or survey responses.
Iii-a Datasets and Models
We study the effectiveness of our attack methodology on sentiment classification on IMDB and Movie Review (MR) datasets. We target three models: word-based convolutional neural network (WordCNN) , word-based long-short term memory (WordLSTM) , and Bidirectional Encoder Representations from Transformers (BERT) . We attack the pretrained models open sourced by  and evaluate our attack algorithm on the same set of 1000 examples that the authors had used in their work. We also run the attack algorithm against Google Cloud NLP API. The summary of the datasets used by  for training the models are in Table I and their original accuracy are given in Table II
|wordCNN||wordLSTM||BERT||GCP NLP API|
|Attack Success rate||75.8||80.6||76.6||64.0||53.2||65.0||78.3|
|% Perturbed Words||12.1||3.1||12.2||2.8||15.6||2.1||11.8|
|Average Text Length||20||215||20||215||20||215||20|
|wordCNN||wordLSTM||BERT||GCP NLP API|
|Attack Success rate||73.5||78.9||73.7||61.9||49.2||62.3||78.3|
|% Perturbed Words||12.2||3.1||12.0||2.6||14.6||2.2||10.2|
|Average Text Length||20||215||20||215||20||215||20|
Iii-B Evaluation Metrics
Gap between Original and After attack accuracy: We first measure the accuracy of the target model on the 1000 test samples and call it original accuracy. Then, we measure the accuracy of the target models against the adversarial samples generated from the same test samples and call it after-attack accuracy. The greater the gap between the original accuracy and the after-attack accuracy, the successful is the attack.
Percentage of Perturbed words: The percentage of words replaced by their synonyms on an average gives us a metric to quantify the change made to a given text.
Semantic similarity: It tells us the degree to which the two given texts carry the similar meaning. We use the Universal Sentence Encoder, to measure the Semantic Similarity between original and adversarial text. Since, my main aim is to generate adversarial texts, we just control the semantic similarity to be above a certain threshold.
Number of queries: The average number of queries made to the target model tells us the efficiency of the attack model.
|Attack System||Attack Success Rate||%Perturbed Words|
|Li et al ||86.7||6.9|
|Alzantot al ||97.0||14.7|
|Jin et al. ||99.7||10.0|
Iv-a Automatic Evaluation
We report our results of the hard label black-box attacks in terms of automatic evaluation on two text classification tasks using coarse and fine POS masks. The main results are summarized in Tables III and IV. Our attack algorithm is able to bring down the accuracy of all the major text classification models with an attack success rate greater then 50% for all the models. Further, the percentage of perturbed words are nearly 3% for all the models on IMDB dataset and between 10-16% for all the models on MR dataset. In the IMDB dataset, which has an average word length of 215 words, our attack model is able to conduct successful attacks by perturbing less then 7 words on an average. That means that our attack model is able to identify the important words in the text and makes subtle manipulations to mislead the classifiers. Overall, our algorithm is able to attack text classification models pertaining to sentiment analysis with an attack success rate greater then 50%, no matter how long the text sequence or how accurate the target model. Further, our model requires the least amount of information among all the models it is compared with.
The attack model is also able to attack GCP NLP API and brings down the accuracy from 76.4% to 16.7% for MR dataset. Further, it changes only 10.2% of the words in the text to generate the adversary. The results are unprecedented as we have achieved the results without having any information about the confidence scores of the classes involved. The attack algorithm and the carefully crafted adversarial texts can be utilized for the study of interpretability of BERT model .
|Movie Review (Positive (POS) Negative (NEG))|
|Original (Label: NEG)||i firmly believe that a good video game movie is going to show up soon i also believe that resident evil is not it|
|Attack (Label: POS||i firmly feel that a good video game movie is going to show up soon i also believe that resident evil is not it|
|Original (Label: POS)||strange and beautiful film|
|Attack (Label: NEG)||strange and resplendent film|
|Original (Label: POS)||the lion king was a roaring success when it was released eight years ago , but on imax it seems better, not just bigger|
|Attack (Label: NEG||the lion king was a roaring attainment when it was released eight years ago , but on imax it transpires better , not just bigger|
|Movie Review (Positive (POS) Negative (NEG))|
|Original (Label: NEG)||after the book i became very sad when i was watching the movie . i am agree that sometimes a film should be different from the original novel but in this case it was more than acceptable . some examples: 1 ) why the ranks are different ( e.g. lt . diestl instead of sergeant etc.) 2 ) the final screen is very poor and makes diestl as a soldier who feds up himself and wants to die . but it is not true in 100 % . just read the book . he was a bull - dog in the last seconds as well . he did not want to die by wrecking his gun and walking simply towards to michael & noah . so this is some kind of a happy end which does not fit at all for this movie .|
|Attack (Label: POS)||after the book i became very bleak when i was watching the movie . i am agree that sometimes a film should be different from the original novel but in this case it was more than acceptable . some examples:1 ) why the ranks are different ( e.g. lt . diestl instead of sergeant etc.) 2 ) the final screen is very flawed and makes diestl as a soldier who feds up himself and wants to die . but it is not true in 100 % . just read the book . he was a bull - dog in the last seconds as well . he did not want to die by wrecking his gun and walking simply towards to michael & noah . so this is some kind of a happy end which does not fit at all for this movie .|
|Original (Label: NEG)||seriously , i do n’t understand how justin long is becoming increasingly popular . he either has the best agent in hollywood , or recently sold his soul to satan . he is almost unbearable to watch on screen , he has little to no charisma , and terrible comedic timing . the only film that he has attempted to anchor that i ’ve remotely enjoyed was waiting … and that is almost solely because i ’ve worked in a restaurant . but i digress . aside from it ’s terrible lead , this film has loads of other debits . i understand that it ’s supposed to be a cheap popcorn comedy , but that does n’t mean that it has to completely insult our intelligence , and have writing so incredibly hackneyed that it borders on offensive . lewis black ’s considerable talent is wasted here too , as he is at his most incendiary when he is unrestrained , which the pg-13 rating certainly wo n’t allow . the film ’s sole bright spot was jonah hill ( who will look almost unrecognizable to fans of the recent superbad due to the amount of weight he lost in the interim ) . his one liners were funny on occasion , but were certainly not enough to make this anywhere close to bearable . if you just want to completely turn your brain off ( or better yet , do n’t have one ) then maybe you ’d enjoy this , but i ca n’t recommend it at all .|
|Attack (Label: POS)||seriously , i do n’t understand how justin long is becoming increasingly popular . he either has the best agent in hollywood , or recently sold his soul to satan . he is almost terrible to watch on screen , he has little to no charisma , and spooky comedic timing . the only film that he has attempted to anchor that i ’ ve remotely enjoyed was waiting … and that is almost solely because i ’ ve worked in a restaurant . but i digress . aside from it ’s spooky lead , this film has loads of other debits . i understand that it ’s supposed to be a miserly popcorn comedy , but that does n’t mean that it has to completely insult our intelligence , and have writing so incredibly hackneyed that it borders on offensive . lewis black ’s considerable talent is wasted here too , as he is at his most incendiary when he is unrestrained , which the pg-13 rating certainly wo n’t allow . the film ’s sole bright spot was jonah hill ( who will look almost unrecognizable to fans of the recent superbad due to the amount of weight he lost in the interim ) . his one liners were funny on occasion , but were certainly not enough to make this anywhere close to bearable . if you just want to completely turn your brain off ( or better yet , do n’t have one ) then maybe you ’d enjoy this , but i ca n’t recommend it at all .|
|Movie Review (Positive (POS) Negative (NEG))|
|Original (Label: POS)||she may not be real , but the laughs are|
|Attack (Label: NEG) using coarse P.O.S tags||she may not be real , but the kidding are|
|Attack (Label: NEG) using fine P.O.S tags||she may not be real , but the chuckles are|
|Original (Label: NEG)||falsehoods pile up , undermining the movie ’s reality and stifling its creator ’s comic voice|
|Attack (Label: POS) using coarse POS tags||falsehoods heaps up , jeopardizes the movie ’s reality and stifle its creator ’s comic voice|
|Attack (Label: POS) using fine POS tags||falsehoods heaps up , jeopardizing the movie ’s reality and stifle its creator ’s comic voice|
Iv-B Benchmark Comparison
We compare our attack against state-of-the-art adversarial attack systems on the same target model and dataset. For GCP NLP API, we compare our attack results against  and  on MR datasets. With wordCNN and wordLSTM as the target models, the comparison is against , , . The results of the comparison are summarised in table V and VI. The lower attack success rates when compared to the other attack systems can be attributed to the fact that our attack system does not make use of confidence scores of the classes which other published systems do.
Iv-C Human Evaluation
Following the practice of , we perform human evaluation by sampling 100 adversarial examples from the MR dataset with the WordLSTM. We perform three experiments to verify the quality of our adversarial examples. First, the human judges are asked to give the gramaticality score of a shuffled mix of original and adversarial text on a scale of 1-5. As shown in Table, the grammaticality of the adversarial texts with fine POS tag mask are closer to the original texts when compared with coarser POS tag mask. Although, both of scores are above 4 meaning that using, both, coarse and fine POS tags result in smooth adversarial texts.
Second, the judges assign classification labels to a shuffled set of original and adversarial texts, for both coarse and fine POS masks. The results show that the overall agreement between the labels of the original and adversarial text for both the cases are quite high, 92% and 93% respectively.This suggests that improving the gramaticality of the adversarial texts using fine POS mask does not contribute much to the overall meaning of the texts to humans.
Third, the judges determine whether the adversarial texts retain the meaning of the original text. The judges are given three options, 1 for similar, 0.5 for ambiguous and 0 for dissimilar. The average sentence similarity score is 0.88 when fine POS mask is used compared to 0.86 when coarse POS mask is used for synonym selection, suggesting a marginal improvement in sentence similarity scores in the former.
|Fine POS filter||Coarse POS filter|
V-a Ablation study
Aggregates The most critical step of our algorithm is use of aggregates, which belong to the original class, to select or reject synonyms for replacement. To validate the effectiveness of this step we remove the usage of aggregates select a synonym for replacement only when its presence is able to misclassify the original text. The results for BERT model is shown in table . After removing the sentence importance ranking step , we see that the after-attack accuracy increases by 32% for IMDB and 35% for MR dataset, respectively.This suggests the importance of aggregates for selecting synonym for replacement, the removal of which renders the attack ineffective.The aggregates generated in the sentence importance ranking step help us to select those synonyms which can take the original text towards misclassification.
|Orig Acc.||After-Attack accuracy||% Perturbed words|
|w/ agg.||w/o agg.||w/ agg.||w/o agg.|
We propose a hard label black-box attack strategy for text classification task. We also conduct extensive experimentation on sentiment analysis datasets to validate our attack system. The attack algorithm and the carefully crafted adversarial texts can be utilized for the study of interpretability of NLP models. At last, we also conduct human evaluation to validate the grammatical and semantic correctness of the generated adversarial examples.
I thank Prof. Sunil Shende for insightful discussions. I especially appreciate Kalyan Alapati and Dheenadhyalan Kumaraswamy for helping with human evaluation.
- footnotetext: The code and links to the pretrained models are available at https://github.com/SachJbp/TextDecepter
- (2018) Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998. Cited by: §I, TABLE V, §IV-B.
- (2017) Magnet and” efficient defenses against adversarial attacks” are not robust to adversarial examples. arXiv preprint arXiv:1711.08478. Cited by: §I.
- (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §III-A, TABLE V.
- (2017) Hotflip: white-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751. Cited by: §I.
- (2018) Pathologies of neural models make interpretations difficult. arXiv preprint arXiv:1804.07781. Cited by: §IV-A.
- (2018) Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. Cited by: §I, TABLE VI, §IV-B.
- (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I.
- (2015) Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: item 1.
- (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-A.
- (2019) Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932. Cited by: §I, §I, §III-A, TABLE I, §IV-A, §IV-B, §IV-C.
- (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §III-A.
- (2018) Textbugger: generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271. Cited by: §I, TABLE V, TABLE VI, §IV-A, §IV-B.
- (2014) Sentiment analysis algorithms and applications: a survey. Ain Shams engineering journal 5 (4), pp. 1093–1113. Cited by: §I.
- (2016) Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892. Cited by: item 1.
- (2016) Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pp. 145–153. Cited by: §I.