AdvDetPatch: Attacking Object Detectors
with Adversarial Patches
Object detectors have witnessed great progress in recent years and have been widely deployed in various important real-world scenarios, such as autonomous driving and face recognition. Therefore, it is increasingly vital to investigate the vulnerability of modern object detectors to different types of attacks. In this work, we demonstrate that actually many mainstream detectors (e.g. Faster R-CNN) can be hacked by a tiny adversarial patch. It is a non-trivial task since the original adversarial patch method can only be applied to image-level classifiers and is not capable to deal with the region proposals involved in modern detectors. Instead, here we iteratively evolve a tiny patch inside the input image so that it invalidates both proposal generation and the subsequent region classification of Faster R-CNN, resulting in a successful attack. Specifically, the proposed adversarial patch (namely, AdvDetPatch) can be trained toward any targeted class so that all the objects in any region of the scene will be classified as that targeted class. One interesting observation is that the efficiency of AdvDetPatch is not influenced by its location: no matter where it resides, the patch can always invalidate RCNN after the same amount of iterations. Furthermore, we find that different target classes have different degrees of vulnerability; and an AdvDetPatch with a larger size can perform the attack more effectively. Extensive experiments show that our AdvDetPatch can reduce the mAP of a state-of-the-art detector on PASCAL VOC 2012 from 71% to 25% and below.
Keywords:Adversarial Patch, Object Detector
While deep learning algorithms achieving excellent performance in various applications, the security and robustness issue of these algorithms raise as an important concern. Previous studies show that as the most widely used model for image recognition, convolution neural network (CNN) is vulnerable to carefully designed small perturbations that can change the network output with no significant change for human eye . These perturbations are called ”adversarial attack”, which are designed to modify each pixel of the input image with a very small amount, and can be found by gradient based algorithms like Fast Gradient Sign Method (FGSM)  and Projected Gradient Descent (PGD)  or iterative methods like DeepFool  and Carlini-Wagner (CW) attack . Research  also shows that these kinds of adversarial attacks can be effective on CNN with different structures, even if the network structure and parameters are unknown to the attacker.
Although these pixel-wise adversarial attacks are very effective on digital images spreading across the internet or data center, it cannot be directly applied to real world scenes. For fooling the deep learning system in real world, such as an intelligent security camera, previous research has proposed ”adversarial glasses”  which can mislead the state of the art face recognition system to recognize the wearer as someone else. More recently researchers from Google purposed the training method of an adversarial patch  that can be attached to a scene at any scale, position and direction, and make state of the art CNN classifier predict the same targeted class no matter what is originally in the scene.
As mentioned above, previous works on adversarial attack are mainly targeting CNN models, which can only recognize one object in the scene and give one prediction result at a time. While there is also another family of deep learning models, Region Convolutional Neural Networks (RCNNs) ,Fast RCNN and Faster RCNN that can simultaneously detect, propose bounding box and recognize multiple objects in one scene.
RCNN  is a region based algorithm which can propose thousands of potential region proposals in a scene through selective search, then the algorithm use CNN to extract the feature of each region, and classify which class it belongs to. Faster RCNN  introduces a region proposal network (RPN) that is able to share full-image convolutional features with the down-stream detection network, enabling the processing frame rate of RCNN models to reach the need in real world applications. Hence, they can defend most adversarial attack in real scenes.
These algorithms have been widely deployed as detectors in real world, including autonomous driving and surveillance footage analysis , where security is definitely a major issue, but related research on attacking against such object detectors, like Faster RCNN models, are very limited. Compared with attacking classifiers, it is much more difficult to attack object detectors. We find that applying the adversarial patch onto an image, e.g. Google’s adverserial patch  that is designed to attack classifiers, object detectors like Faster RCNN cite, can not be fooled. The difficulty of attacking such region-based detector is that they use object proposals at different scales and positions before classifying, so the number of targets is orders of magnitude larger than the classification models . In such case, Google’s patch is unable to attack on each target. In this paper, we proposed an iteratively training method to produce adversarial patch, AdvDetPatch, that is able to attack region-based detectors and it can be attached to wherever of a scene. We select the specific feature based on the targeted class and fix the feature before it is computed by CNN networks, so that both the region proposals and the classifier of Faster RCNN can just recognize this targeted class. Our experiment shows that our AdvDetPatch will make the Faster RCNN recognize all the detected region as our patch’s targeted class, and give wrong detection results. The attack effects has no relationship with the location of the patch, so we can randomly place the AdvDetPatch but still get similar attack results. While the attack effects vary from AdvDetPatches of different targeted classes, in our experiment, bike AdvDetPatch will lower average mAP from 70% to 25%, but tv AdvDetPatch will lower average mAP to 0.98%. We also find that larger sized AdvDetPatch can create stronger attack after same training iterations, 20-by-20 sized patch can decrease mAP to 0.38%, and 80-by-80 sized AdvDetPatch can decrease mAP to 0.29%. Some classes can be fully misclassified using small sized AdvDetPatch, but some can still be recognized using large sized AdvDetPatch. Hence, we conclude the most efficient size based on different classes we want to attack.
2.1 Region-Based Convolutional Neural Networks
CNN models, which have been widely studied by computer vision communities, is able to recognize one object in the image precisely, but struggles to deal with the localization and recognition of multiple images in one scene, such as in the application of autonomous driving and surveillance footage analysis. To tackle this challenge, Girshick et.al. proposed the Region Convolutional Neural Networks (RCNNs) model , which constructs an object detection system by combining two modules: region proposals and feature extraction. RCNN models generate thousands of anchors for an image and each anchor has several region proposals at various scales and ratios, so that the number of targets is orders of magnitude larger than classification models . RCNN achieved mean accuracy precision (mAP) of 0.58 on the testing set of Pascal VOC2007 .
Although providing promising accuracy, the computation of RCNN systems is time-consuming because of large amounts of repeated computation: there exist thousands of proposal regions and most of them are overlapped, while the features of these overlapped regions will be extracted repetitively. Fast RCNN  introduces Spatial pyramid pooling networks(SPPnet)  to speed up RCNN by sharing computation. Moreover, it increases mAP to about 0.68. The bottleneck of Fast RCNN is that it uses selective search to generate region proposals, which is also a time consuming process. Faster RCNN  utilizes Region Proposal Neural Networks (RPN) to efficiently generate region proposals with a wide range of scales and aspect ratios. It raises mAP to 0.73 on VOC2007 test set.
2.2 Adversarial Attack to CNN
Most of the previous research on adversarial attack are about pixel-wise additive noise[2, 4, 5]. These attacks change all the pixels of the input image by a small amount, which tends to be not feasible against real world vision systems such as web cameras and self driving cars. To achieve a universal attack on real world vision system, Google  recently designed a universal, robust adversarial patches that can be applied in any scene, and can cause any classifier to output any targeted class . Based on the principle that object detection networks are used to detecting the most ”salient” object in an image, the adversarial patch can be trained to be more ”salient” than other objects in the scene. Specifically, the trained patch is aimed to optimize the objective function. During the training process, the patch is applied to the scene at random position with random scale and rotation. This process enables the patch to be robust against shifting, scaling and rotating. Finally, the true object that is originally in the scene will be detected as the targeted class of the patch.
3 Motivation & Our Design Principle
Google’s patch that has successfully attacked CNN models can merely be applied in the specific scene. Usually there exists only one object in this scene so that the patch can ”win” the correct object. That is, the classifier output of the patch is much higher than the object. For multi-object detection networks like Faster RCNN, the adversarial patch and the object will both be detected, recognized and correctly classified. The main cause is that Faster RCNN frameworks extract about 2000 bottom-up region proposals, and it will compute CNN features on every extracted region proposal. Therefore, even if a patch exists in one specific region, it cannot influence the detection of other region proposals that contains other objects.
As Fig.1 shows, we apply a trained adversarial patch whose targeted class is set as toaster in an image, while the pretrained Faster RCNN model can still work with high precision. That is to say, previously proposed adversarial patches attacking CNN models never make sense to attack Faster RCNN. Hence, such adversarial patch cannot be applicable in real scenes where more than one object can be recognized.
In order to enable the adversarial patch to attack the real scenes, it is essential to fool the detectors like Faster RCNN networks. To authors’ best knowledge, however, there has not been any previous work focusing on the attacking method of these RCNN-like deep learning models . Our objective in this work is to generate a universal adversarial patch—AdvDetPatch. Even when there are multiple objects in a scene, the AdvDetPatch is able to “win” all the other objects. The multi-object classifier will yield just one result—the targeted class of the AdvDetPatch.
Based on the analysis of Faster RCNN networks, we find that although thousands of region proposals will be extracted, the process of computing CNN features are kept same in these region proposals. Only if the region proposal contains the feature we want to find, it can be selected, otherwise the region proposal will be ignored. That is to say, if we select the specific feature, and feed all these regions with this feature before entering into CNN networks, all the region classifiers will be trained to recognize only one feature. Therefore, when we apply this trained patch in the real scene, thousands of extracted region proposals should try to detect the feature given by the targeted class of the patch, while ignore other features. The function of region proposals will be invalidated by the adversarial patch.
4.1 Training System Design
We construct a system, as Fig2 demonstrates, to train the adversarial patch on the basis of common adversarial patch and Faster RCNN networks. When an input image enters into this training system, more than 2k region proposals will be extracted. In the original Faster RCNN models, features will be computed for each region proposal using a large CNN network. Here, we select one feature according to the targeted class and feed this feature to all the regions, instead of directly computing other CNN features in each region. That is, we let the CNN networks just train the same feature that is determined by the targeted class in all these region proposals. Hence, the region classifier after iteratively trained can merely recognize one feature. If the feature of the targeted class does not exist in the region, the region classifier will ignore this region and try to find the feature in other regions. By that means, there still exist thousands of region proposals in the scene, but these region proposals can no longer detect multiple objects except the patch of the targeted class.
Particularly, the region proposal process is supervised by the ground truth box label in Faster RCNN networks, only when the region proposal is overlapped with the ground truth box over a threshold value, e.g. 0.5, that region proposal can be valid. Hence, we design the algorithm to minimize the difference from the ground truth box(see 4.2).
4.2 AdvDetPatch Design
the training process of the Faster RCNN algorithm simultaneously optimized two objectives— and , such as
where the classification loss denoting the difference between recognized label and ground truth label, and the bounding box regression loss is the difference between detected bounding box and the ground truth . In the equation, is the predicted label, is the ground truth of the classification, is an vector representing the position of the bounding box, and is that of the ground truth bounding box. When training the AdvDetPatch, we design both untargeted and targeted attack. In an untargeted attack, we would like to find a patch pattern that can maximize both the classification loss and regression loss of the Faster RCNN model when applied to the input scene, as shown in equation 2. And in a targeted attack, we would like to find patch that minimize the loss of to the class and bounding box our attack targeted to, as shown in equation 3.
Note that to apply the patch to the scene, we use an ”apply” function which adds the patch onto the input scene at position . In the training we uniformly sample the position within the area of the input scene to make our patch shift invariant.
4.3 AdvDetPatch Training
A Basic AdvDetPatch. Our purpose is to train a AdvDetPatch that can “attract” all the region proposals. The training procedure is summarized in Algorithm 1. The basic AdvDetPatch training process starts with the training of a Faster RCNN networks using the Pascal VOC dataset without introducing any noise. After that, we fix the weights obtained from the pretrained networks and train the AdvDetPatch on this network with the objectives mentioned in Section 4.2. That is, all other values will not be optimized except the AdvDetPatch. Considering the supervision of the ground truth box to the region proposal, we set the label of ground-truth box, the label of region proposal, and the label of Faster RCNN classifiers to be the targeted class of the AdvDetPatch.
Fig. 3 illustrates the entire procedure of adding AdvDetPatch. After setting up the networks, we create a “pad” with the same size of the original image and randomly initialize the pixels in the “pad”. During the training period, all the pixel points in this “pad” will be updated iteratively to optimize the training objective . A mask with the same size will be created to select the working region of the “pad”. The points in the working region is set to be , and the points outside is set to be . That working region will eventually be fetched as the AdvDetPatch. Our default setting chooses a 20-by-20 square at the top left corner of the image to be the working region of the AdvDetPatch. In order to “stick” the AdvDetPatch into the original image, an “anti-mask” should be produced. The value of each point that resides in the mask region of “anti-mask” is , which means the original pixels of the image will be fully covered. After processing the pad and the image, the adversarial patch could be “sticked” to the image by point-to-point adding these two matrices.
Randomly Shifting the Location of AdvDetPatch. The default setting of the AdvDetPatch is to cover the top left corner of the image. In order to analyze the influences of different locations, and to make the AdvDetPatch shift invariant, we randomly shift the location of our AdvDetPatch during the training process. Specifically, we randomly initialize the value of shift when preparing the AdvDetPatch at each iteration of training, but do not modify the pixel points in it. In such a case, each images in the dataset will be attached with the same AdvDetPatch but at different locations. The size of the AdvDetPatch is still 20 by 20, the same as the default setting.
AdvDetPatch targeting different classes. As there exist more than 10 object classes in the dataset used to train Faster RCNN, e.g., 20 classes in Pascal VOC 0712, it is intriguing to see whether mAP will fall down to a similar value if we use different object classes as the attack target. In our experiment, we randomly select four classes, , , and from Pascal VOC 0712 and test their attack effects.
AdvDetPatch with different sizes. AdvDetPatch size is another significant predetermined factor that could affect the effectiveness of our attack. There is a tradeoff between smaller patches that is harder to detect and defense and larger patches that would provide better attacking effect. In our experiment, we produce three different sizes of AdvDetPatches, namely, 20 by 20, 40 by 40 and 80 by 80 to test the efficiency of their attacks. In this way, we can better understand the relationship between AdvDetPatch size and attacking effects, and find the minimal possible size of a patch for successful attacks.
We first train a Faster RCNN model using ResNet 101 as basic networks. ResNet 101 is a 101-layer residual net which guarantees enough depth of the networks . Pascal VOC 0712 [12, 16] is utilized as the training dataset. After training iterations, average mAP of the 20 classes in Pascal Voc 0712 is about 0.7 (Table 1). Since most tests of Faster RCNN on Pascal VOC 0712 show similar mAP with our training results, we believe it could be utilized as a pretrained model. We first set , the last class in the dataset, to be our targeted class. The labels of ground-truth box is set to be . It is expected that average precision() of will fall down quickly because multiple objects of other classes will be misclassified as . We also anticipate that the final of should decrease to about 0, because all other objects should be recognized as that largely increase the number of incorrect recognition, and decrease the ratio of correct recognition. In addition, we expect Faster RCNN will still extract multiple region proposals but only the regions(maybe more than one) containing will be kept at last. Specifically, if an image has a bus, a tree and other objects, the region proposals containing these things should be ignored. It is notable that the trained AdvDetPatch in our experiment is a matrix in format. It can be transformed into format first and then attached to the original image. In our experiment, we use the AdvDetPatch matrix to replace a part of the pixel values in the original image, as Fig.3 shows.
After creating the adversarial patch of , we continue to create 3 other AdvDetPatches, , and . All the AdvDetPatches have the same size, 20 by 20. After evaluating the attack results of different classes, we resize the patch to be 40 by 40 and 80 by 80. Larger-size AdvDetPatches are expected to decrease average mAP to a lower value, but might be unnecessary for some classes.
6 Experiment Results
6.1 Fixed-Sized and Fixed-Located AdvDetPatch
Fig.4 shows a 20-by-20 AdvDetPatch whose targeted class is . This AdvDetPatch disturbs the identification of region proposals and make the region proposal networks only able to recognize . In Fig.5, our AdvDetPatch covers the left top corner of each image, though the patch size(20-by-20) is small compared to other objects in the scene, it can still fool the Faster RCNN classifier and make it yield one result: the targeted class of the AdvDetPatch. Hence, the function of multi-object detection and recognition of Faster RCNN models has been successfully invalidated. The predicted probability is 0.997 of the first image of Fig5, the predicted probability is 1.000 of the other three images. These predictions make sense because only the region containing the AdvDetPatch can be kept, other regions will be ignored.
Specifically, the main purpose of applying such AdvDetPatch is to make mean Accuracy Precision (mAP) drop down to a lower value. The more mAP decreases, more successful the AdvDetPatch is.
Table 1 demonstrates that after approximately 200k training iterations, this patch could fool almost all the 20 classes in Pascal VOC 0712. The mean AP falls down from 70.01% to 0.98%. We notice that at the start of training period (when training iteration is less than 40k), the falling speed of mean AP is largest (see Fig.6). As the AdvDetPatch accepts deeper training, its attack effect will gradually be weakened. Therefore, we can conclude that there exists a saturated point for training AdvDetPatch. After that saturated point, increasing training iterations will no longer improve the attack effects. For ., the saturate point of training iterations is about 180k.
Fig.6(b) shows another AdvDetPatch whose targeted class is . When we apply this patch onto an image, all the bounding boxes determined by the region proposals are disturbed, as Fig.7(b) shows. We infer that the different attack results of patch and patch is due to their different feature maps. For whose outline is like a rectangle, its shape resembles the shape of region proposals. Therefore, the region detector can quickly select the patch while ignore other objects. For , whose outline is irregular, the detector of region proposals cannot find it easily, so they try to locate the patch in multiple points of the image, set multiple region proposals, and draw multiple bounding boxes. In order to certify this inference, we make Faster RCNN recognize an individual patch, as Fig.7(a) shows. It is seen that there also exist multiple bounding boxes trying to locate . The predicted probability is 1.000 or approaching 1.000 (0.997 in the first sub figure of Fig.5), because the AdvDetPatch attack misleads all the region proposals to find , and they will recognize all the objects as .
It is observed that the AdvDetPatches of different targeted classes can all disable the Faster RCNN networks, cause the recognition result to be the targeted class, but the attack results might differ from each other. Hence, we compare it with two more targeted classes in order to deeply explore the attack effects provoked by different targeted classes. (See 6.2)
6.2 Training with Different Targeted Classes
Since it has been found that the determination of region proposals differ among targeted classes, we would like to explore whether AdvDetPatches of different targeted classes can cause mAP drop down to a similar value after same training iterations.
We casually select two more classes: , to compare with and . Fig.8 shows that after 200k training iterations, these targeted classes cause mAP fall down to different levels. and both decline mAP to almost 0, while and shrink mAP to 24.72% and 33.50%.
Based on this finding, we can conclude that and are more efficient to attack Faster RCNN networks. It could be better to set the targeted class as or rather than or . Hence, we can select the most efficient targeted class to train the AdvDetPatch if the dataset is known.
6.3 Fixed-Sized and Randomly-Located Patch
After experimenting with the fixed-located AdvDetPatch, our next trial is to randomly shift the patch in a scene, which means the same AdvDetPatch could appear in any location of the original image. The main purpose of such practice is to evaluate the attacking efficiency of different locations where the patch is placed. If the attacking efficiency, such as spending same training iterations to get similar mAP, does not differ from each other, it is unnecessary to design a specific attack region, which means attackers can place the adversarial patch in any area. We set the targeted class as , and kept the patch size as 20-by-20.
Table 2 demonstrates the decreased mAP when Faster RCNN is attacked by randomly-located and fixed-located AdvDetPatch. It is notable that randomly-located one does not improve the attack result. That is, recognition accuracy(mAP) of all classes have declined to a similar value no matter where the patch is located. This result makes sense because Faster RCNN detector will first extract thousands of region proposals all over the image, instead of finding a specific area. In this case, the Faster RCNN detector will search the whole image to detect the patch. After this detection process, all the detected objects will be misclassified as the targeted class, which also has no relation with the location of AdvDetPatch. Therefore, we could place the AdvDetPatch in an image without intentionally designing its location, which intensifies the feasibility of the attack.
6.4 Multiple-Sized AdvDetPatch
Since all previous attacks set the AdvDetPatch size to be 20-by-20, we would prefer to observe the impacts of patch size on Faster RCNN model. We add two more sizes in this test: 40-by-40 and 80-by-80. It is expected that larger-size patch can decrease mean AP to a lower value. Table 3 validates such expectation. In order to avoid the influence of training iterations, we train these three AdvDetPatches for 200k iterations and make them approach saturated points. We observe that the smallest size for valid attack differ from individual classes. For example, 20-by-20 sized patch is robust enough to attack , , and so on, while 80-by-80 sized one still cannot thoroughly misclassify , , and . Therefore, we can set the patch size according to the classes we mainly want to attack.
The previous adversarial patch  that disables common CNN networks are proven to be useless for Faster RCNN networks, which make it hardly applicable in real-world scenes. Though Faster RCNN networks extract thousands of region proposals in an image, we select the specific feature of the targeted class and fix the labels before it is computed by CNN networks, so that both the region proposals and the classifier of Faster RCNN can just recognize this targeted class.
We create an AdvDetPatch that could attack the Faster RCNN networks targetting any class. In our experiment, the mAP of pretrained Faster RCNN networks falls down from 70% to 0.98% by applying an AdvDetPatch. All the objects in one scene are misclassified as the targeted class. We efficiently construct a powerful attack on the multi-object detection and recognition system.
We evaluate AdvDetPatch of different targeted classes, and find that targeted classes will influence the results of adversarial attack. We also found that randomly-located AdvDetPatch is similarly efficient with fixed-located one, so we can unintentionally choose the attack location in the scene. Furthermore, it is proven that larger-sized AdvDetPatch is more effective to decrease average mAP. Smaller-sized AdvDetPatch can fully misclassify some classes, but some classes still keep robust even applying larger-sized AdvDetPatch. We can select the most efficient AdvDetPatch size based on different classes to attack.
-  Christian, S., Wojciech, Z., Ilya, S., Joan, B., Dumitru, E., Ian, G., Rob, F.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
-  Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
-  Madry, A., Makelov, A., Schmidt, L., Tsipras, D.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
-  Seyed-Mohsen, M.D., Alhussein, F., Pascal, F.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) Number EPFL–CONF–218057
-  Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644 (2016)
-  Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv preprint arXiv:1712.09665 (2017)
-  Sharif, M., Bhagavatula, S., Bauer, L., Reiter, M.: Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. (2016) 1528–1540
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Computer Vision and Pattern Recognition. (2014)
-  Girshick, R.: Fast r-cnn. In: Computer Vision and Pattern Recognition(CVPR). (2015)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS). (2015)
-  Avishek, J., Parham, A.: Adversarial attacks on face detectors using neural net based constrained optimization. In: Computer Vision and Pattern Recognition(CVPR. (2018)
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html (2007)
-  Andrew, G.H., Menglong, Z., Bo, C., Dmitry, K., Weijun, W., Tobias, W., Marco, A., Hartwig, A.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. In: Computer Vision and Pattern Recognition(CVPR). (2017)
-  Yuan, X., Pan He, Q.: Adversarial examples: Attacks and defenses for deep learning. arXiv preprint arXiv:1712.07107 (2017)
-  Kaiming, H., Xiangyu, Z., Shaoqing, R., Jian, S.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition(CVPR). (2015)
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html (2012)