# Deep Detector Health Management under Adversarial Campaigns

###### Abstract

Machine learning models are vulnerable to adversarial inputs that induce seemingly unjustifiable errors. As automated classifiers are increasingly used in industrial control systems and machinery, these adversarial errors could grow to be a serious problem. Despite numerous studies over the past few years, the field of adversarial ML is still considered alchemy, with no practical unbroken defenses demonstrated to date, leaving PHM practitioners with few meaningful ways of addressing the problem. We introduce turbidity detection as a practical superset of the adversarial input detection problem, coping with adversarial campaigns rather than statistically invisible one-offs. This perspective is coupled with ROC-theoretic design guidance that prescribes an inexpensive domain adaptation layer at the output of a deep learning model during an attack campaign. The result aims to approximate the Bayes optimal mitigation that ameliorates the detection model’s degraded health. A proactively reactive type of prognostics is achieved via Monte Carlo simulation of various adversarial campaign scenarios, by sampling from the model’s own turbidity distribution to quickly deploy the correct mitigation during a real-world campaign.

Javier Echauz

## 1 Introduction

A machine learning application often begins with a dataset of examples and the task is to find a classification model that will turn inputs into class-label predictions, while preserving some sense of minimum expected error. The learning problem is often unrealizable, so no perfect model exists that will have 0 generalization error [shalev2014understanding]. But less obviously, it is often possible to deterministically find input examples that force the model to misclassify [szegedy2014intriguing]. Machine learning (ML) models can be subjected to adversarially crafted small perturbations that purposely induce these errors, and they can seem unjustified or surprising to a human observer (e.g., a digital image of a school bus mistaken for a bird). As automated ML-based classifiers pervade across applications in transportation, medicine, finance, and cybersecurity, adversarial errors could grow to be a very serious problem. The danger is particularly acute in industrial control systems (ICS), industrial Internet of Things (IIoT), automation equipment, and factory robotics, where malfunctions can be life-threatening (e.g., steel mill furnace explosions, power grid crashes, etc.). Unfortunately, ICS attacks are on the rise, with increased vectors for malicious party access to critical infrastructure [icscert17]. Detection of attacks to cyberphysical systems [yan2018cyberattack], and particularly as it relates to adversarial ML, is a growing area of concern that has been underserved in PHM literature.

Despite vigorous study over the past few years (see review in [gilmer2018motivating]), the field of adversarial ML is considered by researchers to be at a nascent stage [evanstalk], with no practical unbroken defenses demonstrated to date (attacks succeed with ) [carlini2017adversarial], and still talks of an “arms race” between attackers and defenders [goodfellow2018making]. This leaves PHM practitioners with few meaningful ways of addressing the problem. We have identified a fundamental flaw in the current interpretation of adversarial defenses, and offer an alternative practical reformulation of the problem that copes with population-level campaign mitigation as opposed to individual input, case-by-case protection.

The defense side of adversarial ML has tried to answer a blend of two questions: (a) How to robustify a model (make it harder for an attacker to fool)? This has led to adversarial training, defensive distillation, feature squeezing, architecture modification, and minimax optimization [madry2017towards]; and (b) What can be measured about adversarial inputs that is different from regular ones? This has led to input validators and adversarial detectors [goodfellow2018making]. Generative adversarial networks (GANs) can synthesize adversarial examples which can then be used to retrain the classifier, however, this only helps insofar as it gets a classifier closer to Bayes optimality (it can also make things worse). By omissions in the current discourse, these methods have created the illusion that we could one day prebake a solution at training time that will protect a model against one-or-few-off adversarial inputs at deployment time. Our work suggests that the latter goal comes at a disproportionate price in expected error. Intuitively, if there was a way to accurately detect error-inducing inputs at runtime, then that same detector would have been used to augment or improve the training to begin with.

In the following sections, we will introduce turbidity detection as a different, ROC-centric way of thinking about adversarial example detection that fixes current widespread misinterpretations and leads to a practical mitigation. Our theory yields 3 previously unreported results: (i) unqualified use of an adversarial detector inverts ROC (harms); (ii) adversarial campaign pinches down ROC (harms); and (iii) conditions exist where the ROC can be repaired to at least a gracefully degraded state during the campaign. We propose a methodology for putting that into practice and show experimental results using image (digit recognition) and IIoT security (malware detection) data.

## 2 Turbidity Detection Theory

Our first aim is to show that the unqualified use of an adversarial detector has deleterious effect on ROC. To that end, we will start in a seemingly restrictive setting: 1-dimensional input, uniform distribution over –10 to 10, binary output from binomial discrete-choice theory with logistic noise, equiprobable classes, and Bayes decision rule. However, our main results (ROC inversion, pinching, and repair) will not critically depend on these specific choices, retaining clarity of illustration without loss of generality.

Instead of asking where adversarial examples are “hiding” in high-dimensional input space, we focus on the scalar decision score output axis (preactivation/logit or post-activation/ pseudo-probability), where model-processed samples have to end up anyway, and where any decision confusion actually occurs. Figure 1 shows a deep neural network taking an input array through convolutional and nonlinear activation layers, then dense layers reducing the output to a scalar decision score (here logit). Consider a data-generating process (DGP) such that ground-truth bipolar labels come from adding the score to a symmetric noise (whose scale and bias control class separability and class imbalance respectively), and taking sign:

(1) |

The symmetry of the noise about its 0 mean implies our DGP emits equiprobable labels: . In order to output a monotonic and correctly calibrated posterior probability , what the last-layer activation of the deep net “wants to be” is the CDF of the discrete-choice noise:

(2) |

The logistic-distributed noise has logistic sigmoid CDF, agreeing with an output neuron (here with ):

The Bayes-optimal decision rule (one yielding least probability of misclassification in our DGP) corresponds to the homogeneous halfspace (here semiaxis) obtained by thresholding the above posterior probability at 0.5, or directly thresholding the preactivation score:

(4) |

Now we derive the ROC for this ideal detector in its regular environment. From Bayes theorem, the 0 (“clean”)-vs-1 (“mal”) class-conditionals of the score are

(5) |

(see Figure 2(a)) where denotes CDF from now on,

–10,10) = is the uniform PDF, and that last Iverson bracket [] means indicator function: valued 1 when the event -within-the-interval is true and 0 otherwise. In our 1D mathematical reference figures, the score is directly equal to the input: (while in higher dimensions, it will be an inner product where coordinates can be explanatory variables, features, previous neural layers, etc.).

The marginal of scores is the uniform PDF (Figure 2(b)), while the malicious class-posterior is (Figure 2(c)):

(6) |

Finally, monotonicity of the class-1 posterior allows us to obtain the ROC from a single sweep on the s-axis, yielding the parametric curve (Figure 2(d)):

(7) |

where

(8) |

### 2.1 Clarity and Turbidity Distributions

Unless classes are 100% separable in a generalization preserving way relative to the DGP (input features, label noise, and their statistical relation), every model, including the Bayes-optimal one, experiences difficulty whenever it makes the wrong class prediction. We say that samples that confuse the model, i.e., FPs and FNs, are turbid from the model’s point of view, whereas all the other correctly-classified TNs and TPs are clear. We can think of every model that tackles the original 0-vs-1 problem as having an inherent dual problem: separating clear-vs-turbid (denoted e-vs-d as mnemonic for “easy”-vs-“difficult”), for which a different detector can be built. Since the model’s confusion depends on its output threshold, by default we peg the associated turbidity detection concept to the maximum balanced-accuracy/Youden index threshold in the original detector

(9) |

i.e., the ROC operating point closest to upper-left corner.

Next we present the clarity and turbidity distributions for a DGP where there is 50-50% proportion of clear vs turbid samples (something that we will characterize as a toxic environment compared to the regular one where mistakes should be rare), and 50-50% proportion of clean vs mal within each. Obtain each conditional as a mixture of the truncated class-0 plus the truncated class-1 PDFs. For example, the left half of turbidity consists of the left tail of (= FNs) normalized by the area under it up to 0 (= ), while the right half has the right tail of (= FPs) normalized by the area under it from 0 onward (= ). The mixture of these 2 densities then gives the inflexed arch shape (purple in Figure 2(e)), similar to a truncated Laplace distribution:

(10) |

Obtain the marginal of scores from the equiprobable mixture of the clear and turbid distributions (or from total probability theorem; Figure 2(f)), and the turbidity class-posterior as (Figure 2(g)):

(11) |

Note that this symmetric reverse-ogee arch is nonmonotonic. This implies that the theoretical ROC curve can no longer be obtained simply by sweeping a single threshold over the domain; doing so would result in a suboptimal improper curve (under diagonal chance line). The most general method is to sweep a descending threshold on the vertical axis of the class-posterior, nonlinearly solve/root-find all critical values where posterior intersects the threshold, then calculate area under class-conditionals over regions so as to obtain the pair . In effect, the ROC curve computation becomes multibranched, with number of connected segments dependent on number of intersections encountered during the sweep. A general multibranched algorithm is given in Appendix A1. Figure 2(h) shows the exact ROC, using either the multibranched algorithm just described or an alternative monotonic version afforded by symmetry in this case. Luckily, when data scientists compute an empirical ROC (i.e., from a data sample), they automatically obtain a Monte Carlo estimate, so theoretical complications like the nonmonotonicity above are never encountered. However, the scores should be presented as the possibly nonmonotonic posteriors instead of as preactivations.

### 2.2 Relation to Adversarial Detection

The widely accepted oracle definition of adversarial examples [evanstalk] states that: (i) they are created with intent to deceive, (ii) they start from a seed example of say class A, correctly seen as class A by the model, and (iii) after perturbation they still behave like class A according to the oracle/ground-truth, yet they are now incorrectly seen as class B by the model. However, the goal of “adversarial example detection” (accurately determining at runtime whether an input is adversarial) has been widely misconstrued, leading to overfitting and/or invalidly-dichotomized detectors. If we insist we can detect a particular set of adversarial samples, then that same detector is bound to fail on a freshly created one operating in a regular environment. It will work if operated in a toxic environment, but then for a whole different reason as we’ll see below.

A typical adversarial detection experiment starts from a dataset of regular samples, takes each instance in the dataset as a seed to which a transformation (e.g., from CleverHans library [papernot2016technical]) is applied in order to create an adversarial counterpart, and then sees if the “regular-vs-adversarial” examples are discernible in some way (e.g., by showing differences in distributions or by building adversarial detectors and measuring their above-chance discrimination). By definition, all adversarial examples are turbid. Further, they can exist with “high model confidence” (with or near 1). But exactly the same is true of natural, unforced errors. All regular FPs and FNs are turbid, and while most are associated with low confidence () near model’s decision boundary, high-confidence ones also arise. They happen as predicted even by the 1D DGP, just less frequently, consistent with the tapered-but-still-nonzero tails of the turbidity distribution. Thus, regular and adversarial samples can share the same domain.

We don’t believe human intent is a distinguishing feature that can be measured either—a view hinted in [carlini2017adversarial]—anymore than telling if the person who made the samples was left-handed from looking at the numerical input coordinates. So what is it in the standard adversarial detection experiment that is being detected? Answer: the observed difference between regular and adversarial conditions stems from the fact that all adversarial samples are turbid/difficult by definition, whereas in the regular environment turbid samples are rare. Turbid samples tend to concentrate while regular clear tend to spread, thus second moments separate. A scientific animation illustrating this point can be seen in [javier10ml].

These issues can be fixed by moving from a fortuitous “regular-vs-adversarial” dichotomy to the principled “clear-vs-turbid,” and by not blurring the line between detector and its intended deployment environment [gilmer2018motivating]. Dropping intent and seed-of-origin out of the adversarial character makes the problem realistic and applicable to campaign mitigation.

### 2.3 ROC Inversion

We now show that a realistic adversarial (i.e., e-vs-d) detector cannot actionably help 0-vs-1 decision-making in a regular environment as it leads to ROC inversion. In the spirit of reductio ad absurdum, let the theoretical 1D adversarial detector in Figures 2(g,h), Eq. (11) augment the probabilistic 0-vs-1 detector in Figures 2(c,d), Eq. (6). Given any input at test time, if is accurately declaring that is adversarial then we would want to contradict the decision from . From the point of view of , the Bayes trigger to declare adversariness is , which is equivalent to checking if input magnitude is within a critical cutoff: (also equal to the crossover points in Figure 2(e)). The augmented detector becomes

(12) |

and the augmented-system posterior probability is:

(Any function that reverses the decision within that interval will work.) Figure 3(a) shows this nonmonotonic posterior. Figure 3(b) shows the exact ROC using multibranched algorithm. The original accuracy of = 0.93 goes down to 0.79. The original detector was already optimal for its intended regular environment, and overriding its decisions only makes it worse. Thus, protection against one-off adversarial examples is a misguided design goal.

### 2.4 ROC Pinch-Down

We now show that operating the original 0-vs-1 detector in a toxic environment leads to an ROC pinch-down. Under adversarial campaign, class-conditionals can display abnormal concentrations around the original decision score threshold, a single crossover (as in Figure 2(a)) can become multiple, class-posterior can turn nonmonotonic, and errors become frequent, making model performance plummet. Continuing with the balanced proportions of 50-50% e-vs-d and 50%-50% 0-vs-1 of Section 2.1, the class-conditional likelihoods that the original detector now has to confront are (Figure 4(a)):

(14) |

where and are the conditionals in Eq. 10.

The marginal of inputs is identical to Figure 2(f), just composed differently from the average of the above toxic conditionals. The posterior (not shown) is the same one in Figure 2(c) since the model remained naively unchanged during this toxic campaign. The exact ROC (Figure 4(b)) can be obtained here from the monotonic sweep form in Eq. 7, except with and . So while augmenting the detector during normal operation was harmful, ignoring the problem during abnormal operation is also potentially worse. Thus, protection against adversarial campaigns (not one-offs) is needed.

#### 2.4.1 Asymmetric Toxic Environments

At the 1:1 ratio of e-vs-d samples under both 0 and 1 classes, the characteristic ROC “seagull” (Figure 4(b)) has curve pinned at the chance line (50% accuracy). However, at other ratios of turbidity proportions under each class, conditionals become asymmetric and the pinch point moves somewhere else. Class-0 e:d ratio controls the horizontal axis (FPr), while class-1 e:d ratio independently controls the vertical axis (). This means that if adversarial campaign actors could not only add samples but also subtract from the environment seen by the model, they would be able to place the pinch-down point anywhere on the ROC plane! But they would have to be oracles themselves, for example, to force the model to be always wrong in the future would pin operating point at the bottom-right corner (something we can’t do ourselves with imperfect knowledge). Figure 5 shows a toxic formulation where class-0 samples are regular (i.e., no adversarial FPs), with their natural proportion of clear to turbid, whereas class-1 samples have an unnatural 37.5%-62.5% proportion.

### 2.5 Mitigation/Repair of the Degraded ROC

In its intended regular environment, the original model can adapt to changes in maliciousness imbalance (0-vs-1 prevalences) by simply sliding its operating point along the intact, class prevalence-agnostic ROC curve. However, in the adversarially toxic environment it is no longer enough to simply adjust a threshold to match the environment; a fundamentally different detection problem must be solved. In order to “unpinch” the ROC to the best available shape given the adversarial campaign, we should obey the new posterior:

(15) |

This will typically be nonmonotonic (Figure 6(a)). The exactly repaired ROC is obtained from the multibranched algorithm as shown in Figure 6(b).

The new optimal maximum-a-posteriori Bayes classifier implements decision reversals relative to the original one. Reversals occur only within the decision score intervals where the new heights of 0-vs-1 conditional likelihoods have swapped their dominance, due to the new concentration of turbid/difficult samples in the environment. Thus, the mitigated detector is gracefully (rather than catastrophically) degraded, restoring acceptable error rates and adaptability to maliciousness imbalance.

#### 2.5.1 When Repair Isn’t Possible

In some cases it isn’t really possible to “unpinch” the ROC because the curve morphs into a seamless one with no dent (as if in Figure 5(b) the pinched point fused into the left branch), e.g., with ratio of e-vs-d samples still at 1:1 but with malicious class prevalences falling outside of the interval . The curve is still depressed compared to the original regular one due to 50% of samples being turbid, and only detection threshold remains as a potential adjustment.

We have also uncovered adversarial covariate shift as another condition where ROC repair isn’t possible. This would make score class-conditionals and marginal more turbid while keeping the posterior intact. For example,

The reader can verify that the posterior is exactly recovered as , the CDF of the label noise (also true for other imbalanced 0-vs-1 priors). However, it seems unrealistic that an adversary could shape the conditionals in this fashion as it would require omnipotent control of the environment beyond merely adding adversarial samples to the regular one seen by the model.

### 2.6 Generalization to Suboptimal Models and Higher Dimensions

We have systematically charted an atlas documenting how the above 1D reference theory is impacted when the model is suboptimal instead of Bayes-optimal (via mistuned bias and/or misaligned weights), and in higher-dimensional input space, where the decision score is taken to be the preactivation , i.e., the (possibly augmented) dot product at the output layer of a probabilistic binary classifier. Due to space restrictions, we only mention that all ROC inversion, pinch-down, and mitigation repair results remain qualitatively identical. The score is still 1D; the only difference is that with independent components of x, all distributions become windowed/tapered, and ROCs get “dumber”/shallower from the CLT centrality effect of marginal , which makes samples appear close to decision boundary more frequently. Injecting correlation structure in components of x also weakens separability, but the main ROC results hold. Further, nothing above prevents decision scores from being computed by non-neural network models. Thus, the described campaign effects and mitigation apply to any decision-making component that exposes its scores to attackers, including ensembles of decision trees widely prevalent in industrial settings.

## 3 Preemptive Domain Adaptation

The theoretical results in the previous section can be put into PHM practice by monitoring estimates of the decision score class-conditional distributions in order to declare if and when an adversarial campaign is in effect, repair the degraded ROC during campaign, restore the original model after campaign subsides, and improve readiness for future attacks via simulation. Assume a well-trained classifier has been deployed in its originally-intended threat environment where errors are rare (e.g., % hit at % FP rates). A health management methodology can track 0-vs-1 conditional score histograms (and optionally error rates), from which class-conditionals curves are kernel-density estimated (KDE) as smooth functions. This still requires ground-truth label estimates; in cybersecurity they are obtained after some lag ranging from sub-seconds (with access to cloud-based reputation, etc.) to days (offline endpoints with sporadic live updates, air-gapped IIoT devices, etc.). There is also a way to detect without ground truth, by introspectively looking at whether too many decision scores are falling in a low-confidence interval, but this is bypassed if attackers actively inject only high-confidence samples.

Under adversarial campaign, class-conditionals may develop multimodality, with multiple crossover points that misfit the original decision rules, making model performance plummet for some period of time. This condition can be declared from observation of empirical ROC pinch-down or abnormally large error rates. (In the low-confidence campaign case, the model can track its own decision scores falling in an interval near decision threshold at higher-than-historical rates, suggesting adversarial manipulation since it would be rare to see that in the regular environment.) At that time, an equiprobable class-posterior function implementing Eq. (15) is transmitted to the endpoint model to be used as a post-transformation layer of the original decision scores. The output of this function can then be thresholded to obtain a desired mitigated ROC operating point. In effect, this is an inexpensive statistical domain adaptation that reverses decisions when it makes sense to do so. The same logic can be applied in reverse to restore the original model when campaign has subsided.

The methodology above is still 100% reactive defense. Our investigation suggests that a 100% proactive defense (where the model is hardened at training time against all future one-off adversarial samples in the regular environment) is mathematically impossible. Thus, prognostics in the usual sense of predicting remaining life until failure, to do something about it before it occurs, is outside the scope of our work. However, we introduce a proactively reactive compromise. It precomputes the optimal response to each of several plausible adversarial attack scenarios, via Monte Carlo simulation drawing from the model’s own turbidity distribution, and stores that information as a look-up table to quickly deploy the correct mitigation during a real-world campaign. A final health maintenance modality is to put the decision modification layer into effect continuously/ prophylactically without waiting to detect that a campaign has begun (in which case the last layer calculation automatically yields simply an identity function). This way, as machine operating conditions change even gradually, the method is already there to mitigate possibly harmful effects while any persistent shifting is investigated.

## 4 Experiments with Real-World Data

This section verifies the main ROC inversion, pinching, and repair results using real-world data with corresponding attacks against a deep neural network classifier in two application areas: digit recognition and IIoT malware detection.

### 4.1 Digit Recognition

The standard MNIST benchmark dataset was used, containing 60,000 grayscale px images of handwritten digits. The deep convolutional neural network trained in [dhaliwal2018gradient], whose first layers are visualized in Figure 1, achieved over 99% accuracy on a holdout split of the data. A stratified random sample of 2400 images was taken to equally represent all digits. We adversarially generated 2400 FPs and 2400 FNs using the Carlini-Wagner algorithm [carlini2017towards]. The 10-class problem was dichotomized into classes ‘not-1’ vs ‘1’ by unfolding the preactivation decision scores as

(17) |

where is the preactivation score at the neuron for class ‘1’ (2nd indicator in softmax layer). That leaves 2160 regular instances of class ‘not-1’ and 240 of ‘1’—a class-prior imbalance of 9:1. Figure 7 shows the pdf-normalized regular conditional histograms (top) and the turbidity distribution from the adversarial FNs and FPs (bottom; additionally color-coded by not-1 vs 1 classes). As predicted by the theory, the latter distribution has a Laplace-like inflex concentration around the score decision-crossing point (cf. purple in Figure 2(e)). For clarity, it is shown with balanced not-1s vs 1s within the turbid condition; the regular environment would have 9 times more FPs than FNs while toxic ones can be manipulated. In this potentially overfit “99%” accuracy case, we cannot display an empirical turbidity distribution with only the natural FPs and FNs because there were only 2 and 0 cases, respectively.

Aided by 2160 of the adversarially discovered FPs and 240 of the FNs to mimic the natural 9:1 class lopsidedness in the regular environment, we estimated decision-reversal interval as [–1.6,1.6] (graphically from intersection of empirical conditional histograms, much like green vs purple curves in Figure 2(e)), and generated the empirical ROC over the 2400 regular samples. Figure 9(b) inset confirms that the resulting ROC is inverted. It appears to be small harm but there is actually almost an order-of-magnitude larger FP rate; this difference is critical in the field.

The two class-conditional PDFs were estimated using Gaussian kernel and bandwidth . This is equivalent to fitting a Gaussian mixture distribution where means equal all individual data points, and covariances equal the shared constant . During a very toxic adversarial campaign where half of all samples become turbid, KDE-smoothed PDFs from aggregated data at the desired e-vs-d proportions reveal 3 crossovers (Figure 8(a)). The corresponding equiprobable posterior crosses the 0.5 threshold 3 times (Figure 8(b)). Operating the original unmitigated detector yields the pinched-down ROC (Figure 9(c)). In contrast, passing the scores thru the mitigated posterior function yields the repaired ROC (Figure 9(d)). Figure 9 confirms the inversion, pinch-down, and repair as predicted by the theory. We note however that the empirical nature of the construction makes ROCs look like staircases and it’s impossible to discern convexity of the inversion from concavity of the pinch-down (something we know only from the theory).

### 4.2 Malware Detection from Raw Bytes

Hand-crafted feature engineering for static malware detection takes substantial expertise and years to develop. Increasingly, deep learning alternatives are showing promise as end-to-end feature learners-plus-classifiers, trained from raw binary file examples [Raff2017, krcal2018]. We now verify the adversarial campaign health management framework using a pre-production model intended for an IIoT “ICSP Neural” USB scanning device. The model was made purposely suboptimal (with regular ROC curve far from upper-left corner) in order to better observe the manifestations of the theory. It contains a deep convolutional neural network with {embedding, 4 convolutional, 3 dense, softmax} layers summarized in Appendix A4, trained on half a million raw executable files (originally aggregated from a mix of clean and malicious customer submissions and vendor feeds), XORed with a common byte for inoculation at rest, and spanning at least 6 months of age to encourage learning ‘invariant’ features. This type of network is fed integers in [0,255] representing bytes of a file zero-padded or cropped to length 700,000 (as if it were a wide image that is only 1 pixel tall). The test dataset consisted of 2000 clean and 2000 malicious files sampled from a time split spanning one month after the model’s training date.

One of the simplest adversarial attacks for binaries to circumvent malware protection is to append a crafted payload at the end of the file [trustcom18dl]. These methods can append a binary string, backdoor legitimate files by adding a new section to the executable (either as data or code), or use the resource part of the file when modifying already compiled code. Many far more sophisticated attacks are available [anderson2018learning, suciu2019].

In the present focus of research, the quality of the attack is less important than just finding misclassification-inducing perturbations, so we used the brute-force algorithm in Appendix A3 that appends fixed or random chunks until the model flips its decision (to within a 1000-trial count tolerance). A high-confidence campaign was defined as a set of new binaries bypassing the model with pseudo-probability output above 0.97. Drawing seeds from the size-4000 test set, 512 adversarial FPs and 524 adversarial FNs were created this way.

Figure 10 shows the class-conditional likelihoods in the toxic environment, which are characterized (when pegged to the regular minimum balanced error score threshold 2.2) by 357 unforced + 512 forced FPs, plus 498 unforced + 524 forced FNs, totaling 1891 errors and thus a 62.5%-37.5% clear-to-turbid ratio. The clean (class 0) conditional is estimated from

(18) |

where is the standard Gaussian kernel, are the unfolded decision scores (-unit softmax preactivation minus ) under class 0, and is the bandwidth from Silverman’s estimate. The malware (class 1) conditional was similarly obtained with . Unlike in previous situations, this special high-confidence campaign has adversarial scores dominating at one tail of each distribution, with no adversarial scores in the interval [–4,4]. That creates a complex posterior with 4 crossovers with respect to 0.5, to be used for repair (Eq. (15), Figure 10(b)).

Figure 11 shows the devastating effect of this campaign on ROC and how much could be mitigated. Instead of a single pinch-down somewhere along the midsection of the curve, a composite of pinch-down and inversion brings the whole curve down around the chance line. Passing the original decision scores through the KDE-formed posterior brings the whole curve back to at least a gracefully degraded state.

We have seen that the health of both image-recognition and malware-detection components of industrial systems could be managed using our ROC-centric methodology, but this requires some “server side,” even if lagged, for label estimates. A subtle implication is that the introspective “client-side” monitoring alternative in Section 3 (where the device itself could declare adversarial campaign if too many decision scores are landing in an uncertain band) wouldn’t work with the high-confidence adversarial campaign here. Adversarial actors aren’t required to play by the small delta-perturbation rule as much in security as it is with natural images [gilmer2018motivating]. Semantic proximity between a regular image x and adversarial counterpart means that humans wouldn’t perceive them as belonging to different classes, thus perturbations tend to be small, placing near model’s uncertain boundaries. For malware, semantic proximity only means that will still behave maliciously (or clean will stay clean), not so much that is has to closely resemble the input x. This is manifested as adversarial distribution modes that are central in Figure 8 vs at the extreme ends in Figure 10. Knowledge of this asymmetry can help guide simulations for preemptive domain adaptation.

## 5 Conclusions

The common misunderstanding surrounding what to do about adversarial inputs that fool detectors can be cleared by fixing the “regular-vs-adversarial” dichotomy and by recognizing the difference between one-off/per-trial basis protection vs adversarial campaign mitigation. Our investigation suggests that universal pre-hardening defenses are impossible without paying a price in accuracy of the original model operating in its regular environment.

We introduced turbidity detection, campaign mitigation, and preemptive domain adaptation as conceptual frameworks leading to practicable detector health management solutions. The theory yielded previously unreported results about ROC inversion, pinch-down, and repair in the context of adversarial threats to deep neural networks increasingly used in industry. Though not tested here, results should generalize to non-neural detectors such as ensembles of decision trees, as long as there is access to an internal score.

It should be understood that our method is not a panacea to shield or empower a model; what it does is optimally mitigate the damage (dramatically so for some ROC operating points) caused by adversarial toxicity that the original model wasn’t designed to tackle on its own.

## References

## Biographies

## Appendix

### A.1 Exact ROC for Nonmonotonic Posterior

Pseudocode for generating the exact multi-branched ROC curve without data, given possibly nonmonotonic class-t posterior , and (non-target) vs (target) CDFs.

### A.2 Methods to Aid Replication

The reader can quickly verify shapes of distributions and ROCs in this paper (even if empirically without the benefit of A.1) via Monte Carlo methods. The reference regular DGP in Section 2 can be functional-programmed (in MATLAB/ Mathematica/ R style) directly as

pd = makedist(’Logistic’,’mu’,0,’sigma’,1) pdf0 = @(s)pd.cdf(-s).*unifpdf(s,-10,10)/.5 pdf1 = @(s)pd.cdf(s).*unifpdf(s,-10,10)/.5 cdf0 = @(s)integral(@(q)pdf0(q),-Inf,s) cdf1 = @(s)integral(@(q)pdf1(q),-Inf,s)

Then a functional plotter with adaptively sampled domain and parametric option will graph ROC directly, e.g., Figure 2(d) is fplot(@(s)cdf1(-s), @(s)1-cdf1(s)). An empirical version of this, e.g., in Python^{1}^{1}1As of this writing, there is a deprecated 1D-only plotter in scipy, and a sympy approach that is limited to its known set of functions., can generate randomly sampled scores emitted by the DGP. A 1-D dataset X,y consists of matrix X being a length- array with corresponding labels y = sign(X+), where the noise array is sampled from . Now index into data to obtain the corresponding class-conditional histograms of X[y==–1] vs X[y==1] (Figure 2(a)), and an empirical ROC from roc_curve(y,X).

The same regular dataset X,y can be indexed to obtain turbidity e-vs-d conditionals, with histograms of

(Figure 2(e)). In the regular environment, this will yield approximately samples (e.g., 9303 when =10,000) of ‘clear’ vs only samples (e.g., 693) of ‘turbid’. In adversarially altered DGP environments, aspects like accuracy calculation and the marginal histogram (Figure 2(f)) seen by the model can be simulated by rebalancing the data via over- and/or under-sampling by a rational factor that approximates = 13.4., e.g., oversampling the minority class by 13 or 14. The corresponding y and y maliciousness labels can be used to unit-test/verify that accuracy (1 group all correct, 2 all wrong).

Regarding turbidity detection, to avoid confusion with y maliciousness labels –1s vs +1s (or 0s vs 1s), we could assign 2s to the ‘clear’ class and 3s to the ‘turbid’ class. Then the e-vs-d ROC (Figure 2(h)) can be empirically obtained from true labels [2, 2, …( times), 3, 3, …( times)], predicted soft labels from Eq. (10), and ‘3’ as the target class in the function roc_curve.

The ROC pinch-down (Figure 4(b)) can be empirically verified by sending preactivation scores to the function roc_curve when the DGP is adversarially toxic, e.g., [X, repeat(X 14 times)]. To verify ROC inversion (Figure 3(b)) or repair (Figure 6(b)), the scores are first passed thru a possibly nonmonotonic posterior function before sending to roc_curve. For inversion, the posterior is in Section 2.3 under a regular DGP. For repair, the posterior is Eq. (15) under an adversarially toxic DGP.

### A.3 High-Confidence Adversarial Attack

A brute-force append attack to generate high-confidence adversarial FPs or FNs from raw binaries.

### A.4 Malware Detector Model Summary

The deep neural network investigated was a sequential Keras-wrapped TensorFlow model with 840,882 parameters as summarized below.

___________________________________________________________ Layer (type) Output Shape Param # =========================================================== input (InputLayer) (None, 700000, 1) 0 ___________________________________________________________ reshape_1 (Reshape) (None, 700000) 0 ___________________________________________________________ embedding (Embedding) (None, 700000, 8) 2048 ___________________________________________________________ conv1 (Conv1D) (None, 175000, 48) 12336 ___________________________________________________________ relu1 (Activation) (None, 175000, 48) 0 ___________________________________________________________ conv2 (Conv1D) (None, 43750, 96) 147552 ___________________________________________________________ relu2 (Activation) (None, 43750, 96) 0 ___________________________________________________________ temporal_max_pooling (MaxP(None, 10938, 96) 0 ___________________________________________________________ conv3 (Conv1D) (None, 1368, 128) 196736 ___________________________________________________________ relu3 (Activation) (None, 1368, 128) 0 ___________________________________________________________ conv4 (Conv1D) (None, 171, 192) 393408 ___________________________________________________________ relu4 (Activation) (None, 171, 192) 0 ___________________________________________________________ global_temporal_avg_poolin(None, 1, 192) 0 ___________________________________________________________ flatten (Flatten) (None, 192) 0 ___________________________________________________________ fc1 (Dense) (None, 192) 37056 ___________________________________________________________ selu1 (Activation) (None, 192) 0 ___________________________________________________________ fc2 (Dense) (None, 160) 30880 ___________________________________________________________ selu2 (Activation) (None, 160) 0 ___________________________________________________________ fc3 (Dense) (None, 128) 20608 ___________________________________________________________ selu3 (Activation) (None, 128) 0 ___________________________________________________________ logits (Dense) (None, 2) 258 ___________________________________________________________ output (Activation) (None, 2) 0 ===========================================================