# Subspace Methods That Are Resistant

to a Limited Number of Features Corrupted by an Adversary

###### Abstract

In this paper, we consider batch supervised learning where an adversary is allowed to corrupt instances with arbitrarily large noise. The adversary is allowed to corrupt any features in each instance and the adversary can change their values in any way. This noise is introduced on test instances and the algorithm receives no label feedback for these instances. We provide several subspace voting techniques that can be used to transform existing algorithms and prove data-dependent performance bounds in this setting. The key insight to our results is that we set our parameters so that a significant fraction of the voting hypotheses do not contain corrupt features and, for many real world problems, these uncorrupt hypotheses are sufficient to achieve high accuracy. We empirically validate our approach on several datasets including two new datasets that deal with side channel electromagnetic information.

## 1 Introduction

In this paper, we consider standard batch supervised learning with the additional assumption that an adversary can corrupt a subset of features during the evaluation/use of the algorithm’s hypothesis. We give techniques to make any machine learning algorithm resistant to this corruption and give data dependent upper bounds on the error rate. This type of analysis is useful since it gives a way to weaken the independent and identically distributed (iid) assumption of many machine learning results. Many practical problems cannot be accurately modeled with the iid assumption.

The feature corruption we consider is related to the popular adversarial research that attempts to change the predicted label of an image without significantly changing the image or its true label as confirmed by a human [21]. While our techniques can be applied to that setting, we do not focus on cases where the adversary cannot change the true label. If the structure of the problem is such that the adversary can corrupt the instance and change the label then our bounds will reflect the increase in error rate. While this issue is unavoidable for general machine learning problems, we find that many realistic machine learning problems have large amount of redundancy which can be exploited to help tolerate this type of corruption.

Our corruption model allows the adversary to change up to features in any possible way. This includes picking different features to modify for each instance. This is an established adversarial corruption model [4] and is useful for more than just images. For example, any problem that has categorical features should allow arbitrary changes to the features since there is no natural metric on features to control the amount of change. Another example is missing values. While some missing values might occur during training, an increase during production use can be modeled by an adversary and our techniques can give performance bounds in this setting.

One thing we wish to stress is that while our techniques can work against a strong adversary, we feel its primary benefit is situations that are not adversarial but also not iid. For example, consider the popular example of autonomous driving. Even if there are no adversarial instances on the road, we still want a system that has provable guarantees that are stronger than traditional iid assumptions. If the car is driven in a non-training environment or a sensor starts to malfunction, we want the system to still make accurate predictions or at least report that something is wrong. This is related to redundant and error correcting systems, but in this context, we are building it into the machine learning classifier.

The main intuition of our technique is to use majority vote where each hypotheses only uses a fraction of the features such that we can guarantee that more than half of the hypotheses do not use any corrupt features. We give a range of methods and show that we can tolerate corrupt features where is the total number of features and is the number of features in each hypothesis. Our main result is to give a data dependent bound for our techniques. This works in the same way as a standard test set and can be used to generate upper bounds on the error rate for different amounts of corruption. The technique is broadly applicable as one can use validation sets to pick an effective value against a worst case adversary. While in most cases, it is not possible to know how many features the adversary can corrupt, we find that the validation set has a sweet spot where the accuracy gains from a higher value are offset by the fact that the adversary can corrupt more hypotheses as increases.

Our techniques can be used with any machine learning algorithm. After picking subsets of features one just applies the machine learning algorithm(s) to each subset to create the majority vote classifier. One advantage of this approach is that our method is able to inherit many of the benefits of the basic algorithm used to learn the hypotheses. In some cases, this could include resistance to different types of adversarial attacks. For example, if there is an algorithm that can tolerate the popular adversarial image attack where each pixel can change a small amount, we can use that algorithm as the basic algorithm to generate a classifier that can tolerate examples generated by a more powerful adversary that can arbitrarily modify features and make minor changes to all features.

There has been significant amount theoretical research in concept drift [20] and adversarial learning [15, 7]. Our techniques are strongly related to the random subspace method [9]. In fact, this is one of the algorithms we use in our experiments. Our main contribution over the existing research is to analyze the random subspace method in the adversarial setting and to give new subspace methods that have stronger adversarial performance bounds. In [19], a SVM-based optimization problem is given based on the assumption that features are removed by an adversary giving them a value of zero. Our adversary is more general in that it can generate any value for the corrupt features allowing it to greatly distort the prediction or hide the modifications. In [2], a similar strategy of using few features per basic hypotheses is presented, but the technique uses a stacking approach that can give larger influence to hypotheses that might have been corrupted. They show experimentally that their technique is effective with their weaker adversarial assumptions. In [3], intuitive arguments are given for ensemble methods, including the random subspace method, but again the results are experimental.

There has also been a large amount of research on this problem in the online setting where constant label feedback is available and can be used to adapt the hypotheses to changes in the target function [14, 8, 17]. Our analysis is for the more difficult problem where no label feedback is available after training and our approach is to construct a fixed hypothesis that is robust to the adversarial changes.

## 2 Adversarial Learning Problem

Let be the instance space and be be a set of discrete labels. Let Train be a sequence of training instances selected independently from distribution and let be an sequence of test instances selected independently from the same distribution. The adversary is allowed to arbitrarily corrupt different features on every test instance. More formally, let be this corrupted sequence of instances where for all , .

To help describe our algorithms we use the term corrupt feature to refer to any feature that has been modified by the adversary and corrupt hypothesis to refer to any hypothesis that contains a corrupt feature. For voting, we refer to any hypothesis used in the vote as a basic hypothesis. As mentioned, the testing has a potentially infinite number of instances and we refer to each one as a trial. While the term trial is often used in online learning [13], we should stress our results are not for the online model as we do not receive label feedback after making a prediction. Instead, we learn a static classifier that is robust to adversarial changes in the instances.

## 3 Majority Vote Data Dependent Bounds

The key component of our technique is majority vote where each basic hypotheses predicts a single label and the voting prediction is the label that occurs most frequently and randomly over the labels in the case of a tie. The main intuition is that we generate basic hypotheses for the voting such that a majority of them do not contain features that are corrupt. If these uncorrupt hypotheses predict the correct label than the predicted label will be correct. Of course, it is unlikely that these uncorrupt hypotheses will be perfect, so in this section we give a data dependent bound on the error rate using uncorrupt test data. In the next section, we will show how to control the number of corrupt hypotheses as a function of the number of corrupt features, .

To state our main result, we start with some definitions. Let be the vector of label counts made by the majority vote on instance . Define where is an integer vector with elements; one element for each label value. Notice that corresponds to the margin of instance with respect to the majority vote. Define the loss as where is the indicator function and will depend on properties of the adversary. We will show this loss function can be used to upper bound the error rate on instances that have been corrupted by the adversary.

###### Theorem 1

Assume that an adversary can corrupt at most majority vote hypotheses on any instance. Let be a sequence of uncorrupted test instances sampled from distribution . The error rate of the majority vote on corrupt instances is greater than with probability at most .

Proof:

Let be an instance generated by the adversary on trial and let be the vector of label counts. The margin for this instance is . We can We can decompose this margin into two components as where is margin that results from selecting the instance from distribution and is based on the adversary corrupting hypotheses. Since every corrupt hypotheses can decrease the margin by at most 2 by shifting one vote away from the correct label and moving it to the incorrect label with the highest vote, we have . Since is a random variable, we can apply a Hoeffding bound to the Bernoulli variables . This bound will also be valid for , since . A direct application Theorem 1 in [10] proves that the empirical mean of has a probability of at most of exceeding the true mean by .

It is easiest to understand this bound with a plot. In Figure 1, we give a histogram of the Score function over all the test data. Each corrupt hypothesis will shift this histogram by at most 2 to the left. By counting all the values that are still above 0, we get a bound on the error rate of majority vote. In this case, the test error rate is around 0.06, but the numbers in this plot can be used to show that in the worst case the error is at most 0.1 when 5 features are corrupt.

Notice that the histogram does not show the somewhat idealized case of independence of errors, but for our purpose, dependence is fine. Intuitively, what we are exploiting is redundancy of features. While the different hypotheses learned with these redundant features might be highly correlated, we assume the corruption of one feature has no effect on any related redundant features. Also, while it might seem restrictive to have a hard limit on , and therefore , it is straightforward to assume comes from a distribution and use this distribution to create an upper-bound on the error.

The bound in Theorem 1 can be improved by incorporating information about the variance of the random variables, but we omit the details for clarity and space. In addition the result can be improved slightly by taking into account that the majority vote makes a random prediction on ties. In practice, just as for normal test set bounds, a computer algorithm should be used to get precise bounds on the binomial distribution [12]. Also notice that this bound is worst case in that it assumes a corrupt hypotheses always makes the wrong prediction in the worst way possible. This is a reasonable assumption as it is true for certain types of machine learning problems and classifiers. For example, hyperplane classifiers with . As mentioned in the introduction, the bound can be improved by making more assumptions about the basic learning algorithms used to generate the ensemble.

## 4 Majority Vote Hypotheses Generation

In this section, we give four techniques to generate hypotheses for the majority vote. They are all subspace methods since they only select a subset of features for each hypotheses. Our goal is to bound how many of these hypotheses can be corrupted by an adversary that is allowed to corrupt at most features on each instance. To help explain our results, let be the total number of features and let be the percentage of hypotheses that are corrupt. For all methods, we will show that in order to have more than half the hypotheses uncorrupted, in the worst case, they can at most tolerate corrupt features. Roughly speaking, if we want to double the number of corrupt features the algorithm can tolerate, we need the halve the number of features in each hypothesis.

For all methods, we suggest, at a minimum, to randomize the initial ordering of the features, since feature values and relevance might be correlated in the ordering. Another option is to use some type of feature selection procedure, such as mutual information or domain knowledge in an attempt to equalize the quality of the features across the hypotheses. This includes using validation sets to evaluate possible feature orderings or parameter setting of the methods.

### 4.1 Fixed Feature Split

We call our first method is called fixed-split as it partitions the features into approximately equal sized disjoint groups. Our only parameter is the number of hypotheses, , and we attempt to partition the features as evenly as possible. If there is a remainder when dividing by , the remainder will be split by adding one to the some of the hypotheses. This means each hypotheses will have either or . To simplify the analysis, we will assume as this does not significantly change the results.

Notice that each corrupt feature will corrupt at most one hypothesis. For example, with 900 features, we could learn 9 hypotheses that each have 100 features. In terms of our variables, we have . This is not a strict equality since multiple corrupt features might occur in a single hypothesis. It is useful for comparison with the other methods to rearrange this as which shows how the tolerance to corrupt features is proportional to .

### 4.2 All Size Feature Subsets

In this section, we consider the technique of generating every possible feature hypothesis. We call this the -subset method. While this can be intractable, it is a useful comparison case as all of the remaining method are related to this technique.

Since we are considering every way to select features, there are a total of hypotheses. Since a corrupt hypothesis has one or more corrupt features, there are total of uncorrupt hypotheses, and therefore,

To make this result easier to interpret, we can use the fact that for , [22] to prove that for , . Using our previous equality on , we get . What we really want is a lower bound on , but we found it difficult to achieve a tight and simple lower bound. However, given that this upper bound is tight for , we can use this an effective approximation to better understand the result. If needed, one can always use the exact formula. As can be seen, there is a slight advantage to this bound over the bounds of the other methods as increases in size. This is due to the fact that this bound takes into account that, for this method, as one increases the number of corrupt features some of these features must occur in hypotheses that have already been corrupted.

Next we show that the -subset method is optimal for the case where all hypotheses have features. As explained above, this shows that is a tight bound when .

###### Theorem 2

If there are hypotheses where each hypotheses uses at least features from and features are corrupt then the fraction of corrupt hypotheses is at least .

Proof:

Let be the number of non-corrupt hypotheses after features have been corrupted. Let be the fraction of hypotheses that are corrupted when feature is corrupted. We can use the pigeon hole principle where we consider the holes the remaining features and every hypothesis consists of at least pigeons. Therefore, with hypotheses, there must exist a feature that is used by at least hypotheses. This allows the adversary to always corrupt at least hypotheses. This shows that

Given that , this proves the theorem.

While it is possible to have a mixture strategy that uses different numbers of features in the hypotheses, we currently do not see a compelling reason to use a wide range of sizes for feature subsets. If one can generate a significant number of hypotheses with few features that also have good accuracy, then there is no reason to generate hypotheses using significantly more features that have worse guarantees. The only exception is having a difference of one feature between subsets. This is close enough to often give a good bound and is helpful for the fixed-split method and the -modulus method that will be explained in Section 4.4.

### 4.3 Random Subspace Method

The random subspace method is a well studied method[9] that takes a random sample of hypotheses from the set of hypotheses and is therefore an approximation of the -subset algorithm. It has been show to give give good performance on many types of uncorrupt learning problems, but this research is the first to give provably guarantees in the adversarial setting. As an approximation of the -subset method, it is possible to use a concentration result to show that as increases with high probability the method will not select many more than the average number of corrupt hypotheses from the the -subset algorithm. However, formally this only holds for a single trial and a strong adversary will be able to learn which features to corrupt to maximize the error. This maximum error can be bounded by explicitly computing the number of hypotheses that can be corrupted as increases. While one can randomly search the space of hypotheses to minimize this number, in cases of a strong adversary, we recommend using one of the other algorithms.

### 4.4 Modulus Subspace Method

We call this the -modulus method as it uses the modulus function to build a deterministic group of feature subsets where each subset has elements. The -modulus method can provably tolerate as many corrupt features as the fixed-split method but allows more control over the parameter . The technique starts by indexing the features as to . Next it creates feature sequences by modifying the indexes. Given an sequence of indexes, we add one to each index using modulo arithmetic. For example, if , and we start with then next two sequences would be and . We repeat this procedure times to generate a group containing at most sequences of features. This procedure can be used on all length index sequences to create a partition of all feature subsets. Using every feature subset would make it equivalent to the full -subset method. Instead, we propose to use a small number of groups from the partition. As we will show, this will control the cost while still giving us the bound on the number of corrupt features.

More formally, let be defined as applying this plus one modulus operation applied to sequence of feature indexes. We define to be the set of sequences generated by applying , times on a feature subset . Let . Notice that so any further applications of operator will repeat previous sequences.

Define to be the partition of all feature subsets that is generated by applying to all possible feature size sequences. One way to generate is to iteratively build the partition elements by applying to any feature sequence that is not already part of the incrementally built partition.

To connect the corrupt features to the corrupt hypotheses, we need some new definitions. Let be the binary list that has value 1 if the corresponding indexed element is in , otherwise it has value 0. For example, has . Let be a circular right shift by of feature sequence binary list . For example, . Finally let be the minimal shift in the binary sequence such that .

###### Lemma 1

Given a sequence of features from , the number of elements in partition group is equal to .

Proof:

Apply operation a total of times on . Given the definition of , it must be the case that . Therefore the number of elements in the partition group must be less than or equal to .

Assume we apply operation a total of time on and that . Based on the definition of this is a contradiction since this implies that . This shows that which proves the lemma.

We can use this result to relate the number of corrupt features to the number of corrupt hypotheses.

###### Theorem 3

Assume the majority vote uses a subset of groups from the partition . If an adversary corrupts features then where is the fraction of hypotheses that are corrupt.

Proof:

Let be any of the groups and look at an index set that generates this group. Create a list of all index sets in this set by applying operator a total of times; this will include any repeated index sets. Given that we start with features and cycle each index through all values then each feature index occurs times. Therefore any corrupt feature will corrupt at most hypotheses. Given that there are (potentially non-unique) hypotheses in the group, if the adversary corrupts features then at most hypothesis are corrupt. This means that . Based on Lemma 1, each hypothesis is repeated the same number of times. This shows that even after we remove repeats and therefore for group . Given that this result holds for any group , it also holds for the union of these disjoint groups which proves the theorem.

This is the same bound as the fixed-split method. The main issue with the -modulus method is that it can be expensive to work with hypotheses. This problem can be partially addressed by manipulating the structure of the indexing to generate groups that are smaller than .

###### Theorem 4

For every group in , there must exist a common factor of both and where the group has size . Furthermore, for every that is a factor of both and , there must exist at least one group with unique elements.

Proof:

Let be an index sequence. Based on Lemma 1, must have a structure where the length of the repetition is . Given this structure, we can conclude that there exists an integer such that where is the number of times the pattern repeats. Also, we known that binary sequence only has values that are 1. Therefore where is the number of values that are 1 in one of the repeats. For example, if and then then and . Based on Lemma 1, we can conclude that , which proves the first half of the result.

For the second half, assume is a common factor of both and . It must always be the case that one can construct a non-repeating binary pattern of size where of the elements are 1. For example, we can make the first elements 1 and the remaining elements 0. This pattern directly maps to an index sequence that, based on Lemma 1, generates a group of size .

###### Corollary 1

All partition groups in have size iff and are relatively prime.

Proof:

Assume all partition groups in have size and that . Based on Theorem 4 there must exist a group of size which is a contradiction.

Assume and that there is a partition group with . Based on Theorem 4, this is a contradiction.

Various group sizes are relatively easy to generate by creating binary
strings with repeating patterns. Unfortunately, it is not always
possible to get the exact size we want, but with a slight
modification, we can create a group size that is at most
. We accomplish this by adding
dummy features.^{1}^{1}1A dummy feature is used to generate the
groups but is not used with the learning algorithm. This adds at
most one dummy feature per hypothesis. This is similar to the
technique used in the fixed-split method when a perfect split is
not possible. In that case, we interpreted the split as some
hypotheses having one less feature, but that is equivalent to adding
dummy features. In fact, when the group size is ,
this is the fixed-split method which shows that the modulus method is
a generalization of that simpler technique.

## 5 Decreasing Majority Vote Cost

One issue with using an large ensemble of hypotheses is prediction time. We recommend speeding up prediction time by using sequential sampling techniques that randomly sample the voting hypotheses until a prediction can be made with a controllable probability of correctness [24]. This is most beneficial when the adversary only occasionally corrupts an instance since, on many problems, uncorrupt instances are often predicted correctly for a large fraction of the hypotheses and sequential sampling can quickly and accurately estimate the majority label. In addition, while most sequential sampling techniques assume that sampling is done with replacement, it should be possible to get better bounds since sampling in this case is done without replacement. Another refinement is to take advantage of a multi-core computer architecture and evaluate several hypotheses in each step of the sequential sampling.

## 6 Experiments

In this section, we give experiments on a range of datasets which include worst case bounds when features are corrupt and actual results against a simple weak adversary. The weak adversary is not supposed to model a worst case adversary, but is primarily used to show how traditional algorithms degrade with a simple change in the distribution.

To generate our weak adversary, we want to corrupt different relevant features on each trial. We do this by creating a distribution on the features using mutual information (MI) [11]. For each instance, after selecting features to corrupt from the created MI distribution, we change the feature value to be the maximum value on the opposite side of the mean. For example, if a feature has a range of [-2,3] and a mean of 1 in the training data then during testing we corrupt a feature with value -1 by changing it to 3.

The eight datasets we used are described in Table 1. All our voting techniques use random forest as the basic classifier as it is easy to tune to give high accuracy on most machine learning problems [5]. All graphs include a 99% confidence interval based on the exact binomial distribution [12]. We used the scikit-learn Python library [18] for all our code and used the standard cross validation library for parameter selection where the number of random features select was from and the number of trees was selected from . For all experiments, we used a 80/20 train/test split. The only exception is that we set 100,000 as the maximum number of instances for training or testing. For all datasets, we combined and permuted the existing data. This was to ensure that all the data was iid before the adversary corrupts the instances. For the data dependent bounds, the results for the random subspace method are the expected bounds give that the 500 hypotheses are sampled from all possible subsets. On all experiments we also report the error rate of predicting with the majority label. This is poor but reasonable default classifier since it gives information about label skew and is unaffected by feature corruption. All experiments were performed on a 44 core Intel Linux server.

Given the large number of experiments and independent variables, we
limit the presented results to a single number of hypotheses for both
the fixed split technique and the random subspace technique. For the
fixed-split method we picked the best result from
. For feature corruption, we ran
experiments from to corrupt features. We sometimes stopped
the experiments early if error-rate degraded to majority label
baseline. For the random subspace, given the expense of running the
algorithm, we only used a single set of parameters. To control cost,
we used and set based on the value that gave the
best results for the fixed-split method.^{2}^{2}2When focused
on a single problem with sufficient validation data, we recommend
testing more parameters. Other values of and give
qualitatively similar results when the features are significantly
corrupted. We do not report results for the modulus subspace method
as we are still in the preliminary stage of evaluating this method.

Data set | Features | Label | Train | Test | Access | |
---|---|---|---|---|---|---|

UNO | 1024 | Device Mode | 11 | 24413 | 6104 | non-public |

Pi | 1024 | Device Mode | 3 | 6067 | 1517 | non-public |

Smart | 512 | Device Mode | 4 | 10000 | 2000 | non-public |

Character Font | 409 | Italic Arial | 2 | 20989 | 5248 | UCI |

IoT Botnet | 115 | ACK Attack | 2 | 100000 | 42591 | UCI |

UJIIndoorLoc | 520 | Building Floor | 5 | 16838 | 4210 | UCI |

US Census Data | 68 | Marital Status | 5 | 100000 | 100000 | UCI |

CoverType | 54 | Forest Covers | 7 | 100000 | 100000 | UCI Repo |

### 6.1 Electromagnetic Side Channel Data

We are currently working on a project that determines the computational mode of a device based on its unintended electromagnetic (EM) emissions[1]. The goal of this project is to determine if unauthorized code is running on the device. While we do not have space to give all the details on this problem, we have captured EM data of two devices while they execute authorized code. The data is captured using an antenna and a software defined radio sampling at 25 MHz at a specified central frequency. We then processed that data using the fast Fourier transform into 1024 frequency bins and use those energy levels as our features. Through various techniques, we have labeled this data into device modes.

One convenient property of this data is the presence of harmonics where information is correlated over the frequency spectrum. This along with strong differences between the device modes, makes the learning problem fairly easy for standard machine learning techniques. The difficulties arise when the system is used to make label prediction at a different time and/or location. Changing, intermittent EM noise can be present and can corrupt different features over a sequence of predictions. This was our motivation to develop these techniques. While this EM noise problem is not a worst case adversary, it has properties that make it difficult to analyze with traditional train/test assumptions.

We ran experiments for three devices: an Arduino UNO, a Raspberry Pi, and a smart meter. It was difficult to do controlled experiments with real environmental noise, so we used the adversarial noise model explained at the start of this section.

The left side of Figure 2 gives the result for an Arduino UNO running a simple program consisting of loops of NOP statements. A simple loop that repeats a specific number of clock cycles causes a repeated behavior that is picked up at a specific frequency as an amplitude modulation of the CPU clock. As can be seen, the random forest eventually gets close to half the predictions wrong. The fixed-split method does much better, but the best results are for the random subspace method. This is reasonable as it has the best bound against a weak adversary as explained in Section 4.3.

We also give graphs of the worst case adversarial bounds. It is not surprising that these bounds are much worse since they assume the adversary can maximize the errors by controlling the prediction of any corrupt hypothesis. Still the bounds are somewhat positive given that the ratio is roughly 13. The voting hypotheses must have high accuracy to be able to tolerate corruption of almost half the hypotheses.

The right side of Figure 2 tells a similar story. It is based on a Raspberry Pi running Linux with a simple program that loops over SHA, string search, and sleep. While hard to see, the random forest classifier is doing slightly better at the start; however it quickly decays as . Again the best performance, as the weak adversary corruption increases, is the random subspace method.

On the left side of Figure 3, we give the results of the smart meter experiments based on unmodified firmware running on the device. We are unsure why the result are so much worse for random forest; perhaps it is related to the label skew in this problem. However, both fixed-split and random subspace are largely resistant to the corruption with the random subspace method having the lowest error.

### 6.2 UCI Data

We selected five UCI datasets [6], described in Table 1, from UCI by sorting based on number of instances and choosing problems that fit certain criteria. In particular, we selected classification problems but avoided any problems with less than fifty features or that required extensive feature processing. We also avoid problems that contained features that were clearly a function of some set of original features. The motivation for our problem is that the basic features are independently susceptible to noise, and having features that are functionally related to each other would break that assumption and spread the corruption. We suggest that any derived, functionally related features be placed in the same voting hypothesis to avoid spreading corruption.

Our first UCI problem tries to determine if a bit-mapped Arial font character is italic. The results here are similar to the previous section with the exception that all the classifiers do not start at zero error. As can be seen in the right side of Figure 3, the error increase for both subspace classifiers is very slow. At both classifiers are doing much better than random forest with a slight advantage to the random subspace classifier.

On the left side of Figure 4, we give the results for the binary label problem of determining whether a botnet attack is occurring [16]. Again, the results are positive with the subspace classifiers tolerating roughly twice as many corrupt features before making a significant number of errors. After , both subspace classifiers start to rapidly decay. While it is possible that increasing and reducing could decrease the error rate with these large amounts of corruption, the classifiers are already at the point of having only six features per hypotheses which seems extreme. It is likely this problem has a large amount of relevant feature redundancy.

The results in right side of Figure 4 are interesting and are based on predicting the floor the user is on in a building based on data collected from Wi-Fi access points [23]. The adversary only has a minimal effect on all the classifiers. We are unsure if this is caused by the feature sparsity of this problem combined with our choice of adversary. We plan to study the issue of instance sparsity in the future. However, even in this case, we do see a decrease in error rate for the random subspace algorithm.

Our next data set is based on Census data. Here we defined the label based on the marital status since it has five labels values and reasonable class balance. As can be seen in the left side of Figure 5, we cannot tolerate as many corrupt features, but we also have significantly fewer total features. Also we suffer some loss of accuracy in the non-corrupt case, but as soon as the corruption starts, the subspace techniques have lower error rates than random forest. In addition, we use the small value of as the non-corrupt error rate rises quickly as increases. This shows the difficulty of the non-corrupt form of this problem for the subspace techniques. In principle, subspace method will not work with all machine learning problems. We will address this in the next section, but in general, multiple algorithms should be tested with validation datasets.

On the right side of Figure LABEL:cover-plot, the learning problem labels different types of forest cover. The results are similar to the previous Census data; however, in this case the problem is even more difficult to learn. Still, we do see improvement for and . For higher values of all the algorithms are doing worse than majority label prediction algorithm.

## 7 Difficult Target Functions

An alternative way to interpret our results is a way to quantify what types of learning problems cannot be solved by subspace methods. For example, a conjunction with no redundant attributes will need at least half the majority vote hypotheses to have every relevant variable. Our subspace methods are designed to make sure any features are missing in more than half of the hypotheses. For a conjunction with three terms, the chance that all three will appear in more than half the hypotheses will be small even when is fairly large. For example, with one would need to select features for the subspace method to have a 0.5 chance of working even without corruption. It is interesting how well many of our UCI experiments in Section 6.2 perform with much bigger ratios. While not definitive, this suggests there are large amounts of relevant feature redundancy in many practical problems. This is related to the fact that a sufficiently strong adversary can make learning impossible by corrupting instances to a part of the space with a different label. On concepts like the previously mentioned conjunction, without redundancy high accuracy can be impossible to achieve against a zero norm adversary even when not using subspaces since the adversary can change one feature to change a true conjunction to a false conjunction. This suggests a connection between the performance of subspace methods on standard iid batch learning and the performance of machine learning against zero norm adversaries since both cases exploit redundancy. It also suggest algorithms that attempt to minimize the number of relevant features are more susceptible to zero norm adversaries since they might learn functions that remove redundancy.

## 8 Future Work

One interesting extension of this work is to apply it to regression problems where prediction is a real number. In this setting, we replace majority vote with the median of the of ensemble predictions. Surprisingly, all of our results carry over to this setting with only a small increase in the computational complexity of computing the data dependent bounds. The key insight is that by using the robust median statistic, the damage an adversary can do is limited. The worst an adversary can do is to corrupt hypotheses on one side of the median and shift them to an extreme value on the other side of the median. This will maximize how much the median changes. The actual amount of change depends on the number of corrupted hypotheses and the non-corrupt empirical distribution of predictions. At a minimum, at most 50% of the hypotheses can be corrupt otherwise the median can be changed to any value. This is equivalent to the situation with majority vote classification. The main difference with our results on classification is the extra difficulty in generating the bound. Since we no longer have a binomial distribution for the loss function, we need to make assumptions about the distribution of errors for the uncorrupt hypotheses in order to generate bounds. This is typical for the regression setting and not an additional weakness of the adversarial analysis. In practice, one can use an uncorrupt test set to estimate the distribution of uncorrupt error and to bound the error as a function of the number of corrupt hypotheses.

## 9 Conclusion

This paper presents new subspace methods along with a new analysis that shows, with appropriate parameters, subspace methods can tolerate arbitrary corruption of a limited number of features. While the amount of corruption that can be tolerated depends on unknown details of the problem, we give a statistic that can be used to estimate the worst case performance using uncorrupt test data. This is similar to the traditional test bound used in iid supervised learning but allows us to extend that framework to handle adversarial changes in the instances. While adversaries are not typically encountered in learning problems, the proofs also apply to other situations that include various types of instance distribution drift. We give experiments to show these algorithms perform well on a range of realistic problems including five UCI datasets and three new datasets based on electromagnetic side channel information.

## References

- [1] S. Alexander, H. Agrawal, R. Chen, J. Hollingsworth, C. Hung, R. Izmailov, J. Koshy, J. Liberti, C. Mesterharm, J. Morman, T. Panagos, M. Pucci, I. Sebuktekin, and S. Tsang. Casper: an efficient approach to detect anomalous code execution from unintended electronic device emissions. In Cyber Sensing 2018, 2018.
- [2] A. Bifet, E. Frank, G. Holmes, and B. Pfahringer. Ensembles of restricted hoeffding trees. ACM Transactions on Intelligent Systems and Technology, 3, 2012.
- [3] B. Biggio, G. Fumera, and F. Roli. Multiple classifier systems for robust classifier design in adversarial environments. International Journal of Machine Learning and Cybernetics, 1(1-4):27–41, 2010.
- [4] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy 2017, pages 39–57, 05 2017.
- [5] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceeding of the 23rd International Conference on Machine Learning, 2006.
- [6] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2018.
- [7] A. Fawzi, O. Fawzi, and P. Frossard. Analysis of classifiers’ robustness to adversarial perturbations. Machine Learning, 107(3):481–508, 2018.
- [8] M. Herbster and M. K. Warmuth. Tracking the best linear predictor. Machine Learning, 1:281–309, 2001.
- [9] T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, Aug 1998.
- [10] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- [11] S. Kullback. Information Theory and Statistics. Wiley, New York, 1959.
- [12] J. Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005.
- [13] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285–318, 1988.
- [14] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
- [15] A. Madry, A. Makelov, L. S. D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
- [16] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. Breitenbacher, , and Y. Elovici. N-baiot—network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Computing, 17:12–22, 2018.
- [17] C. Mesterharm. Tracking linear-threshold concepts with winnow. Journal of Machine Learning Research, 4:819–838, 2003.
- [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- [19] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine Learning. The MIT Press, Cambridge, Massachusetts, 2009.
- [20] M. Sugiyama and M. Kawanabe. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. The MIT Press, Cambridge, Massachusetts, 2012.
- [21] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
- [22] F. Topsøe. Some bounds for the logarithmic function. Inequality Theory and Applications, 3, 2003.
- [23] J. Torres-Sospedra, R. Montoliu, A. Martínez-Usó, J. Avariento, T. J. Arnau, M. Benedito-Bordonau, and J. Huerta. Ujiindoorloc: A new multi-building and multi-floor database for wlan fingerprint-based indoor localization problems. In International Conference on Indoor Positioning and Indoor Navigation, 10 2014.
- [24] A. Wald. Sequential Analysis. John Wiley and Sons, Cambridge, Massachusetts, 1947.