Robust Mean Estimation under Coordinate-level Corruption

# Robust Mean Estimation under Coordinate-level Corruption

## Abstract

Data corruption, systematic or adversarial, may skew statistical estimation severely. Recent work provides computationally efficient estimators that nearly match the information-theoretic optimal statistic. Yet the corruption model they consider measures sample-level corruption and is not fine-grained enough for many real-world applications.

In this paper, we propose a coordinate-level metric of distribution shift over high-dimensional settings with coordinates. We introduce and analyze robust mean estimation techniques against an adversary who may hide individual coordinates of samples while being bounded by that metric. We show that for structured distribution settings, methods that leverage structure to fill in missing entries before mean estimation can improve the estimation accuracy by a factor of approximately compared to structure-agnostic methods.

We also leverage recent progress in matrix completion to obtain estimators for recovering the true mean of the samples in settings of unknown structure. We demonstrate with real-world data that our methods can capture the dependencies across attributes and provide accurate mean estimation even in high-magnitude corruption settings.

## 1 Introduction

Data corruption is an impediment to modern machine learning deployments. Corrupted data entries, i.e., data vectors with either noisy or with missing values, can severely skew the statistical properties of a sample, and can therefore bias statistical estimates and lead to invalid inferences. To alleviate the impact of corrupted data on statistical estimation, recent robust estimators can be used in settings where less than half the data entries are corrupted [9]. This assumption is closely connected to the Huber contamination model [20] that underlies much of the existing robust estimation literature.

The Huber contamination model allows an adversary to introduce an -fraction of outliers or deletions where each data vector is either completely clean or completely corrupted. As a result, distributional discrepancies for this corruption model are closely related to the total variation distance (). Under the Huber contamination model, estimators with robustness certificates rely either on filtering or down-weighting corrupted data vectors to reduce their influence [8, 6]. In many applications, however, we can have partially corrupted data entries and even all data vectors can be partially corrupted. For example, in DNA microarrays measurement errors can occur for batches of genes [35]. Filtering or down-weighting an entire data vector can waste the information contained in the clean coordinates of the vector. These limitations motivate the study of robust statistics under corruption models that are associated with more fine-grained measures of distance between distributions.

Problem Summary   We focus on robust estimation under coordinate-level corruption per sample. We propose a measure of distribution shift, referred to as , which models adversaries that can be stronger than those associated with . Metric enables us to capture adversaries that can strategically corrupt specific coordinates per sample, and thus, for a fixed budget, they can corrupt more samples than adversaries that corrupt samples completely.

We study adversaries that introduce missing values and consider a worst-case or adversarial framework. This choice is motivated by cases where the corruption is systematic rather than easily modeled as random noise. Systematically missing data is arguably a common issue encountered by machine learning practitioners when analyzing real-world data. For example, it is prevalent in gene expression analysis [35] and a core challenge related to dropout measurements in sensor arrays [32]. Under this corruption model, we study the problem of robust mean estimation. Given a corrupted sample from a distribution with unknown mean vector we want to find a vector that closely approximates . We focus on estimators with small Mahalanobis distance, i.e., .

Main Results   We first compare adversaries bounded by against those bounded by . We show that, given a data set with coordinates, -bounded adversaries can introduce more corrupted samples than -bounded adversaries by a multiplicative factor of . The reason is that under the corruption model associated with , adversaries can corrupt specific coordinates per data sample, and thus, can allocate their budget more effectively.

We also study settings where the observed data set exhibits redundancy due to correlations among the coordinates of the observed data vectors. This analysis is motivated by recent applied results in data cleaning under systematic corruption [30, 37] that exploit structure in a corrupted data set to learn statistical measures of the clean data distribution, and use those measures to achieve state-of-the-art performance in tasks such as error detection [16] and missing data imputation [37]. We examine cases where the observed data samples are generated via a linear transformation from a lower-dimensional space to a higher-dimensional space.

We consider the case where the data sample , before corruption is introduced, corresponds to where is a lower-dimensional vector drawn from an unknown distribution . For such linear transformations, we show that the key quantity that determines the power of a -bounded adversary is the minimum number of rows that one needs to remove from to reduce its row space by one. This quantity, denoted , can take values in , where is the number of coordinates in the observed data. We show that in the presence of redundancy, we can devise robust estimators that leverage the structure of the data to reduce the strength of adversaries to that of the standard adversaries, thus, closing the aforementioned multiplicative gap. The above result holds for adversaries that are not restricted to deletion but can also perform value replacements in the observed samples.

We further focus on the problem of robust mean estimation in the presence of adversaries that are limited to deletions. We show that for Gaussian distributions where the covariance is full rank, the problem of robust mean estimation the multiplicative gap of becomes . This result introduces an information-theoretic limit on robust mean estimation under the contamination model associated with . In this setup, we find that standard robust mean estimation methods that filter corrupted data samples achieve this performance.

For the case of a known matrix , we show that a two-step procedure where we first use structure-aware data imputation methods to recover missing entries and then apply empirical mean estimation reduces the aforementioned multiplicative gap from to . On the other hand, standard filtering-based robust estimators retain the initial penalty. This result implies that in the presence of highly structured matrices , i.e., when is close to , the stronger adversary reduces to a adversary, since we can leverage redundancy due to structure to restrict the adversary’s power. We also extend these results to the case of unknown matrices . We show that under bounded budget, we can leverage matrix completion methods designed for deterministic missing patterns [28] to recover the missing entries and obtain a robust mean estimator that achieves the same performance as in the case with known structure.

Finally, we present an experimental evaluation of the proposed robust estimation methods on synthetic setups as well as five real-world data sets. Our experiments on synthetic data provide empirical validation to our results, and show that leveraging structure for robust mean estimation leads to significant accuracy improvements over prior efficient robust estimators even for corrupted samples that do not follow a Gaussian distribution. In addition, our experiments on real-world data demonstrate that leveraging structure yields similar improvements even in data sets that do not conform to the aforementioned linear transformation model.

## 2 Related Work

The literature of robust statistics is growing in many directions. A significant range of problems is studied, such as robust mean and covariance estimation [24, 5, 6, 3, 23, 39], robust optimization [2, 7, 12, 29], robust regression [19, 21, 10, 13], and computational hardness of robustness [15, 22, 10, 18]. We describe works related to our paper.

Robust mean estimation   There has been a lot of progress in robust mean estimation, especially to find computationally efficient algorithms that scale well in high dimensions [24, 6]. These algorithms are based on Huber contamination model [20] or similar models with a sample-level adversary bounded by . We present a detailed comparison with the estimators considered in those works under coordinate-level corruption.

Entry-level Corruption   Zhu et al. [39] defined a family of resilient distributions in which robust estimation is possible and proposed a framework of perturbations under any Wasserstein distance, which can be seen as a generalization of the metric we define. We expand the notion of new perturbation further in robust mean estimation of structured distributions. On the other hand, Loh and Tan  [25] studied learning the sparse precision matrix of the data in the presence of cell-level noise but under a -style adversary. We expand on more fine-grained metrics in our work, but extending our analysis to the robust estimation of the precision matrix would be an interesting extension.

Missing Data Imputation   State-of-the-art methods for data imputation demonstrate that leveraging the redundancy in the observed data yields high accuracy, even in the presence of systematic noise. Singular value decomposition (SVD) based imputation methods [34, 27] assume linear relations across coordinates. Bertsimas et al. [1] uses K-nearest neighbors, support vector machines, and tree based methods to learn the structure and impute missing values. AimNet [37] discovers more complex non-linear structures using attention-based mechanisms. Our theoretical analysis provides intuition as to why these methods outperform solutions that only rely on coordinate-wise statistics.

## 3 Coordinate-level Corruption

In this section, we define the coordinate-level measures of distribution shift () and the adversaries considered in this work. We also compare the proposed measures and adversaries with total variation distance and the standard -fraction adversary considered in the robustness literature. All results presented in this section apply to corruption that corresponds to both missing values and value replacements. All proofs are deferred to the Appendix.

### 3.1 Corruption Models

We consider that the observed data follows an adversarially corrupted distribution that lies within some -ball from the true distribution according to a discrepancy . We consider that corruption occurs to samples first drawn from the true distribution and then corrupted. Given samples from distribution on , the adversary is allowed to inspect the samples and perform one of the following corruptions:

• Sample-level adversary (): The adversary is allowed to remove up to of them and replace them with arbitrary vectors.

• Value-fraction adversary (): The adversary is allowed to remove up to values in each coordinate and replace them with arbitrary values.

• Coordinate-fraction adversary (): The adversary is allowed to remove -fraction of the coordinates in expectation over the samples (up to values in the data set) and replace them with arbitrary values.

Adversary corresponds to the standard adversary associated with Huber’s contamination model, which either corrupts a sample completely or leaves it intact. Adversaries and are more fine-grained and can corrupt only part of the entries of a sample. The difference between and is that the fraction being hidden per coordinate is restricted for , while is able to focus all corruption on a single coordinate.

For sufficiently large , we can use the total variation distance to quantify the shift of the distribution introduced by , which is , where is the distribution after corruption. However, for and , would be misleading because a sample with one missing coordinate and a sample with many missing coordinates would be considered equally corrupted under it. It is more appropriate to differentiate between these cases and model corruption based on each coordinate than whole samples. Thus, we propose a new type of metric, referred to as , which can be used to measure coordinate-level corruption.

###### Definition 1 (dENTRY).

For , define , an indicator for each coordinate. For distributions on ,

 d1ENTRY(D1,D2) =infγ∈Γ(D1,D2)1n∥E(x,y)∼γ[I(x,y)]∥1 d∞ENTRY(D1,D2) =infγ∈Γ(D1,D2)∥E(x,y)∼γ[I(x,y)]∥∞

where is the set of all couplings of and .

These metrics can describe the distribution shift due to corruptions introduced by adversaries and . Specifically, the maximum distance that can shift by is . Similarly, the maximum distance that can shift by is . Next, we analyze the strength of adversaries , , and in more detail.

### 3.2 Adversary Analysis and Comparison

Adversaries and are more fine-grained than the sample-level adversary since they have the freedom to corrupt individual entries of the samples. Similarly, is more fine-grained than . As a result, the strength of and , i.e., the magnitude of the distribution shift they can introduce, can vary in a wider range than that of . We next provide a more detailed comparison of these three adversaries in terms of strength (1) for arbitrary distributions and (2) in the presence of highly structured data.

Arbitrary distributions   We first consider the case when the coordinate-level adversaries and have limited budget and can corrupt a smaller number of samples than . Formally, we have the next proposition.

###### Proposition 1.

If , then can simulate and , and thus is stronger.

In the same way, when , and can simulate sample-level corruptions. Furthermore, we show how the strength of and compare.

###### Proposition 2.

If , then and are stronger than . If , is stronger than . If , is stronger than .

Hence when , i.e. when the total number of missing entries are the same, both and are stronger . Directly from the definitions of and and Proposition 2, we have that when , there are distributions for which the coordinate-level adversary can corrupt -times more samples than adversary .

Structured data   The previous analysis shows that is significantly stronger for arbitrary distributions. However, the characterization due to Propositions 1 and 2 is quite loose as the gap between and is large. This gap raises the natural question: Are there distributions for which this gap is more tight, and hence, we can restrict the power of the coordinate-level adversary to be comparable to that of the sample-level adversary ?

We show that for distributions that exhibit structure, i.e., redundancy across the coordinates of the observed samples, we can reduce the strength of the coordinate-level adversary to roughly that of the sample-level adversary. This means that although the coordinate-level corruption model is more fine-grained and stronger, by using structure to our advantage, we can robustly estimate the mean roughly as accurately as that for sample-level corruption. We formalize the notion of structure next.

###### Definition 2 (Structure).

Assume samples lie in a low-dimensional space such that , where generates samples and represents the underlying structure. We consider the case where the structure is linear, i.e., where .

We consider that the data sample before corruption corresponds to and assume that corruption is introduced in . Matrix introduces redundancy in the corrupted, observed data samples. This allows us to detect corruption that may be introduced in the data and in many cases allows us to recover the true value of corrupted entries, e.g., by solving a system of linear equations (see Section 4.2). In fact, we measure the strength of the redundancy that introduces by considering its row space. Note that the coordinates of , and hence, the corrupted data, will exhibit high redundancy when many rows of span a small subspace. Following this intuition, we define the following quantity to describe how tolerant is to retaining its row space.

###### Definition 3 (mA).

Given a structured sample with we define to be the minimum number of rows one needs to remove from to reduce the dimension of its row space by one.

As we show below, is the key quantity that quantifies how robust estimation can be against coordinate-level corruptions. For example, when is the identity matrix we have that and hence we can remove any row and reduce its row space, thus, we have low redundancy. On the other hand, for the matrix , we have that because we have to remove all ’s in to reduce the dimension of its row space. Overall, it is not hard to see that .

We next show that the higher the value that takes for a matrix , the weaker a coordinate-level adversary becomes due to the increased redundancy. Intuitively, the coordinate-level adversary has to spend more budget per sample to introduce corruptions that will counteract the redundancy introduced by matrix . As a consequence, the coordinate-level adversary cannot alter the original distribution too far in total variation distance, leading to more accurate robust estimation techniques. We have the following theorem.

###### Theorem 1.

Given two probability distributions on , both distributions on transformed linearly by ,

 mAn⋅dTV(D1,D2)≤dENTRY(D1,D2)≤dTV(D1,D2)

This theorem states that as , metric , which bounds the power of the coordinate-level adversaries, is almost equal to , which bounds the sample-level adversary. We derive the following corollary.

###### Corollary 1.

Suppose that is the original distribution on with structure determined by matrix . Let be the observed distribution under with budget (or under with budget of ) and let . We have that:

 mAϵn ≤α,ρ≤ϵ

Therefore, when we have structure, we can extend our results in Proposition 1 to more accurately characterize the relative strength of the two adversaries. The next proposition follows directly from Corollary 1 and shows when the coordinate-level adversary is unable to replicate the corruption caused by the sample-level adversary.

###### Proposition 3.

In summary, we have shown that the presence of redundancy in the observed data samples can limit the power of the more flexible coordinate-level adversary. In the next sections, we will show that this redundancy allows us to devise robust estimators that enjoy better estimation accuracy guarantees than the standard filtering-based robust estimation methods inspired by sample-level corruption schemes.

## 4 Robust Mean Estimation and Structure

We consider the problem of mean estimation for Gaussian distributions and cases where data corruption corresponds to deletion. We show that the redundancy (structure) in the observed data can reduce the power of a coordinate-level adversary to approximately that of a sample-level adversary via structure-aware robust estimation techniques. In Sections 4.1 and 4.2, we provide tight bounds for mean estimation under sample-level and coordinate-level corruption given no structure and the exact structure of . We highlight that is the key quantity that determines these information-theoretic limits. Then, we show how to leverage structure when is unknown in Section 4.3. Table 1 summarizes all results presented in this section. All results hold for adversary budgets bounded away from and for a sufficiently large sample size . All proofs are deferred to the Appendix.

### 4.1 Mean Estimation with No Structure

When there is no structure to the data, the samples do not lie in a lower-dimensional space, i.e., we have that where is full rank. This problem then reduces to mean estimation of a non-degenerate Gaussian, which has been studied extensively in variants of the -corruption models. Here, we prove new tight bounds for mean estimation under adversaries and . Specifically, we show that robust mean estimation is much harder under coordinate-level corruption when we do not have structure.

Under sample-level corruption () where -fraction of samples is corrupted, it is well-known that any mean estimator has to information-theoretically be -far from the true mean in Mahalanobis distance with . On the other hand, the Tukey median [36] is a mean estimator where . Although finding the Tukey median is computationally intractable in high dimensions, we know that it is an optimal estimator of .

Under , who can corrupt -fraction of the coordinates in expectation, hidden coordinates cannot be recovered exactly using other visible coordinates because each dimension is independent of each other. Then, if , can concentrate all corruption in the first coordinate of all samples and, without recovery, we cannot get any estimate of the mean for that coordinate. Thus, the following theorem suggests that coordinate-level corruption is more difficult to tolerate than sample-level corruption.

###### Theorem 2.

Let and be a full rank covariance matrix. Given a set of i.i.d. samples from where the set is corrupted by of budget , an algorithm that outputs a mean estimator must satisfy . Furthermore, there exists an algorithm that obtains error .

Under corruption models and , there are optimal algorithms that achieve error guarantees of and , respectively. If , we know that is a stronger adversary than , and in terms of error guarantee, can corrupt the mean with a multiplicative factor of more than . Therefore, when no structure is present in the data, there is a significant estimation gap between sample-level and coordinate-level corruption.

Under , who can corrupt at most -fraction of each coordinate, the information theoretic optimal estimation depends on the structure of the covariance matrix. The exact characterization requires some definitions. For a positive semi-definite (PSD) matrix , we define by to be a matrix with entries , i.e. a rescaled version of so that the diagonal is . We also let .

###### Theorem 3.

Let be a full rank covariance matrix. Given a set of i.i.d. samples from where the set is corrupted by of budget , an algorithm that outputs a mean estimator must satisfy .

###### Lemma 1.

For any PSD matrix ,

Using these results, we have the following corollary.

###### Corollary 2.

Under , an algorithm that outputs a mean estimator must satisfy .

### 4.2 Mean Estimation with Known Structure

We consider the case when we have full information of matrix . Under sample-level corruption, structure does not help recover missing samples so the previous result of holds for . We now study how much can structure help with mean estimation under and corruption.

With known matrix , we can recover missing coordinates by solving a linear system of equations using the remaining coordinates in the sample and . This process is computationally efficient to recover missing coordinates. If we can recover missing coordinates using structure, the best strategy of or is to corrupt coordinates so that recovery is computationally difficult or even impossible. To this end, the adversary must corrupt at least coordinates for that sample to make coordinate recovery impossible. In fact, the following theorem shows that is the key structural quantity that captures the information-theoretic lower bound in robust estimation.

###### Theorem 4.

Assume the setup from Definition 2 where and let be the budget for . Given a set of samples corrupted by , an algorithm that outputs a mean estimator must satisfy . Furthermore, there exists an algorithm that obtains error .

Theorem 4 shows that for a matrix where , the multiplicative gap between mean estimation under and becomes nearly . This bridged gap implies that leveraging structure by imputation reduces the strength of (and also ) to roughly that of a sample-level adversary.

Consider a linear structure such that each row is sampled uniformly from the unit sphere . Then has rank and with probability . Similarly, for the matrices considered in [28], for almost every with respect to the Lebesgue measure on . Naturally , so the estimation error for is approximately , which is the bound for when . Our main result shows that we can restrict the strength of a coordinate-level adversary to that of a sample-level adversary and is information-theoretically optimal.

### 4.3 Mean Estimation with Unknown Structure

In the real world, we may not know the structure beforehand. Hence, we need to first learn the dependencies that exist between coordinates from the visible entries before we use them to impute the missing ones. When the dependencies are linear, there is a natural connection between matrix completion and missing entry imputation. In this section, we show how matrix completion can help robust mean estimation in the setting of Definitions 2 and 3 and when is unknown but has . We make the assumption of infinite samples in this section.

We focus on unique recovery as it has the same effects as exact imputation. Corollary 1 in [28] gives the conditions in which we can uniquely recover a low rank matrix with missing entries. This result goes beyond random missing values and considers deterministic missing-value patterns. We state it here as the following lemma, with small changes made to adapt to our context where each sample can be viewed as one column of a matrix.

###### Lemma 2.

Assume the setting in Definitions 2 and 3 plus the condition that is unknown but has . If there exist disjoint groups of samples, and in each group, any samples have at least dimensions which are not completely hidden, all the samples with at least visible entries can be uniquely recovered.

The key to matrix completion is learning the -dimensional subspace spanned by the samples. The samples with more than visible entries provide some information to identify the subspace. Once the subspace is known, any sample with at least visible entries can be uniquely recovered.

If we consider the case where the data is corrupted by , we have the following result.

###### Theorem 5.

Assume the setting in Definitions 2 and 3 where is unknown. Under corruption with budget , there exists an algorithm that obtains error .

Theorem 5 is based on Theorem 4 and the following lemma.

###### Lemma 3.

Assume the setting in Definition 3 plus and is unknown. When the data are corrupted by with budget , if , we cannot recover any of the samples in the worst case, otherwise we can recover all the samples with less than missing entries.

If the data is corrupted by , we have the following result.

###### Theorem 6.

Assume the setting in Definition 3 plus the condition that , and is unknown but has . Under corruption with budget , there exists an algorithm that obtains error .

Note that the threshold of the budget in Theorem 6 is greater than or equal to the threshold in Theorem 5. Similar to Theorem 5, Theorem 6 can be obtained from Theorem 4 and the following lemma. Theorem 4 can be applied here because can simulate .

###### Lemma 4.

When the data are corrupted by with budget , if , we cannot recover any of the samples in the worst case; if , we can recover all the samples with at least visible entries.

In summary, the above show that when the budget of the coordinate-level adversaries is bounded we can design structure-aware robust mean estimation methods even when the structure of is unknown.

## 5 Experiments

We present an empirical comparison of robust mean estimation methods on both synthetic and real-world data. We focus on the task of mean estimation and corruptions that correspond to missing values. We seek to empirically validate two points: (1) the accuracy improvements that structure-aware robust estimators yield against standard robust estimators when applied to distributions beyond the Gaussian setting considered in Section 4, and (2) the effect of structure-aware mean estimation on real-world data that might not exhibit dependencies due to linear structure.

Methods and Experimental Setup   We consider the following mean estimation methods:

• Empirical Mean: Take the mean for each coordinate, ignoring all missing entries.

• Data Sanitization: Remove any samples with missing entries, and then take the mean of the rest of the data.

• Coordinate-wise Median: Take the median for each coordinate, ignoring all missing entries.

• Matrix Completion: Use iterative hard-thresholded SVD (ITHSVD) [4] to impute the missing entries. Take the mean afterwards. We use randomized SVD [14] to accelerate.

• Exact Imputation: For each sample, build a linear system based on the structure and solve it. If the linear system is under-determined, do nothing. Then, take the mean while ignoring the remaining missing values.

The methods can be classified into three categories, based on the amount of structural information they leverage: (1) Empirical Mean, Data Sanitization, and Coordinate-wise Median ignore the structure information; (2) Matrix Completion assumes there exists some unknown structure but it can be inferred from the visible data; (3) Exact Imputation knows exactly what the structure is and uses it to impute the missing values.

In each experiment presented below, we inject missing values by hiding the smallest fraction of each dimension. For synthetic data sets, the true mean is derived from the data generation procedure. For real-world data sets, we use the empirical mean of the samples before corruption approximate the true mean. For synthetic data sets, we consider the and Mahalanobis distances to measure the estimation accuracy of different methods. For real-world data, we only consider the distance between the estimated mean and the true empirical mean of the data before corruption.

Mean Estimation on Synthetic Data   We show that redundancy in the corrupted data can help improve the robustness of mean estimation. We test all the methods on synthetic data sets with linear structure () and three kinds of latent variables (): Gaussian, Uniform, and Exponential. Each sample is generated by , where is sampled from the distribution describing the latent variable . We set to be a diagonal block matrix with two blocks generated randomly and fixed through the experiments. In every experiment, we consider a sample with 1,000 data vectors. To reduce the effect of random fluctuations, we repeat our experiments for five random instances of the latent distribution for each type of latent distribution and take the average error.

The results for the above experiments are shown in Figure 1 and Figure 2. Figure  1 shows the mean estimation error of different methods measured using the Mahalanobis distance, and Figure 2 shows that in distance. We see that estimators that leverage the redundancy in the observed data to counteract corruption yield more accurate mean estimates. This behavior is consistent across all types of distributions and not only for the case of Gaussian distributions that the theoretical analysis in Section 4 focuses on. We see that the performance of Matrix Completion (when the structure of is considered unknown) is the same as that of Exact Imputation (when the structure of is known) when the fraction of missing entries is below a certain threshold. Following our analysis in Section 4.3, this threshold corresponds to the conditions for which the subpace spanned by the samples can be learned from the visible data. Finally, we point out that we do not report results for Data Sanitization when the missing fraction is high because all samples get filtered.

Mean Estimation on Real-world Data   We turn our attention to settings with real-world data with unknown structure. We use five data sets from the UCI repository [11] for the experiments in this section. Specifically, we consider: Leaf [31], Breast Cancer Wisconsin [26], Blood Transfusion [38], Wearable Sensor [33], and Mice Protein Expression [17]. For each data set, we consider the numeric features; all of these features are also standardized. For all the data sets, we report the error. We summarize the size of the data sets along with the rank used for Matrix Completion in Table 2. As the structure is unknown, we omit exact imputation.

We show the results in Figure 3. We find that Matrix Completion always outperforms Empirical Mean and Coordinate-wise Median on three data sets (Breast Cancer Wisconsin, Wearable Sensor, Mice Protein Expression). For the other two (Leaf, Blood Transfusion), the error of Matrix Completion can be as much as two-times lower than the error of the other two methods for small ’s ( for Leaf and for Blood Transfusion). We also see that the estimation error becomes very high only for large values of . It is also interesting to observe that Data Sanitization performs worse than the Empirical Mean and the Coordinate-wise Median for real-world data. Recall that the opposite behavior was recorded for the synthetic data setups in the previous section. Overall, these results demonstrate that structure-aware robust estimators can outperform the standard filtering-based robust mean estimators even in setups that do not follow the linear structure setup in Section 4.

## 6 Discussion and Extensions

We have shown that utilizing redundancy in the observed samples to counteract corruption in both real-world and synthetic data sets can substantially improve estimates of the mean of a distribution; when the level of corruption is below a threshold, these techniques do not require knowing the structure beforehand.

Our results could be extended by considering other forms of structure; our results here have been exclusive to distributions with linear structure. Though approximate linear structure is present in many data sets, more complex forms of structure may result in more accurate recovery of missing entries in real-world data as they could more accurately approximate the true structure of the distribution.

Our model of adversarial corruption could also be extended to allow both hidden and otherwise modified entries as well as random errors in the data. Our experimental results show that leveraging redundancy is beneficial when the data is only approximately structured, making these promising avenues of future investigation. A complete model would also take the computational budget of the defender into account. Finally, our results could be extended to statistics other than the mean and to models of the effect of structure-based imputation as preprocessing for learning algorithms.

## 7 Conclusion

We have proposed a new model of data corruption that better represents fine-grained modifications, and allows for stronger and more flexible adversaries. Given such an adversary with a limited budget, we have shown that, given almost any linear structure , robust mean estimation techniques that leverage structure outperform those that do not. Even when the structure is unknown, we have shown that matrix completion techniques can still make recovery possible; we have demonstrated that such techniques are effective on real-world data.

Real-world data can include missing entries, and rejecting all partially missing samples via data sanitization techniques can result in substantially higher error than techniques which are able to make use of the entries that remain, particularly when compared to those which leverage the structure of the data. Given the redundancy that is often present between features of a sample in many data sets, and as, in real-world settings, these correlations often reflect an underlying structure in the data, using structure-learning techniques and imputation to recover missing or incorrect entries can result in significant improvements being realized in practice.

## 8 Appendix

We now provide the proofs for all results stated in the main body of our work. We also provide additional figures from our experimental evaluation that were omitted due to space constraints. Finally, we discuss extensions and future directions related to our work.

### 8.1 Proof of Proposition 1

###### Proof.

Suppose that affects at least one entry in a subset of all samples. As at least one coordinate per sample is corrupted, must be at most an -fraction of all samples; since the sample-level adversary can corrupt the entirety of every sample partially corrupted by the coordinate-level adversary, and thus, it is a stronger adversary given this condition. The proof for is similar. ∎

### 8.2 Proof of Proposition 2

###### Proof.

If , similarly to the proof of Proposition 1, and can simulate by placing all its corruptions on the coordinates corrupted by . If , can simulate by corrupting the coordinates corrupted by since can never corrupt more than -fraction of coordinates in expectation. On the other hand, if , can corrupt whatever coordinates decides to corrupt since cannot corrupt more than -fraction of one coordinate. Thus, the three statements hold. ∎

### 8.3 Proof of Theorem 1

###### Proof.

First, we will show the case for .

 d1ENTRY(D1,D2) =infγ∈Γ(D1,D2)∥E(x,y)∼γ[I(x,y)]∥1n =infγ∈Γ(D1,D2)E(x,y)∼γ[||x−y||0n] =dTV(D1,D2)

Now, we can write and for some distributions and on .

Suppose that . Now, suppose by way of contradiction that is nonzero for fewer than values of . Call the rows of and let be the subspace of spanned by the . As , is nonzero. Hence, is nonzero for some so is nonzero.

Now, let be a basis for containing . Consider the subspace of spanned by . As , cannot be an element of and so is not a basis for . Thus, the dimension of is less than that of ; as we have a contradiction of the definition of . Thus, if , must be nonzero for at least values of , and hence .

Now, suppose that for some . Then, and for some . If , then so . Thus

 ||A(x′−y′)||0≥mA

by the above, and so

 E(x,y)∼γ[||x−y||0]≥mAPr(x,y)∼γ[x≠y]

Therefore, we have that

 d1ENTRY(D1,D2) =infγ∈Γ(D1,D2)∥E(x,y)∼γ[I(x,y)]∥1n =infγ∈Γ(D1,D2)E(x,y)∼γ[||x−y||0n] ≥infγ∈Γ(D1,D2)mAnPr(x,y)∼γ[x≠y] =mAndTV(D1,D2)

In the case of , the left hand side follows from above by using the fact that for . Then,

 d∞ENTRY(D1,D2) =infγ∈Γ(D1,D2)∥E(x,y)∼γ[I(x,y)]∥∞ =infγ∈Γ(D1,D2)maxiPr(x,y)∼γ[xi≠yi] =dTV(D1,D2)

Therefore, the theorem holds for the metric. ∎

### 8.4 Proof of Theorem 2

###### Proof.

With a budget of , can concentrate its corruption on one particular coordinate, say the first coordinate. If , we will lose all information for the first coordinate, making mean estimation impossible. Since , can corrupt -fraction of first coordinates of all samples. Since the marginal distribution with respect to the first dimension is a univariate Gaussian, information-theoretically any mean estimator of the first coordinate must be -far from the true mean of the first coordinate.

Now with a corrupted set of samples, only at most -fraction of the samples will be corrupted on one or more coordinates. Then the corrupted distribution will have . Therefore, the Tukey median for the corrupted set of samples achieves error . ∎

### 8.5 Proof of Theorem 3

###### Proof.

Let be the maximizer of and let be the vector with entries . To complete the proof, we will show that . To do this, we are going to use a hybrid argument showing that by only hiding fraction of the entries in the -th coordinate, and become indistinguishable. This is because, . By applying this argument sequentially for every coordinate, and are indistinguishable under an adversary. Since the total distance between and in Mahalanobis distance is at least , the theorem follows. ∎

### 8.6 Proof of Lemma 1

###### Proof.

We have that is a PSD matrix with diagonal elements equal to 1. Consider a random with uniformly random coordinates in . Then, . Thus, . This lower bound is tight for .

For the upper-bound, we notice that since is PSD, it holds that . To see this notice that for both and .

Given this, we have that . This gives the required upper-bound. Notice that the upper-bound is tight for the matrix consisting entirely of ’s. ∎

### 8.7 Proof of Theorem 4

###### Proof.

Define be the fraction of samples that has at least one corrupted coordinate. Note that must corrupt at least coordinates of a sample to make his corruptions non-recoverable. Given that we can recover any sample with less than corrupted coordinates, we have that . If is the original distribution on and is the observed distribution, then . Since between the two Gaussians is less than or equal to , the Tukey median algorithm achieves .

For the lower bound, can corrupt -fraction of the samples so that the coordinates are non-recoverable and shift part of the original distribution to anywhere along the axes of missing coordinates. Then, the proof follows the lower bound proof for estimating the mean of a Gaussian corrupted by . Hence, since we cannot distinguish between two Gaussians that share of mass,

### 8.8 Proof of Lemma 3

###### Proof.

The adversary can simply hide one coordinate completely to prevent us from recovering that coordinate if at least one missing coordinate per sample in expectation is allowed. If the expected number of missing coordinates per sample is less than one, there must then be some positive fraction of samples with no missing coordinates; as we have infinite samples, we can select any disjoint sets of such samples to satisfy the conditions in Lemma 2. ∎

### 8.9 Proof of Lemma 4

###### Proof.

First, We introduce the concept of hidden patterns. The set of coordinates missing from a sample forms its hidden pattern. We only consider the hidden patterns which have been applied to infinitely many samples as if only finitely many samples share a pattern, the adversary could hide those samples completely with budget.

When , the adversary is able to hide entries for every sample, and we cannot learn the structure from samples with only visible entries.

When , the adversary does not have enough budget to hide entries of all the samples, so there exist some patterns with at least coordinates visible. We use to denote the number of such patterns, and to denote the probability of the pattern , .

Next, we show a necessary condition for the adversary to prevent us from learning the structure. Since we have infinitely many samples, one group of samples satisfying the conditions in Lemma 2 is enough to learn the structure since we can find another groups by choosing samples with the same hidden patterns.

It is obvious that the adversary has to hide at least one coordinate per pattern, otherwise we have infinitely many samples without corruption. No matter what the patterns the adversary provides, we try to get the samples satisfies the conditions in Lemma 2 by the following sampling procedure.

1. Start with one of the patterns, pick visible coordinates of it to form the initial visible set . Take one sample from this pattern to form the initial sample group . Mark this pattern as checked.

2. For , take one of the unchecked patterns and check if it contains at least one visible coordinate not in , the current visible set. If so, take one sample from it and pick any one of its visible coordinates . Add to the sample group and to the visible set: , . If not, skip it. Mark the pattern as checked.

We show by induction that any different samples in have at least coordinates in not completely hidden. It is trivial that the property holds for . Assume that the property holds for . According to the sampling procedure, when a new sample comes, it has at least one visible coordinate not in . Consider any different samples in . If the samples don’t include the new sample , by the induction assumption they have at least coordinates in not completely hidden. If is one of the samples, again by the induction assumption the other samples have at least coordinates in not completely hidden, plus of is also not hidden, so there are at least coordinates in not completely hidden. Thus, the property also holds for . By induction, any distinct samples from the group we get at the end of step 2 have at least coordinates not completely hidden, which means if the group has at least samples, the conditions in Lemma 2 can be satisfied.

Denote the set of the patterns being picked as and the set of the patterns being skipped as . Based on the previous analysis, the adversary has to manipulate the patterns so that , in which case the visible set cannot cover all the coordinates, which means there exists at least one common hidden coordinate for the patterns in (otherwise the pattern where that coordinate is visible should have been picked). Since the fraction of hidden entries in that common coordinate is less than or equal to , the sum of the probabilities of the patterns in satisfies . Since all the patterns have at least one missing coordinate, we also have . Thus, we have . In such a case, the overall fraction of missing entries satisfies

 η ≥M∑l=1pl1n+(1−M∑l=1pl)n−rn ≥(n−r)ρ1n+(1−(n−r)ρ)n−rn

The first inequality holds because for the samples with at least visible entries, there are at least missing entries per sample, and for the samples with less than visible entries, there are at least missing entries per sample. In addition, also satisfies , so we have , which is a necessary condition for the adversary. Thus, if , we can learn the structure and impute all the samples with at least visible entries. ∎

### References

1. D. Bertsimas, C. Pawlowski and Y. D. Zhuo (2017) From predictive methods to missing data imputation: an optimization approach. The Journal of Machine Learning Research 18 (1), pp. 7133–7171. Cited by: §2.
2. M. Charikar, J. Steinhardt and G. Valiant (2017) Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 47–60. Cited by: §2.
3. Y. Cheng, I. Diakonikolas, R. Ge and D. Woodruff (2019) Faster algorithms for high-dimensional robust covariance estimation. arXiv preprint arXiv:1906.04661. Cited by: §2.
4. E. Chunikhina, R. Raich and T. Nguyen (2014) Performance analysis for matrix completion via iterative hard-thresholded svd. In 2014 IEEE Workshop on Statistical Signal Processing (SSP), pp. 392–395. Cited by: 4th item.
5. C. Daskalakis, T. Gouleakis, C. Tzamos and M. Zampetakis (2018) Efficient statistics, in high dimensions, from truncated samples. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 639–649. Cited by: §2.
6. I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra and A. Stewart (2019) Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing 48 (2), pp. 742–864. Cited by: §1, §2, §2.
7. I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, J. Steinhardt and A. Stewart (2018) Sever: a robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815. Cited by: §2.
8. I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra and A. Stewart (2017) Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICMLâ17, pp. 999â1008. Cited by: §1.
9. I. Diakonikolas and D. M. Kane (2019) Recent advances in algorithmic high-dimensional robust statistics. External Links: 1911.05911 Cited by: §1.
10. I. Diakonikolas, W. Kong and A. Stewart (2019) Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2745–2754. Cited by: §2.
11. D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.
12. J. Duchi and H. Namkoong (2018) Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750. Cited by: §2.
13. C. Gao (2020) Robust regression via mutivariate regression depth. Bernoulli 26 (2), pp. 1139–1170. Cited by: §2.
14. N. Halko, P. Martinsson and J. A. Tropp (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53 (2), pp. 217–288. Cited by: 4th item.
15. M. Hardt and A. Moitra (2013) Algorithms and hardness for robust subspace recovery. In Conference on Learning Theory, pp. 354–375. Cited by: §2.
16. A. Heidari, J. McGrath, I. F. Ilyas and T. Rekatsinas (2019) HoloDetect: few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD â19, New York, NY, USA, pp. 829â846. External Links: ISBN 9781450356435, Link, Document Cited by: §1.
17. C. Higuera, K. J. Gardiner and K. J. Cios (2015) Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PloS one 10 (6). Cited by: §5.
18. S. B. Hopkins and J. Li (2019) How hard is robust mean estimation?. arXiv preprint arXiv:1903.07870. Cited by: §2.
19. P. J. Huber (1973) Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics 1 (5), pp. 799–821. Cited by: §2.
20. P. J. Huber (1992) Robust estimation of a location parameter. In Breakthroughs in statistics, pp. 492–518. Cited by: §1, §2.
21. A. Klivans, P. K. Kothari and R. Meka (2018) Efficient algorithms for outlier-robust regression. arXiv preprint arXiv:1803.03241. Cited by: §2.
22. A. Klivans and P. Kothari (2014) Embedding hard learning problems into gaussian space. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), Cited by: §2.
23. V. Kontonis, C. Tzamos and M. Zampetakis (2019) Efficient truncated statistics with unknown truncation. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 1578–1595. Cited by: §2.
24. K. A. Lai, A. B. Rao and S. Vempala (2016) Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 665–674. Cited by: §2, §2.
25. P. Loh and X. L. Tan (2018) High-dimensional robust precision matrix estimation: cellwise corruption under -contamination. Electron. J. Statist. 12 (1), pp. 1429–1467. External Links: Cited by: §2.
26. O. L. Mangasarian and W. H. Wolberg (1990) Cancer diagnosis via linear programming. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §5.
27. R. Mazumder, T. Hastie and R. Tibshirani (2010) Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research 11 (Aug), pp. 2287–2322. Cited by: §2.
28. D. L. Pimentel-Alarcón, N. Boston and R. D. Nowak (2016) A characterization of deterministic sampling patterns for low-rank matrix completion. IEEE Journal of Selected Topics in Signal Processing 10 (4), pp. 623–636. Cited by: §1, §4.2, §4.3.
29. A. Prasad, A. S. Suggala, S. Balakrishnan and P. Ravikumar (2018) Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485. Cited by: §2.
30. T. Rekatsinas, X. Chu, I. F. Ilyas and C. Ré (2017) HoloClean: holistic data repairs with probabilistic inference. PVLDB 10 (11), pp. 1190–1201. External Links: Cited by: §1.
31. P. F. Silva, A. R. Marcal and R. M. A. da Silva (2013) Evaluation of features for leaf discrimination. In International Conference Image Analysis and Recognition, pp. 197–204. Cited by: §5.
32. D. C. Swanson (2001) Signal processing for intelligent sensor systems. Acoustical Society of America. Cited by: §1.
33. R. L. S. Torres, D. C. Ranasinghe, Q. Shi and A. P. Sample (2013) Sensor enabled wearable rfid technology for mitigating the risk of falls near beds. In 2013 IEEE International Conference on RFID (RFID), pp. 191–198. Cited by: §5.
34. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17 (6), pp. 520–525. Cited by: §2.
35. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17 (6), pp. 520–525. Cited by: §1, §1.
36. J. W. Tukey (1975) Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, Vol. 2, pp. 523–531. Cited by: §4.1.
37. R. Wu, A. Zhang, I. Ilyas and T. Rekatsinas (2020) AimNet: attention-based learning for missing data imputation. MLsysâ20. Cited by: §1, §2.
38. I. Yeh, K. Yang and T. Ting (2009) Knowledge discovery on rfm model using bernoulli sequence. Expert Systems with Applications 36 (3), pp. 5866–5871. Cited by: §5.
39. B. Zhu, J. Jiao and J. Steinhardt (2019) Generalized resilience and robust statistics. arXiv preprint arXiv:1909.08755. Cited by: §2, §2.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters