Fast and Robust Least Squares Estimation in Corrupted Linear Models
Abstract
Subsampling methods have been recently proposed to speed up least squares estimation in large scale settings. However, these algorithms are typically not robust to outliers or corruptions in the observed covariates.
The concept of influence that was developed for regression diagnostics can be used to detect such corrupted observations as shown in this paper. This property of influence – for which we also develop a randomized approximation – motivates our proposed subsampling algorithm for large scale corrupted linear regression which limits the influence of data points since highly influential points contribute most to the residual error. Under a general model of corrupted observations, we show theoretically and empirically on a variety of simulated and real datasets that our algorithm improves over the current stateoftheart approximation schemes for ordinary least squares.
Fast and Robust Least Squares Estimation in Corrupted Linear Models
Brian McWilliams Gabriel Krummenacher^{†}^{†}thanks: Joint first author Mario Lučić Joachim M. Buhmann Department of Computer Science ETH Zürich, Switzerland {mcbrian,gabriel.krummenacher,mario.lucic,jbuhmann}@inf.ethz.ch
1 Introduction
To improve scalability of the widely used ordinary least squares algorithm, a number of randomized approximation algorithms have recently been proposed. These methods, based on subsampling the dataset, reduce the computational time from to ^{1}^{1}1Informally: means grows more slowly than . [14]. Most of these algorithms are concerned with the classical fixed design setting or the case where the data is assumed to be sampled i.i.d. typically from a subGaussian distribution [7]. This is known to be an unrealistic modelling assumption since realworld data are rarely wellbehaved in the sense of the underlying distributions.
We relax this limiting assumption by considering the setting where with some probability, the observed covariates are corrupted with additive noise. This scenario corresponds to a generalised version of the classical problem of “errorsinvariables” in regression analysis which has recently been considered in the context of sparse estimation [12]. This corrupted observation model poses a more more realistic model of real data which may be subject to many different sources of measurement noise or heterogeneity in the dataset.
A key consideration for sampling is to ensure that the points used for estimation are typical of the full dataset. Typicality requires the sampling distribution to be robust against outliers and corrupted points. In the i.i.d. subGaussian setting, outliers are rare and can often easily be identified by examining the statistical leverage scores of the datapoints.
Crucially, in the corrupted observation setting described in §2, the concept of an outlying point concerns the relationship between the observed predictors and the response. Now, leverage alone cannot detect the presence of corruptions. Consequently, without using additional knowledge about the corrupted points, the OLS estimator (and its subsampled approximations) are biased. This also rules out stochastic gradient descent (SGD) – which is often used for large scale regression – since convex cost functions and regularizers which are typically used for noisy data are not robust with respect to measurement corruptions.
This setting motivates our use of influence – the effective impact of an individual datapoint exerts on the overall estimate – in order to detect and therefore avoid sampling corrupted points. We propose an algorithm which is robust to corrupted observations and exhibits reduced bias compared with other subsampling estimators.
Outline and Contributions.
In §2 we introduce our corrupted observation model before reviewing the basic concepts of statistical leverage and influence in §3. In §4 we briefly review two subsampling approaches to approximating least squares based on structured random projections and leverage weighted importance sampling. Based on these ideas we present influence weighted subsampling (IWSLS), a novel randomized least squares algorithm based on subsampling points with small influence in §5.
In §6 we analyse IWSLS in the general setting where the observed predictors can be corrupted with additive subGaussian noise. Comparing the IWSLS estimate with that of OLS and other randomized least squares approaches we show a reduction in both bias and variance. It is important to note that the simultaneous reduction in bias and variance is relative to OLS and randomized approximations which are only unbiased in the noncorrupted setting. Our results rely on novel finite sample characteristics of leverage and influence which we defer to §SI.3. Additionally, in §SI.4 we prove an estimation error bound for IWSLS in the standard subGaussian model.
Computing influence exactly is not practical in largescale applications and so we propose two randomized approximation algorithms based on the randomized leverage approximation of [8]. Both of these algorithms run in time which improve scalability in large problems. Finally, in §7 we present extensive experimental evaluation which compares the performance of our algorithms against several randomized least squares methods on a variety of simulated and real datasets.
2 Statistical model
In this work we consider a variant of the standard linear model
(1) 
where is a noise term independent of . However, rather than directly observing we instead observe where
(2) 
and is a Bernoulli random variable with probability of being 1. is a matrix of measurement corruptions. The rows of therefore are corrupted with probability and not corrupted with probability .
Definition 1 (Subgaussian matrix).
A zeromean matrix is called subGaussian with parameter if (a) Each row is sampled independently and has . (b) For any unit vector , is a subGaussian random variable with parameter at most .
We consider the specific instance of the linear corrupted observation model in Eqs. (1), (2) where

are subGaussian with parameters and respectively,

is subGaussian with parameters ,
and all are independent of each other.
The key challenge is that even when and the magnitude of the corruptions, are relatively small, the standard linear regression estimate is biased and can perform poorly (see §6). Sampling methods which are not sensitive to corruptions in the observations can perform even worse if they somehow subsample a proportion of corrupted points. Furthermore, the corruptions may not be large enough to be detected via leverage based techniques alone.
The model described in this section generalises the “errorsinvariables” model from classical least squares modelling. Recently, similar models have been studied in the high dimensional () setting in [5, 4, 6, 12] in the context of robust sparse estimation. The “lowdimensional” () setting is investigated in [5], but the “big data” setting () has not been considered so far.^{2}^{2}2Unlike [6, 12] and others we do not consider sparsity in our solution since .
In the highdimensional problem, knowledge of the corruption covariance, [12], or the data covariance [6], is required to obtain a consistent estimate. This assumption may be unrealistic in many settings. We aim to reduce the bias in our estimates without requiring knowledge of the true covariance of the data or the corruptions, and instead subsample only noncorrupted points.
3 Diagnostics for linear regression
In practice, the subGaussian linear model assumption is often violated either by heterogeneous noise or by a corruption model as in §2. In such scenarios, fitting a least squares model to the full dataset is unwise since the outlying or corrupted points can have a large adverse effect on the model fit. Regression diagnostics have been developed in the statistics literature to detect such points (see e.g. [2] for a comprehensive overview). Recently, [14] proposed subsampling points for least squares based on their leverage scores. Other recent works suggest related influence measures that identify subspace [16] and multiview [15] clusters in high dimensional data.
3.1 Statistical leverage
For the standard linear model in Eq. (1), the well known least squares solution is
(3) 
The projection matrix with specifies the subspace in which the residual lies. The diagonal elements of the “hat matrix” , , are the statistical leverage scores of the sample. Leverage scores quantify to what extent a particular sample is an outlier with respect to the distribution of .
An equivalent definition from [14] which will be useful later concerns any matrix which spans the column space of (for example, the matrix whose columns are the left singular vectors of ). The statistical leverage scores of the rows of are the squared row norms of , i.e. .
Although the use of leverage can be motivated from the least squares solution in Eq. (3), the leverage scores do not take into account the relationship between the predictor variables and the response variable . Therefore, lowleverage points may have a weak predictive relationship with the response and viceversa. In other words, it is possible for such points to be outliers with respect to the conditional distribution but not the marginal distribution on .
3.2 Influence
A concept that captures the predictive relationship between covariates and response is influence. Influential points are those that might not be outliers in the geometric sense, but instead adversely affect the estimated coefficients. One way to assess the influence of a point is to compute the change in the learned model when the point is removed from the estimation step. [2].
We can compute a leaveoneout least squares estimator by straightforward application of the ShermanMorrisonWoodbury formula (see Prop. 3 in §SI.3):
where . Defining the influence^{3}^{3}3The expression we use is also called Cook’s distance [2]., as the change in expected mean squared error we have
Points with large values of are those which, if added to the model, have the largest adverse effect on the resulting estimate. Since influence only depends on the OLS residual error and the leverage scores, it can be seen that the influence of every point can be computed at the cost of a least squares fit. In the next section we will see how to approximate both quantities using random projections.
4 Fast randomized least squares algorithms
We briefly review two randomized approaches to least squares approximation: the importance weighted subsampling approach of [9] and the dimensionality reduction approach [14]. The former proposes an importance sampling probability distribution according to which, a small number of rows of and are drawn and used to compute the regression coefficients. If the sampling probabilities are proportional to the statistical leverages, the resulting estimator is close to the optimal estimator [9]. We refer to this as LEVLS.
The dimensionality reduction approach can be viewed as a random projection step followed by a uniform subsampling. The class of JohnsonLindenstrauss projections – e.g. the SRHT – has been shown to approximately uniformize leverage scores in the projected space. Uniformly subsampling the rows of the projected matrix proves to be equivalent to leverage weighted sampling on the original dataset [14]. We refer to this as SRHTLS. It is analysed in the statistical setting by [7] who also propose ULURU, a two step fitting procedure which aims to correct for the subsampling bias and consequently converges to the OLS estimate at a rate independent of the number of subsamples [7].
Subsampled Randomized Hadamard Transform (SRHT)
The SHRT consists of a preconditioning step after which rows of the new matrix are subsampled uniformly at random in the following way with the definitions [3]:

is a subsampling matrix.

is a diagonal matrix whose entries are drawn independently from .

is a normalized WalshHadamard matrix^{4}^{4}4For the Hadamard transform, must be a power of two but other transforms exist (e.g. DCT, DFT) for which similar theoretical guarantees hold and there is no restriction on . which is defined recursively as
We set so it has orthonormal columns.
As a result, the rows of the transformed matrix have approximately uniform leverage scores. (see [17] for detailed analysis of the SRHT). Due to the recursive nature of , the cost of applying the SRHT is operations, where is the number of rows sampled from [1].
The SRHTLS algorithm solves which for an appropriate subsampling ratio, results in a residual error, which satisfies
(4) 
where is the vector of OLS residual errors [14].
Randomized leverage computation
Recently, a method based on random projections has been proposed to approximate the leverage scores based on first reducing the dimensionality of the data using the SRHT followed by computing the leverage scores using this lowdimensional approximation [8, 13, 10, 9].
The leverage approximation algorithm of [8] uses a SRHT, to first compute the approximate SVD of , Followed by a second SHRT to compute an approximate orthogonal basis for
(5) 
The approximate leverage scores are now the squared row norms of , From [14] we derive the following result relating to randomized approximation of the leverage
(6) 
where the approximation error, depends on the choice of projection dimensions and .
The leverage weighted least squares (LEVLS) algorithm samples rows of and with probability proportional to (or in the approximate case) and performs least squares on this subsample. The residual error resulting from the leverage weighted least squares is bounded by Eq. (4) implying that LEVLS and SRHTLS are equivalent [14]. It is important to note that under the corrupted observation model these approximations will be biased.
5 Influence weighted subsampling
In the corrupted observation model, OLS and therefore the random approximations to OLS described in §4 obtain poor predictions. To remedy this, we propose influence weighted subsampling (IWSLS) which is described in Algorithm 1. IWSLS subsamples points according to the distribution, where is a normalizing constant so that . OLS is then estimated on the subsampled points. The sampling procedure ensures that points with high influence are selected infrequently and so the resulting estimate is less biased than the full OLS solution.
Obviously, IWSLS is impractical in the scenarios we consider since it requires the OLS residuals and full leverage scores. However, we use this as a baseline and to simplify the analysis. In the next section, we propose an approximate influence weighted subsampling algorithm which combines the approximate leverage computation of [8] and the randomized least squares approach of [14].
Randomized approximation algorithms.
Using the ideas from §4 and §4 we obtain the following randomized approximation to the influence scores
(7) 
where is the residual error computed using the SRHTLS estimator. Since the approximation errors of and are bounded (inequalities (4) and (6)), this suggests that our randomized approximation to influence is close to the true influence.
Basic approximation.
Residual Weighted Sampling.
Leverage scores are typically uniform [13, 7] for subGaussian data. Even in the corrupted setting, the difference in leverage scores between corrupted and noncorrupted points is small (see §6). Therefore, the main contribution to the influence for each point will originate from the residual error, . Consequently, we propose sampling with probability inversely proportional to the approximate residual, . The resulting algorithm Residual Weighted Subsampling (aRWSLS) is detailed in Algorithm 2. Although aRWSLS is not guaranteed to be a good approximation to IWSLS, empirical results suggests that it works well in practise and is faster to compute than aIWSLS.
Computational complexity.
Clearly, the computational complexity of IWSLS is . The computation complexity of aIWSLS is , where the first term is the cost of SRHTLS, the second term is the cost of approximate leverage computation and the last term solves OLS on the subsampled dataset. Here, is the dimension of the random projection detailed in Eq. (5). The cost of aRWSLS is where the first term is the cost of SRHTLS, the second term is the cost of computing the residuals , and the last term solves OLS on the subsampled dataset. This computation can be reduced to . Therefore the cost of both aIWSLS and aRWSLS is .
6 Estimation error
In this section we will prove an upper bound on the estimation error of IWSLS in the corrupted model. First, we show that the OLS error consists of two additional variance terms that depend on the size and proportion of the corruptions and an additional bias term. We then show that IWSLS can significantly reduce the relative variance and bias in this setting, so that it no longer depends on the magnitude of the corruptions but only on their proportion. We compare these results to recent results from [12, 5] suggesting that consistent estimation requires knowledge about . More recently, [6] show that incomplete knowledge about this quantity results in a biased estimator where the bias is proportional to the uncertainty about . We see that the form of our bound matches these results.
Inequalities are said to hold with high probability (w.h.p.) if the probability of failure is not more than where are positive constants that do not depend on the scaling quantities . The symbol means that we ignore constants that do not depend on these scaling quantities. Proofs are provided in the supplement. Unless otherwise stated, denotes the norm for vectors and the spectral norm for matrices.
Corrupted observation model.
As a baseline, we first investigate the behaviour of the OLS estimator in the corrupted model.
Theorem 1 (A bound on ).
If then w.h.p.
(8) 
where .
Remark 1 (No corruptions case).
Notice for a fixed , taking or for a fixed taking (i.e. there are no corruptions) the above error reduces to the least squares result (see for example [5]).
Remark 2 (Variance and Bias).
The first three terms in (8) scale with so as , these terms tend towards 0. The last term does not depend on and so for some nonzero the least squares estimate will incur some bias depending on the fraction and magnitude of corruptions.
We are now ready to state our theorem characterising the mean squared error of the influence weighted subsampling estimator.
Theorem 2 (Influence sampling in the corrupted model).
For we have
where and is the covariance of the influence weighted subsampled data.
Remark 3.
Theorem 2 states that the influence weighted subsampling estimator removes the proportional dependance of the error on so the additional variance terms scale as and . The relative contribution of the bias term is compared with for the OLS or noninfluencebased subsampling methods.
Comparison with fully corrupted setting.
We note that the bound in Theorem 1 is similar to the bound in [6] for an estimator where all data points are corrupted (i.e. ) and where incomplete knowledge of the covariance matrix of the corruptions, is used. The additional bias in the estimator is proportional to the uncertainty in the estimate of – in Theorem 1 this corresponds to . Unbiased estimation is possible if is known. See the Supplementary Information for further discussion, where the relevant results from [6] are provided in Section SI.6.1 as Lemma 16.
7 Experimental results
We compare IWSLS against the methods SRHTLS [14], ULURU [7]. These competing methods represent current stateoftheart in fast randomized least squares. Since SRHTLS is equivalent to LEVLS [9] the comparison will highlight the difference between importance sampling according to the two difference types of regression diagnostic in the corrupted model. Similar to IWSLS, ULURU is also a twostep procedure where the first is equivalent to SRHTLS. The second reduces bias by subtracting the result of regressing onto the residual. The experiments with the corrupted data model will demonstrate the difference in robustness of IWSLS and ULURU to corruptions in the observations. Note that we do not compare with SGD. Although SGD has excellent properties for largescale linear regression, we are not aware of a convex loss function which is robust to the corruption model we propose.
We assess the empirical performance of our method compared with standard and stateoftheart randomized approaches to linear regression in several difference scenarios. We evaluate these methods on the basis of the estimation error: the norm of the difference between the true weights and the learned weights, . We present additional results for root mean squared prediction error (RMSE) on the test set in §SI.7.
For all the experiments on simulated data sets we use , , . For datasets of this size, computing exact leverage is impractical and so we report on results for IWSLS in §SI.7. For aIWSLS and aRWSLS we used the same number of subsamples to approximate the leverage scores and residuals as for solving the regression. For aIWSLS we set (see Eq. (5)). The results are averaged over 100 runs.
Corrupted data.
We investigate the corrupted data noise model described in Eqs. (1)(2). We show three scenarios where . and were sampled from independent, zeromean Gaussians with standard deviation and respectively. The true regression coefficients, were sampled from a standard Gaussian. We added i.i.d. zeromean Gaussian noise with standard deviation .
Figure 1 shows the difference in distribution of influence and leverage between noncorrupted points (top) and corrupted points (bottom) for a dataset with 30% corrupted points. The distribution of leverage is very similar between the corrupted and noncorrupted points, as quantified by the difference. This suggests that leverage alone cannot be used to identify corrupted points. On the other hand, although there are some corrupted points with small influence, they typically have a much larger influence than noncorrupted points. We give a theoretical explanation of this phenomenon in §SI.3 (remarks 4 and 5).
Figure 2LABEL:sub@fig:c1 and LABEL:sub@fig:c2 shows the estimation error and the mean squared prediction error for different subsample sizes. In this setting, computing IWSLS is impractical (due to the exact leverage computation) so we omit the results but we notice that aIWSLS and aRWSLS quickly improve over the full least squares solution and the other randomized approximations in all simulation settings. In all cases, influence based methods also achieve lowervariance estimates.
For corruptions for a small number of samples ULURU outperforms the other subsampling methods. However, as the number of samples increases, influence based methods start to outperform OLS. Here, ULURU converges quickly to the OLS solution but is not able to overcome the bias introduced by the corrupted datapoints. Results for corruptions are shown in Figs. 5 and 6 and we provide results on smaller corrupted datasets (to show the performance of IWSLS) as well as noncorrupted data simulated according to [13] in §SI.7.
Airline delay dataset
The dataset consists of details of all commercial flights in the USA over 20
years.^{5}^{5}5Dataset along with visualisations available from
http://statcomputing.org/dataexpo/2009/ Selecting the first US Airways flights from January 2000 (corresponding to approximately 1.5 weeks) our goal is to predict the
delay time of the next US Airways flights.
The features in this dataset consist of a binary vector representing origindestination pairs and a real value representing distance ().
The dataset might be expected to violate the usual i.i.d. subGaussian design assumption of standard linear regression since the length of delays are often very different depending on the day. For example, delays may be longer due to public holidays or on weekends. Of course, such regular events could be accounted for in the modelling step, but some unpredictable outliers such as weather delay may also occur. Results are presented in Figure 2LABEL:sub@fig:real1, the RMSE is the error in predicted delay time in minutes. Since the dataset is smaller, we can run IWSLS to observe the accuracy of aIWSLS and aRWSLS in comparison. For more than samples, these algorithm outperform OLS and quickly approach IWSLS. The result suggests that the corrupted observation model is a good model for this dataset. Furthermore, ULURU is unable to achieve the full accuracy of the OLS solution.
8 Conclusions
We have demonstrated theoretically and empirically under the generalised corrupted observation model that influence weighted subsampling is able to significantly reduce both the bias and variance compared with the OLS estimator and other randomized approximations which do not take influence into account. Importantly our fast approximation, aRWSLS performs similarly to IWSLS. We find ULURU quickly converges to the OLS estimate, although it is not able to overcome the bias induced by the corrupted datapoints despite its twostep procedure. The performance of IWSLS relative to OLS in the airline delay problem suggests that the corrupted observation model is a more realistic modelling scenario than the standard subGaussian design model for some tasks.
Software is available at http://people.inf.ethz.ch/kgabriel/software.html.
Acknowledgements.
We thank David Balduzzi for invaluable discussions, suggestions and comments.
References
 [1] Nir Ailon and Edo Liberty. Fast dimension reduction using rademacher series on dual bch codes. In 19th Annual ACMSIAM Symposium on Discrete Algorithms, pages 1–9, 2008.
 [2] David A Belsley, Edwin Kuh, and Roy E Welsch. Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. Wiley, 1981.
 [3] Christos Boutsidis and Alex Gittens. Improved matrix algorithms via the Subsampled Randomized Hadamard Transform. 2012. arXiv:1204.0062v4 [cs.DS].
 [4] Y Chen, C Caramanis, and S Mannor. Robust Sparse Regression under Adversarial Corruption. In International Conference on Machine Learning, 2013.
 [5] Yudong Chen and Constantine Caramanis. Orthogonal Matching Pursuit with Noisy and Missing Data: Low and High Dimensional Results. June 2012. arXiv:1206.0823.
 [6] Yudong Chen and Constantine Caramanis. Noisy and Missing Data Regression: DistributionOblivious Support Recovery. In International Conference on Machine Learning, 2013.
 [7] P Dhillon, Y Lu, D P Foster, and L Ungar. New Subsampling Algorithms for Fast Least Squares Regression. In Advances in Neural Information Processing Systems, 2013.
 [8] Petros Drineas, Malik MagdonIsmail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. September 2011. arXiv:1109.3843v2 [cs.DS].
 [9] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l2 regression and applications. In Proceedings of the Seventeenth Annual ACMSIAM Symposium on Discrete Algorithm, SODA ’06, pages 1127–1136, New York, NY, USA, 2006. ACM.
 [10] Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tamás Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2011.
 [11] Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab., 17:no. 52, 1–6, 2012.
 [12] PoLing Loh and Martin J Wainwright. Highdimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637–1664, June 2012.
 [13] Ping Ma, Michael W Mahoney, and Bin Yu. A Statistical Perspective on Algorithmic Leveraging. In proceedings of the International Conference on Machine Learning, 2014.
 [14] Michael W Mahoney. Randomized algorithms for matrices and data. April 2011. arXiv:1104.5557v3 [cs.DS].
 [15] Brian McWilliams and Giovanni Montana. Multiview predictive partitioning in high dimensions. Statistical Analysis and Data Mining, 5(4):304–321, 2012.
 [16] Brian McWilliams and Giovanni Montana. Subspace clustering of highdimensional data: a predictive approach. Data Mining and Knowledge Discovery, 28:736â–772, 2014.
 [17] Joel A Tropp. Improved analysis of the subsampled randomized Hadamard transform. November 2010. arXiv:1011.1595v4 [math.NA].
 [18] Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. November 2010. arXiv:1011.3027.
Supplementary Information for Fast and Robust Least Squares Estimation in Corrupted Linear Models
Here we collect supplementary technical details, discussion and empirical results which support the results presented in the main text.
Appendix SI.1 Software
We have made available a software package available for Python which implements

IWSLS,

aIWSLS and

aRWSLS,
along with the methods we compare against

SRHTLS and

ULURU.
The software is available from: http://bit.ly/1kifYBU
Appendix SI.2 Approximate Influence Weighted Algorithm
Here we present a detailed description of the approximate influence weighted subsampling (aIWSLS) algorithm. Steps 2, 3 and 4 are required for the approximate leverage computation. Step 3 could be replaced with the QR decomposition.
Appendix SI.3 Leverage and Influence
Here we provide detailed derivations of leverage and influence terms as well as the full statement and proofs of finite sample bounds under the subGaussian design and corrupted design models which are abbreviated in the main text as Lemmas 5, 6, 7, and 8.
Here we provide a full derivation of the leaveoneout estimator of which appears in less detail in [2].
Proposition 3 (Derivation of ).
Defining and
Where the first equality comes from a straightforward application of the Sherman Morrison formula.
Here we provide a derivation of the leaveoneout estimator in the corrupted model where the point we removed is corrupted.
Proposition 4 (Derivation of ).
si.3.1 Results for SubGaussian random design
Lemma 5 (Leverage).
The leverage of a noncorrupted point is bounded by
(9) 
where the exact form of the term is given in the supplementary material.
Lemma 6 (Influence).
Defining , the influence of a noncorrupted point is
(10) 
The term is proportional to .
Proof of Lemma 5.
Lemma 5 states
From the Eigendecomposition, . Define such that . We have
Since and are subGaussian random vectors so the above quadratic form is bounded by Lemma 14, setting the parameter . We combine this with the following inequalities
and
which relate the Frobenius norm with the spectral norm. We also make use of the relationship where to obtain
which holds with high probability.
In order for this bound to not be vacuous in our application, it must be smaller than 1. In order to ensure this, we need bound using Lemma 15 and setting to obtain the following which holds with high probability
∎
Proof of Lemma 6.
si.3.2 Results for corrupted observations
Lemma 7 (Leverage of corrupted point).
The leverage of a corrupted point is bounded by
(11) 
Remark 4 (Comparison of leverage).
Comparing this with Eq. (9), when is large, the dominant term is which implies that the difference in leverage between a corrupted and noncorrupted point – particularly when the magnitude of corruptions is not large – is small. This suggests that it may not be possible to distinguish between the corrupted and noncorrupted points by only comparing leverage scores.
Lemma 8 (Influence of corrupted point).
Defining , the influence of a corrupted point is
(12) 
Remark 5 (Comparison of influence).
Here, differs from in Lemma 6 only in its dependence on the leverage of a corrupted instead of noncorrupted point and so for large , . It can be seen that the influence of the corrupted point includes a bias term similar to the one which appears in Eq. (8). This suggests that the relative difference between the influence of a noncorrupted and corrupted point will be larger than the respective relative difference in leverage. All of the information relating to the proportion of corrupted points is contained within .
Proof of Lemma 7.
Proof of Lemma 8.
Lemma 8 states
Appendix SI.4 Estimation error in subGaussian model
Using the definition of influence above, we can state the following theorem characterising the error of the influence weighted subsampling estimator in the subGaussian design setting.
Theorem 9 (Subgaussian design influence weighted subsampling).
Defining for we have
where and is the covariance of the influence weighted subsampled data and .
Remark 6.
Theorem 9 states that in the noncorrupted subGaussian model, the influence weighted subsampling estimator is consistent. Furthermore, if we set the sampling proportion, , the error scales as . Therefore, similar to ULURU there is no dependence on the subsampling proportion.
Appendix SI.5 Proof of main theorems
In this section we provide proofs of our main theorems which describe the properties of the influence weighted subsampling estimator in the subGaussian random design case, the OLS estimator in the corrupted setting and finally our influence weighted subsampling estimator in the corrupted setting.
In order to prove our results we require the following lemma
Lemma 10 (A general bound on from [6]).
Suppose the following strong convexity condition holds: . Then the estimation error satisfies
Where , are estimators for and respectively
To obtain the results for our method in the noncorrupted and corrupted setting we can simply plug in our specific estimates for and .
Proof of Theorem 9.
Through subsampling according to influence, we solve the problem
where . is a subsampling matrix, is a diagonal matrix whose entries are where is a constant which ensures .
(13) 
Setting , , by Lemma 10 the error of the influence weighted subsampling estimator is given by
(14) 
Define . Now, when using Lemma 13 with we have w.h.p. . It follows that
(16) 
∎
Remark 7 (Scaling by ).
In the following, with some abuse of notation we will write as . Now,
Proof of Theorem 1.
Setting , we have