Fast and Robust Least Squares Estimation in Corrupted Linear Models

# Fast and Robust Least Squares Estimation in Corrupted Linear Models

Brian McWilliamsGabriel KrummenacherMario LučićJoachim M. Buhmann
Department of Computer Science
ETH Zürich, Switzerland
{mcbrian,gabriel.krummenacher,mario.lucic,jbuhmann}@inf.ethz.ch
Joint first author
###### Abstract

Subsampling methods have been recently proposed to speed up least squares estimation in large scale settings. However, these algorithms are typically not robust to outliers or corruptions in the observed covariates.

The concept of influence that was developed for regression diagnostics can be used to detect such corrupted observations as shown in this paper. This property of influence – for which we also develop a randomized approximation – motivates our proposed subsampling algorithm for large scale corrupted linear regression which limits the influence of data points since highly influential points contribute most to the residual error. Under a general model of corrupted observations, we show theoretically and empirically on a variety of simulated and real datasets that our algorithm improves over the current state-of-the-art approximation schemes for ordinary least squares.

Fast and Robust Least Squares Estimation in Corrupted Linear Models

Brian McWilliams  Gabriel Krummenacherthanks: Joint first author  Mario Lučić  Joachim M. Buhmann Department of Computer Science ETH Zürich, Switzerland {mcbrian,gabriel.krummenacher,mario.lucic,jbuhmann}@inf.ethz.ch

## 1 Introduction

To improve scalability of the widely used ordinary least squares algorithm, a number of randomized approximation algorithms have recently been proposed. These methods, based on subsampling the dataset, reduce the computational time from to 111Informally: means grows more slowly than . [14]. Most of these algorithms are concerned with the classical fixed design setting or the case where the data is assumed to be sampled i.i.d. typically from a sub-Gaussian distribution [7]. This is known to be an unrealistic modelling assumption since real-world data are rarely well-behaved in the sense of the underlying distributions.

We relax this limiting assumption by considering the setting where with some probability, the observed covariates are corrupted with additive noise. This scenario corresponds to a generalised version of the classical problem of “errors-in-variables” in regression analysis which has recently been considered in the context of sparse estimation [12]. This corrupted observation model poses a more more realistic model of real data which may be subject to many different sources of measurement noise or heterogeneity in the dataset.

A key consideration for sampling is to ensure that the points used for estimation are typical of the full dataset. Typicality requires the sampling distribution to be robust against outliers and corrupted points. In the i.i.d. sub-Gaussian setting, outliers are rare and can often easily be identified by examining the statistical leverage scores of the datapoints.

Crucially, in the corrupted observation setting described in §2, the concept of an outlying point concerns the relationship between the observed predictors and the response. Now, leverage alone cannot detect the presence of corruptions. Consequently, without using additional knowledge about the corrupted points, the OLS estimator (and its subsampled approximations) are biased. This also rules out stochastic gradient descent (SGD) – which is often used for large scale regression – since convex cost functions and regularizers which are typically used for noisy data are not robust with respect to measurement corruptions.

This setting motivates our use of influence – the effective impact of an individual datapoint exerts on the overall estimate – in order to detect and therefore avoid sampling corrupted points. We propose an algorithm which is robust to corrupted observations and exhibits reduced bias compared with other subsampling estimators.

#### Outline and Contributions.

In §2 we introduce our corrupted observation model before reviewing the basic concepts of statistical leverage and influence in §3. In §4 we briefly review two subsampling approaches to approximating least squares based on structured random projections and leverage weighted importance sampling. Based on these ideas we present influence weighted subsampling (IWS-LS), a novel randomized least squares algorithm based on subsampling points with small influence in §5.

In §6 we analyse IWS-LS in the general setting where the observed predictors can be corrupted with additive sub-Gaussian noise. Comparing the IWS-LS estimate with that of OLS and other randomized least squares approaches we show a reduction in both bias and variance. It is important to note that the simultaneous reduction in bias and variance is relative to OLS and randomized approximations which are only unbiased in the non-corrupted setting. Our results rely on novel finite sample characteristics of leverage and influence which we defer to §SI.3. Additionally, in §SI.4 we prove an estimation error bound for IWS-LS in the standard sub-Gaussian model.

Computing influence exactly is not practical in large-scale applications and so we propose two randomized approximation algorithms based on the randomized leverage approximation of [8]. Both of these algorithms run in time which improve scalability in large problems. Finally, in §7 we present extensive experimental evaluation which compares the performance of our algorithms against several randomized least squares methods on a variety of simulated and real datasets.

## 2 Statistical model

In this work we consider a variant of the standard linear model

 y=Xβ+ϵ, (1)

where is a noise term independent of . However, rather than directly observing we instead observe where

 Z=X+UW. (2)

and is a Bernoulli random variable with probability of being 1. is a matrix of measurement corruptions. The rows of therefore are corrupted with probability and not corrupted with probability .

###### Definition 1 (Sub-gaussian matrix).

A zero-mean matrix is called sub-Gaussian with parameter if (a) Each row is sampled independently and has . (b) For any unit vector , is a sub-Gaussian random variable with parameter at most .

We consider the specific instance of the linear corrupted observation model in Eqs. (1), (2) where

• are sub-Gaussian with parameters and respectively,

• is sub-Gaussian with parameters ,

and all are independent of each other.

The key challenge is that even when and the magnitude of the corruptions, are relatively small, the standard linear regression estimate is biased and can perform poorly (see §6). Sampling methods which are not sensitive to corruptions in the observations can perform even worse if they somehow subsample a proportion of corrupted points. Furthermore, the corruptions may not be large enough to be detected via leverage based techniques alone.

The model described in this section generalises the “errors-in-variables” model from classical least squares modelling. Recently, similar models have been studied in the high dimensional () setting in [5, 4, 6, 12] in the context of robust sparse estimation. The “low-dimensional” () setting is investigated in [5], but the “big data” setting () has not been considered so far.222Unlike [6, 12] and others we do not consider sparsity in our solution since .

In the high-dimensional problem, knowledge of the corruption covariance, [12], or the data covariance [6], is required to obtain a consistent estimate. This assumption may be unrealistic in many settings. We aim to reduce the bias in our estimates without requiring knowledge of the true covariance of the data or the corruptions, and instead sub-sample only non-corrupted points.

## 3 Diagnostics for linear regression

In practice, the sub-Gaussian linear model assumption is often violated either by heterogeneous noise or by a corruption model as in §2. In such scenarios, fitting a least squares model to the full dataset is unwise since the outlying or corrupted points can have a large adverse effect on the model fit. Regression diagnostics have been developed in the statistics literature to detect such points (see e.g. [2] for a comprehensive overview). Recently, [14] proposed subsampling points for least squares based on their leverage scores. Other recent works suggest related influence measures that identify subspace [16] and multi-view [15] clusters in high dimensional data.

### 3.1 Statistical leverage

For the standard linear model in Eq. (1), the well known least squares solution is

 ˆβ=argminβ∥y−Xβ∥2=(X⊤X)−1X⊤y. (3)

The projection matrix with specifies the subspace in which the residual lies. The diagonal elements of the “hat matrix” , , are the statistical leverage scores of the sample. Leverage scores quantify to what extent a particular sample is an outlier with respect to the distribution of .

An equivalent definition from [14] which will be useful later concerns any matrix which spans the column space of (for example, the matrix whose columns are the left singular vectors of ). The statistical leverage scores of the rows of are the squared row norms of , i.e. .

Although the use of leverage can be motivated from the least squares solution in Eq. (3), the leverage scores do not take into account the relationship between the predictor variables and the response variable . Therefore, low-leverage points may have a weak predictive relationship with the response and vice-versa. In other words, it is possible for such points to be outliers with respect to the conditional distribution but not the marginal distribution on .

### 3.2 Influence

A concept that captures the predictive relationship between covariates and response is influence. Influential points are those that might not be outliers in the geometric sense, but instead adversely affect the estimated coefficients. One way to assess the influence of a point is to compute the change in the learned model when the point is removed from the estimation step. [2].

We can compute a leave-one-out least squares estimator by straightforward application of the Sherman-Morrison-Woodbury formula (see Prop. 3 in §SI.3):

 ˆβ−i =(X⊤X−x⊤ixi)−1(X⊤y−x⊤iyi)=ˆβ−Σ−1x⊤iei1−li

where . Defining the influence333The expression we use is also called Cook’s distance [2]., as the change in expected mean squared error we have

 di

Points with large values of are those which, if added to the model, have the largest adverse effect on the resulting estimate. Since influence only depends on the OLS residual error and the leverage scores, it can be seen that the influence of every point can be computed at the cost of a least squares fit. In the next section we will see how to approximate both quantities using random projections.

## 4 Fast randomized least squares algorithms

We briefly review two randomized approaches to least squares approximation: the importance weighted subsampling approach of [9] and the dimensionality reduction approach [14]. The former proposes an importance sampling probability distribution according to which, a small number of rows of and are drawn and used to compute the regression coefficients. If the sampling probabilities are proportional to the statistical leverages, the resulting estimator is close to the optimal estimator [9]. We refer to this as LEV-LS.

The dimensionality reduction approach can be viewed as a random projection step followed by a uniform subsampling. The class of Johnson-Lindenstrauss projections – e.g. the SRHT – has been shown to approximately uniformize leverage scores in the projected space. Uniformly subsampling the rows of the projected matrix proves to be equivalent to leverage weighted sampling on the original dataset [14]. We refer to this as SRHT-LS. It is analysed in the statistical setting by [7] who also propose ULURU, a two step fitting procedure which aims to correct for the subsampling bias and consequently converges to the OLS estimate at a rate independent of the number of subsamples [7].

#### Subsampled Randomized Hadamard Transform (SRHT)

The SHRT consists of a preconditioning step after which rows of the new matrix are subsampled uniformly at random in the following way with the definitions [3]:

• is a subsampling matrix.

• is a diagonal matrix whose entries are drawn independently from .

• is a normalized Walsh-Hadamard matrix444For the Hadamard transform, must be a power of two but other transforms exist (e.g. DCT, DFT) for which similar theoretical guarantees hold and there is no restriction on . which is defined recursively as

 Hn=[Hn/2Hn/2Hn/2−Hn/2],  H2=[+1+1+1−1].

We set so it has orthonormal columns.

As a result, the rows of the transformed matrix have approximately uniform leverage scores. (see [17] for detailed analysis of the SRHT). Due to the recursive nature of , the cost of applying the SRHT is operations, where is the number of rows sampled from [1].

The SRHT-LS algorithm solves which for an appropriate subsampling ratio, results in a residual error, which satisfies

 ∥~e∥≤(1+ρ)∥e∥ (4)

where is the vector of OLS residual errors [14].

#### Randomized leverage computation

Recently, a method based on random projections has been proposed to approximate the leverage scores based on first reducing the dimensionality of the data using the SRHT followed by computing the leverage scores using this low-dimensional approximation [8, 13, 10, 9].

The leverage approximation algorithm of [8] uses a SRHT, to first compute the approximate SVD of , Followed by a second SHRT to compute an approximate orthogonal basis for

 R−1=VΠXΣ−1ΠX∈Rp×p,  ~U=XR−1Π2∈Rn×r2. (5)

The approximate leverage scores are now the squared row norms of , From [14] we derive the following result relating to randomized approximation of the leverage

 ~li≤(1+ρl)li, (6)

where the approximation error, depends on the choice of projection dimensions and .

The leverage weighted least squares (LEV-LS) algorithm samples rows of and with probability proportional to (or in the approximate case) and performs least squares on this subsample. The residual error resulting from the leverage weighted least squares is bounded by Eq. (4) implying that LEV-LS and SRHT-LS are equivalent [14]. It is important to note that under the corrupted observation model these approximations will be biased.

## 5 Influence weighted subsampling

In the corrupted observation model, OLS and therefore the random approximations to OLS described in §4 obtain poor predictions. To remedy this, we propose influence weighted subsampling (IWS-LS) which is described in Algorithm 1. IWS-LS subsamples points according to the distribution, where is a normalizing constant so that . OLS is then estimated on the subsampled points. The sampling procedure ensures that points with high influence are selected infrequently and so the resulting estimate is less biased than the full OLS solution.

Obviously, IWS-LS is impractical in the scenarios we consider since it requires the OLS residuals and full leverage scores. However, we use this as a baseline and to simplify the analysis. In the next section, we propose an approximate influence weighted subsampling algorithm which combines the approximate leverage computation of [8] and the randomized least squares approach of [14].

#### Randomized approximation algorithms.

Using the ideas from §4 and §4 we obtain the following randomized approximation to the influence scores

 ~di=~e2i~li(1−~li)2, (7)

where is the residual error computed using the SRHT-LS estimator. Since the approximation errors of and are bounded (inequalities (4) and (6)), this suggests that our randomized approximation to influence is close to the true influence.

#### Basic approximation.

The first approximation algorithm is identical to Algorithm 1 except that leverage and residuals are replaced by their randomized approximations as in Eq. (7). We refer to this algorithm as Approximate influence weighted subsampling (aIWS-LS). Full details are given in Algorithm 3 in §SI.2.

#### Residual Weighted Sampling.

Leverage scores are typically uniform [13, 7] for sub-Gaussian data. Even in the corrupted setting, the difference in leverage scores between corrupted and non-corrupted points is small (see §6). Therefore, the main contribution to the influence for each point will originate from the residual error, . Consequently, we propose sampling with probability inversely proportional to the approximate residual, . The resulting algorithm Residual Weighted Subsampling (aRWS-LS) is detailed in Algorithm 2. Although aRWS-LS is not guaranteed to be a good approximation to IWS-LS, empirical results suggests that it works well in practise and is faster to compute than aIWS-LS.

#### Computational complexity.

Clearly, the computational complexity of IWS-LS is . The computation complexity of aIWS-LS is , where the first term is the cost of SRHT-LS, the second term is the cost of approximate leverage computation and the last term solves OLS on the subsampled dataset. Here, is the dimension of the random projection detailed in Eq. (5). The cost of aRWS-LS is where the first term is the cost of SRHT-LS, the second term is the cost of computing the residuals , and the last term solves OLS on the subsampled dataset. This computation can be reduced to . Therefore the cost of both aIWS-LS and aRWS-LS is .

## 6 Estimation error

In this section we will prove an upper bound on the estimation error of IWS-LS in the corrupted model. First, we show that the OLS error consists of two additional variance terms that depend on the size and proportion of the corruptions and an additional bias term. We then show that IWS-LS can significantly reduce the relative variance and bias in this setting, so that it no longer depends on the magnitude of the corruptions but only on their proportion. We compare these results to recent results from [12, 5] suggesting that consistent estimation requires knowledge about . More recently, [6] show that incomplete knowledge about this quantity results in a biased estimator where the bias is proportional to the uncertainty about . We see that the form of our bound matches these results.

Inequalities are said to hold with high probability (w.h.p.) if the probability of failure is not more than where are positive constants that do not depend on the scaling quantities . The symbol means that we ignore constants that do not depend on these scaling quantities. Proofs are provided in the supplement. Unless otherwise stated, denotes the norm for vectors and the spectral norm for matrices.

#### Corrupted observation model.

As a baseline, we first investigate the behaviour of the OLS estimator in the corrupted model.

###### Theorem 1 (A bound on ∥ˆβOLS−β∥).

If then w.h.p.

 ∥ˆβOLS−β∥≲ ((σϵσx+πσϵσw+π(σ2w+σwσx)∥β∥)√plogpn+πσ2w√p∥β∥)⋅1λ (8)

where .

###### Remark 1 (No corruptions case).

Notice for a fixed , taking or for a fixed taking (i.e. there are no corruptions) the above error reduces to the least squares result (see for example [5]).

###### Remark 2 (Variance and Bias).

The first three terms in (8) scale with so as , these terms tend towards 0. The last term does not depend on and so for some non-zero the least squares estimate will incur some bias depending on the fraction and magnitude of corruptions.

We are now ready to state our theorem characterising the mean squared error of the influence weighted subsampling estimator.

###### Theorem 2 (Influence sampling in the corrupted model).

For we have

 ∥ˆβIWS−β∥≲ ((σϵσx+πσϵ(σw+1)+π∥β∥)√plogpnsubs+π√p∥β∥).1λ

where and is the covariance of the influence weighted subsampled data.

###### Remark 3.

Theorem 2 states that the influence weighted subsampling estimator removes the proportional dependance of the error on so the additional variance terms scale as and . The relative contribution of the bias term is compared with for the OLS or non-influence-based subsampling methods.

#### Comparison with fully corrupted setting.

We note that the bound in Theorem 1 is similar to the bound in [6] for an estimator where all data points are corrupted (i.e. ) and where incomplete knowledge of the covariance matrix of the corruptions, is used. The additional bias in the estimator is proportional to the uncertainty in the estimate of – in Theorem 1 this corresponds to . Unbiased estimation is possible if is known. See the Supplementary Information for further discussion, where the relevant results from [6] are provided in Section SI.6.1 as Lemma 16.

## 7 Experimental results

We compare IWS-LS against the methods SRHT-LS [14], ULURU [7]. These competing methods represent current state-of-the-art in fast randomized least squares. Since SRHT-LS is equivalent to LEV-LS [9] the comparison will highlight the difference between importance sampling according to the two difference types of regression diagnostic in the corrupted model. Similar to IWS-LS, ULURU is also a two-step procedure where the first is equivalent to SRHT-LS. The second reduces bias by subtracting the result of regressing onto the residual. The experiments with the corrupted data model will demonstrate the difference in robustness of IWS-LS and ULURU to corruptions in the observations. Note that we do not compare with SGD. Although SGD has excellent properties for large-scale linear regression, we are not aware of a convex loss function which is robust to the corruption model we propose.

We assess the empirical performance of our method compared with standard and state-of-the-art randomized approaches to linear regression in several difference scenarios. We evaluate these methods on the basis of the estimation error: the norm of the difference between the true weights and the learned weights, . We present additional results for root mean squared prediction error (RMSE) on the test set in §SI.7.

For all the experiments on simulated data sets we use , , . For datasets of this size, computing exact leverage is impractical and so we report on results for IWS-LS in §SI.7. For aIWS-LS and aRWS-LS we used the same number of sub-samples to approximate the leverage scores and residuals as for solving the regression. For aIWS-LS we set (see Eq. (5)). The results are averaged over 100 runs.

#### Corrupted data.

We investigate the corrupted data noise model described in Eqs. (1)-(2). We show three scenarios where . and were sampled from independent, zero-mean Gaussians with standard deviation and respectively. The true regression coefficients, were sampled from a standard Gaussian. We added i.i.d. zero-mean Gaussian noise with standard deviation .

Figure 1 shows the difference in distribution of influence and leverage between non-corrupted points (top) and corrupted points (bottom) for a dataset with 30% corrupted points. The distribution of leverage is very similar between the corrupted and non-corrupted points, as quantified by the difference. This suggests that leverage alone cannot be used to identify corrupted points. On the other hand, although there are some corrupted points with small influence, they typically have a much larger influence than non-corrupted points. We give a theoretical explanation of this phenomenon in §SI.3 (remarks 4 and 5).

Figure 2LABEL:sub@fig:c1 and LABEL:sub@fig:c2 shows the estimation error and the mean squared prediction error for different subsample sizes. In this setting, computing IWS-LS is impractical (due to the exact leverage computation) so we omit the results but we notice that aIWS-LS and aRWS-LS quickly improve over the full least squares solution and the other randomized approximations in all simulation settings. In all cases, influence based methods also achieve lower-variance estimates.

For corruptions for a small number of samples ULURU outperforms the other subsampling methods. However, as the number of samples increases, influence based methods start to outperform OLS. Here, ULURU converges quickly to the OLS solution but is not able to overcome the bias introduced by the corrupted datapoints. Results for corruptions are shown in Figs. 5 and 6 and we provide results on smaller corrupted datasets (to show the performance of IWS-LS) as well as non-corrupted data simulated according to [13] in §SI.7.

#### Airline delay dataset

The dataset consists of details of all commercial flights in the USA over 20 years.555Dataset along with visualisations available from
http://stat-computing.org/dataexpo/2009/
Selecting the first US Airways flights from January 2000 (corresponding to approximately 1.5 weeks) our goal is to predict the delay time of the next US Airways flights. The features in this dataset consist of a binary vector representing origin-destination pairs and a real value representing distance ().

The dataset might be expected to violate the usual i.i.d. sub-Gaussian design assumption of standard linear regression since the length of delays are often very different depending on the day. For example, delays may be longer due to public holidays or on weekends. Of course, such regular events could be accounted for in the modelling step, but some unpredictable outliers such as weather delay may also occur. Results are presented in Figure 2LABEL:sub@fig:real1, the RMSE is the error in predicted delay time in minutes. Since the dataset is smaller, we can run IWS-LS to observe the accuracy of aIWS-LS and aRWS-LS in comparison. For more than samples, these algorithm outperform OLS and quickly approach IWS-LS. The result suggests that the corrupted observation model is a good model for this dataset. Furthermore, ULURU is unable to achieve the full accuracy of the OLS solution.

## 8 Conclusions

We have demonstrated theoretically and empirically under the generalised corrupted observation model that influence weighted subsampling is able to significantly reduce both the bias and variance compared with the OLS estimator and other randomized approximations which do not take influence into account. Importantly our fast approximation, aRWS-LS performs similarly to IWS-LS. We find ULURU quickly converges to the OLS estimate, although it is not able to overcome the bias induced by the corrupted datapoints despite its two-step procedure. The performance of IWS-LS relative to OLS in the airline delay problem suggests that the corrupted observation model is a more realistic modelling scenario than the standard sub-Gaussian design model for some tasks.

#### Acknowledgements.

We thank David Balduzzi for invaluable discussions, suggestions and comments.

## References

• [1] Nir Ailon and Edo Liberty. Fast dimension reduction using rademacher series on dual bch codes. In 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1–9, 2008.
• [2] David A Belsley, Edwin Kuh, and Roy E Welsch. Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. Wiley, 1981.
• [3] Christos Boutsidis and Alex Gittens. Improved matrix algorithms via the Subsampled Randomized Hadamard Transform. 2012. arXiv:1204.0062v4 [cs.DS].
• [4] Y Chen, C Caramanis, and S Mannor. Robust Sparse Regression under Adversarial Corruption. In International Conference on Machine Learning, 2013.
• [5] Yudong Chen and Constantine Caramanis. Orthogonal Matching Pursuit with Noisy and Missing Data: Low and High Dimensional Results. June 2012. arXiv:1206.0823.
• [6] Yudong Chen and Constantine Caramanis. Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery. In International Conference on Machine Learning, 2013.
• [7] P Dhillon, Y Lu, D P Foster, and L Ungar. New Subsampling Algorithms for Fast Least Squares Regression. In Advances in Neural Information Processing Systems, 2013.
• [8] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. September 2011. arXiv:1109.3843v2 [cs.DS].
• [9] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06, pages 1127–1136, New York, NY, USA, 2006. ACM.
• [10] Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tamás Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2011.
• [11] Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab., 17:no. 52, 1–6, 2012.
• [12] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637–1664, June 2012.
• [13] Ping Ma, Michael W Mahoney, and Bin Yu. A Statistical Perspective on Algorithmic Leveraging. In proceedings of the International Conference on Machine Learning, 2014.
• [14] Michael W Mahoney. Randomized algorithms for matrices and data. April 2011. arXiv:1104.5557v3 [cs.DS].
• [15] Brian McWilliams and Giovanni Montana. Multi-view predictive partitioning in high dimensions. Statistical Analysis and Data Mining, 5(4):304–321, 2012.
• [16] Brian McWilliams and Giovanni Montana. Subspace clustering of high-dimensional data: a predictive approach. Data Mining and Knowledge Discovery, 28:736â–772, 2014.
• [17] Joel A Tropp. Improved analysis of the subsampled randomized Hadamard transform. November 2010. arXiv:1011.1595v4 [math.NA].
• [18] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. November 2010. arXiv:1011.3027.

Supplementary Information for Fast and Robust Least Squares Estimation in Corrupted Linear Models

Here we collect supplementary technical details, discussion and empirical results which support the results presented in the main text.

## Appendix SI.1 Software

We have made available a software package available for Python which implements

• IWS-LS,

• aIWS-LS and

• aRWS-LS,

along with the methods we compare against

• SRHT-LS and

• ULURU.

The software is available from: http://bit.ly/1kifYBU

## Appendix SI.2 Approximate Influence Weighted Algorithm

Here we present a detailed description of the approximate influence weighted subsampling (aIWS-LS) algorithm. Steps 2, 3 and 4 are required for the approximate leverage computation. Step 3 could be replaced with the QR decomposition.

## Appendix SI.3 Leverage and Influence

Here we provide detailed derivations of leverage and influence terms as well as the full statement and proofs of finite sample bounds under the sub-Gaussian design and corrupted design models which are abbreviated in the main text as Lemmas 5, 6, 7, and 8.

Here we provide a full derivation of the leave-one-out estimator of which appears in less detail in [2].

###### Proposition 3 (Derivation of ˆβ−i ).

Defining and

 ˆβ−i =(Σ−x⊤ixi)−1(X⊤y−x⊤iyi) =(Σ−1+Σ−1xix⊤iΣ−11−li)(X⊤y−x⊤iyi) =ˆβ−Σ−1x⊤i(yi+xiΣ−1Xy−xiΣ−1x⊤iyi1−li) =ˆβ−Σ−1x⊤i(yi+^yi−liyi1−li) =ˆβ−Σ−1x⊤i(yi+ei1−li−yi(1−li)1−li) =ˆβ−Σ−1x⊤iei1−li

Where the first equality comes from a straightforward application of the Sherman Morrison formula.

Here we provide a derivation of the leave-one-out estimator in the corrupted model where the point we removed is corrupted.

###### Proposition 4 (Derivation of ˆβ−m).

By proposition 3. Defining

 em=^ym−ym=(xm+wm)ˆβ−ym   and
 lm=(xm+wm)Σ−1(xm+wm)⊤

where , we have that

 ˆβ−m=ˆβ−Σ−1(xm+wm)⊤em1−lm.

### si.3.1 Results for Sub-Gaussian random design

###### Lemma 5 (Leverage).

The leverage of a non-corrupted point is bounded by

 li≤σ2x⋅O((p/√n)2) (9)

where the exact form of the term is given in the supplementary material.

###### Lemma 6 (Influence).

Defining , the influence of a non-corrupted point is

 di≤Ci(σxσϵ+σ2xE). (10)

The term is proportional to .

#### Proof of Lemma 5.

Lemma 5 states

 li≤σ2x⋅(p+2logp+2√plogp√n−C√p−√logp)2.

From the Eigen-decomposition, . Define such that . We have

 li =xiΣ−1x⊤i =∥Ax⊤i∥2

Since and are sub-Gaussian random vectors so the above quadratic form is bounded by Lemma 14, setting the parameter . We combine this with the following inequalities

 √tr(Σ−2)=∥Σ−1∥F≤√p∥Σ−1∥=√pσ1(A)2

and

 tr(Σ−1)=∥A∥2F≤(√p∥A∥)2=pσ1(A)2

which relate the Frobenius norm with the spectral norm. We also make use of the relationship where to obtain

 ∥Ax∥2 ≤σ2xσn(Z)−2(p+2logp+2√plogp)

which holds with high probability.

In order for this bound to not be vacuous in our application, it must be smaller than 1. In order to ensure this, we need bound using Lemma 15 and setting to obtain the following which holds with high probability

 ∥Ax∥2 ≤σ2x(p+2logp+2√plogp√n−C√p−τ)2 ≤σ2x(p+2logp+2√plogp√n−C√p−√logp)2.

#### Proof of Lemma 6.

Defining , Lemma 6 states

 ∥ˆβ−i−ˆβ∥≤ ∥Σ−1∥1−li(σxσϵ+2σ2x∥β−ˆβ∥)√plogp.

Using Proposition 3 we have

 ∥ˆβ−i−ˆβ∥= 11−li∥Σ−1x⊤iei∥ = 11−li∥Σ−1x⊤i(ϵ+xi(β−ˆβ))∥ ≤ 11−li∥Σ−1∥∥x⊤iϵ+x⊤ixi(β−ˆβ)∥ ≤

Using Corollary 12 to bound and (since for these terms and so Lemma 11 does not immediately apply) completes the proof. ∎

### si.3.2 Results for corrupted observations

###### Lemma 7 (Leverage of corrupted point).

The leverage of a corrupted point is bounded by

 lm≤(σ2x+σ2w)⋅O((p/√n)2). (11)
###### Remark 4 (Comparison of leverage).

Comparing this with Eq. (9), when is large, the dominant term is which implies that the difference in leverage between a corrupted and non-corrupted point – particularly when the magnitude of corruptions is not large – is small. This suggests that it may not be possible to distinguish between the corrupted and non-corrupted points by only comparing leverage scores.

###### Lemma 8 (Influence of corrupted point).

Defining , the influence of a corrupted point is

 dm≤ Cm(σxσw+σ2w)∥β∥+(σ2x+2σxσw+σ2w)E +(σx+σw)σϵ. (12)
###### Remark 5 (Comparison of influence).

Here, differs from in Lemma 6 only in its dependence on the leverage of a corrupted instead of non-corrupted point and so for large , . It can be seen that the influence of the corrupted point includes a bias term similar to the one which appears in Eq. (8). This suggests that the relative difference between the influence of a non-corrupted and corrupted point will be larger than the respective relative difference in leverage. All of the information relating to the proportion of corrupted points is contained within .

#### Proof of Lemma 7.

Lemma 7 states

 lm≤(σ2x+σ2w)⋅(p+2logp+2√plogp√n−C√p−√logp)2.

The proof follows from rewriting and following the same steps as the proof of Lemma 5 above. ∎

#### Proof of Lemma 8.

Lemma 8 states

 ∥ˆβ−m−ˆβ∥≤ ∥Σ−1∥1−lm(2(σxσw+σ2w)∥β∥+2(σ2x+σxσw+σ2w)⋅∥β−ˆβ∥+2(σx+σw)σϵ) ⋅√plogp.

From Proposition 4 and following the same argument as Lemma 6 we have

 ∥ˆβ−m−ˆβ∥= 11−lm∥Σ−1(xm+wm)⊤em∥ ≤ 11−lm∥Σ−1(xm+wm)⊤((xm+wm)(β−ˆβ)+wmβ+ϵ)∥ ≤ 11−lm∥Σ−1∥(∥x⊤mwmβ∥+∥w⊤mwmβ∥+∥x⊤mϵ∥+∥w⊤mϵ∥ +∥(x⊤mxm+w⊤mwm+2x⊤mwm)(β−ˆβ)∥).

Applying the triangle inequality followed by Corollary 12 and noting that completes the proof. ∎

## Appendix SI.4 Estimation error in sub-Gaussian model

Using the definition of influence above, we can state the following theorem characterising the error of the influence weighted subsampling estimator in the sub-Gaussian design setting.

Defining for we have

 ∥ˆβIWS−β∥≲ 1λ⋅σϵλmin(Σx)(σϵ+2σxE)⋅√1rn

where and is the covariance of the influence weighted subsampled data and .

###### Remark 6.

Theorem 9 states that in the non-corrupted sub-Gaussian model, the influence weighted subsampling estimator is consistent. Furthermore, if we set the sampling proportion, , the error scales as . Therefore, similar to ULURU there is no dependence on the subsampling proportion.

## Appendix SI.5 Proof of main theorems

In this section we provide proofs of our main theorems which describe the properties of the influence weighted subsampling estimator in the sub-Gaussian random design case, the OLS estimator in the corrupted setting and finally our influence weighted subsampling estimator in the corrupted setting.

In order to prove our results we require the following lemma

###### Lemma 10 (A general bound on ∥ˆβ−β∥ from [6]).

Suppose the following strong convexity condition holds: . Then the estimation error satisfies

 ∥ˆβ−β∥≲1λ∥^γ−ˆΣβ∥.

Where , are estimators for and respectively

To obtain the results for our method in the non-corrupted and corrupted setting we can simply plug in our specific estimates for and .

#### Proof of Theorem 9.

Through subsampling according to influence, we solve the problem

 ˆβIWS=argminβ∥Θy−ΘXβ∥2

where . is a subsampling matrix, is a diagonal matrix whose entries are where is a constant which ensures .

 D2ii∝(∥Σ−1∥1−li(σxσϵ+2σ2x∥β−ˆβ∥)√plogp)−1. (13)

Setting , , by Lemma 10 the error of the influence weighted subsampling estimator is given by

 1λ∥^γ−ˆΣβ∥= ∥(ΘX)⊤(Θy)−(ΘX)⊤(ΘX)β∥ = 1λ∥(ΘX)⊤(Θϵ)+(ΘX)⊤(ΘX)β−(ΘX)⊤(ΘX)β∥ = 1λ∥(ΘX)⊤(Θϵ)∥ (14)

Now, by Lemma 11 we have

 ∥X⊤ϵ∥≤σxσϵ√plogpn

and so defining ,

 ∥(ΘX)⊤Θϵ∥≤ ∥1rnrn∑i=1pi∥⋅∥(SX)⊤Sϵ∥ ≤ ∥1rnrn∑i=1(1−li)∥σxσϵ√plogp/rn∥Σ−1∥(σxσϵ+2σ2xE)√plogp ≤ σxσϵ√plogp/rn∥Σ−1∥(σxσϵ+2σ2xE)√plogp ≤ σϵ√1/rnλmin(Σx)(σϵ+2σxE) (15)

where the third inequaltiy uses the fact that .

Define . Now, when using Lemma 13 with we have w.h.p. . It follows that

 λmin((ΘX)⊤(ΘX)) =inf∥v∥=1v⊤(ΣΘx+(ΘX)⊤(ΘX)−ΣΘx))v ≥12λmin(ΣΘx). (16)

Using (15) and (15) in Eq. (14) completes the proof.

###### Remark 7 (Scaling by π).

In the following, with some abuse of notation we will write as . Now,

 ∥W∥ :=∥UW∥ ≤π∥W∥.

#### Proof of Theorem 1.

Setting , we have

 ∥^γ−ˆΣβ∥ =∥(X+W)⊤y−(X+W)⊤(X+W)β∥ =∥X⊤(Xβ+ϵ)+W⊤(Xβ+ϵ)−X⊤Xβ−W⊤Wβ−X⊤Wβ−W⊤Xβ∥ =∥X⊤ϵ+W⊤ϵ−X⊤Wβ−W⊤Wβ∥ ≤∥X⊤ϵ∥+∥W⊤ϵ∥+∥X⊤Wβ∥+∥W⊤Wβ∥.

From Lemma 11 and Remark 7 we have w.h.p.

 ∥X⊤ϵ∥ ≤σxσϵ√plogpn (17) ∥W⊤ϵ∥ ≤πσwσϵ√plogpn (18) ∥X⊤Wβ∥ ≤πσxσw∥β∥√plogpn (19) ∥W⊤Wβ∥ ≤∥(W⊤W−