Faster Anomaly Detection via Matrix Sketching

Faster Anomaly Detection via Matrix Sketching

Parikshit Gopalan
VMware Research
pgopalan@vmware.com
   Vatsal Sharan
Stanford University
vsharan@stanford.edu
   Udi Wieder
VMware Research
uwieder@vmware.com
Abstract

We present efficient streaming algorithms to compute two commonly used anomaly measures: the rank- leverage scores (aka Mahalanobis distance) and the rank- projection distance, in the row-streaming model. We show that commonly used matrix sketching techniques such as the Frequent Directions sketch and random projections can be used to approximate these measures. Our main technical contribution is to prove matrix perturbation inequalities for operators arising in the computation of these measures.

1 Introduction

The goal of this paper is to provide efficient algorithms for anomaly detection for extremely high dimensional data. Anomaly detection in high-dimensional numeric data is a ubiquitous problem in machine learning (Aggarwal, 2013; Chandola et al., 2009). A typical scenario, which is the main motivation for this work, is where we have a constant stream of measurements (say various parameters regarding the health of machines in a data-center), and our goal is to detect any unusual behavior. In such settings, labels are usually hard to come by and unsupervised learning methods are preferred: the algorithm needs to learn what the bulk of the data look like and then detect any deviations from this.

Any algorithm to detect anomalies in this setting faces computational challenges: first, the dimension of the data may be very large, both in terms of the number of data points and their dimensionality. In the context of a data center there could be measurements per machine and thousands of machines. In the context of IoT the number of collecting devices may explode. Second, the distributed nature of this setting suggests that storing the entire data, let alone transmitting it in its entirety may be unfeasible.

There is no single definition of an anomaly. For instance, clearly a data point which is unusual in magnitude should be considered anomalous. A more subtle notion of anomalies are those that stem from changes in the correlation across dimensions. Consider for instance the case where the CPU usage of a machine rises by . That in itself may not be considered an anomaly, however when a large fraction of machines in the data center see this behavior simultaneously an alarm should be issued. Another more subtle scenario is when, say, a group of servers that are typically correlated suddenly demonstrate uncorrelated changes in load. Detecting these types of behaviors requires a more sophisticated way of assigning an anomaly score to a data point. We focus on two commonly used measures.

The Mahalanobis distance (Mahalanobis, 1936) is well known and heavily used way to measure this type of anomaly111See Wilks (1963); Hawkins (1974) for classic uses of the Mahalanobis distance in statistical outlier detection.. It relies on the Singular Value Decomposition of the data matrix, the rows of the matrix are data points and columns are types of measurements.

To define it formally, we need some notation. Given a matrix , we let denote its row and denote its column. We consider all vectors as column vectors (that includes ).

Definition 1.

Let . Let be its SVD where , for . The Mahalanobis distance of the row is denoted and is defined as,

where and are the row of and column of respectively. Equivalently, where is the row of .

is also known as the leverage score (Drineas et al., 2012; Mahoney, 2011). depends on the entire spectrum of singular values and is clearly very sensitive to smaller singular values. Many high dimensional data sets arising in practice are close to being low-dimensional. The low part of the spectrum might come from noise. In this case, the high sensitivity to small singular values is undesirable. A common approach to circumvent this is to limit the above sum for only the largest singular values (for some appropriately chosen ) (Aggarwal, 2013; Holgersson and Karlsson, 2012). This measure is called the rank leverage score, and denoted by .

The rank leverage score is concerned with the mass which lies within the principal space. The principal subspace represents what is “normal”, and so we also need to catch the anomalies that are far from it. To that end a second measure of anomaly is the rank projection distance of a point, which is simply the distance of the data point to the rank principal subspace, and denote this by . It is defined as

This measure has been studied extensively and shown to be effective in practice, see Huang and Kasiviswanathan (2015) and references therein. It presents a more stable alternative to taking the tail of the Mahalanobis distance,

These two measures, the rank leverage scores and projection distances are the focus of our work. We should note that many other measures are studied in the literature, such as the regularized (or soft) Mahalanobis distance (Aggarwal, 2013), also known as ridge leverage (Holgersson and Karlsson, 2012; Alaoui and Mahoney, 2015; Cohen et al., 2017). We discuss ridge leverage scores in greater detail in Section B, and show that in some settings, the ridge leverage score can be viewed as a combination of and .

The separation assumption

Intuitively once is chosen it should determine the subspace which approximates the data. However, if there is degeneracy in the spectrum of the matrix, namely that then -dimensional principal subspace is not unique, and then the quantities and are not well defined, since their value will depend on the choice of principal subspace. This suggests that we are using the wrong value of , since the choice of ought to be such that the directions orthogonal to the principal subspace have markedly less variance than those in the principal subspace. Hence we require that is such that there is a gap in the spectrum at , which ensures that the principal subspace is well defined.

Definition 2.

A matrix is -separated if

(1)

All our results assume that the data are -separated for , this assumptions manifests itself as an inverse polynomial dependence on in our bounds. We stress that this assumption is very reasonable for real-life data sets.

The low stable rank assumption

We denote by the Frobenius norm of the matrix, and by the operator norm (which is equal to the largest singular value). Define the stable rank . While it is not strictly needed, the reader should have in mind the scenario where the top- principal subspace captures a constant fraction (at least ) of the total variance in the data. This is the setting where the measures and are meaningful. Note that this assumption implies that the stable rank , since is a constant fraction of the total Frobenius norm.

Online vs batch anomaly detection

In our definition of leverage scores, we have assumed that the entire matrix is given to us as input. We refer to this as the batch scenario. In practice, this might not always be a reasonable assumption. Consider again the motivating example where each machine in a data center produces streams of measurements. It is desirable to determine the anomaly score of a data point online as it arrives. The anomaly measure should be calculated with respect to the data produced so far, and by a streaming algorithm. We refer to this as the online scenario. Scores computed in this setting could be very different from the batch setting. In this work, we consider both these kinds of measures. Note that the measures and defined in the introduction are batch measures, since they assume we know the entire matrix . The online measures and are defined analogously to and , except rather than using the entire matrix , we only consider the first rows for the SVD.

For both the online and the batch setting, we will work in the row-streaming model, where each row appears at a point in time. It is easy to see that there is a trivial memory, -pass algorithm in the online setting: we maintain the covariance of the rows seen so far, and use its SVD to computing and .

Computing the batch scores in the streaming model requires care, since even if the rows are seen in streaming order, when row arrives, we cannot compute its leverage score without seeing the rest of the input. Indeed, 1-pass algorithms are not possible (unless they output the entire matrix of scores at the end of the pass, which clearly requires a lot of memory). On the other hand, there is a simple -pass algorithm which uses memory to compute the covariance matrix in one pass, then computes its SVD, and using this computes and in a second pass using memory . So we could hope for better 2-pass algorithms.

1.1 Our Results

Our main results say that given a and a -separated matrix , any sketch of the matrix satisfying

(2)

or

(3)

can be used to efficiently approximate rank leverage scores and the projection distance from the principal -dimensional subspace. We show how to use such sketches to design efficient algorithms for finding anomalies in a streaming fashion using small space and with fast running time, for both the online and offline setting. The actual guarantees (and the proofs) for the two cases are different and incomparable. This is to be expected as the approximations we start with are very different in the two cases: Equation (2) can be viewed as an approximation to the covariance matrix of the row vectors, whereas Equation (3) gives an approximation for the covariance matrix of the column vectors. Hence the corresponding sketches can be viewed as preserving the row/column space of respectively, we will refer to them as row/column space approximations respectively.

Pointwise guarantees from row space approximations

Sketches which satisfy Equation (2) can be computed in the row streaming model using random projections of the columns, or deterministically by using the Frequent Directions algorithm (Ghashami et al., 2016). Our streaming algorithm in this case is very simple—we compute the top right singular vectors of the sketch , then use these estimated singular values and singular vectors (instead of the true top right singular values and vectors) to compute the leverage score of every row of . The algorithm applies to both the online and batch settings.

We state our result for the online setting here, see Section 3.1 for the precise statements. Define the condition number . Recall that for outlier detection using the principal -subspace to be meaningful, the principal -subspace explains a large fraction of the variance of the data and hence the stable rank is small—usually around .

Theorem 1.

Assume that is -separated. There exists , such that there is a one-pass streaming algorithm that computes a sketch and uses it to produce estimates and where

The algorithm uses memory and has running time per row.

The key is that while depends on , and other parameters, it is independent of . In the setting where all these parameters are constants independent of , our memory requirement is , improving on the trivial bound.

Our approximations are additive rather than multiplicative. But for outlier detection, the candidate outliers are ones where or are large, and in this regime, we argue below that our additive bounds also translate to good multiplicative approximations. The additive error in computing is about when all the rows have roughly equal norm. Note that the average top- leverage score of all the rows of any matrix with rows is , hence a reasonable threshold on to regard a point as an outlier is when , so the guarantee for in Theorem 1 is a reasonable guarantee which preserves outlier scores up to a small multiplicative error for candidate outliers, and ensures that points which were not outliers before are not mistakenly classified as outliers.

For , the additive error for row is . Again, for points that are outliers, is a constant fraction of , so this guarantee is meaningful.

Can one reduce the memory requirement further to ? In Section 5, we show that substantial savings are unlikely for any algorithm with strong pointwise guarantees: there is an lower bound for any approximation that lets you distinguish from for any constant , when all the rows have equal length. This lower bound clearly applies to any algorithm with guarantees as in Theorem 1.

Our next result shows that by settling for a weaker average case guarantee, we can indeed reduce the memory required below .

Average-case guarantees from columns space approximations

We now state our results for sketches that give columns space approximations. Our guarantees in this case hold in the batch setting, and hence they need two passes over the data.

A low-dimensional projection by a random matrix (for e.g., each entry of could be a scaled i.i.d. uniform random variable) satisfies Equation (3). But usual bounds for random projections rely on properties such as the Johnson-Lindenstrauss property and subspace embeddings, which require the projection dimension to depend on the dimensions and of the input matrix . However, we show that by powerful results on covariance estimation (Koltchinskii and Lounici, 2017), random projections can satisfy Equation (3) with the projection dimension depending on the stable rank of instead of on its dimension —which is significantly better as is often small for real data.

However, there is still a problem: the covariance matrix of the columns is an matrix (and so is ), and typically . Our solution is to compute instead, which is an matrix, the larger matrix is only used in the analysis. Our streaming algorithm works as follows:

  • In the first pass, we project the rows on to a random -dimensional space using a random matrix , and maintain the covariance matrix .

  • At the end of the first pass, we compute the SVD .

  • We use and to compute approximations and in the second pass.

Theorem 2.

For sufficiently small, there exists such that the algorithm above produces estimates and in the second pass, such that with high probabilty,

The algorithm uses work space , needs to store random bits and has running time per row.

This gives an average case guarantee. For , this guarantee is good when is a constant fraction of , or in other words when the top- principal subspace captures a constant fraction of the variance of the data—which is the natural setting of parameters for our problem. We note that Theorem 2 shows a new property of random projections—that on average they can preserve leverage scores and distances from the principal subspace, with the projection dimension being independent of both and (we have not aimed to optimize the polynomial dependence on ).

The space required in total for this algorithm, is , work space, and (non read-once) access to random bits. The lower bound in Section 5 applies even to the working space alone, the algorithm has access to a separate random string that is not counted against its storage. So already, we have managed to beat that lower bound by relaxing the guarantee.

However, the natural streaming implementation will store the matrix , which means the total space used is . If there were a pseudorandom distribution of matrices that were easier to store, but which still gave comparable guarantees, then the space used would be much smaller, possibly even growing as . So the relaxed guarantees could enable significant savings in total space usage. We pose the question of finding a pseudorandom family of projection matrices that match the covariance estimation bounds of Koltchinskii and Lounici (2017) as an intriguing derandomization problem.

Our result is also meaningful in the IoT regime where the rows corresponds to data collected by different sensors, and the communication cost of transmitting the data from the sensors to a central server is a bottleneck. In such a scenario, the server first samples and sends it to each sensor. This is a one-time setup. Subsequently, the sensors compute the covariance matrices and send these back to the server. The server would reply with (the SVD of) the aggregate covariance, which the sensors could then use to compute the scores locally.

1.2 Technical Contributions

Our main technical contribution is to prove perturbation bounds for various matrix operators that arise in the computation of various outlier scores. Perturbation bounds describe the effect of the addition of a small noise matrix to a base matrix on various quantities and operators associated with the matrix. The basic results here include Weyl’s inequality (c.f. Horn and Johnson (1994) Theorem 3.3.16) and Wedin’s theorem (Wedin, 1972), which respectively give such bounds for eigenvalues and eigenvectors. We use these results to derive perturbation bounds on projection operators that are used to compute outlier scores, these typically involve projecting onto the top- principal subspace, and rescaling each co-ordinate by some function of the corresponding singular values.

We derive our results by combining these perturbation bounds with powerful results on matrix sketching, such as the Covariance Estimation results of (Koltchinskii and Lounici, 2017) and the Frequent Directions sketch (Liberty, 2013; Ghashami et al., 2016). The sketches can be viewed as computing a noisy version of the true covariance matrices, hence our perturbation results can be applied.

Outline

We state and prove our matrix perturbation bounds in Section 2. We use these bounds to get point-wise approximations for outlier scores in Section 3, and our average approximations in Section 4. We state our lower bound in Section 5. Missing proofs are deferred to the Appendix.

2 Matrix Perturbation Bounds

In this section, we will establish projection bounds for various operators needed for computing outlier scores. We first set up some notation and state some results we need.

2.1 Preliminaries

We work with the following setup throughout this section. Let where . Assume that is -separated as in Definition 2. We use to denote the stable rank of , and for the condition number of .

Let be a sketch/noisy version of satisfying

(4)

and let denote its SVD. While we did not assume is -separated, it will follow from Weyl’s inequality for sufficiently small compared to . It helps to think of as property of the input , and as an accuracy parameter that we control.

In this section we prove perturbation bounds for the following three operators derived from , showing their closeness to those derived from :

  1. : projects onto the principal -dimensional column subspace (Lemma 3).

  2. : projects onto the principal -dimensional column subspace, and scale co-ordinate by (Theorem 2.2).

  3. : projects onto the principal -dimensional column subspace, and scale co-ordinate by (Theorem 2.2).

To do so, we will extensively use two classical results about matrix perturbation: Weyl’s inequality (c.f. Horn and Johnson (1994) Theorem 3.3.16) and Wedin’s theorem (Wedin, 1972), which respectively quantify how the eigenvalues and eigenvectors of a matrix change under perturbations.

Lemma 1.

(Weyl’s inequality) Let . Then for all ,

Wedin’s theorem requires a sufficiently large separation in the eigenvalue spectrum for the bound to be meaningful.

Lemma 2.

(Wedin’s theorem) Let . Let and respectively denote the projection matrix onto the space spanned by the top singular vectors of and . Then,

2.2 Matrix Perturbation Bounds

We now use these results to derive the perturbation bounds enumerated above. The first bound is a direct consequence of Wedin’s theorem.

Lemma 3.

If is -separated, satisfies (4) with , then

Proof.

Let and . Since we have , () is the projection operator onto the column space of (and similarly for and ). Since and for any orthogonal projection matrix, we can write,

(5)

Since is -separated,

(6)

So applying Wedin’s theorem to ,

(7)

We next show that the spectrum of also has a gap at . Using Weyl’s inequality,

Hence using Equation (6)

So we apply Wedin’s theorem to to get

(8)

Plugging (7) and (8) into (5),

where the last inequality is becuase since we assumed . The claim follows by taking square roots on both sides. ∎

We now move to proving bounds for items (2) and (3), which are given by Theorems 2.2 and 2.2 respectively.

\thmt@toks\thmt@toks

Let .

Theorem 3.

The full proof of the theorem is quite technical and is deferred to the Appendix. Here we prove a special case that captures the main idea. The simplifying assumption we make is that the values of the diagonal matrix in the operator are distinct and well separated. In the full proof we decompose to a well separated component and a small residual component.

Definition 4.

is monotone and -well separated if

  • for .

  • The s could be partitioned to buckets so that all values in the same buckets are equal, and values across buckets are well separated. Formally, are partitioned to buckets so that if then . Yet, if and then .

The idea is to show where is monotone and well separated and has small norm. The next two lemmas handle these two components. We first state a simple lemma which handles the case where has small norm.

Lemma 4.

For any diagonal matrix ,

Proof.

By the triangle inequality, . The bound follows as and are orthonormal matrices. ∎

We next state the result for the case where is monotone and well separated. In order to prove this result, we need the following direct corollary of Lemma 3:

Lemma 5.

For all ,

(9)

Using this, we now proceed as follows.

Lemma 6.

Let . Let be a monotone and well separated diagonal matrix. Then

Proof.

We denote by the largest index that falls in bucket . Let us set for convenience. Since

we can write,

and similarly for . So we can write

Therefore by using Lemma 5,

where the second inequality is by the triangle inequality and by applying Lemma 2.2. Thus proving that would imply the claim. Indeed

Note that though Lemma 6 assumes that the eigenvalues in each bucket are equal, the key step where we apply Wedin’s theorem (Lemma 3) only uses the fact that there is some separation between the eigenvalues in different buckets. Lemma 10 in the appendix does this generalization by relaxing the assumption that the eigenvalues in each bucket are equal. The final proof of Theorem 2.2 works by splitting the eigenvalues into different buckets or intervals such that all the eigenvalues in the same interval have small separation, and the eigenvalues in different intervals have large separation. We then use Lemma 10 and Lemma 4 along with the triangle inequality to bound the perturbation due to the well-separated part and the residual part respectively.

The bound corresponding to (3) is given in following theorem: \thmt@toks\thmt@toks Let denote the condition number . Let . Then,

Theorem 5.

The proof uses similar ideas to those of Theorem 2.2 and is deferred to the appendix.

3 Pointwise guarantees for Anomaly Detection

3.1 The Online Setting

We consider the scenario where the rows arrive in streaming order. Let denote the matrix of points that have arrived so far (excluding ) and let be its SVD. Write in the basis of right singular vectors as . We define its rank- leverage score and projection distance respectively as

(10)
(11)

There is a one-pass streaming algorithm that can compute both these scores, using time time per row. This algorithm maintains the covariance matrix and computes its SVD to get . From these, it is easy to compute both and .

To approximate these scores, it is natural to use a row-space approximation, or rather a sketch that approximates the covariance matrix as below:

(12)

Given such a sketch, our approximation is the following: compute . The estimates for and respectively are

(13)
(14)

Given the sketch , we expect that is a good approximation to the row space spanned by , since the covariance matrices of the rows are close. In contrast the columns spaces of and are hard to compare since they lie in and respectively. The closeness of the row spaces follows from the results from Section 2 but applied to rather than itself. The results there require that is small, and Equation (12) implies that this assumption holds for .

We first state our approximation guarantees for .

Theorem 6.

Assume that is -separated. Let and let satisfy Equation (12) for . Then for every ,

Proof.

We have

where in the last line we use Lemma 3, applied to the projection onto columns of , which are the rows of . The condition holds since for . ∎

How meaningful is the above additive approximation guarantee? For each row, the additive error is . It might be that which happens when the row is almost entirely within the principal subspace. But in this case, the points are not anomalies, and we have , so these points will not seem anomalous from the approximation either. The interesting case is when for some constant (say ). For such points, we have , so we indeed get multiplicative approximations.

Next we give our approximation guarantee for , which relies on the perturbation bound in Theorem 2.2.

Theorem 7.

Assume that is -separated. Let be as in Equation (12). Let

Then for every ,

(15)
Proof.

Since , , it will suffice to show that

(16)

To prove inequality (16), we will bound the LHS as

(17)

For the first term, we apply Theorem 2.2 to to get

(18)

We bound the second term as

(19)

where we use Weyl’s inequality to bound . Using Weyl’s inequality and the fact that ,

(20)

Plugging Equations (18) and (20) into Equation (17) gives

(21)

Equation (21) follows by Theorem 2.2. The conditions on needed to apply it are guaranteed by our choice of and . ∎

To interpret this guarantee, consider the setting when all the points have roughly the same -norm. More precisely, if for some constant

then Equation (15) gives

Note that is a constant whereas grows as more points come in. As mentioned in the discussion following Theorem 1, the points which are considered outliers are those where . For the parameters setting, if we let and , then our bound on reduces to , and our choice of reduces to .

To efficiently compute a sketch that satisfies (12), we can use Frequent Directions (Liberty, 2013). We use the improved analysis of Frequent Directions in Ghashami et al. (2016):

Theorem 8.

(Ghashami et al., 2016) There is an algorithm that takes the rows of in streaming fashion and computes a sketch satisfying Equation (12) where .

Let . The algorithm maintains . It uses memory and requires time at most per row. The total time for processing rows is . If , this is a significant improvement over the naive one-pass algorithm in both memory and time. If we use Frequent directions, we set

(22)

where is set according to Theorem 6 and 7. This leads to . Note that this does not depend on , and hence is considerably smaller for our parameter settings of interest.

3.2 The batch setting

In the batch setting, we consider the scenario where the entire matrix is given as input. Let be its SVD. Write in the basis of right singular vectors as . Recall that we defined its rank- leverage score and projection distance respectively as

(23)
(24)

Using the same approach as the previous section, we can get a more efficient algorithm by using a sketch of the matrix . In the first pass, we compute using Frequent Directions and at the end of the pass, we use it to compute and . In the second pass, we use these to compute and . The space used is where is given by Equation (22), while the update time per row is .

The lower bounds in Section 5 shows that one cannot hope for sublinear dependence on for pointwise estimates. In the next section, we show how to eliminate the dependence on in the working space of the algorithm in exchange for weaker guarantees. The algorithm still uses randomness for the random projection, which would be eliminated if the random projection were to be derandomized.

4 Average-case guarantees for Anomaly Detection

In this section, we present an approach which circumvents the lower bounds by relaxing the pointwise approximation guarantee. The algorithm is designed for the offline case.

Let be the SVD of . The outlier scores we wish to compute are

(25)
(26)

Note that these scores are defined with respect to the principal space of the entire matrix. We present a guarantee for any sketch that approximates the column space of , or equivalently the covariance matrix of the row vectors. We can work with any sketch where

(27)

Theorem 9 stated in Section 4.2 shows that such a sketch can be obtained for instance by a random projection onto for