Large-scale subspace clustering using sketching and validation

Large-scale subspace clustering
using sketching and validation

Panagiotis A. Traganitis, Konstantinos Slavakis, and Georgios B. Giannakis, Panagiotis A. Traganitis and Georgios B. Giannakis are with the Dept. of Electrical and Computer Engineering and the Digital Technology Center, University of Minnesota, Minneapolis, MN 55455, USA.
Konstantinos Slavakis is with the Dept. of Electrical Engineering, University at Buffalo, State University of New York, Buffalo, NY 14260, USA.
This work was supported by the NSF grants 1442686, 1500713, Eager 1343860 and the AFOSR grant MURI-FA9550-10-1-0567.
E-mails: traga003@umn.edu, kslavaki@buffalo.edu, georgios@umn.edu
Abstract

The nowadays massive amounts of generated and communicated data present major challenges in their processing. While capable of successfully classifying nonlinearly separable objects in various settings, subspace clustering (SC) methods incur prohibitively high computational complexity when processing large-scale data. Inspired by the random sampling and consensus (RANSAC) approach to robust regression, the present paper introduces a randomized scheme for SC, termed sketching and validation (SkeVa-)SC, tailored for large-scale data. At the heart of SkeVa-SC lies a randomized scheme for approximating the underlying probability density function of the observed data by kernel smoothing arguments. Sparsity in data representations is also exploited to reduce the computational burden of SC, while achieving high clustering accuracy. Performance analysis as well as extensive numerical tests on synthetic and real data corroborate the potential of SkeVa-SC and its competitive performance relative to state-of-the-art scalable SC approaches.
Keywords. Subspace clustering, big data, kernel smoothing, randomization, sketching, validation, sparsity.

1 Introduction

The turn of the decade has trademarked society and computing research with a “data deluge” [1]. As the number of smart and internet-capable devices increases, so does the amount of data that is generated and collected. While it is desirable to mine information from this data, their sheer amount and dimensionality introduces numerous challenges in their processing and pattern analysis, since available statistical inference and machine learning approaches do not necessarily scale well with the number of data and their dimensionality. In addition, as the cost of cloud computing is rapidly declining [2], there is a need for redesigning those traditional approaches to take advantage of the flexibility that has emerged from distributing required computations to multiple nodes, as well as reducing the per-node computational burden.

Clustering (unsupervised classification) is a method of grouping unlabeled data with widespread applications across the fields of data mining, signal processing, and machine learning. -means is one of the most successful clustering algorithms due to its simplicity [3]. However, -means, as well as its kernel-based variants, provides meaningful clustering results only when data, after mapped to an appropriate feature space, form “tight” groups that can be separated by hyperplanes [4].

Subspace clustering (SC) is a popular method that can group non-linearly separable data which are generated by a union of (affine) subspaces in a high-dimensional Euclidean space [5]. SC has well-documented impact in various applications, as diverse as image and video segmentation and identification of switching linear systems in controls [5]. Recent advances advocate SC algorithms with high clustering performance at the price of high computational complexity [5].

The goal of this paper is to introduce a randomized framework for reducing the computational burden of SC algorithms when the number of available data becomes prohibitively large, while maintaining high levels of clustering accuracy. The starting point is our sketching and validation (SkeVa) approach in [6]. SkeVa offers a low-computational complexity, randomized scheme for multimodal probability density function (pdf) estimation, since it draws random computationally affordable sub-populations of data to obtain a crude sketch of the underlying pdf of the massive data population. A validation step, based on divergences of pdfs, follows to assess the quality of the crude sketch. These sketching and validation phases are repeated independently for a pre-fixed number of times, and the draw achieving the “best score” is finally utilized to cluster the whole data population. SkeVa is inspired from the random sampling and consensus (RANSAC) method [7]. Although RANSAC’s principal field of application is the parametric regression problem in the presence of outliers, it has been also employed in SC [5].

To achieve the goal of devising an SC scheme with low computational complexity footprint, the present paper broadens the scope of SkeVa to the SC context. Moreover, to support the state-of-the-art performance of SkeVa on real-data, this contribution provides a rigorous performance analysis on the lower bound of the number of independent draws for SkeVa to identify a draw that “represents well” the underlying data pdf, with high probability. The analysis is facilitated by the non-parametric estimation framework of kernel smoothing [8], and models the underlying data pdf as a mixture of Gaussians, which are known to offer universal pdf approximations [9, 10]. To assess the proposed SkeVa-SC, extensive numerical tests on synthetic and real-data are presented to underline the competitive performance of SkeVa-SC relative to state-of-the-art scalable SC approaches.

The rest of the paper is organized as follows. Section 2 provides SC preliminaries along with notation and tools for kernel density estimation. Section 3 introduces the proposed algorithm for large-scale SC, while Section 4 provides performance bounds for SkeVa-SC. Section 5 presents numerical tests conducted to evaluate the performance of SkeVa-SC in comparison with state-of-the-art SC algorithms. Finally, concluding remarks and future research directions are given in Section 6.

Notation: Unless otherwise noted, lowercase bold letters, , denote vectors, uppercase bold letters, , represent matrices, and calligraphic uppercase letters, , stand for sets. The th entry of matrix is denoted by ; stands for the -dimensional real Euclidean space, for the set of positive real numbers, for expectation, and for the -norm.

2 Preliminaries

2.1 SC problem statement

Consider data drawn from a union of affine subspaces, each denoted by , adhering to the model:

(1)

where (possibly with ) is the dimensionality of ; is a matrix whose columns form a basis of , is the low-dimensional representation of in with respect to (w.r.t.) , is the “centroid” or intercept of , and denotes the noise vector capturing unmodeled effects. If is linear then . Using (1), any can be described as

(2)

where is the cluster assignment vector for and denotes the th entry of under the constraints of and . If then datum lies in only one subspace (hard clustering), while if , then can belong to multiple clusters (soft clustering). In the latter case, can be thought of as the probability that datum belongs to .

Given the data matrix and the number of subspaces , SC involves finding the data-to-subspace assignment vectors , the subspace bases , their dimensions , the low-dimensional representations , as well as the centroids  [5]. SC can be formulated as follows

(3)
subject to (s.to)

where , , and denotes the all-ones vector of matching dimensions.

The problem in (3) is non-convex as all of , and are unknown. We outline next a popular alternating way of solving (3). For given and , bases of the subspaces can be recovered using the singular value decomposition (SVD) on the data associated with each subspace. Indeed, given , associated with (), a basis can be obtained from the first (from the left) singular vectors of where . On the other hand, when are given, the assignment matrix can be recovered in the case of hard clustering by finding the closest subspace to each datapoint; that is, , , we obtain

(4)

where is the distance of from .

The -subspaces algorithm [11], which is a generalization of the ubiquitous -means one [12] for SC, builds upon this alternating minimization of (3): (i) First, fixing and then solving for the remaining unknowns; and (ii) fixing , and then solving for . Since SVD is involved, SC entails high computational complexity, whenever and/or are massive. Specifically the SVD of a matrix incurs a computational complexity of , where are algorithm-dependent constants.

It is known that when and is orthonormal, (3) boils down to PCA [13]

(5)
s.to

where () stands for the identity matrix. Notice that for , it holds that . Moreover, if , , with looking for amounts to ordinary clustering

(6)
s.to

Finally, it is also well known that (6) with (soft clustering) amounts to pdf estimation [3]. This fact will be exploited by our novel approach to perform on large-scale SC, and link the benefits of pdf estimation with those of high-performance SC algorithms.

2.2 Kernel density estimation

Kernel smoothing or kernel density estimation [8] is a non-parametric approach to estimating pdfs. Kernel density estimators, similar to the histogram and unlike parametric estimators, make minimal assumptions about the unknown pdf, . Because they employ general kernel functions rather than rectangular bins, kernel smoothers are flexible to attain faster convergence rates than the histogram estimator as the number of observed data, , tends to infinity [8]. However, their convergence rate is slower than that of parametric estimators. For data , drawn from , the kernel density estimator of is given by [8]

(7)

where denotes a pre-defined kernel function centered at , with a positive-definite bandwidth matrix . This bandwidth matrix controls the amount of smoothing across dimensions. Typically, is chosen to be a density so that is also a pdf. The role of bandwidth can be understood clearly as corresponds to the covariance matrix of the pdf  [8].

The performance of kernel-based estimators is usually assessed using the mean-integrated square error (MISE)

(8)

where denotes -dimensional integration, and stands for , with . The choice of has a notable effect on MISE. As a result, it makes sense to select such that the MISE is low.

Letting with , facilitates the minimization of (8) w.r.t. , especially if is regarded as a function of , i.e., , and satisfies

(9)

Using the -variate Taylor expansion of the asymptotically MISE optimal is [8]:

(10)

where is not a function of , when is a spherically symmetric compactly supported density, and . In addition, when the density to be estimated is a mixture of Gaussians [9, 10], and the kernel is chosen to be a Gaussian density , where

(11)

with mean and covariance matrix , it is possible to express (8) in closed form. This closed-form is due to the fact that convolution of Gaussians is also Gaussian; that is,

(12)

If , the optimal bandwidth of (10) becomes

(13)

The proposed SC algorithm of Section 3 will draw ideas from kernel density estimation, and a proper choice of will be proved to be instrumental in optimizing SC performance.

2.3 Prior work

Besides the -subspaces solver outlined in Sec. 2.1, various algorithms have been developed by the machine learning [5] and data-mining community [14] to solve (3). A probabilistic (soft) counterpart of -subspaces is the mixture of probabilistic PCA [15], which assumes that data are drawn from a mixture of degenerate (having zero variance in some dimensions) Gaussians. Building on the same assumption, the agglomerative lossy compression (ALC) [16] utilizes ideas from rate-distortion theory [17] to minimize the required number of bits to “encode” each cluster, up to a certain distortion level. Algebraic schemes, such as the Costeira-Kanade algorithm [18] and Generalized PCA (GPCA [19]), aim to tackle SC from a linear algebra point of view, but generally their performance is guaranteed only for independent and noise-less subspaces. Other methods recover subspaces by finding local linear subspace approximations [20]. See also [21, 22] for online clustering approaches to handling streaming data.

Arguably the most successful class of solvers for (3) relies on spectral clustering [23] to find the data-to-subspace assignments. Algorithms in this class generate first an affinity matrix to capture the similarity between data, and then perform spectral clustering on . Matrix implies a graph whose vertices correspond to data and edge weights between data are given by its entries. Spectral clustering algorithms form the graph Laplacian matrix

(14)

where is a diagonal matrix such that (s.t.) . The algebraic multiplicity of the eigenvalue of yields the number of connected components in  [23], while the corresponding eigenvectors are indicator vectors of the connected components. For the sake of completeness, Alg. 1 summarizes the procedure of how the trailing eigenvectors of are used to obtain cluster assignments in spectral clustering.

1:Data affinity matrix ; number of clusters .
2:Data-cluster associations.
3: Form diagonal matrix , with entries .
4: Laplacian matrix: .
5: Extract trailing eigenvectors of . Let .
6: Let be the rows of ; corresponds to the th vertex (th datapoint).
7: Group into clusters using -means.
Algorithm 1 Unnormalized spectral clustering [23]

Sparse subspace clustering (SSC) [24] exploits the fact that under the union of subspaces model, (3), only a small percentage of data suffice to provide a low-dimensional affine representation of any , i.e., , . Specifically, SSC solves the following sparsity-imposing optimization problem

(15)
s.to

where ; column is sparse and contains the coefficients for the representation of ; is the regularization coefficient; and . Matrix is used to create the affinity matrix . Finally, spectral clustering, e.g., Alg. 1, is performed on and cluster assignments are identified. Using those assignments, is found by taking sample means per cluster, and , are obtained by applying SVD on . For future use, SSC is summarized in Alg. 2.

The low-rank representation algorithm (LRR) [25] is similar to SSC, but replaces the -norm in (15) with the nuclear one: , where stands for the rank and for the th singular value of . The high clustering accuracy achieved by both SSC and LRR comes at the price of high complexity. Solving (15) scales quadratically with the number of data , on top of performing spectral clustering across clusters, which renders SSC computationally prohibitive for large-scale SC. When data are high-dimensional (), methods based on (statistical) leverage scores, random projections [26], or our recent sketching and validation (SkeVa) [6] approach can be employed to reduce complexity to an affordable level. When the number of data is large (), the current state-of-the-art approach, scalable sparse subspace clustering (SSSC) [27], involves drawing randomly data, performing SSC on them, and expressing the rest of the data according to the clusters identified by that random draw of samples. While SSSC clearly reduces complexity, performance can potentially suffer as the random sample may not be representative of the entire dataset, especially when and clusters are unequally populated. To alleviate this issue, the present paper introduces a structured trial-and-error approach to identify a “representative” -point sample from a dataset with , while maintaining low computational complexity.

1:Data ; number of clusters ;
2:Data-cluster associations .
3: Solve (15) for .
4: .
5: Perform spectral clustering, namely, Alg. 1, on .
6: Identify point-to-subspace associations.
Algorithm 2 Sparse Subspace Clustering (SSC) [24]

Regarding kernel smoothing, most of the available algorithms address the important issue of bandwidth selection to achieve desirable convergence rate properties in the approximation of the unknown pdf [28, 29, 30, 31]. The present paper, however, pioneers a framework to randomly choose the subset of kernel functions yielding a small error when estimating a multi-modal pdf.

3 The SkeVa-SC algorithm

The proposed algorithm, named SkeVa-SC, is listed in Alg. 3. It aims at iteratively finding a representative subset of randomly drawn data, , run subspace clustering on , and associate the remaining data with the subspaces extracted from . SkeVa-SC draws a prescribed number of realizations, over which two phases are performed: A sketching and a validation phase. The structure and philosophy of SkeVa-SC is based on the notion of divergences between pdfs.

Per realization , the sketching stage proceeds as follows: A sub-population, , of the data is randomly drawn, and a pdf , is estimated based on this sample using kernel smoothing. As the clusters are assumed to be sufficiently separated, is expected to be multimodal. To confirm this, is compared with a unimodal pdf , using a measure of pdf discrepancy. If is sufficiently different from then , where is some pre-selected threshold, SkeVa-SC proceeds to the validation stage; otherwise, SkeVa-SC deems this draw uninformative and advances to the next realization , without performing the validation step.

At the validation stage of SkeVa-SC, another random sample of data, , different from the one in , is drawn, forming . The purpose of this stage is to evaluate how well represents the whole dataset. The pdf of is estimated and compared to using . A score is assigned to , using a non-increasing scoring function that grows as realizations and come closer.

Finally, after realizations, the set of data that received the highest score , is selected and SC (SSC or any other algorithm) is performed on it; that is,

(16)
s.to

The remaining data are associated with the clusters defined by . This association can be performed either by using the residual minimization method, described in SSSC [27], or, if subspace dimensions are known, by identifying the subspace that is closest to each datum, as in (4).

Remark 1.

Threshold can be updated across iterations. If stands for the initial threshold value, for the current maximum score as of iteration and , the threshold, at iteration , is updated as

(17)
Remark 2.

As SkeVa-SC realizations are drawn independently, they can be readily parallelized using schemes such as MapReduce [32].

Remark 3.

The sketching and validation scheme does not target the SC goal explicitly. Rather, it decides whether a sampled subset of data is informative or not.

To estimate the densities involved at each step of SkeVa-SC, kernel density estimators, or kernel smoothers [cf. Section 2.2] are employed. Specifically, SkeVa-SC, seeks to solve the following optimization problem: , where , and denotes the th column of . As the pdf to be estimated is generally unknown, a prudent choice for the bandwidth matrix is with , as it provides isotropic smoothing across all dimensions and greatly simplifies the analysis.

To assess performance in closed form, start with the integrated square error (ISE)

(18)

or, as in [6], the Cauchy-Schwarz divergence [33]

(19)

which will henceforth be adopted as our dissimilarity metric . Moreover, we will choose the Gaussian multivariate kernel, that is , with defined as (11).

1:Data ; max. no. of iterations ; bandwidth
2:Clustered data; bases of subspaces
3:for  to  do
4:    Sample columns of to form .
5:    Estimate pdf of ; evaluate dissimilarity from a unimodal pdf [cf. (20)].
6:   if   then
7:       Sample new columns to form ; evaluate pdf of .
8:       Find and score .
9:   end if
10:end for
11:Select winner as .
12:Perform SSC (Alg. 2) on to find .
13:Associate to clusters defined in step 12.
Algorithm 3 Sketching and validation SC (SkeVa-SC)

The estimated pdfs that are used by SkeVa-SC are listed as follows:

(20a)
(20b)
(20c)

where is the th column of , and are appropriately defined bandwidth matrices. It is then easy to show that for a measure such as (18) or (19) several of the pdf dissimilarities can be found in closed-form as [cf. (12)]

(21a)
(21b)
(21c)
(21d)

Since finding the Gaussian dissimilarities between -dimensional data incurs complexity , the complexity of Alg. 3 is per iteration, if the Gaussian kernel is utilized. As with any algorithm that involves kernel smoothing, the choice of bandwidth matrices affects critically the performance of Alg. 3.

4 Performance Analysis

The crux of Alg. 3 is the number of independent random draws to identify a subset of data that “represents well” the whole data population. Since SkeVa-SC aims at large-scale clustering, it is natural to ask whether iterations suffice to mine a subset of data whose pdf approximates well the unknown . It is therefore crucial, prior to any implementation of SkeVa-SC, to have an estimate of the minimum number of that ensures an “informative draw” with high probability. Such concerns are not taken into account in SSSC [27], where only a single draw is performed prior to applying SC. This section provides analysis to establish such a lower-bound on , when the underlying pdf obeys a Gaussian mixture model (GMM). Due to the universal approximation properties of GMM [3, 9, 10], such a generic assumption on is also employed by the mixture of probabilistic PCA [15] as well as ALC [16].

Performance analysis will be based on the premises that the multimodal data pdf is modeled by a mixture of Gaussian densities. This assumption seems appropriate as any pdf or multivariate function that is -th order integrable with , can be approximated by a mixture of appropriately many Gaussians [9, 8].

Assumption 1.

Data are generated according to the GMM

(22)

where , and stand for the mean vector and the covariance matrix of the th Gaussian pdf, respectively, and are the mixing coefficients.

Under (As. 1), the mean of the entire dataset is , and can be estimated by the sample mean of all data drawn from .

Definition 1.

A “dissimilarity” function is a metric or a distance if the following properties hold : {enumerate*}[label=P0),ref=P0]

;

;

;

. Property 1 (depicted in Fig. 1) is widely known as the triangle inequality. A semi-distance is a function for which 1, 1, 1, and hold.

A divergence is a function where is the space of pdfs, and for which only 1 and 1 hold. The class of Bregman divergences are generalizations of the squared Euclidean distance and include the Kullback-Leibler [34] as well as the Itakura-Saito one [35], among others. Furthermore, generalized symmetric Bregman divergences, such as the Jensen-Bregman one, satisfy the triangle inequality [36]. Although in (18) is not a distance, since it does not satisfy the triangle inequality, (the -norm) is a semi-distance function because it satisfies 1, 1, 1, and .

If denotes an estimate of the data pdf , and stands for a reference pdf (a rigorous definition will follow), Fig. 1 depicts and as points in the statistical manifold  [37], namely the space of probability distributions. Letting , the triangle inequality suggests that

(23)

Figure 1: The triangle inequality in the statistical manifold .
Definition 2.

Given a dissimilarity function which satisfies the triangle inequality in Def. 1, an event per realization of Alg. 3, is deemed “bad” if the dissimilarity between the kernel-based estimator and the true is larger than or equal to some prescribed value , i.e.,

(24)

where denotes the event of the statement being true. Naturally, an for which the complement of holds true is deemed a “good” estimate of the underlying . Given , and , the triangle inequality dictates that . This implies that for a fixed , the smaller is, the larger becomes. For any arbitrarily fixed , any for which implies that , i.e., occurs [cf. (24)]. In other words,

(25)

Here, is chosen as the unimodal Gaussian pdf that is centered around . This is in contrast with the multimodal nature of the true , for a number of well-separated clusters. According to (25), an estimate that is “close” to the unimodal will give rise to a “bad” event .

Remark 4.

In this section, is centered at the sample mean of the entire dataset, while in Alg. 3, per realization is centered around the sample mean of the sampled data. This is to avoid a step that incurs complexity in SkeVa-SC. If the dataset mean is available (via a preprocessing step) then the unimodal pdf per iteration can be replaced by induced by the mean of the entire dataset [cf. (20b)].

The maximum required number of iterations can be now lower-bounded as summarized in the following theorem.

Theorem 1.

  1. Given a distance function [cf. Def. 2], a threshold [cf. (24)], a “success” probability of Alg. 3, i.e., the probability that after realizations a random draw of data-points yields an estimate that satisfies [cf. (24)], Alg. 3 requires

    (26)

    where the expectation is taken w.r.t. the data pdf ; that is,

  2. Under (As. 1) and with [cf. (18)], an overestimate with prescribed probability of the lower bound in (26) is given by

    (27)

    with the expected value expressed in closed form as

    (28a)
    (28b)
    (28c)

    where is the vector of mixing coefficients of (22) and , is a matrix with entry

    (29)
Proof.

By definition, is the probability that Alg. 3 yields “bad” draws. Since iterations are independent, it holds that

(30)

The number of draws can be lower-bounded as

(31)

where the last inequality follows from the trivial fact that .

Using (25), is lower-bounded as

(32)

where ; thus, when is fixed, so is . Furthermore, Markov’s inequality implies that

(33)

Combining (32) with (33) yields

(34)

Using now (34) and that , it is easy to establish (26) via (31). In the case where