Nonlinear Unsupervised Clustering of Hyperspectral Images with Applications to Anomaly Detection and Active Learning

Nonlinear Unsupervised Clustering of Hyperspectral Images with Applications to Anomaly Detection and Active Learning

James M. Murphy   Mauro Maggioni J.M. Murphy is with the Department of Mathematics at Johns Hopkins University; email: jmurphy@math.jhu.edu
M. Maggioni is with the Department of Mathematics, Department of Applied Mathematics and Statistics, and Institute of Data Intensive Engineering and Science at Johns Hopkins Univeristy; email: mauro.maggioni@jhu.edu
Abstract

The problem of unsupervised learning and segmentation of hyperspectral images is a significant challenge in remote sensing. The high dimensionality of hyperspectral data, presence of substantial noise, and overlap of classes all contribute to the difficulty of automatically clustering and segmenting hyperspectral images. In this article, we propose an unsupervised learning technique that combines a geometric estimation of class modes with a diffusion-inspired labeling that incorporates both spatial and spectral information. The mode estimation incorporates the geometry of the hyperspectral data by using diffusion distance to promote learning a unique mode from each class. These class modes are then used to label all points by a joint spatial-spectral nonlinear diffusion process. The proposed method, called spatial-spectral diffusion learning (DLSS), is shown to perform competitively against benchmark and state-of-the-art hyperspectral clustering methods on a variety of synthetic and real datasets. The proposed methods are shown to enjoy low computational complexity and fast empirical runtime. Two variations of the proposed method are also discussed. The first variation combines the proposed method of mode estimation with partial least squares regression (PLSR) to efficiently segment chemical plumes in hyperspectral images for anomaly detection. The second variation incorporates active learning to allow the user to request labels for a very small number of pixels, which can dramatically improve overall clustering results. Extensive experimental analysis demonstrate the efficacy of the proposed methods, and their robustness to choices of parameters.

I Introduction

I-a Hyperspectral Learning and Previous Work

Hyperspectral imagery (HSI) has emerged as a significant data source in a variety of scientific fields, including medical imaging [1], chemical analysis [2], and remote sensing [3]. Hyperspectral sensors capture reflectance at a sequence of localized electromagnetic ranges, allowing for precise differentiation of materials according to their spectral signatures. Indeed, the power of hyperspectral imagery for material discrimination has led to its proliferation, making manual analysis of hyperspectral data infeasible in many cases. The large data size of HSI, combined with their high dimensionality, demands innovative methods for storage and analysis. In particular, efficient machine learning algorithms are needed to automatically process and glean insight from the deluge of hyperspectral data now available.

A variety of hyperspectral learning problems have been investigated with gusto in recent years. The problem of HSI classification, or supervised segmentation, asks for an algorithm to label each pixel in a given HSI as belonging to a particular class, where randomly chosen training samples from each class are provided together with their labels, to help the algorithm build an effective discrimination tool. A variety of statistical and machine-learning techniques have been used for HSI classification, including nearest-neighbor and nearest subspace methods [4, 5], support vector machines [6, 7], neural networks [8, 9, 10] and regression methods [11, 12].

The problem of HSI unmixing [13] assumes that the pixels in an HSI scene are constituted from a set of basic materials, sometimes called endmembers. It is then of interest to determine quantities related to those materials, such as number of endmembers, their spectral signatures, and their abundances in a given pixel. Techniques in matrix factorization [14, 15] and sparse regression [16, 17] have been applied successfully to the problem of hyperspectral unmixing.

The problem of hyperspectral clustering, or unsupervised segmentation, is similar to the problem of hyperspectral classification, except that no training data is available, i.e. it is an unsupervised learning problem. Hence, decisions about class labels must be made without access to any labeled data. This increases the challenge considerably, and makes the problem ill-posed unless further assumptions, for example about the distribution of the data and how it relates to the labels, are made. Recent techniques for hyperspectral clustering include those based on particle swarm optimization [18], Gaussian mixture models (GMM) [19], nearest neighbor clustering [20], density analysis [21], sparse manifold models [22, 23], hierarchical nonnegative matrix factorization (HNMF) [24], graph-based segmentation [25], and fast search and find of density peaks clustering (FSFDPC) [21, 26, 27].

Anomaly detection in hyperspectral images is another variety of unsupervised problem [28], in which an algorithm attempts to identify pixels that deviate from the background of the image (or from some other reference set of spectra). The anomalies are often targets of interest in various applications of HSI, but labeled examples of the targets are typically unavailable. Anomaly detection methods often construct a model the background, and identify pixels that are not well represented, or have small likelihood according to the background model. Recent methods in anomaly detection for HSI include those based on support vector machines [29], metric learning [30], collaborative representation [31], high-dimensional density estimators [32] and partial least squares regression [2].

Active learning for hyperspectral image classification allows for the labeling of a small, particularly chosen set of pixels, which can lead to high quality classification results with fewer labels than in the standard supervised learning setting, in which the labels are randomly selected [12, 33, 34]. Active learning differs from the typical supervised setting in that the selection of labels is principled rather than random, and the number of labeled pixels provided is typically very small. In this sense, active learning may be understood as a semisupervised learning technique.

This article addresses the problems of hyperspectral clustering, and relatedly anomaly detection and active learning. The methods proposed in the present article combine density-based methods with geometric learning through diffusion geometry [35, 36] in order to identify class modes. This information is then used to label all data points through a nonlinear process that incorporates both spatial and spectral information. The use of data-dependent diffusion maps for mode detection significantly improves over current state-of-the-art methods experimentally, and also enjoys robust theoretical performance guarantees [37]. Moreover, the spatial-spectral labeling scheme takes advantage of the geometric properties of the data, and greatly improves the empirical performance of clustering when compared to labeling based on spectral information alone. The empirical results of the application of our method to hyperspectral clustering, anomaly detection, and active learning are in Sections III-E, III-F, and III-G, respectively.

I-B Overview of Proposed Method

The proposed unsupervised clustering method is provided with the data and the number of classes, and outputs labels { each , by proceeding in two steps:

  • Mode Identification: This step consists first in performing density estimation and analyzing the geometry of the data to find modes , one for each class.

  • Labeling Points: Once the modes are learned, they are assigned a unique label. Remaining points are labeled in a manner that preserves spectral and spatial proximity.

By a mode, we mean a point of high density within a class, that is representative of the entire class. We assume is known, but otherwise we have no access to labeled data; in Section III-F, we will discuss a method for learning automatically.

A crucial ingredient in many machine learning algorithms is a notion of distance, or similarity, between data points. In our case a data point corresponds to the sequence of spectral reflectances at a single spatial location.

One of the key contributions of this article is to measure similarities in the spectral domain not with the widely used Euclidean distances, or distances based on angles (correlations) between points, but with diffusion distances [35, 36], which is a data-dependent notion of distance that accounts for the geometry, linear or nonlinear, of the distribution of the data. The motivation for this approach is to attain robustness with respect to the shape of the distributions corresponding to the different classes, as well as to high-dimensional noise. The modes, suitably defined via density estimation, are robust to noise, and the process we use to pick only one mode per class is based on diffusion distances, which are robust to the shape of the support of the density of a class. The labeling of the points from the modes respects the geometry of the data, by incorporating proximity in both the spectral and spatial domains.

I-B1 Mode Identification

The computation of the modes is a significant aspect of the proposed method, which we now summarize for a general dataset , consisting of classes. We make the assumption that modes of the constituent classes can be characterized as a set of points such that

  1. The empirical density of each is relatively high.

  2. The diffusion distance between pairs , for , is relatively large.

The first assumption is motivated by the fact that points of high density ought to have nearest neighbors corresponding to a single class; the modes should thus produce neighborhoods of points that with high confidence belong to a single class. However, there is no guarantee that the densest points will correspond to unique classes in the case when some classes have a multi-modal distribution. The second assumption addresses this issue, requiring that points far away in diffusion distance belong to different distributions. Note that making this requirement in terms of Euclidean distance would be more stringent and easily violated when the data distribution is nonlinear. Diffusion distances are data-dependent and adapt to the geometry of the dataset. Figure 1 illustrates this phenomenon.

(a) Nonlinear, multimodal example data.
(b) Euclidean distance from .
(c) Diffusion distance from .
Fig. 1: In this two-dimensional example, data is drawn from two distributions. One is a mixture of two isotropic Gaussians with means at and , respectively, connected by a set of points uniformly sampled from a nonlinear, parabolic shape. The second distribution is an isotropic Gaussian with mean at . Samples from uniform noise are added and labeled as their nearest neighbor among the two clusters. The data is plotted and colored by cluster in subfigure (a). We plot the distances from the point in the Euclidean and diffusion distances in subfigures (b), (c), respectively. The rectangle acts as a “bridge” between the two Gaussians and causes the high density regions near and to be closer in diffusion distance than they would be in the usual Euclidean distance. The bridge is overcome efficiently with diffusion distance, because there are many paths with short edges connecting the high density regions across this bridge.

We model as samples from a distribution where each corresponds to the probability distribution of the spectra in class , and the weights correspond to how often each class is sampled, and satisfy . More precisely, sampling means first sampling , then sampling from conditioned on the event .

The mode identification algorithm outputs a point (“mode”) for each . Enforcing that these modes are far apart in diffusion distance has several advantages over enforcing they are far apart in Euclidean distance. It typically leads to a unique mode from each class, even in the case that certain classes are multimodal with several modes having density than the densest point in another class. Diffusion distances are also robust with respect to the shape of the distribution, and are thus suitable for identifying nonlinear clusters. An instance of these advantages of diffusion distance is illustrated in the toy example Figure 1, with the results of the proposed mode detection algorithm in Figure 3. We postpone the the mathematical and algorithmic details to Section II-B.

I-C Labeling Points

At this stage we assume that we found exactly one mode for each class, to which a unique class label is assigned. The remaining points are subsequently labeled in a two-stage scheme, which takes into account both spectral and spatial information. Spatial information is well-known to improve hyperspectral learning [7, 12, 27, 38, 39, 40]. Like many of these methods, spatial information is computed for each pixel by constructing a neighborhood of some fixed radius in the spatial domain, and considering the labels within this neighborhood. For a given point, let spectral neighbor refer to a near neighbor with distances measured in the spectral domain, and let spatial neighbor, refer to a near neighbor with distances measured in the spatial domain.

In the first stage, a point is given the same label as its nearest spectral neighbor of high density, unless that label is sufficiently different from the labels of the point’s nearest spatial neighbors. This produces an incomplete labeling in which we expect the labeled points to be far from the spatial and spectral boundaries of the classes, since these are points that are unlikely to have conflicting spatial and spectral labels. Pixels whose spectral label differs strongly with the labels of nearby spatial pixels are left unlabeled.

In the second stage of the labeling process we label the points left unlabeled in the first stage. Each such point is assigned the consensus label of its nearest spatial neighbors (see Section II-C), if it exists; otherwise, it is given the label of its nearest spectral neighbor of higher density. In this way the yet unlabeled points, typically near the spatial and spectral boundaries of the classes, benefit from the spatial information in the already labeled points, which are closer to the centers of the classes. In this sense, information diffuses from the modes towards the boundaries of the classes. An instance of the two-stage, spatial-spectral labeling appears in Figure 2.

This method of clustering combining the diffusion-based learning of modes with the joint spatial spectral labeling of pixels is called spatial-spectral diffusion learning, and is abbreviated DLSS. This method is contrasted with another novel method we propose, called diffusion learning (DL), in which modes are learned as in DLSS, but the labeling proceeds simply by enforcing that each point has the same spectral label as its nearest spectral neighbor of higher density. DL therefore disregards spatial information, while DLSS makes significant use of it. Our experiment show that while both DL and DLSS perform very well, DLSS is generally superior.

\thesubsubfigure First stage labeling
\thesubsubfigure Second stage, final labeling
Fig. 2: An example of the two-stage spatial-spectral labeling process, performed on the Indian Pines dataset used for experiments in Section III-E1. In subfigure (a), the partial labeling from the first stage is shown. After mode identification, points are labeled with the same label as their nearest spectral neighbor of higher density, unless that label is different from the consensus label in the spatial domain, in which case a point is left unlabeled. This leads to points far from the centers of the classes staying unlabeled after the first stage. In the second stage, unlabeled points are assigned labels by the same rule, unless there is a clear consensus in the spatial domain, in which case the unlabeled point is given the consensus spatial label; the results of this second stage appear in subfigure (b). For visual clarity, here and throughout the paper, pixels without ground truth (GT) labels are masked out in these pictures.

DLSS is a joint spatial-spectral diffusion clustering technique. Each class mode distributes its label throughout the class distribution, until the boundary of the distribution is encountered in either the spatial or spectral domain. The first-stage labeling in the spectral domain is somewhat similar to mean-shift clustering [41] and is the same procedure used in FSFDPC [26] for labeling points once the modes have been determined. The process of halting the labeling of a point if the spectral label differs strongly from nearby spatial labels enforces smooth diffusion in the spatial domain, and may be understood as a form of spatial regularization. Details on how the spatial-spectral diffusion learning is performed appear in Section II-C. This approach may also be connected to semisupervised learning techniques on graphs, where initial given labels are propagated by a diffusion process to other vertices (points): see [42] and references therein. Here we proceed in an unsupervised fashion, replacing initial given labels by estimated modes of the clusters.

I-D Major Contributions

We propose a clustering algorithm for HSI with several significant innovations. First, diffusion distance is proposed to measure distance between high-density regions in hyperspectral data, in order to determine class modes. Our experiments show that this distance efficiently differentiates between points belonging to the same cluster and points in different clusters. This correct identification of modes from each cluster is essential to any clustering algorithm incorporating an analysis of modes. Compared to state-of-the-art fast mode detection algorithms, the proposed method enjoys excellent empirical performance and theoretical performance guarantees, which are beyond the scope of the present article, and will be discussed in the forthcoming article [37].

A second major contribution of the proposed HSI clustering algorithm is the incorporation of spatial information through the labeling process. Labels for points are determined by diffusing in the spectral domain from the labeled modes, unless spatial proximity is violated. By not labeling points that would violate spatial regularity, the proposed algorithm first labels points that, with high confidence, are close to the modes of the distributions. Only once all of these points are labeled are the remaining points, further from the modes, labeled. This enforces a spatial regularity which is natural for HSI images, because a pixel in an HSI is likely to have the same label as the most common label among its nearest spatial neighbors. Difficulties tend to occur on spatial and spectral boundaries, and these points are likely to be labeled last in our scheme. In both stages DLSS takes advantage of the geometry of the dataset by using data-adaptive diffusion processes, greatly improving empirical performance. The proposed methods enjoy low computational complexity and relatively fast runtimes. Indeed, the proposed methods are in the number of points and are thus suitable for the big data setting when is very large.

A third major contribution is the use of the learned modes for the segmentation of chemical plumes in HSI. This may be considered a form of anomaly detection, in which the goal is not clustering, but the correct identification of a special cluster of interest (“anomaly”). Motivated by recent work in the problem of anomaly detection and background separation for HSI [2], the proposed method of mode estimation is combined with partial least squares regression (PLSR) to linearly embed the dataset in lower dimensions in a maximally discriminatory manner. This step enhances the contrast between the clusters, crucial for anomaly detection, while reducing the amount of noise, thanks to the projection onto a small number of discriminating directions. The proposed anomaly detection method transforms supervised PLSR into an unsupervised method, which we show to be highly efficient for the detection of chemical plumes. This work is detailed in Section III-F.

The fourth major contribution is the introduction of an active learning scheme based on distances of points to the computed modes. In the context of active learning, the user is allowed to label only a very small number of points, to be chosen parsimoniously. We propose an unsupervised method for determining which points to label in the active learning setting. Points whose distance to their nearest mode is much larger than the distance to other modes are likely to be easy to label, and hence should not be given labels in the active learning setting. Conversely, pixels that are equally far from their nearest two modes are likely to be near class boundaries, and hence to be the most challenging pixels to label by the proposed unsupervised method. We therefore propose an active learning method that labels only the pixels whose distances to their nearest two modes are closest. Our experiments show that this method can dramatically improve labeling accuracy with a number of labels of the total pixels. This work is detailed in Section III-G.

Ii Unsupervised Learning Algorithm and Variations

Ii-a Motivating Example and Approach

A key aspect of our algorithm is in the method for identifying the modes of the underlying classes in the HSI data. This is challenging because of the high ambient dimension of the data, potential overlaps between distributions at their tails, along with differing densities, sampling rates, and distribution shapes across classes. Consider the simplified example in Figure 3. This data set is the same as that appearing in Figure 1.

(a) Euclidean modes
(b) Diffusion modes
Fig. 3: Learned modes with Euclidean distances and diffusion distances. The Euclidean and diffusion distances from are shown in subfigures (b), (c) of Figure 1, while the corresponding learned modes appear as red circles in subfigures (a), (b), of the present figure. Notice that the proposed geometric learning method, using diffusion distances, correctly identifies a unique mode from each cluster, while using Euclidean distances leads to one cluster being given two modes. This is a fatal error, and will lead to poor clustering results.

The points of high density lie close to the center of the first distribution, and close to the two ends of the second distribution. After computing an empirical density estimate, the distance between high density points is computed. If the Euclidean distance between high density points is used to remove spurious modes, i.e. modes corresponding to the same distribution, then the two modes on opposite sides of the second distribution are returned, which incorrectly assigns two modes to the same distribution. However, if diffusion distance is used, rather than Euclidean distance, then the two modes learned correspond to two different classes. This is because the modes on the opposite ends of the second distribution are far in Euclidean distance but closer in diffusion distance. Indeed, the points between the two modes act as a bridge, which is incorporated into the geometry learned by diffusion distances. On the other hand, there is a substantial region of low density between the two distributions, making the diffusion distance between them relatively large. This suggests that diffusion distance is more useful than Euclidean distance for comparing high density points for the determination of modes, under the assumption that multimodal regions have modes that are connected by regions of not too low density. From the standpoint of Euclidean distance, the bridge is insignificant. From the standpoint of diffusion distance, it is critical.

Ii-B Diffusion Distance

We now present a detailed overview of the construction of diffusion distances. Additional analysis and comments on implementation appear in [43, 35, 36]. Diffusion processes on graphs lead to a data-dependent notion of distance, known as diffusion distance [43, 35, 36]. This notion of distance has been applied to a variety of application problems, including analysis of stochastic and dynamical systems [35, 44, 45, 46], semisupervised learning [47, 42], data fusion [48, 49], latent variable separation [50, 51], and molecular dynamics [52, 53]. Diffusion maps provide a way of computing and visualizing diffusion distances, and may be understood as a type of non-linear dimension reduction, in which data in a high number of dimensions may be embedded in a low-dimensional space by a nonlinear coordinate transformation. In this regard, diffusion maps are related to nonlinear dimension reduction techniques such as isomap [54], Laplacian eigenmaps [47], and local linear embedding [55], among several others.

Consider a discrete set with elements. The diffusion distance [35, 36] between , denoted , is a notion of distance that incorporates and is uniquely determined by the underlying geometry of . The distance depends on a time parameter , which enjoys an interpretation in terms of diffusion on the data. The computation of involves constructing a weighted, undirected graph that encodes the geometry of , and the distance approximates a suitable notion of geometry of the underlying distribution generating the data. The vertices of correspond to and the edges are given by a weight matrix defined as

(1)

for some suitable choice of and for the set of -nearest neighbors of in with respect to Euclidean distance. The weight matrix is normalized to be row-stochastic, yielding a Markov diffusion with transition matrix given by

Given some initial distribution on the state space, the vector is the state of the Markov chain at time . As increases, the diffusion process on evolves according to connections between the points, as encoded by . This Markov chain has a stationary distribution s.t. , given by

The diffusion distance at time is defined as

The computation of involves integrating over all paths connecting to , so is small if are strongly connected in the graph corresponding to , and large if are weakly connected in the graph.

From a computational standpoint, an eigendecomposition of allows to derive fast algorithms involving . Indeed, under mild conditions [36], the matrix admits a spectral decomposition of eigenvectors and eigenvalues , where The diffusion distance can be written in terms of this spectral decomposition as

(2)

One can think of the weighted eigenvectors as encoding new data-dependent coordinates of . These coordinates are close to being geometrically intrinsic [35, 43]. Under this interpretation, diffusion distance is simply the normal distance, but in these new coordinates, rather than the usual Euclidean coordinates.

The diffusion distance is parametrized by , which encodes how long the diffusion process on the graph has run when the distances are computed. Small values of allow a small amount of diffusion, which may prevent the interesting geometry of from being discovered, but provide detailed information. Large values of allow the diffusion process to run for so long that the interesting geometry may be washed out. In this work the regime in between these two is typically when the geometry of the data is revealed. To balance between the short and long term behaviors, we set for all experiments. The impact of may be studied mathematically with spectral graph theory and analyses of Markov mixing times, which allows for estimations of the diffusion distances for certain data [37]. In general, the choices of in the construction of diffusion distances may be important; the choice of these parameters is discussed in Section III-H.

Note that since for all , for large , for , and the sum (2) may approximated by its truncation at index , for some suitable . In all experiments, was set to be the value at which the decay of the eigenvalues began to taper off, i.e. the “kink” in the eigenvalue plot. This is a standard heuristic for such metrics. The subset used in the computation of is a dimension-reduced set of diffusion coordinates. Intuitively, the larger eigenvalues contain significant information, while the smaller ones are mostly noise and are of little value. The truncation improves the computational complexity of the proposed method by requiring the computation of fewer eigenvectors, which are costly even in the case of sparse matrices. A discussion of the computational complexity of diffusion distances appears in Section IV-A.

Ii-C Unsupervised HSI Clustering Algorithm Description

We now discuss the proposed algorithm in detail. A flowchart summarizing the proposed clustering method appears in Figure 4.

max width=0.45

Input ,

Compute ( Equations (3), (4),(5))

Compute, label modes (Algorithm 1)

Order unlabeled according to

For each unlabeled point:

Compute

Compute (Equation 6)

exists or ?

not labeled

For each still unlabeled point:

Recompute (Equation 6)

exists?

Algorithm 2 (Stage 1)

no

yes

Algorithm 2 (Stage 2)

yes

no

Fig. 4: Diagram of proposed unsupervised clustering algorithm. First, modes are computed. Second, points are labeled in the two stage algorithm.

Let be the HSI image, reshaped into an matrix, where is the number of pixels and is the number of spectral bands. We may consider as a collection of points . As described in Section I-B, our algorithm proceeds in two major steps: mode identification and labeling of points.

The algorithm for learning the modes of the classes is summarized in Algorithm 1. It first computes an empirical density for each point with a kernel density estimator. This computes, for each ,

where is the squared Euclidean distance in , and is the set of -nearest neighbors to , in Euclidean distance. The use of the Gaussian kernel density estimator is rather standard in statistics and machine learning, and it enjoys strong theoretical guarantees [56, 57]; other estimators may be used. In Section III, , though our method is robust to choosing larger . The parameter in the exponential kernel is data-adaptive, and set to be half the mean distance between all points, which makes this choice relatively robust across different data sets (one could use the median instead in the presence of outliers). The empirical density is then computed by normalizing so that :

(3)

Once this empirical density is computed, the modes of the HSI classes are computed in a manner similar in spirit to [26]. We compute the time-dependent quantity that assigns, to each pixel, the minimum diffusion distance between the pixel and a point of higher empirical density. The point with highest empirical density is assigned the maximum diffusion distance between it and any other point as its value. More precisely, we compute

(4)

where is the diffusion distance between , at time . In the following we will use the normalized quantity , which has maximum value .

The modes of the HSI are computed as the points yielding the largest values of the quantity

(5)

Intuitively, such points should be both high density and far in diffusion distance from any other higher density points, and can therefore be expected to be modes of different distributions. This method provably detects modes correctly under the assumption the data is drawn from certain nonparametric distributions [37].

1 Compute the empirical densities for all elements of .
2 Compute , the diffusion distance from each point its nearest nearest neighbor in diffusion distance of higher empirical density, normalized.
3 Set the learned modes to be the maximizers of .
Algorithm 1 Geometric Mode Detection Algorithm

Once the modes are detected, they are each given a unique label. The rest of the points are labeled from these mode labels in the following two-stage process, which is summarized in Algorithm 2. In the first stage, the points are sorted in order of decreasing empirical density. In order of decreasing density, the spatial consensus label of each point is computed by finding all labeled points within distance in the spatial domain of the pixel in question; call this set . If one label among occurs with relative frequency , -that label is the spatial consensus label. Otherwise, no spatial consensus label is given. In detail, let denote the labels of the spatial neighbors within radius . Then the spatial consensus label of is

(6)

After a point’s spatial consensus label is computed, it’s spectral label is computed as its nearest neighbor in the spectral domain, measured in diffusion distance, of higher density. The point is then given the overall label of the spectral label unless the spatial consensus label exists and differs from the spatial consensus label. In the case that the spatial consensus label exists and differs from the spectral label, the point in question remains unlabeled in the first stage. Note that points that are unlabeled are considered to have label 0 for the purposes of computing the spatial consensus label, so in the case that most pixels in the spatial neighborhood are unlabeled, the spatial consensus label will be 0. Hence, only pixels with many labeled pixels in their spatial neighborhood can hope to have a consensus spatial label.

Upon completion of the first stage, the dataset will be partially labeled; see Figure 2. Indeed, only points with no spatial consensus label and points whose spatial consensus label exists and agrees with the spectral label will have been labeled. In the second stage, an unlabeled point is given the label of its spatial consensus label, if it exists. Otherwise, it is given the label of its nearest spectral neighbor of higher density.

1 Assign each mode a unique label.
2 Stage 1: Iterating in order of decreasing density among unlabeled points, assign each point the same label as its nearest spectral neighbor (in diffusion distance) of higher density, unless the spatial consensus label exists and differs, in which case the point is left unlabeled.
Stage 2: Iterating in order of decreasing density among unlabeled points, assign each point the consensus spatial label, if it exists, otherwise the same label as its nearest spectral neighbor of higher density.
Algorithm 2 Spatial-Spectral Labeling Algorithm

Points of high density are likely to be labeled according to their spectral properties. The reasons for this are twofold. First, these points are likely to be near the centers of distributions, and hence are likely to be in spatially homogeneous regions. Moreover, points of high density are labeled before points of low density, so it is not likely for high density points to have many labeled points in their spatial neighborhoods. This means that the spatial consensus label is unlikely to exist for these points. Conversely, points of low density may be at the boundaries of the classes, and are hence more likely to be labeled by their spatial properties. In this way, the proposed labeling scheme benefits from both the spectral and spatial regularity of HSI.

The proposed method, combining Algorithms 1, 2 is called spatial-spectral diffusion learning (DLSS). In our experimental analysis, the significance of the spatial-spectral labeling scheme is validated by comparing DLSS against a simpler method, called diffusion learning (DL). This method learns class modes as in Algorithm 1, but labels all pixels simply by requiring each point have the same label as its nearest spectral neighbor of higher density. It is expected that DLSS will generally outperform DL, due to the former’s incorporation of spatial data.

Ii-D Anomaly Detection Algorithm

Motivated by the success of partial least square (PLSR) for anomaly detection in hyperspectral movies [2], we propose to incorporate the learned diffusion modes into PLSR. PLSR is a supervised linear dimension reduction technique that reduces the dimension in such a way that not just variance, but also discrimination between classes, in the low-dimensional embedding is maximized. PLSR requires labeled data, which will be kept apart in the low-dimensional embedding. To incorporate the proposed unsupervised method into PLSR, we associate the modes learned with diffusion distances to a core of points, consisting of the nearest neighbors of a mode in diffusion distance. These cores are points that, with high confidence, have the same class label. The cores are then used as labels for PLSR.

Instead of arbitrarily assigning numbers to each of the classes and regressing these, we assign the elements of to the vertex of a -simplex, in order to perform vector-valued regression with PLSR without imposing a one-dimensional structure on the space of labels of the classes. More precisely, the points in are labeled as with 1 in the coordinate. This generates a set of training pairs for the class:

In this algorithm, the data are assumed to be centered, i.e. the columns of have mean 0. This training data is used in PLSR to acquire a regression coefficient ; details for the PLSR construction appear in Section A-A. The regression coefficient is a matrix that is derived only from the learned cores. Multivariate regression on all of is performed by computing ; is an matrix with rows corresponding to points and columns corresponding to responses of each point. The columns may be interpreted as affinity for each class. Class labels are learned as Other methods for computing class labels from the responses were considered, including -means on with , yielding similar results. The overall algorithm is summarized in Algorithm 3.

1 Learn modes of according to Algorithm 1, call them .
2 Generate, from each , a core of points by computing the -nearest neighbors of in diffusion distance. Call these cores
3 Learn PLSR coefficient using the learned cores as training data (e.g., using the SIMPLS algorithm, see Section A for details).
4 Regress on as .
Assign labels to all data points as .
Algorithm 3 Anomaly Detection With Unsupervised PLSR Algorithm

Ii-E Active Learning

Thus far, all proposed methods have been unsupervised. We now present a variation of the proposed unsupervised method for active learning of hyperspectral images. In the active learning regime, the cost of labeling of pixels is at a premium, so a few well-chosen pixels must be chosen for labeling. The proposed unsupervised method labels points beginning with the learned class modes, and mistakes tend to be made on points that are near the class boundaries. To help clarify the class boundaries, we consider an active learning scheme that labels points whose distances from their nearest two modes are similar. This quantifies the ambiguity of not having a clear nearest mode.

More precisely, we fix a time , and for each pixel , let be the two modes closest to in diffusion distance. We compute the quantity

(7)

If is close to 0, then there is substantial ambiguity as to which of the two modes is closer to. Suppose the user is afforded the labels of exactly points. Then the labels requested in our active learning regime are the minimizers of . We consider a range of values for all real HSI datasets with ground truth. It is desirable to have the parameter dictating the number of labels for active learning be independent of image size. Hence, we parametrize the amount of labels available to the active learning algorithm by the parameter , and for a given we consider labels, where is the number of pixels in the image. That is, is the percentage of pixels we can access. The active learning setting is most interesting when can be taken very small, hence is taken in in our experiments. The proposed active learning scheme is summarized in Algorithm 4.

1 Compute the modes of the data using Algorithm 1.
2 Give each mode a unique label.
3 Compute, for each point , where are the two modes closest to in diffusion distance.
4 Label the minimizers of with ground truth labels.
Label the remaining, unlabeled points as in steps 2, 3 in Algorithm 2.
Algorithm 4 Spatial Spectral Diffusion Learning with Active Learning

Note that the active learning algorithm can be iterated, by labeling points then recomputing the quantity (7) to determine the most challenging points after some labels have been introduced. For simplicity of presentation, we do not consider such an iterated algorithm in the present article.

Iii Experiments

Iii-a Algorithm Evaluation Methods and Experimental Data

We consider several HSI datasets to evaluate the proposed unsupervised algorithm (Algorithms 1, 2) and the anomaly detection (Algorithm 3) and active learning (Algorithm 4) variants. For evaluation in the presence of ground truth (GT), we consider three quantitative measures, as well as visual performance. The quantitative measures of performance are:

  1. Overall Accuracy (OA): Total number of correctly labeled pixels divided by the total number of pixels. This method values large classes more than small classes.

  2. Average Accuracy (AA): For each class, the overall accuracy of that class is computed, as described above. These class-wise performances are then averaged. This method values small classes and large classes equally.

  3. Cohen’s -statistic (): A measurement of interrater agreement that measures agreement between two labelings, corrected for random agreement [58]. Letting be the observed agreement between the labeling and the ground truth and the expected agreement between a uniformly random labeling and the ground truth, . A value of corresponds to perfect overall accuracy, while a labeling that does not improve on what is expected from random guessing would have .

In order to perform quantitative analysis with these metrics and make consistent visual comparisons, the learned clusters are aligned with ground truth, when available. After performing our unsupervised clustering, we perform an a posteriori alignment of the labels with the ground truth that maximizes the overall accuracy of the labeling. More precisely, let be the set of permutations of . Let be the clusters learned from one of the clustering methods, and let be the ground truth clusters. Cluster is assigned label , with defined as

We remark that this alignment method maximizes the overall accuracy of the labeling and is most useful for visualization, but better alignments for maximizing and may exist.

For experimental analysis, we first consider synthetic data where classes are generated randomly according to a specified distribution. Next, we consider five real HSI datasets which reflect strengths and weaknesses of the proposed algorithm. Four of the datasets are standard and have ground truth; they are publicly available111http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. We also consider a dataset without ground truth, for which the proposed quantitative measures do not apply. This image contains a subtle chemical plume, and the goal for this dataset is not clustering of the whole dataset, but rather the segmentation of this chemical plume. For this dataset, we proposed the modified unsupervised mode estimation algorithm combined with partial least squares regression (Algorithm 3), to help with anomaly detection. Experiments with active learning are performed for the four real HSI datasets with ground truth and Algorithm 4.

In order to reduce the number of classes to be suitable for our unsupervised method, some images are restricted to subsets in the spatial domain, which is noted in their respective subsections. The Salinas A dataset is considered in its entirety. For the data with ground truth, not all pixels are labeled in the ground truth. For these datasets, labels are computed for all data, then the pixels with ground truth labels are used for quantitative and visual analysis. The number of class labels in the ground truth images were used as a parameter for all clustering algorithms, though the proposed method automatically estimates the number of clusters; see Section III-F for details. Images of ground truth and the sum of all bands for the Indian Pines, Pavia University, Pavia, and Salinas A datasets are in Figures 5, 22, 7, and 8, respectively. The sum of all bands is presented as a simple visual summary of the data, though it washes out the subtle information presented in individual bands. Finally, we consider an HSI image containing a chemical plume. This image does not have ground truth, so the goal is to cluster it so that the chemical plume is clearly segmented. For clarity of exposition, experiments for synthetic HSI data and Pavia University appear in the Appendix.

Note that additional experiments were performed, using only the data with ground truth labels. These experiments consisted in restricting the HSI to the pixels with labels, which makes the clustering problem significantly easier. Quantitative results were uniformly better for all datasets and methods in these cases, though the relative performances of the algorithms on a given dataset remained the same. The results on the restricted data are not shown.

Iii-B Comparison Methods

We consider a variety of benchmark and state-of-the-art methods of HSI clustering for comparison. First, we consider the classic -means algorithm [57] applied directly to . This method is not expected to generate especially good results for HSI data, in part due to the non-spherical shape of clusters in HSI data. Non-spherical classes are a well-known problem for -means clustering. High dimensionality is also a problem for -means. Indeed, due to the well-known curse of dimensionality, distances of uniformly sampled points are large in high dimensions. This makes it difficult to distinguish between clusters and noise in high dimensions with -means.

Several methods to reduce the dimensionality of the data, while preserving important discriminatory properties of the classes, as well as increasing the signal-to-noise ratio in the projected space, are used for comparison. These dimension reduction techniques are used as benchmarks for comparison with the proposed method. All of these methods involve first reducing the dimension, , of the data, then running -means on the reduced data. The datasets are reduced to being dimensional, where is the number of classes, if known. In the case of chemical plume detection, when the number of classes is unknown, is estimated; see Section III-F for details. We first consider linear dimension reduction via principal component analysis, which seeks to find a low-dimensional representation of the dataset that retains maximum variance. We then consider independent component projection [59, 60], which seeks to decompose the data into maximally independent components. To implement ICA, a standard, fast implementation was used222https://www.cs.helsinki.fi/u/ahyvarin/papers/fastica.shtml. In addition, we consider linear dimensionality reduction by a random projection via a Gaussian random matrix, which has been shown to be efficient in highly simplified data models [61, 62].

We also consider more computationally intensive methods for benchmarking. Spectral clustering (SC) [63, 64] has had success in classifying and clustering HSI [39]. The spectral embedding was computed using the normalized graph Laplacian . The top eigenvectors from are then normalized and used as features for -means clustering. Details of this construction, including parameters for the graph Laplacian appear in Section III-H. We also cluster with Gaussian mixture models (GMM) [19, 65, 66]. This method attempts to fit a number of Gaussians to the dataset, with means and covariance matrices determined by expectation maximization (EM).

In addition to these benchmark algorithms, we consider several recent, state-of-the-art clustering methods. The method of sparse manifold clustering and embedding (SMCE)333http://vision.jhu.edu/code/ is considered [22, 23]. This method attempts to fit the data to low-dimensional, sparse structures, then applies spectral clustering. The method of hierarchical clustering with non-negative matrix factorization (HNMF)444https://sites.google.com/site/nicolasgillis/code [24] is also considered. This method has shown excellent performance for HSI clustering, especially in the simplified case when the clusters are generated from a single endmember. Another graph-based method [25], based on the Mumford-Shah segmentation [67], is also considered. This method is related to spectral clustering, and is called fast Mumford-Shah (FMS) in this article. This implementation is highly parallelized 555http://www.ipol.im/pub/art/2017/204/?utm_source=doi, and is the only comparison method implemented in C, rather than MATLAB. Moreover, we consider the fast search and find of density peaks clustering (FSFDPC) algorithm [26], which has been shown effective in clustering certain data.

Iii-C Relationship Between Proposed Method and Comparison Methods

The FSFDPC method has similarities with the mode estimation aspect of the present work, in that both algorithms attempt to learn the modes of the data classes via a density based analysis, as described in, for example, [21, 26]. Our method is quite different. The proposed measure of distance between high density points is not Euclidean distance, but diffusion distance [35, 36], which is more adept at removing spurious modes, due to its incorporation of the geometry of the data. This phenomenon is illustrated in Figure 1 and 3. The assignment of labels from the modes is also quite different, as diffusion distances are used to determine spectral nearest neighbors, and spatial information is accounted for in the proposed DLSS algorithm. The FSFDPC algorithm, in contrast, assigns to each of the modes its own class label, and to the remaining points a label by requiring that each point has the same label as its Euclidean nearest neighbor of higher density. This means that FSFDPC only incorporates spectral information measured in Euclidean distance, disregarding spatial information. The benefits of both using diffusion distances to learn modes, and incorporating spatial proximities into the clustering process are shown in the experiments.

The proposed DLSS and DL algorithms also share commonalities with spectral clustering, SMCE, and FMS in that these comparison methods compute eigenvectors of a graph Laplacian in order to develop a nonlinear notion of distance. This may be seen as similar to computing the eigenvectors of the Markov transition matrix in the computation of diffusion maps. The proposed method, however, directly incorporates density into the detection of modes, which allows for more robust clustering compared to these comparison methods, which work by applying -means to the eigenvectors of the graph Laplacian. Our technique also does not rely on any assumption about sparsity, and is completely invariant under distance-preserving transformations (it shares this property with SMCE), which could be useful if different imaging modalities (e.g. compressed modalities) are used.

Iii-D Summary of Proposed and Comparison Methods

The experimental methods are summarized in Table I. We propose two novel methods: the full spatial spectral diffusion learning method (DLSS), as well as a simplified diffusion learning method (DL). The DLSS method consists in learning the class modes with diffusion distances, then assigning labels in the joint spatial-spectral way, as illustrated in Figure 4. The simplified DL method learns the class modes with diffusion distances, but assigns labels simply by enforcing that a point has the same label as its nearest neighbor of higher density in diffusion distance. In this way, DL does not incorporate spatial information. We note that several algorithms were not implemented by the authors of this article. Instead, publicly available libraries were used; links to these libraries are noted where appropriate.

max width=.49 Method Dimension Reduction Distance Metric -means on full dataset with random initializations No Euclidean -means on PCA reduced dataset Yes Euclidean -means on ICA reduced dataset Yes Euclidean -means on data reduced by random Gaussian projection Yes Euclidean Spectral clustering [64] Yes Euclidean Gaussian mixture models No Euclidean Sparse manifold clustering and embedding [22, 23] Yes Euclidean Hierarchical HNMF [24] No Euclidean Fast Mumford Shah [25] No Euclidean FSFDPC [26] No Euclidean Diffusion learning (DL) Yes Diffusion Spatial-spectral diffusion learning (DLSS) Yes Diffusion

TABLE I: Methods used for experimental analysis. The methods proposed in this article appear in bold.

All experiments and subsequent analyses are done in MATLAB, with the exception of FMS, which was implemented in C++. Code to reproduce all results will be made publicly available on the authors’ website666http://www.math.jhu.edu/~jmurphy/.

Iii-E Unsupervised HSI Clustering Experiments

Experimental results for synthetic data appear in Section A-B1.

Iii-E1 Indian Pine Dataset

The Indian Pines dataset used for experiments is a subset of the full Indian Pines datasets, consisting of three classes that are difficult to distinguish visually; see Figure 5. It is expected that this dataset will be challenging for all algorithms, due to the lack of clear separation between the classes. Results for Indian Pines appear in Figure 6 and Table II. The highest quantitative score for a given metric is listed in bold in all tables, and the second highest score is underlined.

Fig. 5: The Indian Pines data is a subset of the full Indian Pines dataset. It contains 3 classes, one of which is not well-localized spatially. The dataset was captured in 1992 in Northwest IN, USA by the AVRIS sensor. The spatial resolution is 20m/pixel. There are spectral bands.
(a) -means
(b) PCA+M
(c) ICA+M
(d) RP+M
(e) SC
(f) GMM
(g) SMCE
(h) HNMF
(i) FMS
(j) FSFDPC
(k) DL
(l) DLSS
(m) GT
Fig. 6: Clustering results for Indian Pines dataset. The impact of the spatial-spectral labeling scheme is apparent, as the labels for the DLSS method are more spatially regular than those of the DL method. Note that the regions of difference between DL and DLSS are primarily near boundaries of classes and in very small interior regions. Near the boundaries of classes, pixels are likely to be far from the spectral class cores, and hence are more likely to be labeled based on spatial properties. The small interior regions are unlikely to be formed under the DLSS labeling regime, since these regions consist of points whose spectral label differs from their spatial consensus label. The simplified DL method performs second best, and in particular outperforms FSFDPC, which performs best among the comparison methods.

max width=.499 Metric -means PCA+-means ICA+-means RP+-means SC GMM SMCE HNMF FMS FSFDPC DL DLSS OA 0.4303 0.4316 0.4077 0.5060 0.5418 0.4396 0.5179 0.4077 .4183 0.5750 0.6680 0.8486 AA 0.3770 0.3779 0.3631 0.5147 0.454 0.3458 0.4457 0.3215 .3359 0.5097 0.6154 0.8161 0.0932 0.0950 0.0636 0.2583 0.2445 0.0226 0.2244 -0.0218 .0001 0.2578 0.4415 0.7534

TABLE II: Quantitative analysis for Indian Pines dataset.

The proposed methods, DL and DLSS perform the best, with DLSS strongly outperforming the rest. The use of diffusion distances for mode detection and determination of spectral neighbors is evidently useful, as DL significantly outperforms FSFDPC, which has the best quantitative performance among the comparison methods. Moreover, the use of the proposed spatial-spectral labeling scheme clearly improves over spectral-only labeling. As seen in Figure 6, the spatial-spectral labeling scheme correctly labels many small interior regions that diffusion learning alone incorrectly labels.

Iii-E2 Pavia Dataset

The Pavia dataset used for experiments consists of a subset of the original dataset, and contains six classes, with one of them spread out across the image. As can be seen in Figure 7, the yellow class is small and diffuse, which is expected to add challenge to this example. Results appear in Figure 20 and Table III.

Fig. 7: The Pavia data is a subset of the full Pavia dataset. It contains 6 classes, some of which are not well-localized spatially. The dataset was captured by the ROSIS sensor during a flight over Pavia, Italy. The spatial resolution is 1.3 m/pixel. There are 102 spectral bands.

max width=.49 Metric -means PCA+-means ICA+-means RP+-means SC GMM SMCE HNMF FMS FSFDPC DL DLSS OA 0.7765 0.7761 0.6683 0.7593 0.8154 0.6431 0.8308 0.7217 .7456 0.7783 0.8494 0.9461 AA 0.6239 0.6237 0.5539 0.6147 0.7619 0.5871 0.7708 0.7422 .6020 0.7465 0.7787 0.8260 0.7169 0.7164 0.5833 0.6983 0.7743 0.5469 0.7919 0.6568 .6599 0.7304 0.8137 0.9309

TABLE III: Quantitative analysis for Pavia dataset.

The proposed methods give the best results, and the value of both the diffusion learning stage, as well as the spatial-spectral labeling scheme are witnessed. Indeed, the proposed DLSS algorithm makes essentially only two errors: the yellow-green class is slightly mislabeled, and the blue-green class in the bottom right is labeled completely incorrectly. However, both of these errors are made by all algorithms, often to an even greater degree. Among the comparison methods, SMCE performs best, and classical spectral clustering also performs well.

Iii-E3 SalinasA Dataset

The Salinas A dataset consists of 6 classes arrayed in diagonally spatial rows. The dataset appears in Figure 8.

Fig. 8: The Salinas A data consists of the full HSI. It contains 6 classes, all of which are well-localized spatially. The dataset was captured over Salinas Valley, CA, by the AVRIS sensor. The spatial resolution is 3.7 m/pixel. The image contains 224 spectral bands.
(a) -means
(b) PCA+M
(c) ICA+M
(d) RP+M
(e) SC
(f) GMM
(g) SMCE
(h) HNMF
(i) FMS
(j) FSFDPC
(k) DL
(l) DLSS
(m) GT
Fig. 9: Clustering results for SalinasA dataset. The proposed method, DLSS performs best, with the simplified DL method and benchmark spectral clustering also performing well. Notice that the spatial-spectral labeling scheme removes some of the mistakes in the yellow cluster, and also improves the labeling near the some class boundaries. However, it is not able to fix the mislabeling of the light blue cluster in the lower right. Indeed, all methods split the cluster in the lower right of the image, indicating the challenging aspects of this dataset for unsupervised learning.

max width=.49 Metric -means PCA+-means ICA+-means RP+-means SC GMM SMCE HNMF FMS FSFDPC DL DLSS OA 0.6260 0.6260 0.5688 0.6268 0.8345 0.6442 0.4688 0.6316 0.7031 0.6322 0.8313 0.8495 AA 0.6577 0.6577 0.5567 0.6587 0.8828 0.6056 0.4209 0.6638 0.8053 0.6055 0.8790 0.8987 0.5246 0.5246 0.4444 0.5254 0.7978 0.5530 0.3039 0.5315 0.6466 0.5377 0.7939 0.8104

TABLE IV:

For the full Salinas A dataset, the proposed DLSS method performs best, with the only error made in splitting the bottom right cluster into two pieces. This error was made by all algorithms, however. The simplified DL method also performs well, as does the benchmark spectral clustering algorithm. Comparing the labeling for DL and DLSS, the small regions of mislabeled pixels in DL are correctly labeled in DLSS, because these pixels are likely of low empirical density, and hence would be labeled not based on spectral similarity, but spatial similarity. However, some pixels in the DL image that were labeled correctly were changed to be labeled incorrectly with the DLSS method, indicating that the spatial proximity condition enforced in DLSS may not lead to improved results for every pixel. Details on this, and how to tune the size of the neighborhood with which spatial consensus labels are computed, are given in Section III-H2.

Iii-F Anomaly Detection for Chemical Plume Segmentation

We now consider the problem of anomaly detection in HSI by segmenting a chemical plume. Chemicals invisible to the naked eye can be captured in certain frequency bands of an HSI; the problem is to identify the chemical region. An example, courtesy of the Johns Hopkins Applied Physics Lab, appears in Figure 10. Several pixels are corrupted, making this problem even more difficult. No labeled ground truth is available for this image, so we evaluate with only visual quality of the image segmentation. In order to determine the number of clusters, , used by our clustering algorithms, we examine the plot of sorted values. We look for “kinks” in this plot, which correspond to a major drop in values. This heuristically indicates that, to a certain granularity, all clusters have been located. The plots of sorted and the finite differences between the successive values appear in Figure 12.

\thesubsubfigure Band 11
\thesubsubfigure Band 16
\thesubsubfigure Band 18
\thesubsubfigure Band 29
Fig. 10: A chemical plume HSI dataset of spatial dimensions . There are 129 spectral bands. The bands shown indicate the visibility of the chemical plume in certain bands, but not in others.
(a) -means
(b) PCA+M
(c) ICA+M
(d) RP+M
(e) SC
(f) GMM
(g) SMCE
(h) HNMF
(i) FMS
(j) FSFDPC
(k) DL
(l) DLSS
(m) PLSR+E.C.
(n) PLSR+D.C.
Fig. 11: Results for APL chemical plume dataset. Only the PLSR methods are able to achieve reasonable plume segmentation. PLSR with Euclidean cores (E.C.), however, is unable to correctly segment horizontally, resulting in a plume that is spread too far. However, it correctly ascertains that the plume diffuses somewhat far in the vertical direction. The PLSR with diffusion cores (D.C.) correctly segments in both the horizontal and vertical directions, and gives the best visual result. Several of the other the methods were unable to detect anomalous pixels effectively, resulting in clusters of very small sizes. Spectral clustering and SMCE reasonably segment the bottom half of the plume, but fail completely for the top half.
Fig. 12: Plot of sorted values for APL chemical plume HSI and a plot of the finite differences between successive values. The extrema of the difference curve were used to estimate the number of clusters to use for plume segmentation. The value was used in all clustering algorithms after an automatic analysis of these plots.

In addition to algorithms considered thus far, we propose to use Algorithm 3, which combines the proposed mode detection algorithm with PLSR. Mode detection with both Euclidean and diffusion distances are considered.

We see from Figure 11 that PLSR with diffusion cores segments the plume best, in particular giving the cleanest horizontal segmentation. PLSR with Euclidean cores performs second best, giving a good vertical segmentation, but with several errors in the horizontal direction. The other methods give very poor segmentation of the chemical plume. We remark that PLSR is well-suited for this problem, which requires the segmentation of only a single anomalous class. For this problem, the class of interest is relatively concentrated, and performance in clustering the rest of the image is not considered as important.

Iii-G Active Learning

To evaluate our proposed active learning method, Algorithm 4, the same four labeled HSI datasets were clustered with increasing amounts of available training data points. The amount of points used is parametrized by , which denotes the proportion of pixels in the whole image available for labeling. Note that corresponds the the unsupervised DLSS algorithm. The empirical results for this active learning scheme appear in Figure 13.

\thesubsubfigure Indian Pines
\thesubsubfigure Pavia University
\thesubsubfigure Pavia
\thesubsubfigure SalinasA
Fig. 13: Active learning parameter analysis. The -axis denotes the parameter . As increases, more labeled pixels are introduced. The accuracy is monotonic increasing in , and a small increase can lead to a huge jump in accuracy, as seen in the Indian Pines and Salinas A datasets.

The empirical results indicate that the proposed active learning can produce dramatic improvements in labeling with very few training labels. Indeed, an improvement in overall accuracy from to for Indian Pines can be achieved with only 3 labels. Even more dramatic is the SalinasA dataset, in which 14 labeled points improves the overall accuracy from to . The other datasets see less dramatic improvement; Pavia University improves slightly for increasing , while the Pavia dataset labeling is not affected by the small collection of labeled points. In the case of Pavia, however, the overall accuracy was already very large, so active learning would not necessarily be ideal for this setting.

Iii-H Parameter Analysis

We now discuss the parameters used in the comparison and proposed methods. We begin by stating the parameters used for all comparison methods, before analyzing the two key parameters for the proposed method: diffusion time and radius size for the computation of spatial consensus label. For experimental parameters except these, a range of parameters were considered, and those with best empirical results are used.

The -means algorithm is used in conjunction with several benchmark methods, and the parameters for -means are the same for all these methods. The algorithm is run with 100 iterations, with random initializations each time, and number of clusters equal to the known number of classes in the ground truth. Each of the linear dimension reduction techniques, PCA, ICA, and random projection, embeds the data into , where is the number of clusters. Spectral clustering is run by computing a weight matrix as in , with and . The top eigenvectors are then normalized to have Euclidean norm , then used as features with -means. The GMM labels are computed with expectation maximization with the full covariance matrix and regularization parameter .

Among the state-of-the-art methods, HNMF uses the recommended settings listed in the available online toolbox777https://sites.google.com/site/nicolasgillis/code there are no significant parameters to tune. For FSFDPC, the empirical density estimate is computed as described in Section II-C, with a Gaussian kernel and 20 nearest neighbors. For SMCE, the sparsity parameter was set to be 10, as suggested in the online toolbox 888http://www.vision.jhu.edu/code/. The FMS algorithm depends on several key parameters; we implemented a parameter sweep over them, and ran our experiments for the empirically optimal choice of parameters with respect to a given dataset.

For the proposed algorithm, the same parameters for the density estimator as described above are used, in order to make a fair comparison with FSFDPC. Moreover, in the construction of the graph used to computed diffusion distances, the same construction as used for spectral clustering and SMCE is used, again to make fair comparisons. The remaining parameters, namely diffusion time and spatial radius, were set to be 30 and 3 for all experiments, respectively. We justify these choices and explain their significance in the following subsections.

Iii-H1 Diffusion Time

The most important parameter in using diffusion distances is the time parameter . Recall that the diffusion distance may be computed as in . Under mild assumptions that are almost always satisfied for real data, it is known that for all , so the parameter determines the decay of the lower eigenvalues. The larger is, the less contribution the smaller eigenvalues have. Allowing to vary, connections in the dataset are explored by allowing the diffusion process to evolve. For small values of , all points appear far apart because the diffusion process has not reached far, while for large values of , all points appear close together because the diffusion process has run for so long that it has dissipated over the entire state space. In general, the interesting choice of is moderate, which allows for the interesting geometry to be discovered, but not washed out in long-term.

In Figure 14, the results for a range of values, running from 0 to 100 are displayed. These results are empirical, and not motivated by statistics on the data.

\thesubsubfigure Indian Pines
\thesubsubfigure Pavia University
\thesubsubfigure Pavia
\thesubsubfigure SalinasA
Fig. 14: Time parameter analysis for the four real datasets. In general, the time parameter has little impact on performance of the proposed algorithm. The notable exception is for the Indian Pines dataset, in which there is a small time window in which empirical performance declines. As suggested by these plots, is used for all experiments.

We see that in general, the behavior is largely invariant with respect to time. For Indian Pines, performance is largely constant, except for a dip from time to . For the remaining examples, the performance is robust with respect to the diffusion time in the range analyzed. However, for the Pavia University dataset, performance appears to being eroding around time , indicating that the diffusion process has mixed for too long, and differences between classes have been washed out.

Iii-H2 Spatial Diffusion Radius

The width of the spatial consensus radius can also impact the performance of the proposed algorithm. Recall that this is the distance in the spectral domain used to compute the spatial consensus label. The spatial consensus label is computed by finding all labeled points within distance , the spatial consensus radius, in the spatial domain of the pixel in question, then finding the label in this neighborhood of proportion . More precisely, for a pixel the set of nearest neighbors in the spatial domain are computed. The consensus label is set to be the most common label in , if this label is the label for more than half the points of . Otherwise, no spatial consensus label is given; the precise definition was given in (6). If is too small, insufficient spatial information may be incorporated. If is too large, the spectral information becomes drowned out. Results for each dataset over a range of values, running from 0 to 10, appear in Figure 15.

\thesubsubfigure Indian Pines
\thesubsubfigure Pavia University
\thesubsubfigure Pavia
\thesubsubfigure SalinasA
Fig. 15: Space parameter analysis for DLSS. For each curve, increasing the radius of the neighborhood in which the spatial consensus label is computed improves quantitative performance up until a certain point, until performance decays. The point at which the decay sets in differs for each example. The spatial-spectral tradeoff exhibited in these plots indicates that the spatial and spectral information must be balanced to achieve empirically optimal clustering.

We see that there is generally a trade-off between spatial and spectral information, so that it makes sense to take to be a moderate value sufficiently greater than 0, but less than 10. This trade-off can be interpreted in the following way: empirically optimal results are achieved when both spatial and spectral information contribute harmoniously, and results deteriorate when one or the other dominates. We choose for all experiments, though other choices would give comparable or better quantitative results for the datasets considered. We remark that in the case of the Pavia University dataset, the results for average accuracy decrease monotonically with increasing window size, and do not exhibit the tradeoff. This is because there are several very small classes in this dataset that can be overwhelmed by overzealous spatial regularization. In particular, the average accuracy statistic is acutely sensitive to these small classes.

We note that the role of the spatial radius is analogous to the role of a regularization parameter for a general statistical learning problem. Taken too small, the problem is insufficiently regularized, leading to poor results. Taken too large, the regularization terms dominates the fidelity term, leading also to poor results.

Iv Overall Comments on the Experiments and Conclusion

This article proposes a novel unsupervised method for clustering HSI, using data-dependent diffusion distances to learn modes of clusters, followed by a spatial-spectral labeling scheme based on diffusion in both the spectral and spatial domains. It was expected that the proposed DL algorithm would perform well, particularly compared to FSFDPC, and that the DLSS algorithm would outperform DL, due to the former’s incorporation of spatial information.

max width=.49 Data Metric -means PCA+-means ICA+-means RP+-means SC GMM SMCE HNMF FSFDPC DL DLSS IP OA 0.4303 0.4316 0.4077 0.5060 0.5418 0.4396 0.5179 0.4077 0.5750 0.6680 0.8375 IP AA 0.3770 0.3779 0.3631 0.5147 0.454 0.3458 0.4457 0.3215 0.5097 0.6154 0.8161 IP 0.0932 0.0950 0.0636 0.2583 0.2445 0.0226 0.2244 -0.0218 0.2578 0.4415 0.7534 PaviaU OA 0.7146 0.7143 0.6657 0.6974 0.8096 0.4413 0.7923 0.5472 0.5245 0.8205 0.8421 PaviaU AA 0.7186 0.7184 0.6824 0.7128 0.7577 0.3683 0.7555 0.5083 0.4000 0.7355 0.6885 PaviaU 0.5836 0.5832 0.5078 0.5605 0.7126 0.0416 0.6909 0.2418 0.1708 0.7246 0.7595 Pavia OA 0.7765 0.7761 0.6683 0.7593 0.8154 0.6431 0.8308 0.7217 0.7783 0.8494 0.9461 Pavia AA 0.6239 0.6237 0.5539 0.6147 0.7619 0.5871 0.7708 0.7422 0.7465 0.7787 0.8260 Pavia 0.7169 0.7164 0.5833 0.6983 0.7743 0.5469 0.7919 0.6568 0.7304 0.8137 0.9309 SalinasA OA 0.6260 0.6260 0.5688 0.6268 0.8345 0.6442 0.4688 0.6316 0.6322 0.8313 0.8495 SalinasA AA 0.6577 0.6577 0.5567 0.6587 0.8828 0.6056 0.4209 0.6638 0.6055 0.8790 0.8987 SalinasA 0.5246 0.5246 0.4444 0.5254 0.7978 0.5530 0.3039 0.5315 0.5377 0.7939 0.8104

TABLE V: Summary of quantitative analyses of real HSI clustering. Generally the proposed diffusion methods offer the strongest overall performance, particular DLSS. In all cases, DL outperforms FSFDPC, indicating the importance of using diffusion distances over Euclidean distances for HSI clustering.

The proposed DL method, consisting of the geometric learning of modes but only spectral assignment of labels, generally outperforms all comparison methods. In particular, it outperforms the very popular and recent FSFDPC algorithm, which uses the same technique to assign class labels once modes have been learned. Indeed, DL outperforms FSFDPC in all examples, indicating that Euclidean distance is insufficient for learning the modes of complex HSI data. The quantitative analyses of clustering real HSI experiments with ground truth are summarized in Table V.

Moreover, the joint spatial-spectral labeling scheme improves over using only spectral information to label points. This is indicated by DLSS outperforming DL in all instances but one. The exceptional dataset is Pavia University (see Appendix), and the exceptional statistic is average accuracy. As remarked in Section A-B3, the spatial information appears to wash out the small green class in this example. This phenomenon is further illustrated by subfigure (b) in Figure 15, in which it is shown that increasing the spatial radius in which the spatial consensus neighbor is computed has a uniformly detrimental impact on average accuracy, even as overall accuracy and show the typical trade-off between spatial and spectral information. This indicates a potential drawback of the proposed spatial-spectral labeling scheme: small classes may be washed out, even for small spatial radii. However, the overall improved performance suggests that this risk is generally acceptable.

The two variations of the proposed algorithm for anomaly detection and active learning also show great promise. The variation in which PLSR is combined with diffusion learning successfully segments the gas plume in the APL hyperspectral scene, while all other methods fail. This indicates that for the anomaly detection problem, geometric mode detection combined with PLSR is a useful tool. The incorporation of active learning in the DLSS algorithm dramatically improves the accuracy of labeling of the Indian Pines and Salinas A datasets with only 3 and 14 labeled pixels respectively, and offers more modest improvement for the Pavia University dataset. This parsimonious use of training labels has the potential to greatly improve the efficiency of HSI learning tasks, in which the volume of labels necessary to label a significant proportion of the image can be very high.

Iv-a Computational Complexity

The proposed DLSS method consists of two major steps. First, the modes of the data must be estimated, then labels given. Suppose the data is .

To compute the modes, both the empirical density and each point’s distance to its nearest diffusion distance neighbor of higher density must be computed. To compute each points empirical density, its nearest neighbors must be computed. To do so in an efficient way, the cover trees algorithm [68] is used. The cover trees algorithm computes all points’ nearest neighbors in with a constant that depends exponentially on the intrinsic dimension, of the data. Note that the exponential dependence is not on , the ambient dimensionality of the data. In the case of HSI, is quite large, but it is reasonable to assume that the data lies near a low-dimensional structure, so that . The computation of the density given these nearest neighbors is , so that the computation of empirical density for all points is . Assuming that with respect to , this reduces to .

Next, each point’s nearest neighbor in diffusion distance of higher density must be computed. Under the mild assumption that, aside from the class modes, each point has a point of higher density among its nearest neighbors of higher density, for some with respect to , finding each point’s nearest neighbor of higher density is tantamount to computing each points -nearest neighbors in diffusion distance. The worst-case computational complexity of computing each points -nearest neighbors in diffusion distances is, using cover trees as analyzed above, , where is the number of eigenvectors computed for diffusion distances. For all examples presented in this paper, , and can in general be assumed independent of , since it only depends on the intrinsic geometry of the data. Moreover, with respect to , so that this cost reduces to .

Once both the empirical density and the distance of each point to its nearest neighbor of higher density in diffusion are computed, the product of these values must be sorted, which can be done in . The maximizers of this quantity are the modes. Hence, the total cost of the Algorithm 1 is .

The labeling step require computing, for each point, its nearest nearest spectral neighbor of higher density as well as its spatial consensus label. We have already established that under the mild assumption above, each point’s nearest spectral neighbor of higher density can be computed in , and computing the spatial consensus label is , where is the spatial consensus radius. It is reasonable that independent of , giving Algorithm 2 computational complexity . Hence, the overall complexity of the unsupervised clustering algorithm is . This is the same as fast spectral clustering, which has the same eigenvector bottleneck as diffusion maps, and to the FSFDPC algorithm. It is significantly faster than SMCE.

Experiments on the empirical runtime for each algorithm were also performed. The time each algorithm takes to cluster each of the four labeled datasets appears in Table VI

max width=.49 Dataset -means PCA+-means ICA+-means RP+-means SC GMM SMCE HNMF FMS FSFDPC DL DLSS IP 0.44 0.01 0.22 0.11 0.77 0.40 11.74 0.52 0.15 1.03 1.46 2.75 PaviaU 8.39 0.05 1.37 0.79 119.20 4.44 486.40 0.71 0.61 65.72 70.42 112.40 Pavia 3.89 0.01 0.87 0.79 101.20 3.07 466.90 0.93 0.73 64.12 62.36 100.70 Salinas 1.26 0.01 0.30 0.13 13.87 1.91 222.40 0.74 0.29 12.31 20.66 27.42

TABLE VI: Run times for each method and each dataset, measured in seconds. The linear dimension reduction methods are extremely fast, as is NMF and GMM. The spectral clustering, FSFDPC and DL algorithms are comparable, with DLSS slightly slower. The SMCE algorithm is substantially slower. Note that although FMS is quite fast, it is implemented in parallelized C++ code and ran on a machine with 48 cores. This makes its performance time rather incomparable to the other methods, which could be similarly parallelized.

For the additional PLSR step used in Algorithm 3 for the segmentation of the chemical plumes, we also discuss the complexity of PLSR in Section A. Suffice it say, it does not add to the complexity.

The active learning stage requires additionally to determine which points are to be labeled, which is constant time for each point, so followed by the assignment of the labels, which is constant time for each point. Hence, the active labeling stage, namely Algorithm 4, is , and does not add to the complexity of the proposed algorithm.

V Future Research Directions

A drawback of many clustering algorithms, including the one presented in this paper, is the assumption that the number of clusters, , is known a priori. Initial investigations suggest that looking for the “kink” in the sorted plot of could be used to detect automatically. Indeed, this is true for most of the examples presented in this article, and this method was used in the analysis of chemical plumes in Section III-F. See Figure 16 for another example, this time from the Salinas A dataset.

\thesubsubfigure Plot of the largest 10 values of , sorted from largest to smallest. It may be possible in some cases to use the largest drop-off as an estimate of the number of clusters, . The largest drop-off is indicated with a red star.
\thesubsubfigure The difference between successive values of sorted values of . The largest drop correspond to the correct number of classes, as determined by the ground truth and labeled with a red star.
Fig. 16:

This was suggested in [26] in the case of using Euclidean distances to compute , however this imposes unrealistic constraints on the shapes and relative positions of the mixture components. It is of interest to prove under what assumptions on the distributions and mixture model the plot correctly determines . Moreover, in the case that one cluster is noticeably smaller or harder to detect than others, as in the case of the APL chemical plume dataset, it may be advantageous to use a different local maximum of the finite difference curve, rather than the global maximum. Initial mathematical results and more subtle conjectures are proposed in an upcoming article [37].

Relatedly, the present work is essentially heuristic; it is not known mathematically under what constraints on the mixture model the method proposed for learning modes succeeds with high probability. Besides being of mathematical interest, this would be useful for understanding the limitations of the proposed method for HSI. Indeed, the experimental results of this paper indicate that diffusion distance is more effective than Euclidean distance for discriminating between high density points for use in the FSFDPC algorithm. To understand this phenomenon rigorously, a careful analysis of diffusion distances for data drawn from a non-parametric mixture model is required. This line of research is related to recent research in performance guarantees for spectral clustering and mode detection [69, 70]

V-a Acknowledgments

This research was partially supported by NSF-ATD-1222567, AFOSR FA9550-14-1-0033, and NSF-IIS-1546392.

References

  • [1] G. Lu and B. Fei. Medical hyperspectral imaging: a review. Journal of Biomedical Optics, 19(1):010901–010901, 2014.
  • [2] Y. Wang, G. Chen, and M. Maggioni. High-dimensional data modeling techniques for detection of chemical plumes and anomalies in hyperspectral images and movies. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9(9):4316–4324, 2016.
  • [3] M.T. Eismann. Hyperspectral remote sensing. SPIE Bellingham, 2012.
  • [4] L. Ma, M.M. Crawford, and J. Tian. Local manifold learning-based -nearest-neighbor for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 48(11):4099–4109, 2010.
  • [5] W. Li, E.W. Tramel, S. Prasad, and J.E. Fowler. Nearest regularized subspace for hyperspectral classification. IEEE Transactions on Geoscience and Remote Sensing, 52(1):477–489, 2014.
  • [6] F. Melgani and L. Bruzzone. Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing, 42(8):1778–1790, 2004.
  • [7] M. Fauvel, J.A. Benediktsson, J. Chanussot, and J.R. Sveinsson. Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE Transactions on Geoscience and Remote Sensing, 46(11):3804–3814, 2008.
  • [8] F. Ratle, G. Camps-Valls, and J. Weston. Semisupervised neural networks for efficient hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 48(5):2271–2282, 2010.
  • [9] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 54(10):6232–6251, 2016.
  • [10] H. Liang and Q. Li. Hyperspectral imagery classification using sparse representations of convolutional neural network features. Remote Sensing, 8(2):99, 2016.
  • [11] Y. Qian, M. Ye, and J. Zhou. Hyperspectral image classification based on structured sparse logistic regression and three-dimensional wavelet texture features. IEEE Transactions on Geoscience and Remote Sensing, 51(4):2276–2291, 2013.
  • [12] J. Li, J.M. Bioucas-Dias, and A. Plaza. Semisupervised hyperspectral image classification using soft sparse multinomial logistic regression. IEEE Geoscience and Remote Sensing Letters, 10(2):318–322, 2013.
  • [13] J.M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(2):354–379, 2012.
  • [14] S. Jia and Y. Qian. Constrained nonnegative matrix factorization for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing, 47(1):161–173, 2009.
  • [15] C. N. Dobigeon Févotte. Nonlinear hyperspectral unmixing with robust nonnegative matrix factorization. IEEE Transactions on Image Processing, 24(12):4810–4819, 2015.
  • [16] J.M. Bioucas-Dias and M.A.T. Figueiredo. Alternating direction algorithms for constrained sparse regression: Application to hyperspectral unmixing. In 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), pages 1–4. IEEE, 2010.
  • [17] C. Li, Y. Ma, X. Mei, C. Liu, and J. Ma. Hyperspectral unmixing with robust collaborative sparse regression. Remote Sensing, 8(7):588, 2016.
  • [18] A. Paoli, F. Melgani, and E. Pasolli. Clustering of hyperspectral images based on multiobjective particle swarm optimization. IEEE Transactions on Geoscience and Remote Sensing, 47(12):4175–4188, 2009.
  • [19] N. Acito, G. Corsini, and M. Diani. An unsupervised algorithm for hyperspectral image segmentation based on the gaussian mixture model. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS), volume 6, pages 3745–3747, 2003.
  • [20] C. Cariou and K. Chehdi. Unsupervised nearest neighbors clustering with application to hyperspectral images. IEEE Journal of Selected Topics in Signal Processing, 9(6):1105–1116, 2015.
  • [21] Y. Chen, S. Ma, X. Chen, and P. Ghamisi. Hyperspectral data clustering based on density analysis ensemble. Remote Sensing Letters, 8(2):194–203, 2017.
  • [22] E. Elhamifar and R. Vidal. Sparse manifold clustering and embedding. In Advances in Neural Information Processing Systems, pages 55–63, 2011.
  • [23] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.
  • [24] N. Gillis, D. Kuang, and H. Park. Hierarchical clustering of hyperspectral images using rank-two nonnegative matrix factorization. IEEE Transactions on Geoscience and Remote Sensing, 53(4):2066–2078, 2015.
  • [25] Z. Meng, E. Merkurjev, A. Koniges, and A. Bertozzi. Hyperspectral video analysis using graph clustering methods. Image Processing On Line, 7:218–245, 2017.
  • [26] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492–1496, 2014.
  • [27] H. Zhang, H. Zhai, and L. Zhangand P. Li. Spectral–spatial sparse subspace clustering for hyperspectral remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 54(6):3672–3684, 2016.
  • [28] D.W.J. Stein, S.G. Beaven, L.E. Hoff, E.M. Winter, A.P. Schaum, and A.D. Stocker. Anomaly detection from hyperspectral imagery. IEEE signal processing magazine, 19(1):58–69, 2002.
  • [29] A. Banerjee, P. Burlina, and C. Diehl. A support vector method for anomaly detection in hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 44(8):2282–2291, 2006.
  • [30] B. Du and L. Zhang. A discriminative metric learning based anomaly detection method. IEEE Transactions on Geoscience and Remote Sensing, 52(11):6844–6857, 2014.
  • [31] W. Li and Q. Du. Collaborative representation for hyperspectral anomaly detection. IEEE Transactions on Geoscience and Remote Sensing, 53(3):1463–1474, 2015.
  • [32] G. Chen, M. Iwen, S. Chin, and M. Maggioni. A fast multiscale framework for data in high dimensions: Measure estimation, anomaly detection, and compressive measurements. In Visual Communications and Image Processing (VCIP), 2012 IEEE, pages 1–6, 2012.
  • [33] J. Li, J.M. Bioucas-Dias, and A. Plaza. Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning. IEEE Transactions on Geoscience and Remote Sensing, 48(11):4085–4098, 2010.
  • [34] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari. A survey of active learning algorithms for supervised remote sensing image classification. IEEE Journal of Selected Topics in Signal Processing, 5(3):606–617, 2011.
  • [35] Coifman R.R., S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426–7431, 2005.
  • [36] R.R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, 2006.
  • [37] M. Maggioni and J.M. Murphy. Clustering by unsupervised geometric learning of modes. 2017.
  • [38] Y. Tarabalka, J.A. Benediktsson, and J. Chanussot. Spectral–spatial classification of hyperspectral imagery based on partitional clustering techniques. IEEE Transactions on Geoscience and Remote Sensing, 47(8):2973–2987, 2009.
  • [39] M. Fauvel, Y. Tarabalka, J.A. Benediktsson, J. Chanussot, and J.C. Tilton. Advances in spectral-spatial classification of hyperspectral images. Proceedings of the IEEE, 101(3):652–675, 2013.
  • [40] Z. Wang, N.M. Nasrabadi, and T.S. Huang. Spatial–spectral classification of hyperspectral images using discriminative dictionary designed by learning vector quantization. IEEE Transactions on Geoscience and Remote Sensing, 52(8):4808–4822, 2014.
  • [41] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, 1995.
  • [42] A.D. Szlam, M. Maggioni, and R.R. Coifman. Regularization on graphs with function-adapted diffusion processes. Jour. Mach. Learn. Res., (9):1711–1739, Aug 2008. (YALE/DCS/TR1365, Yale Univ, July 2006).
  • [43] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426–7431, 2005.
  • [44] B. Nadler, S. Lafon, R.R. Coifman, and I.G. Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. Applied and Computational Harmonic Analysis, 21(1):113–127, 2006.
  • [45] R.R. Coifman, I.G. Kevrekidis, S. Lafon, M. Maggioni, and B. Nadler. Diffusion maps, reduction coordinates, and low dimensional representation of stochastic systems. Multiscale Modeling & Simulation, 7(2):842–864, 2008.
  • [46] A. Singer and R.R. Coifman. Non-linear independent component analysis with diffusion maps. Applied and Computational Harmonic Analysis, 25(2):226–239, 2008.
  • [47] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
  • [48] S. Lafon, Y. Keller, and R.R. Coifman. Data fusion and multicue data matching by diffusion maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11):1784–1797, 2006.
  • [49] W. Czaja, B. Manning, L. McLean, and J.M. Murphy. Fusion of aerial gamma-ray survey and remote sensing data for a deeper understanding of radionuclide fate after radiological incidents: examples from the fukushima dai-ichi response. Journal of Radioanalytical and Nuclear Chemistry, 307(3):2397–2401, 2016.
  • [50] R.R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis, 2015.
  • [51] R.R. Lederman, R. Talmon, H. Wu, Y.-L. Lo, and R.R. Coifman. Alternating diffusion for common manifold learning with application to sleep stage assessment. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5758–5762. IEEE, 2015.
  • [52] M.A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi. Determination of reaction coordinates via locally scaled diffusion map. The Journal of Chemical Physics, 134(12):03B624, 2011.
  • [53] W. Zheng, M.A. Rohrdanz, M. Maggioni, and C. Clementi. Polymer reversal rate calculated via locally scaled diffusion map. The Journal of Chemical Physics, 134(14):144109, 2011.
  • [54] J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
  • [55] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  • [56] S.J. Sheather and M.C. Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), pages 683–690, 1991.
  • [57] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning, volume 1. Springer series in statistics Springer, Berlin, 2001.
  • [58] M. Banerjee, M. Capozzoli, L. McSweeney, and D. Sinha. Beyond kappa: A review of interrater agreement measures. Canadian journal of statistics, 27(1):3–23, 1999.
  • [59] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287–314, 1994.
  • [60] A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4):411–430, 2000.
  • [61] S. Dasgupta. Experiments with random projection. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 143–151. Morgan Kaufmann Publishers Inc., 2000.
  • [62] E.J. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406–5425, 2006.
  • [63] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, volume 14, pages 849–856, 2001.
  • [64] U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
  • [65] D. Manolakis, C. Siracusa, and G. Shaw. Hyperspectral subpixel target detection using the linear mixing model. IEEE Transactions on Geoscience and Remote Sensing, 39(7):1392–1409, 2001.
  • [66] S. Kraut, L.L. Scharf, and L.T. McWhorter. Adaptive subspace detectors. IEEE Transactions on Signal Processing, 49(1):1–16, 2001.
  • [67] D. Mumford and J. Shah. Optimal approximations by piecewise smooth functions and associated variational problems. Communications on pure and applied mathematics, 42(5):577–685, 1989.
  • [68] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In International Conference on Machine Learning, pages 97–104, 2006.
  • [69] C.R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Non-parametric inference for density modes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1):99–126, 2016.
  • [70] G. Schiebinger, M.J. Wainwright, and B. Yu. The geometry of kernelized spectral clustering. The Annals of Statistics, 43(2):819–846, 2015.
  • [71] H. Wold. Partial least squares. Encyclopedia of statistical sciences, 1985.
  • [72] P. Geladi and B.R. Kowalski. Partial least-squares regression: a tutorial. Analytica chimica acta, 185:1–17, 1986.
  • [73] S. De Jong. Simpls: an alternative approach to partial least squares regression. Chemometrics and intelligent laboratory systems, 18(3):251–263, 1993.

Appendix A Appendix

A-a Partial Least Squares Regression

Partial least squares [71, 72, 73, 57] is a linear regression method that attempts to maximize variation in both the predictor and response variables; this is in contrast to principal component regression, which attempts to maximize variation in only the predictor variables. Assume some normalized and centered training data is given, along with labels . PLS starts by computing the univariate regression coefficient of on each column of . Summing, we acquire a first partial least squares direction For the remaining PLS directions, the inputs are weighted according to the strength of the univariate impact on . The outcome is regressed on , to obtain coefficient . The columns of are then orthogonalized with respect to , and we repeat. We stop when a sufficient number of directions are obtained, yielding directions . In our experiments, we take when is known, or is estimated by the kink in the curve as above.

It is helpful to compare PLS to PCA. Let be the sample covariance matrix of . In PCA, a representation of is computed in which the principle component is given by

The conditions ensure is uncorrelated with all lower .

For partial least squares, the PLS direction is given by

Letting , the PLSR coefficient satisfies the condition , where is the least-squares solution to overdetermined system .

A-A1 PLSR Computational Complexity

PLSR is implemented with the SIMPLS algorithm [73], which has computational complexity , where is the number of partial least squares components used. In our algorithm, this is equal to the number of classes, either known or determined by studying the decay of the sorted curve, and may be assumed to be and independent of .

A-B Additional Experiments

A-B1 Synthetic Dataset

The synthetic HSI dataset is size , with 3 classes generated by drawing samples from a random distribution. Classes constitute spatial rectangles of size , and respectively; some sample bands appear in Figure 17. We fix three pure spectra for these three classes, then randomly perturb the locations and values of their maxima, randomly dilate the spectra, and add Gaussian noise. Some typical spectra for each class appear in Figure 18.

\thesubsubfigure GT
(a) Band 10
\thesubsubfigure Band 50
(b) Band 150
Fig. 17: Synthetic HSI data. In certain bands, some of the classes are indistinguishable; in others, they are well separated.
(a) Class 1
(b) Class 2
(c) Class 3
Fig. 18: For each class, the pure spectrum, together with 10 realizations from the corresponding random distributions, are shown. To generate the random data, the pure spectra have the locations of their maxima perturbed, are scaled by a constant factor, and are corrupted by Gaussian noise. The extent of the perturbations of the extrema are controlled by the parameters. In this case, .

More precisely, an element of the class is generated from a pure spectrum , as follows. Each spectrum has five local maxima. For each , let be the coordinates of the spectra maxima for the pure spectrum . Let be parameters. A sample from class is generated as follows:

  1. Randomly translate the independently according to a uniform distribution Unif, then smooth.

  2. Randomly shift the independently according to a uniform distribution Unif, then smooth.

  3. Randomly perturb the entire spectrum by a pointwise multiplication by .

  4. Add mean 0, variance .01 Gaussian random noise componentwise.

Fig. 19: Synthetic data numerical results. The displayed images correspond to fixed and varying ; similar results hold for fixed and varying . As the varying parameter increases, the classes become harder to distinguish, reducing accuracy with respect to and . The proposed methods perform well, with the largest range of high accuracy aside from HNMF. The FSFDPC algorithm performs poorly, due to its inability to correctly learn the class modes. The HNMF algorithm performs best, which is reasonable given that the classes in these synthetic examples are generated from perturbations of a single endmember. As argued in [24], this is precisely the regime in which HNMF is expected to perform well. The two proposed algorithms perform about equally, and have the largest radius of good performance, except for HNMF.

Results for fixed and fixed appear in Figure 19; a range of other values were also considered, with similar results. As the varying parameter increases, the clustering problem becomes more challenging. The proposed methods, DL and DLSS perform well up to a certain point, then rapidly decay. The observed phase transition is due to the failure to identify modes from distinct classes, caused by increased class overlap. In this synthetic example with very simple spatial boundaries, there is little benefit to using the spatial-spectral labeling scheme.

After the parameter is sufficiently large and the classes are very mixed, all algorithms converge to rather poor accuracies. These plots illustrate an important drawback of the proposed method: when the algorithm fails to find distinct modes of the distributions, results may be rather poor. This is true of most clustering methods, however, and the proposed method of mode detection using diffusion distances is evidently more robust than using Euclidean distances.

A-B2 Pavia Visual Clustering Results

Visual results for the Pavia dataset appear in Figure 20.

(a) -means
(b) PCA+M
(c) ICA+M
(d) RP+M
(e) SC
(f) GMM
(g) SMCE
(h) HNMF
(i) FMS
(j) FSFDPC
(k) DL
(l) DLSS
(m) GT
Fig. 20: Clustering results for Pavia dataset. The proposed methods perform well, with DLSS in particular labeling most pixels correctly. Of the comparison methods, SMCE also performs well, though this method makes major errors on the yellow-green class. There errors are present for every method except the proposed diffusion-based methods. Indeed, the proposed DLSS method labels the yellow-green class essentially correctly. All methods incorrectly label the blue-green region in the bottom right, however, indicating that this is a challenging class to capture.

A-B3 Pavia University Dataset

The Pavia University dataset used for experiments consists of a subset of the original Pavia University dataset, and contains five classes of wide-ranging size, many of which are spread throughout the image spatially. Figure 22 shows that the yellow, teal, and green classes are relatively small and quite well distributed throughout the image. These classes are expected to be hard to cluster correctly.

(a) -means
(b) PCA+M
(c) ICA+M
(d) RP+M
(e) SC
(f) GMM
(g) SMCE
(h) HNMF
(i) FMS
(j) FSFDPC
(k) DL
(l) DLSS
(m) GT
Fig. 21: Clustering results for the Pavia University dataset. The proposed DLSS method performs best in an aggregate sense, giving a correct clustering of the orange region on the right, as well as correctly clustering the light blue region in most cases. However, the use of the spatial-spectral labeling regime appears detrimental to the labeling of the green class, which is less correctly labeled in DLSS than DL. This indicates that this small class is overwhelmed when using spatial information.
Fig. 22: The Pavia University data is a subset of the full Pavia University dataset. It contains 5 classes, most of which are not well-localized spatially. The dataset was captured by the ROSIS sensor during a flight over Pavia, Italy. The spatial resolution is 1.3 m/pixel. There are 103 spectral bands.

max width=.49 Metric -means PCA+-means ICA+-means RP+-means SC GMM SMCE HNMF FMS FSFDPC DL DLSS OA 0.7146 0.7143 0.6657 0.6974 0.8096 0.4413 0.7923 0.5472 .5774 0.5245 0.8205 0.8375 AA 0.7186 0.7184 0.6824 0.7128 0.7577 0.3683 0.7555 0.5083 .5800 0.4000 0.7355 0.6885 0.5836 0.5832 0.5078 0.5605 0.7126 0.0416 0.6909 0.2418 .2898 0.1708 0.7246 0.7522

TABLE VII: Quantitative analysis for Pavia University dataset.

The proposed diffusion learning methods give the best results for overall accuracy and , though for average accuracy, DL and DLSS drop in performance. This can be understood by analyzing the visual results in Figure 21, where it is seen that the green class is rather poorly labeled by DLSS, compared to DL for example. This indicates that the incorporation of spatial information was perhaps overzealous, and washed out a small class. This leads to a significant drop on average accuracy. We remark that FSFDPC is unable to learn correct modes from each class, leading to some classes being totally mislabeled. Spectral clustering and SMCE both perform rather well, though there are some problems with the dark blue, light blue, and orange classes for both of those methods.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
207419
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description