Nonlinear Unsupervised Clustering of Hyperspectral Images with Applications to Anomaly Detection and Active Learning
Abstract
The problem of unsupervised learning and segmentation of hyperspectral images is a significant challenge in remote sensing. The high dimensionality of hyperspectral data, presence of substantial noise, and overlap of classes all contribute to the difficulty of automatically clustering and segmenting hyperspectral images. In this article, we propose an unsupervised learning technique that combines a geometric estimation of class modes with a diffusioninspired labeling that incorporates both spatial and spectral information. The mode estimation incorporates the geometry of the hyperspectral data by using diffusion distance to promote learning a unique mode from each class. These class modes are then used to label all points by a joint spatialspectral nonlinear diffusion process. The proposed method, called spatialspectral diffusion learning (DLSS), is shown to perform competitively against benchmark and stateoftheart hyperspectral clustering methods on a variety of synthetic and real datasets. The proposed methods are shown to enjoy low computational complexity and fast empirical runtime. Two variations of the proposed method are also discussed. The first variation combines the proposed method of mode estimation with partial least squares regression (PLSR) to efficiently segment chemical plumes in hyperspectral images for anomaly detection. The second variation incorporates active learning to allow the user to request labels for a very small number of pixels, which can dramatically improve overall clustering results. Extensive experimental analysis demonstrate the efficacy of the proposed methods, and their robustness to choices of parameters.
I Introduction
Ia Hyperspectral Learning and Previous Work
Hyperspectral imagery (HSI) has emerged as a significant data source in a variety of scientific fields, including medical imaging [1], chemical analysis [2], and remote sensing [3]. Hyperspectral sensors capture reflectance at a sequence of localized electromagnetic ranges, allowing for precise differentiation of materials according to their spectral signatures. Indeed, the power of hyperspectral imagery for material discrimination has led to its proliferation, making manual analysis of hyperspectral data infeasible in many cases. The large data size of HSI, combined with their high dimensionality, demands innovative methods for storage and analysis. In particular, efficient machine learning algorithms are needed to automatically process and glean insight from the deluge of hyperspectral data now available.
A variety of hyperspectral learning problems have been investigated with gusto in recent years. The problem of HSI classification, or supervised segmentation, asks for an algorithm to label each pixel in a given HSI as belonging to a particular class, where randomly chosen training samples from each class are provided together with their labels, to help the algorithm build an effective discrimination tool. A variety of statistical and machinelearning techniques have been used for HSI classification, including nearestneighbor and nearest subspace methods [4, 5], support vector machines [6, 7], neural networks [8, 9, 10] and regression methods [11, 12].
The problem of HSI unmixing [13] assumes that the pixels in an HSI scene are constituted from a set of basic materials, sometimes called endmembers. It is then of interest to determine quantities related to those materials, such as number of endmembers, their spectral signatures, and their abundances in a given pixel. Techniques in matrix factorization [14, 15] and sparse regression [16, 17] have been applied successfully to the problem of hyperspectral unmixing.
The problem of hyperspectral clustering, or unsupervised segmentation, is similar to the problem of hyperspectral classification, except that no training data is available, i.e. it is an unsupervised learning problem. Hence, decisions about class labels must be made without access to any labeled data. This increases the challenge considerably, and makes the problem illposed unless further assumptions, for example about the distribution of the data and how it relates to the labels, are made. Recent techniques for hyperspectral clustering include those based on particle swarm optimization [18], Gaussian mixture models (GMM) [19], nearest neighbor clustering [20], density analysis [21], sparse manifold models [22, 23], hierarchical nonnegative matrix factorization (HNMF) [24], graphbased segmentation [25], and fast search and find of density peaks clustering (FSFDPC) [21, 26, 27].
Anomaly detection in hyperspectral images is another variety of unsupervised problem [28], in which an algorithm attempts to identify pixels that deviate from the background of the image (or from some other reference set of spectra). The anomalies are often targets of interest in various applications of HSI, but labeled examples of the targets are typically unavailable. Anomaly detection methods often construct a model the background, and identify pixels that are not well represented, or have small likelihood according to the background model. Recent methods in anomaly detection for HSI include those based on support vector machines [29], metric learning [30], collaborative representation [31], highdimensional density estimators [32] and partial least squares regression [2].
Active learning for hyperspectral image classification allows for the labeling of a small, particularly chosen set of pixels, which can lead to high quality classification results with fewer labels than in the standard supervised learning setting, in which the labels are randomly selected [12, 33, 34]. Active learning differs from the typical supervised setting in that the selection of labels is principled rather than random, and the number of labeled pixels provided is typically very small. In this sense, active learning may be understood as a semisupervised learning technique.
This article addresses the problems of hyperspectral clustering, and relatedly anomaly detection and active learning. The methods proposed in the present article combine densitybased methods with geometric learning through diffusion geometry [35, 36] in order to identify class modes. This information is then used to label all data points through a nonlinear process that incorporates both spatial and spectral information. The use of datadependent diffusion maps for mode detection significantly improves over current stateoftheart methods experimentally, and also enjoys robust theoretical performance guarantees [37]. Moreover, the spatialspectral labeling scheme takes advantage of the geometric properties of the data, and greatly improves the empirical performance of clustering when compared to labeling based on spectral information alone. The empirical results of the application of our method to hyperspectral clustering, anomaly detection, and active learning are in Sections IIIE, IIIF, and IIIG, respectively.
IB Overview of Proposed Method
The proposed unsupervised clustering method is provided with the data and the number of classes, and outputs labels { each , by proceeding in two steps:

Mode Identification: This step consists first in performing density estimation and analyzing the geometry of the data to find modes , one for each class.

Labeling Points: Once the modes are learned, they are assigned a unique label. Remaining points are labeled in a manner that preserves spectral and spatial proximity.
By a mode, we mean a point of high density within a class, that is representative of the entire class. We assume is known, but otherwise we have no access to labeled data; in Section IIIF, we will discuss a method for learning automatically.
A crucial ingredient in many machine learning algorithms is a notion of distance, or similarity, between data points. In our case a data point corresponds to the sequence of spectral reflectances at a single spatial location.
One of the key contributions of this article is to measure similarities in the spectral domain not with the widely used Euclidean distances, or distances based on angles (correlations) between points, but with diffusion distances [35, 36], which is a datadependent notion of distance that accounts for the geometry, linear or nonlinear, of the distribution of the data. The motivation for this approach is to attain robustness with respect to the shape of the distributions corresponding to the different classes, as well as to highdimensional noise. The modes, suitably defined via density estimation, are robust to noise, and the process we use to pick only one mode per class is based on diffusion distances, which are robust to the shape of the support of the density of a class. The labeling of the points from the modes respects the geometry of the data, by incorporating proximity in both the spectral and spatial domains.
IB1 Mode Identification
The computation of the modes is a significant aspect of the proposed method, which we now summarize for a general dataset , consisting of classes. We make the assumption that modes of the constituent classes can be characterized as a set of points such that

The empirical density of each is relatively high.

The diffusion distance between pairs , for , is relatively large.
The first assumption is motivated by the fact that points of high density ought to have nearest neighbors corresponding to a single class; the modes should thus produce neighborhoods of points that with high confidence belong to a single class. However, there is no guarantee that the densest points will correspond to unique classes in the case when some classes have a multimodal distribution. The second assumption addresses this issue, requiring that points far away in diffusion distance belong to different distributions. Note that making this requirement in terms of Euclidean distance would be more stringent and easily violated when the data distribution is nonlinear. Diffusion distances are datadependent and adapt to the geometry of the dataset. Figure 1 illustrates this phenomenon.
We model as samples from a distribution where each corresponds to the probability distribution of the spectra in class , and the weights correspond to how often each class is sampled, and satisfy . More precisely, sampling means first sampling , then sampling from conditioned on the event .
The mode identification algorithm outputs a point (“mode”) for each . Enforcing that these modes are far apart in diffusion distance has several advantages over enforcing they are far apart in Euclidean distance. It typically leads to a unique mode from each class, even in the case that certain classes are multimodal with several modes having density than the densest point in another class. Diffusion distances are also robust with respect to the shape of the distribution, and are thus suitable for identifying nonlinear clusters. An instance of these advantages of diffusion distance is illustrated in the toy example Figure 1, with the results of the proposed mode detection algorithm in Figure 3. We postpone the the mathematical and algorithmic details to Section IIB.
IC Labeling Points
At this stage we assume that we found exactly one mode for each class, to which a unique class label is assigned. The remaining points are subsequently labeled in a twostage scheme, which takes into account both spectral and spatial information. Spatial information is wellknown to improve hyperspectral learning [7, 12, 27, 38, 39, 40]. Like many of these methods, spatial information is computed for each pixel by constructing a neighborhood of some fixed radius in the spatial domain, and considering the labels within this neighborhood. For a given point, let spectral neighbor refer to a near neighbor with distances measured in the spectral domain, and let spatial neighbor, refer to a near neighbor with distances measured in the spatial domain.
In the first stage, a point is given the same label as its nearest spectral neighbor of high density, unless that label is sufficiently different from the labels of the point’s nearest spatial neighbors. This produces an incomplete labeling in which we expect the labeled points to be far from the spatial and spectral boundaries of the classes, since these are points that are unlikely to have conflicting spatial and spectral labels. Pixels whose spectral label differs strongly with the labels of nearby spatial pixels are left unlabeled.
In the second stage of the labeling process we label the points left unlabeled in the first stage. Each such point is assigned the consensus label of its nearest spatial neighbors (see Section IIC), if it exists; otherwise, it is given the label of its nearest spectral neighbor of higher density. In this way the yet unlabeled points, typically near the spatial and spectral boundaries of the classes, benefit from the spatial information in the already labeled points, which are closer to the centers of the classes. In this sense, information diffuses from the modes towards the boundaries of the classes. An instance of the twostage, spatialspectral labeling appears in Figure 2.
This method of clustering combining the diffusionbased learning of modes with the joint spatial spectral labeling of pixels is called spatialspectral diffusion learning, and is abbreviated DLSS. This method is contrasted with another novel method we propose, called diffusion learning (DL), in which modes are learned as in DLSS, but the labeling proceeds simply by enforcing that each point has the same spectral label as its nearest spectral neighbor of higher density. DL therefore disregards spatial information, while DLSS makes significant use of it. Our experiment show that while both DL and DLSS perform very well, DLSS is generally superior.
DLSS is a joint spatialspectral diffusion clustering technique. Each class mode distributes its label throughout the class distribution, until the boundary of the distribution is encountered in either the spatial or spectral domain. The firststage labeling in the spectral domain is somewhat similar to meanshift clustering [41] and is the same procedure used in FSFDPC [26] for labeling points once the modes have been determined. The process of halting the labeling of a point if the spectral label differs strongly from nearby spatial labels enforces smooth diffusion in the spatial domain, and may be understood as a form of spatial regularization. Details on how the spatialspectral diffusion learning is performed appear in Section IIC. This approach may also be connected to semisupervised learning techniques on graphs, where initial given labels are propagated by a diffusion process to other vertices (points): see [42] and references therein. Here we proceed in an unsupervised fashion, replacing initial given labels by estimated modes of the clusters.
ID Major Contributions
We propose a clustering algorithm for HSI with several significant innovations. First, diffusion distance is proposed to measure distance between highdensity regions in hyperspectral data, in order to determine class modes. Our experiments show that this distance efficiently differentiates between points belonging to the same cluster and points in different clusters. This correct identification of modes from each cluster is essential to any clustering algorithm incorporating an analysis of modes. Compared to stateoftheart fast mode detection algorithms, the proposed method enjoys excellent empirical performance and theoretical performance guarantees, which are beyond the scope of the present article, and will be discussed in the forthcoming article [37].
A second major contribution of the proposed HSI clustering algorithm is the incorporation of spatial information through the labeling process. Labels for points are determined by diffusing in the spectral domain from the labeled modes, unless spatial proximity is violated. By not labeling points that would violate spatial regularity, the proposed algorithm first labels points that, with high confidence, are close to the modes of the distributions. Only once all of these points are labeled are the remaining points, further from the modes, labeled. This enforces a spatial regularity which is natural for HSI images, because a pixel in an HSI is likely to have the same label as the most common label among its nearest spatial neighbors. Difficulties tend to occur on spatial and spectral boundaries, and these points are likely to be labeled last in our scheme. In both stages DLSS takes advantage of the geometry of the dataset by using dataadaptive diffusion processes, greatly improving empirical performance. The proposed methods enjoy low computational complexity and relatively fast runtimes. Indeed, the proposed methods are in the number of points and are thus suitable for the big data setting when is very large.
A third major contribution is the use of the learned modes for the segmentation of chemical plumes in HSI. This may be considered a form of anomaly detection, in which the goal is not clustering, but the correct identification of a special cluster of interest (“anomaly”). Motivated by recent work in the problem of anomaly detection and background separation for HSI [2], the proposed method of mode estimation is combined with partial least squares regression (PLSR) to linearly embed the dataset in lower dimensions in a maximally discriminatory manner. This step enhances the contrast between the clusters, crucial for anomaly detection, while reducing the amount of noise, thanks to the projection onto a small number of discriminating directions. The proposed anomaly detection method transforms supervised PLSR into an unsupervised method, which we show to be highly efficient for the detection of chemical plumes. This work is detailed in Section IIIF.
The fourth major contribution is the introduction of an active learning scheme based on distances of points to the computed modes. In the context of active learning, the user is allowed to label only a very small number of points, to be chosen parsimoniously. We propose an unsupervised method for determining which points to label in the active learning setting. Points whose distance to their nearest mode is much larger than the distance to other modes are likely to be easy to label, and hence should not be given labels in the active learning setting. Conversely, pixels that are equally far from their nearest two modes are likely to be near class boundaries, and hence to be the most challenging pixels to label by the proposed unsupervised method. We therefore propose an active learning method that labels only the pixels whose distances to their nearest two modes are closest. Our experiments show that this method can dramatically improve labeling accuracy with a number of labels of the total pixels. This work is detailed in Section IIIG.
Ii Unsupervised Learning Algorithm and Variations
Iia Motivating Example and Approach
A key aspect of our algorithm is in the method for identifying the modes of the underlying classes in the HSI data. This is challenging because of the high ambient dimension of the data, potential overlaps between distributions at their tails, along with differing densities, sampling rates, and distribution shapes across classes. Consider the simplified example in Figure 3. This data set is the same as that appearing in Figure 1.
The points of high density lie close to the center of the first distribution, and close to the two ends of the second distribution. After computing an empirical density estimate, the distance between high density points is computed. If the Euclidean distance between high density points is used to remove spurious modes, i.e. modes corresponding to the same distribution, then the two modes on opposite sides of the second distribution are returned, which incorrectly assigns two modes to the same distribution. However, if diffusion distance is used, rather than Euclidean distance, then the two modes learned correspond to two different classes. This is because the modes on the opposite ends of the second distribution are far in Euclidean distance but closer in diffusion distance. Indeed, the points between the two modes act as a bridge, which is incorporated into the geometry learned by diffusion distances. On the other hand, there is a substantial region of low density between the two distributions, making the diffusion distance between them relatively large. This suggests that diffusion distance is more useful than Euclidean distance for comparing high density points for the determination of modes, under the assumption that multimodal regions have modes that are connected by regions of not too low density. From the standpoint of Euclidean distance, the bridge is insignificant. From the standpoint of diffusion distance, it is critical.
IiB Diffusion Distance
We now present a detailed overview of the construction of diffusion distances. Additional analysis and comments on implementation appear in [43, 35, 36]. Diffusion processes on graphs lead to a datadependent notion of distance, known as diffusion distance [43, 35, 36]. This notion of distance has been applied to a variety of application problems, including analysis of stochastic and dynamical systems [35, 44, 45, 46], semisupervised learning [47, 42], data fusion [48, 49], latent variable separation [50, 51], and molecular dynamics [52, 53]. Diffusion maps provide a way of computing and visualizing diffusion distances, and may be understood as a type of nonlinear dimension reduction, in which data in a high number of dimensions may be embedded in a lowdimensional space by a nonlinear coordinate transformation. In this regard, diffusion maps are related to nonlinear dimension reduction techniques such as isomap [54], Laplacian eigenmaps [47], and local linear embedding [55], among several others.
Consider a discrete set with elements. The diffusion distance [35, 36] between , denoted , is a notion of distance that incorporates and is uniquely determined by the underlying geometry of . The distance depends on a time parameter , which enjoys an interpretation in terms of diffusion on the data. The computation of involves constructing a weighted, undirected graph that encodes the geometry of , and the distance approximates a suitable notion of geometry of the underlying distribution generating the data. The vertices of correspond to and the edges are given by a weight matrix defined as
(1) 
for some suitable choice of and for the set of nearest neighbors of in with respect to Euclidean distance. The weight matrix is normalized to be rowstochastic, yielding a Markov diffusion with transition matrix given by
Given some initial distribution on the state space, the vector is the state of the Markov chain at time . As increases, the diffusion process on evolves according to connections between the points, as encoded by . This Markov chain has a stationary distribution s.t. , given by
The diffusion distance at time is defined as
The computation of involves integrating over all paths connecting to , so is small if are strongly connected in the graph corresponding to , and large if are weakly connected in the graph.
From a computational standpoint, an eigendecomposition of allows to derive fast algorithms involving . Indeed, under mild conditions [36], the matrix admits a spectral decomposition of eigenvectors and eigenvalues , where The diffusion distance can be written in terms of this spectral decomposition as
(2) 
One can think of the weighted eigenvectors as encoding new datadependent coordinates of . These coordinates are close to being geometrically intrinsic [35, 43]. Under this interpretation, diffusion distance is simply the normal distance, but in these new coordinates, rather than the usual Euclidean coordinates.
The diffusion distance is parametrized by , which encodes how long the diffusion process on the graph has run when the distances are computed. Small values of allow a small amount of diffusion, which may prevent the interesting geometry of from being discovered, but provide detailed information. Large values of allow the diffusion process to run for so long that the interesting geometry may be washed out. In this work the regime in between these two is typically when the geometry of the data is revealed. To balance between the short and long term behaviors, we set for all experiments. The impact of may be studied mathematically with spectral graph theory and analyses of Markov mixing times, which allows for estimations of the diffusion distances for certain data [37]. In general, the choices of in the construction of diffusion distances may be important; the choice of these parameters is discussed in Section IIIH.
Note that since for all , for large , for , and the sum (2) may approximated by its truncation at index , for some suitable . In all experiments, was set to be the value at which the decay of the eigenvalues began to taper off, i.e. the “kink” in the eigenvalue plot. This is a standard heuristic for such metrics. The subset used in the computation of is a dimensionreduced set of diffusion coordinates. Intuitively, the larger eigenvalues contain significant information, while the smaller ones are mostly noise and are of little value. The truncation improves the computational complexity of the proposed method by requiring the computation of fewer eigenvectors, which are costly even in the case of sparse matrices. A discussion of the computational complexity of diffusion distances appears in Section IVA.
IiC Unsupervised HSI Clustering Algorithm Description
We now discuss the proposed algorithm in detail. A flowchart summarizing the proposed clustering method appears in Figure 4.
Let be the HSI image, reshaped into an matrix, where is the number of pixels and is the number of spectral bands. We may consider as a collection of points . As described in Section IB, our algorithm proceeds in two major steps: mode identification and labeling of points.
The algorithm for learning the modes of the classes is summarized in Algorithm 1. It first computes an empirical density for each point with a kernel density estimator. This computes, for each ,
where is the squared Euclidean distance in , and is the set of nearest neighbors to , in Euclidean distance. The use of the Gaussian kernel density estimator is rather standard in statistics and machine learning, and it enjoys strong theoretical guarantees [56, 57]; other estimators may be used. In Section III, , though our method is robust to choosing larger . The parameter in the exponential kernel is dataadaptive, and set to be half the mean distance between all points, which makes this choice relatively robust across different data sets (one could use the median instead in the presence of outliers). The empirical density is then computed by normalizing so that :
(3) 
Once this empirical density is computed, the modes of the HSI classes are computed in a manner similar in spirit to [26]. We compute the timedependent quantity that assigns, to each pixel, the minimum diffusion distance between the pixel and a point of higher empirical density. The point with highest empirical density is assigned the maximum diffusion distance between it and any other point as its value. More precisely, we compute
(4) 
where is the diffusion distance between , at time . In the following we will use the normalized quantity , which has maximum value .
The modes of the HSI are computed as the points yielding the largest values of the quantity
(5) 
Intuitively, such points should be both high density and far in diffusion distance from any other higher density points, and can therefore be expected to be modes of different distributions. This method provably detects modes correctly under the assumption the data is drawn from certain nonparametric distributions [37].
Once the modes are detected, they are each given a unique label. The rest of the points are labeled from these mode labels in the following twostage process, which is summarized in Algorithm 2. In the first stage, the points are sorted in order of decreasing empirical density. In order of decreasing density, the spatial consensus label of each point is computed by finding all labeled points within distance in the spatial domain of the pixel in question; call this set . If one label among occurs with relative frequency , that label is the spatial consensus label. Otherwise, no spatial consensus label is given. In detail, let denote the labels of the spatial neighbors within radius . Then the spatial consensus label of is
(6) 
After a point’s spatial consensus label is computed, it’s spectral label is computed as its nearest neighbor in the spectral domain, measured in diffusion distance, of higher density. The point is then given the overall label of the spectral label unless the spatial consensus label exists and differs from the spatial consensus label. In the case that the spatial consensus label exists and differs from the spectral label, the point in question remains unlabeled in the first stage. Note that points that are unlabeled are considered to have label 0 for the purposes of computing the spatial consensus label, so in the case that most pixels in the spatial neighborhood are unlabeled, the spatial consensus label will be 0. Hence, only pixels with many labeled pixels in their spatial neighborhood can hope to have a consensus spatial label.
Upon completion of the first stage, the dataset will be partially labeled; see Figure 2. Indeed, only points with no spatial consensus label and points whose spatial consensus label exists and agrees with the spectral label will have been labeled. In the second stage, an unlabeled point is given the label of its spatial consensus label, if it exists. Otherwise, it is given the label of its nearest spectral neighbor of higher density.
Points of high density are likely to be labeled according to their spectral properties. The reasons for this are twofold. First, these points are likely to be near the centers of distributions, and hence are likely to be in spatially homogeneous regions. Moreover, points of high density are labeled before points of low density, so it is not likely for high density points to have many labeled points in their spatial neighborhoods. This means that the spatial consensus label is unlikely to exist for these points. Conversely, points of low density may be at the boundaries of the classes, and are hence more likely to be labeled by their spatial properties. In this way, the proposed labeling scheme benefits from both the spectral and spatial regularity of HSI.
The proposed method, combining Algorithms 1, 2 is called spatialspectral diffusion learning (DLSS). In our experimental analysis, the significance of the spatialspectral labeling scheme is validated by comparing DLSS against a simpler method, called diffusion learning (DL). This method learns class modes as in Algorithm 1, but labels all pixels simply by requiring each point have the same label as its nearest spectral neighbor of higher density. It is expected that DLSS will generally outperform DL, due to the former’s incorporation of spatial data.
IiD Anomaly Detection Algorithm
Motivated by the success of partial least square (PLSR) for anomaly detection in hyperspectral movies [2], we propose to incorporate the learned diffusion modes into PLSR. PLSR is a supervised linear dimension reduction technique that reduces the dimension in such a way that not just variance, but also discrimination between classes, in the lowdimensional embedding is maximized. PLSR requires labeled data, which will be kept apart in the lowdimensional embedding. To incorporate the proposed unsupervised method into PLSR, we associate the modes learned with diffusion distances to a core of points, consisting of the nearest neighbors of a mode in diffusion distance. These cores are points that, with high confidence, have the same class label. The cores are then used as labels for PLSR.
Instead of arbitrarily assigning numbers to each of the classes and regressing these, we assign the elements of to the vertex of a simplex, in order to perform vectorvalued regression with PLSR without imposing a onedimensional structure on the space of labels of the classes. More precisely, the points in are labeled as with 1 in the coordinate. This generates a set of training pairs for the class:
In this algorithm, the data are assumed to be centered, i.e. the columns of have mean 0. This training data is used in PLSR to acquire a regression coefficient ; details for the PLSR construction appear in Section AA. The regression coefficient is a matrix that is derived only from the learned cores. Multivariate regression on all of is performed by computing ; is an matrix with rows corresponding to points and columns corresponding to responses of each point. The columns may be interpreted as affinity for each class. Class labels are learned as Other methods for computing class labels from the responses were considered, including means on with , yielding similar results. The overall algorithm is summarized in Algorithm 3.
IiE Active Learning
Thus far, all proposed methods have been unsupervised. We now present a variation of the proposed unsupervised method for active learning of hyperspectral images. In the active learning regime, the cost of labeling of pixels is at a premium, so a few wellchosen pixels must be chosen for labeling. The proposed unsupervised method labels points beginning with the learned class modes, and mistakes tend to be made on points that are near the class boundaries. To help clarify the class boundaries, we consider an active learning scheme that labels points whose distances from their nearest two modes are similar. This quantifies the ambiguity of not having a clear nearest mode.
More precisely, we fix a time , and for each pixel , let be the two modes closest to in diffusion distance. We compute the quantity
(7) 
If is close to 0, then there is substantial ambiguity as to which of the two modes is closer to. Suppose the user is afforded the labels of exactly points. Then the labels requested in our active learning regime are the minimizers of . We consider a range of values for all real HSI datasets with ground truth. It is desirable to have the parameter dictating the number of labels for active learning be independent of image size. Hence, we parametrize the amount of labels available to the active learning algorithm by the parameter , and for a given we consider labels, where is the number of pixels in the image. That is, is the percentage of pixels we can access. The active learning setting is most interesting when can be taken very small, hence is taken in in our experiments. The proposed active learning scheme is summarized in Algorithm 4.
Note that the active learning algorithm can be iterated, by labeling points then recomputing the quantity (7) to determine the most challenging points after some labels have been introduced. For simplicity of presentation, we do not consider such an iterated algorithm in the present article.
Iii Experiments
Iiia Algorithm Evaluation Methods and Experimental Data
We consider several HSI datasets to evaluate the proposed unsupervised algorithm (Algorithms 1, 2) and the anomaly detection (Algorithm 3) and active learning (Algorithm 4) variants. For evaluation in the presence of ground truth (GT), we consider three quantitative measures, as well as visual performance. The quantitative measures of performance are:

Overall Accuracy (OA): Total number of correctly labeled pixels divided by the total number of pixels. This method values large classes more than small classes.

Average Accuracy (AA): For each class, the overall accuracy of that class is computed, as described above. These classwise performances are then averaged. This method values small classes and large classes equally.

Cohen’s statistic (): A measurement of interrater agreement that measures agreement between two labelings, corrected for random agreement [58]. Letting be the observed agreement between the labeling and the ground truth and the expected agreement between a uniformly random labeling and the ground truth, . A value of corresponds to perfect overall accuracy, while a labeling that does not improve on what is expected from random guessing would have .
In order to perform quantitative analysis with these metrics and make consistent visual comparisons, the learned clusters are aligned with ground truth, when available. After performing our unsupervised clustering, we perform an a posteriori alignment of the labels with the ground truth that maximizes the overall accuracy of the labeling. More precisely, let be the set of permutations of . Let be the clusters learned from one of the clustering methods, and let be the ground truth clusters. Cluster is assigned label , with defined as
We remark that this alignment method maximizes the overall accuracy of the labeling and is most useful for visualization, but better alignments for maximizing and may exist.
For experimental analysis, we first consider synthetic data where classes are generated randomly according to a specified distribution. Next, we consider five real HSI datasets which reflect strengths and weaknesses of the proposed algorithm. Four of the datasets are standard and have ground truth; they are publicly available^{1}^{1}1http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. We also consider a dataset without ground truth, for which the proposed quantitative measures do not apply. This image contains a subtle chemical plume, and the goal for this dataset is not clustering of the whole dataset, but rather the segmentation of this chemical plume. For this dataset, we proposed the modified unsupervised mode estimation algorithm combined with partial least squares regression (Algorithm 3), to help with anomaly detection. Experiments with active learning are performed for the four real HSI datasets with ground truth and Algorithm 4.
In order to reduce the number of classes to be suitable for our unsupervised method, some images are restricted to subsets in the spatial domain, which is noted in their respective subsections. The Salinas A dataset is considered in its entirety. For the data with ground truth, not all pixels are labeled in the ground truth. For these datasets, labels are computed for all data, then the pixels with ground truth labels are used for quantitative and visual analysis. The number of class labels in the ground truth images were used as a parameter for all clustering algorithms, though the proposed method automatically estimates the number of clusters; see Section IIIF for details. Images of ground truth and the sum of all bands for the Indian Pines, Pavia University, Pavia, and Salinas A datasets are in Figures 5, 22, 7, and 8, respectively. The sum of all bands is presented as a simple visual summary of the data, though it washes out the subtle information presented in individual bands. Finally, we consider an HSI image containing a chemical plume. This image does not have ground truth, so the goal is to cluster it so that the chemical plume is clearly segmented. For clarity of exposition, experiments for synthetic HSI data and Pavia University appear in the Appendix.
Note that additional experiments were performed, using only the data with ground truth labels. These experiments consisted in restricting the HSI to the pixels with labels, which makes the clustering problem significantly easier. Quantitative results were uniformly better for all datasets and methods in these cases, though the relative performances of the algorithms on a given dataset remained the same. The results on the restricted data are not shown.
IiiB Comparison Methods
We consider a variety of benchmark and stateoftheart methods of HSI clustering for comparison. First, we consider the classic means algorithm [57] applied directly to . This method is not expected to generate especially good results for HSI data, in part due to the nonspherical shape of clusters in HSI data. Nonspherical classes are a wellknown problem for means clustering. High dimensionality is also a problem for means. Indeed, due to the wellknown curse of dimensionality, distances of uniformly sampled points are large in high dimensions. This makes it difficult to distinguish between clusters and noise in high dimensions with means.
Several methods to reduce the dimensionality of the data, while preserving important discriminatory properties of the classes, as well as increasing the signaltonoise ratio in the projected space, are used for comparison. These dimension reduction techniques are used as benchmarks for comparison with the proposed method. All of these methods involve first reducing the dimension, , of the data, then running means on the reduced data. The datasets are reduced to being dimensional, where is the number of classes, if known. In the case of chemical plume detection, when the number of classes is unknown, is estimated; see Section IIIF for details. We first consider linear dimension reduction via principal component analysis, which seeks to find a lowdimensional representation of the dataset that retains maximum variance. We then consider independent component projection [59, 60], which seeks to decompose the data into maximally independent components. To implement ICA, a standard, fast implementation was used^{2}^{2}2https://www.cs.helsinki.fi/u/ahyvarin/papers/fastica.shtml. In addition, we consider linear dimensionality reduction by a random projection via a Gaussian random matrix, which has been shown to be efficient in highly simplified data models [61, 62].
We also consider more computationally intensive methods for benchmarking. Spectral clustering (SC) [63, 64] has had success in classifying and clustering HSI [39]. The spectral embedding was computed using the normalized graph Laplacian . The top eigenvectors from are then normalized and used as features for means clustering. Details of this construction, including parameters for the graph Laplacian appear in Section IIIH. We also cluster with Gaussian mixture models (GMM) [19, 65, 66]. This method attempts to fit a number of Gaussians to the dataset, with means and covariance matrices determined by expectation maximization (EM).
In addition to these benchmark algorithms, we consider several recent, stateoftheart clustering methods. The method of sparse manifold clustering and embedding (SMCE)^{3}^{3}3http://vision.jhu.edu/code/ is considered [22, 23]. This method attempts to fit the data to lowdimensional, sparse structures, then applies spectral clustering. The method of hierarchical clustering with nonnegative matrix factorization (HNMF)^{4}^{4}4https://sites.google.com/site/nicolasgillis/code [24] is also considered. This method has shown excellent performance for HSI clustering, especially in the simplified case when the clusters are generated from a single endmember. Another graphbased method [25], based on the MumfordShah segmentation [67], is also considered. This method is related to spectral clustering, and is called fast MumfordShah (FMS) in this article. This implementation is highly parallelized ^{5}^{5}5http://www.ipol.im/pub/art/2017/204/?utm_source=doi, and is the only comparison method implemented in C, rather than MATLAB. Moreover, we consider the fast search and find of density peaks clustering (FSFDPC) algorithm [26], which has been shown effective in clustering certain data.
IiiC Relationship Between Proposed Method and Comparison Methods
The FSFDPC method has similarities with the mode estimation aspect of the present work, in that both algorithms attempt to learn the modes of the data classes via a density based analysis, as described in, for example, [21, 26]. Our method is quite different. The proposed measure of distance between high density points is not Euclidean distance, but diffusion distance [35, 36], which is more adept at removing spurious modes, due to its incorporation of the geometry of the data. This phenomenon is illustrated in Figure 1 and 3. The assignment of labels from the modes is also quite different, as diffusion distances are used to determine spectral nearest neighbors, and spatial information is accounted for in the proposed DLSS algorithm. The FSFDPC algorithm, in contrast, assigns to each of the modes its own class label, and to the remaining points a label by requiring that each point has the same label as its Euclidean nearest neighbor of higher density. This means that FSFDPC only incorporates spectral information measured in Euclidean distance, disregarding spatial information. The benefits of both using diffusion distances to learn modes, and incorporating spatial proximities into the clustering process are shown in the experiments.
The proposed DLSS and DL algorithms also share commonalities with spectral clustering, SMCE, and FMS in that these comparison methods compute eigenvectors of a graph Laplacian in order to develop a nonlinear notion of distance. This may be seen as similar to computing the eigenvectors of the Markov transition matrix in the computation of diffusion maps. The proposed method, however, directly incorporates density into the detection of modes, which allows for more robust clustering compared to these comparison methods, which work by applying means to the eigenvectors of the graph Laplacian. Our technique also does not rely on any assumption about sparsity, and is completely invariant under distancepreserving transformations (it shares this property with SMCE), which could be useful if different imaging modalities (e.g. compressed modalities) are used.
IiiD Summary of Proposed and Comparison Methods
The experimental methods are summarized in Table I. We propose two novel methods: the full spatial spectral diffusion learning method (DLSS), as well as a simplified diffusion learning method (DL). The DLSS method consists in learning the class modes with diffusion distances, then assigning labels in the joint spatialspectral way, as illustrated in Figure 4. The simplified DL method learns the class modes with diffusion distances, but assigns labels simply by enforcing that a point has the same label as its nearest neighbor of higher density in diffusion distance. In this way, DL does not incorporate spatial information. We note that several algorithms were not implemented by the authors of this article. Instead, publicly available libraries were used; links to these libraries are noted where appropriate.
All experiments and subsequent analyses are done in MATLAB, with the exception of FMS, which was implemented in C++. Code to reproduce all results will be made publicly available on the authors’ website^{6}^{6}6http://www.math.jhu.edu/~jmurphy/.
IiiE Unsupervised HSI Clustering Experiments
Experimental results for synthetic data appear in Section AB1.
IiiE1 Indian Pine Dataset
The Indian Pines dataset used for experiments is a subset of the full Indian Pines datasets, consisting of three classes that are difficult to distinguish visually; see Figure 5. It is expected that this dataset will be challenging for all algorithms, due to the lack of clear separation between the classes. Results for Indian Pines appear in Figure 6 and Table II. The highest quantitative score for a given metric is listed in bold in all tables, and the second highest score is underlined.
The proposed methods, DL and DLSS perform the best, with DLSS strongly outperforming the rest. The use of diffusion distances for mode detection and determination of spectral neighbors is evidently useful, as DL significantly outperforms FSFDPC, which has the best quantitative performance among the comparison methods. Moreover, the use of the proposed spatialspectral labeling scheme clearly improves over spectralonly labeling. As seen in Figure 6, the spatialspectral labeling scheme correctly labels many small interior regions that diffusion learning alone incorrectly labels.
IiiE2 Pavia Dataset
The Pavia dataset used for experiments consists of a subset of the original dataset, and contains six classes, with one of them spread out across the image. As can be seen in Figure 7, the yellow class is small and diffuse, which is expected to add challenge to this example. Results appear in Figure 20 and Table III.
The proposed methods give the best results, and the value of both the diffusion learning stage, as well as the spatialspectral labeling scheme are witnessed. Indeed, the proposed DLSS algorithm makes essentially only two errors: the yellowgreen class is slightly mislabeled, and the bluegreen class in the bottom right is labeled completely incorrectly. However, both of these errors are made by all algorithms, often to an even greater degree. Among the comparison methods, SMCE performs best, and classical spectral clustering also performs well.
IiiE3 SalinasA Dataset
The Salinas A dataset consists of 6 classes arrayed in diagonally spatial rows. The dataset appears in Figure 8.
For the full Salinas A dataset, the proposed DLSS method performs best, with the only error made in splitting the bottom right cluster into two pieces. This error was made by all algorithms, however. The simplified DL method also performs well, as does the benchmark spectral clustering algorithm. Comparing the labeling for DL and DLSS, the small regions of mislabeled pixels in DL are correctly labeled in DLSS, because these pixels are likely of low empirical density, and hence would be labeled not based on spectral similarity, but spatial similarity. However, some pixels in the DL image that were labeled correctly were changed to be labeled incorrectly with the DLSS method, indicating that the spatial proximity condition enforced in DLSS may not lead to improved results for every pixel. Details on this, and how to tune the size of the neighborhood with which spatial consensus labels are computed, are given in Section IIIH2.
IiiF Anomaly Detection for Chemical Plume Segmentation
We now consider the problem of anomaly detection in HSI by segmenting a chemical plume. Chemicals invisible to the naked eye can be captured in certain frequency bands of an HSI; the problem is to identify the chemical region. An example, courtesy of the Johns Hopkins Applied Physics Lab, appears in Figure 10. Several pixels are corrupted, making this problem even more difficult. No labeled ground truth is available for this image, so we evaluate with only visual quality of the image segmentation. In order to determine the number of clusters, , used by our clustering algorithms, we examine the plot of sorted values. We look for “kinks” in this plot, which correspond to a major drop in values. This heuristically indicates that, to a certain granularity, all clusters have been located. The plots of sorted and the finite differences between the successive values appear in Figure 12.
In addition to algorithms considered thus far, we propose to use Algorithm 3, which combines the proposed mode detection algorithm with PLSR. Mode detection with both Euclidean and diffusion distances are considered.
We see from Figure 11 that PLSR with diffusion cores segments the plume best, in particular giving the cleanest horizontal segmentation. PLSR with Euclidean cores performs second best, giving a good vertical segmentation, but with several errors in the horizontal direction. The other methods give very poor segmentation of the chemical plume. We remark that PLSR is wellsuited for this problem, which requires the segmentation of only a single anomalous class. For this problem, the class of interest is relatively concentrated, and performance in clustering the rest of the image is not considered as important.
IiiG Active Learning
To evaluate our proposed active learning method, Algorithm 4, the same four labeled HSI datasets were clustered with increasing amounts of available training data points. The amount of points used is parametrized by , which denotes the proportion of pixels in the whole image available for labeling. Note that corresponds the the unsupervised DLSS algorithm. The empirical results for this active learning scheme appear in Figure 13.
The empirical results indicate that the proposed active learning can produce dramatic improvements in labeling with very few training labels. Indeed, an improvement in overall accuracy from to for Indian Pines can be achieved with only 3 labels. Even more dramatic is the SalinasA dataset, in which 14 labeled points improves the overall accuracy from to . The other datasets see less dramatic improvement; Pavia University improves slightly for increasing , while the Pavia dataset labeling is not affected by the small collection of labeled points. In the case of Pavia, however, the overall accuracy was already very large, so active learning would not necessarily be ideal for this setting.
IiiH Parameter Analysis
We now discuss the parameters used in the comparison and proposed methods. We begin by stating the parameters used for all comparison methods, before analyzing the two key parameters for the proposed method: diffusion time and radius size for the computation of spatial consensus label. For experimental parameters except these, a range of parameters were considered, and those with best empirical results are used.
The means algorithm is used in conjunction with several benchmark methods, and the parameters for means are the same for all these methods. The algorithm is run with 100 iterations, with random initializations each time, and number of clusters equal to the known number of classes in the ground truth. Each of the linear dimension reduction techniques, PCA, ICA, and random projection, embeds the data into , where is the number of clusters. Spectral clustering is run by computing a weight matrix as in , with and . The top eigenvectors are then normalized to have Euclidean norm , then used as features with means. The GMM labels are computed with expectation maximization with the full covariance matrix and regularization parameter .
Among the stateoftheart methods, HNMF uses the recommended settings listed in the available online toolbox^{7}^{7}7https://sites.google.com/site/nicolasgillis/code there are no significant parameters to tune. For FSFDPC, the empirical density estimate is computed as described in Section IIC, with a Gaussian kernel and 20 nearest neighbors. For SMCE, the sparsity parameter was set to be 10, as suggested in the online toolbox ^{8}^{8}8http://www.vision.jhu.edu/code/. The FMS algorithm depends on several key parameters; we implemented a parameter sweep over them, and ran our experiments for the empirically optimal choice of parameters with respect to a given dataset.
For the proposed algorithm, the same parameters for the density estimator as described above are used, in order to make a fair comparison with FSFDPC. Moreover, in the construction of the graph used to computed diffusion distances, the same construction as used for spectral clustering and SMCE is used, again to make fair comparisons. The remaining parameters, namely diffusion time and spatial radius, were set to be 30 and 3 for all experiments, respectively. We justify these choices and explain their significance in the following subsections.
IiiH1 Diffusion Time
The most important parameter in using diffusion distances is the time parameter . Recall that the diffusion distance may be computed as in . Under mild assumptions that are almost always satisfied for real data, it is known that for all , so the parameter determines the decay of the lower eigenvalues. The larger is, the less contribution the smaller eigenvalues have. Allowing to vary, connections in the dataset are explored by allowing the diffusion process to evolve. For small values of , all points appear far apart because the diffusion process has not reached far, while for large values of , all points appear close together because the diffusion process has run for so long that it has dissipated over the entire state space. In general, the interesting choice of is moderate, which allows for the interesting geometry to be discovered, but not washed out in longterm.
In Figure 14, the results for a range of values, running from 0 to 100 are displayed. These results are empirical, and not motivated by statistics on the data.
We see that in general, the behavior is largely invariant with respect to time. For Indian Pines, performance is largely constant, except for a dip from time to . For the remaining examples, the performance is robust with respect to the diffusion time in the range analyzed. However, for the Pavia University dataset, performance appears to being eroding around time , indicating that the diffusion process has mixed for too long, and differences between classes have been washed out.
IiiH2 Spatial Diffusion Radius
The width of the spatial consensus radius can also impact the performance of the proposed algorithm. Recall that this is the distance in the spectral domain used to compute the spatial consensus label. The spatial consensus label is computed by finding all labeled points within distance , the spatial consensus radius, in the spatial domain of the pixel in question, then finding the label in this neighborhood of proportion . More precisely, for a pixel the set of nearest neighbors in the spatial domain are computed. The consensus label is set to be the most common label in , if this label is the label for more than half the points of . Otherwise, no spatial consensus label is given; the precise definition was given in (6). If is too small, insufficient spatial information may be incorporated. If is too large, the spectral information becomes drowned out. Results for each dataset over a range of values, running from 0 to 10, appear in Figure 15.
We see that there is generally a tradeoff between spatial and spectral information, so that it makes sense to take to be a moderate value sufficiently greater than 0, but less than 10. This tradeoff can be interpreted in the following way: empirically optimal results are achieved when both spatial and spectral information contribute harmoniously, and results deteriorate when one or the other dominates. We choose for all experiments, though other choices would give comparable or better quantitative results for the datasets considered. We remark that in the case of the Pavia University dataset, the results for average accuracy decrease monotonically with increasing window size, and do not exhibit the tradeoff. This is because there are several very small classes in this dataset that can be overwhelmed by overzealous spatial regularization. In particular, the average accuracy statistic is acutely sensitive to these small classes.
We note that the role of the spatial radius is analogous to the role of a regularization parameter for a general statistical learning problem. Taken too small, the problem is insufficiently regularized, leading to poor results. Taken too large, the regularization terms dominates the fidelity term, leading also to poor results.
Iv Overall Comments on the Experiments and Conclusion
This article proposes a novel unsupervised method for clustering HSI, using datadependent diffusion distances to learn modes of clusters, followed by a spatialspectral labeling scheme based on diffusion in both the spectral and spatial domains. It was expected that the proposed DL algorithm would perform well, particularly compared to FSFDPC, and that the DLSS algorithm would outperform DL, due to the former’s incorporation of spatial information.
The proposed DL method, consisting of the geometric learning of modes but only spectral assignment of labels, generally outperforms all comparison methods. In particular, it outperforms the very popular and recent FSFDPC algorithm, which uses the same technique to assign class labels once modes have been learned. Indeed, DL outperforms FSFDPC in all examples, indicating that Euclidean distance is insufficient for learning the modes of complex HSI data. The quantitative analyses of clustering real HSI experiments with ground truth are summarized in Table V.
Moreover, the joint spatialspectral labeling scheme improves over using only spectral information to label points. This is indicated by DLSS outperforming DL in all instances but one. The exceptional dataset is Pavia University (see Appendix), and the exceptional statistic is average accuracy. As remarked in Section AB3, the spatial information appears to wash out the small green class in this example. This phenomenon is further illustrated by subfigure (b) in Figure 15, in which it is shown that increasing the spatial radius in which the spatial consensus neighbor is computed has a uniformly detrimental impact on average accuracy, even as overall accuracy and show the typical tradeoff between spatial and spectral information. This indicates a potential drawback of the proposed spatialspectral labeling scheme: small classes may be washed out, even for small spatial radii. However, the overall improved performance suggests that this risk is generally acceptable.
The two variations of the proposed algorithm for anomaly detection and active learning also show great promise. The variation in which PLSR is combined with diffusion learning successfully segments the gas plume in the APL hyperspectral scene, while all other methods fail. This indicates that for the anomaly detection problem, geometric mode detection combined with PLSR is a useful tool. The incorporation of active learning in the DLSS algorithm dramatically improves the accuracy of labeling of the Indian Pines and Salinas A datasets with only 3 and 14 labeled pixels respectively, and offers more modest improvement for the Pavia University dataset. This parsimonious use of training labels has the potential to greatly improve the efficiency of HSI learning tasks, in which the volume of labels necessary to label a significant proportion of the image can be very high.
Iva Computational Complexity
The proposed DLSS method consists of two major steps. First, the modes of the data must be estimated, then labels given. Suppose the data is .
To compute the modes, both the empirical density and each point’s distance to its nearest diffusion distance neighbor of higher density must be computed. To compute each points empirical density, its nearest neighbors must be computed. To do so in an efficient way, the cover trees algorithm [68] is used. The cover trees algorithm computes all points’ nearest neighbors in with a constant that depends exponentially on the intrinsic dimension, of the data. Note that the exponential dependence is not on , the ambient dimensionality of the data. In the case of HSI, is quite large, but it is reasonable to assume that the data lies near a lowdimensional structure, so that . The computation of the density given these nearest neighbors is , so that the computation of empirical density for all points is . Assuming that with respect to , this reduces to .
Next, each point’s nearest neighbor in diffusion distance of higher density must be computed. Under the mild assumption that, aside from the class modes, each point has a point of higher density among its nearest neighbors of higher density, for some with respect to , finding each point’s nearest neighbor of higher density is tantamount to computing each points nearest neighbors in diffusion distance. The worstcase computational complexity of computing each points nearest neighbors in diffusion distances is, using cover trees as analyzed above, , where is the number of eigenvectors computed for diffusion distances. For all examples presented in this paper, , and can in general be assumed independent of , since it only depends on the intrinsic geometry of the data. Moreover, with respect to , so that this cost reduces to .
Once both the empirical density and the distance of each point to its nearest neighbor of higher density in diffusion are computed, the product of these values must be sorted, which can be done in . The maximizers of this quantity are the modes. Hence, the total cost of the Algorithm 1 is .
The labeling step require computing, for each point, its nearest nearest spectral neighbor of higher density as well as its spatial consensus label. We have already established that under the mild assumption above, each point’s nearest spectral neighbor of higher density can be computed in , and computing the spatial consensus label is , where is the spatial consensus radius. It is reasonable that independent of , giving Algorithm 2 computational complexity . Hence, the overall complexity of the unsupervised clustering algorithm is . This is the same as fast spectral clustering, which has the same eigenvector bottleneck as diffusion maps, and to the FSFDPC algorithm. It is significantly faster than SMCE.
Experiments on the empirical runtime for each algorithm were also performed. The time each algorithm takes to cluster each of the four labeled datasets appears in Table VI
For the additional PLSR step used in Algorithm 3 for the segmentation of the chemical plumes, we also discuss the complexity of PLSR in Section A. Suffice it say, it does not add to the complexity.
The active learning stage requires additionally to determine which points are to be labeled, which is constant time for each point, so followed by the assignment of the labels, which is constant time for each point. Hence, the active labeling stage, namely Algorithm 4, is , and does not add to the complexity of the proposed algorithm.
V Future Research Directions
A drawback of many clustering algorithms, including the one presented in this paper, is the assumption that the number of clusters, , is known a priori. Initial investigations suggest that looking for the “kink” in the sorted plot of could be used to detect automatically. Indeed, this is true for most of the examples presented in this article, and this method was used in the analysis of chemical plumes in Section IIIF. See Figure 16 for another example, this time from the Salinas A dataset.
This was suggested in [26] in the case of using Euclidean distances to compute , however this imposes unrealistic constraints on the shapes and relative positions of the mixture components. It is of interest to prove under what assumptions on the distributions and mixture model the plot correctly determines . Moreover, in the case that one cluster is noticeably smaller or harder to detect than others, as in the case of the APL chemical plume dataset, it may be advantageous to use a different local maximum of the finite difference curve, rather than the global maximum. Initial mathematical results and more subtle conjectures are proposed in an upcoming article [37].
Relatedly, the present work is essentially heuristic; it is not known mathematically under what constraints on the mixture model the method proposed for learning modes succeeds with high probability. Besides being of mathematical interest, this would be useful for understanding the limitations of the proposed method for HSI. Indeed, the experimental results of this paper indicate that diffusion distance is more effective than Euclidean distance for discriminating between high density points for use in the FSFDPC algorithm. To understand this phenomenon rigorously, a careful analysis of diffusion distances for data drawn from a nonparametric mixture model is required. This line of research is related to recent research in performance guarantees for spectral clustering and mode detection [69, 70]
Va Acknowledgments
This research was partially supported by NSFATD1222567, AFOSR FA95501410033, and NSFIIS1546392.
References
 [1] G. Lu and B. Fei. Medical hyperspectral imaging: a review. Journal of Biomedical Optics, 19(1):010901–010901, 2014.
 [2] Y. Wang, G. Chen, and M. Maggioni. Highdimensional data modeling techniques for detection of chemical plumes and anomalies in hyperspectral images and movies. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9(9):4316–4324, 2016.
 [3] M.T. Eismann. Hyperspectral remote sensing. SPIE Bellingham, 2012.
 [4] L. Ma, M.M. Crawford, and J. Tian. Local manifold learningbased nearestneighbor for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 48(11):4099–4109, 2010.
 [5] W. Li, E.W. Tramel, S. Prasad, and J.E. Fowler. Nearest regularized subspace for hyperspectral classification. IEEE Transactions on Geoscience and Remote Sensing, 52(1):477–489, 2014.
 [6] F. Melgani and L. Bruzzone. Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing, 42(8):1778–1790, 2004.
 [7] M. Fauvel, J.A. Benediktsson, J. Chanussot, and J.R. Sveinsson. Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE Transactions on Geoscience and Remote Sensing, 46(11):3804–3814, 2008.
 [8] F. Ratle, G. CampsValls, and J. Weston. Semisupervised neural networks for efficient hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 48(5):2271–2282, 2010.
 [9] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 54(10):6232–6251, 2016.
 [10] H. Liang and Q. Li. Hyperspectral imagery classification using sparse representations of convolutional neural network features. Remote Sensing, 8(2):99, 2016.
 [11] Y. Qian, M. Ye, and J. Zhou. Hyperspectral image classification based on structured sparse logistic regression and threedimensional wavelet texture features. IEEE Transactions on Geoscience and Remote Sensing, 51(4):2276–2291, 2013.
 [12] J. Li, J.M. BioucasDias, and A. Plaza. Semisupervised hyperspectral image classification using soft sparse multinomial logistic regression. IEEE Geoscience and Remote Sensing Letters, 10(2):318–322, 2013.
 [13] J.M. BioucasDias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparse regressionbased approaches. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(2):354–379, 2012.
 [14] S. Jia and Y. Qian. Constrained nonnegative matrix factorization for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing, 47(1):161–173, 2009.
 [15] C. N. Dobigeon Févotte. Nonlinear hyperspectral unmixing with robust nonnegative matrix factorization. IEEE Transactions on Image Processing, 24(12):4810–4819, 2015.
 [16] J.M. BioucasDias and M.A.T. Figueiredo. Alternating direction algorithms for constrained sparse regression: Application to hyperspectral unmixing. In 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), pages 1–4. IEEE, 2010.
 [17] C. Li, Y. Ma, X. Mei, C. Liu, and J. Ma. Hyperspectral unmixing with robust collaborative sparse regression. Remote Sensing, 8(7):588, 2016.
 [18] A. Paoli, F. Melgani, and E. Pasolli. Clustering of hyperspectral images based on multiobjective particle swarm optimization. IEEE Transactions on Geoscience and Remote Sensing, 47(12):4175–4188, 2009.
 [19] N. Acito, G. Corsini, and M. Diani. An unsupervised algorithm for hyperspectral image segmentation based on the gaussian mixture model. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS), volume 6, pages 3745–3747, 2003.
 [20] C. Cariou and K. Chehdi. Unsupervised nearest neighbors clustering with application to hyperspectral images. IEEE Journal of Selected Topics in Signal Processing, 9(6):1105–1116, 2015.
 [21] Y. Chen, S. Ma, X. Chen, and P. Ghamisi. Hyperspectral data clustering based on density analysis ensemble. Remote Sensing Letters, 8(2):194–203, 2017.
 [22] E. Elhamifar and R. Vidal. Sparse manifold clustering and embedding. In Advances in Neural Information Processing Systems, pages 55–63, 2011.
 [23] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.
 [24] N. Gillis, D. Kuang, and H. Park. Hierarchical clustering of hyperspectral images using ranktwo nonnegative matrix factorization. IEEE Transactions on Geoscience and Remote Sensing, 53(4):2066–2078, 2015.
 [25] Z. Meng, E. Merkurjev, A. Koniges, and A. Bertozzi. Hyperspectral video analysis using graph clustering methods. Image Processing On Line, 7:218–245, 2017.
 [26] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492–1496, 2014.
 [27] H. Zhang, H. Zhai, and L. Zhangand P. Li. Spectral–spatial sparse subspace clustering for hyperspectral remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 54(6):3672–3684, 2016.
 [28] D.W.J. Stein, S.G. Beaven, L.E. Hoff, E.M. Winter, A.P. Schaum, and A.D. Stocker. Anomaly detection from hyperspectral imagery. IEEE signal processing magazine, 19(1):58–69, 2002.
 [29] A. Banerjee, P. Burlina, and C. Diehl. A support vector method for anomaly detection in hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 44(8):2282–2291, 2006.
 [30] B. Du and L. Zhang. A discriminative metric learning based anomaly detection method. IEEE Transactions on Geoscience and Remote Sensing, 52(11):6844–6857, 2014.
 [31] W. Li and Q. Du. Collaborative representation for hyperspectral anomaly detection. IEEE Transactions on Geoscience and Remote Sensing, 53(3):1463–1474, 2015.
 [32] G. Chen, M. Iwen, S. Chin, and M. Maggioni. A fast multiscale framework for data in high dimensions: Measure estimation, anomaly detection, and compressive measurements. In Visual Communications and Image Processing (VCIP), 2012 IEEE, pages 1–6, 2012.
 [33] J. Li, J.M. BioucasDias, and A. Plaza. Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning. IEEE Transactions on Geoscience and Remote Sensing, 48(11):4085–4098, 2010.
 [34] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. MunozMari. A survey of active learning algorithms for supervised remote sensing image classification. IEEE Journal of Selected Topics in Signal Processing, 5(3):606–617, 2011.
 [35] Coifman R.R., S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426–7431, 2005.
 [36] R.R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, 2006.
 [37] M. Maggioni and J.M. Murphy. Clustering by unsupervised geometric learning of modes. 2017.
 [38] Y. Tarabalka, J.A. Benediktsson, and J. Chanussot. Spectral–spatial classification of hyperspectral imagery based on partitional clustering techniques. IEEE Transactions on Geoscience and Remote Sensing, 47(8):2973–2987, 2009.
 [39] M. Fauvel, Y. Tarabalka, J.A. Benediktsson, J. Chanussot, and J.C. Tilton. Advances in spectralspatial classification of hyperspectral images. Proceedings of the IEEE, 101(3):652–675, 2013.
 [40] Z. Wang, N.M. Nasrabadi, and T.S. Huang. Spatial–spectral classification of hyperspectral images using discriminative dictionary designed by learning vector quantization. IEEE Transactions on Geoscience and Remote Sensing, 52(8):4808–4822, 2014.
 [41] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, 1995.
 [42] A.D. Szlam, M. Maggioni, and R.R. Coifman. Regularization on graphs with functionadapted diffusion processes. Jour. Mach. Learn. Res., (9):1711–1739, Aug 2008. (YALE/DCS/TR1365, Yale Univ, July 2006).
 [43] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426–7431, 2005.
 [44] B. Nadler, S. Lafon, R.R. Coifman, and I.G. Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. Applied and Computational Harmonic Analysis, 21(1):113–127, 2006.
 [45] R.R. Coifman, I.G. Kevrekidis, S. Lafon, M. Maggioni, and B. Nadler. Diffusion maps, reduction coordinates, and low dimensional representation of stochastic systems. Multiscale Modeling & Simulation, 7(2):842–864, 2008.
 [46] A. Singer and R.R. Coifman. Nonlinear independent component analysis with diffusion maps. Applied and Computational Harmonic Analysis, 25(2):226–239, 2008.
 [47] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
 [48] S. Lafon, Y. Keller, and R.R. Coifman. Data fusion and multicue data matching by diffusion maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11):1784–1797, 2006.
 [49] W. Czaja, B. Manning, L. McLean, and J.M. Murphy. Fusion of aerial gammaray survey and remote sensing data for a deeper understanding of radionuclide fate after radiological incidents: examples from the fukushima daiichi response. Journal of Radioanalytical and Nuclear Chemistry, 307(3):2397–2401, 2016.
 [50] R.R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternatingdiffusion. Applied and Computational Harmonic Analysis, 2015.
 [51] R.R. Lederman, R. Talmon, H. Wu, Y.L. Lo, and R.R. Coifman. Alternating diffusion for common manifold learning with application to sleep stage assessment. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5758–5762. IEEE, 2015.
 [52] M.A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi. Determination of reaction coordinates via locally scaled diffusion map. The Journal of Chemical Physics, 134(12):03B624, 2011.
 [53] W. Zheng, M.A. Rohrdanz, M. Maggioni, and C. Clementi. Polymer reversal rate calculated via locally scaled diffusion map. The Journal of Chemical Physics, 134(14):144109, 2011.
 [54] J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
 [55] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
 [56] S.J. Sheather and M.C. Jones. A reliable databased bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), pages 683–690, 1991.
 [57] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning, volume 1. Springer series in statistics Springer, Berlin, 2001.
 [58] M. Banerjee, M. Capozzoli, L. McSweeney, and D. Sinha. Beyond kappa: A review of interrater agreement measures. Canadian journal of statistics, 27(1):3–23, 1999.
 [59] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287–314, 1994.
 [60] A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4):411–430, 2000.
 [61] S. Dasgupta. Experiments with random projection. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pages 143–151. Morgan Kaufmann Publishers Inc., 2000.
 [62] E.J. Candes and T. Tao. Nearoptimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406–5425, 2006.
 [63] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, volume 14, pages 849–856, 2001.
 [64] U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
 [65] D. Manolakis, C. Siracusa, and G. Shaw. Hyperspectral subpixel target detection using the linear mixing model. IEEE Transactions on Geoscience and Remote Sensing, 39(7):1392–1409, 2001.
 [66] S. Kraut, L.L. Scharf, and L.T. McWhorter. Adaptive subspace detectors. IEEE Transactions on Signal Processing, 49(1):1–16, 2001.
 [67] D. Mumford and J. Shah. Optimal approximations by piecewise smooth functions and associated variational problems. Communications on pure and applied mathematics, 42(5):577–685, 1989.
 [68] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In International Conference on Machine Learning, pages 97–104, 2006.
 [69] C.R. Genovese, M. PeronePacifico, I. Verdinelli, and L. Wasserman. Nonparametric inference for density modes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1):99–126, 2016.
 [70] G. Schiebinger, M.J. Wainwright, and B. Yu. The geometry of kernelized spectral clustering. The Annals of Statistics, 43(2):819–846, 2015.
 [71] H. Wold. Partial least squares. Encyclopedia of statistical sciences, 1985.
 [72] P. Geladi and B.R. Kowalski. Partial leastsquares regression: a tutorial. Analytica chimica acta, 185:1–17, 1986.
 [73] S. De Jong. Simpls: an alternative approach to partial least squares regression. Chemometrics and intelligent laboratory systems, 18(3):251–263, 1993.
Appendix A Appendix
Aa Partial Least Squares Regression
Partial least squares [71, 72, 73, 57] is a linear regression method that attempts to maximize variation in both the predictor and response variables; this is in contrast to principal component regression, which attempts to maximize variation in only the predictor variables. Assume some normalized and centered training data is given, along with labels . PLS starts by computing the univariate regression coefficient of on each column of . Summing, we acquire a first partial least squares direction For the remaining PLS directions, the inputs are weighted according to the strength of the univariate impact on . The outcome is regressed on , to obtain coefficient . The columns of are then orthogonalized with respect to , and we repeat. We stop when a sufficient number of directions are obtained, yielding directions . In our experiments, we take when is known, or is estimated by the kink in the curve as above.
It is helpful to compare PLS to PCA. Let be the sample covariance matrix of . In PCA, a representation of is computed in which the principle component is given by
The conditions ensure is uncorrelated with all lower .
For partial least squares, the PLS direction is given by
Letting , the PLSR coefficient satisfies the condition , where is the leastsquares solution to overdetermined system .
AA1 PLSR Computational Complexity
PLSR is implemented with the SIMPLS algorithm [73], which has computational complexity , where is the number of partial least squares components used. In our algorithm, this is equal to the number of classes, either known or determined by studying the decay of the sorted curve, and may be assumed to be and independent of .
AB Additional Experiments
AB1 Synthetic Dataset
The synthetic HSI dataset is size , with 3 classes generated by drawing samples from a random distribution. Classes constitute spatial rectangles of size , and respectively; some sample bands appear in Figure 17. We fix three pure spectra for these three classes, then randomly perturb the locations and values of their maxima, randomly dilate the spectra, and add Gaussian noise. Some typical spectra for each class appear in Figure 18.



More precisely, an element of the class is generated from a pure spectrum , as follows. Each spectrum has five local maxima. For each , let be the coordinates of the spectra maxima for the pure spectrum . Let be parameters. A sample from class is generated as follows:

Randomly translate the independently according to a uniform distribution Unif, then smooth.

Randomly shift the independently according to a uniform distribution Unif, then smooth.

Randomly perturb the entire spectrum by a pointwise multiplication by .

Add mean 0, variance .01 Gaussian random noise componentwise.
Results for fixed and fixed appear in Figure 19; a range of other values were also considered, with similar results. As the varying parameter increases, the clustering problem becomes more challenging. The proposed methods, DL and DLSS perform well up to a certain point, then rapidly decay. The observed phase transition is due to the failure to identify modes from distinct classes, caused by increased class overlap. In this synthetic example with very simple spatial boundaries, there is little benefit to using the spatialspectral labeling scheme.
After the parameter is sufficiently large and the classes are very mixed, all algorithms converge to rather poor accuracies. These plots illustrate an important drawback of the proposed method: when the algorithm fails to find distinct modes of the distributions, results may be rather poor. This is true of most clustering methods, however, and the proposed method of mode detection using diffusion distances is evidently more robust than using Euclidean distances.
AB2 Pavia Visual Clustering Results
Visual results for the Pavia dataset appear in Figure 20.
AB3 Pavia University Dataset
The Pavia University dataset used for experiments consists of a subset of the original Pavia University dataset, and contains five classes of wideranging size, many of which are spread throughout the image spatially. Figure 22 shows that the yellow, teal, and green classes are relatively small and quite well distributed throughout the image. These classes are expected to be hard to cluster correctly.
The proposed diffusion learning methods give the best results for overall accuracy and , though for average accuracy, DL and DLSS drop in performance. This can be understood by analyzing the visual results in Figure 21, where it is seen that the green class is rather poorly labeled by DLSS, compared to DL for example. This indicates that the incorporation of spatial information was perhaps overzealous, and washed out a small class. This leads to a significant drop on average accuracy. We remark that FSFDPC is unable to learn correct modes from each class, leading to some classes being totally mislabeled. Spectral clustering and SMCE both perform rather well, though there are some problems with the dark blue, light blue, and orange classes for both of those methods.