Deep Manifold Embedding for Hyperspectral Image Classification
Abstract
Deep learning methods have played a more and more important role in hyperspectral image classification. However, the general deep learning methods mainly take advantage of the information of sample itself or the pairwise information between samples while ignore the intrinsic data structure within the whole data. To tackle this problem, this work develops a novel deep manifold embedding method(DMEM) for hyperspectral image classification. First, each class in the image is modelled as a specific nonlinear manifold and the geodesic distance is used to measure the correlation between the samples. Then, based on the hierarchical clustering, the manifold structure of the data can be captured and each nonlinear data manifold can be divided into several subclasses. Finally, considering the distribution of each subclass and the correlation between different subclasses, the DMEM is constructed to preserve the estimated geodesic distances on the data manifold between the learned low dimensional features of different samples. Experiments over three realworld hyperspectral image datasets have demonstrated the effectiveness of the proposed method.
I Introduction
Recently, hyperspectral images, which contain hundreds of spectral bands to characterize different materials, make it possible to discriminate different objects with the plentiful spectral information and have proven its important role in the literature of remote sensing and computer vision [7, 51, 52]. As an important hyperspectral data task, hyperspectral image classification aims to assign the unique landcover label to each pixel and is also the key technique in many realword applications, such as the urban planning [15], military applications [8], and others. However, hyperspectral image classification is still a challenging task. There usually exists high nonlinearity of samples within each class. Therefore, how to effectively model and represent the samples of each class tends to be a difficult problem. Besides, great overlapping which occurs between the spectral channels from different classes in the hyperspectral image, multiplies the difficulty to obtain discriminative features from the samples.
Deep models have demonstrated their potential to model the nonlinearity of samples [23, 47, 13]. It can learn the model adaptively with the data information from the training samples and extract the difference between different classes. Due to the good performance, this work will take advantage of the deep model to extract features from the hyperspectral image. However, large amounts of training samples are required to guarantee a good performance of the deep model while there usually exists limited number of training samples in many computer vision tasks, especially in the literature of hyperspectral image classification. Therefore, how to construct the training loss and fully utilize the data information with a certain number of training samples becomes the essential and key problem for effectively deep learning.
The softmax loss, namely the softmax crossentropy loss, is widely applied in prior works. It is formulated by the cross entropy between the posterior probability and the class label of each sample [38], which mainly takes advantage of the pointtopoint information of each sample itself. Several variants which try to utilize the distance information between each sample pair or among each triplet have been proposed. These losses, such as the contrastive loss [9] and triplet loss [32] have made great strides in improving the representational ability of the CNN model. However, these prior losses, which we call sampleswise methods, mainly utilize the data information of sample itself or between samples and ignore the intrinsic data structure. In other words, these sampleswise methods only consider the commonly simple information and ignore the special intrinsic data structures of the hyperspectral image for the task at hand.
Establishing a good model for the hyperspectral image is the premise of making use of the intrinsic data structure in the deep learning. Generally, the way to model the hyperspectral image can be broadly divided into two classes: parametric model and nonparametric model. Typical parametric models for hyperspectral image are usually constructed by the probabilistic model, such as the multivariant Gaussian distribution. This class of model has been successfully applied in the literature of hyperspectral target detection [57] and anomaly detection [49]. Generally, parameter estimation with the training data is essential under these parametric models [43]. The other class of models usually makes use of the information provided by the training data directly, without modelling class data [18]. These nonparametric models are usually based on the mutual information and suitable for general cases since it does not assume anything about the shape of the class data density functions. In this work, manifold model, which plays an important role in the nonparametric models and can better fit the high dimension of the hyperspectral image, will be applied to model the image for the current task.
Manifold learning has been widely applied in many computer vision tasks, such as the face recognition [43, 44], image classification [28], as well as in the literature of hyperspectral image [42, 29]. Generally, a data manifold follows the law of manifold distribution: in realworld applications, highdimensional data of the same class usually draws close to a low dimensional manifold [21]. Therefore, hyperspectral images, which provide a dense spectral sampling at each pixel, possess good intrinsic manifold structure. This work aims to develop a novel manifold embedding method in deep learning (DMEM) for hyperspectral image classification to make use of the data manifold structure and preserve the intrinsic data structure in the obtained low dimensional features.
In addition to the law of manifold distribution, data manifold usually follows another property, namely the law of cluster distribution: The different subclasses of a certain class in the highdimensional data correspond to different probability distributions on the manifold [22]. Furthermore, these probability distributions are far enough to distinguish these subclasses. Therefore, under the geodesic distances between the samples, we divide each class in the hyperspectral image into several subclasses. Then, we develop the DMEM according to the following two principles.

Based on multistatistical analysis, deep manifold embedding can be constructed to encourage the features from each subclass to follow a certain distribution and further preserve the intrinsic structure in the low dimensional feature space.

Motivated by the idea of maximizing the “manifold margin” by the manifold discriminant analysis [43], additional diversitypromoting term is developed to increase the margin between subclasses from different data manifold.
Overall, the main contributions of this work are threefold. Firstly, this work models the hyperspectral image with the nonlinear manifold and takes advantage of the intrinsic manifold structure of the hyperspectral image in the deep learning process. Secondly, this work formulates a novel training loss based on the manifold embedding in deep learning for hyperspectral image classification and thus the intrinsic manifold structure can be preserved in the low dimensional features. Finally, a thorough comparison is provided using different samplesbased embedding and loss.
The rest of this paper is arranged as follows. Section II briefly reviews the existing works on the topic of manifold learning and general deep learning. Section III gives a detailed description of the proposed method, which embeds the manifold model in deep learning for hyperspectral image classification. Section IV presents the experimental results and comparisons to validate the effectiveness of this paper. Finally, we conclude this work with some discussions in Section V.
Ii Related Work
In this section, we will review two topics that closely related to this paper. First, deep learning methods are briefly introduced, which promote the generation of motivations of this work. Then, manifold learning in prior works is investigated, which is directly the related work of the proposed method.
Iia Deep Learning
Deep learning methods capture the data information from the training samples under a fixed criterion given by the loss function. These loss functions are mainly based on sampleswise information and can be divided into two classes according to different criterions.
The first one is the onetoone correspondence criterion, which measures the difference between the predicted and the corresponding label of each sample. The typical representative is the widely used softmax loss. Some more variants have been developed to boost the performance of the general softmax loss. For example, Liu et al. [27] develops the large margin softmax loss (LSoftmax) which utilizes a simple angle margin regularization to achieve a classification angle margin between different classes. The work of Liu et al. [26] is also of this type while improves the LSoftmax by normalizing the weights. Wang et al. [39] rethinks the softmax loss from the cosine perspective and constructs the large margin cosine loss. Wan et al. [38] introduces the distribution prior on the learned features and constructs the Gaussian Mixture (GM) Loss. Classification margin and likelihood regularization can also be imposed on the GM Loss to accurately model the features. All these works utilize the information from different samples independently.
The second one is using the intersample information. The principle of these works is to decrease the Euclidean distances of the samples with the same class label and increase the distances of samples from different classes. Hadsell et al. [9] first develops the contrastive loss to utilize the information of image pairs. Schroff et al. [32] constructs the triplet data and formulates the triplet loss. Sohn [34] further considers the Npair sampling other than the triplet sampling. Wang et al. [41] reformulates the correlation of triplet data from the view of angular and develops the angular loss with the triplet sampling. As the variants of the contrastive loss, Song et al. [35] proposes the structured loss by taking advantaging of the intrinsic structure within the minibatch and Zhang et al. [48] makes use of the harmonic range within each class to handle imbalanced data through the developed range loss. As a deep improvement one, center loss which is developed by Wen et al. [46] utilizes the center point of each class to formulate the image pairs within the class. By adding more image pairs, Zhe et al. [50] pushes the contrastive loss to the classwise loss.
These former losses only consider the commonly simple information from the training samples and ignore the intrinsic information within the data. Especially, for the task at hand, there exist high nonlinearity and great overlapping in the high dimensional hyperspectral data. Under these circumstances, using the special intrinsic data manifold structure within the hyperspectral image would be particularly important and can make the learned model be more fit for the image. This is also the direct motivations of the developed method in this work.
IiB Manifold Learning
Manifold learning is the research topic to learn from a data a latent space representing the input space. It can not only grasp the hidden structure of the data, but also generate low dimensional features by nonlinear mapping. A large amount of manifold learning methods have already been proposed, such as the Isometric Feature Mapping (ISOMAP) [40, 37], Laplacian Eigenmaps [2, 37], Local linear Embedding (LLE) [31], Semidefinite Embedding [45], Manifold Discriminant Analysis [43, 44], RSRML [10]. With the development of the deep learning, some works have incorporated the manifold in the deep models [1, 56, 28, 17]. Zhu et al. [56] develops the automated transform by manifold approximation (AUTOMAP) which learns a nearoptimal reconstruction mapping through manifold learning. Lu et al. [28] and Aziere et al. [1] mainly apply the manifold learning in deep ensemble and consider the manifold similarity relationships between different CNNs. Iscen et al. [17] utilizes the manifolds to implement the metric learning without labels.
These manifold learning methods are mainly applied in natural image processing tasks, such as face recognition [10], natural image classification [44], image retrieval [1]. Only few works, such as [29], [42], focus on the hyperspectral image classification task. Among these works, Ma et al. [29] only combines the local manifold learning with the nearestneighbor classifier. Wang et al. [42] uses the manifold ranking for salient band selection. All these works do not consider the intrinsic manifold structure of the hyperspectral image in the training process. Faced with the current task, this work tries to develop a novel deep manifold embedding which can promote the learned deep model to capture the data intrinsic structure of the hyperspectral image and further preserve the manifold structure in low dimensional features. In the following, we’ll introduce the developed deep manifold embedding in detail.
Iii Manifold Embedding in Deep Learning
Denote as the training samples of the hyperspectral image and is the corresponding class label of where defines the number of the training samples. where stands for the set of class labels and represents the number of the class of the image.
Iiia Manifold Structure within the Hyperspectral Image
Denote where represents the set of samples from the th class and is the number of samples from the th class.
Following the law of manifold distribution, samples of each class from the hyperspectral image is assumed to satisfy a certain nonlinear manifold. As introduced in the former, the nonlinear manifold obeys the law of cluster distribution. Therefore, each class can be divided into several subclass and each subclass is supposed to follow a certain probability distribution. Generally, the closer samples on the manifold are supposed to belong to the same subclass, namely the same probability distribution. This work uses a novel measurement instead of the Euclidean distance to measure the distance between samples on the manifold.
Given the th class in the image. To separate the samples of each class into different subclasses, all the samples of each class is used to formulate an undirected graph. Let denote the graph over the th class, where is the set of nodes in the graph and is the set of edges in the graph.
The distance between the sample and its nearest neighbors is assumed to distribute on a certain linear manifold and can be calculated under the Euclidean distance,
(1) 
Then, the weights of the edges on the undirected graph on the th class are defined as follows:
(2) 
In the data manifold, the geodesic distance [33] can be used to measure the distance between different samples on the manifold. The geodesic distance on the manifold can be transformed by the shortest path on the graph . Then, the distance between the sample and on the manifold can be calculated by
(3) 
where .
This work uses the Dijkstra algorithm [4] to solve the optimization in Eq. 3. Then, the distance matrix over the data manifold of the th class can be formulated by the pairwise distance between different samples.
Here, for each class, we divide the whole training samples of the class into subclasses. Denote as the subclasses of the th class. The samples in each subclass are supposed to be close enough. Then, these subclasses are constructed under the following optimization:
(4) 
Under the optimization in Eq. 4, we can obtain the subclasses with the smallest geodesic distances between the samples in each subclass. Hierarchical clustering can be used to solve the optimization. The whole procedure is outlined in Algorithm 1.
IiiB Deep Manifold Embedding
This work selects the CNN model as the features extracted model for hyperspectral image. Denote as the extracted features of sample from the CNN model. Then, the obtained features can be looked as the global low dimensional coordinates under the nonlinear CNN mapping. Besides, as Fig. 1 shows, the deep manifold embedding constructs the global low dimensional coordinates to preserve the estimated distance on the manifold.
From the law of cluster distribution, we know that a subclass corresponds to different probability distributions over the manifold. To preserve the estimated geodesic distances, for samples in each subclass, the extracted features in the low dimensional coordinates are also expected to follow the same distribution.
As processed in former subsection, suppose as the subclasses from the th class. Given where is the number of samples in the subclass . If not specified, in the following, we use to represent the . Then, is the set of the learned features. The problem to promote the features in to follow the same distribution can be transformed to the one that , follows the distributions constructed by all the other features in under a certain degree of confidence.
Therefore, Given . Suppose all the other features in follow the multivariant Gaussian distribution where is the dimension of the learned features. Then,
(5) 
Under the confidence , when
(6) 
can be seen as the sample from distribution . For simplicity, we assume that different dimensions in the feature are independent and have the same variance, namely the covariance where represents the identity matrix. Besides, the unbiased estimation of the mean value is
(7) 
Then, the penalization from can be formulated by
(8) 
where is the constant term. Since
(9)  
Ignore the constant term and we use the following penalization to replace that in Eq. 8:
(10) 
Then, the loss for deep manifold embedding can be written as
(11) 
To further improve the performance for manifold embedding, we introduce the diversitypromoting term to enlarge the distance between the subclasses from different classes. The distance between the subclasses can be processed by the settoset distance between different sets. This work will use the Hausdorff distance which is the maximum distance of a set to the nearest point in the other set [30] to measure the distance between different subclasses since the measurement considers the whole shape of the data set and also the position of the samples in the set.
Suppose as the subclass from th class and as the subclass from th class, then the Hausdorff distance between the two subclasses can be calculated by
(12) 
Then, the diversitypromoting term [6] can be formulated as
(13) 
where is a positive value which represents the margin.
IiiC Optimization
Just as general deep learning methods, stochastic gradient descent (SGD) methods and back propagation (BP) are used for the training process of the developed deep manifold embedding [11]. The key process is to calculate the derivation of the loss with respect to (w.r.t.) the features .
Based on the chain rule, gradients of w.r.t. can be calculated as
(15) 
Then, we have
(16) 
where represents the indicative function.
(17)  
We summarize the computation of loss functions and gradients in Algorithm 2.
Iv Experimental Results
In this section, intensive experiments are conducted to prove the effectiveness of the proposed method. First, the datasets used in this work is introduced. Then, the experimental setups is detailed and the experimental results are shown and analyzed.
Iva Datasets
To further validate the effectiveness of the proposed method, this work conducts experiments over three realworld hyperspectral images [16], namely the Pavia University, the Indian Pines, and the Salinas Scene data.

The Pavia University was acquired by the reflective optics system imaging spectrometer (ROSIS3) sensor during a flight campaign over Pavia, Northern Italy. The image consists of pixels with a geometric resolution of 1.3 m/pixels. A total of 42,776 labelled samples divided into 9 land cover objects are used for experiments and each sample is with 103 spectral bands ranging from 0.43 to 0.86 .

The Indian Pines was gathered by 224band AVIRIS sensor ranging from 0.4 to 2.5 over the Indian Pines test site in Northwestern Indiana. It consists of pixels with a spatial resolution of 20 m/pixel. Removing the 24 water absorbtion bands, the 200 bands are retained. 16 classes of agriculture, forests and vegetation with a total of 10,249 labelled samples are included for experiments.

The Salinas Scene was also collected by the 224band AVIRIS sensor with a spectral coverage from 0.4 to 2.5 but over Salinas Valley, California. The image size is with a spatial resolution of 3.7 m/pixel. As the Indian Pines scene, 20 water absorption bands are discarded. 16 classes of interest, including vegetables, bare soils, and vineyard fields with a total of 54,129 labelled samples are chosen for experiments.
IvB Experimental Setups
There are four parameters in the experiments to be determined, namely the balance between the optimization term and the diversitypromoting term , and the balance between the manifold embedding term and the softmax loss ,the number of subclasses , the number of the neighbors . The first two are empirically set as . As for and , a lot of experiments have been done to choose the best parameters. We set the two variables as different values and then check their performance under various and .
Caffe [19] is chosen as the deep learning framework to implement the developed method for hyperspectral image classification. This work adopts the simple CNN architecture as Fig. 2 shows to provide the nonlinear mapping for the low dimensional features of the data manifold. The learning rate, epoch iteration, training batch are set to 0.001, 60000, 84, respectively. The tradeoff parameter in the deep manifold embedding is set to 0.0001. Just as Fig. 2, this work takes advantage of the neighbors to extract both the spatial and the spectral information from the image.
In the experiments, we choose 200 samples per class for training and the remainder for testing over Pavia University and Salinas scene data while over Indian Pines data, we select 20 percent of samples per class for training and the others for testing. To objectively evaluate the classification performance, metrics of the overall accuracy (OA), average accuracy (AA), and the Kappa coefficient are adopted. All the results come from the average value and standard deviation of ten runs of training and testing. The code for the implementation of the proposed method will be released soon at http:/github.com/shendusw/deepmanifoldembedding.
IvC General Performance
At first, we present the general performance of the developed manifold embedding for hyperspectral image classification. In this set of experiments, the number of subclasses is set to 5, the number of neighbors is set to 5. Very common machine with a 3.6GHz Intel Core i7 CPU, 64GB memory and NVIDIA GeForce GTX 1080 GPU was used to test the performance of the proposed method. The proposed method took about 2196s over Pavia University data, 2314s over Indian Pines data, and 2965s over Salinas scene data. It should be noted that the developed manifold embedding is implemented through CPU and the computational performance can be remarkably improved by modifying the codes to run on the GPUs.
Table I, II, and III show the general performance over the Pavia University, Indian Pines, and salinas scene data, respectively. These tables show the classification accuracies of each class and the OA, AA as well as the Kappa by SVMPOLY, the CNN trained with softmax loss and the CNN trained with the proposed method. From these tables, we can easily get that the CNN model provides a more discriminative representation of the hyperspectral image than other handcrafted features. Furthermore, we can find that the performance of the CNN model can be significantly improved when trained with the proposed method other than only with the softmax loss. Over the Pavia University data, the CNN model with the manifold embedding can obtain an accuracy of which is higher than by the CNN with softmax loss only. Over the Indian Pines and the Salinas scene data, the proposed method which can also achieve and outperforms the CNN with general softmax loss.
It should be noted that constructing the data manifold structure requires a certain amount of samples. Over the Salinas scene and the Pavia University data, the classification accuracies from each class are improved by the proposed method. However, over the Indian Pines data, some classification accuracies from the classes, such as the alfalfa, corn, grass pasture mowed, oats, and wheat, are decreased by the developed method. The reason is that the training samples in these classes are quite small while the training samples of other samples are quite large when compared with these classes. Especially, only four samples from the oats class are used for training. Few training samples cannot model the data manifold structures and may even show negatively effects on the classification performance.
To further validate the effectiveness of the developed method, this work uses the McNemar’s test [5], which is based on the standardized normal test statistics, for deeply comparisons in the statistic sense. The statistic can be computed by
(18) 
where describes the number of correctly classified samples by the th method but wrongly by the th method. Therefore, measures the pairwise statistical significance between the th and th methods. At the widely used level of confidence, the difference of accuracies between different methods is statistically significant if .
Methods  SVMPOLY  CNN  Proposed Method  


C1  
C2  
C3  
C4  
C5  
C6  
C7  
C8  
C9  
OA (%)  
AA (%)  
KAPPA (%)  
From these tables, it can also be noted that when compared the proposed method with the CNN trained by general softmax loss, the Mcnemar’s test value achieves 20.80, 4.48, and 12.67 over Pavia University, Indian Pines, and salinas scene data, respectively. This indicates that the improvement of the developed deep manifold embedding on the performance of CNN is statistically significant.
Methods  SVMPOLY  CNN  Proposed Method  


C1  
C2  
C3  
C4  
C5  
C6  
C7  
C8  
C9  
C10  
C11  
C12  
C13  
C14  
C15  
C16  
OA (%)  
AA (%)  
KAPPA (%)  
Methods  SVMPOLY  CNN  Proposed Method  


C1  
C2  
C3  
C4  
C5  
C6  
C7  
C8  
C9  
C10  
C11  
C12  
C13  
C14  
C15  
C16  
OA (%)  
AA (%)  
KAPPA (%)  
IvD Effects of Different Number of Training Samples
Since the number of training samples can significantly affect the construction of the data manifold, this subsection will further validate the performance of the developed deep manifold embedding under different number of training samples. For the Pavia University and the Salinas Scene data, the number of training samples per class is selected from . For the Indian Pines data, we choose 1%, 2%, 5%, 10%, and 20% of samples for training, respectively. In this set of experiments, the number of the subclasses and the neighbors is set to 5, 5, respectively.
Fig. 3 shows the classification performance of the developed method with different number of training samples and Fig. 4 shows the corresponding Mcnemar’s test value between the CNN trained with the proposed method and the CNN trained with the softmax loss only. From the figures, we can obtain the following conclusions.

The developed manifold embedding method can take advantage of the data manifold property within the hyperspectral image and preserve the manifold structure in the low dimensional features which can improve the representational ability of the CNN model. Fig. 3 shows that the proposed method obtains better performance over all the three datasets under different number of training samples. Moreover, Fig. 4 also shows that the corresponding Mcnemar’s test value over the three datasets is higher than 1.96 which means that the improvement of the proposed method is significant in statistic sense.

With the decrease of the training samples, the effectiveness of the developed method would be limited. Fig. 3 shows that the curves of the classification accuracy over each data set tend to be close to each other. Besides, from the Fig. 4, it can be find that when the samples is limited, the value is fluctuate which indicates that the effectiveness is negatively affected by the limited number of training samples. Just as the former subsection shows, this is because that constructing the data manifold requires a certain number of training samples. In contrary, too few samples may construct the false data manifold and show negative effects on the performance.
IvE Effects of the Number of SubClasses
This subsection will show the performance of the developed method under different . In the experiments, the is chosen from . The parameter is set to 5. Fig. 5 presents the experimental results over the three data sets, respectively.
From the figure, we can find that a proper can guarantee a good performance of the developed manifold embedding method. From Fig. 5, it can be find that when , the classification accuracy over Pavia University data can achieve 99.52% OA while can only lead to an accuracy of OA. For Indian Pines data, just as Fig. 5 shows, can make the classification accuracy reach OA while when , the accuracy can only achieve 99.31% OA. Besides, as Fig. 5shows, for Salinas Scene data, when , the proposed method performs the best. Generally, cross validation can be applied to select a proper in realworld application.
IvF Effects of the Number of neighbors
Just as the parameter , the number of neighbors also plays an important role in the developed method. Generally, extremely small , such as , would lead to the extremely “steep” of the constructed data manifold. While extremely large would lead to the overly smoothness of the data manifold. This subsection would discuss the performance of the developed method under different number of neighbors . In the experiments, the is chosen from . We also present the results when approaches infinity, namely all the samples are measured by Euclidean distance. In this set of experiments, the parameter is set to 5. Fig. 6 shows the classification results of the proposed method under different choices of over the three data sets, respectively. Inspect the tendencies in Fig. 6 and we can note that the following hold.
Firstly, different can also significantly affect the performance of the developed method. Coincidentally, the performance of the proposed method achieves the best performance when is set to 5. Besides, the application of the Geodesic distance other than the Euclidean distance can improve the performance of the deep manifold embedding method. As Fig. 6 shows, the proposed method can achieve 99.52% over Pavia University data under Geodesic distance which is higher than 99.35% under Euclidean distance. Over Indian Pines, just as Fig. 6 shows, the proposed method under Geodesic distance obtains an accuracy of 99.51% outperforms that under Euclidean distance (99.32%). From Fig. 6, it can be noted that over Salinas scene data, the proposed method under Geodesic distance can achieve 97.80% which is better than 97.51% under Euclidean distance.
IvG Comparisons with the Samplesbased Loss
This work also compares the developed deep manifold embedding with other recent samplesbased loss. Here, we choose three representative loss in prior works, namely the softmax loss, center loss [46], and structured loss [35]. Table IV lists the comparison results over the three data sets, respectively.
From the table, we can find that the proposed deep manifold embedding which can take advantage of the data manifold property within the hyperspectral image and preserve the manifold structure in the low dimensional features can be more fit for the classification task than these samplesbased loss. Over the Pavia University data, the proposed method can obtain an accuracy of 99.52% outperform the CNN trained with the softmax loss (98.61%), center loss (99.28%), and the structured loss (99.27%). Over the Salinas Scene and Indian Pines data, the proposed method also outperforms these prior samplesbased loss (see the table for details).
Data  Methods  OA(%)  AA(%)  KAPPA(%)  
PU  Softmax Loss  15.77  
Center Loss  6.03  
Structured Loss  6.22  
Proposed Method  
IP  Softmax Loss  4.48  
Center Loss  3.01  
Structured Loss  3.83  
Proposed Method  
SA  Softmax Loss  12.67  
Center Loss  6.42  
Structured Loss  7.05  
Proposed Method  

Furthermore, we present the classification maps in Fig. 7, 8, and 9 by different methods over the Pavia University, Indian Pines, and Salinas Scene data, respectively. Compare Fig. 7 and 7, 8 and 8, 9 and 9, and it can be easily noted that the CNN model trained with the deep manifold embedding can improve the performance of the CNN model. Besides, compare Fig. 7 and 7, 8 and 8, 9 and 9, and we can find that the deep manifold embedding which can take advantage of the manifold structure can better model the hyperspectral image than the center loss. When compared 7 and 7, 8 and 8, 9 and 9, we can also note that the proposed method can significantly decrease the classification errors obtained by the structured loss.
IvH Comparisons with the StateoftheArt Methods
To further validate the effectiveness of the proposed manifold embedding method for hyperspectral image classification, we further make comparison with a number of the stateoftheart methods. Tables V, VI, and VII list the comparison results under the same experimental setups over the three data sets, respectively. It should be noted that the results in these tables are from the literatures where the method was first developed.
Over Pavia University data, the developed method can obtain 99.52% OA outperforms DDBNPF (93.11% OA) [51], CNNPPF (96.48% OA) [24], Contextual DCNN (97.31% OA) [20], SSN (99.36% OA) [55], MLbased SpecSpat (99.34% OA) [3], and DPPDMLMSCNN (99.46% OA) [7]. Besides, over Salinas Scene data and Indian Pines data, the developed method can also provide competitive results (see tables VI and VII for detail). To sum up, the joint supervision of the developed manifold embedding loss and softmax loss can always enhance the deep models’ ability to extract discriminative representations and obtain comparable or even better results when compared other stateoftheart methods.
Methods  OA(%)  AA(%)  KAPPA(%) 

SVMPOLY 

DDBNPF [51]  
CNNPPF [24]  
Contextual DCNN [20]  
SSN [55]  
MLbased SpecSpat [3]  
DPPDMLMSCNN [7]  
Proposed Method  

Methods  OA(%)  AA(%)  KAPPA(%) 

RELM [25] 

DEFN [36]  
DRN [12]  
MCMs+2DCNN [14]  
Proposed Method (10%)  
SVMPOLY  
SSRN [53]  
MCMs+2DCNN [14]  
Proposed Method (20%)  

V Conclusion and Discussion
The data structure is a critical factor that influences the deep learning performance. In this paper, we take advantage of the data manifold to model the intrinsic data structure within the hyperspectral image and develop a novel manifold embedding method in deep learning (DMEM) to preserve the manifold structure in the low dimensional features. Using the intrinsic data structure does help to improve the performance of the deep model and experimental results have validated the effectiveness of the developed DMEM.
As future work, it would be interesting to investigate the effectiveness of the manifold embedding on other hyperspectral imaging tasks, such as hyperspectral target detection. Besides, further consideration should be given to embed the manifold structure in other forms. Finally, other data structures which can significantly affect the deep learning performance is another important future topic.
References
 (2019) Ensemble deep manifold similarity learning using hard proxies. In CVPR, pp. 7299–7307. Cited by: §IIB, §IIB.
 (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, pp. 585–591. Cited by: §IIB.
 (2018) Exploring hierarchical convolutional features for hyperspectral image classification. IEEE TGRS 56 (11), pp. 6712–6722. Cited by: §IVH, TABLE V.
 (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1 (1), pp. 269–271. Cited by: §IIIA.
 (2004) Thematic map comparison: evaluating the statistical significance of differences in classification accuracy. Photogrammetric Engineering and Remote Sensing 70 (5), pp. 627–633. Cited by: §IVC.
 (2019) Diversity in machine learning. IEEE Access 7 (1), pp. 64323–64350. Cited by: §IIIB.
 (2019) A cnn with multiscale convolution and diversified metric for hyperspectral image classification. IEEE TGRS 57 (6), pp. 3599–3618. Cited by: §I, §IVH, TABLE V, TABLE VII.
 (2017) Multiple kernel learning for hyperspectral image classification: a review. IEEE TGRS 55 (11), pp. 6547–6565. Cited by: §I.
 (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, pp. 1735–1742. Cited by: §I, §IIA.
 (2014) From manifold to manifold: geometryaware dimensionality reduction for spd matrics. In ECCV, pp. 17–32. Cited by: §IIB, §IIB.
 (2009) Neural networks and learning machines. Prentice Hall, New York. Cited by: §IIIC.
 (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: TABLE VI.
 (2019) Skipconnected covariance network for remote sensing scene classification. IEEE TNNLS. Cited by: §I.
 (2018) Feature extraction with multiscale covariance maps for hyperspectral image classification. IEEE TGRS 57 (2), pp. 755–769. Cited by: TABLE VI.
 (2019) Urban structure type characterization using hyperspectral remote sensing and height information. Landscape and Urban Planning 105 (4), pp. 361–375. Cited by: §I.
 Hyperspectral data, accessed on aug. 18, 2019. Note: https://www.ehu.ews/ ccwintoco/index.php?title=Hyperspectral_Remote_Sensing_Scenes Cited by: §IVA.
 (2018) Mining on manifolds: metric learning without labels. In CVPR, pp. 7642–7651. Cited by: §IIB.
 (2013) Feature mining for hyperspectral image classification. Proceedings of the IEEE 101 (3), pp. 676–697. Cited by: §I.
 (2014) Caffe: convolutional architecture for fast feature embedding. In ACM MM, pp. 675–678. Cited by: §IVB.
 (2017) Going deeper with contextual cnn for hyperspectral image classification. IEEE TIP 26 (10), pp. 4843–4855. Cited by: §IVH, TABLE V, TABLE VII.
 (2018) Geometric understanding of deep learning. arXiv preprint arXiv: 1805.10451. Cited by: §I.
 (2019) A geometric view of optimal transportation and generative model. Computer Aided Geometric Design 68, pp. 1–21. Cited by: §I.
 (2019) Deep learning for hyperspectral image classification: an overview. IEEE TGRS. Cited by: §I.
 (2016) Hyperspectral image classification using deep pixelpair features. IEEE TGRS 55 (2), pp. 844–853. Cited by: §IVH, TABLE V, TABLE VII.
 (2017) Hyperspectral image reconstruction by deep convolutional neural network for classification. Pattern Recognition 63, pp. 371–383. Cited by: TABLE VI.
 (2017) SphereFace: deep hypersphere embedding for face recognition. In CVPR, pp. 212–220. Cited by: §IIA.
 (2016) Largemargin softmax loss for convolutional neural networks. In ICCV, Cited by: §IIA.
 (2015) Multimanifold deep metric learning for image set classification. In CVPR, pp. 1137–1145. Cited by: §I, §IIB.
 (2010) Local manifold learningbased nearestneighbor for hyperspectral image classification. IEEE TGRS 48 (11), pp. 4099–4109. Cited by: §I, §IIB.
 (2019) Computing the minimum hausdorff distance between two point sets on a line under translation. Information Processing Letters 38 (3), pp. 123–127. Cited by: §IIIB.
 (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), pp. 2323–2326. Cited by: §IIB.
 (2015) Facenet: a unified embedding for face recognition and clustering. In CVPR, pp. 815–823. Cited by: §I, §IIA.
 (2017) Geodesic distance descriptors. In CVPR, pp. 6410–6418. Cited by: §IIIA.
 (2016) Improved deep metric learning with multiclass npair loss objective. In NIPS, pp. 1857–1865. Cited by: §IIA.
 (2016) Deep metric learning via lifted structured feature embedding. In CVPR, pp. 4004–4012. Cited by: §IIA, §IVG, TABLE IV.
 (2018) Hyperspectral image classification with deep feature fusion network. IEEE TGRS 56 (6), pp. 3173–3184. Cited by: TABLE VI.
 (2008) Largescale manifold learning. In CVPR, pp. 1–8. Cited by: §IIB.
 (2018) Rethinking feature distribution for loss functions in image classification. In CVPR, pp. 9117–9126. Cited by: §I, §IIA.
 (2018) CosFace: large margin cosine loss for deep face recognition. In CVPR, pp. 5265–5274. Cited by: §IIA.
 (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), pp. 2319–2323. Cited by: §IIB.
 (2017) Deep metric learning with angular loss. In ICCV, pp. 2593–2601. Cited by: §IIA.
 (2016) Salient band selection for hyperspectral image classification via manifold ranking. IEEE TNNLS 27 (6), pp. 1279–1289. Cited by: §I, §IIB.
 (2009) Manifold discriminant analysis. In CVPR, pp. 429–436. Cited by: item 2, §I, §I, §IIB.
 (2012) Manifoldmanifold distance and its application to face recognition with image sets. IEEE TIP 21 (10), pp. 4466–4479. Cited by: §I, §IIB, §IIB.
 (2006) Unsupervised learning of image manifolds by semidefinite programming. IJCV 70 (1), pp. 77–90. Cited by: §IIB.
 (2016) A discriminative feature learning approach for deep face recognition. In ECCV, pp. 499–515. Cited by: §IIA, §IVG, TABLE IV.
 (2015) Scene recognition by manifold regularized deep learning architecture. IEEE TNNLS 26 (10), pp. 2222–2233. Cited by: §I.
 (2017) Range loss for deep face recognition with longtailed training data. In ICCV, pp. 5409–5418. Cited by: §IIA.
 (2017) Hyperspectral anomaly detection via a sparsity score estimation framework. IEEE TGRS 55 (6), pp. 3208–3222. Cited by: §I.
 (2019) Deep classwise hashing: semanticspreserving hashing via classwise loss. IEEE TNNLS. Cited by: §IIA.
 (2017) Learning to diversify deep belief networks for hyperspectral image classification. IEEE TGRS 55 (6), pp. 3516–3530. Cited by: §I, §IVH, TABLE V.
 (2019) Multiple instance learning for multiple diverse hyperspectral target characterizations. IEEE TNNLS. Cited by: §I.
 (2018) Spectralspatial residual network for hyperspectral image classification: a 3d deep learning framework. IEEE TGRS 56 (2), pp. 847–858. Cited by: TABLE VI.
 (2019) Learning compact and discriminative stacked autoencoder for hyperspectral image classification. IEEE TGRS. Cited by: TABLE VII.
 (2016) Learning hierarchical spectralspatial features for hyperspectral image classification. IEEE CYB 46 (7), pp. 1667–1678. Cited by: §IVH, TABLE V.
 (2018) Image reconstruction by domaintransform manifold learning. Nature 555 (7697), pp. 487–487. Cited by: §IIB.
 (2015) Hierachical suppression method for hyperspectral target detection. IEEE TGRS 54 (1), pp. 330–342. Cited by: §I.