Deep Generative Endmember Modeling:An Application to Unsupervised Spectral Unmixing

Deep Generative Endmember Modeling:
An Application to Unsupervised Spectral Unmixing

Ricardo Augusto Borsoi , Tales Imbiriba,  José Carlos Moreira Bermudez,  This work has been supported by the National Council for Scientific and Technological Development (CNPq) under grants 304250/2017-1, 409044/2018-0, 141271/2017-5 and 204991/2018-8, and by the Brazilian Education Ministry (CAPES) under grant PNPD/1811213.R.A. Borsoi is with the Department of Electrical Engineering, Federal University of Santa Catarina (DEE–UFSC), Florianópolis, SC, Brazil, and with the Lagrange Laboratory, Université Côte d’Azur, Nice, France. e-mail: T. Imbiriba was with the DEE–UFSC, Florianópolis, SC, Brazil, and is with the ECE department of the Northeastern University, Boston, MA, USA. e-mail: J.C.M. Bermudez is with the DEE–UFSC, Florianópolis, SC, Brazil, and with the Graduate Program on Electronic Engineering and Computing, Catholic University of Pelotas (UCPel) Pelotas, Brazil. e-mail: received Month day, year; revised Month day, year.

Endmember (EM) spectral variability can greatly impact the performance of standard hyperspectral image analysis algorithms. Extended parametric models have been successfully applied to account for the EM spectral variability. However, these models still lack the compromise between flexibility and low-dimensional representation that is necessary to properly explore the fact that spectral variability is often confined to a low-dimensional manifold in real scenes. In this paper we propose to learn a spectral variability model directly from the observed data, instead of imposing it a priori. This is achieved through a deep generative EM model, which is estimated using a variational autoencoder (VAE). The encoder and decoder that compose the generative model are trained using pure pixel information extracted directly from the observed image, what allows for an unsupervised formulation. The proposed EM model is applied to the solution of a spectral unmixing problem, which we cast as an alternating nonlinear least-squares problem that is solved iteratively with respect to the abundances and to the low-dimensional representations of the EMs in the latent space of the deep generative model. Simulations using both synthetic and real data indicate that the proposed strategy can outperform the competing state-of-the-art algorithms.

Hyperspectral data, endmember variability, generative models, deep neural networks, variational autoencoders, spectral unmixing.

I Introduction

Hyperspectral image analysis consists in a vast collection of algorithms and methods used to retrieve vital information from hyperspectral images (HI) in a increasing number of applications. Common applications include [kouyama2016development, Bioucas-Dias-2013-ID307] space exploration, remote sensing, surveillance, and, more recently, medical applications such as disease diagnosis and image-guided surgery [lu2014medical]. One analysis methodology of particular interest is spectral unmixing (SU), which aims at retrieving sub-pixel information concerning the spectra of materials present in the scene, as well as estimating the proportions in which they contribute to each HI pixel [Keshava:2002p5667, imbiriba2018ULTRA].

Many parametric models have been proposed to describe the interaction between light and the target surface [Keshava:2002p5667, Dobigeon-2014-ID322]. The simplest of such models is the Linear Mixing Model (LMM), which considers that the observed reflectance of an HI pixel is obtained from a convex combination of the spectral signatures of pure materials. This model imposes a convex geometry to the SU problem, where all HI pixels are confined to a simplex whose vertices are the pure material reflectances, usually termed endmembers (EMs). The linearity and convexity of the LMM model lead to an interpretation of its coefficients as the relative abundances of each pure material in the HI. Nevertheless, some characteristics of practical HIs cannot be modeled by the standard LMM, such as nonlinearities [Dobigeon-2014-ID322, Imbiriba2016_tip, Imbiriba2017_bs_tip, Imbiriba2014] or variations of the EMs along the image [Thouvenin_IEEE_TSP_2016_PLMM, drumetz2016blindUnmixingELMM, imbiriba2018glmm]. More sophisticated models are required when such nonidealities have important impact on the formation of the HI.

I-a EM variability and learning-based SU methods

The variation of the endmembers across an HI (also called EM variability) is a very common effect since we can often associate multiple, different spectral signatures to each pure underlying material in a scene. EM variability can originate from environmental conditions, illumination, or atmospheric or temporal changes [Zare-2014-ID324-variabilityReview]. Its occurrence may incur the propagation of significant estimation errors throughout the unmixing process [Thouvenin_IEEE_TSP_2016_PLMM]. Different strategies have been proposed to cope with EM variability in SU. They can be classified in methods that represent EMs as sets, methods that model EMs as statistical distributions, and methods that incorporate parametric representations of EM variability in the mixing model [drumetz2016variabilityReviewRecent].

Parametric models are raising considerable interest since they lead to good unmixing results and avoid the main drawbacks of the other groups of SU methods that address EM variability, namely the dependence on a priori knowledge of libraries of material spectra or the need for strong assumptions on the statistical distribution of the EMs for mathematical tractability [Zare-2014-ID324-variabilityReview, drumetz2016variabilityReviewRecent]. Recently proposed parametric models attempt to capture spectral variability by extending the LMM using either additive [Thouvenin_IEEE_TSP_2016_PLMM] or multiplicative [drumetz2016blindUnmixingELMM, imbiriba2018glmm, Borsoi_multiscaleVar_2018, Borsoi_2018_Fusion] scaling factors, or by considering tensor-based formulations [imbiriba2018ULTRA_V, borsoi2019icassp].

Although SU methods based on extended parametric models offer different trade-offs between representation capacity and model complexity, they still fail to achieve a desirable balance between a low-dimensional representation and enough flexibility to represent complex EM variability. Specifically, they fail to properly explore the fact that, although being very complex and spectrally non-homogeneous, spectral variability in real scenes is often confined to low-dimensional manifolds [Hapke1981, jacquemoud2001leafOpticalPropertiesReview, lobell2002moistureEffectsOnReflectance]. This property is due to the fact that the spectral signature of many materials is a function of only a few photometric or chemical properties of the medium. Prominent examples include packed particle spectra as a function of its roughness, size and density [Hapke1981], leaf reflectance spectra as a function of various biophysical parameters [jacquemoud2001leafOpticalPropertiesReview], and soil reflectance as a function of moisture conditions [lobell2002moistureEffectsOnReflectance]. Thus, existing models tend to be either too restrictive in their modeling capability or to lead to severely ill-posed estimation problems.

SU considering EM variability has also been formulated as a supervised learning problem, which is then solved without the need for an accurate physical model using neural networks (NNs) or support vector machines (SVMs) [okujeni2013learningSVRunmixingUrban, wang2013unmixingSVMregressionResidueConstr, mianji2011unmixingVariabilitySVMclassification, parente2017unmixingGenModels]. However, these strategies depend on the availability of vast amounts of training data to adequately capture the spectral diversity of real scenes. This makes the training process computationally intensive and often intractable for large EM libraries, which must also be known a priori. Some works attempt to reduce the computational cost of these solutions by modifying learning algorithms to use hybrid soft-hard classification [wang2009unmixingExtendedSVMpurePixels, gu2013unmixingExtendedSVMpurePixelsMultiKernel, li2015unmixingExtendedSVMgeometricAnalysis]. However, the resulting reconstructed abundance fractions do not have a clear physical interpretation due to the lack of a direct relationship to a physically motivated mixing model.

More recently, unsupervised SU approaches have also emerged by using autoencoders (AEC), which consist of encoder-decoder structured NNs originally devised for nonlinear dimensionality reduction [van2009dimensionalityReductionRev]. These methods attempt to associate the decoder structure of the network with the LMM and the low-dimension representation of the input spectral vectors to the fractional abundances [palsson2018autoencoderUnmixing_IEEEaccess]. Different variations have been proposed, using pre-processing steps to reduce noise and outliers [guo2015autoencodersUnmixing, su2018autoencodersUnmixing], untiying the decoder from the encoder weights [qu2018udas_autoencoderUnmixing], using spectral angle distances to address nonlinear SU [ozkan2018endnet_autoencoderUnmixing], or using denoising autoencoders to generate a robust initialization to matrix factorization-based SU strategies [su2019deepAutoencoderUnmixing]. In [ozkan2018autoencoderUnmixingVariability] the authors proposed a nonlinear encoder-decoder structure to address the unmixing problem considering spectral variability. The proposed solution involves the simultaneous training of six neural networks to optimize a very large number of parameters. An autoencoder structure is employed to estimate the parameters of a spectral model of a hyperspectral image by minimizing the image reconstruction error while limiting the energy of some of the model parameters. No regularization strategy connecting the different pixels of the image is employed.

Despite their popularity, supervised learning-based SU algorithms are still not able to properly address the spectral variability problem, as they depend on extremely large amounts of labeled training data, leading to a computationally unfeasible learning process. Furthermore, the lack of a clear connection between AEC-based strategies and the physical mixing process makes one skeptic when concerning the robustness of AEC-based SU in face of more complex phenomena such as spectral variability.

I-B Proposed methodology

In this work, we propose a novel SU formulation that leverages the advantages of deep learning methods to address EM variability while still maintaining a strong connection to the physical mixing process, and using limited amounts of training data. Specifically, we adopt a deep generative NN to represent the manifold of EM spectra, which is then incorporated within the LMM. Generative models such as variational autoencoders (VAE) [kingma2013AEC_varBayes] and generative adversarial networks (GAN) [goodfellow2014GANs] have recently obtained excellent performance at learning the probability distribution of complex data sets in very high dimensional spaces (e.g. natural images) from relatively small amounts of training data. The structure of generative models allow one to find a low-dimensional latent representation that parsimonously describes the variability of complex high-dimensional data sets. This leads to a low-dimensional parametrization of the training data distribution.

We formulate a novel unmixing strategy that can be cast as the problem of estimating the latent representations of the generative endmember models and the corresponding fractional abundances for each pixel in the HI. Specifically, we break down the SU problem in two steps. In the first step, we learn the latent EM variability manifold for each material in the scene using a deep generative EM model. The learning process uses pure pixel information directly extracted from the observed HI, which makes the proposed strategy suitable for unsupervised SU. In the second step, an alternating least-squares strategy is employed to estimate the parameters of an extended version of the LMM parametrized using the generative EM models obtained in the first step. The corresponding optimization problem is solved iteratively with respect to the abundances and to the low-dimensional representations of the EMs in the latent space of the deep generative models.

As a result, the proposed approach benefits from the reduced dimension of the latent space. Moreover, unlike current approaches, the new method does not depend on the careful selection of regularization parameters to yield a good performance. The resulting algorithm is named Deep Generative Unmixing algorithm (DeepGUn). The proposed method is strongly related to parametric models and leverages the learning and generalization capability of deep neural networks to properly represent the manifold of EM variability. Hence, DeepGUn leads to a model that is both low-dimensional and physically accurate, better describing the variability actually present in the scene.

Experimental results performed with both synthetic and real data indicate that the proposed strategy leads to more accurate abundance estimations than standard state-of-art SU methods accounting for EM variability. Qualitative analysis of the estimated abundance maps confirms these results. The improved accuracy comes at the expense of a small increase in the computational cost when compared to the best competing strategies.

This paper is organized as follows. Section II briefly reviews the LMM and its parametric extended versions. Section III discusses the basic properties of generative models in the context of VAE and GAN. Section IV introduces the proposed generative EM model and its learning strategy. In Section V we formulate the resulting SU problem, present the DeepGUn algorithm, and discuss aspects of the proposed optimization strategy. The neural network architecture is discussed in Section VI. The performance of the proposed method is compared with that of competing algorithms in Section VII. Finally, the conclusions are presented in Section VIII.

Ii Linear Mixing Models

The Linear Mixing Model (LMM) [Keshava:2002p5667] assumes that a given -th pixel , with bands, is represented as


where is a matrix whose columns are the EM spectral signatures , is the abundance vector and is an additive white Gaussian noise (WGN). The LMM assumes that the EM spectra are fixed for all HI pixels , . This assumption jeopardizes the accuracy of estimated abundances in many circumstances due to the spectral variability existing in a typical scene.

Different parametric models have been recently proposed to account for variable EM spectra within a given scene [drumetz2016blindUnmixingELMM, Thouvenin_IEEE_TSP_2016_PLMM, imbiriba2018glmm]. These models can be generically described as


where is a parametric function, is a reference EM matrix, and is a vector of parameters describing the manifold of EM variability.

Different functional forms have been proposed for to account for EM variability in this framework, such as additive [Thouvenin_IEEE_TSP_2016_PLMM] or multiplicative [drumetz2016blindUnmixingELMM, imbiriba2018glmm] variability factors acting upon the reference EM matrix . However, these models fail to achieve a desirable balance between a low-dimensional representation and enough flexibility to represent complex variability patterns. They tend to be either too restrictive in their modeling capability, or to lead to ill-posed optimization problems [borsoi2019icassp]. Instead of using a pre-defined parametric model, we propose to address this issue by learning a parametric function using a generative model.

Iii Generative Models

Generative models attempt to estimate the probability distribution of a random variable based on a set of observations , in such a way that allows one to generate new samples that look similar to new realizations of . The main characteristic of this problem, which sets it apart of other unsupervised learning methods such as density estimation, is the fact that we must to be able to sample from the estimated model .

In many practical applications of interest, the dimensionality  of the variable of interest is very high. This makes the general problem very difficult, as it amounts to estimating and sampling from an arbitrary high-dimensional probability density function [neal2001annealedDensityEstimation, arjovsky2017wassersteinGANs]. Nonetheless, the distributions of interest are often supported at a low-dimensional manifold of a set of so-called latent variables, and this fact can be explored to make the problem more tractable. A convenient way to address this problem is to define a new random variable with a known distribution in a low-dimensional space (e.g. an isotropic Gaussian distribution with ), and a parametric function (e.g. a neural network) mapping such that the image of by is a random variable whose distribution is very close to . In other words, the goal becomes to learn the parameters of such that the distribution of is as close to as possible. Then, samples of can be generated by sampling from and using the mapping .

Although estimating may still seem difficult at first, recent advances in machine learning such as VAEs [kingma2013AEC_varBayes] and GANs [goodfellow2014GANs] have shown formidable performance at learning complex distributions such as those of natural images.

VAEs address this problem by assuming that the distribution of the observed data follows a directed graphical model , which is represented by the function . The parameters of are learned by maximizing a lower bound on the log-likelihood of  [kingma2013AEC_varBayes]:


where is the Kullback-Leibler divergence between two distributions, is the expected value operator with respect to the distribution and is a variational approximation to the intractable posterior , which is modeled by a function (e.g. another neural network) parameterized in . Note that must be a high-capacity distribution111Capacity of a distribution is a generic term to describe how complex a relationship it can model., so that it can provide a good approximation of the posterior , which then allows the lower bound in (III) to be close to the true value of  [doersch2016tutorialVAE].

GANs, on the other hand, attempt to learn the distribution by searching for the Nash equilibrium of a two-player adversarial game [goodfellow2014GANs]. A generator network tries map the distribution of the latent variables into the data distribution of , and a discriminator network tries to predict the probability of a random sample coming from the true distribution instead of being generated through . The generator is trained by maximizing the probability of the discriminator making a mistake. This is formulated as the minimax optimization problem


GANs are more flexible and have shown better performance at approximating complex distributions such as natural images (leading to sharper results) when compared to VAEs [goodfellow2014GANs]. However, GANs are also much harder to train [arjovsky2017wassersteinGANs]. Moreover, VAEs naturally offer a way to obtain the latent representations corresponding to samples by mapping using the function , which is also called an encoder model. This property and their stable training have motivated us to use VAEs in this work.

Figure 1: Illustration of the proposed Deep Generative Endmember Model.

Iv A Deep Generative Endmember Model

In this section, we propose to model the distribution of EM spectral variability using a deep generative model. By doing so, we can explicitly explore a common property of spectral variability: the EM spectra are usually confined to a low-dimensional manifold. This property is due to the fact that the spectral signature of many materials is a function of a few photometric or chemical properties of the medium. Prominent examples include packed particle spectra as a function of its roughness, size and density [Hapke1981], leaf reflectance spectra as a function of various biophysical parameters [jacquemoud2001leafOpticalPropertiesReview], and soil reflectance as a function of moisture conditions [lobell2002moistureEffectsOnReflectance].

Iv-a The steps of the proposed SU method

We assume the existence of nonlinear functions , (the generative model) that map latent representations into their corresponding spectral signatures . We assume also the existence of encoder models that map spectral signatures into their latent representations. In other words, we assume that any arbitrary observation of a spectral signature of a material belongs to the set


and thus can be equivalently represented by a corresponding low-dimensional vector in the latent space of the generative model . This reasoning is illustrated in Fig. 1, where the encoder function maps the input EM signature to the low-dimensional manifold. Reciprocally, low-dimensional vectors in the latent space can be mapped (decoded) to their corresponding spectral signatures using .

As such, we can formulate EM estimation in the SU problem in the latent domain (as opposed to the input spectral space), which is of a much lower order. Moreover, this approach will keep the physical interpretation of the model, provided that we have relevant training data to learn the generative models [shah2018invGANpriors, bora2017compressedInvGANs, anirudh2018unsupervisedInverseProblemsGANs, asim2018GANsDeblurring]. This strategy relies on the existence of a priori training data for each material in the image, which might come in the form of, e.g., spectral libraries of laboratory measurements [iordache2011sunsal]. Nevertheless, we propose a more practical and effective approach to train the generators  by exploring information contained in multiple pure pixels extracted from the observed HI. The presence of multiple pure pixels in an observed HI is a characteristic of many real scenes, and can be leveraged to help in estimating the EM models, thus, reducing the ill-posedness of the SU problem222Pure pixels are defined here as a set of pixels whose spectral distance relative to the reference EMs in is less than a specified threshold..

Therefore, we propose to break the unmixing problem into a sequence of two problems:

  • Using pure pixel information extracted from the HI by standard EM extraction methods, learn the generative and encoder models, and , for all EMs in the scene ().

  • Using the learned generative models, solve the SU problem by estimating the latent EM representations and the fractional abundance vectors that can best represent the observed hyperspectral data, for all pixels in the scene ().

Iv-B Learning the generative and encoder models  and 

The objective of this first problem is to estimate the generative and encoding models and , for . We assume the knowledge of a set of pure pixels for the -th EM, for all . Multiple pure pixels exist in many scenes, and can be directly extracted from the observed HI using automated EM extraction techniques [somers2012automatedBundlesRansomSampling, somers2016endmemberLibrariesChapter]. The sets of pure pixels , which can be seen as observations from the statistical distribution of each EM, are then used in the form of training data to learn the models and using a VAE [kingma2013AEC_varBayes]. If the set is representative of the variability of the -th material, the learned generative model will be able to accurately describe the manifold of the -th EM variability. Doing the same for all yields a set of variability models for all the EM spectra.

Although the extraction of multiple pure pixels from observed HIs is a well-established technique used to produce EM libraries [somers2012automatedBundlesRansomSampling], mixed pixels can sometimes be mistakenly identified as a pure pixel of some of the EMs. This constitutes a problem for library-based SU applications (e.g. MESMA and sparse SU) since some of the library spectra may end up not being representative of their EM class (material).

The smooth nature of the latent representation of VAEs allows the mitigation of this problem in the proposed approach. Assuming the availability of a reference EM matrix of correctly identified signatures (which can be obtained using any EM extraction method) and of a set of encoder models , we can compute the latent representation of these reference signatures of each EM as


where is the -th column of . The latent representation can be used as a reference latent code for the -th material. Thus, we can measure how close an estimated EM latent representation  is to the latent representation of a pure pixel by evaluating its Euclidean distance to . This can be performed since the output of VAEs have been shown to vary smoothly with changes of the latent variable [kingma2013AEC_varBayes]. Thus, we can use to regularize the SU problem to prevent from representing mixed pixels. This increases the robustness of the proposed approach.

Iv-C Extracting sets of pure pixels from the observed HI

An important part of the proposed methodology consists in the extraction of the sets of pure pixels , from the observed hyperspectral image . Although different strategies have been proposed for image-based library construction (see e.g. [uezato2016novelBundlesExtractionClustering, andreou2016bundlesExtractionVariabilityMultiscaleBand]), these techniques depend on multiple parameters that must be carefully adjusted in order to obtain good results. Instead of these approaches, we adopt a very simple strategy to select pure pixels from an HI that makes use of the reference matrix  extracted from the image using a pure-pixel-based endmember extraction algorithm (e.g. VCA [Nascimento2005]), which will also later be used to construct in (IV-B). We simply select as the elements of  the image pixels that have the smallest spectral angle to the reference signature in the -th column of , where is the cardinality of  for . Although the success of this strategy depends on having a reasonably accurate estimation of , we experimentally found it to be more robust and easier to adjust than, for instance, the one in [somers2012automatedBundlesRansomSampling].

V The Unmixing Algorithm

Given a set of generative models , for each EM in the scene, a latent space representation of a reference EM matrix , and an HI , the SU problem can be cast as the minimization of a risk functional of the form


where is the abundance matrix, is a 3-D tensor obtained by stacking all pixel-dependent latent EM representations , such that , and are regularization terms to improve the problem conditioning, and the matrix-valued function defined as

is the concatenation of the generative functions for each EM.

The term is a regularization functional that aims to provide spatial smoothness and to enforce positivity and sum-to-one constraints to the abundances. It is given by [imbiriba2018glmm]


where parameter controls the contribution of this term to the cost function. The first two terms are a spatial regularizers over , where and are linear operators that compute the first-order horizontal and vertical gradients of a bidimensional signal, acting separately for each material of , and is the norm, defined as . The term is the indicator function of the unity simplex, i.e. if and otherwise, where


The term constrains the EM latent representations to be close to the latent representation of the reference EM matrix . It is given by


where parameter controls the contribution of this term to the cost function. This regularization makes the estimation problem more robust to the selection of the training data  by assuring the closeness of the estimated latent codes and the representations of pure pixels of each class. However, it relies indirectly on the reference EM signatures  (which are extracted from the observed HI with endmember extraction algorithms) being adequate representatives of their material classes in order to provide a good performance.

The optimization problem then becomes


The problem defined in (11) is non-smooth and non-convex if solved simultaneously with respect to both variables , and . However, an approximate solution can be found by minimizing (11) iteratively with respect to each variable, leading to the Deep Generative Unmixing (DeepGUn) method described in Algorithm 1. The DeepGUn algorithm consists of two distinctive steps. First, the generative endmember models generative and encoder models , , are trained based on the pure pixels , extracted from the observed HI and is computed. Afterwards, the alternating minimization approach is applied to compute the abundance maps and the latent representations of the endmembers for each pixel. We next describe the details of each optimization step. Implementation details are described in Sections VI and VII.

Input : , , and .
Output :  and .
1 Estimate the reference EM signatures using an EM extraction method (e.g. VCA);
2 Estimate using a standard LMM-based SU method;
3 Extract sets of pure pixels , from the HI using a bundle extraction strategy;
4 Train the generative and encoder models , , ;
5 Compute the latent representation of as ;
6 Set ;
7 while stopping criterion is not satisfied do
8       ;
9       ;
10       ;
12 end while
13for , do, , end;
14 return ;
Algorithm 1 DeepGUn algorithm for solving (11)

V-a Optimization with respect to

Rewriting (11) considering only the terms in (7) that depend on , the problem becomes


This is a regularized nonlinear least squares problem, which can be solved individually for each pixel . Thus, (12) can be decomposed into non-convex, nonlinear optimization problems with dimensionality by denoting each summand in (12) by , . We solve each of those problems  using a quasi-Newton algorithm, described in Algorithm 2, which provides an efficient solution for high-dimensional functions  [nocedal2006optimizationBook].

Although problem (12) is generally non-convex, recent research [hand2018convexityInverseOptGANs] has proven that, under suitable assumptions on the generator network , the problem of recovering the latent variable does not have any stationary point (e.g. local minima or saddle points) outside a small neighborhood of the desired solution and its negative scalar multiple. This indicates the existence of a favorable global geometry of (12).

Note that is not necessarily differentiable with respect to the latent representations , which can make the optimization problem more challenging. Nonetheless, quasi-Newton algorithms show excellent performance at non-smooth problems [lewis2013nonsmoothQuasiNewton], where convergence is generally observed as long as the line search procedure does not return a point at which the objective function is non-differentiable. This allows quasi-Newton algorithms to be directly applied to obtain approximate solutions to non-smooth problems with good computational efficiency [lewis2013nonsmoothQuasiNewton, curtis2015quasiNewtonNonsmooth].

Input : , , , and .
Output : .
1 Set and ;
2 while stopping criterion is not satisfied do
3       ;
4       Compute search direction ;
5       Set , where is computed using a line search procedure to satisfy the Wolfe conditions;
6       Define and ;
7       ;
9 end while
10Reorder as a matrix ;
11 return ;
Algorithm 2 Quasi-Newton algorithm for solving (12

V-B Optimization with respect to

Restating (11) considering only the terms in (7) that depend on leads to


Since the latent variables are fixed, (13) consists of a SU problem with a pixel-dependent EM matrix and an edge-preserving spatial regularization. Although this problem is not separable with respect to each pixel in the image, the Alternating Direction Method of the Multipliers (ADMM) can be used to obtain an efficient solution [Boyd_admm_2011]. The solution of (13) using the ADMM is well described elsewhere (e.g. [drumetz2016blindUnmixingELMM]) and will thus be omitted here for conciseness.

Vi Neural Network Architecture

As discussed before, we used a VAE [kingma2013AEC_varBayes] to learn the generative and encoder models and from the sets of pure pixels . Compared to GANs, the training of VAEs is much simpler and more stable [arjovsky2017wassersteinGANs]. Moreover, VAEs naturally return the encoder model as an approximation to the posterior distribution when learning . We have selected a dimension for the latent space, as it was experimentally verified to be sufficient to adequately capture the variability of each single material in a scene.

For the network architectures, we selected the number of layers and neurons according to the autoencoder implementation in [van2008matlabToolbox, van2009dimensionalityReductionRev], with three hidden layers using ReLU activations (defined as ) in the hidden layers, which are described in more detail in Tables I and II.

We found that this configuration led to spectrally smooth generated signatures, and was effective at generalizing well with small training sample sizes. We trained the network for 50 epochs with the Adam optimizer [kingma2014adam] in TensorFlow, using a batch optimization with mini-batch size equal to one third of the total amount of training data for each EM.

Layer Activation Function Number of units
Hidden # 1 ReLU
Hidden # 2 ReLU
Hidden # 3 ReLU
Table I: Encoder network architecture.
Layer Activation Function Number of units
Hidden # 1 ReLU
Hidden # 2 ReLU
Hidden # 3 ReLU
Output Sigmoid
Table II: Decoder network architecture.

Vii Experimental Results

In this section, simulation results using both synthetic and real data illustrate the performance of the proposed method. We compare the proposed DeepGUn method with the fully constrained least squares (FCLS), the PLMM [Thouvenin_IEEE_TSP_2016_PLMM], the ELMM [drumetz2016blindUnmixingELMM], and the GLMM [imbiriba2018glmm]. In all experiments, the VCA algorithm [Nascimento2005] was used to extract the reference EM matrix from the observed HI and to initialize the different methods. The abundance maps of all methods were initialized using the results obtained by the FCLS algorithm. The sets of pure pixels were constructed by selecting the  image pixels with the smallest spectral angles relative to the reference EMs in . We ran the alternating optimization process in Algorithm 1 for at most  iterations or until the relative change of  and  was less than . The iterative procedure in Algorithm 2 was run until the relative change of was less than . The performances were evaluated using the Normalized Root Means Squared Error (NRMSE) between the estimated abundance maps (), between the EM matrices () and between the reconstructed images (). The NRMSE between a true, generic tensor  and its estimate  is defined as


Note that for the case of , the reconstructed image is given by , .

We consider also the Spectral Angle Mapper (SAM) to evaluate the estimated EMs


where and are the true and the estimated signatures of the -th endmember in the -th pixel, respectively.

Data Cube 1 – DC1
Time [s]
FCLS 0.2854 0.0350 0.71
PLMM 0.2604 0.1075 0.0440 0.0007 122.09
ELMM 0.2554 0.1032 0.0398 0.0321 8.82
GLMM 0.2480 0.1036 0.0355 0.0235 23.74
DeepGUn 0.0566 0.0944 0.0233 0.0448 75.20
Data Cube 2 – DC2
FCLS 0.1294 0.0393 0.38
PLMM 0.1197 0.0481 0.0378 0.0336 41.31
ELMM 0.1110 0.0566 0.0382 0.0231 20.25
GLMM 0.1146 0.0534 0.0367 0.0226 17.03
DeepGUn 0.0969 0.0463 0.0323 0.0384 36.40
Data Cube 3 – DC3
FCLS 0.2606 0.0542 0.34
PLMM 0.2028 0.0928 0.0385 0.0302 59.88
ELMM 0.1997 0.0640 0.0188 0.0238 17.99
GLMM 0.1841 0.0638 0.0185 0.0226 25.74
DeepGUn 0.1613 0.0600 0.0172 0.0457 48.96
Data Cube 4 – DC4
FCLS 0.5109 0.1712 0.50
PLMM 0.5066 0.6245 0.4874 0.0320 269.48
ELMM 0.4385 0.4712 0.1451 0.0106 18.36
GLMM 0.4371 0.4855 0.1972 0.0108 21.15
DeepGUn 0.2550 0.2918 0.0873 0.1403 99.94
Table III: Simulation results using synthetic data.
Figure 2: Abundance maps of DC1 (top), DC2 (middle up), DC3 (middle down) and DC4 (bottom) for all tested algorithms. Abundance values represented by colors ranging from blue () to red ().

Vii-a Synthetic data

To quantitatively compare the different algorithms, four synthetic datasets were created, namely Data Cubes 1–4 (DC1–DC4), with 7070 pixels (DC1) and 5050 pixels (DC2, DC3 and DC4). These datasets were built using three (DC1, DC2 and DC3) and five (DC4) 224-band EMs extracted from the USGS Spectral Library [clark2003imaging] and spatially correlated abundance maps, as depicted in the first row of Fig. 2.

Spectral variability of the EMs was imposed using four different models. For the DC1 datacube, we adopted the variability model used in [Thouvenin_IEEE_TSP_2016_PLMM], consisting of pixelwise multiplicative spectral factors given by random piecewise-linear functions. For DC2, the variability model of [imbiriba2018glmm] was used, consisting of band dependent scaling factors that varied smoothly in both the spatial and spectral dimensions. For DC3, we considered a simple model introduced in [uezato2016novelBundlesExtractionClustering, Section IV-A-1] to emulate errors in atmospheric compensation as a function of the viewing geometry given the direct and diffuse light on the scene, and the solar path transmittance. For datacube DC4, we used as endmembers pure pixels of five materials (asphalt, tree, roof, metal and dirt) which were manually extracted from a real hyperspectral image, thus depicting realistic spectral variability. White Gaussian noise was finally added to all datasets to yield a 30dB SNR.

The optimal parameters for each algorithm were selected by performing grid searches for each dataset. The ranges in which the parameters were searched were selected according to those discussed by the authors in the original publications. For the PLMM we searched for , and in the ranges , and , respectively. For both ELMM and GLMM, the parameters were selected among the following values: , , and . For the proposed DeepGUn algorithm, we fixed and selected among the values . For the proposed method, the sets of pure pixels for each EM were constructed by selecting the  pixels closest to the reference materials .

The quantitative results are shown in Table III, with the best results for each metric marked in bold. The proposed method clearly outperformed the competing algorithms in terms of for all four datasets. Qualitatively, the abundance maps provided by DeepGUn, displayed in Fig. 2, are clearly much closer to the true abundance maps than those provided by the other methods. These are important results, as accuracy in abundance estimation is the main objective of SU.

For the EM reconstruction metrics and , DeepGUn gave the best results for all data cubes. This indicates that the proposed endmember model used by DeepGUn allows for precise material identification from the observed hyperspectral scenes.

The reconstruction error of the DeepGUn algorithm was comparable to the FCLS and significantly larger than that of the GLMM. This is natural since the GLMM has more degrees of freedom. However, the connection between and the abundance reconstruction error is far from being direct, as can be seen in Table III.

The execution times, at the rightmost column of Table III, indicate that the computational complexity of DeepGUn is somewhere between the complexities of GLMM and PLMM, the two major competing algorithms. Hence, the DeepGUn method yielded superior SU performance, with easier parameter tuning, and at a reasonable computational cost.

Vii-B Influence of the latent dimension 

An important parameter in the design of the DeepGUn method is the dimensionality of the latent space of the generative endmember models , . The individual dimensions of the latent space are used to represent changes in the endmember signatures due to spectral variability. Since in a given scene the endmembers are likely to be affected only by a small number of effects (hence the hypothesis that they are supported at a low-dimensional manifold),  should be small in order to avoid introducing spurious effects and increasing the computational complexity of the solution.

To illustrate this, we performed a simulation with DeepGUn where we varied the dimensionality of the latent space  and measured the normalized abundance reconstruction error . For this, we considered the data cubes DC1 and DC2 described above in Section VII-A. The results are depicted in Figure 3, and show that the abundance estimation error tends to increase with . It can also be seen that there is a sharper increase in  for DC1 when compared to DC2. This is likely due to the fact that the dataset DC2 uses a more complex model to generate endmember variability. This indicates that the selection of a small value for  is important to obtain good unmixing results.

Figure 3: Abundance NRMSE as a function of the latent space dimension  for datacubes DC1 (left) and DC2 (right).
Houston HI Samson HI Jasper Ridge HI

Time [s]

Time [s]

Time [s]

 FCLS 0.2470 2.56 0.0545 1.38 0.2057 1.52
 PLMM 0.0713 663.25 0.0239 103.84 0.0553 220.84
 ELMM 0.0171 38.30 0.0119 14.76 0.0278 27.08
 GLMM 0.0016 48.53 0.0006 46.69 0.0019 86.33
 DeepGUn 0.2355 259.61 0.0862 121.88 0.1094 209.64
Table IV: Simulation results using real data.
Figure 4: Abundance maps of the Houston dataset for all tested algorithms. Abundance values represented by colors ranging from blue () to red ().
Figure 5: Abundance maps of the Samson dataset for all tested algorithms. Abundance values represented by colors ranging from blue () to red ().
Figure 6: Abundance maps of the Jasper Ridge dataset for all tested algorithms. Abundance values represented by colors ranging from blue () to red ().

Vii-C Real data

We considered the Houston, Samson and the Jasper Ridge datasets for the simulations with real data. These datasets were captured by the AVIRIS instrument, and originally had 224 bands. The spectral bands corresponding to water absorption and low SNR regions were removed, resulting in 188 bands for the Houston image, 156 bands for the Samson image and 198 bands for the Jasper Ridge image. Previous studies indicate that the Houston HI has four predominant EMs [drumetz2016blindUnmixingELMM], while the Samson and Jasper Ridge HIs are known to have three and four EMs, respectively [Borsoi2017_multiscale].

The reconstructed abundance maps for both datasets and all algorithms are shown in Figs. 45 and 6. For the Houston dataset, the last row of Fig. 4 shows that the abundance maps provided by the DeepGUn method better evidence the strong vegetation and concrete abundances at the stadium field and stands, respectively, as well as the stronger asphalt abundances in the parking lot. For the Samson and Jasper Ridge images, a clear performance improvement can be seen for the DeepGUn algorithm. Note, for instance, a smaller confusion between the Water and Soil EMs in the Samson HI when compared to the other methods. Similarly, for the Jasper Ridge HI, the DeepGUn method leads to considerably stronger Water abundances in the region containing the river. Moreover, although the ELMM provided a better estimation of the road in the scene when compared to the remaining methods, it also resulted in a greater confusion between the Vegetation and Soil EMs, especially in the right part of the scene.

The quantitative results for all algorithms and datasets are shown in Table IV. Since the correct abundance values (the ground truth) are not available for most real images, the reconstruction error has been used as a sort of quality verification. As was the case for the synthetic data, the DeepGUn reconstruction errors are higher than those yielded by other methods that address spectral variability. However, the reconstruction error is definitely not a good performance measure for abundance estimation in real images, which is the main objective of unmixing algorithms. The higher reconstruction errors of DeepGUn in this case are just due the fact that DeepGUn has much fewer degrees of freedom than the ELMM, PLMM and GLMM algorithms. In fact, the DeepGUn has only degrees of freedom for each pixel, which is comparable to the FCLS () much smaller than the ELMM, GLMM and PLMM methods (). Although this means that the ELMM, GLMM and PLMM can achieve arbitrarily small reconstruction errors , this is not necessarily reflected as good abundance estimation results. The execution times of the proposed DeepGUn method were again comparable to those of the other algorithms addressing spectral variability, which indicates that it scales well with larger image sizes.

Viii Conclusions

In this paper, a deep generative EM model was proposed to address spectral variability in SU of HIs. Instead of relying on user-defined parametric EM models which have shown to be very hard to estimate in practical scenes, the proposed methodology leveraged the generalization capability of deep neural networks to accurately model EM spectra while still maintaining a strong connection to the physical mixing process. A deep generative model for each EM was trained prior to unmixing by using pure pixel information extracted directly from the observed HI, which allowed for an unsupervised formulation. The proposed EM model was then applied to solve the SU problem, which was cast as the estimation of the low-dimensional representations of the EMs in the latent space of the deep generative models and their corresponding fractional abundances, for each pixel. The resulting DeepGUn algorithm presented excellent performance despite the simple strategy used for selecting the training data for learning the generative model. Simulations using synthetic and real data indicate that the proposed method can lead to significant improvements in abundance estimation accuracy.


Ricardo Augusto Borsoi (S’18) received the MSc degree in electrical engineering from Federal University of Santa Catarina (UFSC), Florianópolis, Brazil, in 2016. He is currently working towards his doctoral degree at Université Côte d’Azur (OCA) and at UFSC. His research interests include image processing, tensor decomposition, and hyperspectral image analysis.

Tales Imbiriba (S’14, M’17) received his Doctorate degree from the Department of Electrical Engineering (DEE) of the Federal University of Santa Catarina (UFSC), Florianópolis, Brazil, in 2016. He served as a Postdoctoral Researcher at the DEE–UFSC and is currently a Postdoctoral Researcher at the ECE dept. of the Northeastern University, Boston, MA, USA. His research interests include audio and image processing, pattern recognition, kernel methods, adaptive filtering, and Bayesian Inference.

José Carlos M. Bermudez (S’78,M’85,SM’02) received the B.E.E. degree from the Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil, the M.Sc. degree in electrical engineering from COPPE/UFRJ, and the Ph.D. degree in electrical engineering from Concordia University, Montreal, Canada, in 1978, 1981, and 1985, respectively. He joined the Department of Electrical Engineering, Federal University of Santa Catarina (UFSC), Florianopolis, Brazil, in 1985. He is currently a Professor of Electrical Engineering at UFSC and a Professor at Catholic University of Pelotas (UCPel), Pelotas, Brazil. He has held the position of Visiting Researcher several times for periods of one month at the Institut National Polytechnique de Toulouse, France, and at Université Nice Sophia-Antipolis, France. He spent sabbatical years at the Department of Electrical Engineering and Computer Science, University of California, Irvine (UCI), USA, in 1994, and at the Institut National Polytechnique de Toulouse, France, in 2012. His recent research interests are in statistical signal processing, including linear and nonlinear adaptive filtering, image processing, hyperspectral image processing and machine learning. Prof. Bermudez served as an Associate Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING in the area of adaptive filtering from 1994 to 1996 and from 1999 to 2001. He also served as an Associate Editor of the EURASIP Journal of Advances on Signal Processing from 2006 to 2010, and as a Senior Area Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING from 2015 to 2019. He is the Chair of the Signal Processing Theory and Methods Technical Committee of the IEEE Signal Processing Society. Prof. Bermudez is a Senior Member of the IEEE.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description