Deep Generative Endmember Modeling:
An Application to Unsupervised Spectral Unmixing
Abstract
Endmember (EM) spectral variability can greatly impact the performance of standard hyperspectral image analysis algorithms. Extended parametric models have been successfully applied to account for the EM spectral variability. However, these models still lack the compromise between flexibility and lowdimensional representation that is necessary to properly explore the fact that spectral variability is often confined to a lowdimensional manifold in real scenes. In this paper we propose to learn a spectral variability model directly from the observed data, instead of imposing it a priori. This is achieved through a deep generative EM model, which is estimated using a variational autoencoder (VAE). The encoder and decoder that compose the generative model are trained using pure pixel information extracted directly from the observed image, what allows for an unsupervised formulation. The proposed EM model is applied to the solution of a spectral unmixing problem, which we cast as an alternating nonlinear leastsquares problem that is solved iteratively with respect to the abundances and to the lowdimensional representations of the EMs in the latent space of the deep generative model. Simulations using both synthetic and real data indicate that the proposed strategy can outperform the competing stateoftheart algorithms.
I Introduction
Hyperspectral image analysis consists in a vast collection of algorithms and methods used to retrieve vital information from hyperspectral images (HI) in a increasing number of applications. Common applications include [kouyama2016development, BioucasDias2013ID307] space exploration, remote sensing, surveillance, and, more recently, medical applications such as disease diagnosis and imageguided surgery [lu2014medical]. One analysis methodology of particular interest is spectral unmixing (SU), which aims at retrieving subpixel information concerning the spectra of materials present in the scene, as well as estimating the proportions in which they contribute to each HI pixel [Keshava:2002p5667, imbiriba2018ULTRA].
Many parametric models have been proposed to describe the interaction between light and the target surface [Keshava:2002p5667, Dobigeon2014ID322]. The simplest of such models is the Linear Mixing Model (LMM), which considers that the observed reflectance of an HI pixel is obtained from a convex combination of the spectral signatures of pure materials. This model imposes a convex geometry to the SU problem, where all HI pixels are confined to a simplex whose vertices are the pure material reflectances, usually termed endmembers (EMs). The linearity and convexity of the LMM model lead to an interpretation of its coefficients as the relative abundances of each pure material in the HI. Nevertheless, some characteristics of practical HIs cannot be modeled by the standard LMM, such as nonlinearities [Dobigeon2014ID322, Imbiriba2016_tip, Imbiriba2017_bs_tip, Imbiriba2014] or variations of the EMs along the image [Thouvenin_IEEE_TSP_2016_PLMM, drumetz2016blindUnmixingELMM, imbiriba2018glmm]. More sophisticated models are required when such nonidealities have important impact on the formation of the HI.
Ia EM variability and learningbased SU methods
The variation of the endmembers across an HI (also called EM variability) is a very common effect since we can often associate multiple, different spectral signatures to each pure underlying material in a scene. EM variability can originate from environmental conditions, illumination, or atmospheric or temporal changes [Zare2014ID324variabilityReview]. Its occurrence may incur the propagation of significant estimation errors throughout the unmixing process [Thouvenin_IEEE_TSP_2016_PLMM]. Different strategies have been proposed to cope with EM variability in SU. They can be classified in methods that represent EMs as sets, methods that model EMs as statistical distributions, and methods that incorporate parametric representations of EM variability in the mixing model [drumetz2016variabilityReviewRecent].
Parametric models are raising considerable interest since they lead to good unmixing results and avoid the main drawbacks of the other groups of SU methods that address EM variability, namely the dependence on a priori knowledge of libraries of material spectra or the need for strong assumptions on the statistical distribution of the EMs for mathematical tractability [Zare2014ID324variabilityReview, drumetz2016variabilityReviewRecent]. Recently proposed parametric models attempt to capture spectral variability by extending the LMM using either additive [Thouvenin_IEEE_TSP_2016_PLMM] or multiplicative [drumetz2016blindUnmixingELMM, imbiriba2018glmm, Borsoi_multiscaleVar_2018, Borsoi_2018_Fusion] scaling factors, or by considering tensorbased formulations [imbiriba2018ULTRA_V, borsoi2019icassp].
Although SU methods based on extended parametric models offer different tradeoffs between representation capacity and model complexity, they still fail to achieve a desirable balance between a lowdimensional representation and enough flexibility to represent complex EM variability. Specifically, they fail to properly explore the fact that, although being very complex and spectrally nonhomogeneous, spectral variability in real scenes is often confined to lowdimensional manifolds [Hapke1981, jacquemoud2001leafOpticalPropertiesReview, lobell2002moistureEffectsOnReflectance]. This property is due to the fact that the spectral signature of many materials is a function of only a few photometric or chemical properties of the medium. Prominent examples include packed particle spectra as a function of its roughness, size and density [Hapke1981], leaf reflectance spectra as a function of various biophysical parameters [jacquemoud2001leafOpticalPropertiesReview], and soil reflectance as a function of moisture conditions [lobell2002moistureEffectsOnReflectance]. Thus, existing models tend to be either too restrictive in their modeling capability or to lead to severely illposed estimation problems.
SU considering EM variability has also been formulated as a supervised learning problem, which is then solved without the need for an accurate physical model using neural networks (NNs) or support vector machines (SVMs) [okujeni2013learningSVRunmixingUrban, wang2013unmixingSVMregressionResidueConstr, mianji2011unmixingVariabilitySVMclassification, parente2017unmixingGenModels]. However, these strategies depend on the availability of vast amounts of training data to adequately capture the spectral diversity of real scenes. This makes the training process computationally intensive and often intractable for large EM libraries, which must also be known a priori. Some works attempt to reduce the computational cost of these solutions by modifying learning algorithms to use hybrid softhard classification [wang2009unmixingExtendedSVMpurePixels, gu2013unmixingExtendedSVMpurePixelsMultiKernel, li2015unmixingExtendedSVMgeometricAnalysis]. However, the resulting reconstructed abundance fractions do not have a clear physical interpretation due to the lack of a direct relationship to a physically motivated mixing model.
More recently, unsupervised SU approaches have also emerged by using autoencoders (AEC), which consist of encoderdecoder structured NNs originally devised for nonlinear dimensionality reduction [van2009dimensionalityReductionRev]. These methods attempt to associate the decoder structure of the network with the LMM and the lowdimension representation of the input spectral vectors to the fractional abundances [palsson2018autoencoderUnmixing_IEEEaccess]. Different variations have been proposed, using preprocessing steps to reduce noise and outliers [guo2015autoencodersUnmixing, su2018autoencodersUnmixing], untiying the decoder from the encoder weights [qu2018udas_autoencoderUnmixing], using spectral angle distances to address nonlinear SU [ozkan2018endnet_autoencoderUnmixing], or using denoising autoencoders to generate a robust initialization to matrix factorizationbased SU strategies [su2019deepAutoencoderUnmixing]. In [ozkan2018autoencoderUnmixingVariability] the authors proposed a nonlinear encoderdecoder structure to address the unmixing problem considering spectral variability. The proposed solution involves the simultaneous training of six neural networks to optimize a very large number of parameters. An autoencoder structure is employed to estimate the parameters of a spectral model of a hyperspectral image by minimizing the image reconstruction error while limiting the energy of some of the model parameters. No regularization strategy connecting the different pixels of the image is employed.
Despite their popularity, supervised learningbased SU algorithms are still not able to properly address the spectral variability problem, as they depend on extremely large amounts of labeled training data, leading to a computationally unfeasible learning process. Furthermore, the lack of a clear connection between AECbased strategies and the physical mixing process makes one skeptic when concerning the robustness of AECbased SU in face of more complex phenomena such as spectral variability.
IB Proposed methodology
In this work, we propose a novel SU formulation that leverages the advantages of deep learning methods to address EM variability while still maintaining a strong connection to the physical mixing process, and using limited amounts of training data. Specifically, we adopt a deep generative NN to represent the manifold of EM spectra, which is then incorporated within the LMM. Generative models such as variational autoencoders (VAE) [kingma2013AEC_varBayes] and generative adversarial networks (GAN) [goodfellow2014GANs] have recently obtained excellent performance at learning the probability distribution of complex data sets in very high dimensional spaces (e.g. natural images) from relatively small amounts of training data. The structure of generative models allow one to find a lowdimensional latent representation that parsimonously describes the variability of complex highdimensional data sets. This leads to a lowdimensional parametrization of the training data distribution.
We formulate a novel unmixing strategy that can be cast as the problem of estimating the latent representations of the generative endmember models and the corresponding fractional abundances for each pixel in the HI. Specifically, we break down the SU problem in two steps. In the first step, we learn the latent EM variability manifold for each material in the scene using a deep generative EM model. The learning process uses pure pixel information directly extracted from the observed HI, which makes the proposed strategy suitable for unsupervised SU. In the second step, an alternating leastsquares strategy is employed to estimate the parameters of an extended version of the LMM parametrized using the generative EM models obtained in the first step. The corresponding optimization problem is solved iteratively with respect to the abundances and to the lowdimensional representations of the EMs in the latent space of the deep generative models.
As a result, the proposed approach benefits from the reduced dimension of the latent space. Moreover, unlike current approaches, the new method does not depend on the careful selection of regularization parameters to yield a good performance. The resulting algorithm is named Deep Generative Unmixing algorithm (DeepGUn). The proposed method is strongly related to parametric models and leverages the learning and generalization capability of deep neural networks to properly represent the manifold of EM variability. Hence, DeepGUn leads to a model that is both lowdimensional and physically accurate, better describing the variability actually present in the scene.
Experimental results performed with both synthetic and real data indicate that the proposed strategy leads to more accurate abundance estimations than standard stateofart SU methods accounting for EM variability. Qualitative analysis of the estimated abundance maps confirms these results. The improved accuracy comes at the expense of a small increase in the computational cost when compared to the best competing strategies.
This paper is organized as follows. Section II briefly reviews the LMM and its parametric extended versions. Section III discusses the basic properties of generative models in the context of VAE and GAN. Section IV introduces the proposed generative EM model and its learning strategy. In Section V we formulate the resulting SU problem, present the DeepGUn algorithm, and discuss aspects of the proposed optimization strategy. The neural network architecture is discussed in Section VI. The performance of the proposed method is compared with that of competing algorithms in Section VII. Finally, the conclusions are presented in Section VIII.
Ii Linear Mixing Models
The Linear Mixing Model (LMM) [Keshava:2002p5667] assumes that a given th pixel , with bands, is represented as
(1) 
where is a matrix whose columns are the EM spectral signatures , is the abundance vector and is an additive white Gaussian noise (WGN). The LMM assumes that the EM spectra are fixed for all HI pixels , . This assumption jeopardizes the accuracy of estimated abundances in many circumstances due to the spectral variability existing in a typical scene.
Different parametric models have been recently proposed to account for variable EM spectra within a given scene [drumetz2016blindUnmixingELMM, Thouvenin_IEEE_TSP_2016_PLMM, imbiriba2018glmm]. These models can be generically described as
(2) 
where is a parametric function, is a reference EM matrix, and is a vector of parameters describing the manifold of EM variability.
Different functional forms have been proposed for to account for EM variability in this framework, such as additive [Thouvenin_IEEE_TSP_2016_PLMM] or multiplicative [drumetz2016blindUnmixingELMM, imbiriba2018glmm] variability factors acting upon the reference EM matrix . However, these models fail to achieve a desirable balance between a lowdimensional representation and enough flexibility to represent complex variability patterns. They tend to be either too restrictive in their modeling capability, or to lead to illposed optimization problems [borsoi2019icassp]. Instead of using a predefined parametric model, we propose to address this issue by learning a parametric function using a generative model.
Iii Generative Models
Generative models attempt to estimate the probability distribution of a random variable based on a set of observations , in such a way that allows one to generate new samples that look similar to new realizations of . The main characteristic of this problem, which sets it apart of other unsupervised learning methods such as density estimation, is the fact that we must to be able to sample from the estimated model .
In many practical applications of interest, the dimensionality of the variable of interest is very high. This makes the general problem very difficult, as it amounts to estimating and sampling from an arbitrary highdimensional probability density function [neal2001annealedDensityEstimation, arjovsky2017wassersteinGANs]. Nonetheless, the distributions of interest are often supported at a lowdimensional manifold of a set of socalled latent variables, and this fact can be explored to make the problem more tractable. A convenient way to address this problem is to define a new random variable with a known distribution in a lowdimensional space (e.g. an isotropic Gaussian distribution with ), and a parametric function (e.g. a neural network) mapping such that the image of by is a random variable whose distribution is very close to . In other words, the goal becomes to learn the parameters of such that the distribution of is as close to as possible. Then, samples of can be generated by sampling from and using the mapping .
Although estimating may still seem difficult at first, recent advances in machine learning such as VAEs [kingma2013AEC_varBayes] and GANs [goodfellow2014GANs] have shown formidable performance at learning complex distributions such as those of natural images.
VAEs address this problem by assuming that the distribution of the observed data follows a directed graphical model , which is represented by the function . The parameters of are learned by maximizing a lower bound on the loglikelihood of [kingma2013AEC_varBayes]:
(3) 
where is the KullbackLeibler divergence between two distributions, is the expected value operator with respect to the distribution and is a variational approximation to the intractable posterior , which is modeled by a function (e.g. another neural network) parameterized in . Note that must be a highcapacity distribution^{1}^{1}1Capacity of a distribution is a generic term to describe how complex a relationship it can model., so that it can provide a good approximation of the posterior , which then allows the lower bound in (III) to be close to the true value of [doersch2016tutorialVAE].
GANs, on the other hand, attempt to learn the distribution by searching for the Nash equilibrium of a twoplayer adversarial game [goodfellow2014GANs]. A generator network tries map the distribution of the latent variables into the data distribution of , and a discriminator network tries to predict the probability of a random sample coming from the true distribution instead of being generated through . The generator is trained by maximizing the probability of the discriminator making a mistake. This is formulated as the minimax optimization problem
(4) 
GANs are more flexible and have shown better performance at approximating complex distributions such as natural images (leading to sharper results) when compared to VAEs [goodfellow2014GANs]. However, GANs are also much harder to train [arjovsky2017wassersteinGANs]. Moreover, VAEs naturally offer a way to obtain the latent representations corresponding to samples by mapping using the function , which is also called an encoder model. This property and their stable training have motivated us to use VAEs in this work.
Iv A Deep Generative Endmember Model
In this section, we propose to model the distribution of EM spectral variability using a deep generative model. By doing so, we can explicitly explore a common property of spectral variability: the EM spectra are usually confined to a lowdimensional manifold. This property is due to the fact that the spectral signature of many materials is a function of a few photometric or chemical properties of the medium. Prominent examples include packed particle spectra as a function of its roughness, size and density [Hapke1981], leaf reflectance spectra as a function of various biophysical parameters [jacquemoud2001leafOpticalPropertiesReview], and soil reflectance as a function of moisture conditions [lobell2002moistureEffectsOnReflectance].
Iva The steps of the proposed SU method
We assume the existence of nonlinear functions , (the generative model) that map latent representations into their corresponding spectral signatures . We assume also the existence of encoder models that map spectral signatures into their latent representations. In other words, we assume that any arbitrary observation of a spectral signature of a material belongs to the set
(5) 
and thus can be equivalently represented by a corresponding lowdimensional vector in the latent space of the generative model . This reasoning is illustrated in Fig. 1, where the encoder function maps the input EM signature to the lowdimensional manifold. Reciprocally, lowdimensional vectors in the latent space can be mapped (decoded) to their corresponding spectral signatures using .
As such, we can formulate EM estimation in the SU problem in the latent domain (as opposed to the input spectral space), which is of a much lower order. Moreover, this approach will keep the physical interpretation of the model, provided that we have relevant training data to learn the generative models [shah2018invGANpriors, bora2017compressedInvGANs, anirudh2018unsupervisedInverseProblemsGANs, asim2018GANsDeblurring]. This strategy relies on the existence of a priori training data for each material in the image, which might come in the form of, e.g., spectral libraries of laboratory measurements [iordache2011sunsal]. Nevertheless, we propose a more practical and effective approach to train the generators by exploring information contained in multiple pure pixels extracted from the observed HI. The presence of multiple pure pixels in an observed HI is a characteristic of many real scenes, and can be leveraged to help in estimating the EM models, thus, reducing the illposedness of the SU problem^{2}^{2}2Pure pixels are defined here as a set of pixels whose spectral distance relative to the reference EMs in is less than a specified threshold..
Therefore, we propose to break the unmixing problem into a sequence of two problems:

Using pure pixel information extracted from the HI by standard EM extraction methods, learn the generative and encoder models, and , for all EMs in the scene ().

Using the learned generative models, solve the SU problem by estimating the latent EM representations and the fractional abundance vectors that can best represent the observed hyperspectral data, for all pixels in the scene ().
IvB Learning the generative and encoder models and
The objective of this first problem is to estimate the generative and encoding models and , for . We assume the knowledge of a set of pure pixels for the th EM, for all . Multiple pure pixels exist in many scenes, and can be directly extracted from the observed HI using automated EM extraction techniques [somers2012automatedBundlesRansomSampling, somers2016endmemberLibrariesChapter]. The sets of pure pixels , which can be seen as observations from the statistical distribution of each EM, are then used in the form of training data to learn the models and using a VAE [kingma2013AEC_varBayes]. If the set is representative of the variability of the th material, the learned generative model will be able to accurately describe the manifold of the th EM variability. Doing the same for all yields a set of variability models for all the EM spectra.
Although the extraction of multiple pure pixels from observed HIs is a wellestablished technique used to produce EM libraries [somers2012automatedBundlesRansomSampling], mixed pixels can sometimes be mistakenly identified as a pure pixel of some of the EMs. This constitutes a problem for librarybased SU applications (e.g. MESMA and sparse SU) since some of the library spectra may end up not being representative of their EM class (material).
The smooth nature of the latent representation of VAEs allows the mitigation of this problem in the proposed approach. Assuming the availability of a reference EM matrix of correctly identified signatures (which can be obtained using any EM extraction method) and of a set of encoder models , we can compute the latent representation of these reference signatures of each EM as
(6) 
where is the th column of . The latent representation can be used as a reference latent code for the th material. Thus, we can measure how close an estimated EM latent representation is to the latent representation of a pure pixel by evaluating its Euclidean distance to . This can be performed since the output of VAEs have been shown to vary smoothly with changes of the latent variable [kingma2013AEC_varBayes]. Thus, we can use to regularize the SU problem to prevent from representing mixed pixels. This increases the robustness of the proposed approach.
IvC Extracting sets of pure pixels from the observed HI
An important part of the proposed methodology consists in the extraction of the sets of pure pixels , from the observed hyperspectral image . Although different strategies have been proposed for imagebased library construction (see e.g. [uezato2016novelBundlesExtractionClustering, andreou2016bundlesExtractionVariabilityMultiscaleBand]), these techniques depend on multiple parameters that must be carefully adjusted in order to obtain good results. Instead of these approaches, we adopt a very simple strategy to select pure pixels from an HI that makes use of the reference matrix extracted from the image using a purepixelbased endmember extraction algorithm (e.g. VCA [Nascimento2005]), which will also later be used to construct in (IVB). We simply select as the elements of the image pixels that have the smallest spectral angle to the reference signature in the th column of , where is the cardinality of for . Although the success of this strategy depends on having a reasonably accurate estimation of , we experimentally found it to be more robust and easier to adjust than, for instance, the one in [somers2012automatedBundlesRansomSampling].
V The Unmixing Algorithm
Given a set of generative models , for each EM in the scene, a latent space representation of a reference EM matrix , and an HI , the SU problem can be cast as the minimization of a risk functional of the form
(7) 
where is the abundance matrix, is a 3D tensor obtained by stacking all pixeldependent latent EM representations , such that , and are regularization terms to improve the problem conditioning, and the matrixvalued function defined as
is the concatenation of the generative functions for each EM.
The term is a regularization functional that aims to provide spatial smoothness and to enforce positivity and sumtoone constraints to the abundances. It is given by [imbiriba2018glmm]
(8) 
where parameter controls the contribution of this term to the cost function. The first two terms are a spatial regularizers over , where and are linear operators that compute the firstorder horizontal and vertical gradients of a bidimensional signal, acting separately for each material of , and is the norm, defined as . The term is the indicator function of the unity simplex, i.e. if and otherwise, where
(9) 
The term constrains the EM latent representations to be close to the latent representation of the reference EM matrix . It is given by
(10) 
where parameter controls the contribution of this term to the cost function. This regularization makes the estimation problem more robust to the selection of the training data by assuring the closeness of the estimated latent codes and the representations of pure pixels of each class. However, it relies indirectly on the reference EM signatures (which are extracted from the observed HI with endmember extraction algorithms) being adequate representatives of their material classes in order to provide a good performance.
The optimization problem then becomes
(11) 
The problem defined in (11) is nonsmooth and nonconvex if solved simultaneously with respect to both variables , and . However, an approximate solution can be found by minimizing (11) iteratively with respect to each variable, leading to the Deep Generative Unmixing (DeepGUn) method described in Algorithm 1. The DeepGUn algorithm consists of two distinctive steps. First, the generative endmember models generative and encoder models , , are trained based on the pure pixels , extracted from the observed HI and is computed. Afterwards, the alternating minimization approach is applied to compute the abundance maps and the latent representations of the endmembers for each pixel. We next describe the details of each optimization step. Implementation details are described in Sections VI and VII.
Va Optimization with respect to
Rewriting (11) considering only the terms in (7) that depend on , the problem becomes
(12) 
This is a regularized nonlinear least squares problem, which can be solved individually for each pixel . Thus, (12) can be decomposed into nonconvex, nonlinear optimization problems with dimensionality by denoting each summand in (12) by , . We solve each of those problems using a quasiNewton algorithm, described in Algorithm 2, which provides an efficient solution for highdimensional functions [nocedal2006optimizationBook].
Although problem (12) is generally nonconvex, recent research [hand2018convexityInverseOptGANs] has proven that, under suitable assumptions on the generator network , the problem of recovering the latent variable does not have any stationary point (e.g. local minima or saddle points) outside a small neighborhood of the desired solution and its negative scalar multiple. This indicates the existence of a favorable global geometry of (12).
Note that is not necessarily differentiable with respect to the latent representations , which can make the optimization problem more challenging. Nonetheless, quasiNewton algorithms show excellent performance at nonsmooth problems [lewis2013nonsmoothQuasiNewton], where convergence is generally observed as long as the line search procedure does not return a point at which the objective function is nondifferentiable. This allows quasiNewton algorithms to be directly applied to obtain approximate solutions to nonsmooth problems with good computational efficiency [lewis2013nonsmoothQuasiNewton, curtis2015quasiNewtonNonsmooth].
VB Optimization with respect to
Restating (11) considering only the terms in (7) that depend on leads to
(13) 
Since the latent variables are fixed, (13) consists of a SU problem with a pixeldependent EM matrix and an edgepreserving spatial regularization. Although this problem is not separable with respect to each pixel in the image, the Alternating Direction Method of the Multipliers (ADMM) can be used to obtain an efficient solution [Boyd_admm_2011]. The solution of (13) using the ADMM is well described elsewhere (e.g. [drumetz2016blindUnmixingELMM]) and will thus be omitted here for conciseness.
Vi Neural Network Architecture
As discussed before, we used a VAE [kingma2013AEC_varBayes] to learn the generative and encoder models and from the sets of pure pixels . Compared to GANs, the training of VAEs is much simpler and more stable [arjovsky2017wassersteinGANs]. Moreover, VAEs naturally return the encoder model as an approximation to the posterior distribution when learning . We have selected a dimension for the latent space, as it was experimentally verified to be sufficient to adequately capture the variability of each single material in a scene.
For the network architectures, we selected the number of layers and neurons according to the autoencoder implementation in [van2008matlabToolbox, van2009dimensionalityReductionRev], with three hidden layers using ReLU activations (defined as ) in the hidden layers, which are described in more detail in Tables I and II.
We found that this configuration led to spectrally smooth generated signatures, and was effective at generalizing well with small training sample sizes. We trained the network for 50 epochs with the Adam optimizer [kingma2014adam] in TensorFlow, using a batch optimization with minibatch size equal to one third of the total amount of training data for each EM.
Layer  Activation Function  Number of units 

Input  —  
Hidden # 1  ReLU  
Hidden # 2  ReLU  
Hidden # 3  ReLU 
Layer  Activation Function  Number of units 

Hidden # 1  ReLU  
Hidden # 2  ReLU  
Hidden # 3  ReLU  
Output  Sigmoid 
Vii Experimental Results
In this section, simulation results using both synthetic and real data illustrate the performance of the proposed method. We compare the proposed DeepGUn method with the fully constrained least squares (FCLS), the PLMM [Thouvenin_IEEE_TSP_2016_PLMM], the ELMM [drumetz2016blindUnmixingELMM], and the GLMM [imbiriba2018glmm]. In all experiments, the VCA algorithm [Nascimento2005] was used to extract the reference EM matrix from the observed HI and to initialize the different methods. The abundance maps of all methods were initialized using the results obtained by the FCLS algorithm. The sets of pure pixels were constructed by selecting the image pixels with the smallest spectral angles relative to the reference EMs in . We ran the alternating optimization process in Algorithm 1 for at most iterations or until the relative change of and was less than . The iterative procedure in Algorithm 2 was run until the relative change of was less than . The performances were evaluated using the Normalized Root Means Squared Error (NRMSE) between the estimated abundance maps (), between the EM matrices () and between the reconstructed images (). The NRMSE between a true, generic tensor and its estimate is defined as
(14) 
Note that for the case of , the reconstructed image is given by , .
We consider also the Spectral Angle Mapper (SAM) to evaluate the estimated EMs
(15) 
where and are the true and the estimated signatures of the th endmember in the th pixel, respectively.
Data Cube 1 – DC1  

Time [s]  
FCLS  0.2854  —  —  0.0350  0.71 
PLMM  0.2604  0.1075  0.0440  0.0007  122.09 
ELMM  0.2554  0.1032  0.0398  0.0321  8.82 
GLMM  0.2480  0.1036  0.0355  0.0235  23.74 
DeepGUn  0.0566  0.0944  0.0233  0.0448  75.20 
Data Cube 2 – DC2  
FCLS  0.1294  —  —  0.0393  0.38 
PLMM  0.1197  0.0481  0.0378  0.0336  41.31 
ELMM  0.1110  0.0566  0.0382  0.0231  20.25 
GLMM  0.1146  0.0534  0.0367  0.0226  17.03 
DeepGUn  0.0969  0.0463  0.0323  0.0384  36.40 
Data Cube 3 – DC3  
FCLS  0.2606  —  —  0.0542  0.34 
PLMM  0.2028  0.0928  0.0385  0.0302  59.88 
ELMM  0.1997  0.0640  0.0188  0.0238  17.99 
GLMM  0.1841  0.0638  0.0185  0.0226  25.74 
DeepGUn  0.1613  0.0600  0.0172  0.0457  48.96 
Data Cube 4 – DC4  
FCLS  0.5109  —  —  0.1712  0.50 
PLMM  0.5066  0.6245  0.4874  0.0320  269.48 
ELMM  0.4385  0.4712  0.1451  0.0106  18.36 
GLMM  0.4371  0.4855  0.1972  0.0108  21.15 
DeepGUn  0.2550  0.2918  0.0873  0.1403  99.94 
Viia Synthetic data
To quantitatively compare the different algorithms, four synthetic datasets were created, namely Data Cubes 1–4 (DC1–DC4), with 7070 pixels (DC1) and 5050 pixels (DC2, DC3 and DC4). These datasets were built using three (DC1, DC2 and DC3) and five (DC4) 224band EMs extracted from the USGS Spectral Library [clark2003imaging] and spatially correlated abundance maps, as depicted in the first row of Fig. 2.
Spectral variability of the EMs was imposed using four different models. For the DC1 datacube, we adopted the variability model used in [Thouvenin_IEEE_TSP_2016_PLMM], consisting of pixelwise multiplicative spectral factors given by random piecewiselinear functions. For DC2, the variability model of [imbiriba2018glmm] was used, consisting of band dependent scaling factors that varied smoothly in both the spatial and spectral dimensions. For DC3, we considered a simple model introduced in [uezato2016novelBundlesExtractionClustering, Section IVA1] to emulate errors in atmospheric compensation as a function of the viewing geometry given the direct and diffuse light on the scene, and the solar path transmittance. For datacube DC4, we used as endmembers pure pixels of five materials (asphalt, tree, roof, metal and dirt) which were manually extracted from a real hyperspectral image, thus depicting realistic spectral variability. White Gaussian noise was finally added to all datasets to yield a 30dB SNR.
The optimal parameters for each algorithm were selected by performing grid searches for each dataset. The ranges in which the parameters were searched were selected according to those discussed by the authors in the original publications. For the PLMM we searched for , and in the ranges , and , respectively. For both ELMM and GLMM, the parameters were selected among the following values: , , and . For the proposed DeepGUn algorithm, we fixed and selected among the values . For the proposed method, the sets of pure pixels for each EM were constructed by selecting the pixels closest to the reference materials .
The quantitative results are shown in Table III, with the best results for each metric marked in bold. The proposed method clearly outperformed the competing algorithms in terms of for all four datasets. Qualitatively, the abundance maps provided by DeepGUn, displayed in Fig. 2, are clearly much closer to the true abundance maps than those provided by the other methods. These are important results, as accuracy in abundance estimation is the main objective of SU.
For the EM reconstruction metrics and , DeepGUn gave the best results for all data cubes. This indicates that the proposed endmember model used by DeepGUn allows for precise material identification from the observed hyperspectral scenes.
The reconstruction error of the DeepGUn algorithm was comparable to the FCLS and significantly larger than that of the GLMM. This is natural since the GLMM has more degrees of freedom. However, the connection between and the abundance reconstruction error is far from being direct, as can be seen in Table III.
The execution times, at the rightmost column of Table III, indicate that the computational complexity of DeepGUn is somewhere between the complexities of GLMM and PLMM, the two major competing algorithms. Hence, the DeepGUn method yielded superior SU performance, with easier parameter tuning, and at a reasonable computational cost.
ViiB Influence of the latent dimension
An important parameter in the design of the DeepGUn method is the dimensionality of the latent space of the generative endmember models , . The individual dimensions of the latent space are used to represent changes in the endmember signatures due to spectral variability. Since in a given scene the endmembers are likely to be affected only by a small number of effects (hence the hypothesis that they are supported at a lowdimensional manifold), should be small in order to avoid introducing spurious effects and increasing the computational complexity of the solution.
To illustrate this, we performed a simulation with DeepGUn where we varied the dimensionality of the latent space and measured the normalized abundance reconstruction error . For this, we considered the data cubes DC1 and DC2 described above in Section VIIA. The results are depicted in Figure 3, and show that the abundance estimation error tends to increase with . It can also be seen that there is a sharper increase in for DC1 when compared to DC2. This is likely due to the fact that the dataset DC2 uses a more complex model to generate endmember variability. This indicates that the selection of a small value for is important to obtain good unmixing results.
Houston HI  Samson HI  Jasper Ridge HI  


Time [s] 

Time [s] 

Time [s] 

FCLS  0.2470  2.56  0.0545  1.38  0.2057  1.52 
PLMM  0.0713  663.25  0.0239  103.84  0.0553  220.84 
ELMM  0.0171  38.30  0.0119  14.76  0.0278  27.08 
GLMM  0.0016  48.53  0.0006  46.69  0.0019  86.33 
DeepGUn  0.2355  259.61  0.0862  121.88  0.1094  209.64 
ViiC Real data
We considered the Houston, Samson and the Jasper Ridge datasets for the simulations with real data. These datasets were captured by the AVIRIS instrument, and originally had 224 bands. The spectral bands corresponding to water absorption and low SNR regions were removed, resulting in 188 bands for the Houston image, 156 bands for the Samson image and 198 bands for the Jasper Ridge image. Previous studies indicate that the Houston HI has four predominant EMs [drumetz2016blindUnmixingELMM], while the Samson and Jasper Ridge HIs are known to have three and four EMs, respectively [Borsoi2017_multiscale].
The reconstructed abundance maps for both datasets and all algorithms are shown in Figs. 4, 5 and 6. For the Houston dataset, the last row of Fig. 4 shows that the abundance maps provided by the DeepGUn method better evidence the strong vegetation and concrete abundances at the stadium field and stands, respectively, as well as the stronger asphalt abundances in the parking lot. For the Samson and Jasper Ridge images, a clear performance improvement can be seen for the DeepGUn algorithm. Note, for instance, a smaller confusion between the Water and Soil EMs in the Samson HI when compared to the other methods. Similarly, for the Jasper Ridge HI, the DeepGUn method leads to considerably stronger Water abundances in the region containing the river. Moreover, although the ELMM provided a better estimation of the road in the scene when compared to the remaining methods, it also resulted in a greater confusion between the Vegetation and Soil EMs, especially in the right part of the scene.
The quantitative results for all algorithms and datasets are shown in Table IV. Since the correct abundance values (the ground truth) are not available for most real images, the reconstruction error has been used as a sort of quality verification. As was the case for the synthetic data, the DeepGUn reconstruction errors are higher than those yielded by other methods that address spectral variability. However, the reconstruction error is definitely not a good performance measure for abundance estimation in real images, which is the main objective of unmixing algorithms. The higher reconstruction errors of DeepGUn in this case are just due the fact that DeepGUn has much fewer degrees of freedom than the ELMM, PLMM and GLMM algorithms. In fact, the DeepGUn has only degrees of freedom for each pixel, which is comparable to the FCLS () much smaller than the ELMM, GLMM and PLMM methods (). Although this means that the ELMM, GLMM and PLMM can achieve arbitrarily small reconstruction errors , this is not necessarily reflected as good abundance estimation results. The execution times of the proposed DeepGUn method were again comparable to those of the other algorithms addressing spectral variability, which indicates that it scales well with larger image sizes.
Viii Conclusions
In this paper, a deep generative EM model was proposed to address spectral variability in SU of HIs. Instead of relying on userdefined parametric EM models which have shown to be very hard to estimate in practical scenes, the proposed methodology leveraged the generalization capability of deep neural networks to accurately model EM spectra while still maintaining a strong connection to the physical mixing process. A deep generative model for each EM was trained prior to unmixing by using pure pixel information extracted directly from the observed HI, which allowed for an unsupervised formulation. The proposed EM model was then applied to solve the SU problem, which was cast as the estimation of the lowdimensional representations of the EMs in the latent space of the deep generative models and their corresponding fractional abundances, for each pixel. The resulting DeepGUn algorithm presented excellent performance despite the simple strategy used for selecting the training data for learning the generative model. Simulations using synthetic and real data indicate that the proposed method can lead to significant improvements in abundance estimation accuracy.
References
Ricardo Augusto Borsoi (S’18) received the MSc degree in electrical engineering from Federal University of Santa Catarina (UFSC), Florianópolis, Brazil, in 2016. He is currently working towards his doctoral degree at Université Côte d’Azur (OCA) and at UFSC. His research interests include image processing, tensor decomposition, and hyperspectral image analysis. 
Tales Imbiriba (S’14, M’17) received his Doctorate degree from the Department of Electrical Engineering (DEE) of the Federal University of Santa Catarina (UFSC), Florianópolis, Brazil, in 2016. He served as a Postdoctoral Researcher at the DEE–UFSC and is currently a Postdoctoral Researcher at the ECE dept. of the Northeastern University, Boston, MA, USA. His research interests include audio and image processing, pattern recognition, kernel methods, adaptive filtering, and Bayesian Inference. 
José Carlos M. Bermudez (S’78,M’85,SM’02) received the B.E.E. degree from the Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil, the M.Sc. degree in electrical engineering from COPPE/UFRJ, and the Ph.D. degree in electrical engineering from Concordia University, Montreal, Canada, in 1978, 1981, and 1985, respectively. He joined the Department of Electrical Engineering, Federal University of Santa Catarina (UFSC), Florianopolis, Brazil, in 1985. He is currently a Professor of Electrical Engineering at UFSC and a Professor at Catholic University of Pelotas (UCPel), Pelotas, Brazil. He has held the position of Visiting Researcher several times for periods of one month at the Institut National Polytechnique de Toulouse, France, and at Université Nice SophiaAntipolis, France. He spent sabbatical years at the Department of Electrical Engineering and Computer Science, University of California, Irvine (UCI), USA, in 1994, and at the Institut National Polytechnique de Toulouse, France, in 2012. His recent research interests are in statistical signal processing, including linear and nonlinear adaptive filtering, image processing, hyperspectral image processing and machine learning. Prof. Bermudez served as an Associate Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING in the area of adaptive filtering from 1994 to 1996 and from 1999 to 2001. He also served as an Associate Editor of the EURASIP Journal of Advances on Signal Processing from 2006 to 2010, and as a Senior Area Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING from 2015 to 2019. He is the Chair of the Signal Processing Theory and Methods Technical Committee of the IEEE Signal Processing Society. Prof. Bermudez is a Senior Member of the IEEE. 