Dimensionality Reduction via Regression in Hyperspectral Imagery

Dimensionality Reduction via Regression in Hyperspectral Imagery

Valero Laparra, Jesús Malo and Gustau Camps-Valls Copyright (c) 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. DOI: 10.1109/JSTSP.2015.2417833.Image Processing Laboratory (IPL), Universitat de València, Catedrático A. Escardino - 46980 Paterna, València (Spain). E-mail: {valero.laparra, jesus.malo, gcamps}@uv.esThis work was partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under project TIN2012-38102-C03-01, and under a EUMETSAT contract.

This paper introduces a new unsupervised method for dimensionality reduction via regression (DRR). The algorithm belongs to the family of invertible transforms that generalize Principal Component Analysis (PCA) by using curvilinear instead of linear features. DRR identifies the nonlinear features through multivariate regression to ensure the reduction in redundancy between the PCA coefficients, the reduction of the variance of the scores, and the reduction in the reconstruction error. More importantly, unlike other nonlinear dimensionality reduction methods, the invertibility, volume-preservation, and straightforward out-of-sample extension, makes DRR interpretable and easy to apply. The properties of DRR enable learning a more broader class of data manifolds than the recently proposed Non-linear Principal Components Analysis (NLPCA) and Principal Polynomial Analysis (PPA). We illustrate the performance of the representation in reducing the dimensionality of remote sensing data. In particular, we tackle two common problems: processing very high dimensional spectral information such as in hyperspectral image sounding data, and dealing with spatial-spectral image patches of multispectral images. Both settings pose collinearity and ill-determination problems. Evaluation of the expressive power of the features is assessed in terms of truncation error, estimating atmospheric variables, and surface land cover classification error. Results show that DRR outperforms linear PCA and recently proposed invertible extensions based on neural networks (NLPCA) and univariate regressions (PPA).

1 Introduction

In the last decades, the technological evolution of optical sensors has provided remote sensing analysts with rich spatial, spectral, and temporal information. In particular, the increase in spectral resolution of hyperspectral sensors in general, and of infrared sounders in particular, opens the doors to new application domains and poses new methodological challenges in data analysis. The distinct highly-resolved spectra offered by hyperspectral images (HSI) allow us to characterize land-cover classes with unprecedented accuracy. For instance, hyperspectral instruments such as NASA’s Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) covers the wavelength region from 0.4 to 2.5m using more than 200 spectral channels, at a nominal spectral resolution of 10 nm. The MetOp/IASI infrared sounder poses even more complex image processing problems, as it acquires more than 8000 channels per iFOV. Actually, such improvements in spectral resolution have called for advances in signal processing and exploitation algorithms capable of summarizing the information content in as few components as possible [1, 2, 3, 4].

In addition to its eventual high dimensionality, the complex interaction between radiation, atmosphere, and objects in the surface leads to irradiance manifolds which consist of non-aligned clusters that may change nonlinearly in different acquisition conditions [5, 6]. Fortunately, it has been shown that, given the spatial-spectral smoothness of the signal, the intrinsic dimensionality of the data is small, and this can be used both for efficient signal coding [7, 3], and for knowledge extraction from a reduced set of features [8, 9]. The high dimensionality problem is not only affecting the hyperspectral data: very often, multispectral data processing applications involve using spatial, multi-temporal or multi-angular features that are combined with the spectral features [10, 11]. In such cases, the representation space becomes more redundant and pose challenging problems of collinearity for the algorithms. In both cases, the key in coding, classification, and in bio-geo-physical parameter retrieval applications reduces to finding the appropriate set of features, that should be necessarily flexible and nonlinear.

In order to find these features, in recent years a number of feature extraction and dimensionality reduction methods have been presented. Most of them are based on nonlinear functions to allow describing data manifolds that exhibit nonlinear relations (see [12] for a comprehensive review). Approaches range from local methods [13, 14, 15, 16, 17], kernel-based and spectral decompositions [18, 19, 20, 9], neural networks [21, 22, 23], or projection pursuit formulations [24, 25]. Despite the theoretical advantages of nonlinear methods, the fact is that classical principal component analysis (PCA) [26] is still the most widely used dimensionality reduction technique in real remote sensing applications [27, 28, 3, 29]. This is mainly because PCA has different properties that make it useful in real examples: it is easy to apply since it involves solving a linear and convex problem, and it has a straightforward out-of-sample extension. Moreover, the PCA transformation is invertible and, as a result, the features extracted can be easily interpreted.

The new dimensionality reduction algorithms that involve nonlinearities rarely fulfill the above properties. Nonlinear models usually have complex formulations, which introduce a number of non-intuitive free parameters. Tuning these parameters implies strong assumptions about the manifold characteristics (e.g. local Gaussianity or special symmetries), or a high computational cost training. This complexity reduces the applicability of nonlinear feature extraction to specific data, i.e. the performance of these methods do not significantly improve that of PCA on many remote sensing problems [3, 27, 9]. Moreover, these methods have problems to obtain out-of-sample predictions, which is mandatory in most of the real applications. Another critical point is that the transform involved by the nonlinear models is hard to interpret. This problem could be alleviated if the methods were invertible since then one could get the data back to the input domain (where units are meaningful) and understand the results therein. Invertibility allows to characterize the transformed domain, and to evaluate its quality. However, invertibility is scarcely achieved in the manifold learning literature. For instance, spectral and kernel methods involve implicit mappings between the original and the curvilinear coordinates, but these implicit features are not easily invertible nor interpretable [30].

The desirable properties of PCA are straightforward in methods that find projections onto explicit features in the input domain. These explicit features may be either straight lines or curves. This family of projection methods may be understood as a generalization of linear transforms extending linear components to curvilinear components. This family ranges between two extreme cases: (1) rigid approaches where features are straight lines in the input space (e.g. conventional PCA, Independent Components Analysis -ICA- [31]), and (2) flexible non-parametric techniques that closely follow the data, such as Self-Organizing Maps (SOM) [32], or the related Sequential Principal Curves Analysis (SPCA) [6, 33]. This family is discussed in Section 2 below. Both extreme cases are undesirable because of different reasons: limited performance (in too rigid methods), and complex tuning of free parameters and/or unaffordable computational cost (in too flexible methods). In this projection-onto-explicit-features context, autoencoders such as Nonlinear-PCA (NLPCA) [23], and approaches based on fitting functional curves, such as Principal Polynomial Analysis (PPA) [34, 35], represent convenient intermediate points between the extreme cases in the family. Note that these methods have shown better performance than PCA on a variety of real data [35, 36]. Actually, in the case of PPA, it is theoretically ensured to obtain better results than PCA. The method proposed here, Dimensionality Reduction via Regression (DRR), represents a qualitative step towards the flexible end in this family because of the multivariate nature of the regression (as opposed to the univariate regressions done in PPA) while keeping the convenient properties of PPA and PCA which make it suitable for practical high dimensional problems (as opposed to SPCA and SOM). Therefore, it extends the applicability of PPA to more general manifolds, such as those encountered in remote sensing data analysis.

Following the taxonomy in [35] these three methods (NLPCA, PPA and DRR) could be included in the Principal Curves Analysis framework [37]. This framework includes both parametric (fitting analytic curves) [26, 38, 39], and non-parametric  [40, 41, 42, 6, 33] methods. NLPCA, PPA and DRR exploit the idea behind this framework to define generalizations of PCA of controlled flexibility.

Preliminary results of DRR were presented in [43]. Here we extend the theoretical analysis of the method and the experimental confirmation of the performance in hyperspectral images. The remainder of the paper is organized as follows. Section 2 reviews the properties and shortcomings of the projection-onto-explicit-features family pointing out the qualitative advantages of the proposed DRR. Section 3 introduces the mathematical details of DRR. We describe the DRR transform and the key differences with PPA. We derive an explicit expression for the inverse and we prove the volume preservation property of DRR. The theoretical properties of DRR are demonstrated and illustrated in controlled toy examples of different complexity. In Section 4, we address two important high dimensional problems in remote sensing: the estimation of atmospheric state vectors from Infrared Atmospheric Sounding Interferometer (IASI) hyperspectral sounding data, and the dimensionality reduction and classification of spatio-spectral Landsat image patches. In the experiments, DRR is compared with conventional PCA [26], and with recent fast nonlinear generalizations that belong to the same class of invertible transforms, PPA [34, 35] and NLPCA [23]. Comparisons are made both in terms of reconstruction error and of expressive power of the extracted features. We end the paper with some concluding remarks in Section 5.

2 From rigid to flexible features

Here we illustrate how DRR represents a step forward with regard to NLPCA and PPA in the family of projections onto explicit curvilinear features ranging from rigid to flexible extremes. First, we review the basic details of previous projection methods, and then we illustrate the advantages of the proposed method in an easy to visualize example.

2.1 Representative projections onto lines and curves

Classical techniques such as PCA [26] or ICA [31] represent the rigid extreme of the family, where, zero-mean samples are projected onto rectilinear features through the projection matrix, :

where are the Principal Components (PC scores for PCA) or the Independent Components (for ICA), and the linear features in the input space are the column vectors (straight directions) in . These rigid techniques use a single set of global features regardless of the input.

On the contrary, flexible techniques adapt the set of features to the local properties of the input. Examples include SOM [32] where a flexible grid is adapted to the data and samples can be represented by projections onto the local axes defined by the edges of the parallelepiped corresponding to the closest node. Similarly, local-PCA [44] and local-ICA [45] project the data onto local axes corresponding to the closest code vector. More generally, local-to-global methods integrate these local-linear representations into a single global curvilinear representation [46]. In particular, using the fact that local eigenvectors are tangent to first and secondary principal curves [47], Sequential Principal Curves Analysis (SPCA) [6, 33] integrates local PCAs, , along a sequence of principal curves to get a curvilinear representation

where the local metric, , sets the line element along the curves. SPCA is inverted by taking the lengths, , along the sequence of principal curves drawn from the origin, . Similarly to SOM, SPCA assumes a grid of curves adapted to the data. However, as opposed to SOM, SPCA does not learn the whole grid, but only segments of principal curves per sample.

The above methods identify explicit curves/features that follow the data, but they are hard to train (e.g. parameters to control their flexibility depend on the problem) and require many samples to be reliable, which make them hard to use in high-dimensional scenarios. Other methods have been proposed to generalize the rigid representations by considering curvilinear features instead of straight lines [26]. For instance, in NLPCA [21, 23] an invertible internal representation is computed through a two stage neural network,

where the matrices represent sets of linear receptive fields, and is a set of fixed point-wise nonlinearities. The inverse of this autoencoder [22] can be used to make the curvilinear coordinates explicit.

Fitting general parametric curves in , as done in [38, 39], is difficult because of the unconstrained nature of the problem [26, 35]. Alternatively, PPA [35] follows a deflationary sequence in which a single polynomial depending on a straight line (univariate fit) is computed at a time. Specifically, the -th stage of PPA accounts for the -th curvilinear dimension by using two elements: (1) one-dimensional projection onto the leading vector , and (2) polynomial prediction of the average at the orthogonal subspace,


where the polynomial prediction, , is removed from the data in the orthogonal subspace. Superindices in the above formula represent the stage. As a result, data at the -th stage is represented by and by the -dimensional residual that cannot be predicted from that projection. Prediction using this univariate polynomial is a way to remove possible nonlinear dependencies between the linear subspace of and its orthogonal complement. Despite its convenience, the univariate nature of the fits restricts the kind of dependencies that can be taken into account since more information about the orthogonal subspace (better predictions) could be obtained if more dimensions were used in the prediction. Moreover, using a single parameter, , to build the -th polynomial implies that the -th curvilinear feature has the same shape along the -th curve.

DRR addresses these limitations by using multivariate instead of univariate regressions in the nonlinear predictions. As a result, DRR improves energy compaction and extends the applicability of PPA to more general manifolds while keeping its simplicity, which make it suitable in high dimensional problems (as opposed to SPCA and SOM).

2.2 Qualitative advantages of DRR

The advantages of DRR are illustrated in Fig. 1 where we compare representative invertible representations of this family on two curved and noisy manifolds of the class introduced by Delicado [47] (in red and blue). This class of manifolds, originally presented to illustrate the concept of secondary principal curves [47], is convenient since one can easily control the complexity of the problem by introducing tilt (non-stationarity) on the secondary principal curves (dark color) along the first principal curve (light color). This controlled complexity is useful to point out the limitations of previous techniques (e.g. required symmetry in the manifold) and how these limitations are alleviated by using the (more general) DRR model. The performance is compared in the input domain through the dimensionality reduction error and through the accuracy of the identified curvilinear features. These manifolds come from a two-dimensional space of latent variables (positions along the first and secondary curves). As a result, the dimensionality reduction error depends on the unfolding ability of the forward transform: the closer the transformed data fit a flat rectangle, the smaller the error when truncating the representation. On the other hand, the identified features depend on how the inverse transform bends a cartesian grid in the latent space: the better the model represents the curvature of data, the bigger the fidelity of the identified features.

Let us start by considering the performance on the easy case: manifold in red with no tilt along the second principal curve. The previously reported techniques perform as expected: on the one hand, progressively more flexible techniques (from PCA to SPCA) reduce the distortion after dimensionality reduction (in MSE terms) because they better unfold test data. As a result, removing the third dimension in the rigid-to-flexible family progressively introduces less error. On the other hand, the identified features in the input domain are progressively more similar to the actual curvilinear latent variables when going from the rigid to the flexible extremes. In this specific easy example the proposed DRR outperforms even the flexible SPCA in MSE terms. Moreover, since this particular manifold may not require increased flexibility (and hence may be better suited to the PPA model), PPA slightly outperforms DRR in MSE terms.

Manifold 1 Manifold 2
(easy: no tilt) (difficult: tilt)
Input Domain
Rigid Extreme Flexible Extreme
Identified Features
MSE         100 48 2 23.1 0.6 3.3 0.3 17 3
MSE         100 66.4 0.7 42.1 0.4 49 1 12 1
Identified Features
MSE       269 4 80 40 97 2 26.5 0.3 19 3
MSE       209 2 85 30 85.6 0.4 69 2 15 4
Figure 1: Performance of the family of invertible representations on illustrative manifolds of different complexity. Complexity of the considered manifolds (top panel) depends on the tilt in secondary principal curves along the first principal curve [47]. Sample data are shown together with the first and secondary principal curves generated by the latent variables (angle and radius) in the input domain. Results of the different techniques for the considered manifolds are depicted in the same color as the input data (red for the no-tilt manifold, and blue for the tilted manifold). Previously reported representations range from rigid schemes such as PCA [26] to flexible schemes such as SPCA [6, 33], including practical nonlinear generalizations of PCA such as NLPCA [23] and PPA [35] which are examples of intermediate flexibility between the extreme cases. Performance is compared in terms of reconstruction error when removing the third dimension (dimensionality reduction MSE numbers are relative to the PCA error in the easy case), and in terms of the mean squared distance between the identified and the actual curvilinear features (MSE numbers are relative to the PCA error in the easy case). MSE is related to the unfolding ability of the model (see the Transform rows), and MSE is related to its ability to get appropriate curvilinear features from an assumed latent cartesian grid (see the Identified Features rows). We used training samples and optimal settings in all methods. Dimensionality reduction was tested on the highlighted curvilinear grid sampled from the true latent variables. The features in the input space were identified by inverting a 2- cartesian grid in the transform domain. This (assumed) latent grid was scaled in every representation to minimize MSE. Standard deviations in errors come from models trained on 10 different data set realizations.

Results for the more complex manifold (tilted secondary curves, in blue) provide more insight into the techniques since it pushes them (specifically PPA) to their limits. Firstly, note that the increase in complexity is illustrated by an increase in the errors in all methods. For instance, linear PCA, that identifies the same features in both cases, doubles the normalized MSEs. While the reduction in performance is not that relevant in SPCA (remember these flexible techniques cannot be applied in high dimensional scenarios), this twisted manifold certainly challenges fast generalizations of PCA: the MSEs dramatically increase for NLPCA and PPA. Even though NLPCA identifies certain tilt in the secondary feature along the first curve, it seems too rigid to follow the data structure. PPA displays a different problem: as stated above, by construction, the -th curvilinear feature in PPA cannot handle relations with the -th curve beyond the prediction of the mean. This is because the data in all orthogonal subspaces along the -th curve collapse, and are described by a single curve depending on a single parameter (univariate regression). This leads to using the same -th curve all along the -th feature (note the repeated secondary curves along the first curve in both, red and blue, cases). This is good enough when data manifolds have the required symmetry (PPA performance is over NLPCA in the first case), but leads to dramatic errors when the method have to deal with relations between three or more variables, as for the manifold in blue, where PPA performance is below NLPCA. This latter effect frequently appears in hyperspectral images, as different (non-stationary) nonlinear relations between spectral channels occur for different objects [48, 49, 3].

Finally, note that DRR clearly improves PPA in the challenging example in blue, mainly because it uses multiple dimensions (instead of a single one) to predict each lower variance dimension in the data. As a result, it can handle non-stationarity along the principal curves leading to better unfolding (lower MSE) and tilted secondary features (lower MSE). This removes the symmetry requirement in PPA and broadens the class of manifolds suited to DRR.

3 Dimensionality Reduction via Regression

PCA removes the second order dependencies between the signal components, i.e. PCA scores are decorrelated [26]. Equivalently, PCA can be casted as the linear transform that minimizes reconstruction error when a fixed number of features are neglected. However, for general non-Gaussian sources, and in particular for hyperspectral images, PCA scores still display significant statistical relations, see [3][ch. 2]. The scheme presented here tries to nonlinearly remove the information still shared by different PCA components.

3.1 DRR scheme

It is well known that, even though PCA leads to a domain with decorrelated dimensions, complete independence (or non redundant coefficients) is guaranteed only if the signal has a Gaussian probability density function (PDF). Here, we propose a scheme to remove this redundancy (or uninformative data). The idea is simple: just predict the redundant information in each coefficient that can be extracted from the others. Only the non-predictable information (the residual prediction error) should be retained for data representation. Specifically, we start from the linear PCA representation outlined above: . Then, we propose to predict each coefficient, , through a multivariate regression function, , that takes the higher variance components as inputs for prediction. The non-predictable information is:


and this residual, , is the -th dimension of the DRR domain. This prediction+substraction is applied times, , where is the dimension of the input. As a result, the DRR representation of each input vector , is:

3.2 Properties of DRR

DRR generalizes PCA

In the particular case of using linear regressions in , i.e. linear-DRR, the prediction would be equal to zero. Note that this is the result when using classical (least squares) solution since is uncorrelated with each . Therefore , and then , i.e. linear-DRR reduces to PCA.

As a result, if the employed nonlinear functions generalize the linear functions, DRR will obtain at least as good results as PCA. The flexibility of these functions with regard to the linear case will reduce the variance of the residuals, and hence the reconstruction error in dimensionality reduction.

DRR is invertible

Given the DRR transformed vector, , and knowing the functions of model , the inverse is straightforward since it reduces to sequentially undo the forward transformation: we first predict coefficient -th from previous (known) coefficients using the known regression function, and then, we use the known residual to correct the prediction:

DRR has an easy out-of-sample extension

Note that forward and inverse DRR transforms can be applied to new data (not in the training set) since there is no restriction to apply the prediction functions . See Sec. 3.3 for a discussion on the selected regression functions in this work.

DRR is a volume preserving transform

A nonlinear transform preserves the volume of the input space if the determinant of its Jacobian is one for all [50]. Here we prove that the nature of DRR ensures that its Jacobian fulfills this property.

DDR can be thought of as a sequential algorithm in which only one dimension is addressed at a time through the elementary transform consisting of prediction and substraction (Eq. (2)). Yet mathematically convenient to formulate the Jacobian, this sequential view is does not prevent the parallelization discussed later. Hence, the (global) Jacobian of DRR, , is the product of the Jacobians of the elementary transforms in this sequence times the matrix of the initial PCA as follows:

The -th elementary transform leaves all, but the -th dimension, unaltered. Therefore, each elementary Jacobian is the identity matrix except for the -th row, which depends on through the derivatives of the -th regression function with regard to each component in the previous stage:

Whatever these derivatives are (whatever regression function is selected), the determinant of such a simple matrix is always one at every point . Therefore, the determinant of the global Jacobian is guaranteed to be one including the PCA transform, , which is orthonormal.

Parallelization of DRR

Multivariate regression in DRR is a qualitative advantage over other methods (as discussed in Section 2). However it is computationally expensive. Fortunately, the proposed DRR allows trivial parallel implementation of the forward transform. Note that the prediction of each component is actually done from a subset of the original PC scores. Therefore, all the prediction functions, , can be applied at the same time after the initial PCA step, and sequential implementation is not necessary. This is an obvious computational advantage over PPA, which necessarily requires a sequential implementation, but it represents a qualitative advantage too, since in PPA each feature depends on the previous nonlinear features (see Eq. 1), while in DRR nonlinear regressions only depend on linear features, but not on previous curvilinear coefficients. As opposed to the forward transform, the inverse is not parallelizable since, in order to predict the -th PCA coefficient, we need the previous linear PCs, which implies operating sequentially from .

3.3 Selecting the class of regression functions

In practice, the prediction functions reduce to training a set of nonlinear regression models. In our experiments, we used the kernel ridge regression (KRR) [51] to implement the prediction functions , although any alternative regression method could be also applied. Notationally, given data points, the prediction for all of them is estimated as:

where is a kernel (similarity) function reproducing a dot product in Hilbert space, , is the matrix containing all the training samples in rows, is the -th column of to be estimated, denotes a submatrix containing columns of used as input data to fit the prediction model, and represents the feature vector in row of . In this prediction function, is the dual weight vector obtained by solving the least squares linear regression problem in Hilbert space:

where is the kernel matrix with entries , being . Two parameters must be tuned for the regression: the regularization parameter and the kernel parameters. In our experiments we used the standard squared exponential kernel function, , as it is a universal kernel which involves estimating only the length-scale . Both and can be estimated by standard cross-validation.

KRR can be quite convenient in the DRR scheme because it implements flexible nonlinear regression functions, and reduces to solving a matrix inversion problem with unique solution. KRR offers a moderate training and testing computational cost111While naive implementations scale as for training, recent sparse and low-rank approximations [52, 53] along with divide-and-conquer schemes [54] can make KRR very efficient., includes regularization in a natural way, and also offers the possibility to generate multi-output nonlinear regression. The latter is an important feature to extend the DRR scheme to multiple outputs approximation. Finally, KRR has been successfully used in many real applications [55, 51] including remote sensing data analysis involving hyperspectral data [27]. However, it should be noted that, even in such cases, a previous feature extraction was mandatory to attain significant results [56, 57, 27, 53].

4 Experimental results

In this section, we give experimental evidence of the performance of the proposed algorithm in two illustrative settings. First, we show results on the truncation error in a multispectral image classification problem including spatial context. Then we evaluate the performance of DRR in terms of both the reconstruction error and the expressive power of the features to perform multi-output regression of a challenging problem involving hyperspectral infrared sounding data222Reproduction of the experiments in this work is possible using the generic DRR toolbox at, http://isp.uv.es/drr.html.

Focusing in these two experiments is not arbitrary. The two applications imply challenging high dimensional data: (1) multispectral image classification in which contextual information is stacked to the spectral information highly increases the dimensionality, and (2) hyperspectral infrared sounding data used to estimate atmospheric state vectors is densely sampled. In both cases the input space may become redundant because of the collinearity introduced either by the (locally stationary) spatial features or by the spectral continuity of natural sources. In these experiments, in which , we compare DRR with members of the invertible projection family described in Section 2 suited to high dimensional scenarios. This implies focusing on PCA, NLPCA and PPA, excluding SPCA and SOM because of their prohibitive cost.

Figure 2: Reconstruction error results on the contextual multispectral image classification. Comparison between PCA, PPA, NLPCA and DRR for different number of extracted features, in both mean absolute reconstruction error (MAE) (left) and relative MAE with respect to PCA error (right), for which going below the PCA means better results (less error).
Figure 3: Classification results on the contextual multispectral image classification. Comparison between PCA, PPA, NLPCA and DRR for different number of extracted features, in both classification error (left) and relative classification error with respect to PCA accuracy (right), for which going below the PCA means better results (less error).

4.1 Experiment 1: Multispectral image classification

For our first set of experiments, we considered a Landsat MSS image consisting of pixels with a spatial resolution of 80m 80m (all data acquired from a rectangular area approximately km wide) 333Image available at http://www.ics.uci.edu/ mlearn/MLRepository.html. Six classes are identified in the image, namely red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble and very damp grey soil. A total of labeled samples are available. Contextual information was included stacking neighboring pixels in 33 windows. Therefore, -dimensional input samples were generated, with a high degree of redundancy and collinearity. We address two problems with this dataset: a pure spatio-spectral dimensionality reduction problem, and the effect of the reduced dimension in image classification.

4.1.1 Reconstruction accuracy

In the first problem, we compare the dimensionality reduction performance in terms of Mean Absolute Error (MAE) in the original domain. Note that this kind of evaluation can be used only with invertible methods. For each method, the data are transformed and then inverted using less dimensions. This is equivalent to truncate dimensions in PCA. In order to illustrate the advantage of using a given method instead of PCA, results are shown in percentage with regard to the PCA performance: where and refer to the MAE obtained with the considered method and PCA, respectively.

Figure 2 shows the results of the experiment. We divided the available labeled data into two sets (training and test) with equal number of samples. The samples of each set have been randomly selected from the original image dataset. The MAE of reconstruction in the test set averaged over ten independent realizations is shown. Several conclusions can be obtained: Specifically, NLPCA obtains good results when a few number of extracted features are obtained, but rapidly degrades its performance with more than 10 extracted features, revealing a clear inability to handle high-dimensional problems. Note that the available implementation of NLPCA444http://www.nlpca.org/ is restricted to extract at most features. For a given number of extracted features, the reconstruction error increases substantially with regard to PCA (Fig. 2 right). PPA shows better results than NLPCA, and it is better suited than PCA in all the number of extracted features. Nevertheless, it is noticeable that DRR is in all cases better than all the other methods, revealing a maximum gain of +25% over PCA for very few features.

4.1.2 Classification accuracy

The second problem with this dataset shows the classification results using the inverted data into the original input space of the different methods. We used the standard linear discriminant analysis on top of the inverted data44footnotetext: While other more sophisticated nonlinear classifiers could be used here, we are actually interested in this setting that allows us to study the expressive power of the extracted features. An homologous setting will be also used in the regression experiments of next subsection.. In all cases, we used 3200 randomly selected examples for training and the same amount for testing. Test results are averaged over five realizations, and are shown in Fig. 3. The performance results indicate similar trends observed in the reconstruction error in Fig. 2. Essentially, DRR outperforms the other methods, especially noticeable when a few number of components are used for reconstruction and classification. As the number of components increase, DRR and PPA show similar results. These results suggest that DRR better compacts the information in a lower number of components, which is useful for both reconstruction and data classification.

4.1.3 Computational load

Table 1 shows the computation cost for all considered methods for training and testing555Experiments were performed using Matlab on an Intel 3.3 GHz processor with 48 GB RAM memory. No parallelization was applied on DRR in this experiment.. The experiments used training and test samples, with . Two main conclusions can be extracted: NLPCA is the most computationally costly algorithm for training and DRR for testing.

Training time (sec) 0.05 0.6 7944 1920
Testing time (sec) 0.007 0.16 0.05 35
Table 1: Computational cost Landsat Dataset

4.2 Experiment 2: Regression from infrared sounding data

We here analyze the benefits of using DRR for the estimation of atmospheric parameters from hyperspectral infrared sounding data with a reduced dimensionality. We first motivate the problem, and then describe the considered dataset. Again, we are interested in analyzing the impact of the reduced dimensionality both in the reconstruction error and in a different task, in this case, the retrieval of geophysical parameters.

Temperature and water vapor are atmospheric parameters of high importance for weather forecast and atmospheric chemistry studies [58, 59]. Observations from spaceborne high spectral resolution infrared sounding instruments can be used to calculate the profiles of such atmospheric parameters with unprecedented accuracy and vertical resolution [60]. In this work we focus on the data coming from the Infrared Atmospheric Sounding Interferometer (IASI), the Microwave Humidity Sensor (MHS) and the Advanced Microwave Sensor Unit (AMSU) onboard of the MetOp-A satellite666https://directory.eoportal.org/web/eoportal/satellite-missions/m/metop. The IASI instrument is the one that poses the major dimensionality challenge due to its dense spectrum sampling: while MHS and AMSU spectra consist of about twenty values together, IASI spectra consist of spectral channels, between and  m, with a spectral resolution of  cm after apodization [61, 62]. Its spatial resolution is  km at nadir with an Instantaneous Field of View (IFOV) size of  km at an altitude of  km. This huge data dimensionality typically requires simple and computationally efficient processing techniques.

One of the retrieval techniques available in the MetOp-IASI Level 2 Product Processing Facility (L2 PPF) is a computationally inexpensive method based on linear regression from the principal components of the measured brightness spectra and the atmospheric state parameters. We aim to introduce DRR in such scheme as an alternative to PCA. In this application it is important that dimensionality reduction minimizes the reconstruction error and that the identified features are useful in the retrieval stage.

Figure 4: Surface temperature [in K] world map provided by the official ECMWF model, http://www.ecmwf.int/.

We used a collection of datasets of input data from the different sensors: IASI, MHS and AMSU. The considered output atmospheric variables are diverse, e.g. temperature, moisture, and surface pressure. In each dataset provided by EUMETSAT, the preprocessed input data were -dimensional. Each input vector consisted of the following: one scalar indicating the secant of satellite zenith angle, radiance values from the AMSU and MHS sensors, and values from the IASI sensor. The data from IASI were actually three separate sets of PC scores each, from three different IASI bands. Note that, despite intra-band decorrelation, the vector elements may still exhibit statistical dependency, which may be significant even at a second order level, among different bands and instruments.

The data to be predicted (or output data) is -dimensional. Each output vector consists of the following: data corresponding to the surface temperature and moisture, the skin temperature, and the surface pressure; and data corresponding to altitude profiles of temperature, moisture, and ozone at model levels each. An example of surface temperature is shown in Fig. 4. Data was provided by the official European Center for Medium-range Weather Forecasting (ECMWF) model, http://www.ecmwf.int/, on March 4th, 2008.

Figure 5: Reconstruction error. Left: Absolute reconstruction error for different number of retained features obtained when using different DR methods on the first (just one) dataset. Right: Relative error (percentage) with regard to the error in PCA, mean and standard deviation have been obtained over the (all) datasets.

4.2.1 Reconstruction accuracy

In this experiment, we study the representation power of a small number of features extracted by DRR. The 110 input features are processed with PCA [26], PPA [34, 35], NLPCA [21, 23] and the presented DRR method. Here, the quality of the transformation is evaluated solely with the mean absolute error (MAE) in the input space between the original signal and the reconstructed with the most relevant coefficients retained. Figure 5 illustrates the effect of reconstructing the input data when using PCA, PPA, NLPCA and DRR for different numbers of components. On the one hand, as reported in [35], the performance in PPA is similar or better than in NLPCA in reconstruction error. On the other hand, it is important to note that results in absolute and relative terms show that DRR clearly obtains less reconstruction error than PCA and PPA for an arbitrary number of features.

4.2.2 Retrieval accuracy

Figure 6 illustrates the effect of using the features either from PCA, PPA or DRR for the retrieval of the physical parameters described before. We used linear regression in the features-to-parameters estimation. We plotted the mean absolute error (MAE) for different number of features. These plots show the effect of using different (linear and non-linear) dimensionality reduction methods for retrieval. Figure 6 shows the results for the first dataset for illustration purposes (similar results were obtained for the remainder datasets). Note that using DRR features to estimate the features has clear benefits. For instance, using just the of the DRR features obtains the same accuracy as PCA when using all the components.

Figure 6: Retrieval performance. Accuracy of the parameter retrieval (MAE) with regard to the number of retained features. Results are given for different feature extraction (PCA, PPA, NLPCA, DRR) methods. Left: Absolute MAE for the first dataset. Right: Relative (to the PCA MAE in each dimension) results. Results for the remainder are similar.

4.2.3 Computational load

Times for training and testing are shown in Table 2 (same computer resources as before). In this experiment, we took training and test samples, and . As in the previous experiment, NLPCA and DRR are the most expensive in training and test, respectively. In this experiment, however, times for DRR are notably higher due to the increase in dimensionality but mostly to the bigger training set.

Training time (sec) 0.13 16 65389 14424
Testing time (sec) 0.01 0.3 1.3 1112
Table 2: Computational cost IASI Dataset

5 Conclusions

We introduced a novel unsupervised method for dimensionality reduction via the application of a multivariate nonlinear regression to approximate each projection from the higher variance scores. The method is shown to generalize PCA and to achieve more data compression (smaller MSE for a fixed number of retained components) and better features for prediction (less error in classification and regression problems) than competitive nonlinear methods like NLPCA and PPA. Besides, unlike other nonlinear dimensionality reduction methods, DRR is easy to apply, it has out-of-sample extension, it is invertible, and the learned transformation is volume-preserving. We focused on the challenging problems of spatial-spectral multispectral land cover classification, and atmospheric parameter retrieval from hyperspectral infrared sounding data. Extension of DRR to cope with multiset/output regression, as well as impact of the data dimensionality and noise sources, will be explored in the future.

6 Acknowledgments

The authors wish to thank Tim Hultberg from the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT) in Darmstadt, Germany, for kindly providing the IASI datasets used in this paper.


  • [1] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 43, no. 6, pp. 1351–1362, June 2005.
  • [2] A. Plaza, J. A. Benediktsson, J. Boardman, J. Brazile, L. Bruzzone, G. Camps-Valls, J. Chanussot, M. Fauvel, P. Gamba, A. Gualtieri, and J. Tilton, “Recent advances in techniques for hyperspectral image processing,” Remote Sensing of Environment, vol. 113, no. S1, pp. 110–122, Sept 2009.
  • [3] G. Camps-Valls, D. Tuia, L. Gómez, S. Jiménez, and J. Malo, Remote Sensing Image Processing, ser. Synthesis Lectures on Image, Video and Multimedia Processing, A. Bovik, Ed.   Morgan & Claypool Publishers, 2011.
  • [4] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. Atli Benediktsson, “Advances in hyperspectral image classification: Earth monitoring with statistical learning methods,” Signal Processing Magazine, IEEE, vol. 31, no. 1, pp. 45–54, Jan 2014.
  • [5] D. Tuia, J. Muñoz Marí, L. Gómez-Chova, and J. Malo, “Graph matching for adaptation in remote sensing,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 51, no. 1, pp. 329–341, Jan 2013.
  • [6] V. Laparra, S. Jiménez, G. Camps-Valls, and J. Malo, “Nonlinearities and adaptation of color vision from sequential principal curves analysis,” Neural Comp., vol. 24, no. 10, pp. 2751–2788, 2012.
  • [7] B. Penna, T. Tillo, E. Magli, and G. Olmo, “Transform coding techniques for lossy hyperspectral data compression,” IEEE Trans. Geosci. Rem. Sens., vol. 45, no. 5, pp. 1408–1421, 2007.
  • [8] S. Jiménez and J. Malo, “The role of spatial information in disentangling the irradiance-reflectance-transmittance ambiguity,” IEEE Trans. Geosci. Rem. Sens., vol. 52, no. 8, pp. 4881–4894, 2014.
  • [9] J. Arenas-García, K. Petersen, G. Camps-Valls, and L. Hansen, “Kernel multivariate analysis framework for supervised subspace learning: A tutorial on linear and kernel multivariate methods,” Signal Processing Magazine, IEEE, vol. 30, no. 4, pp. 16–29, 2013.
  • [10] M. Fauvel, J. A. Benediktsson, J. Chanussot, and J. R. Sveinsson, “Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 11, pp. 3804 – 3814, 2008.
  • [11] D. Tuia, F. Pacifici, M. Kanevski, and W. Emery, “Classification of very high spatial resolution imagery using mathematical morphology and support vector machines,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 11, pp. 3866–3879, 2009.
  • [12] J. A. Lee and M. Verleysen, Nonlinear dimensionality reduction.   Springer, 2007.
  • [13] J. B. Tenenbaum, V. Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, December 2000.
  • [14] S. T. Roweis, L. K. Saul, and G. E. Hinton, “Global coordination of local linear models,” in Advances in Neural Information Processing Systems 14.   MIT Press, 2002, pp. 889–896.
  • [15] J. J. Verbeek, N. Vlassis, and B. Krose, “Coordinating principal component analyzers,” in In Proc. International Conference on Artificial Neural Networks.   Springer, 2002, pp. 914–919.
  • [16] Y. W. Teh and S. Roweis, “Automatic alignment of local representations,” in NIPS 15.   MIT Press, 2003, pp. 841–848.
  • [17] M. Brand, “Charting a manifold,” in NIPS 15.   MIT Press, 2003, pp. 961–968.
  • [18] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, December 2000.
  • [19] B. Schölkopf, A. J. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comp., vol. 10, no. 5, pp. 1299–1319, 1998.
  • [20] K. Q. Weinberger and L. K. Saul, “Unsupervised learning of image manifolds by semidefinite programming,” in Proc. IEEE CVPR, 2004, pp. 988–995.
  • [21] M. A. Kramer, “Nonlinear principal component analysis using autoassociative neural networks,” AIChE Journal, vol. 37, no. 2, pp. 233–243, 1991.
  • [22] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, July 2006.
  • [23] M. Scholz, M. Fraunholz, and J. Selbig, Nonlinear principal component analysis: neural networks models and applications.   Springer, 2007, ch. 2, pp. 44–67.
  • [24] P. Huber, “Projection pursuit,” Annals of Statistics, vol. 13, no. 2, pp. 435–475, 1985.
  • [25] V. Laparra, G. Camps-Valls, and J. Malo, “Iterative gaussianization: from ICA to random rotations,” IEEE Trans. Neur. Net., vol. 22, 2011.
  • [26] I. Jolliffe, Principal component analysis.   Springer, 2002.
  • [27] G. Camps-Valls, J. Muñoz and, L. Gómez, L. Guanter, and X. Calbet, “Nonlinear statistical retrieval of atmospheric profiles from MetOp-IASI and MTG-IRS infrared sounding data,” IEEE Trans. Geosci. Rem. Sens., vol. 50, no. 5, pp. 1759–1769, 2012.
  • [28] T. M. Lillesand, R. W. Kiefer, and J. Chipman, Remote Sensing and Image Interpretation.   New York: John Wiley & Sons, 2008.
  • [29] C.-I. Chang, Hyperspectral Data Exploitation: Theory and Applications.   NJ, USA: Wiley-Interscience, 2008.
  • [30] P. Honeine and C. Richard, “The pre-image problem in kernel-based machine learning,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 77–88, 2011.
  • [31] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis.   New York, USA: John Wiley & Sons, 2001.
  • [32] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, no. 1, pp. 59–69, January 1982. [Online]. Available: http://dx.doi.org/10.1007/BF00337288
  • [33] V. Laparra and J. Malo, “Visual aftereffects and nonlinearities from a single statistical framework,” Submitted to Front. Human Neurosci., 2014.
  • [34] V. Laparra, D. Tuia, S. Jiménez, G. Camps-Valls, and J. Malo, “Nonlinear data description with principal polynomial analysis,” in IEEE Workshop on Machine Learning for Signal Processing, 2012, pp. 1–6.
  • [35] V. Laparra, S. Jiménez, D. Tuia, G. Camps-Valls, and J. Malo, “Principal polynomial analysis,” Int. J. Neur. Syst., vol. 26, no. 7, 2014.
  • [36] M. Scholz, “Validation of nonlinear PCA,” Neural processing letters, pp. 1–10, 2012.
  • [37] T. Hastie, “Principal curves and surfaces,” Ph.D. dissertation, Stanford University, 1984.
  • [38] D. Donnell, A. Buja, and W. Stuetzle, “Analysis of additive dependencies and concurvities using smallest additive principal components,” The Annals of Statistics, vol. 22, no. 4, pp. 1635–1668, 1994.
  • [39] P. C. Besse and F. Ferraty, “Curvilinear fixed effect model,” Computational Statistics, vol. 10, pp. 339–351, 1995.
  • [40] J. Einbeck, G. Tutz, and L. Evers, “Local principal curves,” Statistics and Computing, vol. 15, pp. 301–313, 2005.
  • [41] J. Einbeck, L. Evers, and B. Powell, “Data compression and regression through local principal curves and surfaces,” Int. J. Neur. Syst., vol. 20, no. 03, pp. 177–192, 2010.
  • [42] U. Ozertem and D. Erdogmus, “Locally defined principal curves and surfaces,” JMLR, vol. 12, pp. 1249–1286, 2011.
  • [43] V. Laparra, J. Malo, and G. Camps-Valls, “Dimensionality reduction via regression on hyperspectral infrared sounding data,” in IEEE Workshop on Hyperspectral Image and Signal Processing, 2014.
  • [44] N. Kambhatla and T. Leen, “Dimension reduction by local PCA,” Neural Computation, vol. 9, no. 7, pp. 1493–1500, 1997.
  • [45] J. Karhunen, S. Malaroiu, and M. Ilmoniemi, “Local linear independent component analysis based on clustering,” Intl. J. Neur. Syst., vol. 10, no. 6, December 2000.
  • [46] J. Malo and J. Gutiérrez, “V1 non-linear properties emerge from local-to-global non-linear ICA,” Network: Comp. Neur. Syst., vol. 17, no. 1, pp. 85–102, 2006.
  • [47] P. Delicado, “Another look at principal curves and surfaces,” J. Multivar. Anal., vol. 77, pp. 84–116, 2001.
  • [48] C. Bachmann, T. Ainsworth, and R. Fusina, “Exploiting manifold geometry in hyperspectral imagery,” IEEE Trans. Geosc. Rem. Sens., vol. 43, no. 3, pp. 441–454, Mar 2005.
  • [49] ——, “Improved manifold coordinate representations of large-scale hyperspectral scenes,” IEEE Trans. Geosc. Rem. Sens., vol. 44, no. 10, pp. 2786–2803, Oct 2006.
  • [50] B. Dubrovin, S. Novikov, and A. Fomenko, Modern Geometry: Methods and Applications.   New York: Springer Verlag, 1982.
  • [51] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis.   Cambridge University Press, 2004.
  • [52] M. Lázaro-Gredilla, J. Q. Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, “Sparse spectrum gaussian process regression.” Journal of Machine Learning Research, vol. 11, pp. 1865–1881, 2010.
  • [53] J. Arenas-García, K. B. Petersen, G. Camps-Valls, and L. K. Hansen, “Kernel multivariate analysis framework for supervised subspace learning,” IEEE Signal Processing Magazine, p. 1, 2013.
  • [54] Y. Zhang, J. C. Duchi, and M. J. Wainwright, “Divide and conquer kernel ridge regression,” in COLT, 2013, pp. 592–617.
  • [55] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Eds., Least Squares Support Vector Machines,.   Singapore: World Scientific Pub. Co., 2002.
  • [56] G. Camps-Valls, L. Guanter, J. Muñoz, L. Gómez, and X. Calbet, “Nonlinear retrieval of atmospheric profiles from MetOp-IASI and MTG-IRS data,” in Proc. Im. Sig. Proc. Rem. Sens. XVI, vol. 7830.   SPIE, 2010, p. 78300Z.
  • [57] G. Camps-Valls, V. Laparra, J. Muñoz, L. Gómez, and X. Calbet, “Kernel-based retrieval of atmospheric profiles from IASI data,” in IEEE Proc. IGARSS 11, Jul 2011, pp. 2813–2816.
  • [58] K. N. Liou, An Introduction to Atmospheric Radiation, 2nd ed.   Hampton, USA: Academic Press, 2002.
  • [59] F. Hilton, N. C. Atkinson, S. J. English, and J. R. Eyre, “Assimilation of IASI at the Met Office and assessment of its impact through observing system experiments,” Q. J. R. Meteorol. Soc., vol. 135, pp. 495–505, 2009.
  • [60] H. L. Huang, W. L. Smith, and H. M. Woolf, “Vertical resolution and accuracy of atmospheric infrared sounding spectrometers,” J. Appl. Meteor., vol. 31, pp. 265–274, 1992.
  • [61] G. Chalon, F. Cayla, and D. Diebel, “IASI: an advanced sounder for operational meteorology,” in Proceedings of the 52nd Congress of IAF, Toulouse, France, 2001.
  • [62] C. G. Siméoni D., Singer C., “Infrared atmospheric sounding interferometer,” Acta Astronautica, vol. 40, pp. 113–118, 1997.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description