Hyperspectral Data Augmentation
Abstract
Data augmentation is a popular technique which helps improve generalization capabilities of deep neural networks. It plays a pivotal role in remotesensing scenarios in which the amount of highquality ground truth data is limited, and acquiring new examples is costly or impossible. This is a common problem in hyperspectral imaging, where manual annotation of image data is difficult, expensive, and prone to human bias. In this letter, we propose online data augmentation of hyperspectral data which is executed during the inference rather than before the training of deep networks. This is in contrast to all other stateoftheart hyperspectral augmentation algorithms which increase the size (and representativeness) of training sets. Additionally, we introduce a new principal component analysis based augmentation. The experiments revealed that our data augmentation algorithms improve generalization of deep networks, work in realtime, and the online approach can be effectively combined with offline techniques to enhance the classification accuracy.
I Introduction
Hyperspectral satellite imaging (HSI) captures a wide spectrum (commonly more than a hundred of bands) of light per pixel (forming an array of reflectance values). Such detailed information is being exploited by the remote sensing, pattern recognition, and machine learning communities in the context of accurate HSI classification (elaborating a class label of an HSI pixel) and segmentation (finding the boundaries of objects) in many fields [1]. Although the segmentation techniques include conventional machine learning algorithms (both unsupervised [2] and supervised [1, 3]), deep learning based techniques became the main stream [4, 5, 6, 7, 8, 9, 10, 11]. They encompass deep belief networks [4, 7], recurrent neural networks [8], and convolutional neural networks (CNNs) [5, 6, 9, 10, 11].
Deep neural networks discover the underlying data representation, hence they do not require feature engineering and can potentially capture features which are unknown for humans. However, to train such largecapacity learners (and to avoid overfitting), we need huge amount of ground truth data. Acquiring such training sets is often expensive, timeconsuming, and humandependent. These problems are very important realworld obstacles in training wellgeneralizing models (and validating such learners) faced by the remote sensing community—they are manifested by a very small number of ground truth benchmark HSI sets (there are approx. 10 widelyused benchmarks, with the Salinas Valley, Pavia University, and Indian Pines scenes being the most popular).
To combat the problem of limited, nonrepresentative, and imbalanced training sets, data augmentation can be employed. It is a process of synthesizing new examples following the original data distribution. Since such enhanced training sets can improve generalization of the learners, data augmentation may be perceived as implicit regularization. In computer vision tasks, data augmentation often involves simple affine (e.g., rotation or scaling) and elastic transforms of the image data [12]. These techniques, albeit applicable to HSI, do not benefit from all available information to model useful data.
Ia Related literature
The literature on HSI data augmentation is fairly limited (not to mention, only one of the deeplearning HSI segmentation methods discussed earlier in this section used augmentation—simple mirroring of training samples—for improved classification [10]). In [13], the authors calculated perspectralband standard deviation (for each class) in the training set. The augmented samples are later drawn from a zeromean multivariate normal distribution , where is a diagonal matrix with the standard deviations (for all classes) along its main diagonal, and is a hyperparameter (scale factor) of this technique. Albeit its simplicity, this augmentation was shown to be able to help improve generalization.
Li et al. utilized both spectral and spatial information to synthesize new samples in their pixelblock augmentation [14]. Two datageneration approaches: (i) Gaussian smoothing filtering alongside (ii) labelbased augmentation were exploited in [15]. The latter technique resembles weaklabeling [16], and builds on an assumption that neighboring HSI pixels should share the same class (the label of a pixel propagates to its neighbors, and these generated examples are inserted into the training set). Thus, it may introduce wronglylabeled samples.
Generative adversarial networks (GANs) have already attracted research attention in the context of data augmentation due to their ability of introducing invariance of models with respect to affine and appearance variations. GANs model an unknown data distribution based on the provided samples, and they are composed of a generator and discriminator. A generator should generate data samples which follow the underlying data distribution and are indistinguishable from the original data by the discriminator (hence, they compete with each other). In a recent work [17], Audebert et al. applied GAN conditioning to ensure that the synthesized HSI examples (from random distribution) belong to the specified class. Overall, all of the stateoftheart HSI augmentation methods are aimed at increasing the size and representativeness of training sets which are later fed to train the deep learners.
IB Contribution
In this letter, we propose a novel online augmentation technique for hyperspectral data (Section IIA)—instead of synthesizing samples and adding them to the training set (hence increasing its size which adversely affects the training time of deep learners), we generate new examples during the inference. These examples (both original and artificial) are classified using a deep net trained over the original set, and we apply the voting scheme to elaborate the final label. To our knowledge, such online augmentation has not been exploited in HSI analysis so far (testtime augmentation was utilized in medical imaging [18], where the authors applied affine transforms and noise injection into braintumor images for better segmentation). Also, we introduce principal component analysis (PCA) based augmentation (Section IIB) which may be used both offline (before the training) and online. This PCAbased augmentation simulates data variability, yet follows the original data distribution (which GANs are intended to learn [17], but they are not applicable at testtime).
Our rigorous experiments performed over HSI benchmarks revealed that the online approach is very flexible—different augmentation techniques can be exploited in this setting (Section III). The results obtained using a spectral CNN indicated that the testtime augmentation significantly improves abilities of the models when compared with those trained using the original sets, and augmented using other algorithms (also, we compared our CNN with a spectralspatial CNN from the literature whose capacity is much larger [11]). Our online approach does not sacrifice the inference speed and allows for realtime classification. We showed that the proposed PCA augmentation is extremely fast, and ensures the highestquality generalization of the deep models for all datasplit scenarios. Finally, we demonstrated that the offline and online augmentations can be effectively combined for better classification.
Ii Proposed Hyperspectral Data Augmentation
Iia Online Hyperspectral Data Augmentation
Our online (testtime) data augmentation involves synthesizing artificial samples for each incoming example during the inference. We traverse the neighborhood of the original example and try to mitigate potential inputdependent uncertainty of the deep model. In contrast to the offline augmentation techniques, the testtime augmentation does not cause increasing the training time of the network, and it does not require defining the number of synthesized samples beforehand (also, for the majority of specific augmentation algorithms, the operation time of a trained learner would not be significantly affected since the inference is fast). We build upon the theory of ensemble learning, where elaborating a combined classifier (encompassing several weak learners) delivers highquality generalization (it is an efficient regularizer). Here, by creating artificial data points, we implicitly form a homogeneous ensemble of deep models (trained over the same training set ). The final class label is elaborated using majority voting (with equal weights) over all () samples (for low ensemble confidence, when two or more classes receive the same number of votes, we perform soft voting—we average all class probabilities, and the final class label corresponds to the class with the highest average probability).
The proposed online HSI augmentation may be considered to be a metaalgorithm, in which a specific data augmentation method is applied to synthesize samples on the fly. Although we exploited the noise injection based approach [13], and our principal component analysis based technique (see Section IIB) in this work, we anticipate that other augmentations which are aimed at modifying an existent sample can be straightforwardly utilized here. Finally, the online augmentation may be coupled with any offline technique (Section III).
IiB Principal component analysis based data augmentation
In this work, we propose a new augmentation method based on PCA. Let us consider a training set of HSI pixels , where , and each is dimensional ( denotes the number of bands in this HSI). PCA extracts a set of () projection directions (vectors) by maximizing the projected variance of a given dimensional dataset—the first principal component () accounts for as much of the data variability as possible, and so forth. First, we center the data at the origin (hence, we subtract the average sample from each , and form the data matrix (of size), whose th column is (). The covariance matrix becomes , and it undergoes eigendecomposition , where is the matrix with the nonincreasingly ordered eigenvalues along its main diagonal, and is the matrix with corresponding eigenvectors (principal components) as columns. Finally, principal components form an orthogonal base, and each sample can be projected onto a new feature space: . Importantly, each sample can be projected back to its original space: with the error (the PCAtraining procedure minimizes this error—it is nonzero if ; otherwise, if , there is no reconstruction error).
(a)  (b)  (c) 
The first step of our PCAbased data augmentation involves transforming all training samples using PCA (trained over ). Afterwards, the first principal component^{1}^{1}1However, more principal components could be exploited here. (of each sample) is multiplied by a random value drawn from a uniform distribution , where and are the hyperparameters of our method ( is drawn independently for all original examples). This process is visualized in Fig. 1—we can observe that the synthesized examples (Fig. 1c) preserve the original data distribution (Fig. 1b) projected onto a reduced feature space, and preserve interclass relationships. Finally, these samples are projected back onto the original space (using all principal components to ensure correct mapping), and they are added to the augmented (if executed offline). This PCAbased augmentation can be applied in both offline and online settings (in both cases, PCA is trained over the original ).
Iii Experiments
The experimental objective was to verify the impact of data augmentation on the deep model generalization. For online augmentation, we applied our PCA augmentation (PCAON), and noise injection (NoiseON) [13], whereas for the offline setting, we used our PCAbased method (PCA), generative adversarial nets (GAN) [17], and noise injection (Noise) [13]. GAN cannot be used online, since it does not modify an incoming example, but rather synthesizes samples which follow an approximated distribution. Finally, we combined online and offline augmentation in PCA/PCAON (PCA augmentation is used to both augment the set beforehand, and to generate new samples at test time), and GAN/PCAON. For each offline technique, we at most doubled the number of original samples (unless that number would exceed the most numerous class—in such case, we augment only by the missing difference). For online augmentation, we synthesize samples, and for PCA and PCAON, we had and .
We exploit our shallow (thus resourcefrugal) 1D spectral CNN (Fig. 2) coded in Python 3.6 [19]. Largercapacity CNNs require longer training and infer slower, hence are less likely to be deployed for Earth observation, especially on board of a satellite. The training (ADAM, learn. rate of , , and ) stops, if after 15 epochs the validation set (random subset of ) accuracy plateaus.
We train and validate the deep models using: (1) balanced sets with random pixels (B), (2) imbalanced sets with random pixels (IB), and (3) our patched sets (P) [19] (for fair comparison, the numbers of pixels in and for B and IB are close to those reported in [11]). We also report the results obtained using a spectralspatial CNN (3DCNN) [11], trained over the original (3DCNN—in contrast to our CNN—does suffer from the trainingtest information leak problem, and the 3DCNN results over B and IB are overoptimistic [19]). For each fold in (3), we repeat the experiments , and for (1) and (2), we perform MonteCarlo crossvalidation with the same number of runs (e.g., if 5 folds are run , we execute MonteCarlo runs for B and IB). We report perclass, average (AA), and overall accuracy (OA), averaged across all runs.
We focused on three HSI benchmarks (see their classdistribution characteristics at: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes): Salinas Valley, USA ( pixels, NASA AVIRIS sensor; , , ) presents different sorts of vegetation (16 classes, 224 bands, 3.7 m spatial resolution). Indian Pines (, AVIRIS; , , ) covers the NorthWestern Indiana, USA (agriculture and forest, 16 classes, 200 channels, 20 m). Pavia University (, ROSIS; , , ) was captured over Lombardy, Italy (urban scenery, 9 classes, 103 channels, 1.3 m).
The results (over the test sets ) obtained for the Salinas Valley, Indian Pines, and Pavia University datasets are gathered in Table I. Introducing augmented samples (in both online and offline settings) helped boost generalization abilities of the deep models in the majority of cases (even up to more than 8% of OA for GAN, PCA, and PCA/PCAON in Salinas, B; only the Noise offline augmentation deteriorated both OA and AA for P). Interestingly, exploring the local neighborhood randomly (Noise and NoiseON) can notably deteriorate OA and AA. It usually occurs for underrepresented classes (e.g., C7 in Pavia) since their examples lay close to otherclass examples in the discovered feature space (therefore, they can be easily “confused” with each other). This problem is addressed by the datadistribution analysis in our PCAbased augmentations. Coupling offline and online augmentation (PCA/PCAON and GAN/PCAON) gave consistent highquality results over all sets and all trainingtest splits, and dealt well with the HSI imbalance (in P, we did not ensure that examples of all classes are included in the original , thus P is very challenging [19]).


(a)  (b)  (c)  (d)  (e)  (f)  (g)  (h)  
(I)  (a)  —  0.05  0.001  0.001  0.2  0.001  0.001  0.001 
(b)  0.2  —  0.05  0.005  0.05  0.2  0.001  0.02  
(c)  0.05  0.02  —  0.05  0.001  0.2  0.2  0.2  
(d)  0.001  0.001  0.05  —  0.001  0.001  0.2  0.1  
(e)  0.2  0.2  0.2  0.001  —  0.001  0.001  0.001  
(f)  0.001  0.01  0.1  0.05  0.001  —  0.02  0.2  
(g)  0.01  0.001  0.01  0.2  0.001  0.02  —  0.05  
(h)  0.01  0.02  0.1  0.05  0.01  0.2  0.01  —  
(II)  (a)  —  0.05  0.2  0.05  0.2  0.2  0.02  0.2 
(b)  0.05  0.001  0.2  0.05  0.001  0.1  
(c)  0.05  0.1  0.2  0.001  0.001  
(d)  0.05  0.05  0.2  0.02  
(e)  0.05  0.005  0.2  
(f)  0.005  0.2  
(g)  0.001  


(a) Without, (b) Noise, (c) GAN, (d) PCA, (e) NoiseON, (f) PCAON,  
(g) PCA/PCAON, (h) GAN/PCAON. 
To verify the statistical significance of the results (and see if the differences in the average perclass accuracy are important in the statistical sense), we executed twotailed Wilcoxon’s tests for each dataset split (B, IB, and P) over perclass AA for all HSI. The results reported in Table II show that applying HSI data augmentation is beneficial in most cases and delivers significant improvements in accuracy. GAN did equally well as e.g., PCA, PCAON, and our combined PCA/PCAON and GAN/PCAON for B, as NoiseON, PCAON, and GAN/PCAON for IB, and as NoiseON and PCAON for P. It indicates that employing timeconsuming and complex deeplearning engines for data augmentation not necessarily brings larger improvements in the performance of the deep models.
Our combined approaches (PCA/PCAON and GAN/PCAON) were stable and consistently ensured highquality generalization (as shown in Table I) of the deep models over all splits. This stability is also manifested in Table III, where we summarize the results across all sets (although PCA gave the best accuracy for B, the differences between PCA and PCA/PCAON and GAN/PCAON are not statistically significant). We can appreciate that our PCAbased augmentation (offline, online, or combined) allowed us to obtain the best generalization—very intuitive PCAbased datadistribution analysis for synthesizing samples outperformed or worked on par with GAN in the case of difficult (small and imbalanced) sets. Finally, our CNN surpassed the accuracy elaborated using a significantly larger 3DCNN from the literature (with a bigger capacity) for P (note that the results obtained using 3DCNN for B and IB are overoptimistic due to the intrinsic trainingtest information leak problem, hence they cannot be considered reliable [11]).


Set  B  IB  P  
Augmentation  OA  AA  OA  AA  OA  AA 
Without  81.00  85.66  83.62  81.28  71.22  64.33 
Noise  81.20  87.30  75.83  82.89  69.98  63.45 
GAN  85.31  88.88  86.25  84.49  71.28  64.66 
PCA  85.52  89.51  87.37  86.27  72.62  65.81 
NoiseON  78.59  81.99  82.85  80.02  70.24  63.21 
PCAON  83.75  88.23  86.86  85.26  71.94  64.77 
PCA/PCAON  84.89  89.48  87.56  86.47  73.03  65.92 
GAN/PCAON  83.29  89.09  86.32  84.92  71.10  64.78 
3DCNN [11]  87.77*  92.71*  91.21*  91.03*  62.89  55.87 


*Overoptimistic due to trainingtest information leak; see [19]. 
To gain better insights into the augmentation performance (and its potential overhead imposed on the deep models in terms of training and/or test times), we collected the average execution times of the most important steps of the investigated methods in Table IV. It can be observed that training of GANs is very timeconsuming (it was executed using NVIDIA GeForce GTX 1060), and is of orders of magnitude higher than preprocessing in other offline techniques (PCA and Noise). Although all offline augmentations affect the training time of deep networks, these differences are not dramatic. Finally, the online augmentation allowed us to classify test pixels in realtime (note that we report the inference time in ms).


Set  B  IB  P  
Aug.  Sa  IP  PU  Sa  IP  PU  Sa  IP  PU  
(a)  Noise  0.14  0.02  0.05  0.10  0.06  0.03  0.08  0.05  0.03 
GAN  529.10  2048.47  241.18  555.77  1938.28  617.40  594.28  960.14  1151.71  
PCA  0.10  0.05  0.02  0.09  0.06  0.02  0.08  0.05  0.02  
(b)  Without  103.15  64.01  14.97  116.40  63.51  16.05  118.98  64.46  26.08 
Noise  176.20  56.91  32.60  127.31  76.85  28.38  104.44  67.21  31.47  
GAN  146.21  146.22  48.11  192.36  92.87  29.33  102.40  67.97  25.32  
PCA  167.83  91.14  26.14  156.30  84.28  23.53  183.89  81.13  27.40  
(c)  Without  0.09  0.11  0.07  0.09  0.11  0.07  0.10  0.15  0.09 
Noise  0.09  0.11  0.07  0.09  0.11  0.07  0.09  0.15  0.09  
GAN  0.09  0.11  0.07  0.09  0.12  0.07  0.10  0.14  0.10  
PCA  0.09  0.11  0.07  0.09  0.11  0.07  0.10  0.15  0.09  
NoiseON  1.65  1.90  1.46  1.75  1.84  1.44  2.03  2.23  1.79  
PCAON  1.97  1.96  1.61  2.00  1.99  1.55  2.18  2.40  2.01  


Sa—Salinas Valley, IP—Indian Pines, PU—Pavia University. 
Iv Conclusions
In this letter, we introduced a new online HSI data augmentation approach which synthesizes examples at test time. It is in contrast to other stateoftheart hyperspectral data augmentation techniques that work offline (i.e., before the deepnetwork training to increase the training set cardinality and representativeness). Our experimental study, performed over three HSI benchmark sets (with different trainingtest data splits) and coupled with statistical tests revealed that our online augmentation is very flexible (different augmentations can be applied here), improves the generalization abilities of deep neural networks, and works in realtime. Also, we showed that combining online and offline augmentation leads to consistently wellperforming models. Finally, we proposed a principal component analysis based augmentation which operates extremely fast, synthesizes highquality data, outperforms other augmentations for small and imbalanced sets, and is applicable in online and offline settings.
References
 [1] T. Dundar and T. Ince, “Sparse representationbased hyperspectral image classification using multiscale superpixels and guided filter,” IEEE GRSL, pp. 1–5, 2018.
 [2] G. Bilgin, S. Erturk, and T. Yildirim, “Segmentation of hyperspectral images via subtractive clustering and cluster validation using oneclass SVMs,” IEEE TGRS, vol. 49, no. 8, pp. 2936–2944, 2011.
 [3] F. Li, D. Clausi, L. Xu et al., “STIRGS: A regionbased selftraining algorithm applied to hyperspectral image classification and segmentation,” IEEE TGRS, vol. 56, no. 1, pp. 3–16, 2018.
 [4] Y. Chen, X. Zhao, and X. Jia, “Spectralâspatial classification of hyperspectral data based on deep belief network,” IEEE JSTARS, vol. 8, no. 6, pp. 2381–2392, 2015.
 [5] W. Zhao and S. Du, “Spectralspatial feature extraction for hyperspectral image classification,” IEEE TGRS, vol. 54, no. 8, pp. 4544–4554, 2016.
 [6] Y. Chen, H. Jiang, C. Li et al., “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE TGRS, vol. 54, no. 10, pp. 6232–6251, 2016.
 [7] P. Zhong, Z. Gong, S. Li et al., “Learning to diversify deep belief networks for hyperspectral image classification,” IEEE TGRS, vol. 55, no. 6, pp. 3516–3530, 2017.
 [8] L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent nets for hyperspectral classification,” IEEE TGRS, vol. 55, no. 7, pp. 3639–3655, 2017.
 [9] A. Santara, K. Mani, P. Hatwar et al., “BASS Net: Bandadaptive spectralspatial feature learning neural network for hyperspectral image classification,” IEEE TGRS, vol. 55, no. 9, pp. 5293–5301, 2017.
 [10] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyperspectral classification,” IEEE TIP, vol. 26, no. 10, pp. 4843–4855, 2017.
 [11] Q. Gao, S. Lim, and X. Jia, “Hyperspectral image classification using convolutional neural networks and multiple feature learning,” Rem. Sens., vol. 10, no. 2, p. 299, 2018.
 [12] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” Proc. NIPS, p. 1097â1105, 2012.
 [13] V. Slavkovikj, S. Verstockt, W. De Neve et al., “Hyperspectral image classification with CNNs,” in Proc. ICM. ACM, 2015, pp. 1159–1162.
 [14] W. Li, C. Chen, M. Zhang et al., “Data augmentation for hyperspectral classification with deep CNN,” IEEE GRSL, pp. 1–5, 2018.
 [15] J. Acquarelli, E. Marchiori, L. M. Buydens et al., “Spectralspatial classification of hyperspectral images,” Rem. Sens., vol. 10, no. 7, 2018.
 [16] Y.Y. Sun, Y. Zhang, and Z.H. Zhou, “Multilabel learning with weak label,” in Proc. AAAI. AAAI Press, 2010, pp. 593–598.
 [17] N. Audebert, B. L. Saux, and S. Lefèvre, “Generative adversarial networks for realistic synthesis of hyperspectral samples,” CoRR, vol. abs/1806.02583, pp. 1–4, 2018.
 [18] G. Wang, W. Li, S. Ourselin et al., “Automatic brain tumor segmentation using convolutional neural networks with testtime augmentation,” CoRR, vol. abs/1810.07884, pp. 1–12, 2018.
 [19] J. Nalepa, M. Myller, and M. Kawulok, “Validating hyperspectral image segmentation,” IEEE GRSL, pp. 1–5, 2019, in press, DOI:10.1109/LGRS.2019.2895697 (preprint: https://arxiv.org/abs/1811.03707).