A Comparative Study on Unsupervised Domain Adaptation Approaches for Coffee Crop Mapping
In this work, we investigate the application of existing unsupervised domain adaptation (UDA) approaches to the task of transferring knowledge between crop regions having different coffee patterns. Given a geographical region with fully mapped coffee plantations, we observe that this knowledge can be used to train a classifier and to map a new county with no need of samples indicated in the target region. Experimental results show that transferring knowledge via some UDA strategies performs better than just applying a classifier trained in a region to predict coffee crops in a new one. However, UDA methods may lead to negative transfer, which may indicate that domains are too different that transferring knowledge is not appropriate. We also verify that normalization affect significantly some UDA methods; we observe a meaningful complementary contribution between coffee crops data; and a visual behavior suggests an existent of a cluster of samples that are more likely to be drawn from a specific data.
Accessibility to large sets of remote sensing images (RSIs) has increased over the years, and RSIs are currently a common source of information in many agribusiness applications. Identifying crops is essential for knowing and monitoring land use, defining new land expansion strategies, and estimating viable production value. In this work we focus on the use of RSIs for a crucial agro-economic activity in Brazil and, in particular, the state of Minas Gerais: the coffee crop mapping.
Automatic recognition of coffee plantations using RSIs is typically modeled as a supervised classification problem. However, this task is rather challenging, mainly because the relief and age of the crop may hinder the recognition process, Indeed, different spectral response and texture patterns can be observed for different regions. For instance, coffee may grow in mountainous or in flat regions (as in Brazilian cerrado-savannas), and mountains may introduce shadows and distortions in the spectral information, making the corresponding patterns appear very different from those of coffee grown in flat geographic regions. Because of this, spectral information may be significantly reduced or even totally lost. Moreover, since the growing of coffee is not a seasonal activity, there may be coffee plantations of different ages in different regions, which also affects the observed spectral patterns.
Although several approaches have advanced the state of the art in mapping coffee in recent years [1, 2, 3], one problem still remains: how to obtain representative samples for classification of new geographic areas of study? The first possible strategy is the labeling of new samples, but it usually depends on experts, and even for them, it is not always possible to visually identify the patterns in the images. Thus, this process often requires visits to the study site, which can add extra costs to the analysis. An alternative strategy to obtain extra data for training models is to transfer knowledge from already mapped regions. However, as Nogueira et al.  has shown, due to the aforementioned differences that may exist in the coffee patterns between different geographic regions, direct transfer does not result in satisfactory quality.
In this work we investigate the application of existing unsupervised domain adaptation (UDA) approaches to the task of transferring knowledge from between crop regions having different coffee patterns. Our intent is to evaluate the effectiveness of UDA approaches to map new coffee crop areas. UDA methods allow labeled data to be employed from one or more prior datasets with the aim of create a learning model for unseen or unlabeled data. As assumption of UDA, the source (prior labeled datasets) and target (new unlabeled data) domain have related but different probability distributions, and the divergence between such distributions is called domain shift. This phenomenon can arise in several visual applications, caused, for instance, by human pose-changes in estimation tasks, luminosity variations in photos, differences in acquisition sensors, or the use of multi-view descriptions of a same object (draw, sketch, photo, textual description).
Since supervised learning methods typically expect both source and target data to follow a same distribution, the presence of domain shift can degrade the accuracy on target data if the training occurs directly in a source domain without a proper domain adaptation, i.e., a correction of the difference between source and target distributions. Ideally, we would like to learn a proper domain adaptation in an unsupervised manner. The task, however, is rather challenging, and its relation with realistic applications has been drawing attention in last years . Encouraged by these challenges, in this paper we perform a comparative experimental study of various methods for UDA in a specific view of remote sensing data. We use the dataset composed by four remote sensing images of coffee crop agriculture in scenarios with different plant and terrain conditions.
This paper is organized as follows: Section II presents an overview of UDA techniques. Section III and IV presents, respectivelly, the evaluation protocol and experimental results of our analysis. We conclude this work in Section V with some remarks and the future directions in the research.
Ii Unsupervised Domain Adaptation Approaches
UDA methods can be organized in three main categories : (1) instance-based, (2) feature-based, and (3) classifier-based.
In instance-based adaptation is re-weight data in the source domain or in both domains to reduce domain divergence. In feature representation is attempt to learn a new feature representations to minimize domain shift and error of learning task. In classifier-based is learn a new model that minimizes the generalization error in the target domain via training instances from both domains.
In this work, we focused on feature-based UDA methods. Seven approaches were selected from the literature and are summarized in Table I according to their main properties. Note that they are grouped into three branches: Data-centric, Subspace centric, and Hybrid methods. We breafly introduce each UDA method according to their branch in the next subsections.
|Data Centric||TCA ||✓||✓|
|Subspace Centric||SA ||✓|
Ii-a Data centric Approaches
To align source and target data, data-centric methods attempt to find a specific transformation that can project both domains into a domain-invariant space. The distributional divergence between domains is reduced, while preserving the data properties from the original spaces [5, 6, 7].
Transfer Component Analysis (TCA) : its goal is to learn a set of transfer components in a Reproducing Kernel Hilbert Space. When projecting domain data onto the latent space spanned by the transfer components, the distance between different distributions across domains is reduced while variance is preserved.
Joint Distribution Adaptation (JDA) : it extends the Maximum Mean Discrepancy to measure the difference in both marginal and conditional distributions. Despite minimizing the marginal distribution between the source and target data, TCA does not guarantee that conditional distributions are reduced in this formulation, which may lead to a poor adaptation. JDA improves TCA, and integrates Maximum Mean Discrepancy with Principal Component Analysis to create a feature representation that is effective and robust for large domain shifts.
Transfer Joint Matching (TJM)  aims at minimizing the distribution distance between domains trying to properly reweigh the instances which are more valuable to a classification task in the final adaptation. It is important because some instances from source data could have more relevance for classification task than others due to the huge difference of initial data representation.
TCA, JDA, and TJM rely on the assumption that there always exists a transformation function which can project the source and target data into a common subspace which, at the same time, reduce distribution difference and preserves most original information. This assumption, however, is not realistic: known problems arising from strong domain shifts suggest that there may not always exist such a space.
Ii-B Subspace Centric Approaches
In contrast to data-centric methods, subspace-centric methods do not assume the existence of a unified transformation. They rely on a subspace manipulation of the source and target domains  or between them , upholding that separate subspaces have very particular features to be exploited.
Subspace Alignment (SA): it projects source and target data onto different subspaces using PCA as a robust representation. The method, then, learns a linear transformation matrix that aligns the source subspace to the target one while minimizing the Frobenius norm of their difference. In this way, the distance between different distributions across domains is reduced by moving closer the source and target subspaces exploiting the global covariance statistical structure of the two domains.
Geodesic Flow Kernel (GFK): is an elegant approach that integrates an infinite number of subspaces that lie on the geodesic flow from the source subspace to the target one using the kernel trick. The main drawback of subspace-centric methods is that, while focused in reducing the geometrical shift between subspaces, the distribution shift between projected data of domains is not explicitly treated as in data-centric methods. Moreover, the subspace dimension to project the data normally requires some tuning of parameters or preprocessing, which can be computationally expensive.
Ii-C Hybrid Approaches
CORAL : it was proposed to tackle the drawbacks of data and subspace-centric methods. In this approach, the domain shift is minimized by aligning the covariance of a source and target distributions in the original data. In contrast to subspace-centric methods, CORAL suggests an alignment without the need of subspace projection, which would require intense computation and complex hyper-parameter tuning. In addition, CORAL do not assume a unified transformation like data-centric methods; it uses, instead, an asymmetric transformation only on source data.
Joint Geometric Subspace Alignment (JGSA) : it aims to reduce the statistical and geometrical divergence between domains using common and specific properties of the source and target data. To achieve that, an overall objective function is created by taking into account five terms: target variance, variance between/within classes, distribution shift, and subspace shift.
We carried out an extensive set of experiments on a Brazilian Coffee Crops dataset in order to evaluate the robustness of UDA methods in a remote-sensing agriculture scenario. The experiments were designed to answer the following research questions:
Is UDA methods more effective than transferring without adaptation for coffee crop mapping? Which UDA approach is the most effective? When applied as a pre-processing step, how much does data normalization affect the quality of knowledge transfer?
Can knowledge transfer between coffee plantations datasets from different geographic regions yield complementary results?
Is it possible to infer a spatial relationship between coffee samples correctly predicted from learning models trained in different data sources?
To answer question (1), we compare the selected UDA methods agaist a classifier with no adaptation strategy. We also compare the approaches by using the four most common ways to normalize data: L1-Norm, L2-Norm, L1-Norm followed by a Z-score standardize, and L2-Norm followed by a Z-score standardize. Although data pre-processing analysis is not always considered the main topic of analysis, different ways of normalizing the data before the adaptation phase can cause a great impact in the transference of knowledge. Concerning question (2), we used Venn diagrams of predictions to analyze the complementarity among different coffee datasets according to a tuple (normalization method, UDA approach). We selected the normalization method and UDA approach based on the most suitable tuple observed in the experiments conducted to answer question (1). This experiment aims at investigating the contribution of knowledge from different sources to the same target data. Since domain shift can be caused by different latent aspects in remote sensing, it is expected that different sources will have complementary contributions in consideration for the same target data. At last, to answer (3), we perform a visual analysis of samples which are correctly predicted by specific models, using two different methods to project the original representation of data in 2D-space: Principal Component Analysis and t-Distributed Stochastic Neighbor Embedding (TSNE).
The Brazilian Coffee Scenes 111http://www.patreo.dcc.ufmg.br/2017/11/12/brazilian-coffee-scenes-dataset/ dataset consists of four remote sensing images composed of multi-spectral scenes taken by the SPOT sensor in 2005, covering regions of coffee cultivation over four counties in the state of Minas Gerais, Brazil: Arceburgo (AR), Guaxupé (GX), Guaranésia (GA) and Monte Santo (MS). Each county is partitioned into multiple tiles of 64 x 64 pixels, which are divided into 2 classes (coffee and non-coffee). To mitigate the problem of imbalanced datasets (which is an issue corresponding to the significant difference among the number of samples in the different classes) we applied a random under-sampling technique, balancing the data by randomly selecting a subset of data for the targeted classes. In our analysis, we considered each county as a different domain, thus we have four domains (AR, GX, GA and MS) leading to 12 possible domain adaptation combinations. We have used a low-level descriptor named Border/Interior Pixel Classification (BIC)  for feature extraction from coffee scenes. BIC is a very effective descriptor for coffee crops as shown in [3, 1].
Iii-B Setup and Implementation Details
We made a comparison between seven state-of-the-art methods: Transfer Component Analysis (TCA) , Geodesic Flow Kernel (GFK) , Subspace Alignment (SA) , Joint Distribution Analysis (JDA) , Transfer Joint Matching (TJM) , CORAL , Joint Geometrical and Statistical Alignment (JGSA)  and transfer with no adaptation (NA). A brief descriptions of methods is shown in Section II (for more details we recommend the original papers). We follow a full training evaluation protocol, where a Support Vector Machine (SVM) is trained on the labeled source data, and tested on the unlabeled target data. In our experimental setup, tuning of parameters is always made in the source data, since is impossible to use a cross-validation without labeled samples from the target domain. We evaluate all methods by empirically searching the parameter space for optimal parameter settings that gives the highest average kappa on all datasets, and we report the best accuracy results of each method.
Iv Experimental Results and Discussion
Iv-a UDA Approaches Comparison
In this section, we compare different strategies of transferring knowledge between geographic domains in order to map coffee crops. We compared the SVM classifiers with no adaptation (NA) against the seven selected UDA approaches. We also evaluated four different way to normalize data over eight unsupervised domain adaptation approaches. In order to evaluate a normalization method, the mean accuracy value of all 12 UDA configuration was computed and reported. The mean accuracy results on each pair of counties from Brazilian Coffee Crops dataset are shown in Table II.
The results show that, in general, it is better use some UDA strategy than try to transfer knowledge without no adaptation. We can also observe that TCA achieved the best results in comparison with the other UDA approaches when using the L2 Norm followed by a Z-score standardize. Despite the comparison of mean general results, it can be observed several remarkable points: 1) even though TCA achieving the best mean accuracy, the method had lower results in 9 of 12 combinations over the L2 Norm-Z-score setup, that could be a cost of an adaptation without taking into account the conditional distribution information; 2) the knowledge to be transferred between two domains is not always reciprocal, seems a case of JGSA which got the best results when trained at AR and test in MS but obtained the worst results when using MS to train and evaluate in AR.; 3) at the same setup mentioned before, in 3 of 12 combinations got the best results when none of the domain adaptations methods are used, that is 25% of combinations had a negative transfer phenomenon. Table III shows the cases of positive and negative transfer, where the UDA approaches which got a better performance in comparison with a no adaptation approach are marked as blue (Positive Transfer), otherwise they are marked with red (Negative Transfer).
Iv-B Complementariness of Cross-Domain Predictions
In this subsection, we select the pair (l2-Norm-Z-score/TCA) to analyse the complementariness of predictions between source and target data. Given a target data each group from the diagram represents a source data which the pair (l2-Norm-Z-score/TCA) was trained and afterwards test on target data. The intersections between sets show samples that were predicted correctly by both sets. The results were represented in Venn diagrams, which are shown in Figure 1.
As expected, most of the samples in all diagrams are at the intersection of three sets, i.e., the easiest samples are correctly predicted if trained in any of the available source datasets. However, in all cases it is possible to notice a considerable number of samples that were correctly predicted only from a single source, e.g., at Figure (b)b a model trained in Montesanto is capable to correctly predict 110 samples not in common which the models trained in Arceburgo or Guaranésia. This suggests the existence of complementary information that can be exploited to build a more reliable learning model. It is also noticeable a relationship of “similarity” between domains. That is, some pair of domains perform better than others, for instance, Guaxupé and Montesanto, in Figure (b)b a model trained in Montesanto and in Figure (d)d a model trained in Guaxupé. However, this relationship is not always bidirectional, for example, the case of Arceburgo and Guaranésia, in Figure (a)a Guaranésia perform well as a source domain, predicting correctly 140 over 182 (76,92%) samples, but in Figure (c)c Montesanto is more useful than Arceburgo, predicting correctly 429 over 552 (77,71%) samples in a comparison of 416 (75,36%) from Arceburgo.
Iv-C Visual Analysis
In this section we investigate the spatial relationship among the samples. Given a fix adaptation approach, we are focusing on samples that were correctly predicted exclusively for that source data in specific. For this purpose, we propose a visual analysis of these samples using two different methods to project the original representation of data in 2D-space: Principal Component Analysis and t-Distributed Stochastic Neighbor Embedding (TSNE). The projections from PCA and TSNE data are showed in Figure 2 and 3 respectively.
With a visual analysis of the projections, it is possible to notice important aspects of data and the complementariness between source data. First, PCA projections show little insight into the spatial relationship of correctly predicted samples; instead, it shows sparsity over the features space. The PCA projection is a powerful dimensional reduction technique since it projects the original high-dimensional data in a low-dimensional space preserving the maximum variance as possible. However, PCA not preserve the local structure of original data, i.e., points that are close, regarding some metric, in original high-dimensional space do not remain close in the new low-dimensional space. Second, in contrast with PCA, TSNE using a non-linear manifold approach can successfully create a low-dimensional representations preserving local structures, as shown in Figure 3. In addition, we can notice a leaning of a complementariness between learning models, since the samples corrected predict from different sources are tending to create clusters. This behavior in projections can be a suggestive interpretation of shared properties between the source and target data where the clusters show samples whose are more likely to be drawn from a specific source data. Another way of seeing the previous interpretation is taking in consideration the fact of remote sensing images can present a high intra-class variance due to the huge spatial extension explored. An entire image can be seen as a composition of several probabilities distributions which some of them are better explained from different sources of data.
In this paper, we have conducted a comparative experimental analysis of seven UDA approaches to perform automatic coffee crop mapping. We conducted three sets of experiments with the intent of verifying whether existing approaches to unsupervised domain adaptation can assist in the transfer of knowledge between datasets of different geographic domains.
The main conclusion is that employ an UDA strategy is more effective than perform transfer knowledge without any adaptation. Experimental results also showed a great sensitive of the methods compared over different normalization pre-processing steps. In terms of mean accuracy, the Transfer Component Analysis (TCA)  presents the must suitable results. In addition, the negative transfer phenomenon is noticed in several adaptation combinations supporting the importance of an effective adaptation. Analyzing the complementarity of predictions, was showed an existence of additional information that could be exploited from multiple source data to build a more reliable learning model. At last, in visual analysis was possible to identify a formation of clusters betweens samples correct predicted using different source data. This observation shows that some samples from target data are likely to be drawn from specific source. This inspection indicate that a robust UDA approach needs to recognize the importance of multiples sources, considering that each source data have a different contribution for distinct samples from the target.
As future work, we intend to investigate ways for avoiding negative transfer and employ UDA strategies in other vegetation mapping applications.
-  J. A. dos Santos, O. A. B. Penatti, and R. da S. Torres, “Evaluating the potential of texture and color descriptors for remote sensing image retrieval and classification,” in VISAPP, Angers, France, May 2010.
-  K. Nogueira, W. R. Schwartz, and J. A. dos Santos, “Coffee crop recognition using multi-scale convolutional neural networks,” in CIARP, 2015, pp. 67–74.
-  K. Nogueira, O. A. Penatti, and J. A. dos Santos, “Towards better exploiting convolutional neural networks for remote sensing scene classification,” Pattern Recognition, vol. 61, pp. 539–556, 2017.
-  J. Zhang, W. Li, and P. Ogunbona, “Transfer learning for cross-dataset recognition: A survey,” arXiv preprint arXiv:1705.04396, 2017.
-  S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
-  M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in ICCV, 2013, pp. 2200–2207.
-  ——, “Transfer joint matching for unsupervised domain adaptation,” in CVPR, 2014, pp. 1410–1417.
-  B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in ICCV, 2013, pp. 2960–2967.
-  B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in CVPR. IEEE, 2012, pp. 2066–2073.
-  B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation.” in AAAI, vol. 6, no. 7, 2016, p. 8.
-  J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and statistical alignment for visual domain adaptation,” in CVPR, 2017, pp. 1859–1867.
-  R. O. Stehling, M. A. Nascimento, and A. X. Falcão, “A compact and efficient image retrieval approach based on border/interior pixel classification,” in CIKM. ACM, 2002, pp. 102–109.