DASA: Domain Adaptation in Stacked Autoencoders using Systematic Dropout
Domain adaptation deals with adapting behaviour of machine learning based systems trained using samples in source domain to their deployment in target domain where the statistics of samples in both domains are dissimilar. The task of directly training or adapting a learner in the target domain is challenged by lack of abundant labeled samples. In this paper we propose a technique for domain adaptation in stacked autoencoder (SAE) based deep neural networks (DNN) performed in two stages: (i) unsupervised weight adaptation using systematic dropouts in mini-batch training, (ii) supervised fine-tuning with limited number of labeled samples in target domain. We experimentally evaluate performance in the problem of retinal vessel segmentation where the SAE-DNN is trained using large number of labeled samples in the source domain (DRIVE dataset) and adapted using less number of labeled samples in target domain (STARE dataset). The performance of SAE-DNN measured using in source domain is , without and with adaptation are and , and when trained exclusively with limited samples in target domain. The area under ROC curve is observed respectively as , , and . The high efficiency of vessel segmentation with DASA strongly substantiates our claim.
The under-performance of learning based systems during deployment stage can be attributed to dissimilarity in distribution of samples between the source domain on which the system is initially trained and the target domain on which it is deployed. Transfer learning is an active field of research which deals with transfer of knowledge between the source and target domains for addressing this challenge and enhancing performance of learning based systems , when it is challenging to train a system exclusively in the target domain due to unavailability of sufficient labeled samples. While domain adaptation (DA) have been primarily developed for simple reasoning and shallow network architectures, there exist few techniques for adapting deep networks with complex reasoning . In this paper we propose a systematic dropout based technique for adapting a stacked autoencoder (SAE) based deep neural network (DNN)  for the purpose of vessel segmentation in retinal images . Here the SAE-DNN is initially trained using ample number of samples in the source domain (DRIVE dataset111http://www.isi.uu.nl/Research/Databases/DRIVE) to evaluate efficacy of DA during deployment in the target domain (STARE dataset222http://www.ces.clemson.edu/ ahoover/stare) where an insufficient number of labeled samples are available for reliable training exclusively in the target domain.
Related Work: Autoencoder (AE) is a type of neural network which learns compressed representations inherent in the training samples without labels. Stacked AE (SAE) is created by hierarchically connecting hidden layers to learn hierarchical embedding in compressed representations. An SAE-DNN consists of encoding layers of an SAE followed by a target prediction layer for the purpose of regression or classification. With increase in demand for DA in SAE-DNNs different techniques have been proposed including marginalized training , via graph regularization  and structured dropouts , across applications including recognizing speech emotion  to fonts .
Challenge: The challenge of DA is to retain nodes common across source and target domains, while adapting the domain specific nodes using fewer number of labeled samples. Earlier methods [3, 7, 10] are primarily challenged by their inability to re-tune nodes specific to the source domain to nodes specific for target domain for achieving desired performance, while they are able to only retain nodes or a thinned network which encode domain invariant hierarchical embeddings.
Approach: Here we propose a method for DA in SAE (DASA) using systematic dropout. The two stage method adapts a SAE-DNN trained in the source domain following (i) unsupervised weight adaptation using systematic dropouts in mini-batch training with abundant unlabeled samples in target domain, and (ii) supervised fine-tuning with limited number of labeled samples in target domain. The systematic dropout per mini-batch is introduced only in the representation encoding (hidden) layers and is guided by a saliency map defined by response of the neurons in the mini-batch under consideration. Error backpropagation and weight updates are however across all nodes and not only restricted to the post dropout activated nodes, contrary to classical randomized dropout approaches . Thus having different dropout nodes across different mini-batches and weight updates across all nodes in the network, ascertains refinement of domain specific hierarchical embeddings while preserving domain invariant ones.
2 Problem Statement
Let us consider a retinal image represented in the RGB color space as , such that the pixel location has the color vector . is a neighborhood of pixels centered at . The task of retinal vessel segmentation can be formally defined as assigning a class label using a hypothesis model . When the statistics of samples in is significantly dissimilar from , the performance of is severely affected. Generally is referred to as the source domain and or the set of samples used during deployment belong to the target domain. The hypothesis which optimally defines source and target domains are also referred to as and . DA is formally defined as a transformation as detailed in Fig. 1.
3 Exposition to the Solution
Let us consider the source domain as with abundant labeled samples to train an SAE-DNN for the task of retinal vessel segmentation, and a target domain with limited number of labeled samples and ample unlabeled samples, insufficient to learn reliably as illustrated in Fig. 1. and are closely related, but exhibiting distribution shifts between samples of the source and target domains, thus resulting in under-performance of in as also illustrated in Fig. 1. The technique of generating using , and subsequently adapting to via systematic dropout using is explained in the following sections.
3.1 SAE-DNN learning in the source domain
AE is a single layer neural network that encodes the cardinal representations of a pattern onto a transformed spaces with denoting the connection weights between neurons, such that
where the cardinality of denoted as , , , and is termed as the bias connection with . We choose to be a sigmoid function defined as . AE is characteristic with another associated function which is generally termed as the decoder unit such that
where , and . When , this network acts to store compressed representations of the pattern encoded through the weights . However the values of elements of these weight matrices are achieved through learning, and without the need of having class labels of the patterns , it follows unsupervised learning using some optimization algorithm , viz. stochastic gradient descent.
such that is the cost function used for optimization over all available patterns
where regularizes the sparsity penalty, is the imposed sparsity and is the sparsity observed with the pattern in the mini-batch.
The SAE-DNN consists of cascade connected AEs followed by a softmax regression layer known as the target layer with as its output. The number of output nodes in this layer is equal to the number of class labels such that and the complete DNN is represented as
where are the pre-trained weights of the network obtained from the earlier section. The weights are randomly initialized and convergence of the DNN is achieved through supervised learning with the cost function
during which all the weights are updated to completely tune the DNN.
3.2 SAE-DNN adaptation in the target domain
Unupervised adaptation of SAE weights using systematic dropouts: The first stage of DA utilizes abundant unlabeled samples available in target domain to retain nodes which encode domain invariant hierarchical embeddings while re-tuning the nodes specific in source domain to those specific in target domain. We follow the concept of systematic node drop-outs during training . The number of layers and number of nodes in the SAE-DNN however remains unchanged during domain adaptation. Fig. 2 illustrates the concept.
Weights connecting each of the hidden layers is imported from the SAE-DNN trained in are updated in this stage using an auto-encoding mechanism. When each mini-batch in is fed to this AE with one of the hidden layers from the SAE-DNN; some of the nodes in the hidden layer exhibit high response with most of the samples in the mini-batch, while some of the nodes exhibit low response. The nodes which exhibit high-response in the mini-batch are representative of domain invariant embeddings which need to be preserved, while the ones which exhibit low-response are specific to and need to be adapted to . We set a free parameter defined as the transfer coefficient used for defining saliency metric for the node in the layer as
SAE-DNN architecture: We have a two-layered architecture with where AE consists of nodes and AE consists of nodes. The number of nodes at input is corresponding to the input with patch size of in the color retinal images in RGB space. AEs are unsupervised pre-trained with learning rate of , over epochs, and . Supervised weight refinement of the SAE-DNN is performed with a learning rate of over epochs. The training configuration of learning rate and epochs were same in the source and target domains, with .
Source and target domains: The SAE-DNN is trained in using of the available patches from the images in the training set in DRIVE dataset. DA is performed in using (i) of the available patches in unlabeled images for unsupervised adaptation using systematic dropout and (ii) of the available patches in labeled images for fine tuning.
Baselines and comparison: We have experimented with the following SAE-DNN baseline (BL) configurations and training mechanisms for comparatively evaluating efficacy of DA: BL1: SAE-DNN trained in source domain and deployed in target domain without DA; BL2: SAE-DNN trained in target domain with limited samples and deployed in target domain.
5 Results and Discussion
|Area under ROC|
Hierarchical embedding in representations learned across domains: AEs are typically characteristic of learning hierarchical embedded representations. The first level of embedding represented in terms of in Fig. 3(g) is over-complete in nature, exhibiting substantial similarity between multiple sets of weights which promotes sparsity in the nature of in Fig. 3(h). Some of these weight kernels are domain invariant, and as such remain preserved after DA as observed for in Fig. 3(i) and for in Fig. 3(j). Some of the kernels which are domain specific, exhibit significant dissimilarity in and between source domain in Figs. 3(g) and 3(h) vs. target domain in Figs. 3(i) and 3(j). These are on account of dissimilarity of sample statistics in the domains as illustrated earlier in Fig. 1 and substantiates DASA of being able to retain nodes common across source and target domains, while re-tuning domain specific nodes.
Accelerated learning with DA: The advantage with DA is the ability to transfer knowledge from source domain to learn with fewer number of labeled samples and ample number of unlabeled samples available in the target domain when directly learning in the target domain does not yield desired performance. Figs. 3(k) and 3(l) compare the learning of and using ample unlabeled data in source and target domain exclusively vs. DA. Fig. 3(m) presents the acceleration of learning with DA in target domain vs. learning exclusively with insufficient number of labeled samples.
Importance of transfer coefficient: The transfer coefficient drives quantum of knowledge transfer from the source to target domains by deciding on the amount of nodes to be dropped while adapting with ample unlabeled samples. This makes it a critical parameter to be set in DASA to avoid over-fitting and negative transfers as illustrated in Table. 2 where optimal . Generally with being associated with large margin transfer between domains when they are not very dissimilar, and being associated otherwise.
We have presented DASA, a method for knowledge transfer in an SAE-DNN trained with ample labeled samples in source domain for application in target domain with less number of labeled samples insufficient to directly train to solve the task in hand. DASA is based on systematic droupout for adaptation being able to utilize (i) ample unlabeled samples and (ii) limited amount of labeled samples in target domain. We experimentally provide its efficacy to solve the problem of vessel segmentation when trained with DRIVE dataset (source domain) and adapted to deploy on STARE dataset (target domain). It is observed that DASA outperforms the different baselines and also exhibits accelerated learning due to knowledge transfer. While systematic drouput is demonstrated on an SAE-DNN in DASA, it can be extended to other deep architectures as well.
We acknowledge NVIDIA for partially supporting this work through GPU Education Center at IIT Kharagpur.
-  M. D. Abràmoff, M. K. Garvin, and M. Sonka. Retinal imaging and image analysis. IEEE Rev. Biomed. Engg., 3:169–208, 2010.
-  Y. Bengio. Learning deep architectures for AI. Found., Trends, Mach. Learn., 2(1):1–127, 2009.
-  M. Chen, Z. Xu, K. Weinberger, and F. Sha. Marginalized denoising autoencoders for domain adaptation. In Proc. Int. Conf. Mach. Learn., pages 767–774, 2012.
-  J. Deng, Z. Zhang, F. Eyben, and B. Schuller. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process. Let., 21(9):1068–1072, 2014.
-  S. Haykin. Neural Networks and Learning Machines. Pearson Education, 2011.
-  S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. Knowledge., Data Engg., 22(10):1345–1359, 2010.
-  Y. Peng, S. Wang, and B.-L. Lu. Marginalized denoising autoencoder via graph regularization for domain adaptation. In Proc. Neural Inf. Process. Sys., pages 156–163, 2013.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014.
-  Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agarwala, J. Brandt, and T. S. Huang. Real-world font recognition using deep network and domain adaptation. In Proc. Int. Conf. Learning Representations, page arXiv:1504.00028, 2015.
-  Y. Yang and J. Eisenstein. Fast easy unsupervised domain adaptation with marginalized structured dropout. Proc. Assoc., Comput. Linguistics, 2014.