Adversarial Transfer Learning for Cross-domain Visual Recognition
In many practical visual recognition scenarios, feature distribution in the source domain is generally different from that of the target domain, which results in the emergence of general cross-domain visual recognition problems. To address the problems of visual domain mismatch, we propose a novel semi-supervised adversarial transfer learning approach, which is called Coupled adversarial transfer Domain Adaptation (CatDA), for distribution alignment between two domains. The proposed CatDA approach is inspired by cycleGAN, but leveraging multiple shallow multilayer perceptrons (MLPs) instead of deep networks. Specifically, our CatDA comprises of two symmetric and slim sub-networks, such that the coupled adversarial learning framework is formulated. With such symmetry of two generators, the input data from source/target domain can be fed into the MLP network for target/source domain generation, supervised by two confrontation oriented coupled discriminators. Notably, in order to avoid the critical flaw of high-capacity of the feature extraction function during domain adversarial training, domain specific loss and domain knowledge fidelity loss are proposed in each generator, such that the effectiveness of the proposed transfer network is guaranteed. Additionally, the essential difference from cycleGAN is that our method aims to generate domain-agnostic and aligned features for domain adaptation and transfer learning rather than synthesize realistic images. We show experimentally on a number of benchmark datasets and the proposed approach achieves competitive performance over state-of-the-art domain adaptation and transfer learning approaches.
In cross-domain visual recognition systems, the traditional (task-specific) classifiers usually do not work well on those semantic-related but distribution different tasks. A typical cross-domain problem is presented in Fig. 1, which illustrates some example images with similar semantics but different distribution. By observing the Fig. 1, it is not difficult to understand that the classifier trained with the images in the first row cannot work well when classifying the remaining images due to the explicit heterogeneity of multiple visual tasks. Mathematically, the reason lies in that the training data and test samples have different feature distribution (i.e. data bias) and do not satisfy the condition of independent identical distribution (i.e. i.i.d.) [38, 32, 57, 45, 53]. Additionally, to train an accurate classification model, sufficient labeled samples are needed according to the statistical machine learning theory. However, collecting data is an expensive and time-consuming process. Generally, the data problem can be relieved by exploiting a few label information from related source domains and leveraging a number of unlabeled instances from the target domain for recognition model training. Although the scarcity of data and labels can be partially solved by fusing the data drawn from multiple different domains, another dilemma of domain discrepancy is resulted. Recently, domain adaptation (DA) [6, 23, 58] techniques which can effectively ease domain shift problem have been proposed, and demonstrated a great success in various cross-domain visual datasets.
It is of great practical importance to explore DA methods. DA models allow machine learning model to be self-adapted among multiple visual knowledge domains, i.e., the trained model from one data domain can be adapted to another domain and it is the key objective of this paper. There is a fundamental assumption underlying is that, although the domains differ, there is sufficient commonality to support such adaptation.
Most of the existing DA algorithms seek to bridge the gap between domains by reconstructing a common feature subspace for general feature representation. In this paper, we reformulate DA as a conditional image generation problem which tends to bridge the gap by generating domain specific data. The mapping function from one domain to another can be viewed as the modeling process of a generator, which achieves automatic domain shift alignment during data sampling and generating . Recently, generative adversarial network (GAN) proposed in , that tends to generate user-defined images by the adversarial mechanism between generator and discriminator, has become a mainstream of DA approach . This is usually modeled by minimizing the approximate domain discrepancy via an adversarial objective function. GAN generally carries two networks called generator and discriminator, which work against each other. The generator is trained to produce images that could confuse the discriminator, while the discriminator tries to distinguish the generated images from real images. This adversarial strategy is very suitable for DA problem [47, 41], because domain discrepancy can be easily reduced by adversarial generation. Therefore, this confrontation principle is exploited to ensure that the discriminator cannot distinguish the source domain from the generated target domain. In , DANN is one of the first work of deep domain adaptation, in which the adversarial mechanism of GAN was used to bridge the gap between the source and target domains. Similarly, the GAN inspired domain adaptation (ADDA) with convolutional neural network (CNN) architecture has also achieved a surprisingly good performance in . With the success of GAN in domain adaptation, an adversarial domain adaptation framework with domain generators and domain discriminators as GAN does is studied in this work for cross-domain visual recognition.
It is worthy noting that, in GANs [35, 11], the realistic of the generated images is important. However, the purpose of DA methods is to reduce the domain discrepancy, while the realistic of the generated image is not that important. Therefore, our focus lies in the domain distribution alignment for cross-domain visual recognition instead of the pure image generation like GAN. To this end, in this work, we proposed a simple, slim but effective Coupled adversarial transfer based domain adaptation (CatDA) algorithm. To be specific, the proposed CatDA is formulated with a slim and symmetric multilayer perceptron (MLP) structure instead of deep structure for generative adversarial adaptation. CatDA comprises of two symmetric and coupled sub-networks, with each a generator, a discriminator, a domain specific term and a domain knowledge fidelity term are formulated. CatDA is then implemented by coupled learning of the two sub-networks. With the symmetry, both domains can be generated from each other with an adversarial mechanism supervised by the coupled discriminators, such that the network compatibility for arbitrary domain generation can be guaranteed.
In order to avoid the critical flaw of high-capacity of the network mapping function during domain adversarial training, the semi-supervised mode is therefore considered in our method. In addition, a content fidelity term and a domain loss term are proposed in the generators for achieving the joint domain-knowledge preservation in source and target domains. The structure of CatDA can be simply described as two generators and two discriminators. Specifically, the main contribution and novelty of this work are fourfold:
In order to reduce the distribution discrepancy between domains, we propose a simple but effective coupled adversarial transfer network (CatDA), which is a slim and symmetric adversarial domain adaptation network structured by shallow multilayer perceptrons (MLPs). Through the proposed network, source and target domains can be generated against each other with an adversarial mechanism supervised by the coupled discriminators.
Inspired by the cycleGAN, the CatDA adopts a cycling structure and formulates a generative adversarial domain adaptation framework comprising of two sub-networks with each carries a generator and a discriminator. The coupled learning algorithm follows a two-way strategy, such that arbitrary domain generation can be achieved without constraining the input to be source or target.
To avoid the limitation of domain adversarial training that feature extraction function has high-capacity, in domain alignment, a semi-supervised domain knowledge fidelity loss and domain specific loss are designed for domain content self-preservation and domain realistic. In this way, domain distortion in domain generation is avoided and the domain-agnostic feature representation become more stable and discriminative.
A simple but effective MLP network is tailored to handle the small-scale cross-domain visual recognition datasets. In this way, both the requirements of large-scale samples and the pre-trained processing are avoided. Extensive experiments and comparisons on a number of benchmark datasets demonstrate the effectiveness and competitiveness of the proposed CatDA over state-of-the-art methods.
The rest of this paper is organized as follows. In Section \@slowromancapii@, the related work in domain adaptation is reviewed. In Section \@slowromancapiii@, we present the proposed CatDA method in detail. In Section \@slowromancapiv@, the experiments and comparisons on a number of common datasets are presented. The discussion is presented in Section \@slowromancapv@, and finally Section \@slowromancapvi@ concludes this paper.
Ii Related Work
Our approach involves domain adaptation (DA) and generative adversarial methods. Therefore, In this section, the shallow domain adaptation, deep domain adaptation, and generative adversarial networks are briefly introduced, respectively.
Ii-a Shallow Domain Adaptation
In recent years, a number of shallow learning approaches have been proposed in domain adaptation. Generally, these shallow domain adaptation methods can be divided into three categories.
Classifier based approaches. A generic way in this category is to learn a common classifier on source domain leveraging source data and a few labeled target data. In AMKL, Duan et al.  proposed an adaptive multiple kernel learning method for video event recognition. Also, a domain transfer MKL (DTMKL)  was proposed by jointly learning a SVM and a kernel function for classifier adaptation. Li et al.  proposed the cross-domain extreme learning machine for visual domain adaptation [18, 56], in which MMD was formulated for characterizing and matching the marginal and conditional distribution between domains. Zhang et al.  proposed a robust extreme domain adaptation (EDA) based classifier with manifold regularization for cross-domain visual recognition.
Feature augmentation/transformation based approaches. In MMDT, Hoffman et al.  proposed a Max-Margin Domain Transforms method, in which a category specific transformation was optimized for domain transfer. Long et al.  proposed a Transfer Sparse Coding (TSC) method to learn robust sparse representations, in which the empirical Maximum Mean Discrepancy (MMD)  is constructed as the distance measure. Then he  also proposed a Transfer Joint Matching approach. This TJM method aims to learn a non-linear transformation across domains by minimizing distribution discrepancy based on MMD. Zhang et al.  proposed a regularized subspace alignment method for realizing cross-domain odor recognition with signal shift.
Feature reconstruction based approaches. Different from those methods above, domain adaptation is achieved by feature reconstruction between domains. Jhuo et al.  proposed a RDALR method, in which the source samples was reconstructed by the target domain with low-rank constraint model. Similarly, Shao et al.  proposed a LTSL method by pre-learning a subspace through principal component analysis (PCA) or linear discriminant analysis (LDA), then low-rank regularization across domains is modeled in this method. Zhang et al.  proposed a Latent Sparse Domain Transfer (LSDT) approach by jointly learning a common subspace and a sparse reconstruction matrix across domains, and this method achieves good results.
Ii-B Deep Domain Adaptation
The other category of data-driven domain adaptation method is deep learning and it has witnessed a series of great success [46, 36, 51, 34, 33]. However, for small-data tasks, deep learning method may not improve the performance too much. Thus, recently, deep domain adaptation methods under small-scale tasks have been emerged.
Donahue et al.  proposed a deep transfer strategy leveraging a CNN network on the large-scale ImageNet for small-scale object recognition tasks. Similarly, Razavian et al.  also proposed to train a deep network based on ImageNet for high-level domain feature extractor. Tzeng et al.  proposed a CNN network based on DDC method both aligning domains and tasks. In DAN, Long et al.  proposed a deep transfer network leveraging MMD loss on the high-level features between the two-streamed fully-connected layers from two domains. Then he also proposed another two famous methods. In RTN , he proposed a residual transfer network which aims to learn a residual function between the source and target domain. A joint adaptation networks (JAN) is another method which was proposed  to learn a adaptation network. This network aligned the joint distributions across domains with a joint maximum mean discrepancy (JMMD) criterion. Hu et al.  proposed a deep transfer metric learning (DTML) method leveraging the MLPs instead of CNN. The novelty of this method is to learn a set of hierarchical nonlinear transformations. Autoencoder is an unsupervised feature representation  and Wen et al.  proposed a deep autoencoder based feature reconstruction for domain adaptation, which aims to share the feature representation parameters between source and target domains. Recently, Chen et al.  proposed a broad learning system instead of deep system which can also be considered for transfer learning.
Ii-C Generative Adversarial Networks
The generative adversarial network (GAN) was first proposed by Goodfellow et al.  to generate images and produced a high impact in deep learning. GAN generally comprises of two operators: a generator (G) and a discriminator (D). The discriminator discriminates whether the sample is fake or real, while the generator produces samples as real as possible to cheat the discriminator. Mirza et al.  proposed a conditional generative adversarial net (CGAN) where both networks G and D receive an additional information vector as input. Salimans et al.  achieved state-of-the-art results in semi-supervised classification and improves the visual realistic and image quality compared to GAN. Zhu et al.  proposed a cycleGAN for discovering cross-domain relations and transferring style from one domain to another, which was similar with DiscoGAN  and DualGAN . The key attributes such as orientation and face identity are preserved.
In , DANN is one of the first work in deep domain adaptation, in which the adversarial mechanism of GAN was used to bridge the gap between the source and target domains. A novel ADDA method is proposed in  for adversarial domain adaptation. In this method, the convolutional neural network (CNN) was used for adversarial discriminative feature learning. A GAN based model  that adapted the source domain images to appear to be drawn from the target domain was proposed, in which domain image generation was focused. The three works have shown the potential of adversarial learning in domain adaptation. Additionally, a CyCADA method  was proposed for cycle generation, in which the representations in both pixel-level and feature-level are adaptive by enforcing cycle-consistency.
Iii The Proposed CatDA
In our method, the source domain is defined by subscript “” and target domain is defined by subscript “”, respectively. The training set of source and target domain are defined as and , respectively. denotes the data labels. A generator network is denoted as : , that embeds data from source domain to its co-domain . The discriminator network is denoted as , which tries to discriminate the real samples in target domain and the generated samples in co-domain . Similarly, aims to map the data from target domain to its co-domain , and tries to discriminate the real samples in source domain and the generated samples in co-domain . ¡° ¡±represents the generated target data from ; ¡° ¡±represents the generated source data from . represents the unlabeled target test data.
Direct supervised learning of an effective classifier on the target domain is not allowed due to the label scarcity. Therefore, in this paper, we would like to answer whether the target data can be generated by using the labeled source data, such that the classifier can be trained on the generated target data with labels. Our key idea is to learn a ”source target” generative feature representation through an adversarial domain generator, then a domain-agnostic classifier can be learnt on the generated features for cross-domain applications. Noteworthily, our aim is to minimize the feature discrepancy between domains via similarly distributed feature generation rather than generating a vivid target image. Therefore, a simple and flexible network is much more expected for homogeneous image feature generation, instead of very complicated structure (encoder vs. decoder) for realistic image rendering.
Visually, the structure of the standard GAN and conditional GAN are shown in Fig. 2 (a) and Fig. 2 (b), respectively. There are several limitations for the two models. In standard GAN, explicitly supervised data is seldom available, and the randomly generated samples can become tricky if the content information is lost. Thus the trained classifier may not work obviously well. In conditional GAN, although a label constraint is imposed, it does not guarantee the cross-domain relation because of the one-directional domain mapping. Since conditional GAN architecture only learns one mapping from one domain to another (one-way), a two-way coupled adversarial domain generation method with more freedom is therefore proposed in this paper, as shown in Fig. 2 (c). The core of our CatDA model depends on two symmetric GANs, which then result in a pair of symmetric generative and discriminative functions. The two-way generator function can be recognized as a bijective mapping, and the work flow of the proposed CatDA in implementation can be described as follows.
First, different from GAN, the image or feature instead of noise is fed as input into the model. The way-1 of CatDA comprises of the generator and the discriminator . The way-2 comprises of the generator and the discriminator . For way-1, the source data is fed into the generator, and the co-target data is generated. Then the generated target data and the real target data are fed into the discriminator network for adversarial training. For way-2, the similar operation with way-1 is conducted. In order to achieve the bijective mapping, we expect that the real source data can be recovered by feeding the generated into the generator for progressively learning supervised by . Similarly, is also fine-tuned by feeding the supervised by to recover the real target training data.
Iii-C Model Formulation
In order to avoid the domain adversarial training limitation of high-capacity of the feature extraction function, the semi-supervised strategy is used in our proposed CatDA to preserve the content information of the generated samples. We process the samples class by class which results in a semi-supervised structure. The training data in source domain and target domain per class are preprocessed independently, thus the number of networks to be trained equals to the number of classes. Specifically, the pipeline of the class-wise CatDA is shown in Fig. 3, and the class conditional information is used. For each CatDA model, the generator is a two-layered perceptron and the discriminator is a three-layered perceptron. Sigmoid function is selected as the activation function in hidden layers. The network structure of the joint generator and discriminator is described in Fig. 4.
The proposed CatDA model has a symmetric structure comprising of two generators and two discriminators, which are described as two ways across domains ( and ). We first describe the model of way-1 (), which shares the same model and algorithm as way-2 ().
A target domain discriminator is formulated to classify whether a generated data is drawn from the target domain (real). Thus, the discriminator loss is formulated as
where . is the generator for generating realistic data similar to target domain. Therefore, the supervised generator loss is formulated as
The two losses in Eq. (1) and (2) are the inherent loss functions in traditional GAN models. The focus of CatDA is to reduce the distribution difference across domains. Therefore, in the proposed CatDA model, two novel loss functions are proposed, which are the domain specific loss (domain loss) and the domain knowledge fidelity loss (content loss).
One of the feasible strategies for reducing the domain discrepancy is to find an abstract feature representation under which the source and target domains are similar. This idea has been explored in [26, 30, 29] by leveraging the Maximum Mean Discrepancy (MMD) criterion, which is used when the source and target data distributions differs. Therefore, in this paper, we focus on the domain specific loss by a simple and approximate MMD, which is formalized to maximize the two-sample test power and minimize the Type II error (the failure of rejecting a false null hypothesis). For convenience, we define the proposed domain loss as , which is minimized to help the learning of the generator as shown in Fig. 2 (c). Specifically, in order to reduce the distribution mismatch between the generated target data and the original target data , the domain specific loss (domain loss) can be formulated as
where denotes the -norm, and represents the center of the co-target data and target data, respectively. Noteworthily, during the network training phase, the sigmoid function is imposed on the domain loss for probability output normalized to . Therefore, the target domain loss shown in Eq. (3) can be further written as
|Algorithm 1 The Proposed CatDA|
|Input: Source data ,|
|target training data ,|
|iterations , , the number ¡° ¡± of classes;|
|1. Initialize and using traditional GAN:|
|Step1: Train the generator and discriminator by solving|
|problem (1) and (2) using back-propagation (BP) algorithm;|
|Step2: Compute ;|
|Step3: Train the generator and discriminator by solving|
|the problem (7) using BP algorithm;|
|Step4: Compute ;|
|2. Update and using the proposed model:|
|Step1: Train the generator and the discriminator , by|
|solving the problem (1) and (6) using BP algorithm;|
|Step2: Update and ;|
|Step3: Train the generator and the discriminator , by|
|solving the problem (7) and (8) using BP algorithm;|
|Step4: Update and ;|
Additionally, for preserving the content in source data, we establish a content fidelity term in our model. Ideally, the equality should be satisfied, that is, the generation is reversible. However, this hard constraint is difficult to be guaranteed and thus a relaxed soft constraint is more desirable. To this end, we try to minimize the distance between and and formulate a content loss function , i.e. source content loss as follows
where , and is a generator of way-2 (). Finally, the objective function of the way-1 generator is composed of 3 parts:
For way-2, the similar models with way-1 are formulated, including the source discriminator loss , the source data generator loss , source domain loss , and the target content loss . Specifically, the loss functions of way-2 can be formulated as follows
where and . and are the centers of the generated source data and the real source data. Similar to Eq. (6), the objective function of the way-2 generator can be formulated as
Complete CatDA Model:
The proposed CatDA model is a coupled net of the way-1 and way-2, each of which learns the bijective mapping from one domain to another. The two ways in CatDA are jointly trained in an alternative manner. The generated data and are fed into the discriminators and , respectively.
By joint learning of the Way-1 and Way-2, the complete model of CatDA including the generator and the discriminator can be formulated as the follows.
In detail, the complete training procedure of the proposed CatDA approach is summarized in Algorithm 1.
For classification, the general classifiers can be trained by the domain aligned and augmented training samples with label . Note that is the output of Algorithm 2. Finally, the recognition accuracy of the unlabeled target test data is reported and compared.
The whole procedure of proposed CatDA for cross-domain visual recognition is clearly described in Algorithm 2, following which the experiments are then conducted to verify the effectiveness and superiority of the proposed method.
|Algorithm 2 The Proposed Cross-domain Visual Recognition Method|
|Input: Source data , a very few target training data ,|
|source label , target label , target test data ;|
|1. Step1: Compute by CatDA method using Algorithm 1;|
|2. Step2: Train the classifier on augmented training data|
|with label ;|
|3. Step3: Predict the label by the classifier, i.e. .|
|Output: Predicted label .|
In CatDA, one key difference from the previous GAN model is that a two-way coupled architecture is proposed, with each a domain loss and a content loss are designed for domain alignment and content fidelity. Note that, the proposed CatDA has a similar structure with the cycleGAN, but essentially different in the following aspects.
The purpose of CatDA aims at achieving domain adaptation in feature representation for cross-domain application by domain alignment and content preservation, rather than generating realistic pictures. Therefore, a simple yet effective shallow multilayer perceptrons model instead of deep model is proposed in our approach.
In order to avoid the limitation of domain adversarial training that feature extraction function has high-capacity, the semi-supervised domain adaptation model is adopted in this paper to help preserve the rich content information.
For minimizing the domain discrepancy but preserve the content information, a novel domain loss and a content loss are designed. The cycleGAN focus on image generation but not for cross-domain visual recognition.
Iv-a Experimental Protocol
In our method, the total number of layers is set as 3. The neurons number of the output layer in the generator network is the same as the number of input neurons (i.e. feature dimensionality). Then, the output is fed into the discriminator network where the neurons number of hidden layer is set as 100. The neurons number in the output layer of the discriminator is set as 1. The network weights can be optimized by gradient descent based back-propagation algorithm.
Iv-B Compared Methods
The proposed model is flexible and simple, which is therefore regarded as a shallow domain adaptation approach. Therefore, both the shallow features (e.g., pixel-level , hand-crafted feature) and deep features can be fed as input.
In the shallow protocol, the following typical DA approaches are exploited for performance comparison.
No Adaptation (NA): a naive baseline that learns a linear SVM classifier on the source domain and then applies it to the target domain, which could be regarded as a lower bound;
Geodesic Flow Kernel (GFK) : a classic cross-domain subspace learning approach that aligns two domains with a geodesic flow in a Grassmann manifold;
Subspace Alignment (SA) : a widely-used feature alignment approach that projects the source subspace to the target subspace computed by PCA;
Robust Domain Adaptation via Low Rank (RDALR) : a transfer approach by reconstructing the rotated source data with the target data via low-rank representation;
Discriminative Transfer (DTSL) : a subspace-based reconstruction approach that tends to align the source to the target by joint low rank and sparse representation;
Latent Sparse Domain Transfer (LSDT) : a reconstruction-based latent sparse domain transfer method that tends to jointly learning the subspace and the sparse representation.
In Office-31 recognition, we followed the ADGANet  to compare with some recent deep DA approaches,such as DDC , DAN , RTN , DANN , ADDA , JAN  and ADGANet . For fair comparison, the deep features of Office-31 dataset extracted from ResNet-50 with convolutional neural network architecture. Additionally, we also compared with some shallow semi-supervised methods following the .
In the deep protocol for cross-domain handwritten digits recognition, we have compared with the recent deep DA approaches, such as DANN , Domain confusion , CoGAN , ADDA and ADGANet . Note that for fair comparison, the deep features of handwritten digit datasets extracted using LeNet with convolutional neural network architecture are fed as the input of our CatDA model.
Iv-C Comparison with Shallow Domain Adaptation
In this section, five benchmark datasets including 1 4DA office dataset, 2 office-31 dataset, 3 COIL-20 object dataset, 4 MSRC-VOC 2007 datasets and 5 cross-domain handwritten digits are conducted for cross-domain visual recognition. Visually, in Fig. 5, it is shown some samples from 4DA office dataset. Several example images from COIL-20 object dataset are shown in Fig. 6, several example images from MSRC and VOC 2007 datasets are described in Fig. 7, and several example images from the handwritten digits datasets are shown in Fig. 8. From the cross-domain images, the visual heterogeneity is significantly observed, and multiple cross-domain tasks are naturally formulated.
Office 4DA (Amazon, Webcam, DSLR and Caltech 256) datasets :
There are four domains in Office 4DA datasets, which are Amazon (A), Webcam (W)
The asterisk () indicates that we use our method as an unsupervised manner and therefore the results are not good.
Office 3DA (Office-31 dataset ):
This dataset contains three domains such as Amazon (A), Webcam (W) and Dslr (D). It contains 4,652 images from 31 object classes. With each domain worked as source and target alternatively, 6 cross-domain tasks are formed, e.g., , , etc. In experiment, we follow the experimental protocol as  for the semi-supervised strategy. In our method, 3 images per class are selected when they are used as target training data, while the rest samples in target domains are used for testing. The recognition accuracy is reported in Table II. We can achieve the competitive results with . Noteworthily, we do not use any auxiliary and discriminative methods only the simple MLP network.
Columbia Object Image Library (COIL-20) dataset .
The COIL-20 dataset
The MSRC dataset contains 18 classes including 4323 images, and the VOC 2007 dataset contains 20 concepts with 5011 images. These two datasets share 6 semantic categories: airplane, bicycle, bird, car, cow and sheep. In this way, the two domain data are constructed to share the same label set. The cross-domain experimental protocol is followed in . We select 1269 images from MSRC as the source domain and 1530 images from VOC 2007 as the target domain to construct a cross-domain task MSRC vs. VOC 2007. Then we switch the two datasets ( VOC 2007 vs. MSRC ) to construct the other task. For feature extraction, all images are uniformly re-scaled to 256 pixels, and the VLFeat open source package is used to extract the 128-dimensional dense SIFT (DSIFT) features. Then -means clustering is leveraged to obtain a 240-dimensional codebook.
By following , the source training sample set contains all the labeled samples in the source domain, 4 labeled target samples per class randomly selected from the target domain formulate the labeled target training data, and the rest unlabeled examples are recognized as the target testing data. The experimental results of different domain adaptation methods are shown in Table IV. From the results, we observe that the proposed CatDA outperforms other DA methods.
Cross-domain handwritten digits datasets:
Three handwritten digits datasets: MNIST
Iv-D Comparison with Some Deep Transfer Learning.
In this section, some experiments are deployed on office-31 dataset and handwritten digit datasets for comparison with state-of-the-art deep transfer learning approaches.
Deep features of Office-31 dataset :
For fair comparison, we extract the deep features of Office-31 dataset by the ResNet-50 architecture. From the results, we observe that our method ( in average) outperforms state-of-the-art deep domain adaptation methods. Especially, compared with ADGANet  which is proposed from the generative method, our accuracy exceeds it by 4.8%. This demonstrates that our model can effectively alleviate the model bias problem. Similar with 4DA dataset, we also exhibit the unsupervised version of our method on this dataset. The asterisk () in Table VI indicates that we use our method as an unsupervised manner and the results are not so good.
The asterisk () indicates that we use our method as an unsupervised manner and therefore the results are not good.
Deep features of handwritten digits datasets:
In this section, a new handwritten digits dataset SVHN
We have experimentally validated our proposed method in MNIST, USPS and SVHN datasets. These datasets share 10 classes of digits. In our method, the deep features of three datasets are extracted using the LeNet model provided in the Caffe source code package. For adaptation tasks between MNIST and USPS, the training protocol established in  is followed, where 2000 images from MNIST and 1800 images from USPS are randomly sampled. While for adaptation task between SVHN and MNIST, we use the full training sets for comparison against . All the domain adaptation tasks are conducted by following the experimental protocol in . The key difference between  and our method lies in that the ADDA  is convolutional neural network structured method while ours is multilayer perceptron structured method. In ADDA, an essential problem is that the generated samples may be randomly changed (e.g., instead of ). In our CatDA, this problem can be handled by establishing multiple class-wise models. In our setting, for target training data, 10 samples per class from target domain are randomly selected and 5 random splits are considered totally. The average classification accuracies are reported in Table VII. From the cross-domain recognition results, we observe that our CatDA model outperforms most state-of-the-art methods with improvement in average, and only lower than the very recent ADGANet  method.
V-a Computational Efficiency
As the proposed CatDA model is simple in cross-domain visual recognition system, CPU is enough for model optimization and training instead of GPU. Also, the time cost is much lower as shown in Table VIII, from which we can observe that the computational speed is quite fast for three layered MLP model. Considering that our three layered shallow network can achieve competitive performance and fast computational time, we choose this three layered model in our method. Our experiments are implemented on the runtime environment with a PC of Intel i7-4790K CPU, 4.00GHz, 32GB RAM. It is noteworthy that the time for data preprocessing and classification is excluded.
V-B Evaluation of Layer Number
For insight of the impact of layers, we show the results of different number of layers in Table IX, from which we can observe that the performance does not always show an upward trend with increasing layers and the three layered shallow network can achieve competitive performance.
|Tasks||3 layers||4 layers||5 layers||6 layers|
|Tasks||3 layers||4 layers||5 layers||6 layers|
V-C Model Visualization
In this section, for better insight of the CatDA model, visualization of class distribution is explored. We visualize the features for further validating the effectiveness of our model. The t-SNE  visualization method is employed on the source domain and target domain in the task of handwritten digits datasets (shallow domain adaptation) and task of 3DA office datasets (deep domain adaptation) for feature visualization. From the (b) and (e) in Fig. 9, it is obvious that the better clustering characteristic is achieved and the feature discriminative power is improved in the generated data. As a result, the visual recognition cross-domain performance is improved. It is worthy noting that as shown in Fig. 9 the discrimination of the generated features becomes better than raw features. The reason is that in our semi-supervised domain adaptation method, a partial target label information is used for feature generation, which, therefore, improves the discrimination of the generated features.
The proposed method is an adversarial feature adaptation model, which can be used for semi-supervised and unsupervised domain adaptation. We have to claim that 1) the proposed CatDA is completely different from GAN that it cannot be used for image synthesis in its current form, but only for domain adaptive feature representation. This is because the inputs fed into our CatDA are still features such as the low-level hand-crafted feature or deep features by an off-the-shelf CNN model, rather than image pixels. 2) The proposed model is not CNN based, so it may not compare with deep networks. However, deep features can be fed into our model for fair comparison as is presented in Table VI.
In this paper, we propose a new transfer learning method from the perspective of feature generation for cross-domain visual recognition. Specifically, a coupled adversarial transfer domain adaptation (CatDA) framework comprising of two generators, two discriminators, two domain specific loss terms and two content fidelity loss terms is proposed in a semi-supervised mode for domain and intra-class discrepancy reduction. This symmetric model can achieve bijective mapping, such that the domain feature can be generated alternatively benefiting from the reversible characteristic of the proposed model. Consider that our focus is the domain feature generation with distribution disparity removed for cross-domain applications, rather than realistic image generation, a shallow yet effective MLP transfer network is therefore considered. Extensive experiments on several benchmark datasets demonstrate the superiority of the proposed method over some other state-of-the-art DA methods.
In our future work, benefit from the strong feature learning capability of deep neural network (e.g. convolutional neural network), deep adversarial domain adaptation framework that owns similar model with CatDA should be focused. Compared with CatDA, deep methods extract high dimensional semantic features to achieve good performance but they need large-scale samples to train the network. Hence, fine-tuning based parameter transfer from big data to small data can be leveraged for improving the network adaptation.
The authors are grateful to the AE and anonymous reviewers for their valuable comments on our work.
Shanshan Wang received BE and ME from the Chongqing University in 2010 and 2013, respectively. She is currently pursuing the Ph.D. degree at Chongqing University. Her current research interests include machine learning, pattern recognition, computer vision.
Lei Zhang (M’14-SM’18) received his Ph.D degree in Circuits and Systems from the College of Communication Engineering, Chongqing University, Chongqing, China, in 2013. He worked as a Post-Doctoral Fellow with The Hong Kong Polytechnic University, Hong Kong, from 2013 to 2015. He is currently a Professor/Distinguished Research Fellow with Chongqing University. He has authored more than 90 scientific papers in top journals, such as IEEE T-NNLS, IEEE T-IP, IEEE T-MM, IEEE T-IM, IEEE T-SMCA, and top conferences such as ICCV, AAAI, ACM MM, ACCV, etc. His current research interests include machine learning, pattern recognition, computer vision and intelligent systems. Dr. Zhang was a recipient of the Best Paper Award of CCBR2017, the Outstanding Reviewer Award of many journals such as Pattern Recognition, Neurocomputing, Information Sciences, etc., Outstanding Doctoral Dissertation Award of Chongqing, China, in 2015, Hong Kong Scholar Award in 2014, Academy Award for Youth Innovation in 2013 and the New Academic Researcher Award for Doctoral Candidates from the Ministry of Education, China, in 2012.
Jingru Fu received the B.S. degree from Fuzhou University, Fuzhou, China, in 2017. She is currently working towards the M.S. degree in Learning Intelligence and Vision Essential (LiVE) group at Chongqing University, Chongqing, China. Her current research interests include machine learning, transfer learning and computer vision.
- (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, Cited by: §II-C.
- (2013) Robust visual domain adaptation with low-rank reconstruction. In CVPR, pp. 2168–2175. Cited by: 4th item, 5th item, TABLE III.
- (2018) Broad learning system: an effective and efficient incremental learning system without the need for deep architecture. IEEE Trans. Neural Networks and Learning Systems 29 (1), pp. 10–24. Cited by: §II-B.
- (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In ICML, pp. 647–655. Cited by: §II-B, §V-C.
- (2012) Domain transfer multiple kernel learning. IEEE Trans. Pattern Analysis and Machine Intelligence 34 (3), pp. 465–479. Cited by: §II-A, TABLE IV.
- (2010) Visual event recognition in videos by learning from web data. In CVPR, pp. 1959–1966. Cited by: §I, §II-A.
- (2012) Learning with augmented features for heterogeneous domain adaptation. arXiv. Cited by: TABLE I.
- (2014) Unsupervised visual domain adaptation using subspace alignment. In ICCV, pp. 2960–2967. Cited by: 3rd item, TABLE I, TABLE V.
- (2015) Domain-adversarial training of neural networks. JMLR. Cited by: §I, §II-C, §IV-B, §IV-B, §IV-D, TABLE VI, TABLE VII.
- (2012) Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pp. 2066–2073. Cited by: 2nd item, §IV-C, §IV-C, TABLE I, TABLE II, TABLE IV, TABLE V, TABLE VI.
- (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §I, §I, §II-C.
- (2014) Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11), pp. 2288–2302. Cited by: §IV-B, §IV-C, TABLE II.
- (2011) Domain adaptation for object recognition: an unsupervised approach. In ICCV, pp. 999–1006. Cited by: TABLE I, TABLE II, TABLE V.
- (2014) Asymmetric and category invariant feature transformations for domain adaptation. IJCV 109 (1-2), pp. 28–41. Cited by: §II-A, TABLE I, TABLE IV.
- (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv. Cited by: §IV-B.
- (2017) CYCADA: cycle-consistent adversarial domain adaptation. arXiv. Cited by: §II-C.
- (2015) Deep transfer metric learning. In ICCV, pp. 325–333. Cited by: §II-B.
- (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans. Systems, Man, and Cybernetics: Systems 42 (2), pp. 513–529. Cited by: §II-A.
- (2006) Correcting sample selection bias by unlabeled data. In NIPS, pp. 601–608. Cited by: TABLE IV.
- (2014) Maximum mean discrepancy for class ratio estimation: convergence bounds and kernel selection.. In ICML, pp. 530–538. Cited by: §II-A.
- (2012) Robust visual domain adaptation with low-rank reconstruction. In CVPR, pp. 2168–2175. Cited by: §II-A.
- (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: §I, §II-C.
- (2011) What you saw is not what you get: domain adaptation using asymmetric kernel transforms. In CVPR, pp. 1785–1792. Cited by: §I, TABLE I, TABLE II.
- (2018) Cross-domain extreme learning machine for domain adaptation. IEEE Trans. Systems, Man, and Cybernetics: Systems. Cited by: §II-A.
- (2016) Coupled generative adversarial networks. Cited by: §IV-B, TABLE VII.
- (2015) Learning transferable features with deep adaptation networks. In ICML, pp. 97–105. Cited by: §II-B, §III-C, §IV-B, TABLE VI.
- (2013) Transfer sparse coding for robust image representation. In ICCV, pp. 407–414. Cited by: §II-A.
- (2014) Transfer joint matching for unsupervised domain adaptation. In CVPR, pp. 1410–1417. Cited by: §II-A, §IV-C, §IV-D.
- (2016) Deep transfer learning with joint adaptation networks. In ICML, Cited by: §II-B, §III-C.
- (2016) Unsupervised domain adaptation with residual transfer networks. In NIPS, pp. 136–144. Cited by: §II-B, §III-C, §IV-B, TABLE VI.
- (2017) Deep transfer learning with joint adaptation networks. In ICML, pp. 2208–2217. Cited by: §IV-B, TABLE VI.
- (2017) When unsupervised domain adaptation meets tensor representations. In ICCV, Cited by: §I.
- (2018) Early fault detection approach with deep architectures. IEEE Trans. Instrumentation and Measurement 67 (7), pp. 1679–1689. Cited by: §II-B.
- (2018) Binary volumetric convolutional neural networks for 3-d object recognition. IEEE Trans. Instrumentation and Measurement, pp. 1–11. Cited by: §II-B.
- (2014) Conditional generative adversarial nets. Computer Science, pp. 2672–2680. Cited by: §I, §II-C.
- (2014) Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, pp. 1717–1724. Cited by: §II-B.
- (2011) Domain adaptation via transfer component analysis. IEEE Trans. Neural Networks 22 (2), pp. 199–210. Cited by: TABLE VI.
- (2010) A survey on transfer learning. IEEE Trans. Knowle. Data Engineering 22 (10), pp. 1345–1359. Cited by: §I.
- (2011) Columbia object image library (coil-20). Computer. Cited by: §IV-C.
- (2010) Adapting visual category models to new domains. In ECCV, pp. 213–226. Cited by: §IV-C, §IV-D, TABLE II.
- (2016) Improved techniques for training gans. In NIPS, pp. 2234–2242. Cited by: §I, §II-C.
- (2018) Generate to adapt: aligning domains using generative adversarial networks. In CVPR, Cited by: §IV-B, §IV-B, §IV-D, §IV-D, TABLE VI, TABLE VII.
- (2014) Generalized transfer subspace learning through low-rank constraint. IJCV 109 (1-2), pp. 74–93. Cited by: §II-A, 5th item, TABLE I, TABLE III, TABLE V.
- (2014) CNN features off-the-shelf: an astounding baseline for recognition. In CVPR, pp. 806–813. Cited by: §II-B.
- (2017) Training a new instrument to measure cotton fiber maturity using transfer learning. IEEE Trans. Instrumentation and Measurement 66 (7), pp. 1668–1678. Cited by: §I.
- (2015) Simultaneous deep transfer across domains and tasks. In ICCV, pp. 4068–4076. Cited by: §II-B, §II-B, §IV-B, TABLE VII.
- (2017) Adversarial discriminative domain adaptation. Note: CVPR Cited by: §I, §II-C, §IV-B, §IV-B, §IV-D, TABLE VI, TABLE VII.
- (2014) Deep domain confusion: maximizing for domain invariance. Computer Science. Cited by: §IV-B, TABLE VI.
- (2017) Fredholm multiple kernel learning for semi-supervised domain adaptation. In AAAI, Cited by: §IV-C.
- (2018) A new deep transfer learning based on sparse auto-encoder for fault diagnosis. IEEE Trans. Systems, Man, and Cybernetics: Systems 49 (1), pp. 136–144. Cited by: §II-B.
- (2015) Transfer learning from deep features for remote sensing and poverty mapping. arXiv. Cited by: §II-B.
- (2015) Discriminative transfer subspace learning via low-rank and sparse representation.. IEEE Trans. Image Processing 25 (2), pp. 850–863. Cited by: 6th item, §IV-C, §IV-C, TABLE III.
- (2017) Correcting instrumental variation and time-varying drift using parallel and serial multitask learning. IEEE Trans. Instrumentation and Measurement 66 (9), pp. 2306–2316. Cited by: §I.
- (2016) Autoencoder with invertible functions for dimension reduction and image reconstruction. IEEE Trans. Systems, Man, and Cybernetics: Systems 48 (7), pp. 1065–1079. Cited by: §II-B.
- (2017) DualGAN: unsupervised dual learning for image-to-image translation. In ICCV, pp. 2868–2876. Cited by: §II-C.
- (2018) Abnormal odor detection in electronic nose via self-expression inspired extreme learning machine. IEEE Trans. Systems, Man, and Cybernetics: Systems. Cited by: §II-A.
- (2017) Odor recognition in multiple e-nose systems with cross-domain discriminative subspace learning. IEEE Trans. Instrumentation and Measurement 66 (2), pp. 198–211. Cited by: §I.
- (2015) Domain adaptation extreme learning machines for drift compensation in e-nose systems. IEEE Trans. Instrumentation and Measurement 64 (7), pp. 1790–1801. Cited by: §I.
- (2016) Robust visual knowledge transfer via extreme learning machine-based domain adpatation. IEEE Trans. Image Processing 25 (3), pp. 4959–4973. Cited by: §II-A.
- (2018) Efficient solutions for discreteness, drift, and disturbance (3d) in electronic olfaction. IEEE Trans. Systems, Man, and Cybernetics: Systems 48 (2), pp. 242–254. Cited by: §II-A.
- (2016) LSDT: latent sparse domain transfer learning for visual adaptation. IEEE Trans. Image Processing 25 (3), pp. 1177–1191. Cited by: §II-A, 7th item, TABLE I, TABLE III, TABLE IV, TABLE V.
- (2012) A grassmann manifold-based domain adaptation approach. In Pattern Recognition (ICPR), 2012 21st International Conference on, Cited by: TABLE II.
- (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. Cited by: §II-C.